Error Monitoring & Alerting

Build workflows that triage errors, process alerts, and dispatch notifications across channels.

This guide covers building workflows whose purpose is to monitor external systems, classify errors, and route alerts. For handling errors that occur inside your own workflows, see Errors & Retrying.

Error monitoring is one of the most common workflow use cases. A typical pipeline receives error events from external systems, classifies them, deduplicates repeat occurrences, and dispatches notifications to the right channels. Workflows are a natural fit because they survive failures, retry flaky notification APIs, and maintain state across long-running monitoring loops.

Error Triage Workflow

The simplest error monitoring workflow receives an error event via webhook, classifies it by severity, and routes it to the appropriate handler.

workflows/error-triage.ts
import { createWebhook } from "workflow";

interface ErrorEvent {
  source: string;
  message: string;
  stack?: string;
  metadata?: Record<string, unknown>;
}

async function classifyError(event: ErrorEvent) {
  "use step";

  // Classify based on error patterns
  if (event.message.includes("FATAL") || event.message.includes("OOM")) {
    return "critical" as const;
  }
  if (event.message.includes("timeout") || event.message.includes("rate limit")) {
    return "warning" as const;
  }
  return "info" as const;
}

async function handleCritical(event: ErrorEvent) { 
  "use step";
  // Page on-call, create incident ticket, etc.
  console.log(`CRITICAL: ${event.source} - ${event.message}`);
}

async function handleWarning(event: ErrorEvent) {
  "use step";
  // Post to team Slack channel
  console.log(`WARNING: ${event.source} - ${event.message}`);
}

async function handleInfo(event: ErrorEvent) {
  "use step";
  // Log for later review
  console.log(`INFO: ${event.source} - ${event.message}`);
}

export async function errorTriageWorkflow() {
  "use workflow";

  const webhook = createWebhook(); 
  console.log("Listening for errors at:", webhook.url);

  for await (const request of webhook) { 
    const event: ErrorEvent = await request.json();
    const severity = await classifyError(event);

    if (severity === "critical") {
      await handleCritical(event);
    } else if (severity === "warning") {
      await handleWarning(event);
    } else {
      await handleInfo(event);
    }
  }
}

The workflow creates a persistent webhook endpoint. External systems POST error events to it. Each event is classified in a step (with full Node.js access for pattern matching, database lookups, or ML inference), then routed to the correct handler. Because the webhook uses for await...of, the workflow stays alive and processes errors as they arrive.

Webhooks implement AsyncIterable, so a single workflow instance can process an unlimited stream of events over time. See Hooks & Webhooks for details on iteration and custom tokens.

Alert Processing Pipeline

Real alert pipelines need deduplication. When the same error fires hundreds of times in a minute, you want one alert, not hundreds. Use custom hook tokens to route duplicate events to the same workflow instance.

workflows/alert-pipeline.ts
import { createHook, getStepMetadata } from "workflow";

interface Alert {
  alertId: string;
  source: string;
  message: string;
  timestamp: number;
}

interface EnrichedAlert extends Alert {
  service: string;
  owner: string;
  runbook: string;
}

async function enrichAlert(alert: Alert): Promise<EnrichedAlert> {
  "use step";

  // Look up service metadata from your registry
  const service = alert.source.split("/")[0];
  return {
    ...alert,
    service,
    owner: `team-${service}`,
    runbook: `https://runbooks.internal/${service}/${alert.alertId}`,
  };
}

async function dispatchNotification(alert: EnrichedAlert) {
  "use step";

  const { stepId } = getStepMetadata(); 

  await fetch("https://hooks.slack.com/services/...", {
    method: "POST",
    headers: { "Idempotency-Key": stepId }, 
    body: JSON.stringify({
      text: `[${alert.source}] ${alert.message}\nOwner: ${alert.owner}\nRunbook: ${alert.runbook}`,
    }),
  });
}

export async function alertPipelineWorkflow(alertId: string) { 
  "use workflow";

  // Custom token ensures duplicate alerts route here
  const hook = createHook<Alert>({ token: `alert:${alertId}` }); 

  // Process the first alert
  const alert = await hook;
  const enriched = await enrichAlert(alert);
  await dispatchNotification(enriched);
}

The key pattern here is the custom hook token. When your ingestion layer receives an alert, it uses resumeHook() to deliver the payload to the waiting workflow. The hook token alert:${alertId} is deterministic, so the sender does not need to know the workflow's run ID.

app/api/alerts/route.ts
import { start, resumeHook } from "workflow/api";
declare function alertPipelineWorkflow(alertId: string): Promise<void>; // @setup

export async function POST(request: Request) {
  const alert = await request.json();
  const token = `alert:${alert.alertId}`;

  // Start the workflow first, then deliver the alert data
  await start(alertPipelineWorkflow, [alert.alertId]); 

  // The hook may not be registered yet — retry until it is
  let delivered = false;
  for (let i = 0; i < 5 && !delivered; i++) {
    try {
      await resumeHook(token, alert); 
      delivered = true;
    } catch {
      await new Promise((r) => setTimeout(r, 500 * (i + 1)));
    }
  }

  return Response.json({ delivered });
}

The workflow needs time to register the hook after start() returns. The retry loop above handles this race. For high-throughput scenarios, consider using createWebhook() instead, which provides an HTTP endpoint that handles delivery automatically.

Real-Time Alert Dispatch

When a critical event needs immediate attention, fan out notifications to multiple channels in parallel using Promise.all. Each channel is its own step, so a failure in one (e.g., Slack API is down) does not block the others, and each is retried independently.

workflows/instant-alert.ts
import { createWebhook, FatalError, RetryableError, getStepMetadata } from "workflow";

interface CriticalEvent {
  title: string;
  description: string;
  severity: "P1" | "P2";
  source: string;
}

async function sendSlackAlert(event: CriticalEvent) {
  "use step";

  const { stepId } = getStepMetadata(); 

  const response = await fetch("https://hooks.slack.com/services/...", {
    method: "POST",
    headers: { "Idempotency-Key": stepId }, 
    body: JSON.stringify({
      text: `*${event.severity}: ${event.title}*\n${event.description}`,
    }),
  });

  if (response.status === 429) {
    throw new RetryableError("Slack rate limited", { retryAfter: "30s" }); 
  }

  if (!response.ok) {
    throw new FatalError(`Slack API error: ${response.status}`); 
  }
}

async function sendEmailAlert(event: CriticalEvent) {
  "use step";

  await fetch("https://api.sendgrid.com/v3/mail/send", {
    method: "POST",
    headers: { Authorization: `Bearer ${process.env.SENDGRID_KEY}` },
    body: JSON.stringify({
      to: "oncall@example.com",
      subject: `${event.severity}: ${event.title}`,
      text: event.description,
    }),
  });
}

async function createPagerDutyIncident(event: CriticalEvent) {
  "use step";

  await fetch("https://events.pagerduty.com/v2/enqueue", {
    method: "POST",
    body: JSON.stringify({
      routing_key: process.env.PAGERDUTY_KEY,
      event_action: "trigger",
      payload: {
        summary: `${event.severity}: ${event.title}`,
        source: event.source,
        severity: event.severity === "P1" ? "critical" : "error",
      },
    }),
  });
}

export async function instantAlertWorkflow() {
  "use workflow";

  const webhook = createWebhook();

  const request = await webhook;
  const event: CriticalEvent = await request.json();

  // Fan out to all channels in parallel
  await Promise.all([ 
    sendSlackAlert(event), 
    sendEmailAlert(event), 
    createPagerDutyIncident(event), 
  ]); 
}

Because each notification is a separate step, the framework retries failures independently. If PagerDuty returns a 500, Slack and email still succeed, and the PagerDuty step retries on its own schedule.

External System Monitoring

Not all monitoring is event-driven. Sometimes you need to poll external systems on a schedule. The core pattern is a sleep() loop that checks a service in a step and alerts on failure:

workflows/monitor-service.ts
import { sleep } from "workflow";
declare function checkServiceHealth(endpoint: string): Promise<{ healthy: boolean; latency: number }>; // @setup
declare function sendAlert(service: string, latency: number): Promise<void>; // @setup

export async function monitorServiceWorkflow(service: string, endpoint: string) {
  "use workflow";

  while (true) {
    const status = await checkServiceHealth(endpoint); 

    if (!status.healthy) {
      await sendAlert(service, status.latency);
    }

    await sleep("5m"); 
  }
}

This uses the durable polling pattern — sleep() consumes no compute while waiting, and the workflow resumes at the correct time even after restarts. For more advanced scheduling patterns including consecutive failure tracking, graceful shutdown, and cron-like dispatching, see Scheduling & Cron.

sleep() accepts duration strings like "5m", "1h", or "30s", as well as Date objects for sleeping until a specific time. See the sleep() API reference for all supported formats.

Content Security Scanning

Workflows can also monitor content against security or policy rules. This pattern receives content via webhook, scans it in a step, and takes action on violations.

workflows/content-security.ts
import { createWebhook } from "workflow";

interface ContentEvent {
  contentId: string;
  body: string;
  author: string;
  type: "post" | "comment" | "message";
}

interface ScanResult {
  passed: boolean;
  violations: string[];
}

async function scanContent(event: ContentEvent): Promise<ScanResult> {
  "use step";

  const violations: string[] = [];

  // Check against policy rules
  const blockedPatterns = [/credential/i, /api[_-]?key/i, /password\s*=/i];
  for (const pattern of blockedPatterns) {
    if (pattern.test(event.body)) {
      violations.push(`Blocked pattern: ${pattern.source}`);
    }
  }

  return { passed: violations.length === 0, violations };
}

async function quarantineContent(contentId: string, violations: string[]) {
  "use step";

  // Move content to review queue
  await fetch("https://api.internal/content/quarantine", {
    method: "POST",
    body: JSON.stringify({ contentId, violations }),
  });
}

async function notifySecurityTeam(event: ContentEvent, result: ScanResult) {
  "use step";

  await fetch("https://hooks.slack.com/services/...", {
    method: "POST",
    body: JSON.stringify({
      text: `Content violation in ${event.type} by ${event.author}: ${result.violations.join(", ")}`,
    }),
  });
}

export async function contentSecurityWorkflow() {
  "use workflow";

  const webhook = createWebhook();

  for await (const request of webhook) {
    const event: ContentEvent = await request.json();
    const result = await scanContent(event); 

    if (!result.passed) { 
      await Promise.all([
        quarantineContent(event.contentId, result.violations),
        notifySecurityTeam(event, result),
      ]);
    }
  }
}

The scanning step has full Node.js access, so it can call external scanning APIs, run regex-based rules, or invoke ML models. When a violation is found, the workflow quarantines the content and notifies the security team in parallel.