BACK_TO_INTEL_STREAM
Engineering
2026-02-17

Alerting on OpenClaw Incidents: Slack, Discord, and PagerDuty in One Afternoon

O
AUTHOR
Operations Team

Monitoring a fleet of autonomous agents is only useful if you can react to failure in real-time. If an agent starts burning $100/minute in a reasoning loop, checking a dashboard once an hour is a $6,000 mistake. You need automated, high-fidelity alerting.

This guide walks through setting up the three most critical alerts for any OpenClaw deployment, with examples you can implement today.

1. The "Heartbeat Stale" Alert (Offline Nodes)

The most basic failure mode: an agent process dies or loses its network connection. In a distributed fleet, you need to know the moment a node goes dark.

Example Webhook Payload:

{JSON.stringify({ event: "agent.offline", agent_id: "agent-v1-882", reason: "heartbeat_timeout", severity: "high", timestamp: "2026-02-17T14:30:00Z" }, null, 2)}

2. The "Cost Spike" Alert (Financial Safety)

Agents can be expensive stochastic oscillators. If an agent's token consumption deviates from its rolling average, you need a PagerDuty incident immediately.

CLI Configuration Example:

clawtrace alert create \ --metric token_spend_rate \ --threshold 5.00 \ --period 1m \ --channel pagerduty_critical

3. The "Tool Error Rate" Alert (Reasoning Failure)

Sometimes an agent is online and "thinking," but every tool it calls is failing. This usually indicates a broken API connection or a hallucinated tool signature.

Logic Pattern: Check for error_rate > 30% over a 5-minute window. This prevents alerting on transient network flutters while catching serious logic failures.

Bridging to Your Rails

Setting up these bridges manually involves writing dozens of webhook handlers, retry logic, and secret management. You have to handle Discord's rate limits, Slack's Block Kit formatting, and PagerDuty's event orchestration API.

Implementation Shortcut

With ClawTrace, you can connect these channels in the Settings -> Alerts panel in minutes. We handle the formatting, the debouncing, and the secure delivery, so you can focus on building the agents, not the plumbing.

Conclusion: Reactive to Proactive

Move your operations from "searching for logs after a crash" to "receiving a Slack notification before the crash affects your users." Real-time alerting is the final piece of the production agent puzzle.