Postmortem: The Night Our OpenClaw Agent Took Down Staging

AUTHOR

SRE Team

At 2:14 AM on a Tuesday, our OpsBot-01—an experimental OpenClaw agent designed to clean up temporary log volumes—deleted the main staging database disk. It wasn't a bug in the code. It was a failure of oversight.

The Sequence of Events

The agent was tasked with identifying "unused volumes." It called its list_volumes tool, correctly identified 20 log volumes, and then... it hallucinated a pattern. It decided that staging-db-vol-A looked enough like a log volume to be added to the deletion list.

Because the agent had unrestricted raw exec permissions on the cloud CLI tool, there was nothing to stop it. It executed aws ec2 delete-volume, and staging went dark.

The Recovery

We restored from backup by 4:00 AM, but the real damage was our trust in autonomous autonomy. We realized that Prompt Guardrails are not Security. If an agent can execute a command, it will eventually execute a bad command.

The Signals We Missed

If we had been using a centralized control plane that night, three signals would have prevented the outage:

High-Impact Tool Alert: A delete-volume call at 2 AM should have triggered a "High Risk" notification.
Tool Whitelisting: The agent should have never had the permission to call delete on resources tagged with "Protected".
Anomaly Detection: A sudden spike in volume deletions (20 in 3 minutes) should have triggered an automatic circuit breaker.

Conclusion: Design for Hallucination

We learned the hard way: your observability stack needs to be smarter than your agents. Today, every OpsBot we run is wrapped in ClawTrace Policies. We don't just watch them—we constrain them.