Agents misbehave in ways that are hard to detect and fix.
When a production failure is reported, you have to decide whether it's worth solving — which customers are affected, on what use cases, how often.
Digging too deep wastes time on one-offs, but skipping investigations risks customer fires.
Once you commit to a fix, the work gets tedious fast: combing through long traces, hand-crafting eval criteria, manually correlating between scattered signals.
The answer isn't more dashboards.
It's agents swarming over your production data to automate the path from detection to maintenance.

