Prompts hope. Policy enforces.
You give an AI agent full access to your codebase — file reads, writes, shell commands, git operations. You write careful instructions. But you can't watch every tool call across a 100-turn session. Neither can your team, across dozens of repos and hundreds of sessions a week.
The failure mode isn't dramatic. Agents don't crash — they degrade. They quietly widen scope, re-read files they just read, guess at CLI flags that don't exist, loop on the same failing approach, and fill the context window with noise. By turn 40, the session is burning tokens producing nothing useful. By turn 80, you're starting over.
The cost isn't just wasted compute. It's broken commits that look right but aren't. It's senior engineers reviewing AI-generated changes that a deterministic layer should have caught. It's the gap between what agents could do and what they actually deliver without guardrails.