Beyond Observability: Why AI Coding Agents Need Runtime Guardrails

Some notes after QCon London

One thing that stood out in the talk, and even more in the conversations after: the recent AI-related incidents weren’t really about bad code.

Take the Amazon dev4 case. 4 Sev1 incidents in a week. Hours of downtime. Real revenue impact. Those changes didn’t look obviously wrong. They passed the review. They just didn’t behave well once they hit production.

The immediate fix was to add more human review. Which is reasonable. If you don’t trust the system, you add control. But it’s also a bit of a tell.

The way I’ve been thinking about it:

we’re trying to operate Level 3 systems on top of Level 1 infrastructure. In the talk I broke it down like this:

Level 1 – assistants (autocomplete, chat)
Level 2 – agents with human gates
Level 3 – agents that act and self-correct

Most teams want Level 3. Almost everyone is still set up for Level 1.

The issue isn’t really model capability. It’s that agents don’t have access to what actually matters in production. They see code, tests, docs. They don’t see behavior.

They don’t know which functions run 60k times a minute and which run once a day. They don’t know what sits on a critical path. They don’t see how changes propagate across services.

So they generate something that makes sense locally and then breaks in ways that aren’t obvious.

This is also where traditional observability falls short. It assumes a human loop: alert → investigate → fix

The signals are built for someone to interpret. If you hand that same signal to an agent, it’s mostly just noise. There’s no clear path from detection to action. What we found works better is giving agents runtime context directly. In the talk I simplified it to three properties:

zero config – it’s just there, no setup
complete – no blind spots across functions
deep – enough detail to actually trace and act

Without that, you can generate code. You just can’t rely on it. Once that context exists, the behavior changes pretty quickly. Agents plan differently, because they see real usage.
Risk becomes something you can quantify, not guess.

And the loop starts to close: detect → trace → fix → PR

Human review still matters, but it’s not doing all the work anymore.

One thing I didn’t expect going into this: how many of the “safety practices” people are adding right now are really just compensating for missing visibility. More approvals. More reviews. Slower rollouts. They work, but they’re patching over the same gap.

If there’s a practical takeaway, it’s probably this: before trying to make agents smarter, make sure they can actually see the system they’re changing. Even partial runtime context already changes outcomes quite a bit.

We’re still early here. But it’s starting to feel like the bottleneck isn’t generation anymore.

It’s whether the system generating the code understands what happens after.

About the author

May Walter

Co-Founder and CTO of Hud

May Walter is a software engineer, researcher, and entrepreneur with a proven track record leading technology innovation as CTO across multiple startups. An authority in operating system optimization and cloud runtime internals, now co-founder and CTO of Hud, where she is pioneering a Runtime Code Sensor – a new foundation for AI coding agents to operate with real production context.