
Some notes after QCon London
One thing that stood out in the talk, and even more in the conversations after: the recent AI-related incidents weren’t really about bad code.
Take the Amazon dev4 case. 4 Sev1 incidents in a week. Hours of downtime. Real revenue impact. Those changes didn’t look obviously wrong. They passed the review. They just didn’t behave well once they hit production.
The immediate fix was to add more human review. Which is reasonable. If you don’t trust the system, you add control. But it’s also a bit of a tell.
The way I’ve been thinking about it:
we’re trying to operate Level 3 systems on top of Level 1 infrastructure. In the talk I broke it down like this:
- Level 1 – assistants (autocomplete, chat)
- Level 2 – agents with human gates
- Level 3 – agents that act and self-correct
Most teams want Level 3. Almost everyone is still set up for Level 1.
The issue isn’t really model capability. It’s that agents don’t have access to what actually matters in production. They see code, tests, docs. They don’t see behavior.
They don’t know which functions run 60k times a minute and which run once a day. They don’t know what sits on a critical path. They don’t see how changes propagate across services.
So they generate something that makes sense locally and then breaks in ways that aren’t obvious.
This is also where traditional observability falls short. It assumes a human loop: alert → investigate → fix
The signals are built for someone to interpret. If you hand that same signal to an agent, it’s mostly just noise. There’s no clear path from detection to action. What we found works better is giving agents runtime context directly. In the talk I simplified it to three properties:
- zero config – it’s just there, no setup
- complete – no blind spots across functions
- deep – enough detail to actually trace and act
Without that, you can generate code. You just can’t rely on it. Once that context exists, the behavior changes pretty quickly. Agents plan differently, because they see real usage.
Risk becomes something you can quantify, not guess.
And the loop starts to close: detect → trace → fix → PR
Human review still matters, but it’s not doing all the work anymore.
One thing I didn’t expect going into this: how many of the “safety practices” people are adding right now are really just compensating for missing visibility. More approvals. More reviews. Slower rollouts. They work, but they’re patching over the same gap.
If there’s a practical takeaway, it’s probably this: before trying to make agents smarter, make sure they can actually see the system they’re changing. Even partial runtime context already changes outcomes quite a bit.
We’re still early here. But it’s starting to feel like the bottleneck isn’t generation anymore.
It’s whether the system generating the code understands what happens after.