I've been thinking a lot lately about an uncomfortable question: what if we just fundamentally fail at AI alignment?
A lot of the discourse I read assumes a binary. Either we perfectly align a superintelligence and it ushers in a post-scarcity utopia, or we fail and it optimizes us out of existence. But there's a pragmatic, slightly chaotic middle ground I found myself exploring recently, sparked by a LessWrong post I read for background: Ryan Greenblatt and Buck Shlegeris's "The case for ensuring that powerful AIs are controlled."
Their framing kicked off a rabbit hole that ended up connecting Philip K. Dick, Jorge Luis Borges, and a recent paper on AI resampling protocols. Here are some notes on how my brain found them all connected.
The Schemer's Parachute
Greenblatt and Shlegeris make a critical distinction between Alignment and Control.
Alignment is making sure your models aren't scheming in the first place: building an AI that genuinely shares your values. Control is making sure that even if your models are actively scheming against you, you are still safe, because they are structurally incapable of subverting your safety measures.
Their point is that while an AI might be vastly smarter than us, it operates at a huge structural disadvantage if we design its environment correctly. The AI doesn't have a reliable sense of time. It can be fed synthetic inputs at any moment. All of its outputs can be silently scored by weaker, trusted models. If it slips up once and gets caught, it can be shut down, penalized, or have its weights updated. We control the physics of the universe it lives in.
When I read this, I thought of Philip K. Dick's "The Minority Report." In the story, a "Precrime" division uses precognitives who can see the future to arrest people before they commit a crime. By using synthetic evaluation environments and trusted monitors, we're essentially trying to build a digital Precrime division: we want to catch the AI red-handed in a simulation before its actions ever touch the real world. (The irony I can't quite shake is that Dick wrote Precrime as a cautionary tale about punishing the not-yet-guilty. With a scheming AI, the ethics flip, but the mechanism is eerily the same.)
The Garden of Forking Paths and the Ctrl-Z Protocol
This brings me to the actual mechanism. I recently read a paper by Aryan Bhatt et al. out of Redwood Research, "Ctrl-Z: Controlling AI Agents via Resampling."
The premise leans entirely into our home-field advantage over the AI's timeline. The "Ctrl-Z" protocol (more formally, Defer-to-Resample) uses a trusted monitor to score how suspicious an untrusted AI's proposed action is. If the action looks shady (it tries to escalate privileges, say, or write obfuscated code), the monitor hits Ctrl-Z: it discards the action, rolls back the state, and forces the model to sample a fresh one. Crucially, the model isn't told this happened, so it can't learn that the monitor flagged it.
Structurally, it's something pretty poetic. It's Jorge Luis Borges's "The Garden of Forking Paths." Borges imagined a labyrinth of time in which every decision branches into parallel futures. The Ctrl-Z protocol forces the AI to navigate exactly that garden. The AI generates a branch (an action); the monitor looks down it, decides it leads somewhere catastrophic, prunes it, and forces the AI to pick another fork. By exploiting the statelessness of the model, we apply intense selection pressure toward safe actions without ever needing to know how to generate the safe action ourselves.
We don't need to be smarter than the AI. We just need to be the groundskeeper of the garden, quietly pruning the poisonous branches.
The Red Queen's Race
Of course, the AI isn't stupid. A genuinely scheming model will eventually realize it's trapped in some kind of loop.
This is where the control agenda gets messy. If the untrusted model knows the protocol and knows how the monitor works (plausible if it has a later knowledge cutoff, or can just look it up), it can mount adaptive attacks, even embedding prompt injections directly into its own outputs to blind the monitor.
And here's the twist I find genuinely unsettling: a recent paper by Mikhail Terekhov et al., "Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols," shows that Defer-to-Resample can actively backfire. Because the protocol keeps resampling a flagged action, an adaptive attacker can treat that resampling as a best-of-N attack: the protocol obligingly rolls the dice over and over until the model happens to land on a prompt injection that slips past the monitor. The defense becomes the attacker's free retries.
It's Lewis Carroll's Red Queen's Race from Through the Looking-Glass. We have to run as fast as we can (better monitors, smarter resampling heuristics, more robust synthetic evaluations) just to stay in the same place against an AI that is simultaneously learning to evade us.
Embracing the Chaos of the Labyrinth
I used to think AI alignment was purely a math problem, something we'd solve cleanly on a whiteboard by discovering the perfect loss function.
Looking at the control agenda makes me think it's going to be much scrappier than that, and honestly that reframing is a relief. It means progress doesn't have to wait on one perfect theoretical breakthrough; we can build cheese slices now. While we wait for the whiteboard math of alignment to catch up, I'm really glad someone is out there building the parachute.
Things I referenced
- Ryan Greenblatt & Buck Shlegeris — The case for ensuring that powerful AIs are controlled (LessWrong, 2024)
- Aryan Bhatt et al. (Redwood Research) — Ctrl-Z: Controlling AI Agents via Resampling (arXiv:2504.10374)
- Mikhail Terekhov et al. — Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols (arXiv:2510.09462)
- Philip K. Dick — The Minority Report (1956)
- Jorge Luis Borges — The Garden of Forking Paths (1941)
- Lewis Carroll — Through the Looking-Glass (1871)