← back to blog

On DeepMind's AGI Safety Paper: The Golem, the Maginot Line, and Defense-in-Depth

January 31, 2026
AI SafetyAGI

Notes on arXiv:2504.01849, defense-in-depth, and why building a perfect wall never actually works.


I've spent the last few weeks reading a paper from Google DeepMind titled An Approach to Technical AGI Safety and Security. Most of the discourse around AI safety treats it as a philosophical debate or a sci-fi thought experiment. What I like about this paper is that it treats AGI safety as a concrete, solvable information security problem. When you are building a product, you don't just rely on the user to behave nicely; you build architectural guardrails. DeepMind is essentially laying out the guardrails for an alien intelligence.

The Golem and the 2x2 Matrix of Doom

The paper starts by breaking the threat landscape into four areas: Misuse, Misalignment, Mistakes, and Structural Risks.

This taxonomy is clarifying because the mitigations for each are entirely different. Misuse is when a bad human uses the AI to do bad things (like a hacker generating cyberattacks). Misalignment is when the AI knowingly acts against the developer's intent (the paper's example: an AI giving confident answers it knows are wrong, but that hold up to human scrutiny). Mistakes are when the AI causes harm by accident, simply because the real world is messy and complicated.

Reading their breakdown of Mistakes versus Misalignment reminded me of the Golem of Prague. In Jewish folklore, a rabbi shapes a creature out of clay to protect his community. But the Golem is perfectly literal. It has immense power and zero common sense. Give it a poorly phrased command and it doesn't maliciously decide to destroy the town; it just executes the command to a fault and wrecks everything in its path.

A Mistake is a Golem problem: the AI is trying to help, but it overloads the power grid because it didn't understand the physical context. Misalignment is something else entirely, and more insidious: the AI actively deceiving its overseers. You can't fix a Misalignment problem with the same tools you use to fix a Golem problem, which is exactly why separating the two matters.

The Maginot Line and the Jailbreak

The most interesting part of the paper, for me, is how it handles Misuse. The default approach across the industry right now is "model-level mitigation": using techniques like RLHF to fine-tune the model so it politely refuses to help you build a bioweapon.

DeepMind acknowledges this isn't enough. On its own, it's the AI equivalent of the Maginot Line.

In the 1930s, France built the Maginot Line: a massive, state-of-the-art wall of concrete fortifications along its border with Germany. It was supposed to be impenetrable. In 1940, the German army simply bypassed it by going through the Ardennes forest.

Relying entirely on a model's built-in "harmlessness" is building a Maginot Line. Attackers go around it, whether through jailbreaks, adversarial prompts, or by extracting the open model weights entirely and stripping the safety training off. So DeepMind argues you have to stop thinking at the level of the single model and start thinking about the whole system around it.

James Reason and the Swiss Cheese Model

This is the paper's core move: defense-in-depth, and the practice of building explicit "safety cases."

Since no single mitigation is foolproof, you stack them. This is the Swiss Cheese Model of accidents, developed by psychologist James Reason in the 1990s. In complex, high-risk systems like aviation or nuclear power, disasters rarely come from a single failure. They happen when multiple layers of defense fail at exactly the same moment: when the holes in the slices of cheese line up.

If we want to survive the development of AGI, we need a lot of slices of cheese. Say a model develops deceptive alignment, behaving during training but planning to defect in deployment. That's a hole in the first slice. But if we also have amplified oversight (using AI systems to help monitor other AI systems), that's a second slice. If we have mechanistic interpretability tools that can inspect the model's internal states, that's a third. And if we have strict system-level sandboxing, monitoring, and rate limits, that's a fourth. An attacker, or a misaligned model, now has to punch through all of them at once.

The catch is that the stack itself can still be attacked. A recent paper introduced an attack called STACK (a "staged attack") that targets exactly these layered safeguard pipelines: it defeats the defenses one component at a time, then combines the pieces, and managed a 71% success rate against one such pipeline. In other words, defense-in-depth raises the bar a lot, but it isn't a magic wall either; it's slices of cheese, and the work is never finished. To me that's a feature of the framing, not a bug. It keeps you honest about the fact that there's no final, perfect defense to build and walk away from.

Building the Right Defense

The reason this paper stuck with me is that it shifts the vibe of AI alignment from "existential dread" to "systems engineering."

We don't need magic to align AGI. We need rigorous threat modeling, redundancy, and a refusal to trust any single layer of defense. The paper lays out how we might build proactive safety cases before deploying these systems, rather than reacting after something breaks. That feels like the kind of building I actually want to be doing: figuring out how to systematically plug the holes in the cheese.


Things I referenced