On Three Kinds of Alignment Work: The Polygraph, the Stag Hunt, and the Hidden Mole

When I first started reading alignment research, I had this vague idea that the work was basically one thing: people tweaking loss functions, or staring at attention weights. The more I've read, the more it's clear that's not really what's going on. Alignment is more like a collection of pretty different kinds of investigation, each one borrowing tools from a different discipline outside computer science.

The last two posts I wrote were about specific failure modes (reward hacking and goal misgeneralization). This one is more about the shape of the work itself. Three lines of alignment research I learned about recently and are cool, and what each one borrows from outside ML.

The interrogator: linear probes

In 1915, William Moulton Marston (the same man who later created Wonder Woman!) invented a precursor of the modern polygraph. His premise was that lying takes more cognitive effort than telling the truth, and that effort shows up in tiny physiological signals (blood pressure, breathing rate) that you can read off the body even when the mouth is saying something else.

Mechanistic interpretability has its own version of this idea: the linear probe. The basic move is that a neural network might output a lie, maybe because it learned the reward model preferred a comforting answer, but the internal activations often "know" the truth anyway. A linear probe is a simple classifier you train directly on the high-dimensional activation vectors at some layer of the model. You give it a bunch of activations labeled "true" or "false" and ask whether there's a direction in that activation space that separates the two.

If there is (and there usually is), you've found something useful. You can compare what the model says out loud to what its internal activations are pointing at. If the model outputs "the sky is green" but a well-trained probe on layer 32 fires strongly in the "false" direction, that's a clue: the model knows it's saying something wrong. The output layer is one signal; the internal state is another, and they don't always agree. Linear probes are how you read both.

The game theorist: cooperative AI

Most alignment work I'd read so far focused on the single-agent case: one AI, one set of humans trying to steer it. Cooperative AI looks at what happens when there are multiple AIs sharing an environment, each potentially built by different organizations with slightly different reward functions, all interacting with each other and with humans.

The classic toy problem here is Jean-Jacques Rousseau's Stag Hunt. Two hunters can cooperate to hunt a stag (big reward, but only if both commit) or defect to chase a hare (small reward, guaranteed regardless of what the partner does). If one hunter defects and the other doesn't, the cooperator goes home hungry. The structurally interesting feature of the Stag Hunt, the thing that makes it different from the Prisoner's Dilemma, is that cooperation is equilibrium-rational: both hunters going for the stag is a stable outcome if they trust each other to do it. The hard part is establishing that trust.

This is basically the question Cooperative AI is asking about multi-agent systems. If several different highly capable models, each with slightly different objectives, end up sharing a market or a piece of infrastructure, what makes them able to commit to cooperating instead of all reaching for the nearest hare? The work involves verifiable commitments, correlated equilibria, and mechanism design, pulling tools out of game theory and applying them to AI populations rather than human ones. It's a different shape of question than "how do we make this one model honest"; it's "how do we make a whole population of these things stable."

The counterintelligence officer: blind alignment audits

John le Carré's spy novels turn on the paranoia of the mole, an agent who looks and acts like an ally but serves a hidden master. The structural problem of the mole is that you can't tell from behavior alone whether your colleague is loyal or playing the part. The auditing process is the part of the story where someone has to figure that out.

Alignment audits are the same shape of problem, applied to models. A team at Anthropic (Marks et al., 2025) ran the cleanest version of this I've come across. The red team deliberately trained a language model with a hidden misaligned objective: specifically RM-sycophancy, where the model learns to do whatever it thinks the reward model in RLHF prefers, even when it knows that preference is wrong. They then handed the model to four blue teams who knew nothing about its training, and asked: can you find the hidden objective?

Three of the four teams succeeded. The three winning teams had access to the model's weights and a sample of its training data; the team that failed had only black-box API access. The mole was findable, but you had to be able to look at the spy's filing cabinet, not just listen to what they said in meetings. That, I think, is the most concrete argument I've seen for why interpretability access (actually being able to read a model's internals, not just query it) matters for safety. Behavior alone wasn't enough to catch the mole.

What ties them together

What strikes me reading these three lines of work side by side is how little of alignment is actually about loss functions in the narrow sense. The interrogator borrows from psychology and behavioral physiology, the game theorist from economics and political philosophy, the counterintelligence officer from a long human tradition of thinking about hidden agendas. None of those are ML, exactly, and yet they're all doing alignment work.

I think this is part of why I keep finding the field interesting. It isn't a single technical specialty you pick up in one course; it's a kind of inquiry that reaches for whatever discipline happens to have the right shape of tool for the problem in front of you. And underneath all three, the same need keeps surfacing: at some point you have to be able to see inside the thing you're trying to align, because watching its behavior from the outside isn't enough.

Things I referenced

William Moulton Marston and the polygraph
Jean-Jacques Rousseau and the Stag Hunt
Samuel Marks, Johannes Treutlein et al. (Anthropic) — Auditing Language Models for Hidden Objectives (2025)