← back to blog

On Causal Abstraction: Aunt Hillary, Borges' Map, and Implementing an Algorithm

April 14, 2026
Mechanistic InterpretabilityAI Safety

Notes on a talk by Thomas Icard: what does it mean for a physical system to implement a high-level algorithm, and why does it matter for mechanistic interpretability?


I recently listened to a talk at UT's AI+Human Objectives Initiative (AHOI) by Stanford's Thomas Icard that sits between cognitive science and mechanistic interpretability. He was trying to answer a question that sounds simple but seems really difficult to prove: What does it actually mean for a physical system to "implement" a high-level algorithm? When a biological brain (or a cluster of GPUs) makes a decision, how do we mathematically prove it is running a specific logical process, rather than just spitting out a correlated reflex?

Aunt Hillary and the Implementation Problem

To understand why this is hard, we have to look back at the early days of cognitive science. Philosophers like Hilary Putnam and John Searle pointed out that "implementation" can be deeply trivial if you aren't careful. If you map things cleverly enough, you can argue that the wind blowing against a brick wall is computing Microsoft Word.

Listening to Icard frame this, my brain instantly thought of Douglas Hofstadter's book, Gödel, Escher, Bach: An Eternal Golden Braid. In it, Hofstadter introduces a character named "Aunt Hillary" (not to be confused with philosopher Hilary Putnam), who happens to be a conscious ant colony.

Individual ants aren't smart. They just follow pheromone gradients (low-level physics). But the colony as a whole can hold conversations and express high-level thoughts. Icard is basically asking: How do we draw a mathematically rigorous line between the chaotic movement of the ants (matrix multiplications in a neural network) and the coherent thoughts of Aunt Hillary (the algorithm)?

Causal Abstraction and Borges' Useless Map

The solution Icard and his collaborators developed is a framework called Causal Abstraction.

He grounds this in a task called Relational Match-to-Sample (RMTS). Imagine an organism or a neural net is shown two pairs of objects. It has to decide if the relationship in Pair A (e.g., "they are the same shape") matches the relationship in Pair B. Algorithmically, this requires three high-level steps: evaluate Pair A, evaluate Pair B, and compare them. It is essentially a 2-bit XOR gate. (Technically the comparison is XNOR, since it's true when the two relations are equal, but that's just XOR's complement, so it's the same non-linearly-separable operation either way.)

But if you look at the weights of a trained neural network, you don't see an XOR gate. You see a bunch of floating-point numbers.

To bridge this gap, Icard introduces a "Commuting Diagram" relying on two mappings: τ (Tau) and ω (Omega).

τ maps the low-level states (the raw neuron activations) to the high-level states (the abstract concept of "sameness").

ω maps the low-level operations (the math) to the high-level operations (the XOR logic).

If a network can actually implement the algorithm, the diagram must commute. Meaning: if you take the raw network, run it, and then apply τ to extract the final concept, you should get the exact same result as if you applied τ first to extract the initial concepts and just ran the high-level XOR algorithm yourself.

This necessity of the τ mapping reminded me of Jorge Luis Borges' famous one-paragraph short story, On Exactitude in Science. Borges writes about an empire so obsessed with accurate cartography that they create a Map with a 1:1 scale of the Empire itself. The map covers the entire territory, and is therefore completely useless.

If we don't have τ to abstract the low-level neural activations into high-level causal variables, our only "map" of the neural network is just the neural network itself. Causal abstraction, in my mind, is basically how we fold Borges' map into something we can actually fit in our pockets.

Flatland and Abstraction Under Translation

Here is the coolest part of the lecture (imo).

Icard points out that when we look for these high-level causal variables (like the concept of "sameness"), they are almost never found in a single, standard-basis neuron. They are distributed across thousands of neurons.

To make the Commuting Diagram work (aka to actually find the XOR gate inside the matrix) you have to apply linear transformations. You have to rotate the vector space. Icard calls this Abstraction under Translation (referencing foundational work like Geiger et al.'s paper on Causal Abstraction and Paul Smolensky's 1986 work on Parallel Distributed Processing).

This made me think of Edwin A. Abbott's novella Flatland: A Romance of Many Dimensions (I know it's kinda random, but I swear I have a point lol). In the book, a two-dimensional square cannot comprehend a 3D sphere. When the sphere passes through Flatland, it just looks like a circle growing and shrinking. The Square can only understand the truth by being pulled out of his plane and viewing his world from a new dimensional axis.

That is exactly what we are doing with neural networks. If we just look at the raw activations (Flatland), the model looks like it's hallucinating random noise. But if we rotate the axes of the vector space, suddenly the hidden, high-level algorithm reveals itself perfectly.

Xenolinguistics and the Path Forward

In my first post, I compared mechanistic interpretability to xenolinguistics and said it was like building translation manuals for an alien mind. Icard's framework of Causal Abstraction feels like the Rosetta Stone for that translation. It gives us a rigorous way to check whether these alien minds are running the same high-level algorithms we are, just encoded on axes we haven't learned how to look at yet.

If we can master how to systematically find these rotations (how to mathematically verify that a model's internal "thoughts" causally link to its outputs), we might actually have a shot at surviving the adolescence of technology.


Things I referenced