Paper deep dive
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Denis Sutter, Julian Minder, Thomas Hofmann, Tiago Pimentel
Models: Pythia (multiple sizes)
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/12/2026, 6:48:10 PM
Summary
The paper introduces the 'non-linear representation dilemma,' arguing that causal abstraction, a popular method for mechanistic interpretability, becomes vacuous when alignment maps are not restricted to linear functions. The authors prove that arbitrarily powerful non-linear maps can align any neural network to any algorithm, rendering the abstraction trivial. They provide empirical evidence showing that even randomly initialized models can achieve perfect interchange-intervention accuracy using non-linear alignment maps, suggesting that causal abstraction requires strong prior assumptions about information encoding to be meaningful.
Entities (5)
Relation Signals (3)
Distributed Alignment Search → isusedtoevaluate → Causal Abstraction
confidence 95% · How do we evaluate if an algorithm is an input-restricted distributed abstraction of a DNN? Geiger et al. (2024b) proposes an efficient method to answer this, called distributed alignment search (DAS).
Causal Abstraction → suffersfrom → Non-linear Representation Dilemma
confidence 95% · This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses...
Linear Representation Hypothesis → motivates → Causal Abstraction
confidence 90% · Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis
Cypher Suggestions (2)
Identify the impact of non-linear maps on causal abstraction. · confidence 95% · unvalidated
MATCH (a:Methodology {name: 'Causal Abstraction'})-[r:SUFFERS_FROM]->(d:Concept {name: 'Non-linear Representation Dilemma'}) RETURN rFind all methodologies related to mechanistic interpretability. · confidence 90% · unvalidated
MATCH (e:Entity)-[:RELATED_TO]->(m:Methodology {name: 'Mechanistic Interpretability'}) RETURN eAbstract
Abstract:The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model's representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised language models, our alignment maps reach 100\% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps' complexity and accuracy. Together, these results suggest an answer to our title's question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.
Tags
Links
Full Text
146,948 characters extracted from source content.
Expand or collapse full text
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability? Denis Sutter,Julian Minder,Thomas Hofmann,Tiago Pimentel ETH ZürichEPFL densutter@ethz.ch,julian.minder@epfl.ch,thomas.hofmann, tiago.pimentel@inf.ethz.ch densutter/non-linear-representation-dilemma Abstract The concept ofcausal abstractiongot recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can beabstractedas a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by thelinear representation hypothesis: the idea that features are encoded linearly in a model’s representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions,anyneural network can be mapped toanyalgorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly ini- tialised language models, our alignment maps reach 100% interchange-intervention accuracy on the indirect object identification task. This raises thenon-linear rep- resentation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps’ complexity and accuracy. Together, these results suggest an answer to our title’s question: causal abstraction isnotenough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information- encoding assumption and causal abstraction should lead to exciting future work. 1 Introduction The increasing popularity of machine learning (ML) models has led to a surge in their deployment across various industries. However, the lack of interpretability in these models raises significant concerns, particularly in high-stakes applications where understanding the decision-making process is crucial (Goodman and Flaxman, 2017; Tonekaboni et al., 2019; Zhu et al., 2020; Gao and Guan, 2023). Unsurprisingly, this opacity has motivated a multitude of research on mechanistic (or causal) interpretability, which tries to analyse and understand the hidden algorithms that underlie these models (Olah et al., 2020; Elhage et al., 2021; Mueller et al., 2024; Ferrando et al., 2024; Sharkey et al., 2025). A promising approach to address this challenge iscausal abstraction(Beckers and Halpern, 2019; Geiger et al., 2024a), which tries to map the behaviour of a model to a higher-level (and conceptually simpler) algorithm which solves the task. At the core of this concept is the idea that if an intervention is found to change a model’s behaviour in a way that aligns with a specific algorithm, then that algorithm can be considered implemented by the model. Recent research, however, has raised consid- erable issues with this approach (e.g., Makelov et al., 2024; Mueller, 2024; Sun et al., 2025). Among those, Méloux et al. (2025) notes that a model’s causal abstraction is not necessarily unique, showing that many algorithms can be aligned to the same neural network. Additionally, most work on causal 39th Conference on Neural Information Processing Systems (NeurIPS 2025). arXiv:2507.08802v2 [cs.LG] 12 Nov 2025 = ([] ψ η 1 Φ , [] ψ η 1 Φ ) = ([] ψ η 1 Φ , [] ψ η 1 Φ ) = (a=1,b=1,c=0,d=5) = (a=4,b=0,c=2,d=2) = (a=9,b=9,c=7,d=7) = (a=5,b=8,c=5,d=5) Predict O Predict X Accuracy= 1.0 ψ 1 ψ 2 ψ 2 ψ 1 Datapoints: Datapoint withlabelTrue Datapoint withlabelFalse ψ η 2 Φ Linear Φ -1 Non-linear Φ -1 Identity Φ -1 ψ 1 ψ 2 ψ 1 ψ 2 ψ 1 ψ 2 < < η 1 : a==b HierarchicalEqualityAlgorithm η 2 : c==d a b Dataset: = (a=1,b=2,c=2,d=4) = (a=6,b=6,c=3,d=3) ψ η 1 Φ Interventions: = ([] ψ η 1 Φ , [] ψ η 1 Φ ) = ([] ψ η 1 Φ , [] ψ η 1 Φ ) ψ η 2 Φ ψ η 1 Φ ψ η 2 Φ ψ η 1 Φ ψ 2 ψ 1 (a==b)==(c==d) Causal Abstraction ΦAcc. Identity0.8 Linear0.9 Non-linear1.0 c d NeuralTrue/FalseClassifier Figure 1: A visualisation of what happens when analysing causal abstractions with increasingly complex alignment mapsφ. The more complexφis, the higher the intervention accuracy—and, consequently, the stronger the algorithm–DNN alignment. In Theorem 1, we show that given arbitrar- ily complex alignment maps, we can always find a perfect alignment (under reasonable assumptions). abstraction (Wu et al., 2023; Geiger et al., 2024b; Minder et al., 2025; Sun et al., 2025) implicitly assumes information is linearly encoded in models’ representations, relying on thelinear representa- tion hypothesis(Alain and Bengio, 2016; Bolukbasi et al., 2016). Linearity, however, is not required by the definition of causal abstraction (Beckers and Halpern, 2019) and increasing evidence suggests that not all representations may be linearly encoded (White et al., 2021; Olah and Jermyn, 2024; Mueller, 2024; Csordás et al., 2024; Engels et al., 2025a,b; Kantamneni and Tegmark, 2025). In this paper, we first prove that, once we drop the linearity constraint,anymodel can be per- fectly mapped toanyalgorithm under relatively weak assumptions—e.g., hidden activation’s input- injectivity and output-surjectivity, which we will define formally. This renders causal abstraction vacuous when used without constraints. If we restrict alignment maps to only consider, e.g., linear functions, this problem does not arise though. It follows that causal abstraction implicitly relies on strong assumptions about how features are encoded in deep neural networks (DNNs), and becomes trivial without such assumptions. This puts us at an impasse: we may want to rely on stronger notions of causal abstraction which may leverage non-linearly encoded information, but this may make our analyses vacuous; we call this thenon-linear representation dilemma(schematised in Fig. 1). To empirically validate our theoretical results, we reproduce the original distributed alignment search (DAS) experiments (Geiger et al., 2024b), but while leveraging more complex alignment maps. We find that key empirical patterns they observed—such as the first layer being easier to map to the tested algorithms in a hierarchical equality task—vanish when we use more powerful maps. Additionally, we find that we can achieve over 80% interchange intervention accuracy (IIA) using non-linear alignment maps in randomly initialised models. Extending our experiments to language models from the Pythia suite (Biderman et al., 2023), we show that near-perfect maps can be found for randomly initialised models in the indirect object identification (IOI) task (Wang et al., 2023); notably, as training progresses, the complexity of the alignment maps needed to achieve perfect IIA in this task decreases. Overall, our results show that causal abstraction, while promising in theory, suffers from a fundamental limitation: withouta prioriconstraints on the used alignment maps, it becomes vacuous as a method for understanding neural networks. 2 Background In this section, we formally define algorithms (§2.1) and deep neural networks (§2.2). These will then be used to define a causal abstraction (§3). First, we formalise a task as a functionT:X →Y, wherex∈Xrepresents a set of input features andy∈Ydenotes the corresponding output. 2.1 Algorithms Given a taskT, we may hypothesise different ways it can be solved. We term each such hypothesis an algorithm 1 A , which we represent as a deterministic causal model—a directed acyclic graph that 1 “Algorithm” here need not match a formal definition as, e.g., the considered functions may be uncomputable. 2 implements a functionf A :X →Y. 2 These causal models have a set of nodesη all which can be decomposed into three disjoint sets: (i) input nodesη x representing elements inx, (i) output nodes η y representing elements iny, and (i) inner nodesη inner representing intermediate variables used in the computation off A . As we focus on acyclic causal models, the edges in this graph induce a partial ordering on nodesη x ≺η inner ≺η y . Letv η denote the value held by nodeη, and letv η denote values taken by the set of nodesη. The set of incoming edges to a nodeηrepresent a direct causal relationship between that node and its parentspar A (η), denoted:v η =f η A (v par A (η) ). We can compute algorithmf A by iteratively solving the value of its nodes while respecting their partial ordering: v η x def =x, ∀ η∈η inner ∪η y v η =f η A (v par A (η) ),f A (x) def =v η y (1) where we definev η x to be the input valuex, and take the value of the output nodesv η y as our algorithm’s output. Importantly, for an algorithmAto represent a taskT, its output under “normal” operation must bef A (x) =T(x). For an example of a task and related algorithms, see App. I.1. These causal models, however, allow us to go beyond “normal” operations and investigate the behaviour of our algorithm under counterfactual settings. We can, for instance, investigate what its behaviour would be if we enforce a nodeη ′ ’s value to be a constantv η ′ =c, which we write as: v η x =x, v η ′ =c, ∀ η∈(η inner ∪η y )\η ′ v η =f η A (v par A (η) ), f A (x,(v η ′ ←c)) =v η y (2) Now, letf :η A (x)represent a function which runs our algorithm with inputxuntil it reaches nodeη, outputting its valuev η . We can use such interventions to investigate the behaviour of algorithmAunder inputx, when nodeη ′ is forced to assume the value it would have under x ′ as:f A (x,(v η ′ ←f :η ′ A (x ′ ))) . Now, letI A be a multi-node intervention, e.g.,I A = (v η ′ ←c ′ ) whereη ′ = [η ′ ,η ′ ]andc ′ = [f :η ′ A (x ′ ),c] . We can observe how our model operates under those interventions by runningf A (x,I A ). See App. B for a pseudo-code implementation. 2.2 Deep Neural Networks Deep neural networks (DNNs) are the driving force behind recent advances in ML and can be defined as a sequence of functionsf ℓ N :H ψ ℓ →H ψ ℓ+1 , whereψ ℓ denotes the set of neurons in layerℓand H ψ ℓ is the corresponding vector space. A DNNNwithLlayers can be specified as follows: h ψ 1 =f 0 N (x),h ψ ℓ+1 =f ℓ N (h ψ ℓ ), p N (y|x) =f L N (h ψ L )(3) whereh ψ ℓ denotes the vector of activations for neuronsψ ℓ . We focus on DNNs with real-valued neurons and probabilistic outputs, so thatH ψ 0 =X,H ψ ℓ =R |ψ ℓ | andH ψ L+1 = ∆ |Y|−1 . We define f ψ ′ N (x ′ ) as the function that computes the activations of the subset of neuronsψ ′ when the network is evaluated on inputx ′ . In particular,f ψ ℓ N (x) returns the activations at layerℓfor inputx. Thus, the standard computation of the DNN corresponds to evaluatingp N (y|x) =f ψ L+1 N (x).This formulation allows us to instantiate common architectures, such as multi-layer perceptrons (MLPs) or transformers, by specifying the form of eachf ℓ N and the structure of the neuron setsψ ℓ . The parameters of these models are typically optimised to minimise the cross-entropy loss. Notably, similarly to the algorithms above, a DNN’s architecture induces a partial ordering on its neurons, respecting the order in which they are computedψ 0 ≺ψ ℓ ≺ψ L . We can thus analogously define aDNN interventionas follows: given a set of neuronsψin the network and a corresponding set of valuesc ψ , we denote the intervention byf ψ L+1 N x,(h ψ ←c ψ ) . This notation means that, during the forward computation of the DNN, the activations of the neurons inψare fixed to the specified valuesc ψ , while the rest of the network operates as usual. 3 Causal Abstraction To define causal abstraction, we will base ourselves on the definitions in Beckers and Halpern (2019) and Geiger et al. (2024a). Let anabstraction mapbe defined asτ:H→N, whereH andNare, respectively, the Cartesian products of the hidden state-spaces in a neural networkN(i.e., ψ int def =ψ 1: ), and the node value-spaces in an algorithmA(i.e.,η int def =η inner ∪η y ), both excluding the inputsψ 0 andη x . In words, an abstraction map translates the inner states of a neural network into an algorithms’ inner states. Now, consider the DNN interventionI N =h ψ ←c ψ and the algorithm 2 Geiger et al. (2024a) also considers cyclic deterministic causal models and Beckers and Halpern (2019) considers cyclic and stochastic causal models. We leave the expansion of our work to such models for future work. 3 interventionI A =v η ←c η . Further, letH h ψ =c ψ be the set of states in a DNN for whichh ψ =c ψ holds, and equivalently forN v η =c η . Under abstraction mapτ, we can define an intervention map as: ω τ (h ψ ←c ψ ) = v η ←c η ifN v η =c η =τ(h)|h∈H h ψ =c ψ undefined else (4) Intuitively,ω τ maps a DNN interventionI N to an algorithmic oneI A if the sets of states they induce onNandA, respectively, are the same. Further, letI A be a set of interventionsI A which can be performed on algorithmA. We can useω τ to derive a set of equivalent DNN interventions as: I N =h ψ ←c ψ |v η ←c η ∈I A ,ω τ (h ψ ←c ψ ) =v η ←c η (5) Given these definitions, we now put forward a first notion of causal abstraction. Definition 1(from Beckers and Halpern, 2019).An algorithmAis aτ-abstractionof a neural networkNiff:τis surjective;I A =ω τ (I N ); 3 and there exists a surjectiveτ η x such that: ∀ x∈X I N ∈I N :τ(f ψ int N (x,I N )) =f :η int A (τ η x (x),I A )whereI A =ω τ (I N )(6) In words, the first condition in this definition enforces that all states in an algorithm are needed to abstract the DNN, while the second and third enforce that interventions in the algorithm have the same effect as interventions in the DNN. We further say thatAis astrongτ-abstractionofNif it is a τ-abstraction andI A is maximal, meaning that any intervention is allowed on algorithmA. However, while strongτ-abstractions give us a notion of equivalence between algorithms and DNNs, the mapsτ may be highly entangled and provide little intuition about the DNN’s behaviour. To ensure algorithmic information is disentangled in the DNN, we sayτis aconstructive abstraction mapif there exists a partition ofN’s neuronsψ η |η∈η int ∪ψ ⊥ —whereψ η are non-empty—and there exist maps τ η such thatτis equivalent to the block-wise application ofτ η (h ψ η ). In other words, constructive ab- straction maps compute the valuev η of each nodeη∈η int inAusing non-overlapping sets of neurons ψ η fromN, with setψ ⊥ being left unused. We now define a second notion of causal abstraction. Definition 2(from Beckers and Halpern, 2019).An algorithmAis aconstructive abstractionof a neural networkNiff there exists anτ: for whichAis a strongτ-abstraction ofN; andτis constructive. As we will deal with algorithm–DNN pairs which share the same input and output spaces, we will impose an additional constraint onτ—one that is not present in Beckers and Halpern’s (2019) definition. Namely, we restrict:ψ η x to be the neurons in layer zero andτ η x to be the identity; and ψ η y to be the neurons in layerL+ 1andτ η y to be theargmaxoperation. 4 3.1 Information Encoding in Neural Networks The definition of constructive abstraction above maps non-overlapping sets of neurons inN, i.e., ψ η , to nodes inA, i.e.,η. However, much research in ML interpretability highlights that concept information is not always neuron-aligned and that neurons are often polysemantic (Olah et al., 2017, 2020; Arora et al., 2018; Elhage et al., 2022). In fact, there is a large debate about how information is encoded in DNNs. We highlight what we see as the three most prominent hypotheses here. Definition 3.Theprivileged bases hypothesis(Elhage et al., 2023) argues that neurons form privileged bases to encode information in a neural network. Most evidence in favour of this hypothesis comes from indirect evidence: i.e., the presence of neuron-aligned outlier features or activations in DNNs (Kovaleva et al., 2021; Elhage et al., 2023; He et al., 2024; Sun et al., 2024). Going back to 2015, Karpathy (2015) already showed that a single neuron in a language model could carry meaningful information. Importantly, this hypothesis is consistent with the notion of constructive abstraction above, as it argues each nodeη’s information should be encoded in separate, non-overlapping sets of neurons. Several researchers, however, question the special status of neurons assumed by this hypothesis, assuming instead that information is encoded in linear subspaces of the representation space, of which neurons are only a special case. Definition 4.Thelinear representation hypothesis(Alain and Bengio, 2016) argues that information is encoded in linear subspaces of a neural network. 3 We overload functionω τ here, withω τ (I N )simply applyingω τ elementwise to the interventions in setI N . 4 We note that this implies algorithmAand networkNmust have the same outputs on the input setX. 4 A large literature has developed, backed by the linear representation hypothesis, including: concept erasure methods (Ravfogel et al., 2020, 2022), probing methodologies (Elazar et al., 2021; Ravfogel et al., 2021; Lasri et al., 2022), and work on disentangling activations (Yun et al., 2021; Elhage et al., 2022; Huben et al., 2024; Templeton et al., 2024). Some, however, still question this idea that all infor- mation must be encoded linearly in DNNs: as neural networks implement non-linear functions, there is no a priori reason for why information should be linearly encoded in them (Conneau et al., 2018; He- witt and Liang, 2019; Pimentel et al., 2020b,a, 2022). Further, recent research presents strong evidence that some concepts are indeed non-linearly encoded in DNNs (White et al., 2021; Pimentel et al., 2022; Olah and Jermyn, 2024; Csordás et al., 2024; Engels et al., 2025a,b; Kantamneni and Tegmark, 2025). Definition 5.Thenon-linear representation hypothesis(Pimentel et al., 2020b) argues that infor- mation may be encoded in arbitrary non-linear subspaces of a neural network. 3.2 Distributed Causal Abstractions Following the discussion above, the definition of constructive abstraction may be too strict, as it assumesτmust decompose across neurons—and thus that node information is encoded in non-overlapping neurons. With this in mind, Geiger et al. (2024a,b) proposed the notion of distributed interventions: they expose the subspaces where node information is encoded in a DNNNby applying a bijective function to its hidden states; this function’s output is then itself a constructive abstraction of algorithmA. Here, we make this notion a bit more formal. We defineτas adistributed abstraction mapif the following two conditions hold. First, there exists a bijective functionφ—termed here analignment map—that maps the inner neuronsψ int ofNblock-wise to an equal-sized set oflatent variablesψ φ int , in a manner that respects the partial ordering of computations in the network. Specifically, for each layerℓ, there exists a bijection φ ℓ :R |ψ ℓ | →R |ψ ℓ | on its neurons such thatφis defined as the concatenation of these layer-wise bijections. Similarly to the neurons’ activationh ψ , we will denote latent variables ash ψ φ =φ(h ψ ). Second, there exists a partitionψ φ η |η∈η int ∪ψ φ ⊥ of the resulting latent variablesψ φ int — whereψ φ η are non-empty—and a set of mapsτ η such thatτis equivalent to the block-wise application ofτ η (h ψ φ η ) . In words, a distributed abstraction map computes the valuev η of each nodeη∈η int in Ausing non-overlapping partitions of latent variablesψ φ η fromN, with partitionψ φ ⊥ remaining unused. Given an alignment mapφ, we can perform distributed interventions:h ψ φ η ←c ψ φ η . These inter- ventions are performed by first mapping the hidden stateh ψ to the latent variablesh ψ φ =φ(h ψ ), intervening on a subsetψ φ η by replacingh ψ φ η with desired valuesc ψ φ η , and then mapping these intervened latent variables back to the original neuron base viah ′ ψ =φ −1 (h ′ ψ φ ). Thus, interventions are applied in the latent space defined byφ, generalising privileged-bases interventions to arbitrary (possibly non-linear) subspaces. We are now in a position to define distributed abstractions. Definition 6.An algorithmAis adistributed abstractionof a neural networkNiff there exists anτ: for whichAis a strongτ-abstraction ofN; andτis a distributed abstraction map. Finally, we note that the set of all possible interventionsI A andI N may be hard to analyse in practice. Geiger et al. (2024b) thus restrict their analyses to what we term hereinput-restricted interventions: the set of interventions which are themselves producible by a set of other input-restricted interventions. In other words, we restrict interventionsh ψ ′ ←c ψ ′ (whereψ ′ ⊆ψ int orψ ′ ⊆ψ φ int ) andv η ′ ←c η ′ toc ψ ′ andc η ′ which are a product of other input-restricted interventions, e.g.,c ψ ′ =f ψ ′ N (x ′ ) or c η ′ =f :η ′ A (x ′ ,(v η ′ ←f :η ′ A (x ′ ))) . This leads to the definition ofinput-restrictedτ-abstraction: a weakened notion of strongτ-abstraction, where intervention sets are restricted to input-restricted interventions. Finally, we define an analogous version of distributed abstraction, which is input- restricted; this is the notion typically used in practice by machine learning practitioners. Definition 7(inspired by Geiger et al., 2024b).An algorithmAis aninput-restricted distributed abstractionof a neural networkNiff there exists anτ: for whichAis an input-restrictedτ-abstraction ofN; andτis a distributed abstraction map. A visual representation of how these causal abstraction definitions are related is given in App. D as Fig. 6. Finally, we further introduceinput-restrictedV-abstractions: input-restricted distributed abstractions for which we restrict alignment mapsφto be in a specific variational familyV. The case of linear alignment maps, will be particular important here—as it relates to the linear representation hypothesis—and we will thus explicitly label it asinput-restricted linear abstraction. 5 3.3 Finding Distributed Abstractions How do we evaluate if an algorithm is an input-restricted distributed abstraction of a DNN? Geiger et al. (2024b) proposes an efficient method to answer this, called distributed alignment search (DAS). Before applying DAS, one must assume a partitioningψ φ η |η∈η int ∪ψ φ ⊥ which remains fixed during the method’s application; we term|ψ φ η |theintervention size. The principle behind DAS is then to leverage the constraint onτ η y , which is fixed as:v η y = argmax y∈Y p N (y|x)since p N (y|x) =h ψ L+1 . Given this constraint, we can initialise a parametrised functionφ, which we train to predict this equality under possible interventions; this is done via gradient descent, minimising the cross-entropy between the DNN and the algorithm. Specifically, we first select a set of nodes to be intervenedη∈P(η inner ), wherePis a function that takes the powerset of a set, along with corresponding counterfactual inputsx η ∈Xfor eachη∈ηand a base inputx ∅ ∈X. We then define the following two interventions: I N = h ψ φ η η∈ η ← f ψ φ η N (x η ) η∈ η andI A = v η η∈η ← f :η A (x η ) η∈ η (7) Finally, we run our algorithm under base inputx ∅ and interventionI A to get a ground truth output: y=f A (x ∅ ,I A ). Repeating this processNtimes, we build a datasetD=(x (n) ∅ ,I (n) N ,y (n) ) N n=1 on which we can train the alignment mapφsuch that the DNN matches the algorithm: L=− X (x (n) ∅ ,I (n) N ,y (n) )∈D logp φ N (y (n) |x (n) ∅ ,I (n) N ),wherep φ N (y|x ∅ ,I N ) =f N (x ∅ ,I N )(8) Notably, DAS mostly ignores how functionτis constructed, relying solely on the assumed definition ofτ η y . Finding a low-loss alignment mapφis then assumed as sufficient evidence thatAis an input-restricted distributed abstraction ofN. 4 Unbounded Abstractions are Vacuous In this section, we provide our main theorem: that under reasonable assumptions,anyalgorithmA can be shown to be an input-restricted distributed abstraction ofanyDNNN, making this notion of causal abstraction vacuous. To show that, we need a few assumptions (for their formal definition, see App. F). Our first assumption (Assump. 1) is that we have acountable input-spaceX. While this may not hold in general, it holds for common applications such as language modelling (where the input-space is the countably infinite set of finite strings) or computer vision (where the input-space is a countable union of pixels, which can assume a finite set of values). The second assumption (Assump. 2) is that DNNs areinput-injective in all layers: i.e.,f ψ ℓ N is injective for all layers. This guarantees that no information about a DNN’s inputxis lost when computing the hidden statesh ψ ℓ . This assumption is also present in prior work (e.g., Pimentel et al., 2020b) and we show in App. G— assuming real-valued weights and activations—that this is almost surely true for transformers at initialisation. 5 Due to floating point precision and neural collapse (Papyan et al., 2020), it is likely not to hold fully in practice; however, it still seems to be well-approximated in many empirical settings (Morris et al., 2023, and App. H). The third assumption (Assump. 3) isstrict output-surjectivity in all layers. This assumption guarantees that in each layer there is at least one choice ofh ψ ℓ that will produce the desired output. Notably, this assumption may not hold in theory, due to issues like the softmax-bottleneck (Yang et al., 2018). In practice, however, even with large vocabulary sizes, it seems that almost all outputs can still be produced by language models (Grivas et al., 2022) which is sufficient for these DNNs to be abstracted by many algorithms. Our fourth assumption (Assump. 4) is that thealgorithmAand DNNNhave matchable partial-orderings, meaning that there is a partitioning of neurons inNwhich would match the partial-ordering of nodes inA; this is likely to be the case for most reasonable algorithms given the size of state-of-the-art deep neural networks. Finally, our last assumption (Assump. 5) is that theDNNNsolves the given taskT. We believe this assumption to be reasonable, as it would be impractical in practice to evaluate a neural network that does not perform the task correctly. 6 Given these assumptions, we can now present our main theorem. Theorem 1.Given any algorithmAand any neural networkNsuch that Assumps. 1 to 5 hold, we can show thatAis an input-restricted distributed abstraction ofN. Proof.We refer to App. F for the proof. 5 Also see Nikolaou et al. (2025), who show almost sure injectivity holds for transformers throughout training. 6 If the model does not solve the task, perfect IIA is impossible since non-intervened inputs yield incorrect out- puts. Thus, assuming the model solves the task is necessary. In practice, however, even when the DNN is imper- fect, an alignment map could produce correct outputs for all intervened inputs, achieving near-perfect IIA scores. 6 5 Experimental Setup Building on the previous section’s proof that alignment maps between DNNs and algorithms always exist, we now demonstrate their practical learnability and how increasingly complex alignment maps reveal various causal abstractions for different tasks, even on DNNs that do not solve them. Alignment Maps.To assess how complexity impacts causal-abstraction analyses, we explore three ways to parameteriseφ. First, we will consider the simplestidentity maps:φ id (h) =h. This is the least expressiveφwe consider, and if we find thatAabstractsNunder this map, we can say thatA is a constructive abstraction ofN; further, this map implicitly assumes the privileged bases hypothesis. Forφ id , we greedily search for the optimal partitionψ φ η |η∈η int (instead of keeping it fixed) by iteratively adding neurons to them. For allη inner simultaneously, one neuron is added at a time for eachψ φ η , up to a maximum allowed intervention size; these neurons are chosen to minimise the loss in eq. (8). Second, we will considerlinear maps:φ lin (h) =W orth h, whereW orth ∈R d ℓ ×d ℓ is an or- thogonal matrix. This is the type of alignment map originally considered by Geiger et al. (2024b), 7 and implicitly assumes the linear representation hypothesis, evaluating input-restricted linear abstractions. Finally, we considernon-linear maps:φ nonlin (h) =revnet[L rn ,d rn ](h), whererevnet[L rn ,d rn ] is a reversible residual network (RevNet; Gomez et al., 2017) withL rn layers and hidden sized rn . We can modulate the complexity of this final map by increasingL rn andd rn , assuming the non-linear representation hypothesis. We note that all three maps are bijective and easily invertible. Evaluation Metric.We evaluate the effectiveness of an alignment mapφusing theinterchange intervention accuracy(IIA) metric proposed by Geiger et al. (2024b). For a held out test setD test with the same structure as the training setDdefined in §3.3, we compute the accuracy of our model (i.e.,argmax y ′ ∈Y p φ N (y ′ |x (n) ∅ ,I (n) N )) when predicting the intervenedy (n) =f A (x (n) ∅ ,I (n) A ). We compare this to the DNN’s accuracy on the test setD test without interventions. 5.1 Tasks, Algorithms, and DNNs. Hierarchical equality task (Geiger et al. (2024b)).We will showcase our results primarily on this task. Letx=x 1 ◦x 2 ◦x 3 ◦x 4 be a 16-dimensional vector, andx 1 tox 4 each be 4-dimensional vectors, where◦represents vector concatenation. Further, letX= [−.5,.5] 16 . This task consists of evaluating:y= (x 1 ==x 2 ) == (x 3 ==x 4 ). As our DNNN, we investigate a 3-layer multi-layer perceptron (MLP) with hidden size16, trained to perform this task; we describe this DNN, and its training procedure in more detail in App. I.1. Finally, we explore three algorithms for this task. The both equality relationsalgorithm first computes the two equalities (v η 1 = (x 1 ==x 2 )and v η 2 = (x 3 ==x 4 )) separately; it then determines whether they are equivalent as a second step. Theleft equality relationalgorithm first computes the left equality (v η 1 = (x 1 ==x 2 )), and then determines in a single step if this is equivalent to(x 3 ==x 4 ). Finally, theidentity of first argument algorithm assumes we copy the first input to a node (v η 1 =x 1 ) and then compute the output directly. These three algorithms are more rigorously defined in App. I.1. Indirect object identification (IOI) task.In a second set of experiments, we explore this task, inspired by Wang et al. (2023) and using the dataset of Muhia (2022). This task is more real- istic and relies on larger (language) models. Here, inputsx∈ Xare strings where two peo- ple are first introduced, and later one of them assumes the role of subject (S), giving or say- ing something to the other, the indirect object (IO). The task is then to predict the first token of theIO, with the output setYcontaining the first token of each person’s name. E.g.,x= “Friends Juana and Kristi found a mango at the bar.Kristi gave it to ” andy=“Juana”. As our DNN, we use models from the Pythia suite (Biderman et al., 2023) across different sizes (from 31M to 410M parameters) and training stages. We evaluate theABAB-ABBAalgorithm where, given two names A and B, an inner nodev η 1 captures if the sentence structure is ABAB (e.g., “A and B ... A gave to B”) or ABBA (e.g., “A and B ... B gave to A”), and the algorithm outputs prediction B if v η 1 is ABAB and A otherwise. This algorithm is more rigorously defined in App. I.2. 6 Experiments and Results We now proceed with our empirical study, applying alignment maps of varying complexity on both “toy” and real neural networks to evaluate their effects on the causal abstraction method DAS. 7 We note that, while Geiger et al. (2024b) describe the usedφas a rotation, their pyvene (Wu et al., 2024a) implementation uses orthogonal matrices. This, however, makes no difference in the power of the alignment map. 7 128128128 0.25 0.50 0.75 1.00 Accuracy L1L2L3 Both Equality Relations 128128128 L1L2L3 Left Equality Relation 128128128 L1L2L3 Identity of First Argument linear non-linear identity Figure 2: IIA in the hierarchical equality task for causal abstractions trained with different alignment mapsφ. The figure shows results for all three analysed algorithms for this task. The bars represent the max IIA across 10 runs with different random seeds. The black lines represent mean IIA with 95% confidence intervals. The|ψ φ η |denotes the intervention size per node. Without interventions, all DNNs reach almost perfect accuracy (>0.99). The usedφ nonlin usesL rn = 10andd rn = 16. 124816 d rn 0.6 0.8 1.0 Accuracy = 1 = 2 = 8 01MFull DNN Training Steps 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy DNN linear L rn = 1,d rn = 2 4 L rn = 5,d rn = 2 4 L rn = 10,d rn = 2 4 L rn = 10,d rn = 2 7 Figure 3: IIA of alignment between theboth equality relationsalgorithm and an MLP, with interventions at layer 1. Left: Mean IIA over 5 seeds usingφ nonlin (L rn = 1) on the trained DNN. Performance improves with larger hidden dimensiond rn and intervention size|ψ φ η | . Right: Maximum IIA across 5 seeds usingφ lin andφ nonlin with|ψ φ η |= 8 . Complex alignment maps achieve high IIA even with randomly initialised DNNs, while simpler maps gradually improve as training progresses. Hierarchical equality task, main results. 8 Fig. 2 presents IIA results across different alignment mapsφfor all three algorithms. As expected, the identity mapφ id generally results in the worst per- formance. Using linear alignments (φ lin ), we observe patterns consistent with Geiger et al. (2024b): IIA forboth equality relationsandleft equality relationdecreases substantially in the third layer, indicating information becomes difficult to manipulate using linear transformations at deeper layers. With the non-linear alignment (φ nonlin ), this layer-dependent degradation vanishes, yielding near-optimal IIA across all layers. Consequently, while assuming linear representations seems to enable us to identify the location of certain variables in our DNN, many of these insights fail to generalise when more powerful non-linear alignment maps are employed. Theidentity of first argumentalgorithm’s IIA consistently hovers around 50% forφ id ,φ lin andφ nonlin . Additional experiments (App. H) suggest this is caused by insufficient capacity of the usedrevnet model, as the identity ofx 1 seems to be encoded in the model’s hidden states. Hierarchical equality task, exploringφ nonlin ’s complexity.Fig. 3 (left) illustrates how varying the hidden sized rn and intervention size|ψ φ η |affects IIA with theboth equality relations algorithm on layer 1 of our MLP. Fig. 3 (right) shows IIA evolution as alignment complexity increases throughout the MLP’s training (evaluated on its layer 1). Remarkably, even with randomly initialised DNNs, we achieve over80%IIA using the most complex alignment map. As training progresses, simpler alignment maps gradually attain higher IIA values. Additional results in App. I.1.3 extend these findings to other MLP layers, intervention sizes, and algorithms, consistently revealing similar patterns that reinforce our conclusion about the impact of alignment map complexity on IIA dynamics. Indirect object identification task, main results.Fig. 4 (left) presents the results of trying to find causal abstractions between theABAB-ABBAalgorithm and Pythia language models, exploring how model size affects alignment capabilities. Notably, despite only the larger models (160M and 410M parameters) successfully learning the IOI task, we can align the algorithm to models of all sizes— including the 31M and 70M parameter models that fail to learn the task. Further, and somewhat surprisingly, this alignment is perfect even for randomly initialised models across all sizes; smaller fully trained models (31M, 70M), though, show slightly reduced alignment accuracy. This reduction 8 As an additional task similar to hierarchical equality, we also explore thedistributive law taskin App. I.3. 8 31m(L3)70m(L3)160m(L6)410m(L12) DNN Size 0.5 1.0 Accuracy DNN (Full) DNN (Init.) non-lin. (Full) non-lin. (Init.) lin. (Full) lin. (Init.) 01k2k3k4kFull DNN Training Step 0.50 0.75 1.00 Accuracy DNN linear L rn = 1 L rn = 2 L rn = 4 L rn = 8 Figure 4: IIA of alignment betweenABAB-ABBAalgorithm and Pythia language models. Left: IIA across model sizes at initialisation (Init.) or after full training (Full), with intervention at the middle layer. Right: IIA with increasingly complex alignment maps duringPythia-410m’s training. Results show complex alignment maps yield near-perfect IIA. Allφ nonlin used rn = 64. may stem from these smaller models saturating late in training (Godey et al., 2024), becoming highly anisotropic and making it harder forφ nonlin to access the information needed to match the algorithm. Indirect object identification task, exploringφ nonlin ’s complexity.Fig. 4 (right) illustrates the interplay between model training progression and algorithmic alignment for the Pythia with 410M parameters. Notably, while this model begins to acquire task proficiency only around training step 3000 (as indicated by model accuracy), employing an 8-layerφ nonlin as alignment map yields near- perfect IIA across all training steps, including for randomly initialised models. This pattern partially extends to a 4-layersφ nonlin configuration; however, there is a noticeable dip in IIA at step 1000 for this configuration, which may be due to the model over-fitting to unigram statistics (Chang and Bergen, 2022; Belrose et al., 2024) at this point—thereby making context (and hidden states) be mostly ignored when producing model outputs. Interestingly, as training advances, even less complex alignment maps (1- and 2-layerφ nonlin ) eventually attain perfect alignment. In contrast, linear maps only approximate perfect alignment in the fully trained model, following a similar trend to the DNN’s performance. 7 Discussion Our results show that when we lift the assumption of linear representations, sufficiently complex alignment maps can achieve near-perfect alignment across all models—regardless of their ability to solve the underlying task. This provides compelling evidence for the non-linear representation dilemma, suggesting causal alignment may be possible even when the model lacks task capability. We now discuss our results in the context of prior literature, with additional related work in App. C. Causal Abstraction is not Enough.Causal abstraction (Geiger et al., 2024a) has gained traction as a theoretical framework for mechanistic interpretability, promising to overcome probing limitations by analysing DNN behaviour through interventions: if you intervene on a DNN’s representations and its behaviour changes in a predictable way, you have identified how the DNN “truly” encodes that feature (Elazar et al., 2021; Ravfogel et al., 2021; Lasri et al., 2022). Recent critiques of causal abstraction (e.g., Mueller, 2024) highlight practical shortcomings, including the non-uniqueness of identified algorithms (Méloux et al., 2025) and the risk of “interpretability illusions” (Makelov et al., 2024). Despite counterarguments to some of these critiques (Wu et al., 2024b; Jørgensen et al., 2025), concerns have emerged that methods based on causal abstraction may introduce new information rather than accurately reflect the behaviour of the DNN (Wu et al., 2023; Sun et al., 2025); as an example, causal abstraction methods applied to random models sometimes yield above- chance performance (Geiger et al., 2024b; Arora et al., 2024). By examining the implications of assuming arbitrary complex ways in which features may be encoded in a DNN, we show that nearly any neural network can be aligned to any algorithm. Together, our results thus suggest that the shift in interpretability research to causal abstractions does not, by itself, resolve the core challenge of understanding how representations are encoded. Additionally, we note that early causal abstraction methods (Geiger et al., 2021) implicitly rely on the privileged bases hypothesis, while recent advancements (Geiger et al., 2024b) rely on the linear representation hypothesis instead. Balancing the Accuracy vs. Complexity ofφ.Diagnostic probing was a previously popular method for interpretability research (Alain and Bengio, 2016), where aprobewas applied to the hidden representations of a DNN and trained to predict a specific variable. Notably, the architecture chosen for this probe implicitly reflected assumptions about representation encoding, and the absence of a universally accepted model for representation encoding precluded a theoretically founded 9 choice of probe architecture (Belinkov, 2022). The debate regarding the trade-off between probing complexity and accuracy (Hewitt and Liang, 2019; Pimentel et al., 2020b,a; Voita and Titov, 2020) underscores the risk of complex probes merely memorising variable-specific relations, instead of revealing which information the DNN “truly” encodes and uses. In this paper, we revive this debate by showing a clear analogue in causal abstraction methodologies: the effect ofφ’s complexity on IIA. Unfortunately, this debate was never solved by the probing literature, and solutions ranged from: controlling for the probe’s memorisation capacity (Hewitt and Liang, 2019), 9 explicitly measuring a probe’s complexity accuracy trade-off (Pimentel et al., 2020a), training minimum description length probes (Voita and Titov, 2020), or leveraging unsupervised probes (Burns et al., 2023). 10 The Role of Generalisation.We now highlight that Theorem 1 provides an existence proof for a perfect abstraction map (thus guaranteeing perfect IIA) between a DNN and an algorithm. This existence proof, however, leverages complex interactions between the intervened hidden states and the DNN’s structure, requiring perfect information about both and thus representing a form of extreme overfitting. Crucially, this theorem offers no guarantees regarding the learnability of the alignment mapφfrom limited data or its generalisation to unseen inputs. This gap between theoretical existence and practical learnability becomes evident in practise. For instance, in an additional experiment on the IOI task (in App. I.2.3), we show that when training and test sets contain disjoint sets of names, the learned alignment map fails to generalise, resulting in low IIA on the test set. This suggests that generalisation should play a crucial role in causal abstraction analysis, as the ability to learn abstraction maps that transfer beyond training data seems fundamental to interpreting a model, distinguishing a genuine understanding about its inner workings from mere training pattern memorisation. Investigating Representation Encoding in DNNs.How neural networks encode variables/concepts is a long-standing question in interpretability, with three main hypotheses standing out: the privileged bases, linear representation, and non-linear representation hypotheses (see §3.1). One way to try to distinguish between these hypotheses is with causal abstraction analyses, but what can we learn about these hypotheses if our methods themselves rely on them as assumptions? One solution could be to compare results usingφwith different architectures. Our Fig. 4 (right), for instance, shows that whileφ nonlin achieves consistently near-perfect results throughout model training,φ lin accompanies the actual DNN’s performance more closely. Intuitively, we may thus be inclined to support the linear representation hypothesis here. We (the authors), however, cannot make this intuition formal to justify why we believe this is the case. Furthermore,φ lin still manages to sometimes achieve IIA higher than the DNN’s accuracy, implying it may also “learn the task”. We expect future work will propose novel methodologies to analyse information encoding and try to answer these questions. 8 Conclusion This paper critically examines causal abstraction in machine learning, when no assumptions are imposed on how representations are encoded. We show that, under mild conditions, any algorithm can be perfectly aligned with any DNN, leading to the non-linear representation dilemma. Empirical validation through experiments on the hierarchical equality and the indirect object identification tasks corroborate our theoretical insights, demonstrating near perfect IIA even in randomly initialised DNNs. So,what should you do if you want to perform a causal analysis of your DNN?We believe that it must be decided on a case-by-case basis. If you have reason to believe the linear representation hypothesis holds for the features you wish to extract, constrainingφto linear functions may be advised. If you do not, however, you may face the non-linear representation dilemma, and be forced to investigate some kind of trade-off betweenφ’s accuracy and complexity. Limitations.Our proof that any algorithm can be aligned with any DNN (Theorem 1) relies on a form of overfitting. Yet, our experiments show that the learned alignment mapsφgeneralise to unseen test data; studying the factors behind this generalisation would be valuable. Further, our theorem relies on two strong assumptions: input-injectivity (Assump. 2) and strict output-surjectivity (Assump. 3) in all layers. While we justify both, there are settings—related to, e.g., the softmax bottleneck—where they may fail; studying these failure modes could clarify our assumptions’ limitations. 9 Notably, this method was previously applied to causal abstraction analysis by Arora et al. (2024). 10 The complexity—accuracy trade-off in probing arises mainly in supervised settings, where more complex probes can extract richer features from model representations. Unsupervised probing avoids this, lacking the supervision that enables such “gerrymandered” mappings. 10 Contributions Denis Sutter led the project, implemented the base version of the DAS code, conducted the MLP experiments and derived the base proof of Theorem 1 as well as the proof of Theorem 2. Julian Minder implemented and ran the language model experiments, produced all plots, and helped refine the proof of Theorem 2. Thomas Hofmann provided guidance throughout the project. Tiago Pimentel supervised the project, giving initial intuitions for the proofs in both Theorem 1 and Theorem 2, refining the proof of Theorem 1, and defining the main notation in the paper, integrating feedback from Denis and Julian. All authors wrote the paper together. Acknowledgments This work was mostly done in the Data Analytics Lab at ETH Zürich. We would like to thank Pietro Lesci, Julius Cheng, Marius Mosbach, Chris Potts, and Atticus Geiger for their thoughtful feedback. We would also like to thank Frederik Hytting Jørgensen for bringing to our attention a mistake in our original Definition 1 and for his feedback on our manuscript. We thank Zhengxuan Wu and Kevin Du for early discussions related to the ideas presented here. We are grateful to the Data Analytics Lab at ETH for providing access to their computing cluster. Julian Minder is supported by the ML Alignment Theory Scholars (MATS) program. Denis Sutter gratefully acknowledges the financial support of his parents, Renate and Wendelin Sutter, throughout his graduate studies, during which this work was carried out, as well as the technical support of Urban Moser and Leo Schefer. References Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes.arXiv. Aryaman Arora, Dan Jurafsky, and Christopher Potts. 2024. CausalGym: Benchmarking causal interpretability methods on linguistic tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14638–14663, Bangkok, Thailand. Association for Computational Linguistics. Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2018. Linear algebraic structure of word senses, with applications to polysemy.Transactions of the Association for Computational Linguistics, 6:483–495. Sander Beckers and Joseph Y. Halpern. 2019. Abstracting causal models.Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):2678–2685. Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219. Nora Belrose, Quintin Pope, Lucia Quirke, Alex Mallen, and Xiaoli Fern. 2024. Neural networks learn statistics of increasing complexity. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, Usvsn Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar Van Der Wal. 2023. Pythia: A suite for analyzing large language models across training and scaling. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 2397–2430. Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. InAdvances in Neural Information Processing Systems, volume 29. Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. 2023. Discovering latent knowledge in language models without supervision. InThe Eleventh International Conference on Learning Representations. Tyler A. Chang and Benjamin K. Bergen. 2022. Word acquisition in neural language models. Transactions of the Association for Computational Linguistics, 10:1–16. 11 Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sashank Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2023. PaLM: Scaling language modeling with pathways. J. Mach. Learn. Res., 24(1). Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. 2018. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia. Association for Computational Linguistics. Róbert Csordás, Christopher Potts, Christopher D Manning, and Atticus Geiger. 2024. Recurrent neural networks learn to store and generate sequences using non-linear representations. InPro- ceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 248–262, Miami, Florida, US. Association for Computational Linguistics. Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. 2021. Amnesic probing: Behavioral explanation with amnesic counterfactuals.Transactions of the Association for Computational Linguistics, 9:160–175. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. 2022. Toy models of superposition.Transformer Circuits Thread. Nelson Elhage, Robert Lasenby, and Christopher Olah. 2023. Privileged bases in the transformer residual stream.Transformer Circuits Thread, page 24. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. 2021. A mathematical framework for transformer circuits.Transformer Circuits Thread. Joshua Engels, Eric J Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. 2025a. Not all language model features are one-dimensionally linear. InThe Thirteenth International Conference on Learning Representations. Joshua Engels, Logan Riggs Smith, and Max Tegmark. 2025b. Decomposing the dark matter of sparse autoencoders.Transactions on Machine Learning Research. Javier Ferrando, Gabriele Sarti, Arianna Bisazza, and Marta R. Costa-jussà. 2024. A primer on the inner workings of transformer-based language models.arXiv. Lei Gao and Ling Guan. 2023. Interpretability of machine learning: Recent advances and future prospects.IEEE MultiMedia, 30(4):105–118. Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, and Thomas Icard. 2024a. Causal abstraction: A theoretical foundation for mechanistic interpretability.arXiv. Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. 2021. Causal abstractions of neural networks. InProceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY, USA. Curran Associates Inc. 12 Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Goodman, and Christopher Potts. 2022. Inducing causal structure for interpretable neural networks. InPro- ceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 7324–7338. PMLR. Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. 2024b. Finding alignments between interpretable causal variables and distributed neural representations. arXiv. Nathan Godey, Éric Villemonte de la Clergerie, and Benoît Sagot. 2024. Why do small language models underperform? Studying language model saturation via the softmax bottleneck. InFirst Conference on Language Modeling. Satvik Golechha and James Dao. 2024. Challenges in mechanistically interpreting model representa- tions.arXiv. Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. 2017. The reversible resid- ual network: Backpropagation without storing activations. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Bryce Goodman and Seth Flaxman. 2017. European Union regulations on algorithmic decision making and a “right to explanation”.AI Magazine, 38(3):50–57. Andreas Grivas, Nikolay Bogoychev, and Adam Lopez. 2022. Low-rank softmax can have unargmax- able classes in theory but rarely in practice. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6738–6758, Dublin, Ireland. Association for Computational Linguistics. Bobby He, Lorenzo Noci, Daniele Paliotta, Imanol Schlag, and Thomas Hofmann. 2024. Under- standing and minimising outlier features in transformer training. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China. Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. 2024. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations. Frederik Hytting Jørgensen, Luigi Gresele, and Sebastian Weichwald. 2025. What is causal about causal models and representations?arXiv. Subhash Kantamneni and Max Tegmark. 2025. Language models use trigonometry to do addition. arXiv. Andrej Karpathy. 2015. The unreasonable effectiveness of recurrent neural networks. Olga Kovaleva, Saurabh Kulshreshtha, Anna Rogers, and Anna Rumshisky. 2021. BERT busters: Outlier dimensions that disrupt transformers. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3392–3405, Online. Association for Computational Lin- guistics. Karim Lasri, Tiago Pimentel, Alessandro Lenci, Thierry Poibeau, and Ryan Cotterell. 2022. Probing for the usage of grammatical number. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8818–8831, Dublin, Ireland. Association for Computational Linguistics. Aleksandar Makelov, Georg Lange, Atticus Geiger, and Neel Nanda. 2024. Is this the subspace you are looking for? An interpretability illusion for subspace activation patching. InThe Twelfth International Conference on Learning Representations. 13 Maxime Méloux, Silviu Maniu, François Portet, and Maxime Peyrard. 2025. Everything, everywhere, all at once: Is mechanistic interpretability identifiable? InThe Thirteenth International Conference on Learning Representations. Julian Minder, Kevin Du, Niklas Stoehr, Giovanni Monea, Chris Wendler, Robert West, and Ryan Cotterell. 2025. Controllable context sensitivity and the knob behind it. InThe Thirteenth International Conference on Learning Representations. John Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander Rush. 2023. Text embeddings reveal (almost) as much as text. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12448–12460, Singapore. Association for Computational Linguistics. Aaron Mueller. 2024. Missed causes and ambiguous effects: Counterfactuals pose challenges for interpreting neural networks.arXiv. Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, Eric Todd, David Bau, and Yonatan Belinkov. 2024. The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability.arXiv. Brian Muhia. 2022. ioi (revision 223da8b). Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. InProceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 807–814, Madison, WI, USA. Omnipress. Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, and Emanuele Rodolá. 2025. Language models are injective and hence invertible.Preprint, arXiv:2510.15511. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020. Zoom in: An introduction to circuits.Distill. Https://distill.pub/2020/circuits/zoom-in. Chris Olah and Adam Jermyn. 2024. What is a linear representation? What is a multidimensional feature?Transformer Circuits Thread. Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. 2017. Feature visualization.Distill. Https://distill.pub/2017/feature-visualization. Vardan Papyan, X. Y. Han, and David L. Donoho. 2020. Prevalence of neural collapse during the terminal phase of deep learning training.Proceedings of the National Academy of Sciences, 117(40):24652–24663. Tiago Pimentel, Naomi Saphra, Adina Williams, and Ryan Cotterell. 2020a. Pareto probing: Trading off accuracy for complexity. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3138–3153, Online. Association for Computational Linguistics. Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. 2020b. Information-theoretic probing for linguistic structure. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4609–4622, Online. Association for Computational Linguistics. Tiago Pimentel, Josef Valvoda, Niklas Stoehr, and Ryan Cotterell. 2022. The architectural bottleneck principle. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11459–11472, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.OpenAI blog. 14 Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. 2020. Null it out: Guarding protected attributes by iterative nullspace projection. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7237–7256, Online. Association for Computational Linguistics. Shauli Ravfogel, Grusha Prasad, Tal Linzen, and Yoav Goldberg. 2021. Counterfactual interventions reveal the causal effect of relative clause representations on agreement prediction. InProceedings of the 25th Conference on Computational Natural Language Learning, pages 194–209, Online. Association for Computational Linguistics. Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan D Cotterell. 2022. Linear adversarial concept erasure. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 18400–18421. PMLR. Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, and Tom McGrath. 2025. Open problems in mechanistic interpretability.arXiv. Jiuding Sun, Jing Huang, Sidharth Baskaran, Karel D’Oosterlinck, Christopher Potts, Michael Sklar, and Atticus Geiger. 2025. HyperDAS: Towards automating mechanistic interpretability with hypernetworks.arXiv. Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. 2024. Massive activations in large language models. InFirst Conference on Language Modeling. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. 2024. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread. Sana Tonekaboni, Shalmali Joshi, Melissa D McCradden, and Anna Goldenberg. 2019. What clinicians want: Contextualizing explainable machine learning for clinical end use.arXiv. Oskar van der Wal, Pietro Lesci, Max Müller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, and Stella Biderman. 2025. PolyPythias: Stability and outliers across fifty language model pre-training runs. InThe Thirteenth International Conference on Learning Representations. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA. Curran Associates Inc. Elena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum description length. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 183–196, Online. Association for Computational Linguistics. Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. InThe Eleventh International Conference on Learning Representations. Jennifer C. White, Tiago Pimentel, Naomi Saphra, and Ryan Cotterell. 2021. A non-linear structural probe. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 132–138, Online. Association for Computational Linguistics. Zhengxuan Wu, Atticus Geiger, Aryaman Arora, Jing Huang, Zheng Wang, Noah Goodman, Christo- pher Manning, and Christopher Potts. 2024a. pyvene: A library for understanding and improving PyTorch models via interventions. InProceedings of the 2024 Conference of the North American 15 Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations), pages 158–165, Mexico City, Mexico. Association for Computational Linguistics. Zhengxuan Wu, Atticus Geiger, Jing Huang, Aryaman Arora, Thomas Icard, Christopher Potts, and Noah D. Goodman. 2024b. A reply to Makelov et al. (2023)’s “interpretability illusion” arguments. arXiv. Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, and Noah Goodman. 2023. Inter- pretability at scale: Identifying causal mechanisms in Alpaca. InAdvances in Neural Information Processing Systems, volume 36, pages 78205–78226. Curran Associates, Inc. Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. 2018. Breaking the softmax bottleneck: A high-rank RNN language model. InInternational Conference on Learning Representations. Zeyu Yun, Yubei Chen, Bruno Olshausen, and Yann LeCun. 2021. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. In Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 1–10, Online. Wenwu Zhu, Xin Wang, and Wen Gao. 2020. Multimedia intelligence: When multimedia meets artificial intelligence.IEEE Transactions on Multimedia, 22(7):1823–1835. 16 A Reproducibility We provide the code to reproduce our experiments inhttps://github.com/densutter/ non-linear-representation-dilemma. Refer to theREADME.mdfor instructions. B Pseudo-code for Running an Intervention on an Algorithm In Fig. 5, we present pseudo-code that demonstrates how algorithms execute under our intervention framework. 1deff(A): 2deff A (x,I A =None): 3v η x =x 4forηinA.topological_sort(η inner ∪η y ): 5ifI A andηinI A : 6v η =I A [η] 7else: 8v η =f η A (v par A (η) ) 9returnv η y 10returnf A Figure 5: Pseudo-code implementation of an algorithm with interventions, where interventionsI A are specified as a Python dictionary mapping nodes to their intervened values. C Additional Related Work The concept of causal abstraction was ported to deep neural networks by Geiger et al. (2024a), providing a generalised framework for understanding how neural networks can be abstracted to higher-level algorithms. Early work by Geiger et al. (2021) explored direct interventions on neuron- aligned activations, laying the groundwork for more sophisticated approaches. Building on this, Geiger et al. (2024b) introduced distributed alignment search (DAS), which uses an alignment map to align distributed representations in neural networks with causal graphs. Several improvements to DAS have been proposed: Sun et al. (2025) developed HyperDAS, which automates the search for node information using hypernetworks, while Wu et al. (2023) introduced Boundless DAS, which automatically determines intervention size through gradient descent, scaling to larger models. However, recent work has raised important critiques of causal alignment methods. Méloux et al. (2025) demonstrated that multiple algorithms can be causally aligned with the same neural network, and conversely, a single algorithm can align with different network subspaces. Mueller (2024) identified fundamental limitations in counterfactual theories, showing they may miss certain causes and that causal dependencies in neural networks are not necessarily transitive. Makelov et al. (2024) showed that subspace interventions such as those used in DAS can lead to “interpretability illusions”— cases where manipulating a subspace changes the behaviour of the model through activating parallel pathways, rather than directly controlling the target feature. In their response, Wu et al. (2024b) argued these illusions may be artefacts of specific evaluation approaches rather than fundamental flaws, and that they depend on the definition of causality being used, a point also made by Jørgensen et al. (2025). Recent work has also raised significant challenges to the linear representation hypothesis. White et al. (2021) demonstrated that syntactic structure in language models is encoded non-linearly, showing that kernelised structural probes outperform linear ones while maintaining parameter count. Similarly, Csordás et al. (2024) found that recurrent neural networks use fundamentally non-linear representations for sequence tasks. Engels et al. (2025a) provided concrete examples of non-linear feature representations in language models, such as days of the week being encoded on a circular manifold. While Golechha and Dao (2024) argued that some language modelling behaviours may be represented linearly due to next-token prediction and LayerNorm folding, Mueller et al. (2024) advocated for exploring non-linear mediators to uncover more sophisticated abstractions. Additional evidence comes from Kantamneni and Tegmark (2025), who found that language models represent numbers on a helical manifold. Olah and Jermyn (2024) offered an important clarification: the linear representation hypothesis is not about dimensionality but rather about features behaving mathematically linearly through addition and scaling, allowing for multidimensional features with constrained geometry. This represents a relaxation of the strongest form of the hypothesis. 17 D Schematic of the Relation Between Notions of Causal Abstraction Any Set ofInterventions Unconstrained τ Distributed τ Constructive τ Restrictingτ Restricting Interventions τ -Abstraction ψ η v η ←c η H N τ η∈SomeSet ⊆푃(η all ) c η ∈SomeSet⊆ℝ |η| h ψ ←h ψ ωτ All Interventions Strong τ -Abstraction ψ h ψ ←h ψ η vη←cη ωτ H N τ η∈푃(η all ) c η ∈ℝ |η| ConstructiveAbstraction η 2 vη 2 ←cη 2 H N τ η 1 vη 1 ←cη 1 η∈푃η all e.g.η 1 ,η 2 c η ∈ℝ |η| Distributed Abstraction ψ η 1 Φ h ψ η 1 Φ ←c ψ η 1 Φ η 2 vη 2 ←cη 2 ωτ H N τ η 1 vη 1 ←cη 1 Φ -1 η∈푃η all e.g.η 1 ,η 2 c η ∈ℝ |η| h ψ η 2 Φ ←c ψ η 2 Φ Φ ψ η 2 Φ All Input-RestrictedInterventions η∈푃η all e.g.η 1 ,η 2 c η ∈Reachablebysome Input+Intervention Input-RestrictedDistributed Abstraction η 2 vη 2 ←cη 2 H N τ η 1 vη 1 ←cη 1 ψ η 1 Φ h ψ η 1 Φ ←c ψ η 1 Φ ωτ Φ -1 h ψ η 2 Φ ←c ψ η 2 Φ Φ ψ η 2 Φ η∈푃η all e.g.η 1 ,η 2 c η ∈Reachablebysome Input+Intervention Input-RestrictedConstructiveAbstraction η 2 vη 2 ←cη 2 ωτ H N τ η 1 vη 1 ←cη 1 ψ η 1 ψ η 2 ψ η 1 ψ η 2 h ψ η 2 ←c ψ η 2 h ψ η 1 ←c ψ η 1 h ψ η 2 ←c ψ η 2 h ψ η 1 ←c ψ η 1 ωτ Figure 6: A schematic of the definitions of causal abstraction in §3. The axes represent an increase in how restricted the notion of causal abstraction is based on:y-axis, constraints placed onτ; andx-axis, constraints placed on the set of allowed interventions. Grey arrows symbolise a superset→subset relationship: if anA-Npair fulfils the conditions in the subset, it also fulfils them in the superset. E DNN Definitions E.1 MLP A multi-layer perceptron (MLP) consists of a sequence of linear transformations interleaved with non-linear activation functions. Submodule 1.We can define amulti-layer perceptron(mlp) by choosing: h ψ 1 =f 0 N (x) =W 0 x,h ψ ℓ+1 =f ℓ N (h ψ ℓ ) =W ℓ (σ(h ψ ℓ )) +b ℓ (9) whereW 0 ∈R |ψ 1 |×|x| ,W ℓ ∈R |ψ ℓ+1 |×|ψ ℓ | , andb ℓ ∈R |ψ ℓ+1 | are trainable parameters, andσis a non-linearity like ReLU. For this model,H ℓ =R |ψ ℓ | for0< ℓ < LandH L =R |y| . In this work, we focus on MLPs used for classification tasks, whose final layer includes a softmax transformation. DNN 1.Aclassification multi-layer perceptron(MLP) is defined like Submodule 1 but with a softmax on the last layer: p N (y|x) =f L N (h ψ L ) =softmax(W L h ψ L )(10) whereW L ∈R |y|×|ψ L | is a trainable parameter. E.2 Transformer Language Model In this section, we provide a definition of decoder-only autoregressive language models (Radford et al., 2019; Vaswani et al., 2017). While many variations of transformer architectures have been developed, we focus on the original GPT-2 architecture (Radford et al., 2019). We highlight that the Pythia models explored in our experiments are slightly different from the original GPT-2 and use parallel attention (Chowdhery et al., 2023); however, we do not expect this change to strongly affect our results. We now define the different submodules that compose a transformer. 18 Submodule 2.TheEmbeddinglayer in a transformer maps input tokens to vectors: f 0 N (x) =e x ,wheree x ∈R |x|×|ψ 1 | (11) In this equation,e∈R |X|×|ψ 1 | is a learned parameter matrix andxindexes into its rows. 11 Submodule 3.Multi-Head Self-AttentionwithHheads is defined as: attn(h) =concat(η 1 (h),...,η H (h))W O (12) where each head operates in dimensiond η ≪|ψ ℓ |and computes: η i (h) =softmax (hW Q i )(hW K i ) ⊤ √ d k ! hW V i (13) with learned parametersW O ∈R Hd η ×|ψ ℓ | ,W Q i ,W K i ,W V i ∈R |ψ ℓ |×d η . Submodule 4.Layer Normalizationapplies per-feature normalization: LN(h) =γ⊙ h−μ σ +β(14) whereμandσare the mean and standard deviation across all features for a single input, andγ,β are learned parameters. Using these submodules, we define a transformer block. Submodule 5.ATransformer Blockchains together attention and MLP layers with residual connections: h ′ =h+attn(LN(h)),f ℓ N (h) =h ′ +mlp(LN(h ′ ))(15) wheremlpis applied to each token activations separatly as defined in Submodule 1. Finally, we define the complete transformer language model. DNN 2.Atransformer language modelconsists of an embedding layer, transformer blocks, and an output layer: h ψ 1 =f 0 N (x) =e x (16a) h ψ ℓ+1 =f ℓ N (h ψ ℓ )(16b) p N (y|x) =f L N (h ψ L ) =softmax(LN(h ψ L ) −1 W L )(16c) whereLN(h ψ L ) −1 selects the final token’s position after layernorm is applied. For this model, H ψ ℓ =R |x|×|ψ ℓ | for1≤ℓ≤LandH ψ L+1 = ∆ |Y|−1 . F Proof of Theorem 1 In this section, we prove our main theorem. For notational simplicity, we write in this section: f :ℓ N def =f ψ ℓ N =f ℓ−1 N ◦·◦f 0 N | z Run DNN up to layerℓ ,andf ℓ: N def =f L N ◦·◦f ℓ N | z Run DNN from layerℓ (17) We start by formally stating our assumptions. Assumption 1(Countable input-space).We assume that the space of inputs (i.e.,X) is countable. Assumption 2(Input-injectivity in all layers).We assume thatf :ℓ N is injective for all layers. Assumption 3(Strict output-surjectivity in all layers).We assume that the composition off ℓ: N and τ η y is strictly surjective for all layers (we define strict surjectivity in Definition 10). Assumption 4(Algorithm and DNN have matchable partial-orderings).We assume that there exists a partitioningψ η |η∈η int ∪ψ ⊥ ofN’s neuronsψ int —whereψ η are single neurons—which respects the partial-ordering of algorithmA, i.e.,η≺η ′ =⇒ψ η ≺ψ η ′ . Further, for each layer at least one neuron is left unused in this partitioning, i.e.,ψ ⊥ ∩ψ ℓ ̸=∅. 11 We ignore positional embeddings here for simplicity, as they do not affect our proofs of injectivity in App. G. Note that, in that section, we show injectivity on an entire layer’s activationsh ψ ℓ . Position embeddings would be needed to show injectivity on a single position, e.g., the last token’s position(h ψ ℓ ) −1 ; a property which we conjecture should also hold. Further, we note that position embeddings are used in our experiments. 19 Assumption 5(DNN solves the task).We assume that for any inputx∈ X, the neural network solves the task correctly, satisfyingT(x) = argmax y∈Y p N (y|x). We provide a longer discussion about why we think these assumptions are reasonable in App. F.1. For convenience, we also put a self-contained version of Definition 7 (input-restricted distributed abstraction) in App. F.2. Now, we restate our theorem and present its proof. Theorem 1.Given any algorithmAand any neural networkNsuch that Assumps. 1 to 5 hold, we can show thatAis an input-restricted distributed abstraction ofN. Proof.To show that an algorithmAis an input-restricted distributed abstraction of a neural network N, we must show (according to Definition 7) that there exists aτfor which:Ais an input-restricted τ-abstraction ofN; andτis a distributed abstraction map. Forτto be a distributed abstraction map, we need a partition of hidden variables which allows us to independently compute it per node. Further, we need the partitioned hidden variablesψ φ int to be the output of an alignment mapφwhich is layer-wise decomposable. We thus have: h ψ φ =φ(h) = [φ ℓ (h ψ ℓ )] L+1 ℓ=0 , | z Align DNN Ψ φ =ψ φ η |η∈η int ∪ψ φ ⊥ , |z Partition hidden variables τ(h) = [τ η (h ψ φ η )] η∈η int | z Compute abstraction (18) Therefore, to define a distributed abstraction mapτ, we must define the following three terms: (i) a set of layer-wise alignment mapsφ ℓ L ℓ=1 (note that the alignment mapsφ 0 andφ L+1 are fixed by definition); (i) a partition of hidden variablesΨ φ ; and (i) a set of per-node functionsτ η η∈η int . To prove this theorem, then, we must show that there exists a way to define these terms while ensuring thatAis an input-restrictedτ-abstraction ofN. We now note that—given Assump. 4 and independently of our choice of alignment mapφ—there exists at least one partitionΨ φ of the hidden variablesψ φ int inNfor which: ∀ η,η ′ ∈η int :η≺η ′ =⇒ψ φ η ≺ψ φ η ′ | z respect partial ordering ,∀ η∈η int :|ψ φ η |= 1, |z 1 neuron per partition ∀ 0<ℓ≤L :ψ φ ℓ ̸⊆ [ η∈η int ψ φ η | z no layer is fully occupied (19) where we defineψ φ ℓ as the latent variables given when applyingφ ℓ onψ ℓ . To facilitate our proof, we choose one such partitionΨ φ which we will keep fixed independently of our choice of alignment mapφ. Given partitionΨ φ , we can assign each nodeηto a specific layerℓ, asψ φ η contains a single hidden variable and therefore trivially belongs to a single layer. We therefore can defineη ℓ as all nodes associated with layerℓ: η∈η ℓ ⇔ψ φ η ⊆ψ φ ℓ (20) We now consider the application of interventions onNas layer-wise onψ φ η ⊆ψ φ ℓ forη∈η ℓ . Let us therefore defineI ℓ N as the set of all interventions onψ φ η forη∈η ℓ , where we note thatI ℓ N also includes an empty intervention (i.e., no intervention). For notational convenience, we will write the set of all interventions up to layerℓasI :ℓ N , and the set of all nodes associated with those layers asη :ℓ : I :ℓ N =I 0 N ×I 1 N ×·×I ℓ N ,η :ℓ =η 0 ∪η 1 ∪·∪η ℓ (21) where×denotes a Cartesian product. We analogously defineI ℓ A andI :ℓ A . Finally, we get to an induction proof that will complete this theorem. We will iteratively construct abstraction and alignment maps for each layer such that it holds that: ∀ I N ∈I :ℓ−1 N x∈X ∀ η∈η ℓ :τ η (h ′ ψ φ η ) =f η A (x,I A ), | z tested condition whereh ′ ψ φ ℓ =φ ℓ (f :ℓ N (x,I N )), |z pre-int. hidden variable, asI N ∈I :ℓ−1 N andI A =ω τ (I N )1 where int. stands for intervention. Note that if this holds for all layers, we have proven thatAis an input-restrictedτ-abstraction ofN, as we can perfectly reconstruct the behaviour of algorithmAfrom N’s states under any intervention. 12 Also note, however, that our definition of abstraction map restricts τ η y (h ψ L+1 ) = argmaxh ψ L+1 , so special care must be taken to guarantee that this last identity will 12 The attentive reader may note condition 1 only guarantees we can reconstruct the behaviour of algorithm Afrom pre-intervention hidden variables. Lemma 1 shows the same holds for post-intervention hidden variables. 20 be preserved. We thus also require an additional condition to hold at each step: ∀ I N ∈I :ℓ N x∈X :τ η y (f ℓ: N (h ′ ψ ℓ )) =f A (x,I A ), | z tested condition whereh ′ ψ ℓ =f :ℓ N (x,I N ), |z post-int. neurons, asI N ∈I :ℓ N andI A =ω τ (I N ) 2 Finally, for convenience, we add a third condition to our inductive proof which will make the other two conditions easier to guarantee: ∀ η∈η :ℓ−1 ∃g η ∀ I N ∈I :ℓ−1 N x∈X :g η (h ′ ψ φ ℓ ∩ψ φ ⊥ ) =f η A (x,I A ), | z tested condition whereh ′ ψ φ ℓ =φ ℓ (f :ℓ N (x,I N )), |z pre-int. hidden variable, asI N ∈I :ℓ−1 N andI A =ω τ (I N ) 3 This condition guarantees that information about previous nodes (i.e.,η∈η :ℓ−1 ) is preserved in each layer’s non-intervened neurons (i.e.,ψ φ ℓ ∩ψ φ ⊥ ). This final condition will be useful to guarantee conditions1and2are preserved in future layers. Statement.Conditions1,2, and 3 hold for all layersℓin a DNN. Base Case (ℓ= 0).For layerℓ= 0, we haveη ℓ =η x . We also have bothφ ℓ andτ η as the identity function. Further, we considerI :−1 =∅andI :0 =∅—where symbol∅here denotes an empty intervention—and we considerf :0 N to be the identity onx. (Note that layerf 0 N is not applied inf :0 N .) Now, it is easy to prove our base case: •1follows trivially, asf η x A (x,I A ) =xandf :0 N (x,I N ) =x. • 2follows from Assump. 5. •3follows trivially givenη :−1 is an empty set. Induction Step (givenℓ−1, thenℓ).Now, due to the inductive hypothesis, we assume that1, 2and 3 hold for layer(ℓ−1). Given this, we must now prove that these conditions also hold for layerℓ. We will consider two cases:η ℓ is either empty or not. Before doing so, however, we note that 1and3hold for layer(ℓ−1)’s pre-intervention hidden variables. In Lemmas 1 and 2, we show that the same applies for the post-intervention hidden variables. Let’s consider thecase whereη ℓ is empty. In this case, we can simply defineφ ℓ as the identity map. Further, given an emptyη ℓ , we know that there are no interventions in this layer, i.e.,I ℓ N =∅, and, as such, we have that:I :ℓ N =I :ℓ−1 N ×I ℓ N = I :ℓ−1 N . We can now prove the induction step for this case. • 1is true trivially, sinceη ℓ is empty. •2follows using the inductive hypothesis. LetI N ∈I :ℓ N andx∈ X. Now, leth ′ ψ ℓ = f :ℓ N (x,I N ),h ′ ψ ℓ−1 =f :ℓ−1 N (x,I N ), andI A =ω τ (I N ). We can now show that: τ η y f ℓ: N (h ′ ψ ℓ ) =τ η y f ℓ: N f :ℓ N (x,I N ) definition ofh ′ ψ ℓ (22a) =τ η y f ℓ: N f ℓ−1 N f :ℓ−1 N (x,I N ) no intervention at layerℓ(22b) =τ η y f ℓ−1: N f :ℓ−1 N (x,I N ) definition off ℓ−1: N (22c) =τ η y f ℓ−1: N h ′ ψ ℓ−1 definition ofh ′ ψ ℓ−1 (22d) =f A (x,I A )inductive hypothesis on 2(22e) This shows2holds for layerℓwhenη ℓ is empty. •3follows using the inductive hypothesis. LetI N ∈I :ℓ−1 N ,x∈ X, andη∈η :ℓ−1 . Now, leth ′ ψ φ ℓ =φ ℓ (f :ℓ N (x,I N )) ,h ′ ψ φ ℓ−1 =φ ℓ−1 (f :ℓ−1 N (x,I N )) , andI A =ω τ (I N ). Further, let 21 g ℓ−1 η (h ′ ψ φ ℓ−1 ) =f :η A (x,I A ); we know such function exists due to the inductive hypothesis on 1and 3 , together with Lemmas 1 and 2. Finally, sincef :ℓ N is injective (by Assump. 2) and sincef :ℓ N =f ℓ−1 N ◦f :ℓ−1 N , we know that eachh ′ ψ φ ℓ−1 is mapped to a uniqueh ′ ψ φ ℓ in the next layer. We can thus define functionf ℓ−1 N on the domain formed by these hidden variables which, given a hidden variableh ′ ψ φ ℓ returns its “parent”h ′ ψ φ ℓ−1 ; in other words,f ℓ−1 N is an partial inverse off ℓ−1 N defined only on its image. Defining now functiong ℓ η =g ℓ−1 η ◦f ℓ−1 N , we can show: g ℓ η (h ′ ψ φ ℓ ∩ψ φ ⊥ ) =g ℓ η (h ′ ψ φ ℓ )η ℓ is empty (23a) =g ℓ η φ ℓ f ℓ−1 N (h ′ ψ φ ℓ−1 ) no intervention at layerℓ(23b) =g ℓ η f ℓ−1 N (h ′ ψ φ ℓ−1 ) φ ℓ is the identity function (23c) =g ℓ−1 η f ℓ−1 N f ℓ−1 N (h ′ ψ φ ℓ−1 ) definition ofg ℓ (23d) =g ℓ−1 η (h ′ ψ φ ℓ−1 )definition of f ℓ−1 N (23e) =f :η A (x,I A )inductive hypothesis on 1and 3 (23f) Let’s now consider thecase whenη ℓ is not empty. To show1,2and 3 for layerℓwe need to find a suitable bijectiveφ ℓ . We now show a careful way to construct this map which satisfies these conditions. To do so, we will again split this step of the proof into two parts. We will first take care of the case in which no interventions are applied to layerℓ, guaranteeing that the model behaves correctly in those cases. In that case, we must handle the set ofinput-restricted pre-intervention hidden statesin layerℓ, which we define as: H ◆ ψ ℓ def =f ψ ℓ N (x,I N )|x∈X,I N ∈I :ℓ−1 N | z pre-int., as we do not includeI ℓ (24) Notably, instead of defining the entire alignment mapφ ℓ at once, we will first define its behaviour only on those hidden states. We will denote this domain-restricted function asφ ◆ ℓ . Given this function, we will be able to define a set ofinput-restricted pre-intervention hidden variablesin layerℓas: H ◆ ψ φ ℓ def =φ ◆ ℓ (h ψ ℓ )|h ψ ℓ ∈H ◆ ψ ℓ (25) where◆represents the non-intervened hidden states and variables, and we will use❖to represent the intervened instances. Note thatH ◆ ψ φ ℓ is the set of representations output by alignment mapφ ◆ ℓ . The second case we will consider will then handle interventions on this layer, and will again guarantee that the model behaves as expected in those cases. We thus define the set ofinput-restricted post- intervention hidden variablesas: H ψ φ ℓ def =f ψ φ ℓ N (x,I N )|x∈X,I N ∈I :ℓ N |z post-int., as we includeI ℓ (26) Notably, what an intervention on layerℓdoes is re-combine the representations inH ◆ ψ φ ℓ . We can thus writeH ψ φ ℓ in terms ofH ◆ ψ φ ℓ as: H ψ φ ℓ = × η∈η ℓ φ ℓ (h ψ ℓ ) ψ φ η |h ψ ℓ ∈H ◆ ψ ℓ | z pre-int. h.v., projected toψ φ η ×φ ℓ (h ψ ℓ ) ψ φ ⊥ ∩ψ φ ℓ |h ψ ℓ ∈H ◆ ψ ℓ | z pre-int. h.v., projected toψ φ ⊥ (27) We further define the set ofinput-restricted intervention-only hidden variablesas: H ❖ ψ φ ℓ =H ψ φ ℓ ◆ ψ φ ℓ (28) 22 By carefully defining the behaviour ofφ ℓ on this set, we can guarantee the conditions above to hold. In particular, we will define this part of the function via its inverseφ ❖ ℓ −1 , which maps these hidden variables back to hidden states. We therefore haveφ ℓ and its partial inverse defined as: φ ℓ (h) = ( φ ◆ ℓ (h),ifh∈H ◆ ψ ℓ φ ❖ ℓ (h),ifh∈H ❖ ψ ℓ φ −1 ℓ (h) = φ ◆ ℓ −1 (h),ifh∈H ◆ ψ φ ℓ φ ❖ ℓ −1 (h),ifh∈H ❖ ψ φ ℓ (29) We now defineφ ◆ ℓ . Definition 8.Partial mapφ ◆ ℓ :H ◆ ψ ℓ →R |ψ ℓ | is some fixed function that is injective on each dimension, i.e.,∀ i∈1,...,|ψ φ ℓ | ∀ h 1 ,h 2 ∈H ◆ ψ ℓ :h 1 ̸=h 2 ⇒φ ◆ ℓ (h 1 ) i ̸=φ ◆ ℓ (h 2 ) i . Such a function exists, becauseH ◆ ψ ℓ is countable (Lemma 4) andRis uncountable. Further, its partial inverseφ ◆ ℓ −1 , defined on the imageH ◆ ψ φ ℓ , exists becauseφ ◆ ℓ is injective. We can now prove that conditions 1 and3hold. We also prove that 2 holds whenI N ∈I :ℓ−1 N , i.e., when there is no intervention in layerℓ. •1follows using the inductive hypothesis. LetI N ∈I :ℓ−1 N ,x∈X, andη∈η ℓ . Further, let h ′ ψ φ ℓ =φ ℓ (f :ℓ N (x,I N )) ,h ′ ψ φ ℓ−1 =φ ℓ−1 (f :ℓ−1 N (x,I N )) , andI A =ω τ (I N ). Now, note that there exists a functiong η for whichv η =g η (v η :ℓ−1 ) , as the parents ofηare a subset ofη :ℓ−1 . It now suffices to show thath ′ ψ φ η encodes information aboutv η :ℓ−1 = [f :η A (x,I A )] η∈η :ℓ−1 . By the inductive hypothesis on1and 3 , together with Lemmas 1 and 2, we know thath ′ ψ φ ℓ−1 encodes information aboutv η :ℓ−1 ; letg η :ℓ−1 be a function that extracts this information, i.e.,g η :ℓ−1 (h ′ ψ φ ℓ−1 ) =v η :ℓ−1 . Now, sincef :ℓ N is injective, andφ ◆ ℓ is injective on each output dimension, we know thath ′ ψ ℓ = [φ ◆ ℓ (f ℓ−1 N (φ −1 ℓ−1 (h ′ ψ φ ℓ−1 )))] ψ φ η contains the same information ash ′ ψ φ ℓ−1 . We can thus construct (partial) inversesf ℓ−1 N ,φ ◆−1 ℓ,ψ φ η and defineτ η as the compositiong η ◦g :η−1 ◦φ ℓ−1 ◦f ℓ−1 N ◦φ ◆−1 ℓ,ψ φ η , which concludes this step of the proof: τ η ([h ′ ψ φ ℓ ] ψ φ η ) =τ η ([φ ◆ ℓ (f ℓ−1 N (φ −1 ℓ−1 (h ′ ψ φ ℓ−1 )))] ψ φ η )definition ofh ′ ψ ℓ (30a) =g η (g :η−1 (φ ℓ−1 ( f ℓ−1 N (φ ◆−1 ℓ,ψ φ η ([φ ◆ ℓ (f ℓ−1 N (φ −1 ℓ−1 (h ′ ψ φ ℓ−1 )))] ψ φ η )))))(30b) =g η (g :η−1 (h ′ ψ φ ℓ−1 ))(30c) =f :η A (x,I A )(30d) • 2whenI N ∈I :ℓ−1 N follows using the inductive hypothesis. LetI N ∈I :ℓ−1 N andx∈ X. Now, leth ′ ψ ℓ =f :ℓ N (x,I N ),h ′ ψ ℓ−1 =f :ℓ−1 N (x,I N ), andI A =ω τ (I N ). We can show that: τ η y f ℓ: N (h ′ ψ ℓ ) =τ η y f ℓ: N f :ℓ N (x,I N ) definition ofh ′ ψ ℓ (31a) =τ η y f ℓ: N f ℓ−1 N f :ℓ−1 N (x,I N ) no intervention at layerℓ(31b) =τ η y f ℓ−1: N f :ℓ−1 N (x,I N ) definition off ℓ−1: N (31c) =τ η y f ℓ−1: N h ′ ψ ℓ−1 definition ofh ′ ψ ℓ−1 (31d) =f A (x,I A )inductive hypothesis on 2(31e) This shows2holds for layerℓwhen there is no intervention in layerℓ. • 3follows using the inductive hypothesis. LetI N ∈I :ℓ−1 N ,x∈ X, andη∈η :ℓ−1 . Now, leth ′ ψ φ ℓ =φ ℓ (f :ℓ N (x,I N )) ,h ′ ψ φ ℓ−1 =φ ℓ−1 (f :ℓ−1 N (x,I N )) , andI A =ω τ (I N ). Further, let 23 g ℓ−1 η (h ′ ψ φ ℓ−1 ) =f :η A (x,I A ); we know such function exists due to the inductive hypothesis on 1and 3 together with Lemmas 1 and 2. Finally, sincef :ℓ N andφ ◆ ℓ are injective andφ ℓ−1 is bijective, we can define the partial inverse functionf ℓ−1 N of their compositionφ ◆ ℓ ◦f ℓ N ◦φ −1 ℓ−1 (applied only to their image) which, given the hidden variableh ′ ψ φ ℓ returns its “parent” h ′ ψ φ ℓ−1 . Defining now a functionbg ℓ η =g ℓ−1 η ◦f ℓ−1 N andg ℓ η (h ′ ψ φ ℓ ∩ψ φ ⊥ ) =bg ℓ η (h ′ ψ φ ℓ )—which exists, asφ ◆ ℓ is injective on each dimension—we can show: g ℓ η (h ′ ψ φ ℓ ∩ψ φ ⊥ ) =bg ℓ η (h ′ ψ φ ℓ )φ ◆ ℓ is injective on all dimensions (32a) =bg ℓ η φ ◆ ℓ f ℓ−1 N (φ −1 ℓ−1 (h ′ ψ φ ℓ−1 )) no intervention at layerℓ(32b) =g ℓ−1 η f ℓ−1 N φ ◆ ℓ f ℓ−1 N (φ −1 ℓ−1 (h ′ ψ φ ℓ−1 )) definition ofg ℓ (32c) =g ℓ−1 η (h ′ ψ φ ℓ−1 )definition of f ℓ−1 N (32d) =f :η A (x,I A )inductive hypothesis on 1 and3(32e) We have now proved1and 3 . We have also partially proved2for cases where there is no intervention in layerℓ. 13 We now finish our proof by considering cases where there is an intervention in this layerℓ. In the second case, we need to handle intervention-only representationsH ❖ ψ φ ℓ . We will now defineφ ❖ ℓ −1 on this domainH ❖ ψ φ ℓ to fulfil 2. Definition 9.Partial mapφ ❖ ℓ −1 :H ❖ ψ φ ℓ →R |ψ ℓ | is some fixed function such that it holds: 1.φ ❖ ℓ −1 maps to the setH ψ ℓ ◆ ψ ℓ 2.φ ❖ ℓ −1 is an injective map 3. LetI N ∈I :ℓ N :ℓ−1 N andx∈X. Now, leth ′ ψ φ ℓ =f ψ φ ℓ N (x,I N ) , andI A =ω τ (I N ). We have thatf ℓ: N (φ ❖ ℓ −1 (h)) =f A (x,I A ). Where the first two conditions ensure the necessary bijectivity ofφ ℓ and the last characteristic ensures 2. Now, letx∈Xbe any input andI N ∈I :ℓ N any intervention. Further, leth ′ ψ φ ℓ =f ψ φ ℓ N (x,I N ) , and I A =ω τ (I N ). We now note that—given1and 3 , and Lemmas 1 and 2—the valuev η =f :η A (x,I A ) for all nodesη∈η :ℓ are encoded inh ′ ψ φ ℓ . This is enough information to determine the algorithm’s outputy ⋆ =f A (x,I A ). Now, define a functiong ⋆ which maps an elementh ′ ψ φ ℓ ∈H ❖ ψ φ ℓ to the output algorithmAexpects. Further, by Lemma 6 there exists an uncountably infinite set of hidden states H (y ⋆ ) ψ ℓ such that: ∀h∈H (y ⋆ ) ψ ℓ :y ⋆ = argmax y ′ ∈Y [f ℓ: N (h ψ ℓ )] y ′ (33) We define ˆ H (y ⋆ ) ψ ℓ =H (y ⋆ ) ψ ℓ ◆ ψ ℓ , which—asH ◆ ψ ℓ is countable—is still uncountably infinite. We can now map anyh∈H ❖ ψ φ ℓ to an element in ˆ H (g ⋆ (h)) ψ ℓ fulfilling the third characteristic of Definition 9. That such a mapping exists adhering to the first and second characteristic of Definition 9 is ensured by the fact that ˆ H (y ⋆ ) ψ ℓ is uncountable andH ❖ ψ φ ℓ is countable (shown in Lemma 5). Further, asφ ❖ ℓ −1 is injective, its partial inverseφ ❖ ℓ on its imageH ❖ ψ ℓ exists. 13 This also proves the result for any inputx∈ Xand interventionI N ∈I ℓ whereh∈ H ◆ ψ ℓ . By3at layer ℓ, there existsI ′ N ∈I ℓ−1 —constructed by applying only the interventions fromI N on layers< ℓ—such that f :ℓ N (x,I N ) =f :ℓ N (x,I ′ N ). Since both encode the same values onη <ℓ according to3, which fully determine the output off A , and since2holds for(x,I ′ N )at layerℓ, it must also hold for(x,I N ). 24 1defextract_unused_rep(h,H ◆ ψ φ ℓ ∪ H ❖ ψ φ ℓ ): 2rep=h 3whilerep∈ H ◆ ψ φ ℓ ∪ H ❖ ψ φ ℓ : 4rep=φ −1 ℓ (rep) 5returnrep Figure 7: Pseudo-code forextract_unused_rep. This function returns an unique element in (H ◆ ψ ℓ ∪H ❖ ψ ℓ )\(H ◆ ψ φ ℓ ∪H ❖ ψ φ ℓ )for eachh∈(H ◆ ψ φ ℓ ∪H ❖ ψ φ ℓ )\(H ◆ ψ ℓ ∪H ❖ ψ ℓ )ensuring bijectivity ofφ ′ ℓ (h). The attentive reader may have noticed that we definedφ ℓ only over the domainH ◆ ψ ℓ ∪H ❖ ψ ℓ instead overR |ψ ℓ | . We note that it is simple to extendφ ℓ to anφ ′ ℓ defined overR |ψ ℓ | . Letidbe the identity function andextract_unused_repbe defined by the algorithm given in Fig. 7. A bijective function φ ′ ℓ overR |ψ ℓ | mapping toR |ψ ℓ | can be defined as: φ ′ ℓ (h) = φ ℓ (h)ifh∈H ◆ ψ ℓ ∪H ❖ ψ ℓ extract_unused_rep(h,H ◆ ψ φ ℓ ∪H ❖ ψ φ ℓ ),ifh∈(H ◆ ψ φ ℓ ∪H ❖ ψ φ ℓ )\(H ◆ ψ ℓ ∪H ❖ ψ ℓ ) helse (34) which completes the proof. F.1 Discussion about Assumptions Assump. 1 (Countable input-space).While this assumption cannot be made on all neural networks like MLPs, it holds for models working on language and images. The set of all images with a specific resolution is finite, as it considers a finite number of pixels where each pixel has a finite number of channels (e.g. values for red, green and blue) and each channel is a number between 0 and 255. The set of all sequences in a language model is also countably infinite, as each set of sequences of some length is finite given finite tokens; so we have a set made out of the countable union of finite sets, which is still countable. Assump. 2 (Input-injectivity in all layers).Neural network layers (e.g., MLP blocks) are not necessarily injective. The usage of learnable weights, activation functions like ReLU (Nair and Hinton, 2010) and information bottlenecks makes it possible to have a non-injective model. However, we prove in App. G that transformers, at least, are almost surely injective at initialisation on their inputs. Further, Nikolaou et al. (2025) recently published a proof—as well as empirical evidence—that transformers are almost surely injective in the hidden states of their last token, both at initialisation and after training. We also see in our empirical experiments in App. H that the MLPs we analyse are also, in practice, injective—or close enough to it that we observe no collisions in embedding space. Assump. 3 (Strict output-surjectivity in all layers).Surjectivity can be defined on the output distributionf ℓ: N :H ψ ℓ →∆ |Y|−1 , but that is a rather strong assumption. For our proofs, we will rely on strict surjectivity on the classification space instead (τ η y ◦f ℓ: N :H ψ ℓ →|Y|), such that every class can be predicted. However, surjectivity on the classification space still does not necessarily hold for DNNs. LLMs have problems like the softmax bottleneck (Yang et al., 2018), which can lead to a model having insufficient capacity to predict all possible tokens. Grivas et al. (2022) also evaluate and find this problem, but show that surjectivity on the tokens is still likely in practice, making this a reasonable assumption in LLM settings. Assump. 4 (Algorithm and DNN have matchable partial-orderings).We assume this since, for a neural networkNto be abstracted by the algorithmA, we need it to have this minimal width and depth. Assump. 5 (DNN solves the task).We assume this because, if a neural network does not solve the given task, it will also not be abstracted by an algorithm which implements it. F.2 Detailed Version of Definition 7 Definition 7 can also be written without referring to previous definitions as following: Alternative Definition 1(Equivalent to Definition 7).An algorithmAis aninput-restricted distributed abstractionof a neural networkNiff there exists anτ,I A , andI N such that 25 •τis a distributed abstraction map. I.e., there exists an alignment mapφ, a latent-variable partitionψ φ η |η∈η int ∪ψ φ ⊥ ofψ φ int (with non-emptyψ φ η ), and subabstraction maps τ η |η∈η int such thatτis equivalent to computing the value of each node block-wise withτ η (h ψ φ η ). An alignment mapφis a bijective function that maps the inner neurons ψ int ofNonto an equal-sized set of latent variablesψ φ int , withφrespecting the network’s computational order by being the combination of layer-wise bijectionsφ ℓ :R |ψ ℓ | →R |ψ ℓ | applied to the neurons of each of the DNN’s layers (ℓ); •I A andI N are a maximal input-restricted intervention set. A maximal input-restricted intervention set is composed of all interventions produced from other input-restricted interventions, i.e., it is a set withh ψ φ ←c ψ φ (whereψ φ ⊆ψ φ int ) orv η ←c η (whereη⊆ η int ) wherec ψ φ orc η arise from valid input-restricted computations (e.g.,c ψ φ =f ψ φ N (x) orc η =f :η A (x,c η ←f :η A (x))). •τis surjective; •I A =ω τ (I N ); •There exists a surjectiveτ η x such that ∀ x∈X I N ∈I N :τ(f ψ φ int N (x,I N )) =f :η int A (τ η x (x),I A )whereI A =ω τ (I N )(35) F.3 Useful Definitions and Lemmas for Theorem 1 Definition 10.We say the composition of a functionf:R d →∆ |Y|−1 withargmaxisstrictly surjectiveif, for any outputy ⋆ ∈Y, there exists an inputh∈R d for whichfoutputsy ⋆ no matter how ties are broken in theargmax. Formally: ∀y ⋆ ∈Y,∃h∈R d ,∀y ′ ∈Y \y ⋆ : [f(h)] y ⋆ >[f(h)] y ′ (36) Lemma 1.LetNbe a DNN andAbe an algorithm. Further, letτbe a distributed abstraction map with partitionΨandτ η η∈η int . If, for allη∈η ℓ ,τ η satisfies the conditions in1(defined in Theorem 1’s proof) applied on layerℓ’s pre-intervention hidden variables, i.e., if: ∀ I N ∈I :ℓ−1 N x∈X :τ η (h ′ ψ φ η ) =f η A (x,I A ), |z tested condition whereh ′ ψ φ ℓ =f ψ φ ℓ N (x,I N )), | z pre-int. hidden variable, asI N ∈I :ℓ−1 N andI A =ω τ (I N )(37) where we note thatf ψ φ ℓ N =φ ℓ ◦f :ℓ N when no intervention is applied to layerℓ. Thenτ η also satisfies this condition when applied to layerℓ’s post-intervention hidden variables: ∀ I N ∈I :ℓ N x∈X :τ η (h ′ ψ φ η ) =f η A (x,I A ), |z tested condition whereh ′ ψ φ ℓ =f ψ φ ℓ N (x,I N )), |z post-int. hidden variable, asI N ∈I :ℓ N andI A =ω τ (I N )(38) Proof.Letτ η be the abstraction map ofη∈η ℓ . By assumption, condition1holds for all pre- intervention hidden variables, i.e., hidden variables of the formh ′ ψ φ η = [φ ℓ (f :ℓ N (x,I N ))] ψ φ η . We can show the same function applies to post-intervention hidden variables, i.e., hidden variables of the form: h ′ ψ φ η = c ψ φ η ifh ′ ψ φ η ←c ψ φ η ∈I N φ ℓ (f :ℓ N (x,I N )) ψ φ η else (39) Now letI N ∈I :ℓ N be any intervention andx∈Xbe any input. Further, letI A =ω τ (I N ). IfI N ∈I :ℓ−1 N , then the post-intervention hidden variable is identical to a pre-intervention one, and the conditions in 1still hold, i.e.,:h ′ ψ φ η = [φ ℓ (f :ℓ N (x,I N ))] ψ φ η andτ η is such thatτ η (h ′ ψ φ η ) =f η A (x,I A ). IfI N ̸∈I :ℓ−1 N , for each node’s hidden variablesψ φ η , we might or not intervene on it. If we do not intervene on node η, then we still have the caseh ′ ψ φ η = [φ ℓ (f :ℓ N (x,I N ))] ψ φ η and thusτ η still gives us the correct solution, i.e.,τ η (h ′ ψ φ η ) =f η A (x,I A ) . If we intervene onη, then we know there exists an intervention of form 26 h ψ φ η ←f :ℓ N (x ′ ,I ′ N )inI N , for whichI ′ N ∈I :ℓ−1 N , as our interventions are input-restricted. We also know (by eq. (4)) that there exists an equivalent interventionv η ←τ η (f :ℓ N (x ′ ,I ′ N ))inI A . We thus have thatτ η (h ′ ψ φ η ) =f η A (x,I A ). Lemma 2.LetNbe a DNN andAbe an algorithm. Further, letτbe a distributed abstraction map with partitionΨ. If, for allη∈η :ℓ−1 , there exists a functiong η which satisfies the conditions in 3 (defined in Theorem 1’s proof) applied on layerℓ’s pre-intervention hidden variables, i.e., if: ∀ I N ∈I :ℓ−1 N x∈X :g η (h ′ ψ φ ℓ ∩ψ φ ⊥ ) =f η A (x,I A ), | z tested condition whereh ′ ψ φ ℓ =f ψ φ ℓ N (x,I N ), |z pre-int. hidden variable, asI N ∈I :ℓ−1 N andI A =ω τ (I N )(40) where we note thatf ψ φ ℓ N =φ ℓ ◦f :ℓ N when no intervention is applied to layerℓ. Theng η also satisfied this condition when applied to layerℓ’s post-intervention hidden variables: ∀ I N ∈I :ℓ N x∈X :g η (h ′ ψ φ ℓ ∩ψ φ ⊥ ) =f η A (x,I A ), | z tested condition whereh ′ ψ φ ℓ =f ψ φ ℓ N (x,I N ), |z post-int. hidden variable, asI N ∈I :ℓ N andI A =ω τ (I N )(41) Proof. Letη∈η :ℓ−1 andg η be a function that satisfies condition3for it. Note that condition 3 holds for all pre-intervention hidden variables, i.e., hidden variables of the formh ′ ψ φ ℓ ∩ψ φ ⊥ = [φ ℓ (f :ℓ N (x,I N ))] ψ φ ℓ ∩ψ φ ⊥ . We can show the same functiong η also applies to post-intervention hidden variables, i.e., hidden variables of the form: h ′ ψ φ ℓ ∩ψ φ ⊥ = ( c ψ φ ℓ ∩ψ φ ⊥ ifh ′ ψ φ ℓ ∩ψ φ ⊥ ←c ψ φ ℓ ∩ψ φ ⊥ ∈I N φ ℓ (f :ℓ N (x,I N )) ψ φ ℓ ∩ψ φ ⊥ else (42) Now, letI N ∈I :ℓ N ,x∈ X. Further, letI A =ω τ (I N ). IfI N ∈I :ℓ−1 N , then the post-intervention hidden variable is identical to a pre-intervention one, and the conditions in3still hold, i.e.,: h ′ ψ φ ℓ ∩ψ φ ⊥ = [φ ℓ (f :ℓ N (x,I N ))] ψ φ ℓ ∩ψ φ ⊥ andg η is such thatg η (h ′ ψ φ ℓ ∩ψ φ ⊥ ) =f η A (x,I A ) . IfI N ̸∈I :ℓ−1 N , it means that we intervene on at least one hidden variable in this layerψ φ ℓ . However, we never intervene on neurons inψ φ ⊥ , meaning that for those we still have the caseh ′ ψ φ ℓ ∩ψ φ ⊥ = [φ ℓ (f :ℓ N (x,I N ))] ψ φ ℓ ∩ψ φ ⊥ and thus the same functiong η still satisfies our conditiong η (h ′ ψ φ ℓ ∩ψ φ ⊥ ) =f η A (x,I A ). Lemma 3.Under Assump. 1 and given a fixedφ, the set of input-restricted interventionsI N is countable. Proof. This can be shown by induction. More specifically, we show that for any layerℓ, the set of input-restricted interventionsI :ℓ N is countable for a specificφ. Base Case (ℓ= 0).The base case can be proved trivially, asI :0 N =∅. Induction step (ℓgivenℓ−1).By the induction hypothesis,I :ℓ−1 N is countable. Now, note that I :ℓ N can be decomposed as: I :ℓ N =I :ℓ−1 N ×I ℓ N (43) As the Cartesian product of two countable sets is itself countable, and asI :ℓ−1 N is countable by the inductive hypothesis, we only need to show thatI ℓ N is countable to complete our proof. This setI ℓ N is defined as the set of all input-restricted interventions to layerℓ. Given a set of neurons or hidden variables in this layerψ ′ , we are thus dealing with interventions of the form:h ψ ′ ←f ψ ′ N (x,I N ) , where: (i)ψ ′ ⊆ψ ℓ orψ ′ ⊆ψ φ ℓ ; (i)x∈ X; and (i)I N ∈I :ℓ−1 N . The set of all input-restricted interventions in this layer is thus bounded in size by the Cartesian product: × ψ∈ψ φ ℓ X ×I :ℓ−1 N . These three sets are countable, and thus so isI ℓ N . This concludes our proof. Lemma 4.Under Assump. 1 and given a fixedφ, the set of input-restricted pre-intervention hidden states in layerℓ, i.e.,H ◆ ψ ℓ , is countable. 27 Proof.The set of input-restricted hidden statesH ◆ ψ ℓ is formed by hidden statesh ψ ℓ =f ψ ℓ N (x,I N ), which we can write as: H ◆ ψ ℓ =f ψ ℓ N (x,I N )|x∈X,I N ∈I :ℓ−1 N (44) We thus have that the size ofH ◆ ψ ℓ is bounded by the size of the Cartesian productX ×I :ℓ N . As both of these sets are countable (by Assump. 1 and Lemma 3, respectively),H ◆ ψ ℓ is also countable. This completes our proof. Lemma 5.Under Assump. 1 and given a fixedφ, the set of input-restricted intervention-only hidden variables in layerℓ, i.e.,H ❖ ψ φ ℓ , is countable. Proof.A similar proof to Lemma 4 applies here. In short, we have three relevant sets for this proof. First, the set of input-restricted pre-intervention hidden variables: H ◆ ψ φ ℓ =φ ℓ (h ψ ℓ )|h ψ ℓ ∈H ◆ ψ ℓ (45) Second, we have the set of input-restricted post-intervention hidden variables: H ψ φ ℓ = × η∈η ℓ φ ℓ (h ψ ℓ ) ψ φ η |h ψ ℓ ∈H ◆ ψ ℓ | z Pre-int. h.v., projected toψ φ η × φ ℓ (h ψ ℓ ) ψ φ ⊥ ∩ψ φ ℓ |h ψ ℓ ∈H ◆ ψ ℓ | z Pre-int. h.v., projected toψ φ ⊥ (46) Both sets above are countable, sinceH ◆ ψ ℓ is countable (by Lemma 4), andη ℓ is finite. Third, we have the set of input-restricted intervention-only hidden variables, defined as: H ❖ ψ φ ℓ =H ψ φ ℓ ◆ ψ φ ℓ (47) SinceH ψ φ ℓ is countable,H ❖ ψ φ ℓ is clearly also countable. This completes the proof. Lemma 6.Under Assump. 3 and given a target outputy ⋆ ∈Y, we know that there is an uncountably infinite setH (y ⋆ ) ψ ℓ which predicts it, i.e.,: h∈H (y ⋆ ) ψ ℓ ⇔y ⋆ = argmax y ′ ∈Y [f ℓ: N (h)] y ′ (48) Proof. Under Assump. 3, we know that—for any target outputy ⋆ ∈Y—there is at least one hidden stateh ⋆ ∈R |ψ ℓ | which predicts it, i.e.: y ⋆ = argmax y ′ ∈Y [f ℓ: N (h ⋆ )] y ′ (49) where we note thatf ℓ: N (h ⋆ )outputs a probability distribution overY, i.e.,p N (y ′ |h ⋆ ). To show that we have an uncountably infinite set, let us first notice that ∀h ′ ∈R |ψ ℓ | :||f ℓ: N (h)−f ℓ: N (h ′ )|| 2 < m 1 −m 2 2 ⇒y ⋆ = argmax y ′ ∈Y [f ℓ: N (h ′ )] y ′ (50) form 1 be the max value off ℓ: N (h)andm 2 the second highest value off ℓ: N (h).m 1 > m 2 follows by the strict subjectivity mentioned inAssump.3. Eq. (50) follows by the definition of the euclidean norm (||.|| 2 ),argmaxandf ℓ: N (h ′ )∈∆ |ψ L |−1 asm 1 has to be lowered at least m 1 −m 2 2 to increase m 2 by m 1 −m 2 2 for those two values to be the same. Increasing any other value inf ℓ: N (h)would requirem 1 being lowered more than m 1 −m 2 2 or any other value increased by more than m 1 −m 2 2 . Now, given continuity of neural networks, we know that: ∀ε >0,∃δ >0,∀h ′ ∈R |ψ ℓ | : 0<||h−h ′ || 2 < δ, ⇒||f ℓ: N (h)−f ℓ: N (h ′ )|| 2 < ε.(51) 28 Therefore, we see that: ∃δ >0,∀h ′ ∈R |ψ ℓ | : 0<||h−h ′ || 2 < δ, ⇒||f ℓ: N (h)−f ℓ: N (h ′ )|| 2 < m 1 −m 2 2 (52a) ⇒y ⋆ = argmax y ′ ∈Y [f ℓ: N (h ′ )] y ′ (52b) We notice that||h−h ′ || 2 < δforδ >0denotes a continuous region inR |ψ ℓ | which therefore includes uncountably infinite points. G Transformers at Initialisation are Almost Surely Injective on each Layer Theorem 2.Transformers like DNN 2 with randomly independent initialised from a continuous distribution (riicd.) weights are almost surely injective at initialisation up to each layer0≤ℓ < L. Proof. To show injectivity up to a layerℓ ′ in a transformer, it suffices to show thatf ℓ N is injective on any countable subsetHof its domain for all layersℓ(0≤ℓ≤ℓ ′ ). This suffices as we assume the set of inputsXis countable, and the composition of injective functions is injective. LetΘbe the random variable representing the transformer’s weights. To show (almost sure) injectivity on layerℓfor any fixed input setH, we need that (because of Lemma 10): 14 p Θ ∀h 1 ,h 2 ∈H,h 1 ̸=h 2 :f ℓ N (h 1 )̸=f ℓ N (h 2 ) = 1(53) Since the transformer operates over sequences of tokens, any elementh∈Hhas its first dimension indexing the sequence length. Let|h|denote the sequence length and[h] t refer to thet-th element inh. LetTbe the set of token positionsT=1,...,min(|h 1 |,|h 2 |). For injectivity, it suffices to show that: p Θ (∀h 1 ,h 2 ∈H,t∈T,[h 1 ] t ̸= [h 2 ] t : [f ℓ N (h 1 )] t ̸= [f ℓ N (h 2 )] t ) = 1(54) Note that eq. (54) only ensures injectivity when|h 1 |=|h 2 |. However, this is sufficient because when|h 1 | ̸=|h 2 |, eq. (53) follows trivially: since|f ℓ N (h 1 )| ̸=|f ℓ N (h 2 )| , we immediately have f ℓ N (h 1 )̸=f ℓ N (h 2 ). When|h 1 |=|h 2 |, we can show that eq. (54) implies eq. (53) as follows: if h 1 ̸=h 2 , then there exists at least one token positiont ′ ∈Twhere[h 1 ] t ′ ̸= [h 2 ] t ′ . By eq. (54), this implies[f ℓ N (h 1 )] t ′ ̸= [f ℓ N (h 2 )] t ′ almost surely, and thereforef ℓ N (h 1 )̸=f ℓ N (h 2 )almost surely. We observe that a transformer’s input setXconsists of all sequences formed from a finite token vocabulary, which is countably infinite. Since transformers are deterministic functions, the input set Hencountered at any sublayer is also countably infinite. Therefore, it suffices to prove eq. (54) for any fixed countably infinite input setH. We show that Eq. (54) holds for any fixed countably infinite subsetHof the layer’s domain. This is established for the embedding layer (f ℓ N (h) =e h ), the MLP layer (f ℓ N (h) =h+mlp(LN(h)) ), and the attention layer (f ℓ N (h) =h+attn(LN(h)) ) by Lemma 7, Lemma 8, and Lemma 9, respectively. The 3 theorems facilitating the proof above are: Lemma 7.Lets assume we have an embedding layer randomly independent initialized from a continuous distribution (riicd.) weights and any countably infinite input sets (in embeddings token indexes). We denote the set of random variables over the weights asΘ. We then can show for any fixed countably infinite input setHthat this Layer is injective almost surely. p Θ (∀h 1 ,h 2 ∈H,t∈T,[h 1 ] t ̸= [h 2 ] t : [e h 1 ] t ̸= [e h 2 ] t ) = 1(55) Proof.See App. G.2. Lemma 8.Lets assume we have a sub-block consisting of an MLP with a residual connection and layer norm (i.e.,h+ (mlp(LN(h))) with riicd. weights. We can show that, for any fixed countably 14 We note thatp Z (∀z∈ Z:z), whereZis a set of events, is the same asp Z (∩ z∈Z z)formally. 29 infinite input setH, this layer is injective almost surely: p Θ (∀h 1 ,h 2 ∈H,t∈T,[h 1 ] t ̸= [h 2 ] t :(56) [h 1 +mlp(LN(h 1 ))] t ̸= [h 2 +mlp(LN(h 2 ))] t ) = 1 Proof.See App. G.3. Lemma 9.Lets assume we have a sub-block consisting of a self-attention with a residual connection and layer norm (i.e.,h+attn(LN(h))) with riicd. weights. We can show that, for any fixed countably infinite input setH, this layer is injective almost surely: p Θ (∀h 1 ,h 2 ∈H,t∈T,[h 1 ] t ̸= [h 2 ] t :(57) [h 1 +attn(LN(h 1 ))] t ̸= [h 2 +attn(LN(h 2 ))] t ) = 1 Proof.See App. G.4 G.1 Fundamental Lemmas In this section we present some fundamental lemmas used to prove G.2 to G.4. Lemma 10.For a layers functionf ℓ N to be injective on its input setH, it has to hold that: p Θ (∀h 1 ,h 2 ∈H:h 1 ̸=h 2 ⇒f ℓ N (h 1 )̸=f ℓ N (h 2 )) = 1(58) This can equivalently be written as: p Θ ∀h 1 ,h 2 ∈H,h 1 ̸=h 2 :f ℓ N (h 1 )̸=f ℓ N (h 2 ) = 1(59) Proof.We can derive eq. (59) from eq. (58): p Θ (∀h 1 ,h 2 ∈H:h 1 ̸=h 2 ⇒f ℓ N (h 1 )̸=f ℓ N (h 2 )) =p Θ \ h 1 ,h 2 ∈H h 1 ̸=h 2 ⇒f ℓ N (h 1 )̸=f ℓ N (h 2 ) (60a) =p Θ \ h 1 ,h 2 ∈H (h 1 ̸=h 2 )∧f ℓ N (h 1 )̸=f ℓ N (h 2 ) ∨(h 1 =h 2 ) (60b) =p Θ \ h 1 ,h 2 ∈H,h 1 ̸=h 2 (h 1 ̸=h 2 )∧f ℓ N (h 1 )̸=f ℓ N (h 2 ) ∩ \ h 1 ,h 2 ∈H,h 1 =h 2 (h 1 =h 2 ) | z always true (60c) =p Θ \ h 1 ,h 2 ∈H,h 1 ̸=h 2 (h 1 ̸=h 2 ) |z always true ∧f ℓ N (h 1 )̸=f ℓ N (h 2 ) (60d) =p Θ \ h 1 ,h 2 ∈H,h 1 ̸=h 2 f ℓ N (h 1 )̸=f ℓ N (h 2 ) (60e) =p Θ ∀h 1 ,h 2 ∈H,h 1 ̸=h 2 :f ℓ N (h 1 )̸=f ℓ N (h 2 ) (60f) Lemma 11.If we have a countable setZof almost sure eventsz, we know that their intersection is also almost surely. Formally: ∀z∈Z:p(z) = 1 =⇒ p(∀z∈Z:z) = 1 (61) 30 Proof.First, observe that p \ z∈Z z ! = 1−p \ z∈Z z ! c ! = 1−p [ z∈Z z c ! (62) wherez c is the complement of an eventz. Sincep(z) = 1for allz∈Z, it follows thatp(z c ) = 0for allz∈ZBy the countable subadditivity of probability measures: p [ z∈Z z c ! ≤ X z∈Z p(z c ) = X z∈Z 0 = 0(63) Therefore, p(∀z∈Z:z) =p \ z∈Z z ! = 1−0 = 1 (64) G.2 Proof of Lemma 7 In this section, we will prove Lemma 7 which states that the embedding layer is almost surely injective on countably infinite inputs. Lemma 7.Lets assume we have an embedding layer randomly independent initialized from a continuous distribution (riicd.) weights and any countably infinite input sets (in embeddings token indexes). We denote the set of random variables over the weights asΘ. We then can show for any fixed countably infinite input setHthat this Layer is injective almost surely. p Θ (∀h 1 ,h 2 ∈H,t∈T,[h 1 ] t ̸= [h 2 ] t : [e h 1 ] t ̸= [e h 2 ] t ) = 1(55) Proof.We can apply Lemma 11 three times (onh 1 ,h 2 andt) to show that eq. (55) is equivalent to, for anyh 1 ,h 2 ∈Handt∈Tfor which[h 1 ] t ̸= [h 2 ] t , it holding that: p Θ ([e h 1 ] t ̸= [e h 2 ] t ) = 1(65) We further note that, by the definition of an embedding block (Submodule 2): p Θ ([e h 1 ] t ̸= [e h 2 ] t ) = 1⇔p Θ (e [h 1 ] t ̸=e [h 2 ] t ) = 1(66) We can thus apply the law of total probability by definingΘ ′ as all the random variablesΘexcept the one for the first element ofe [h 1 ] t , i.e., except[e [h 1 ] t ] 1 , 15 ande ′ [h 1 ] t as the embedding of[h 1 ] t without the first element: Z p Θ\Θ ′ (e [h 1 ] t ̸=e [h 2 ] t |e ′ [h 1 ] t ,e [h 2 ] t )p Θ ′ (e ′ [h 1 ] t ∪e [h 2 ] t )d(e ′ [h 1 ] t ∪e [h 2 ] t ) = 1(67) It therefore suffices to show that, for anye ′ [h 1 ] t ande [h 2 ] t : p Θ\Θ ′ (e [h 1 ] t ̸=e [h 2 ] t |e ′ [h 1 ] t ,e [h 2 ] t ) = 1(68) This holds trivially when any embedding dimension other than the first ofe [h 1 ] t ande [h 2 ] t differs. When all dimensions except the first are equal, we apply: p Θ\Θ ′ ([e [h 1 ] t ] 1 ̸= [e [h 2 ] t ] 1 |e ′ [h 1 ] t ,e [h 2 ] t ) = 1(69) ⇔p Θ\Θ ′ ([e [h 1 ] t ] 1 = [e [h 2 ] t ] 1 |e ′ [h 1 ] t ,e [h 2 ] t ) = 0 The right-hand side[e [h 2 ] t ] 1 is a constant while the left-hand side[e [h 1 ] t ] 1 is a random variable over a continuous region; this event has measure 0, resulting in probability 0. G.3 Proof of Lemma 8 In this section, we will prove Lemma 8, which will show that the block consisting of an MLP, residual connection and layer norm is almost sure injective on its countably infinite inputs. Lemma 8.Lets assume we have a sub-block consisting of an MLP with a residual connection and layer norm (i.e.,h+ (mlp(LN(h))) with riicd. weights. We can show that, for any fixed countably 15 By this we refer to the first element of the embedding vector of[h 1 ] t . 31 infinite input setH, this layer is injective almost surely: p Θ (∀h 1 ,h 2 ∈H,t∈T,[h 1 ] t ̸= [h 2 ] t :(56) [h 1 +mlp(LN(h 1 ))] t ̸= [h 2 +mlp(LN(h 2 ))] t ) = 1 Proof.For notational convenience, letm i =mlp(LN(h i )). Given Lemma 11, it suffices to prove that for anyh 1 ,h 2 ∈Handt∈T, where[h 1 ] t ̸= [h 2 ] t , we have: p Θ [h 1 +m 1 ] t ̸= [h 2 +m 2 ] t = 1(70) Without loss of generality, fix one suchh 1 ,h 2 ∈Handt∈T. We can manipulate this probability distribution as: p Θ [h 1 +m 1 ] t ̸= [h 2 +m 2 ] t =p Θ [h 1 ] t + [m 1 ] t ̸= [h 2 ] t + [m 2 ] t =p Θ [m 1 ] t ̸= [m 2 ] t p Θ [h 1 ] t + [m 1 ] t ̸= [h 2 ] t + [m 2 ] t |[m 1 ] t ̸= [m 2 ] t (71a) +p Θ [m 1 ] t = [m 2 ] t ) p Θ [h 1 ] t + [m 1 ] t ̸= [h 2 ] t + [m 2 ] t |[m 1 ] t = [m 2 ] t | z =1, since[h 1 ] t ̸=[h 2 ] t =p Θ [m 1 ] t ̸= [m 2 ] t p Θ [h 1 ] t + [m 1 ] t ̸= [h 2 ] t + [m 2 ] t |[m 1 ] t ̸= [m 2 ] t (71b) +p Θ [m 1 ] t = [m 2 ] t ) Therefore, it suffices to show that: p Θ [h 1 ] t + [m 1 ] t ̸= [h 2 ] t + [m 2 ] t |[m 1 ] t ̸= [m 2 ] t = 1(72) sincep Θ [m 1 ] t ̸= [m 2 ] t ) +p Θ [m 1 ] t = [m 2 ] t ) is trivially 1. We now unfold the last layer of the MLP asm i =W L (m ′ i ) +b L , wherem ′ i =σ(f :L−1 N MLP (LN(h i ))). We can rewrite eq. (72) as: p Θ [h 1 ] t + [W L m ′ 1 +b L ] t ̸= [h 2 ] t + [W L m ′ 2 +b L ] t |[m 1 ] t ̸= [m 2 ] t =p Θ [h 1 ] t +W L [m ′ 1 ] t ̸= [h 2 ] t +W L [m ′ 2 ] t |[m 1 ] t ̸= [m 2 ] t (73a) (1) =p Θ [h 1 ] t +W L [m ′ 1 ] t ̸= [h 2 ] t +W L [m ′ 2 ] t |[m ′ 1 ] [t,i] ̸= [m ′ 2 ] [t,i] (73b) (2) ≥p Θ [h 1 ] [t,1] + [W L m ′ 1 ] [t,1] ̸= [h 2 ] [t,1] + [W L m ′ 2 ] [t,1] |[m ′ 1 ] [t,i] ̸= [m ′ 2 ] [t,i] (73c) =p Θ [h 1 ] [t,1] + |[h 1 ] t | X j=1 [W L ] [1,j] [m ′ 1 ] [t,j] ̸= [h 2 ] [t,1] + |[h 2 ] t | X j=1 [W L ] [1,j] [m ′ 2 ] [t,j] |[m ′ 1 ] [t,i] ̸= [m ′ 2 ] [t,i] (73d) (3) = Z Θ\Θ ′ p Θ ′ [h 1 ] [t,1] + |[h 1 ] t | X j=1 [W L ] [1,j] [m ′ 1 ] j ̸= [h 2 ] [t,1] +(73e) |[h 2 ] t | X j=1 [W L ] [1,j] [m ′ 2 ] [t,j] |[m ′ 1 ] [t,i] ̸= [m ′ 2 ] [t,i] ,θ p Θ\Θ ′ (θ|[m ′ 1 ] [t,i] ̸= [m ′ 2 ] [t,i] )dθ where equality (1) holds since[m 1 ] t ̸= [m 2 ] t implies there exists some indexisuch that[m ′ 1 ] [t,i] ̸= [m ′ 2 ] [t,i] .. 16 (2) holds because if the inequality is satisfied for a single component of the vector, it must also be satisfied for the entire vector. In (3), we defineΘ ′ as the random variable responsible for the value of[W L ] [1,i] andθas a realisation of the random variablesΘ\Θ ′ . Therefore, to prove 16 [m ′ 1 ] [t,i] represents a two dimensional indexing, referring to thei-th element of the representation of the t-th token. 32 eq. (72), it suffices to show: p Θ ′ [h 1 ] [t,1] + |[h 1 ] t | X j=1 [W L ] [1,j] [m ′ 1 ] j ̸= [h 2 ] [t,1] + |[h 2 ] t | X j=1 [W L ] [1,j] [m ′ 2 ] [t,j] |[m ′ 1 ] [t,i] ̸= [m ′ 2 ] [t,i] ,θ = 1(74) For brevity, we omit repeating the conditions in the following probabilities as they remain unchanged to the previous equation: p Θ ′ [h 1 ] [t,1] + |[h 1 ] t | X j=1 [W L ] [1,j] [m ′ 1 ] j ̸= [h 2 ] [t,1] + |[h 2 ] t | X j=1 [W L ] [1,j] [m ′ 2 ] [t,j] |... = 1(75a) ⇔p Θ ′ [h 1 ] [t,1] + |[h 1 ] t | X j=1 [W L ] [1,j] [m ′ 1 ] j = [h 2 ] [t,1] + |[h 2 ] t | X j=1 [W L ] [1,j] [m ′ 2 ] [t,j] |... = 0(75b) ⇔p Θ ′ [W L ] [1,i] [m ′ 1 ] [t,i] −[W L ] [1,i] [m ′ 2 ] [t,i] = [h 2 ] [t,1] −[h 1 ] [t,1] (75c) + |[h 1 ] t | X j=1,j̸=i [W L ] [1,j] [m ′ 2 ] [t,j] − |[h 2 ] t | X j=1,j̸=i [W L ] [1,j] [m ′ 1 ] [t,j] |... = 0 ⇔p Θ ′ [W L ] [1,i] = 1 [m ′ 1 ] [t,i] −[m ′ 2 ] [t,i] [h 2 ] [t,1] −[h 1 ] [t,1] (75d) + |[h 2 ] t | X j=1,j̸=i [W L ] [1,j] [m ′ 2 ] [t,j] − |[h 1 ] t | X j=1,j̸=i [W L ] [1,j] [m ′ 1 ] [t,j] |... = 0 where the last step follows from the condition[m ′ 1 ] [t,i] ̸= [m ′ 2 ] [t,i] which ensures the denominator is non-zero. Now, Eq. (75d) holds because the right-hand side is a constant (since its elements are fixed given the conditions of the probability) while the left-hand side is a random variable drawn from a continuous distribution (since the weights are riicd.). Therefore, the probability that this equality holds is zero, as the event has measure zero. G.4 Proof of Lemma 9 In this Section, we prove Lemma 9, which establishes that the self-attention sub-block (consisting of attention, residual connection, and layer normalisation) is almost surely injective on countably infinite inputs. The proof structure parallels that of Lemma 8, so we highlight the key differences and necessary adaptations without repeating the full derivation. Lemma 9.Lets assume we have a sub-block consisting of a self-attention with a residual connection and layer norm (i.e.,h+attn(LN(h))) with riicd. weights. We can show that, for any fixed countably infinite input setH, this layer is injective almost surely: p Θ (∀h 1 ,h 2 ∈H,t∈T,[h 1 ] t ̸= [h 2 ] t :(57) [h 1 +attn(LN(h 1 ))] t ̸= [h 2 +attn(LN(h 2 ))] t ) = 1 Proof. We follow a proof strategy analogous to that of Lemma 8 in App. G.3. Following the same steps up to eq. (72), it suffices to show for this lemma that for anyh 1 ,h 2 ∈ Handt∈T, where [h 1 ] t ̸= [h 2 ] t , we have: p Θ [h 1 ] t + [attn(LN(h 1 ))] t ̸= [h 2 ] t + [attn(LN(h 2 ))] t |[attn(LN(h 1 ))] t ̸= [attn(LN(h 2 ))] t = 1(76) We can write this according to the definition of an attention block (Submodule 1): p Θ [h 1 ] t + (W O ) T [h ′ 1 ] t ̸= [h 2 ] t + (W O ) T [h ′ 2 ] t |[attn(LN(h 1 ))] t ̸= [attn(LN(h 2 ))] t (77) whereh ′ 1 andh ′ 2 are the hidden states after concatenation in the self-attention mechanism (see eq. (12)). The remainder of the proof follows the same approach as the proof of Lemma 8 in App. G.3, starting from eq. (73a). 33 All PairsSame OutputNot Same OutputSame VariablesNot Same Variables Input8.5e−2±1.1e−28.5e−2±1.1e−21.7e−1±1.9e−28.5e−2±1.1e−21.6e−1±1.9e−2 Layer 15.7e−4±4.5e−45.7e−4±4.5e−42.4e−2±6.2e−35.7e−4±4.5e−41.1e−2±4.3e−3 Layer 24.5e−4±3.8e−44.5e−4±3.8e−43.2e−2±1.2e−24.5e−4±3.8e−41.2e−2±6.7e−3 Layer 33.3e−4±2.8e−43.3e−4±2.8e−48.9e−2±4.3e−23.3e−4±2.8e−41.5e−2±1.0e−2 Table 1: An approximation of the minimal Euclidean distance of the trained MLP model in the hierarchical equality task using1,280,000samples. The minimal Euclidean distance is computed between all samples to a randomly selected subset of10,000samples. We compute it over 10 different random seeds and present the mean and standard deviation (the number after±). H MLP Injectivity in Hierarchical Equality Task We see in Fig. 2 that the IIA remains low for theidentity of first argumentalgorithm on a fully trained model even when using aφ nonlin alignment map (based onrevnet). A reasonable assumption for why would be that the fully trained model does not fulfil some assumption required by our proof of Theorem 1 (any algorithm is an input-restricted distributed abstraction for any model) given in §4. In this section, we present follow-up experiments investigating the reason for this disagreement between our empirical results on theidentity of first argumentalgorithm and the theoretical result of Theorem 1. Let us first note that to prove Theorem 1 we rely on an existence proof: showing there exists a functionφwhich satisfies the conditions for a DNN to be abstracted by an algorithm. It says nothing, however, about this function being learnable in practice. Our experiments, however, measure IIA on an unseen test set—which requiresφ nonlin to not only fit a training set, but generalise to new data. Therefore, following our proof of Theorem 1 we explore the IIA on the train set. However, on the normal training set (with1,280,000samples), we still do not get an IIA over 0.55. On the other hand, if we repeat the experiment with only1,000training samples, we seeφ nonlin achieves an IIA of over 0.99on the training set. Therefore, it is likely that therevnetused when definingφ nonlin does not have enough capacity to fit the overly complex function our proof describes. To further analyse why the capacity of the usedrevnetis not sufficient, we analyse the injectivity of the evaluated MLP by investigating its hidden representations. We first evaluate1,280,000randomly sampled inputs and their hidden states, checking if they are all unique. In these1,280,000samples (and repeating this experiment with 10 different random seeds), no collisions were found, implying the evaluated MLP is (at least close to) injective. We now examine the supposition that the model finds it more difficult to distinguish between hidden states that share the same values for the variables inboth equality relationsthan between those that do not. To this end, we compute the minimal Euclidean distance between hidden states across the entire set of1,280,000samples to a randomly selected subset of10,000samples. Specifically, we measure the minimal pairwise Euclidean distance among: (i) all sample pairs, (i) sample pairs sharing the same output, (i) sample pairs sharing the same values for both equality variables inboth equality relations, (iv) sample pairs that do not share the same output and (v) sample pairs that do not share the same values for both equality variables. The results are presented in Table 1. We observe that the minimal Euclidean distances are smaller for pairs sharing the same output or the same equality-variable values compared to pairs that do not. This suggests that, although injectivity is preserved, a RevNet likely will find it more challenging to separate hidden states that share variable values. I Additional Experiment Details In this section, we present additional details about our hierarchical equality task (in App. I.1) and indirect object identification (in App. I.2) experiments. We also present details and results on the distributive law task (in App. I.3). I.1 Hierarchical Equality Task Task 1(from Geiger et al., 2024b).Thehierarchical equality taskis defined as follows. Let x=x 1 ◦x 2 ◦x 3 ◦x 4 be a 16-dimensional vector, where eachx i ∈R 4 fori∈ 1,2,3,4, and ◦denotes vector concatenation. The input space isX= [−0.5,0.5] 16 , and the output space is 34 Y=false,true. The task function is: T(x) = (x 1 ==x 2 ) == (x 3 ==x 4 ) ,(78) where the equality(x i ==x j )holds if and only ifx i andx j are equal as vectors inR 4 . I.1.1 Algorithms We define the following three candidate algorithms in detail. Alg 1.Theboth equality relations alg.to solve Task 1 hasη inner =η x 1 ==x 2 ,η x 3 ==x 4 and: f η x 1 ==x 2 A (v par A (η x 1 ==x 2 ) ) = (v η x 1 ==v η x 2 ) f η x 3 ==x 4 A (v par A (η x 3 ==x 4 ) ) = (v η x 3 ==v η x 4 ) f η y A (v par A (η y ) ) = (v η x 1 ==x 2 ==v η x 3 ==x 4 ) Alg 2.Theleft equality relation alg.to solve Task 1 hasη inner =η x 1 ==x 2 and: f η x 1 ==x 2 A (v par A (η x 1 ==x 2 ) ) = (v η x 1 ==v η x 2 ) f η y A (v par A (η y ) ) = (v η x 1 ==x 2 == (v η x 3 ==v η x 4 )) Alg 3.Theidentity of first argument alg.to solve Task 1 hasη inner =η x 1 and: f η x 1 A (par A (η x 1 )) =x 1 f η y A (par A (η y )) = ((v η x 1 ==x 2 ) == (x 3 ==x 4 )) I.1.2 Training Details For the hierarchical equality task, we use a 3-layer MLP with|ψ 1 |=|ψ 2 |=|ψ 3 |= 16. The model is trained using the Adam optimiser with learning rate 0.001 and cross-entropy loss. We use a batch size of 1024 and train on 1,048,576 samples, with 10,000 samples each for evaluation and testing. Training runs for a maximum of 20 epochs with early stopping after 3 epochs of no improvement. For the training progression experiments, we use the same configuration but limit training to 2 epochs. When training the alignment mapsφ, we use a batch size of 6400 and train for up to 50 epochs with early stopping after 5 epochs of no improvement (using a threshold of 0.001 for the required change, compared to 0 for MLP training). We use the Adam optimiser with learning rate 0.001 and cross-entropy loss. To generate the datasets for DAS, for Alg. 1 we intervene with a probability of 1/3 onη x 1 ==x 2 , 1/3 onη x 3 ==x 4 , and 1/3 on both variables. The samples for the base and source inputs are generated such that(x 1 ==x 2 )and(x 3 ==x 4 )each hold 50% of the time. For Alg. 2 and Alg. 3 we intervene onη x 1 ==x 2 andη x 1 for all samples, respectively. For each algorithm, we sample1,280,000interventions for training,10,000for evaluation, and10,000for testing. I.1.3 Additional Results We present results for the three candidate algorithms for the hierarchical equality task, analysing the effect of hidden sized rn and intervention size|ψ φ η |across all MLP layers. For theboth equality relationsalgorithm, Fig. 8a, 9a and 9b demonstrate that the hidden size experiment aligns with previously reported trends, while also showing how alignment mapsφof increasing complexity perform across training epochs, layers, and intervention sizes. For theleft equality relationalgorithm, as shown in Fig. 8b, 9c and 9d, we observe similar patterns: increasing hidden size and intervention size improves performance, and alignment is generally more successful in later layers during early training. For theidentity of first argumentalgorithm, Fig. 8c, 9e and 9f reveal that, interestingly, some alignment is achieved—especially in layer 3—during the first half of training, but this effect diminishes in the second half. Overall, these results demonstrate that the hidden size experiment is consistent with the findings reported in the main paper. They also show that it is easier, in untrained models, to find an alignment map for later layers, and that transient alignment can occur in specific layers and algorithms during the initial stages of training. 35 124816 d rn 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy Layer 1 124816 d rn Layer 2 124816 d rn Layer 3 = 1 = 2 = 8 (a)both equality relations 124816 d rn 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy Layer 1 124816 d rn Layer 2 124816 d rn Layer 3 = 1 = 2 = 8 (b)left equality relation 124816 d rn 0.490 0.495 0.500 0.505 0.510 0.515 0.520 Accuracy Layer 1 124816 d rn Layer 2 124816 d rn Layer 3 = 1 = 2 = 8 (c)identity of first argument Figure 8: Mean IIA over 5 seeds usingφ nonlin (L rn = 1) on the trained DNN. Performance improves with larger hidden dimensiond rn and intervention size|ψ φ η |. Each subplot corresponds to one of the three candidate algorithms for the hierarchical equality task, showing how the model’s representational capacity influences performance. I.2 Indirect Object Identification Task Task 2.TheIndirect Object Identification(IOI) task involves predicting the indirect object in sentences with a specific structure. Each inputx∈Xconsists of a text where a subject (S) and an indirect object (IO) are introduced, followed by theSgiving something to theIO. For example: "Friends Juana and Kristi found a mango at the bar. Kristi gave it to"⇒"Juana" Here, "Juana" and "Kristi" are introduced, with "Kristi" (S) appearing again before giving something to "Juana" (IO). The output setYconsists of the first tokens of the two names: Y=first_token(S),first_token(IO)(79) I.2.1 Algorithm For this task, we evaluate theABAB-ABBAalgorithm. Denoting the two names in the story as A and B, this algorithm determines whether the sentence follows an ABAB pattern (e.g.,"Friends Juana and Kristi found a mango at the bar. Juana gave it to Kristi") or an ABBA pattern (e.g.,"Friends Juana and Kristi found a mango at the bar. Kristi gave it to Juana"). If the pattern is ABAB (where B is the indirect objectIO), the algorithm predicts the first token of B. Conversely, for an ABBA pattern, it predicts the first token of A. In our experiments, we intervene on whether an input follows the ABAB pattern or not. 36 0.6 0.8 1.0 Layer 3 Accuracy = 1= 2= 12 0.6 0.8 1.0 Layer 2 Accuracy 01MFull DNN Training Steps 0.6 0.8 1.0 Layer 1 Accuracy 01MFull DNN Training Steps 01MFull DNN Training Steps DNN linear L rn = 1,d rn = 2 4 L rn = 5,d rn = 2 4 L rn = 10,d rn = 2 4 L rn = 10,d rn = 2 7 (a) Max IIA forboth equality relations 0.6 0.8 1.0 Layer 3 Accuracy = 1= 2= 12 0.6 0.8 1.0 Layer 2 Accuracy 01MFull DNN Training Steps 0.6 0.8 1.0 Layer 1 Accuracy 01MFull DNN Training Steps 01MFull DNN Training Steps DNN linear L rn = 1,d rn = 2 4 L rn = 5,d rn = 2 4 L rn = 10,d rn = 2 4 L rn = 10,d rn = 2 7 (b) Mean IIA forboth equality relations 0.6 0.8 1.0 Layer 3 Accuracy = 1= 2= 8 0.6 0.8 1.0 Layer 2 Accuracy 01MFull DNN Training Steps 0.6 0.8 1.0 Layer 1 Accuracy 01MFull DNN Training Steps 01MFull DNN Training Steps DNN linear L rn = 1,d rn = 2 4 L rn = 5,d rn = 2 4 L rn = 10,d rn = 2 4 L rn = 10,d rn = 2 7 (c) Max IIA forleft equality relation 0.6 0.8 1.0 Layer 3 Accuracy = 1= 2= 8 0.6 0.8 1.0 Layer 2 Accuracy 01MFull DNN Training Steps 0.6 0.8 1.0 Layer 1 Accuracy 01MFull DNN Training Steps 01MFull DNN Training Steps DNN linear L rn = 1,d rn = 2 4 L rn = 5,d rn = 2 4 L rn = 10,d rn = 2 4 L rn = 10,d rn = 2 7 (d) Mean IIA forleft equality relation 0.6 0.8 1.0 Layer 3 Accuracy = 1= 2= 8 0.6 0.8 1.0 Layer 2 Accuracy 01MFull DNN Training Steps 0.6 0.8 1.0 Layer 1 Accuracy 01MFull DNN Training Steps 01MFull DNN Training Steps DNN linear L rn = 1,d rn = 2 4 L rn = 5,d rn = 2 4 L rn = 10,d rn = 2 4 L rn = 10,d rn = 2 7 (e) Max IIA foridentity of first argument 0.6 0.8 1.0 Layer 3 Accuracy = 1= 2= 8 0.6 0.8 1.0 Layer 2 Accuracy 01MFull DNN Training Steps 0.6 0.8 1.0 Layer 1 Accuracy 01MFull DNN Training Steps 01MFull DNN Training Steps DNN linear L rn = 1,d rn = 2 4 L rn = 5,d rn = 2 4 L rn = 10,d rn = 2 4 L rn = 10,d rn = 2 7 (f) Mean IIA foridentity of first argument Figure 9: IIA over 5 seeds for each combination of MLP layer (rows) and intervention size (columns) during training progression for the tested algorithms. Alg 4.TheABAB-ABBAalgorithm for the IOI task has one inner nodeη inner =η 1 and is defined as follows: f η 1 A (par A (η 1 )) =check_is_abab_pattern(v η x )(80a) f η y A (par A (η y )) = first_token(get_name_b(v η x ))ifv η 1 =true first_token(get_name_a(v η x ))ifv η 1 =false (80b) Here,get_name_a(x)extracts the first name (denoted A, e.g. Juana in our example) and get_name_b(x)extracts the second name (denoted B, e.g. Kristi in our example) from the input sentencex. The functioncheck_is_abab_pattern(x)returnstrueif the sentence follows an “ABAB” structure (e.g., “A and B ... A gave to B”, meaning B is theIO) andfalseif it follows 37 an “ABBA” structure (e.g., “A and B ... B gave to A”, meaning A is theIO).first_token(Name) returns the first token of the specified name. The outputyis the first token of the indirect object. I.2.2 Training Details Figure 10: Cross Entropy loss (y-axis) during train- ing (x-axis) of the alignment map for the randomly initialisedPythia-31m. The loss plateaus between 4k and 6k steps, and suddenly drops after 6k steps. We use models from the Pythia suite (Biderman et al., 2023) to evaluate the IIA performance of the differentφon the IOI task. Specifically, we employ the Pythia 31M, 70M, 160M, and 410M parameter models. We also examine different training checkpoints provided by these models to analyse how IIA evolves during training. To assess robustness, we replicate a subset of ex- periments using alternative Pythia model seeds from (van der Wal et al., 2025). We train all alignment maps on 2 epochs of10 6 interventions based on data from Muhia (2022), with a batch size of256and a learning rate of 10 −4 . For all experiments, we set the interven- tion size|ψ φ η |to half of the DNN’s hidden di- mension. For smaller models (31M, 70M), we train using float64 precision and a learning rate of 10 −3 , as these adjustments proved crucial for convergence. We also note that we observed quite severe grokking behaviour, where models had low IIA for a long time, which quickly jumped to high IIA values at a certain point of training (see Fig. 10; wandb run). I.2.3 Additional Results Robustness across random seeds.In Fig. 11, we examine how our main results from §6 generalise across multiple training seeds of the Pythia model. The key trends hold consistently across all 5 seeds - we can find perfect alignments usingφ nonlin in most cases. However, we observe two notable exceptions. For one seed, the DNN fails to learn the IOI task even after full training. For another seed, we cannot find an alignment using even complex alignment maps under our current setup. 17 All other seeds achieve perfect alignment underφ nonlin . We hypothesise that the alignment failure case is primarily due to suboptimal training of the alignment map. Due to computational constraints, we did not perform extensive hyperparameter tuning that might have achieved convergence. Init.Full DNN Training Step 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy DNN non-linear linear Figure 11: IIA of alignment between theABAB-ABBAalgorithm and thePythia-410mmodel across multiple seeds (seeds 1 to 5 from van der Wal et al., 2025), with interventions at layer 12. We evaluate the IIA of bothφ lin andφ nonlin (withd h = 64,K= 1) on randomly initialised (Init.) and fully trained (Full) DNNs. Generalisation across distinct name sets.In the main paper, we split the dataset from Muhia (2022) by ensuring that no two sentences appear in both the training and evaluation sets. However, this splitting strategy does not guarantee that the names themselves are distinct between training and evaluation sets. In Fig. 12, we examine the results when using completely different sets of names for training and evaluation. The results differ substantially: we cannot find an alignment using even complex alignment maps for the randomly initialised DNN. This suggests that IIA on the randomly initialised DNN may depend critically on overlap between the specific entities encountered during 17 These failures occur in different seeds: seed 3 shows poor IIA despite learning the task, while seed 4 fails to learn the IOI task. 38 training and evaluation. For the fully trained DNN, we observe perfect alignment usingφ nonlin and reasonably high alignment usingφ lin . 0Full DNN Training Step 0.50 0.75 1.00 Accuracy DNN linear L rn = 8 L rn = 16 Figure 12: IIA of alignment between theABAB-ABBAalgorithm and thePythia-410mmodel using a different set of names for training and evaluation, with interventions at layer 12. We evaluate the IIA of bothφ lin andφ nonlin (withd h = 64,K= 1) on randomly initialised and fully trained (Full) DNNs. I.3 Distributive Law Task We now study a similar task to the hierarchical equality (in App. I.1), based on the distributive law of and (∧) and or (∨). Task 3.Thedistributive law taskis defined as follows. Letx=x 1 ◦x 2 ◦x 3 ◦x 4 ◦x 5 ◦x 6 be a 24-dimensional vector, where eachx i ∈[−0.5,0.5] 4 fori∈1,2,3,4,5,6, and◦denotes vector concatenation. The input space isX= [−0.5,0.5] 24 , and the output space isY=false,true. The task function is T(x) = (x 1 ==x 2 )∧(x 3 ==x 4 ) ∨ (x 3 ==x 4 )∧(x 5 ==x 6 ) ,(81) where the equality(x i ==x j )holds if and only ifx i andx j are equal as vectors inR 4 . I.3.1 Algorithms We define the following two candidate algorithms. Alg 5.TheAnd-Or-And alg.to solve Task 3 has η inner =η (x 1 ==x 2 )∧(x 3 ==x 4 ) ,η (x 3 ==x 4 )∧(x 5 ==x 6 ) and it is defined as follows: f η (x 1 ==x 2 )∧(x 3 ==x 4 ) A (v par A (η (x 1 ==x 2 )∧(x 3 ==x 4 ) ) ) = (v η x 1 ==v η x 2 )∧(v η x 3 ==v η x 4 ) f η (x 3 ==x 4 )∧(x 5 ==x 6 ) A (v par A (η (x 3 ==x 4 )∧(x 5 ==x 6 ) ) ) = (v η x 3 ==v η x 4 )∧(v η x 5 ==v η x 6 ) f η y A (v par A (η y ) ) =v η (x 1 ==x 2 )∧(x 3 ==x 4 ) ∨v η (x 3 ==x 4 )∧(x 5 ==x 6 ) Alg 6.TheAnd-Or alg.to solve Task 3 has η inner =η x 3 ==x 4 ,η (x 1 ==x 2 )∨(x 5 ==x 6 ) and it is defined as follows: f η x 3 ==x 4 A (v par A (η x 3 ==x 4 ) ) = (v η x 3 ==v η x 4 ) f η (x 1 ==x 2 )∨(x 5 ==x 6 ) A (v par A (η (x 1 ==x 2 )∨(x 5 ==x 6 ) ) ) = (v η x 1 ==v η x 2 )∨(v η x 5 ==v η x 6 ) f η y A (v par A (η y ) ) =v η x 3 ==x 4 ∧v η (x 1 ==x 2 )∨(x 5 ==x 6 ) I.3.2 Training Details For the distributive law task, we use a 3-layer MLP (see App. E.1) with an input dimensionality of 24, hidden layers of dimensionality|ψ 1 |=|ψ 2 |=|ψ 3 |= 24, and an output dimensionality of 2. The model is trained using the Adam optimiser with a learning rate of 0.001 and cross-entropy loss. We use a batch size of 1024. The datasets are generated by randomly sampling input vectors x=x 1 ◦·◦x 6 such that the target label ̄y=T(x)is true 50% of the time. We sample1,048,576 samples for training,10,000for evaluation, and10,000for testing. Training runs for a maximum of 20 epochs with early stopping after 3 epochs of no improvement. For trainingφ, we use a batch size of 6400 and train for up to 50 epochs with early stopping after 5 epochs of no improvement (using a threshold of 0.001 for the required change). We use the Adam 39 optimiser with learning rate 0.001 and cross-entropy loss. To generate the intervened datasets: For Alg. 5, we intervene with a probability of 1/3 onη (x 1 ==x 2 )∧(x 3 ==x 4 ) , 1/3 onη (x 3 ==x 4 )∧(x 5 ==x 6 ) , and 1/3 on both variables. For Alg. 6, we intervene with a probability of 1/3 onη x 3 ==x 4 , 1/3 on η (x 1 ==x 2 )∨(x 5 ==x 6 ) , and 1/3 on both variables. For both algorithms, the samples for the base and source inputs are generated such that the output of the intervention changes compared to the base input 50% of the time. We sample1,280,000interventions for training,10,000for evaluation , and 10,000for testing for each algorithm. I.3.3 Results In this section, we discuss the results on the distributed law task using the And-or-And and And-Or algorithms. Our findings corroborate the results presented in the main paper. As shown in Fig. 13, using linear and identity alignment maps reveals distinct dynamics. TheAnd-Oralgorithm achieves higher IIA usingφ lin , particularly in later layers where the IIA ofφ lin on theAnd-Or-Andalgorithm approaches 0.5. However, these dynamics completely vanish when using a more complex alignment map likeφ nonlin , where we achieve almost perfect IIA everywhere. 121212121212 0.25 0.50 0.75 1.00 Accuracy L1L2L3 AndOr 121212121212 L1L2L3 AndOrAnd linear non-linear identity Figure 13: IIA in the distributive law task for causal abstractions trained with different alignment mapsφ. The figure shows results for both analysed algorithms for this task. The bars represent the max IIA across 10 runs with different random seeds. The black lines represent mean IIA with 95% confidence intervals. The|ψ φ η | denotes the intervention size per node. Without interventions, all DNNs reach 100% accuracy. The usedφ nonlin usesL rn = 10andd rn = 24. Fig. 14a and 14c present the evaluated IIA throughout model training. These training progression plots show that randomly initialised models often achieve IIA above 0.8 with non-linear alignment maps, supporting our insight that when the notion of causal abstraction is equipped withφ nonlin it may identify algorithms which are not necessarily implemented by the underlying model. In Fig. 14b and 14d, we plot the mean IIA over 5 seeds instead of the maximum IIA. The hidden size experiments (Fig. 15a and 15b) show that even RevNets with smalld h of 4 achieve near-perfect IIA for And-Or-And, while And-Or never reaches perfect IIA in the second layer, regardless of thed h . The training progression plots suggest a possible explanation: IIA for And-Or- And initially increases in the last two layers but then decreases, while RevNets maintain near-perfect IIA. This may indicate that And-Or-And is first implemented with simple encodings detectable by linearφs, before evolving into non-linear encodings that only RevNets can detect. The fact that And-Or never achieves high IIA in later layers further suggests it may not be a true abstraction of the model’s behaviour, though we note this remains a hypothesis requiring further investigation. And-Or-And Training.In this section, we analyse a DNN when this model is trained specifically to rely on theAnd-Or-Andalgorithm (and, consequently, to encode the values of its hidden nodes). We do so with the method from Geiger et al. (2022), training the DNN to encodeAnd-Or-And’s hidden nodes’ values in its second layer, with an intervention size of 12. This method is similar to how we trainφ(see App. I.3.2), butφis fixed to the identity function, and the DNN itself is trained; further, the training dataset is composed of 1/4 non-intervened samples, 1/4 samples with interventions onη (x 1 ==x 2 )∧(x 3 ==x 4 ) , 1/4 onη (x 3 ==x 4 )∧(x 5 ==x 6 ) , and 1/4 on both variables. We then evaluate if this DNN abstracts both theAnd-Or-AndandAnd-Orusing differentφ(as before, after freezing the DNN). The IIA performance of theseφis presented in Fig. 16. We can see here that, when using identity and linear alignment mapsφ, IIA scores suggest that theAnd-Or-Andalgorithm seems to be implemented perfectly given the second layer, where we have only around 0.75 IIA for 40 0.6 0.8 1.0 Layer 3 Accuracy = 1= 2= 12 0.6 0.8 1.0 Layer 2 Accuracy 01MFull DNN Training Steps 0.6 0.8 1.0 Layer 1 Accuracy 01MFull DNN Training Steps 01MFull DNN Training Steps DNN linear L rn = 1,d rn = 24 L rn = 5,d rn = 24 L rn = 10,d rn = 24 L rn = 10,d rn = 2 7 (a) Max IIA forAnd-Or 0.6 0.8 1.0 Layer 3 Accuracy = 1= 2= 12 0.6 0.8 1.0 Layer 2 Accuracy 01MFull DNN Training Steps 0.6 0.8 1.0 Layer 1 Accuracy 01MFull DNN Training Steps 01MFull DNN Training Steps DNN linear L rn = 1,d rn = 24 L rn = 5,d rn = 24 L rn = 10,d rn = 24 L rn = 10,d rn = 2 7 (b) Max IIA forAnd-Or 0.6 0.8 1.0 Layer 3 Accuracy = 1= 2= 12 0.6 0.8 1.0 Layer 2 Accuracy 01MFull DNN Training Steps 0.6 0.8 1.0 Layer 1 Accuracy 01MFull DNN Training Steps 01MFull DNN Training Steps DNN linear L rn = 1,d rn = 24 L rn = 5,d rn = 24 L rn = 10,d rn = 24 L rn = 10,d rn = 2 7 (c) Max IIA forAnd-Or-And 0.6 0.8 1.0 Layer 3 Accuracy = 1= 2= 12 0.6 0.8 1.0 Layer 2 Accuracy 01MFull DNN Training Steps 0.6 0.8 1.0 Layer 1 Accuracy 01MFull DNN Training Steps 01MFull DNN Training Steps DNN linear L rn = 1,d rn = 24 L rn = 5,d rn = 24 L rn = 10,d rn = 24 L rn = 10,d rn = 2 7 (d) Max IIA forAnd-Or-And Figure 14: IIA over 5 seeds for each combination of MLP layer (rows) and intervention size (columns) during training progression for the evaluated algorithms. theAnd-Oralgorithm. However, these differences vanish almost completely usingφ nonlin as our alignment map. And-Or Training.In this section, we report an experiment similar to the above, but we train our DNN to rely on theAnd-Oralgorithm instead. These results are shown in Fig. 17. In this figure, we again see that, when using identity and linear as alignment mapφ, IIA performance suggests that theAnd-Oralgorithm seems to be implemented perfectly given the second layer, where we have only around 0.65 IIA for theAnd-Or-Andalgorithm. These differences however vanish when using φ nonlin as alignment map, which leads to perfect IIA scores with either algorithm. J Computational Resources The experiments on MLP were executed on CPU (10 computers with i7-4770 or newer) over 3 weeks, as we noticed that DAS on small MLPs are faster on CPU than on GPU. The experiments on the Pythia models were executed on a single A100 GPU with 80GB of memory using approximately 30 GPU hours, including the hyperparameter tuning. 41 124816 d rn 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 Accuracy Layer 1 124816 d rn Layer 2 124816 d rn Layer 3 = 1 = 2 = 8 (a)And-Or 124816 d rn 0.5 0.6 0.7 0.8 0.9 1.0 Accuracy Layer 1 124816 d rn Layer 2 124816 d rn Layer 3 = 1 = 2 = 8 (b)And-Or-And Figure 15: Mean IIA over 5 seeds usingφ nonlin (L rn = 1) on the trained DNN. Performance improves with larger hidden dimensiond rn and intervention size|ψ φ η | . Each subplot corresponds to one of the two candidate algorithms for the distributed law task, showing howφ’s representational capacity influences performance. 121212121212 0.25 0.50 0.75 1.00 Accuracy L1L2L3 AndOr 121212121212 L1L2L3 AndOrAnd linear non-linear identity Figure 16: IIA in the distributive law task for causal abstractions trained with different alignment mapsφand a DNN trained to use theAnd-Or-Andalgorithm. The figure shows results when evaluating if the DNN encodes either of the analysed algorithms for this task. The bars represent the max IIA across 10 runs with different random seeds. The black lines represent mean IIA with 95% confidence intervals. The|ψ φ η | denotes the intervention size per node. All DNNs reach >99.9% accuracy after training. The usedφ nonlin usesL rn = 10andd rn = 16. 121212121212 0.25 0.50 0.75 1.00 Accuracy L1L2L3 AndOr 121212121212 L1L2L3 AndOrAnd linear non-linear identity Figure 17: IIA in the distributive law task for causal abstractions trained with different alignment maps φand a DNN trained to use theAnd-Oralgorithm. The figure shows IIA results when evaluating if the DNN encodes either of the analysed algorithms for this task. The bars represent the max IIA across 10 runs with different random seeds. The black lines represent mean IIA with 95% confidence intervals. The|ψ φ η | denotes the intervention size per node. Without interventions, all DNNs reach >99.9% accuracy. The usedφ nonlin usesL rn = 10andd rn = 16. 42