Paper deep dive
Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability
Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, Thomas Icard
Models: Fully-connected feed-forward neural networks (small-scale, 8 input/24 intermediate/2 output neurons)
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/12/2026, 8:15:33 PM
Summary
This paper provides a formal theoretical foundation for mechanistic interpretability using causal abstraction. It generalizes the theory of causal abstraction to include arbitrary mechanism transformations (interventionals), formalizes core concepts like polysemantic neurons and the linear representation hypothesis, and unifies various interpretability methods (e.g., patching, causal scrubbing, sparse autoencoders) under a single mathematical framework.
Entities (6)
Relation Signals (3)
Causal Abstraction → providesfoundationfor → Mechanistic Interpretability
confidence 100% · Causal abstraction provides a theoretical foundation for mechanistic interpretability
Causal Abstraction → unifies → Mechanistic Interpretability Methods
confidence 95% · unifying a variety of mechanistic interpretability methods in the common language of causal abstraction
Interventionals → generalizes → Mechanism Replacement
confidence 90% · generalizing the theory of causal abstraction from mechanism replacement... to arbitrary mechanism transformation
Cypher Suggestions (2)
Map the relationship between theory and field · confidence 95% · unvalidated
MATCH (t:Theory)-[r:PROVIDES_FOUNDATION_FOR]->(f:Field) RETURN t.name, r.relation, f.name
Find all interpretability methods unified by Causal Abstraction · confidence 90% · unvalidated
MATCH (t:Theory {name: 'Causal Abstraction'})-[:UNIFIES]->(m:Method) RETURN m.nameAbstract
Abstract:Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models. Our contributions are (1) generalizing the theory of causal abstraction from mechanism replacement (i.e., hard and soft interventions) to arbitrary mechanism transformation (i.e., functionals from old mechanisms to new mechanisms), (2) providing a flexible, yet precise formalization for the core concepts of polysemantic neurons, the linear representation hypothesis, modular features, and graded faithfulness, and (3) unifying a variety of mechanistic interpretability methods in the common language of causal abstraction, namely, activation and path patching, causal mediation analysis, causal scrubbing, causal tracing, circuit analysis, concept erasure, sparse autoencoders, differential binary masking, distributed alignment search, and steering.
Tags
Links
- Source: https://arxiv.org/abs/2301.04709
- Canonical: https://arxiv.org/abs/2301.04709
Full Text
174,363 characters extracted from source content.
Expand or collapse full text
Journal of Machine Learning Research 26 (2025) 1-63Submitted 1/23; Revised 12/24; Published 5/25 Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability Atticus Geiger ∗♢ , Duligur Ibeling ♠ , Amir Zur ♢ , Maheep Chaudhary ♢ , Sonakshi Chauhan ♢ , Jing Huang ♠ , Aryaman Arora ♠ , Zhengxuan Wu ♠ , Noah Goodman ♠ , Christopher Potts ♠ , Thomas Icard ∗♠ ♢ Pr(Ai) 2 R Group ♠ Stanford University ∗ Corresponding authors: atticusg@gmail.com; icard@stanford.edu Editor:Jin Tian Abstract Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models. Our contributions are (1) generalizing the theory of causal abstraction from mechanism replacement (i.e., hard and soft interventions) to arbitrary mechanism transformation (i.e., functionals from old mechanisms to new mechanisms), (2) providing a flexible, yet precise formalization for the core concepts of polysemantic neurons, the linear representation hypothesis, modular features, and graded faithfulness, and (3) unifying a variety of mechanistic interpretability methods in the common language of causal abstraction, namely, activation and path patching, causal mediation analysis, causal scrubbing, causal tracing, circuit analysis, concept erasure, sparse autoencoders, differential binary masking, distributed alignment search, and steering. Keywords:Mechanistic Interpretability, Causality, Abstraction, Explainable AI ©2025 Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, Thomas Icard. License: C-BY 4.0, seehttps://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-0058.html. arXiv:2301.04709v4 [cs.AI] 8 May 2025 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard Contents 1 Introduction4 2 Causality and Abstraction6 2.1 Deterministic Causal Models with Implicit Graphical Structure . . . . . . .6 2.2 Intervention Algebras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 2.3 Exact Transformation with Interventionals . . . . . . . . . . . . . . . . . . .13 2.3.1Bijective Translation . . . . . . . . . . . . . . . . . . . . . . . . . . .14 2.3.2Constructive Causal Abstraction . . . . . . . . . . . . . . . . . . . .16 2.3.3Decomposing Alignments Between Models . . . . . . . . . . . . . . .17 2.4 Approximate Transformation . . . . . . . . . . . . . . . . . . . . . . . . . .20 2.5 Interchange Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . .21 2.6 Example: Causal Abstraction in Mechanistic Interpretability . . . . . . . .23 2.7 Example: Causal Abstraction with Cycles and Infinite Variables . . . . . .27 3 A Common Language for Mechanistic Interpretability30 3.1 Polysemantic Neurons, the Linear Representation Hypothesis, and Modular Features via Intervention Algebras . . . . . . . . . . . . . . . . . . . . . . .30 3.2 Graded Faithfulness via Approximate Abstraction . . . . . . . . . . . . . .32 3.3 Behavioral Evaluations as Abstraction by a Two Variable Chains . . . . . .32 3.3.1LIME: Behavioral Fidelity as Approximate Abstraction . . . . . . .32 3.3.2Single Source Interchange Interventions from Integrated Gradients .34 3.3.3Estimating the Causal Effect of Real-World Concepts . . . . . . . .34 3.4 Patching Activations with Interchange Interventions . . . . . . . . . . . . .35 3.4.1Causal Mediation as Abstraction by a Three-Variable Chain . . . . .36 3.4.2Path Patching as Recursive Interchange Interventions . . . . . . . .37 3.5 Ablation as Abstraction by a Three Variable Collider . . . . . . . . . . . . .37 3.5.1Concept Erasure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38 3.5.2Sub-Circuit Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .39 3.5.3Causal Scrubbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40 3.6 Modular Feature Learning as Bijective Transformation . . . . . . . . . . . .41 3.6.1Unsupervised Methods . . . . . . . . . . . . . . . . . . . . . . . . . .42 3.6.2 Aligning Low-Level Features with High-Level Variables . . . . . . .42 3.6.3Supervised Methods . . . . . . . . . . . . . . . . . . . . . . . . . . .43 3.7 Activation Steering as Causal Abstraction . . . . . . . . . . . . . . . . . . .44 3.8 Training AI Models to be Interpretable . . . . . . . . . . . . . . . . . . . . .44 4 Conclusion45 2 Causal Abstraction for Mechanistic Interpretability SymbolMeaningLocation Domain(f)the domain of a functionf v7→f(v)the functionf 1[φ]An indicator function that outputs 1 ifφis true, 0 otherwise. VA complete set of variablesDefinition 2 Val X A function mapping variablesX⊆Vto valuesDefinition 2 ΣA signature (variablesVand valuesVal)Definition 2 vA total settingv∈Val V Definition 3 X,xA variableX∈Vand a valuex∈Val X Definition 3 X,xA set of variablesX⊆Vand a partial settingx∈Val X Definition 3 Proj X (y)The restriction of a partial settingyto variablesX⊆Y⊆VDefinition 4 Proj −1 Y (x)The set of partial settingsywhereProj X (y) =xDefinition 4 F X X∈V Mechanisms for the variables inVDefinition 5 ≺An ordering ofVby causal dependency as induced viaF X X∈V Definition 5 MA causal model (a signature Σ and mechanismsF X X∈V )Definition 5 Solve(M)The set of total settings that are solutions to a modelMDefinition 5 i∈HardA hard intervention that fixes the variablesI⊆VDefinition 9 I X X∈X ∈SoftA soft intervention that replaces the mechanisms ofX⊆VDefinition 10 I X X∈X ∈FuncAn interventional that edits the mechanisms ofX⊆VDefinition 11 Func V The set of interventionals that edit all mechanismsDefinition 11 ΦThe interventionals inFunc V equivalent to hard interventions.Definition 11 αA sequence of elementsDefinition 11 ΨA set of interventionalsDefinition 11 τA map from total settings ofMto total settings ofM ∗ Definition 25 ωA map from interventionals inMto interventionals ofM ∗ Definition 25 H,LHigh-level and low-level causal modelsDefinition 33 Π X H A partition cell ofV L assigned to the variableX H ∈V H Definition 31 π X H A function mapping from values of Π X H to values ofX H Definition 31 3 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard 1 Introduction We take the fundamental aim of explainable artificial intelligence to be explainingwhya model makes the predictions it does. For many purposes the paradigm of explanation is causal explanation (Woodward, 2003; Pearl, 2019), elucidating counterfactual difference- making details of the mechanisms underlying model behavior. However, not just any causal explanation will be apt. There is an obvious sense in which we already know all the low-level causal facts about a deep learning model. After all, we can account for every aspect of model behavior in terms of real-valued vectors, activation functions, and weight tensors. The problem, of course, is that these low-level explanations are typically nottransparentto humans—they fail to instill an understanding of the high-level principles underlying model behavior (Lipton, 2018; Creel, 2020) that can guide human action (Karimi et al., 2021, 2023). In many contexts it can be quite straightforward to devise simple algorithms for tasks that operate on human-intelligible concepts. The crucial question is, under what conditions does such a transparent algorithm constitute afaithful interpretation(Jacovi and Goldberg, 2020) of the known, but opaque, low-level details of a black box model? This is the motivating question for the explainable AI subfield known asinterpretability. The question takes on particular significance formechanistic interpretability, 1 which, in contrast to behavioral interpretability, is precisely aimed at reverse engineering theinternalsof a black box model in terms of a transparent algorithm. Mechanistic interpretability research is quite analogous to the problem that cognitive scientists face in understanding how the human mind works. At one extreme, we can try to understand minds at a very low level, e.g., of biochemical processes in the brain. At the other extreme, we can focus just on the input-output facts of the system, roughly speaking, on ‘observable behavior.’ Analogously, for a deep learning model, we can focus either on low-level features (weight tensors, activation functions, etc.), or on what input-output function is computed. In both cases, however, it can be illuminating to investigate the mediating processes and mechanisms that transform input to output at a slightly higher-level of abstraction. This is what Marr (1982) famously called thealgorithmiclevel of analysis. To the extent that these algorithmic-level hypotheses are transparent to the scientist, we may have a useful elucidation of the agent’s inner workings. However, it is crucial that mechanistic interpretability methods avoid telling ‘just-so’ stories that are completely divorced from the internal workings of the model. To clarify what this means exactly, we need a common language for explicating and comparing methodologies, and for precisifying core concepts. We submit that the theory ofcausal abstractionprovides this common language. In some ways, modern deep learning models are like the weather or an economy: they involve large numbers of densely connected ‘microvariables’ with complex, non-linear dynam- ics. One way of reining in this complexity is to find ways of understanding these systems in 1.Saphra and Wiegreffe (2024) argue that the term ‘mechanistic interpretability’ often signals not a single research program, but rather a cultural identity within the AI alignment community (Olah et al., 2020; Elhage et al., 2021; Wang et al., 2023; Nanda et al., 2023a). However, they also note that there is a natural ‘narrow’ technical reading of the term, where the concern is specifically withcausalanalyses of models’ internal mechanisms (Vig et al., 2020; Geiger et al., 2020, 2021, 2023; Finlayson et al., 2021; Meng et al., 2022; Stolfo et al., 2023; Todd et al., 2024; Prakash et al., 2024; Mueller et al., 2024). We embrace that causality is core to mechanistic interpretability. 4 Causal Abstraction for Mechanistic Interpretability terms of higher-level, more abstract variables (‘macrovariables’). For instance, the many microvariables might be clustered together into more abstract macrovariables. A number of researchers have been exploring theories of causal abstraction, providing a mathematical framework for causally analyzing a system at multiple levels of detail (Chalupka et al., 2017; Rubenstein et al., 2017; Beckers and Halpern, 2019; Beckers et al., 2019; Rischel and Weichwald, 2021; Massidda et al., 2023, 2024). These methods tell us when a high-level causal model is a simplification of a (typically more fine-grained) low-level model. To date, causal abstraction has been used to analyze weather patterns (Chalupka et al., 2016), human brains (Dubois et al., 2020a,b), physical systems (Kekic et al., 2023), batteries (Zennaro et al., 2023a), epidemics (Dyer et al., 2023), and deep learning models (Chalupka et al., 2015; Geiger et al., 2021; Hu and Tian, 2022; Geiger et al., 2023; Wu et al., 2023). Macrovariables will not always correspond to sets of microvariables. Just as with neural network models of human cognition (Smolensky, 1986), this is the typical situation in mechanistic interpretability, where high-level concepts are thought to be represented by modular ‘features’ distributed across individual neural activations (Harradon et al., 2018; Olah et al., 2020; Huang et al., 2024). For example, the linear subspaces of activation space learned from distributed alignment search (Geiger et al., 2023) and the output dimensions of sparse autoencoders (Bricken et al., 2023; Cunningham et al., 2023) are features that are distributed across overlapping sets of neural activations. Our first contribution is to extend the theory of causal abstraction to remove this limitation, building heavily on previous work. The core issue is that typical hard and soft interventions replace variable mechanisms entirely, so they are unable to isolate quantities distributed across overlapping sets of microvariables. To address this, we consider a very general type of intervention—what we callinterventionals—that maps from old mechanisms to new mechanisms. While this space of operations is generally unconstrained, we isolate special classes of interventionals that formintervention algebras, satisfying two key modularity properties. Such classes can essentially be treated as hard interventions with respect to a new (‘translated’) variable space. We elucidate this situation, generalizing earlier work by Rubenstein et al. (2017) and Beckers and Halpern (2019). Our second contribution is to show how causal abstraction provides a solid theoretical foundation for the field of mechanistic interpretability. We leverage our general presentation to provide flexible, yet mathematically precise, definitions for the core mechanistic inter- pretability concepts of polysemantic neurons, the linear representation hypothesis, modular features, and graded faithfulness. Furthermore, we unify a wide range of existing inter- pretability methodologies in a common language, including activation and path patching, causal mediation analysis, causal scrubbing, causal tracing, circuit analysis, concept erasure, sparse autoencoders, differential binary masking, distributed alignment search, and activa- tion steering. We also connect the behavioral interpretability methods of LIME, integrated gradients, and causal effect estimation to the language of causal abstraction. We are optimistic about productive interplay between theoretical work on causal ab- straction and applied work on mechanistic interpretability. In stark contrast to weather, brains, or economies, we can measure and manipulate the microvariables of deep learning models with perfect precision and accuracy, and thus empirical claims about their structure can be held to the highest standard of rigorous falsification through experimentation. 5 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard 2 Causality and Abstraction This section presents a general theory of causal abstraction. Although we build on much existing work in the recent literature, our presentation is in some ways more general, and in other ways less so. Due to our focus on (deterministic) neural network models, we do not incorporate probability into the picture. At the same time, because the operations employed in the study of modern machine learned system go beyond ‘hard’ and ‘soft’ interventions that replace model mechanisms (see Def. 9 and 10 below), we define a very general kind of intervention, theinterventional, which is a functional mapping from old mechanisms to new mechanisms (see Def. 11). In order to impose structure on this unconstrained class of model transformations, we establish some new results on classes of interventionals that form what we will callintervention algebras(see especially Theorems 20, 21). Next, we explore key relations that can hold between causal models. We begin withexact transformations(Rubenstein et al., 2017), which characterize when the mechanisms of one causal model are realized by the mechanisms of another (‘causal consistency’). We generalize exact transformation from hard interventions to interventionals that form intervention algebras (see Def. 25). A bijective translation (see Def. 28) is an exact transformation that retains all the details in the original model, staying at the same level of granularity. On the other hand, a constructive causal abstraction (see Def. 33) is a ‘lossy’ exact transformation that merges microvariables into macrovariables, while maintaining a precise and accurate description of the original model. Also, we (1) decompose constructive causal abstraction into three operations, namely marginalization, variable merge, and value merge (see Prop. 40), and (2) provide a framework for approximate transformations (see Def. 41). Lastly, we define a family ofinterchange interventionoperations, which are central to understanding mechanistic interpretability through the lens of causal abstraction. We begin with simple interchange interventions (see Def. 44), where a causal model with input and output variables has certain variables fixed to values they would have under different input conditions. We extend these to recursive interchange interventions (see Def. 45), which allow variables to be fixed based on the results of previous interchange interventions. Crucially, we also define distributed interchange interventions that target variables distributed across multiple causal variables and involve a bijective translation to and from a transformed variable space (see Def. 46). We conclude by explicating how to construct an alignment for interchange intervention analysis and how to use interchange intervention accuracy to quantify approximate abstractions. 2.1 Deterministic Causal Models with Implicit Graphical Structure We start with some basic notation. Remark 1 (Notation throughout the paper)Capital letters (e.g.,X) are used for variables and lower case letters (e.g.,x) are used for values. Bold faced letters (e.g.Xorx) are used for sets of variables and sets of values. When a variable (or set of variables) and a value (or set of values) have the same letter, the values correspond to the variables (e.g., x∈Val X orx∈Val X ). Definition 2 (Signature) We useVto denote a fixed set of variables, eachX∈Vcoming with a non-empty rangeVal X of possible values. Together Σ = (V,Val) are called asignature. 6 Causal Abstraction for Mechanistic Interpretability Definition 3 (Partial and Total Settings)We assumeVal X ∩Val Y =∅wheneverX̸= Y, meaning no two variables can take on the same value. 2 This assumption allows representing the values of a set of variablesX⊆Vas a setxof values, with exactly one valuex∈Val X inxfor eachX∈X. When we need to be specific about the choice of variableX, we denote this element ofVal X asx. We thus havex⊆ S X∈X Val X , and we refer to the values x∈Val X aspartial settings. In the special case wherev∈Val V , we callvatotal setting. Another useful construct in this connection is theprojectionof a partial setting: Definition 4 (Projection)Given a partial settingyfor a set of variablesY⊇X, we defineProj X (y) to be the restriction ofyto the variables inX. Given a partial settingx, we define the inverse: Proj −1 Y (x) =y∈Val Y :Proj X (y) =x. Definition 5A (deterministic)causal modelis a pairM= (Σ,F X X∈V ), such that Σ is a signature andF X X∈V is a set ofmechanisms, withF X :Val V →Val X assigning a value toXas a function of the values of all the variables, includingXitself. We writeF X for F X X∈X . We will often break up the input argument to a mechanism into partial settings that form a total setting, e.g.,F X (y,z) forY∪Z=V. Remark 6 (Inducing Graphical Structure)Observe that our definition of causal model makes no explicit reference to a graphical structure defining a causal ordering on the variables. While the mechanism for a variable takes in total settings, it might be that the output of a mechanism depends only on a subset of values. This induces acausal orderingamong the variables, such thatY≺X—orYis aparentofX—just in case there is a settingzof the variablesZ=V\Y, and two settingsy,y ′ ofYsuch thatF X (z,y)̸=F X (z,y ′ ) (see, e.g., Woodward 2003). The resulting order≺captures a notion of direct causation, which can be extended to indirect causation by taking its transitive closure≺ ∗ . Throughout the paper, we will define mechanisms to take in partial settings of parent variables, though technically they take in total settings and depend only on the parent variables. When≺ ∗ is irreflexive, we say the causal model isacyclic. Most of our examples of causal models will have this property. However, it is often also possible to give causal interpretations of cyclic models (see, e.g., Bongers et al. 2021). Indeed, the abstraction operations to be introduced generally create cycles among variables, even from initially acyclic models (see Rubenstein et al. 2017,§5.3 for an example). In Section 2.7, we provide an example where we abstract a causal model representing the bubble sort algorithm into a cyclic model where any sorted list is a solution satisfying the equations. Remark 7 (Acyclic Model Notation) Our example in Section 2.6 will involve causal abstraction between two finite, acyclic causal models. We will call variables that depend on no other variableinput variables(X In ), and variables on which no other variables depend output variables(X Out ). The remaining variables areintermediate variables. We can 2.To allow the same value to occur multiple times, we can simply take any causal model where variables share values, and then ‘tag’ the shared values with variable names to make them unique. 7 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard intervene on input variablesX In to ‘prime’ the model with a particular input. 3 As such, the constant functions for input variables in our examples will be overwritten. AsMcan also be interpreted simply as a set of equations, we can define the set of solutions, which may be empty. Definition 8 (Solution Sets)GivenM= (V,F X X∈V ), the set of solutions, called Solve(M), is the set of allv∈Val V such that all the equationsProj X (v) =F X (v) are satisfied for eachX∈V. WhenMis acyclic, there is a single solution and we useSolve(M) to refer interchangeably to a singleton set of solutions and its sole member, relying on context to disambiguate. We give a general definition of intervention on a model (see, e.g., Spirtes et al. 2000; Woodward 2003; Pearl 2009). For the following, assume we have fixed a signature Σ. Definition 9 (Intervention)Define ahard interventionto be a partial settingi∈Val I for finiteI⊆V. Given a modelMwith signature Σ, defineM i to be just likeM, except that we replaceF X with the constant functionv7→Proj X (i) for eachX∈I. Define Hard X =Val X to be the set of all hard interventions onX. Definition 10 (Soft Intervention)We define asoft interventionon some finite set of variablesX⊆Vto be a family of functionsI X X∈X whereI X :Val V →Val X for each X∈X. Given a modelMwith signature Σ, defineM I to be just likeM, except that we replace each functionF X with the functionI X . DefineSoft X to be the set of all soft interventions onX. Soft interventions generalize hard interventions. The hard interventioniis equivalent to a constant soft interventionI=v7→x x∈i . Definition 11 (Interventional) We define aninterventionalon some finite set of variables X⊆Vto be a family of functionsI=I X X∈X whereI X :Soft X →Soft X for eachX∈X. We defineM I to be just likeM, except that we replace each functionF X with the function I X ⟨F X ⟩. DefineFunc X to be the set of interventionals onX. Interventionals generalize soft interventions. The soft interventionIis a constant interventional—namely, the family of functions that, for eachX∈X, sends any element of Soft X toI X . When only one variable is targeted, we say that the intervention(al) isatomic. Remark 12 (On Terminology)What we are calling hard interventions are sometimes calledstructuralinterventions (Eberhardt and Scheines, 2007). Our soft interventions are essentially the same as the soft interventions studied, e.g., in Tian (2008); Bareinboim and Correa (2020), although as mentioned, we set aside probabilistic aspects of causal models in this paper. The main difference between soft interventions and interventionals is that the latter ‘mechanism replacements’ can depend on the previous mechanisms in the model. This type of dependence has also been studied, e.g., in the setting of so-calledparametric interventions (Eberhardt and Scheines, 2007). 3.In some parts of the literature what we are calling input variables are designated as ‘exogenous’ variables. 8 Causal Abstraction for Mechanistic Interpretability Remark 13 (Interventionals as Unconstrained Model Transformations)The set Func V contains the interventionals that targetsallvariablesV. This set of interventionals is quite general, containing every function that maps from and to causal models with the same signature. Remark 14 (Composing Intervention(al)s)Interventionals containing families of con- stant functions to constant functions are exactly the set of hard interventions. Similarly, interventionals containing families of constant functions are exactly the set of soft interven- tions. As such, we treat all three types of interventions as interventionals on all variables, i.e., S X⊆V Hard X ⊆ S X⊆V Soft X ⊆ S X⊆V Func X ⊆Func V . This allows us to understand the composition of interventions as function composition. We simplify notation by writing, e.g.,x◦yfor the composition of hard interventions (settingXtoxand thenYtoy). 4 The unconstrained space of interventionalsFunc V is unruly, and there is no guarantee that an interventional can be thought of as isolating a natural model component. We want to characterize spaces of interventionals that ‘act like hard interventions’ insofar as they possess a basic algebraic structure. We elaborate on this in the next section. The following is an example of a causal model with hard interventions, soft interventions, and interventionals defined on it. Example 1Define a signature Σ with variablesV=X,Ywith valuesVal X =0,...,9 andVal Y =True,False. Define a modelMwith mechanismsF X (v) = 0 (recall that input variables have no parents and are mapped to constants per Remark 7) andF Y (v) = [Proj X (v)>5]. The graphical structure ofMhas one directed edge fromXtoYbecause X≺Y. Per Remark 6, we could define the mechanism forYasF Y (x) = [x >5] omitting the value thatYdoes not depend on, namely its own. Define the hard interventiony=True ∈Hard Y , the soft interventionI=v7→ [Proj X (v)≤5]∈Soft Y , and the interventionalI=F Y 7→(v7→ ¬F Y (v))∈Func Y . The modelM y has mechanismsF X (v) = 0 andF Y (v) =Trueand a graphical structure with no edges. The modelsM I ,M I◦I , andM I all have the mechanismsF X (v) = 0 and F Y (x) = [x≤5] and the same graphical structure asM. The modelM I◦I is identical toM. Define the interventionalJ=F X ,F Y 7→ v→6×1[Proj Y (v)],v7→[¬F Y (v)] ∈ Func V . The modelM J has mechanismsF X (y) = 6×1[y] andF Y (x) = [x≤5] with a cyclic graphical structure whereX≺YandY≺X. This model has no solutions. The modelM J◦J has mechanismsF X (y) = 6×1[y] andF Y (x) = [x >5] with a cyclic graphical structure whereX≺YandY≺X. The solutions to this model are6,Trueand0,False. 2.2 Intervention Algebras We are interested in the subsets ofFunc V that are well-behaved in the sense that they share an algebraic structure with hard interventions under the operation of function composition. The relevant algebraic structure is captured in the next definition. Definition 15 Let Λ be a set and⊕be a binary operation on Λ. We define (Λ,⊕) to be an intervention algebraif there exists a signature Σ = (V,Val) such that (Φ,◦)≃(Λ,⊕)—that 4.Note that we write composition in the opposite order from common notation for function composition, following the more intuitive order of intervention composition adopted, e.g., in Rubenstein et al. (2017). 9 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard is, these structures are isomorphic—where Φ is the set of all constant functionals mapping to constant functions (i.e., hard interventions) for signature Σ, and◦is function composition. As a matter of fact, intervention algebras can be given an intuitive characterization based on two key properties of hard interventions. Definition 16 (Commutativity and Left-Annihilativity) Hard interventionsx,y∈Φ under function composition have the following properties: (a) If hard interventions target different variables,◦is commutative: ifX̸=Ythenx◦y=y◦x; (b) If hard interventions target the same variable,◦is left-annihilative: ifX=Y, thenx◦y=y; Note equality signifies the compositions are the very same functions from models to models. These two properties highlight an important sense in which hard interventions are modular: when intervening on two different variables, the order does not matter; and when intervening on the same variable twice, the second undoes any effect of the first intervention. We can use commutativity and left-annihilativity to build an equivalence relation that we will show captures the fundamental algebraic structure of hard interventions. Definition 17LetAbe any set with equivalence relation∼, and define (A ∗ ,·) to be the free algebra generated by elements ofAunder concatenation. Define≈to be the smallest congruence onA ∗ extending the following: ⟨x·y,y·x⟩:x̸∼y∪⟨y·x,x⟩:x∼y,(1) for allx,y∈A, where·is concatenation inA ∗ . As it turns out,≈can be obtained constructively as the result of two operations that define a normal form for sequences of atomic hard interventions. Definition 18 (Normal form)LetAbe a set equipped with equivalence relation∼. Fix an order⋖on∼-equivalence classes. DefineCollapse:A ∗ →A ∗ to take a sequence and remove every element that has an∼-equivalent element that occurs to its right. Define Sort:A ∗ →A ∗ to take a sequence and sort it according to⋖. For any elementα∈A ∗ , we callSort(Collapse(α)) thenormal formofα. This normal form clearly exists and is unique. Lemma 19Forα,β∈A ∗ , we haveα≈βiffSort(Collapse(α)) =Sort(Collapse(β)), that is, iffαandβhave the same normal form. ProofLet us writeα≡βwhenSort(Collapse(α)) =Sort(Collapse(β)). First note that≡is a congruence extending the relation in (1), and hence≈⊆≡. For the other direction, it suffices to observe that bothCollapse(c)≈candSort(c)≈c, for anyc∈A ∗ . This follows from the fact that≈is a congruence extending (1). Then if α≡β, we haveSort(Collapse(α))≈Sort(Collapse(β)) (by reflexivity of≈), and moreover: α≈Sort(Collapse(α)) ≈Sort(Collapse(β)) ≈β, 10 Causal Abstraction for Mechanistic Interpretability and hence≡⊆≈. Consequently,α≈βiffαandβhave the same normal form. The foregoing produces a representation theorem for intervention algebras. Theorem 20The quotient (A ∗ /≈,⊙) of (A ∗ ,·) under≈, with⊙defined so [α] ≈ ⊙[β] ≈ = [α·β] ≈ , is a intervention algebra. ProofDefine a signature Σ where the variablesVare the∼-equivalence classes ofAand the values of each variable are the members of the respective∼-equivalence class. We need to show that (A ∗ /≈,⊙) is isomorphic to (Φ,◦), the set of all (finite) hard interventions with function composition. Begin by defining the mapι ∗ :A ∗ →Φ, whereι ∗ (x 1 · ... ·x n ) =x 1 ◦ ... ◦x n . First observe: ι ∗ (α) =ι ∗ (Collapse(α)) =ι ∗ (Sort(Collapse(α)))(2) Eq. (2) follows from commutativity and left-annihilativity (Def. 16). Moreover, again by commutativity and left-annihilativity, ifX=YbutX /∈Z, then x◦z◦y=z◦y, which is precisely what justifies the first equality. Finally, defineι:A ∗ /≈→Φ so thatι([α] ≈ ) =ι ∗ (α). This is well-defined by Eq. (2) and Lemma 19. It is surjective becauseι ∗ is surjective. It is also injective: ifSort(Collapse(α))̸= Sort(Collapse(β)), then pick the⋖-least∼-equivalence class ofA—that is, the⋖-least variable X—such thatSort(Collapse(α)) andSort(Collapse(β)) disagree onX. Without loss, we can assumeι ∗ (Sort(Collapse(α))) assignsXto some valuex, butι ∗ (Sort(Collapse(β))) does not assignXtox(either because it does not assignXto any value or because it assignsX to a different value). In any case, if [α] ≈ ̸= [β] ≈ , thenι([α] ≈ ) =ι ∗ (Sort(Collapse(α)))̸= ι ∗ (Sort(Collapse(β))) =ι([β] ≈ ). Thatιis an isomorphism follows from the sequence of equalities below: ι([α] ≈ ⊙[β] ≈ ) =ι([α·β] ≈ )By the definition of⊙ =ι(Sort(Collapse(α·β)))By Lemma 19 =ι ∗ (Sort(Collapse(α·β)))By the definition ofι =ι ∗ (α·β)By Equation (2) =ι ∗ (α)◦ι ∗ (β)By the definition ofι ∗ =ι ∗ (Sort(Collapse(α)))◦ι ∗ (Sort(Collapse(β)))By Equation (2) =ι(Sort(Collapse(α)))◦ι(Sort(Collapse(β)))By the definition ofι =ι([α] ≈ )◦ι([β] ≈ )By Lemma 19 This concludes the proof of Theorem 20. Sets of atomic soft interventions also form intervention algebras: Theorem 21Suppose Ψ 0 is a set of atomic soft interventions with signature Σ. Let Ψ be the closure of Ψ 0 under function composition. Then (Ψ,◦) is an intervention algebra. 11 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard ProofJust as in the proof of Theorem 20 we can consider the free algebra generated by A= Ψ 0 , quotiented under≈, to obtain (A/≈,⊙). The proof that this algebra is isomorphic to a set of hard interventions follows exactly as in the proof of Theorem 20, relying on the fact that soft interventions also satisfy the key properties (a) and (b) in Definition 16: (i)If soft interventions target different variables,◦is commutative: ifX̸=Ythen I X ◦I Y =I Y ◦I X ; (i)If interventions target the same variable,◦is left-annihilative: ifX=Y, then I X ◦I Y =I Y . Consequently, we know that there exists a signature Σ ∗ such that hard interventions on Σ are isomorphic to the soft interventions Ψ with respect to function composition. Specifically, where Ψ X 0 ⊆Ψ 0 is the set of atomic soft interventions that targetX, the variables of Σ ∗ are V ∗ =X ∗ : Ψ X 0 ̸=∅and, for eachX ∗ ∈V ∗ , the values areVal X ∗ = Ψ X 0 . Remark 22While both hard and soft interventions give rise to intervention algebras, the more general class of interventionals satisfies weaker algebraic constraints. For instance, it is easy to see that left-annihilativity (see (b) and (i) above) often fails. Consider a signature Σ with a single variableXthat takes on binary values0,1. Define the interventional I⟨F X ⟩=x7→1−F X (x). Observe thatI◦I̸=IbecauseI◦Iis the identity function, and so left-annihilativity fails. While interventionals do not form intervention algebras in general, particular classes of interventionals can form intervention algebras. Example 2 Consider a signature Σ with variablesX 1 ,...,X n that take on integer values. Define Ψ to contain interventionals that fix thepth digit of a number to be the digitq, where % is modulus and//is integer division: I⟨F X k ⟩=v7→F X k (v)− (F X k (v)//10 p ) % 10 ·10 p +q·10 p This class of interventionals is isomorphic to hard interventions on a signature Σ ∗ with vari- ablesY p 0 ,Y p 1 ,...,Y p n :p∈0,1,...that take on values0,1,...,9. The interventional fixing thepth digit ofX k to be the digitqcorresponds to the hard intervention fixingY p k to the valueq. Remark 23 (Assignment mutations)As a side remark, it is worth observing that an- other setting where intervention algebras appear is in programming language semantics and the semantics of, e.g., first-order predicate logic. LetDbe some domain of values andVar a set of variables. Anassignmentis a functiong:Var→D. Letg x d be themutationofg, defined so thatg x d (y) =g(y) wheny̸=x, andg x d (y) =dwheny=x. Then a mutation (·) x d can be understood as a function from the set of assignments to the set of assignments. WhereMis the set of all mutations, it is then easy to show that the pair (M,◦) forms an intervention algebra. Furthermore, every intervention algebra can be obtained this way (up to isomorphism), for a suitable choice ofDandVar. 12 Causal Abstraction for Mechanistic Interpretability Definition 24 (Ordering on Intervention Algebras)Let (Λ,⊕) be a intervention al- gebra. We define an ordering≤on elements of Λ as follows: λ≤λ ′ iffλ ′ ⊕λ=λ ′ . So, in particular, if the intervention algebra contains hard interventions—if it is of the form (Φ,◦)—then this amounts to the ordering from Rubenstein et al. (2017): x≤yiffX⊆Yandx=Proj X (y) iffx⊆y. Note that≤is defined as in a semi-lattice, except that⊕(or◦) is not commutative in general. So the order matters. 2.3 Exact Transformation with Interventionals Researchers have been interested in the question of when two models—potentially defined on different signatures—are compatible with one another in the sense that they could both accurately describe the same target causal phenomena. The next definition presents a deterministic variant of the notion of ‘exact transformation’ from Rubenstein et al. (2017) (see also Def. 3.5 in Beckers and Halpern 2019), generalized to interventionals. The other notions we study in this paper—namely, bijective translation and constructive abstraction—are special cases of exact transformation. Definition 25LetM,M ∗ be causal models and let (Ψ,◦) and (Ψ ∗ ,◦) be two intervention algebras where Ψ and Ψ ∗ are interventionals onMandM ∗ , respectively. Furthermore, letτ:Val V →Val V ∗ andω: Ψ→Ψ ∗ be two partial surjective functions whereωis ≤-preserving; that is,ω(I)≤ω(I ′ ) wheneverI≤I ′ . ThenM ∗ is anexact transformationofMunder (τ,ω) if for allI∈Domain(ω), the following diagram commutes: I ω(I) ω Solve M I Solve M ∗ ω(I) τ That is to say, the interventionalω(I) onM ∗ results in the same total settings ofV ∗ as the result of first determining a setting ofVfromIand then applying the translationτto obtain a setting ofV ∗ . In a single equation: 5 τ Solve(M I ) =Solve M ∗ ω(I) .(3) 5.When evaluating whetherτ(S) =T, as in, e.g.(3), it is possible that not every element ofSis in the domain ofτ, sinceτis partial. Simply map such points to some distinguished element⊥. 13 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard This definition captures the intuitive idea thatMandM ∗ are consistent descriptions of the same causal situation. Remark 26Definition 25 is a variant of exact transformation in Rubenstein et al. (2017) with intervention algebras. However, their definition of exact transformation includes an existential quantifier overω, stating thatM ∗ is an exact transformation ofMunderτif an ωexists that satisfies the commuting diagram. Our definition tells us when a particular pair τandωconstitute an exact transformation. We believe this difference to be inessential. Remark 27 The composition of exact transformations is an exact transformation. That is, if (τ 1 ,ω 1 ) and (τ 2 ,ω 2 ) are exact transformation, then so is (τ 1 ◦τ 2 ,ω 1 ◦ω 2 ), when these compositions are all defined. (See Lemma 5 from Rubenstein et al. 2017.) 2.3.1 Bijective Translation Exact transformations can be ‘lossy’ in the sense thatMmay involve a more detailed, or finer-grained, description thanM ∗ . For example, the modelMcould encode the causal process of computer hardware operating on bits of memory while the modelM ∗ encodes the fully precise, but less detailed, process of assembly code the hardware implements. In contrast, whenτis a bijective function, there is an important sense in whichMandM ∗ are just two equivalent (and inter-translatable) descriptions of the same causal setup. Definition 28 (Bijective Translation)Fix signatures Σ and Σ ∗ . LetMbe a causal model with signature Σ and mechanismsF V . Letτ:Val V →Val V ∗ be a bijective map from total settings of Σ to total settings of Σ ∗ . Defineτ(M) to be the causal model with signature Σ ∗ and mechanisms F X ∗ (v ∗ ) =Proj X ∗ (τ(F V (τ −1 (v ∗ )))) for each variableX ∗ ∈V ∗ . We say thatτ(M) is the bijective translation ofMunderτ. Remark 29 (Bijective Translations Define a Canonicalω) Let Φ ∗ be the interven- tion algebra formed by hard interventions on Σ ∗ . We will now construct an intervention algebra Ψ consisting of interventionals on Σ and define a functionω: Ψ→Φ ∗ . For eachi ∗ ∈Hard I ∗ withI ∗ ⊆V ∗ , define the interventional using notation from Definition 11, I⟨F V ⟩=v7→τ −1 Proj V ∗ ∗ τ F V (v) ∪i ∗ ! . We addIto Ψ and defineω(I) =i ∗ . The interventionalItakes in a set of mechanismsF V for all of the variables and outputs a new set of mechanismsI⟨F V ⟩for all of the variables. To retrieve mechanisms for individual variables, a projection must be applied. By construction (Ψ,◦) is isomorphic to (Φ ∗ ,◦) with the≤-order preserving (and reflecting) mapω, so (Ψ,◦) is an intervention algebra. Theorem 30 (Bijective Translations are Exact Transformations)The bijective translationM ∗ =τ(M) is an exact transformation ofMunder (τ,ω), relative to Ψ and Φ ∗ as constructed above. 14 Causal Abstraction for Mechanistic Interpretability ProofChoose an arbitraryI∈Ψ with corresponding hard interventioni ∗ ∈ Φ ∗ where ω(I) =i ∗ . LetF ∗ be the mechanisms ofM ∗ andG ∗ be the mechanisms ofM ∗ i ∗ . Fix an arbitrary solutionv∈Solve(M I ). The following string of equalities shows that τ(v) is inSolve(M ∗ i ∗ ) and thereforeτ(Solve(M I ))⊆Solve(M ∗ i ∗ ). τ(v) =τ(I⟨F V ⟩(v))By the definition of a solution =τ τ −1 Proj V ∗ ∗ τ F V (v) ∪i ∗ ! By the definition ofI =Proj V ∗ ∗ τ F V (v) ∪i ∗ Inverses cancel =Proj V ∗ ∗ F ∗ V ∗ (τ(v)) ∪i ∗ M ∗ is a bijective translation ofMunderτ =G ∗ V ∗ (τ(v))By the definition of hard interventions. Fix an arbitrary solutionv ∗ ∈Solve(M ∗ i ∗ ). The following string of equalities show that τ −1 (v ∗ ) is inSolve(M I ) and thereforeτ(Solve(M I ))⊇Solve(M ∗ i ∗ ). τ −1 (v ∗ ) =τ −1 (G ∗ V ∗ (v ∗ ))By the definition of a solution =τ −1 Proj V ∗ F ∗ V ∗ (v ∗ ) ∪i ∗ By the definition of hard interventions =τ −1 Proj V ∗ τ F V (τ −1 (v ∗ )) ∪i ∗ M ∗ is a bij. trans. ofMunderτ −1 =I⟨F V ⟩(τ −1 (v ∗ ))By the definition ofI Thus,τ(Solve(M I )) =Solve(M ∗ i ∗ ) for an arbitraryI, and we can concludeM ∗ =τ(M) is an exact transformation ofMunder (τ,ω). Example 3If the variables of a causal model form a vector space, then a natural bijective translation is a rotation of a vector. Consider the following causal modelMthat computes boolean conjunction. X 1 X 2 Y 1 Y 2 Z The variablesX 1 andX 2 take on binary values from0,1and have constant mechanisms mapping to 0. The variablesY 1 andY 2 take on real-valued numbers and have mechanisms defined by a 20 ◦ rotation matrix F Y 1 (x 1 ,x 2 )F Y 2 (x 1 ,x 2 ) = x 1 x 2 ⊤ cos(20 ◦ )−sin(20 ◦ ) sin(20 ◦ )cos(20 ◦ ) 15 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard The variableZtakes on binary values from0,1and has the mechanism F Z (y 1 ,y 2 ) =1[(sin(20 ◦ ) +cos(20 ◦ ))y 1 + (cos(20 ◦ )−sin(20 ◦ ))y 2 = 2] Note thatZ=X 1 ∧X 2 , sinceF Z un-rotatesY 1 andY 2 before summing their values. The variablesY 1 andY 2 perfectly encode the values of the variablesX 1 andX 2 using a coordinate system with axes that are tilted by 20 ◦ . We can view the modelMthrough this tilted coordinate system using a bijective translation. Define the function τ x 1 x 2 y 1 y 2 z = x 1 x 2 y 1 y 2 z ⊤ 10000 01000 00cos(−20 ◦ )−sin(−20 ◦ )0 00sin(−20 ◦ )cos(−20 ◦ )0 00001 and consider the modelτ(M). This model will have altered mechanisms forY 1 ,Y 2 , andZ. In particular, these mechanisms are F Y 1 (x 1 ,x 2 )F Y 2 (x 1 ,x 2 ) = x 1 x 2 ⊤ 1 0 0 1 andF Z (y 1 ,y 2 ) =1[y 1 +y 2 = 2]. 2.3.2 Constructive Causal Abstraction Suppose we have a ‘low-level model’L= (Σ L ,F L ) built from ‘low-level variables’V L and a ‘high-level model’H= (Σ H ,F H ) built from ‘high-level variables’V H . What structural conditions must be in place forHto be a high-levelabstractionof the low-level modelL? At a minimum, this requires that the high-level interventions represent the low-level ones, in the sense of Def. 25;Hshould be an exact transformation ofL. What else must be the case? A prominent further intuition about abstraction is that it may involve associating specific high-level variables withclustersof low-level variables. That is, low-level variables are to be clustered together in ‘macrovariables’ that abstract away from low-level details. To systematize this idea, we introduce alignment between a low-level and a high-level signature: Definition 31 (Alignment)Analignmentbetween signatures Σ L and Σ H is given by a pair⟨Π,π⟩of a partition Π =Π X H X H ∈V H ∪⊥ and a familyπ=π X H X H ∈V H of maps, such that: 1.The partition Π ofV L consists of non-overlapping, non-empty cells Π X H ⊆ V L for eachX H ∈V H , in addition to a (possibly empty) cell Π ⊥ ; 2. There is a partial surjective mapπ X H :Val Π X H →Val X H for eachX H ∈V H . In words, the set Π X H consists of those low-level variables that are ‘aligned’ with the high-level variableX H , andπ X H tells us how a given setting of the low-level cluster Π X H corresponds to a setting of the high-level variableX H . The remaining set Π ⊥ consists of those low-level variables that are ‘forgotten’, that is, not mapped to any high-level variable. 16 Causal Abstraction for Mechanistic Interpretability Remark 32An alignment⟨Π,π⟩induces a unique partial functionω π that maps from low-level hard interventions to high-level hard interventions. We only defineω π on low- level interventions that target full partition cells, excluding Π ⊥ . Forx L ∈Val Π X H where X H ⊆V H and Π X H = S X∈X H Π X , we define ω π (x L ) def = [ X H ∈V H π X H Proj Π X H (x L ) .(4) As a special case of Eq. (4), we obtain a unique partial functionτ π :Val V L →Val V H that maps from low-level total settings to high-level total settings. To wit, for anyv L ∈Val V L : τ π (v L ) = [ X H ∈V H π X H Proj Π X H (v L ) .(5) Thus, the cell-wise mapsπ X H canonically give us these partial functions (τ π ,ω π ). Definition 33 (Constructive Abstraction) We say thatHis a constructive abstraction ofLunder an alignment⟨Π,π⟩iffHis an exact transformation ofLunder (τ π ,ω π ). See Section 2.6 for an example of constructive causal abstraction. Remark 34 Though the idea was implicit in much earlier work (going back at least to Simon and Ando 1961 and Iwasaki and Simon 1994), Beckers and Halpern (2019) and Beckers et al. (2019) explicitly introduced the notion of a constructive abstraction in the setting of probabilistic causal models. 6 As Examples 3.10 and 3.11 from Beckers and Halpern (2019) show, there are exact transformations that are not also constructive abstractions, even when we restrict attention to hard interventions. Remark 35In the field of program analysis,abstract interpretationis a framework that can be understood as a special case of constructive causal abstraction where models are acyclic and high-level variables are aligned with individual low-level variables rather than sets of low-level variables (Cousot and Cousot, 1977). The functionsτandτ −1 are the abstraction and concretization operators that form a Galois connection, and the commuting diagram summarized in Equation 3 guarantees that abstract transfer functions are consistent with concrete transfer functions. 2.3.3 Decomposing Alignments Between Models Given the importance and prevalence of this relatively simple notion of abstraction, it is worth understanding the notion from different angles. Abstraction under the alignment ⟨Π,π⟩can be decomposed via the following three fundamental operations on variables. Marginalizationremoves a set of variables.Variable mergecollapses a partition of variables, i.e., each partition cell becoming a single variable.Value mergecollapses a partition of values for each variable, i.e., each partition cell becoming a single value. The first and third operations relate closely to concepts identified in the philosophy of science literature as being critical to addressing the problem ofvariable choice(Kinney, 2019; Woodward, 2021). 6. Beckers and Halpern (2019) require eachπis total, while we allow partial functions for more flexibility. 17 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard First,marginalizationremoves a set of variablesX. As an alignment, the variablesX are placed into the cell Π ⊥ , while each other variableY /∈Xis left untouched; in a model that is an abstraction under this alignment, the parents and children of each marginalized variable are directly linked. Definition 36 (Marginalization)Define the marginalization ofX⊂Vto be an align- ment from the signature Σ = (V,Val) to the high-level signature Σ ∗ = (V ,Val): We set the partitions Π Y =YforY∈V Π ⊥ =X, while the functions are identity, π Y =y7→yforY∈V . Marginalization is essentially a matter ofignoringa subsetXof variables. Next,variable mergecollapses each cell of a partition into single variables. Variables are merged according to a partitionΠ X ∗ X ∗ ∈V ∗ with cells indexed bynewvariablesV ∗ . In a model that is an abstraction under a variable merge, these new variables depend on the parents of their partition and determine the children of their partition. Definition 37 (Variable Merge) LetΠ X ∗ X ∗ ∈V ∗ be a partition ofVindexed by new high-level variablesV ∗ whereVal ∗ X ∗ =Val Π X ∗ for eachX ∗ ∈V ∗ . Then the variable merge ofVintoV ∗ is an alignment to the new signature Σ ∗ = (V ∗ ,Val ∗ ) with partition Π =Π X ∗ X ∗ ∈V ∗ ∪Π ⊥ where Π ⊥ =∅and functionsπ X ∗ (y) =yfor eachX ∗ ∈V ∗ . Finally,value mergealters the value space of each variable, potentially collapsing values: Definition 38 (Value Merge)Choose some familyδ=δ X X∈V of partial surjective functionsδ X :Val X →Val ∗ X mapping to new variable values. The value merge is an alignment to the new signature Σ ∗ = (V,Val ∗ ) with partition cells Π X =XforX∈V, Π ⊥ =∅, and functionsπ X =δ X forX∈V. The notion of value merge relates to an important concept in the philosophy of causation. The range of valuesVal X for a variableXcan be more or less coarse-grained, and some levels of resolution seem to be better causal-explanatory targets. For instance, to use a famous example from Yablo (1992), if a bird is trained to peck any target that is a shade of red, then it would be misleading, if not incorrect, to say that appearance of crimson (a particular shade of red) causes the bird to peck. Roughly, the reason is that this suggests the wrong counterfactual contrasts: if the target were not crimson (but instead another shade of red, say, scarlet), the bird would still peck. Thus, for a given explanatory purpose, the level of grain in a model should guarantee that cited causes can beproportionalto their effects (Yablo, 1992; Woodward, 2021). The three operations above are notable not only for conceptual reasons, but also because they suffice to decompose any alignment, as we will now explain. Composition of alignments is defined, as expected, via composition of their maps. Formally: Definition 39Let⟨Π,π⟩be an alignment from signature (V,Val) to signature (V ′ ,Val ′ ) and⟨Π ′ ,π ′ ⟩be an alignment from signature (V ′ ,Val ′ ) to signature (V ′ ,Val ′ ). We define the composition⟨Π ′ ,π ′ ⟩◦⟨Π,π⟩as an alignment⟨Π ′ ,π ′ ⟩from signature (V,Val) to signature (V ′ ,Val ′ ) whose cells Π ′ and mapsπ ′ are given as follows: Π ′ X ′ = [ X ′ ∈Π ′ X ′ Π X ′ ,Π ′ ⊥ = [ X ′ ∈Π ′ ⊥ ∪⊥ Π X ′ , π ′ X ′ (y) =π ′ X ′ h [ X ′ ∈Π ′ X ′ π X ′ Proj Π X ′ (y) i 18 Causal Abstraction for Mechanistic Interpretability We then have the following: Proposition 40Let⟨Π,π⟩be an alignment. Then there is a marginalization, variable merge, and value merge whose composition is⟨Π,π⟩. ProofIt is straightforward to verify that the composition of alignments corresponding to the following three operations gives⟨Π,π⟩. First, marginalize the variablesX= Π ⊥ . Then, variable merge withV ∗ as the index set for the partition Π. Lastly, value merge with δ X =π X for eachX∈V ∗ . Here, we give an example of an abstraction under a composition of marginalization, variable merge, and value merges. At each step we give a model that is a constructive abstraction of the previous model under the corresponding operation, with the final result being a high-level model that is a constructive abstraction of the initial model. For succinctness, we will henceforth use marginalization, variable merge, and value merge as operations on models themselves. Thus, rather than saying, e.g., “a constructive abstraction of the model under a value merge alignment,” we simply say “a value merged model,” and so on. Example 4Consider a causal modelMthat computes the maximum of two positive numbers. X 1 X 2 Y 1 Y 2 Y 3 Z The variablesX 1 andX 2 take on positive real values and have constant functions mapping to 1. The variablesY 1 ,Y 2 ,Y 3 take on real-valued numbers and have the following mechanisms: F Y 1 (x 1 ,x 2 ) F Y 2 (x 1 ,x 2 ) F Y 3 (x 1 ,x 2 ) = ReLU x 1 x 2 ⊤ 1−1 1 −11 1 ! The variableZtakes on a real-number value and has the mechanismF Z (y 1 ,y 2 ,y 3 ) = 1 2 (y 1 +y 2 +y 3 ). Observe thatProj Z (Solve(M x 1 ◦x 2 )) =max(x 1 ,x 2 ); only one ofY 1 andY 2 takes on a positive value, depending on whetherX 1 orX 2 is greater. X 1 X 2 Y 1 Y 2 Y 3 Z Marginalization X 1 X 2 Y 1 Y 2 Z Variable Merge X 1 X 2 Y ∗ Z Value Merge X 1 X 2 Y ∗ Z The marginalization ofMwith Π ⊥ =Y 3 removes the mechanism ofY 3 and changes the mechanism ofZtoF Z (x 1 ,x 2 ,y 1 ,y 2 ) = 1 2 (y 1 +y 2 +x 1 +x 2 ). 19 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard The variable merge of the marginalized model with Π Y ∗ =Y 1 ,Y 2 constructs a new variableY ∗ with mechanism F Y ∗ (x 1 ,x 2 ) = F Y 1 (x 1 ,x 2 ) F Y 2 (x 1 ,x 2 ) = ReLU x 1 −x 2 x 2 −x 1 and alters the mechanism ofZtoF Z (x 1 ,x 2 ,(y ∗ 1 ,y ∗ 2 )) = 1 2 (y ∗ 1 +y ∗ 2 +x 1 +x 2 ). Finally, we value merge the marginalized and variable merged model. Defineδ Y ∗ :R 2 → 0,1whereδ Y ∗ (y 1 ,y 2 ) =1[y 1 ≥0], which creates a binary partition over the values ofY. In the case thaty 1 =x 1 −x 2 , thenδ Y ∗ (h) = 1, meaning thatx 1 ≥x 2 ; in the other possible case wherey= 0 andx 1 < x 2 , thenδ Y ∗ (h) = 0. After value merge withδ, the mechanism ofY ∗ isF Y ∗ (x 1 ,x 2 ) =1[x 1 ≥x 2 ] and the mechanism ofZisF Z (x 1 ,x 2 ,y ∗ ) =y ∗ ·x 1 + (1−y ∗ )x 2 . 2.4 Approximate Transformation Constructive causal abstraction and other exact transformations are all-or-nothing notions; the exact transformation relation either holds or it doesn’t. This binary concept prevents us from having a graded notion of faithful interpretation that is more useful in practice. We define a notion of approximate abstraction (Beckers et al., 2019; Rischel and Weichwald, 2021; Zennaro et al., 2023b) that can be flexibly adapted: Definition 41 (Approximate Transformation)Consider causal modelsM,M ∗ and intervention algebras (Ψ,◦),(Ψ ∗ ,◦) with signatures Σ and Σ ∗ , respectively. Furthermore, letτ:Val V →Val V ∗ andωbe surjective partial functions whereω: Ψ→Ψ ∗ is≤order preserving. Finally, we also need: 1. a (‘distance’) functionSimthat maps two total settings ofM ∗ to a real number, 2.a probability distributionPoverDomain(ω) used to describe which interventionals are expected, 3.a real-valued statisticSfor the random variableSim τ(Solve(M I )),Solve(M ∗ ω(I) ) whereI∼P. Taken together, we can construct a metric that quantifies the degree to whichM ∗ is an approximate transformationofMin a single number: S I∼P [Sim τ(Solve(M I )),Solve(M ∗ ω(I) ) ](6) This metric is a graded version of Equation 3 in the definition of exact transformation. If this number is above a particular cutoffη, we can say thatM ∗ is anη-approximate abstraction ofM. Remark 42 WhenMis a model with inputs and outputs, we might consider probability distributions that only give mass to interventionals that assign all input variables. This ensures that the default value for input variables is not taken into account (See Remark 7). Remark 43In a paper on approximate causal abstraction, Beckers et al. (2019) pro- posed max -α-approximate abstraction, which takes the maximum distance over low-level interventions. 20 Causal Abstraction for Mechanistic Interpretability Example 5LetMbe a causal model that computes the sum of two numbers with addend variablesX 1 andX 2 that take on values0,1,2,...and a sum variableYthat takes on values0,1,2,.... LetM ∗ be a causal model that only sums multiples of 10 with addend variablesX ∗ 1 andX ∗ 2 that take on values0,10,20,...and a sum variableY ∗ that takes on values0,10,20,.... We can quantify the degree to whichM ∗ approximatesMwherePis uniform over hard interventions targeting exactlyX 1 andX 2 ,Simoutputs the absolute difference between the values ofY ∗ , andSis expected value. Let (Ψ,◦) and (Ψ ∗ ,◦) be hard interventions and define mapsτ(x 1 ,x 2 ,y) =x 1 %10,x 2 %10,y%10andω(z) =Proj Z (z)%10 :Z∈Z. The degree to whichM ∗ is an approximate transformation ofMis S x 1 ,x 2 ∼P [Sim τ(Solve(M x 1 ,x 2 )),Solve(M ∗ ω(x 1 ,x 2 ) ) ] = S x 1 ,x 2 ∼P [Sim (x 1 %10,x 2 %10,(x 1 +x 2 )%10),(x 1 %10,x 2 %10,x 1 %10 +x 2 %10) ] = E x 1 ,x 2 ∼P [|(x 1 +x 2 )%10)−(x 1 %10 +x 2 %10)|] = 4.5 2.5 Interchange Interventions Aninterchange intervention(Geiger et al., 2020, 2021) is an operation on a causal model with input and output variables (e.g., acyclic models; recall Remark 7). Specifically, the causal model is provided a ‘base’ input and an intervention is performed that fixes some variables to be the values they would have if different ‘source’ inputs were provided. Such interventions will be central to grounding mechanistic interpretability in causal abstraction. Definition 44 (Interchange Interventions)LetMbe a causal model with input vari- ablesX In ⊆V. Furthermore, consider source inputss 1 ,...,s k ∈Val X In and disjoint sets of target variablesX 1 ,...,X k ⊆V In . Define the hard intervention IntInv(M,⟨s 1 ,...,s k ⟩,⟨X 1 ,...,X k ⟩) def = [ 1≤j≤k Proj X j (Solve(M s j )). We take interchange interventions as a base case and generalize torecursive interchange interventions, where variables are fixed to be the value they would have if a recursive inter- change intervention (that itself may be defined in terms of other interchange interventions) were performed. Definition 45 (Recursive Interchange Interventions) LetMbe a causal model with input variablesX In ⊆V. Define recursive interchange interventions of depth 0 to simply be interchange interventions. Givens 1 ,...,s k ∈Val X In , disjoint sets of target variables X 1 ,...,X k ⊆V ∗ In , and interchange interventionsi 1 ,...,i k of depthm, we define the recursive interchange interventions of depthm+ 1 to be RecIntInv m+1 (M,⟨s 1 ,...,s k ⟩,⟨X 1 ,...,X k ⟩,⟨i 1 ,...,i k ⟩) def = [ 1≤j≤k Proj X j (Solve(M s j ◦i j )). Geiger et al. (2023) generalize interchange interventions todistributed interchange inter- ventionsthat target variables distributed across multiple causal variables. We define this operation using an interventional that applies a bijective translation, performs an interchange 21 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard intervention in the new variable space, and then applies the inverse translation to get back to the original variable space. Definition 46 (Distributed Interchange Interventions) LetMbe a causal model with input variablesX In ⊆ Vand letτ:Val V →Val V ∗ be a bijective translation that preserves inputs. 7 For source inputss 1 ,...,s k ∈Val X In , and disjoint target variables X ∗ 1 ,...,X ∗ k ⊆V ∗ In , we define DistIntInv(M,τ,⟨s 1 ,...,s k ⟩,⟨X ∗ 1 ,...,X ∗ k ⟩) to be an interventional on all variablesVthat replaces each mechanism forXwith the function v7→Proj X τ −1 Solve τ(M) IntInv(M,⟨s 1 ,...,s k ⟩,⟨X ∗ 1 ,...,X ∗ k ⟩) . When conducting a causal abstraction analysis using interchange interventions, a partition Π X X∈V H ∪⊥ defined over all high-level variables and mappingπ X X∈X In H defined over high-level input variables are together sufficient to fully determine an alignment⟨Π,π⟩. Remark 47 (Constructing an Alignment for Interchange Intervention Analysis) Consider low-level causal modelLand high-level causal modelHwith input variables X In L ⊆ V L andX In H ⊆ V H , respectively. Suppose we have a partially constructed alignment (Π X X∈V H ∪⊥ ,π X X∈X In H ) with X∈X In H ⇔Π X ⊆X In L X∈V H In H ⇔Π X ⊆V L In L We can induce the remaining alignment functions from the partitions and the input alignment functions, and the two causal models. ForY H ∈V H In H andz L ∈Val Π Y H , if there exists x L ∈Val X In L such thatz L =Proj Π Y H (Solve(L x L )), then, withx H =π X In H (x L ), we define the alignment functions π Y H (z L ) =Proj Y H (Solve(H x H )) and otherwise, we leaveπ Y H undefined forz L . In words, map the low-level partial settings realized for a given input to the corresponding high-level values realized for the same input. This is a subtle point, but, in general, this construction is not well defined because two different inputs can produce the samez L while producing differentProj Y H (Solve(H x H )). If this were to happen, the causal abstraction relationship simply wouldn’t hold for the alignment. Once an alignment⟨Π,π⟩is constructed, aligned interventions must be performed to experimentally verify the alignment is a witness to the high-level model being an abstraction of the low-level model. Observe thatπwill only be defined for values of intermediate partition cells that are realized when some input is provided to the low-level model. This greatly constrains the space of low-level interventions to intermediate partitions that will correspond with high-level interventions. Specifically, we are only able to interpret low-level interchange interventionsas high-level interchange interventions. 7.I.e.,X In ⊆ V ∗ and for allx∈Val X In andy∈Val V In there exists ay ∗ ∈Val V ∗ In such that τ(x∪y) =x∪y ∗ . 22 Causal Abstraction for Mechanistic Interpretability X 1 X 2 X 3 X 4 Y 1 Y 2 Z Id 1 (x 1 ,x 2 )Id 1 (x 3 ,x 4 ) Id 2 (y 1 ,y 2 ) Id 1 TFF FTF FFT Id 2 TF TTF FFT (a) The algorithm. TrueTrue True (b) The total setting ofLdetermined by the empty intervention. TrueTrue True (c) The total setting ofLdetermined by the intervention fixingX 3 ,X 4 , andY 2 to be△,□, andTrue. Figure 1: A tree-structured algorithm that perfectly solves the hierarchical equality task with a compositional solution. Geiger et al. (2022b) proposeinterchange intervention accuracy, which is simply the proportion of interchange interventions where the low-level and high-level causal models have the same input–output behavior (see Section 2.6 for an example). Definition 48 (Interchange Intervention Accuracy)Consider a low-level causal model Laligned to a high-level causal modelH, an alignment (Π,π) as defined in Definition 47, and a probability distributionPover the domain ofω π . We define the interchange intervention accuracy as follows: IIA(H,L,(Π,π)) =E i∼P(Domain(ω π )) h 1[Proj X Out H (τ π (L i )) =Proj X Out H (H ω π (i) )] i . Interchange intervention accuracy is equivalent to input-output accuracy if we further restrict ωto be defined only on interchange interventions where base and sources are all the same single input. This is a special case of approximate causal abstraction (Section 2.4) whereSim(v L , v H ) = 1[Proj X Out H (τ π (v L )) =Proj X Out H (v H )] andSis expected value. This analysis can be extended to the case of distributed interchange interventions by simply applying a bijective translation to the low-level model before constructing the alignment to the high-level model. 2.6 Example: Causal Abstraction in Mechanistic Interpretability With the theory laid out, we can now present an example of causal abstraction from the field of mechanistic interpretability. We begin by defining two basic examples of causal models that demonstrate a potential to model a diverse array of computational processes; the first causal model represents a tree-structured algorithm and the second is fully-connected feed- forward neural network. Both the network and the algorithm solve the same ‘hierarchical equality’ task. A basic equality task is to determine whether a pair of objects is identical. A hierarchical equality task is to determine whether a pair of pairs of objects have identical relations. The 23 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard input to the hierarchical task is two pairs of objects, and the output isTrueif both pairs are equal or both pairs are unequal, andFalseotherwise. For illustrative purposes, we define the domain of objects to consist of a triangle, square, and pentagon. For example, the input (D,D,△,□) is assigned the outputFalseand the inputs (D,D,△,△) and (D,□,△,□) are both labeledTrue. We chose hierarchical equality for two reasons. First, there is an obvious tree-structured symbolic algorithm that solves the task: compute whether the first pair is equal, compute whether the second is equal, then compute whether those two outputs are equal. We will encode this algorithm as a causal model. Second, equality reasoning is ubiquitous and has served as a case study for broader questions about the representations underlying relational reasoning in biological organisms (Marcus et al., 1999; Alhama and Zuidema, 2019; Geiger et al., 2022a). We provide a companion jupyter notebook that walks through this example. A Tree-Structured Algorithm for Hierarchical EqualityWe define a tree struc- tured algorithmHconsisting of four ‘input’ variablesX In H =X 1 ,X 2 ,X 3 ,X 4 each with possible valuesVal X j =D,△,□, two ‘intermediate’ variablesY 1 ,Y 2 with valuesVal Y j = True,False, and one ‘output’ variableX Out H =Zwith valuesVal Z =True,False. The acyclic causal graph is depicted in Figure 1a, where eachF X i (with no arguments) is a constant function toD(which will be overwritten, per Remark 7), andF Y 1 ,F Y 2 ,F O each compute equality over their respective domains, e.g.,F Y 1 (x 1 ,x 2 ) =1 x 1 =x 2 . A total setting can be captured by a vector [x 1 ,x 2 ,x 3 ,x 4 ,y 1 ,y 2 ,z] of values for each of the variables. The default total setting that results from no intervention is [D,D,D,D,True,True,True] (see Figure 1b). We can also ask what would have occurred had we intervened to fixX 3 ,X 4 , andY 1 to be△,□, andFalse, for example. The result is [D,D,△,□,False,False,True] (See Figure 1c). A Handcrafted Fully Connected Neural Network for Hierarchical EqualityDefine a neural networkLconsisting of eight ‘input’ neuronsX Out H =N 1 ,...,N 8 , twenty-four ‘intermediate’ neuronsH (i,j) for 1≤i≤3 and 1≤i≤8, and two ‘output’ neurons X Out H =O True ,O False . The values for each of these variables are the real numbers. We depict the causal graph in Figure 2. LetN,H 1 ,H 2 ,H 3 be the sets of variables for the first four layers, respectively. We define F N k (with no arguments) to be a constant function to 0, for 1≤k≤8. The intermediate and output neurons are determined by the network weightsW 1 ,W 2 ∈R 8×8 , andW 3 ∈R 8×2 . For 1≤j≤8, we define F H (1,j) (n) =ReLU((nW 1 ) j )F H (2,j) (h 1 ) =ReLU((h 1 W 2 ) j ) F O True (h 2 ) =ReLU((h 3 W 3 ) 1 )F O False (h 2 ) =ReLU((h 2 W 3 ) 2 ) The four shapes that are the input for the hierarchical equality task are represented inn D , n □ ,n △ ∈R 2 by a pair of neurons with randomized activation values. The network outputsTrueif the value of the output logitO True is larger than the value ofO False , andFalse otherwise. We can simulate a network operating on the input (□,D,□,△) by performing an intervention setting (N 1 ,N 2 ) and (N 5 ,N 6 ) ton □ , (N 3 ,N 4 ) ton D , and (N 7 ,N 8 ) ton △ . In Figure 2, we define the weights of the networkL, which have been handcrafted to implement the tree-structured algorithmH. 24 Causal Abstraction for Mechanistic Interpretability O True O False N 1 N 2 N 3 N 4 N 5 N 6 N 7 N 8 H (1,1) H (1,2) H (1,3) H (1,4) H (1,5) H (1,6) H (1,7) H (1,8) H (2,1) H (2,2) H (2,3) H (2,4) H (2,5) H (2,6) H (2,7) H (2,8) X 1 X 2 X 3 X 4 Y 1 Y 2 Z n D = [0.012,−0.301]n □ = [−0.812,0.456]n △ = [0.682,0.333] W 1 = 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 W 2 = 1 1 0 1 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 W 3 = 1 0 1 0 1−ε0 1−ε0 0 0 0 0 0 0 0 0 W 3 ReLU(W 2 ReLU(W 1 [a,b,c,d])) = [||a−b|−|c−d||−(1−ε)|a−b|−(1−ε)|c−d|,0] Figure 2: A fully-connected feed-forward neural network that labels inputs for the hierarchical equality task. The weights of the network are handcrafted to implement the tree-structured solution to the task. False o True o False n 1 n 2 n 3 n 4 n 5 n 6 n 7 n 8 h (1,0) h (1,1) h (1,2) h (1,3) h (1,4) h (1,5) h (1,6) h (1,7) h (2,0) h (2,1) h (2,2) h (2,3) h (2,4) h (2,5) h (2,6) h (2,7) TrueFalse False True o ∗ True o ∗ True n ′ 1 n ′ 2 n ′ 3 n ′ 4 n ′ 5 n ′ 6 n ′ 7 n ′ 8 h (1,0) h (1,1) h (1,2) h (1,3) h ′ (1,4) h ′ (1,5) h ′ (1,6) h ′ (1,7) h ∗ (2,0) h ∗ (2,1) h ∗ (2,2) h ∗ (2,3) h ∗ (2,4) h ∗ (2,5) h ∗ (2,6) h ∗ (2,7) TrueTrue True Figure 3: The result of aligned interchange intervention on the low-level fully-connected neural network and a high-level tree structured algorithm under the alignment in Figure 2. Observe the equivalent counterfactual behavior across the two levels. 25 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard O True O False N 1 N 2 N 3 N 4 N 5 N 6 N 7 N 8 H (1,1) H (1,2) H (1,3) H (1,4) H (1,5) H (1,6) H (1,7) H (1,8) H (2,1) H (2,2) H (2,3) H (2,4) H (2,5) H (2,6) H (2,7) H (2,8) O True O False N 1 N 2 N 3 N 4 N 5 N 6 N 7 N 8 H (1,1) H (1,2) H (1,3) H (1,4) H (1,5) H (1,6) H (1,7) H (1,8) X 1 X 2 X 3 X 4 Y 1 Y 2 Z X 1 X 2 X 3 X 4 Y 1 Y 2 Z Figure 4: An illustration of a fully-connected neural network being transformed into a tree structured algorithm by (1) marginalizing away neurons aligned with no high-level variable, (2) merging sets of variables aligned with high-level variables, and (3) merging the continuous values of neural activity into the symbolic values of the algorithm. An Alignment Between the Algorithm and the Neural NetworkThe networkL was explicitly constructed to be abstracted by the algorithmHunder the alignment written formally below and depicted visually in Figure 2. Π Z =O True ,O False Π X k =N 2k−1 ,N 2k Π Y 1 =H (1,j) : 1≤j≤4 Π Y 2 =H (1,j) : 5≤j≤8Π ⊥ =V\(Π Z ∪Π Y 1 ∪Π Y 2 ∪Π X 1 ∪Π X 2 ∪Π X 3 ∪Π X 4 ) π Z (o True ,o False ) = ( Trueo True > o False Falseotherwise π X k (n 2k−1 ,n 2k ) = □(n 2k−1 ,n 2k ) =n □ D(n 2k−1 ,n 2k ) =n D △(n 2k−1 ,n 2k ) =n △ Undefinedotherwise We follow Definition 47 to defineπ Y 1 andπ Y 2 on all interchange interventions. For each inputn∈ n D , n □ ,n △ 4 , letx=π X In H (n) andh (1,j) : 1≤j≤8=Proj H (Solve(L n )). Then define π Y 1 (h (1,1) ,h (1,2) ,h (1,3) ,h (1,4) ) =Proj Y 1 (Solve(H x )) π Y 2 (h (1,5) ,h (1,6) ,h (1,7) ,h (1,8) ) =Proj Y 2 (Solve(H x )) Otherwise leaveπ Y 1 andπ Y 2 undefined. Consider an interventioniin the domain ofω π . We have a fixed alignment for the input and output neurons, whereican have output values from the real numbers and input values fromn D ,n □ ,n △ 4 . The intermediate neurons are assigned high-level alignment by stipulation;ican only have intermediate variables that are realized on some input intervention L n forn∈ n D ,n □ ,n △ 4 (i.e., interchange interventions). Constructive abstraction will hold only if these stipulative alignments to intermediate variables do not violate the causal laws ofL. 26 Causal Abstraction for Mechanistic Interpretability The Algorithm Abstracts the Neural NetworkFollowing Def. 47, the domain of ω π is restricted to 3 4 input interventions, (3 4 ) 2 single-source hard interchange interventions for high-level interventions fixing eitherY 1 orY 2 , and (3 4 ) 3 double-source hard interchange interventions for high-level interventions fixing bothY 1 andY 2 . This low-level neural network was hand crafted to be abstracted by the high-level algorithm under the alignment⟨Π,π⟩. This means that for alli∈Domain(ω π ) we have τ π Solve L i =Solve H ω π (i) (7) In the companion jupyter notebook, we provide code that verifies this is indeed the case. In Figure 3, we depict an aligned interchange intervention performed onLandHwith the base input (D,D,△,□) and a single source input (□,D,△,△). The central insight is that the network and algorithm have the same counterfactual behavior. Decomposing the AlignmentOur decomposition of the alignment object in Section 2.3.2 provides a new lens through which to view this result. The networkLcan be transformed into the algorithmHthrough a marginalization, variable merge, and value merge. We visually depict the algorithmHbeing constructed from the networkLin Figure 4. A Fully Connected Neural Network Trained on Hierarchical EqualityInstead of handcrafting weights, we can also train the neural networkLon the hierarchical equality task. Looking at the network weights provides no insight into whether or not it implements the algorithmH. A core result of Geiger et al. (2023) demonstrates that it is possible to learn a bijective translationτof the neural modelLsuch that the algorithmHis a constructive abstraction of the transformed modelτ(L). This bijective translation is in the form of an orthogonal matrix that rotates a hidden vector into a new coordinate system. The method is Distributed Alignment Search which is covered in Section 3.6.3 and the result is replicated in our companion jupyter notebook. 2.7 Example: Causal Abstraction with Cycles and Infinite Variables Causal abstraction is a highly expressive, general purpose framework. However, our example in Section 2.6 involved only finite and acyclic models. To demonstrate the framework’s expressive capacity, we will define a causal model with infinitely many variables with infinite value ranges that implements the bubble sort algorithm on lists of arbitrary length. Then, we show this acyclic model can be abstracted into a cyclic process with an equilibrium state. A Causal Model for Bubble SortBubble sort is an iterative algorithm. On each iteration, the first two members of the sequence are compared and swapped if the left element is larger than the right element; then the second and third member of the resulting list are compared and possibly swapped, and so on until the end of the list is reached. This process is repeated until no more swaps are needed. Define the causal modelMto have the following (countably) infinite variables and values V=X j i ,Y j i ,Z j i :i,j∈1,2,3,4,5,...Val X j i =Val Y j i =1,2,3,4,5,...∪⊥ Val Z j i =True,False,⊥ The causal structure ofMis depicted in Figure 5a. The⊥value will indicate that a variable is not being used in a computation, much like a blank square on a Turing 27 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard X 1 1 X 1 2 X 1 3 X 1 4 ... X 2 1 X 2 2 X 2 3 X 2 4 ... X 3 1 X 3 2 X 3 3 X 3 4 ... . . . . . . . . . . . . ... Z 1 1 Z 1 2 Z 1 3 ... Z 2 1 Z 2 2 Z 2 3 ... Y 1 1 Y 1 2 Y 1 3 ... Y 2 1 Y 2 2 Y 2 3 ... (a) A causal model that represents the bubble sort algorithm. X 1 1 X 1 2 X 1 3 X 1 4 ... X 2 1 X 2 2 X 2 3 X 2 4 ... X 3 1 X 3 2 X 3 3 X 3 4 ... . . . . . . . . . . . . (b) The causal model from Figure 5a with the variablesY i j andZ i j marginalized. X ∗ 1 X ∗ 2 X ∗ 3 X ∗ 4 ... X 1 1 X 1 2 X 1 3 X 1 4 ... (c) The causal model from Figure 5b with the variablesX 2 i ,X 3 i ,...merged for alli >0. The values of these new variables contain the full history of the algorithm, e.g., the value ofX ∗ 1 a sequence containing the first element in the list after each bubbling iteration. X ∗ 1 X ∗ 2 X ∗ 3 X ∗ 4 ... X 1 1 X 1 2 X 1 3 X 1 4 ... (d) The causal model from Figure 5c with the values of each variableX ∗ j merged for allj >0, e.g., the value ofX ∗ 1 is the first element in the sorted list. Figure 5: Abstractions of the bubble sort causal model. 28 Causal Abstraction for Mechanistic Interpretability machine. The (countably) infinite sequence of variablesX 1 1 ,X 1 2 ,...contains the unsorted input sequence, where an input sequence of lengthkis represented by settingX 1 1 ,...,X 1 k to encode the sequence and doing nothing to the infinitely many remaining input variablesX j forj > k. For a given rowjof variables, the variablesZ j i store the truth-valued output of the comparison of two elements, the variablesY j i contain the values being ‘bubbled up’ through the sequence, and the variablesX j i are partially sorted lists resulting fromj−1 passes through the algorithm. When there are rowsjandj+ 1 such thatX j i andX j−1 i take on the same value for alli, the output of the computation is the sorted sequence found in both of these rows. We define the mechanisms as follows. The input variablesX 1 i have constant functions to ⊥. The variableZ 1 1 is⊥if either ofX 1 1 orX 1 2 is⊥,Trueif the value ofX 1 1 is greater than X 1 2 , andFalseotherwise. The variableY 1 1 is⊥ifZ 1 1 is⊥,X 1 1 ifZ 1 1 isTrue, andX 1 2 ifZ 1 1 is False. The remaining intermediate variables can be uniformly defined: F X j i (y j−1 i−1 ,z j−1 i ,x j−1 i+1 ) = x j−1 i+1 z j−1 i =True y j−1 i−1 z j−1 i =False ⊥z j−1 i =⊥ F Y j i (y j i−1 ,z j i ,x j i+1 ) = x j i+1 z j i =True y j i−1 z j i =False ⊥z j i =⊥ F Z j i (y j i ,x j i+1 ) = ( y j i < x j i+1 y j i ̸=⊥andx j i+1 ̸=⊥ ⊥otherwise This causal model is countably infinite, supporting both sequences of arbitrary length and an arbitrary number of sorting iterations. Abstracting Bubble SortSuppose we aren’t concerned with how, exactly, each iterative pass of the bubble sort algorithm is implemented. Then we can marginalize away the variables Z=Z j i andY=Y j i and reason about the resulting model instead (Figure 5b). Define the mechanisms of this model recursively with base caseF X j 1 (x j−1 1 ,x j−1 2 ) =Min(x j−1 1 ,x j−1 2 ) forj >1 and recursive case F X j i (x j−1 1 ,x j−1 2 ,...,x j−1 i+1 ) =Min(x j−1 i+1 ,Max(x j−1 i ,F X j i−1 (x j−1 1 ,x j−1 2 ,...,x j−1 i ))) Suppose instead that our only concern is whether the input sequence is sorted. We can further abstract the causal model using variable merge with the partition Π X ∗ i =X j i :j∈ 2,3,...for eachi∈ 1,2,.... The result is a model (Figure 5c) where each variable X ∗ i takes on the value of an infinite sequence. There are causal connections to and fromX ∗ i andX ∗ j for anyi̸=j, because the infinite sequences stored in each variable must jointly be a valid run of the bubble sort algorithm. This is a cyclic causal process with an equilibrium point. Next, we can value-merge with a family of functionsδwhere the input variable functions δ X 1 i are identity functions and the other functionsδ X ∗ i output the constant value to which an eventually-constant infinite sequence converges. The mechanisms for the resulting model (Figure 5d) simply map unsorted input sequences to sorted output sequences. 29 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard 3 A Common Language for Mechanistic Interpretability The central claim of this paper is that causal abstraction provides a theoretical foundation for mechanistic interpretability. Equipped with a general theory of causal abstraction, we will provide mathematically precise definitions for a handful of core mechanistic interpretability concepts and show that a wide range of methods can be viewed as special cases of causal abstraction analysis. A tabular summary of this section can be found in Table 1. 3.1Polysemantic Neurons, the Linear Representation Hypothesis, and Modular Features via Intervention Algebras A vexed question when analyzing black box AI is how to decompose a deep learning system into constituent parts. Should the units of analysis be real-valued activations, directions in the activation space of vectors, or entire model components? Localizing an abstract concept to a component of a black box AI would be much easier if neurons were a sufficient unit of analysis. However, it has long been known that artificial (and biological) neural networks have polysemantic neurons that participate in the representation of multiple high-level concepts (Smolensky, 1986; Rumelhart et al., 1986; McClelland et al., 1986; Thorpe, 1989). Therefore, individual neural activations are insufficient as units of analysis in interpretability, a fact that has been recognized in the recent literature (Harradon et al., 2018; Cammarata et al., 2020; Olah et al., 2020; Goh et al., 2021; Elhage et al., 2021; Bolukbasi et al., 2021; Geiger et al., 2023; Gurnee et al., 2023; Huang et al., 2023a). Perhaps the simplest case of polysemantic neurons is where some rotation can be applied to the neural activations such that the dimensions in the new coordinate system are monosemantic (Smolensky, 1986; Elhage et al., 2021; Scherlis et al., 2022; Geiger et al., 2023). Indeed, thelinear representation hypothesis(Mikolov et al., 2013; Elhage et al., 2022b; Nanda et al., 2023b; Park et al., 2023; Jiang et al., 2024) states that linear representations will be sufficient for analyzing the complex non-linear building blocks of deep learning models. We are concerned that this is too restrictive. The ideal theoretical framework won’t bake in an assumption like the linear representation hypothesis, but rather support any and all decompositions of a deep learning system intomodular featuresthat each have separate mechanisms from one another. We should have the flexibility to choose the units of analysis, free of restrictive assumptions that may rule out meaningful structures. Whether a particular decomposition of a deep learning system into modular features is useful for mechanistic interpretability should be understood as an empirical hypothesis that can be falsified through experimentation. Our theory of causal abstraction supports a flexible, yet precise conception of modular features via intervention algebras (Section 2.2). An intervention algebra formalizes the notion of a set of separable components with distinct mechanisms, satisfying the fundamental algebraic properties of commutativity and left-annihilativity (see (a) and (b) in Definition 16). Individual activations, orthogonal directions in vector space, and model components (e.g. attention heads) are all separable components with distinct mechanisms in this sense. A bijective translation (Section 2.3.1) gives access to such features while preserving the overall mechanistic structure of the model. We propose to define modular features as any set of variables that form an intervention algebra accessed by a bijective translation. 30 Causal Abstraction for Mechanistic Interpretability Table 1: Interpretability Methods Behavioral Methods (Section 3.3) • Feature attribution(Zeiler and Fergus, 2014; Ribeiro et al., 2016; Lundberg and Lee, 2017) •Integrated gradients(Sundararajan et al., 2017) • Effects of real-world concepts on models(Goyal et al., 2019; Feder et al., 2021; Abraham et al., 2022; Wu et al., 2022a) Patching Activations with Interchange Interventions (Section 3.4) • Interchange interventions(Geiger et al., 2020; Vig et al., 2020; Geiger et al., 2021; Li et al., 2021; Chan et al., 2022b; Wang et al., 2023; Lieberum et al., 2023; Huang et al., 2023a; Hase et al., 2023; Cunningham et al., 2023; Davies et al., 2023; Tigges et al., 2023; Feng and Steinhardt, 2024; Ghandeharioun et al., 2024; Todd et al., 2024) •Path patching(Goldowsky-Dill et al., 2023; Wang et al., 2023; Hanna et al., 2023; Prakash et al., 2024) •Causal mediation analysis(Vig et al., 2020; Finlayson et al., 2021; Meng et al., 2022; Stolfo et al., 2023; Mueller et al., 2024) Ablation-Based Analysis (Section 3.5) • Concept erasure(Ravfogel et al., 2020, 2022, 2023b,a; Elazar et al., 2020; Lovering and Pavlick, 2022; Belrose et al., 2023) •Sub-circuit analysis(Michel et al., 2019; Sanh and Rush, 2021; Csord ́as et al., 2021; Cammarata et al., 2020; Olsson et al., 2022; Chan et al., 2022b; Lepori et al., 2023b,a; Wang et al., 2023; Conmy et al., 2023; Nanda et al., 2023b) •Causal scrubbing(Chan et al., 2022b) Modular Feature Learning (Section 3.6) • Principal Component Analysis(Bolukbasi et al., 2016; Chormai et al., 2022; Marks and Tegmark, 2023; Tigges et al., 2023) •Sparse autoencoders(Bricken et al., 2023; Cunningham et al., 2023; Huben et al., 2024; Marks et al., 2024) •Differential Binary Masking(De Cao et al., 2020, 2021; Csord ́as et al., 2021; Davies et al., 2023; Prakash et al., 2024; Huang et al., 2024) •Probing(Peters et al., 2018; Tenney et al., 2019; Hupkes et al., 2018) •Difference of means(Tigges et al., 2023; Marks and Tegmark, 2023) • Distributed Alignment Search(Geiger et al., 2023; Wu et al., 2023; Tigges et al., 2023; Arora et al., 2024; Huang et al., 2024; Minder et al., 2024; Feng et al., 2024; Rodriguez et al., 2024; Grant et al., 2025) Activation Steering (Section 3.7) (Giulianelli et al., 2018; Bau et al., 2019; Soulos et al., 2020; Besserve et al., 2020; Subramani et al., 2022; Turner et al., 2023; Zou et al., 2023; Vogel, 2024; Li et al., 2024; Wu et al., 2024a,b) Training Models to be Interpretable (Section 3.8) (Geiger et al., 2022b; Wu et al., 2022b,a; Elhage et al., 2022a; Hewitt et al., 2023; Y ̈uksekg ̈on ̈ul et al., 2023; Chauhan et al., 2023; Huang et al., 2023b; Tamkin et al., 2024; G ́omez and Cin`a, 2024; Zur et al., 2024; Liu et al., 2024) 31 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard If the linear representation hypothesis is correct, then rotation matrices should be sufficient bijective translations for mechanistic interpretability. If not, there will be cases where non-linear bijective translations will be needed to discover modular features that are not linearly accessible, e.g., the ‘onion’ representations found by Csord ́as et al. (2024) in simple recurrent neural networks. Our conception of modular features enables us to remain agnostic to the exact units of analysis that will prove essential. 3.2 Graded Faithfulness via Approximate Abstraction Informally, faithfulness has been defined as the degree to which an explanation accurately represents the ‘true reasoning process behind a model’s behavior’ (Wiegreffe and Pinter, 2019; Jacovi and Goldberg, 2020; Lyu et al., 2022; Chan et al., 2022a). Crucially, faithfulness should be a graded notion (Jacovi and Goldberg, 2020), but precisely which metric of faithfulness is correct will depend on the situation. It could be that for safety reasons there are some domains of inputs for which we need a perfectly faithful interpretation of a black box AI, while for others it matters less. Ideally, we can defer the exact details to be filled in based on the use case. This would allow us to provide a variety of graded faithfulness metrics that facilitates apples-to-apples comparisons between existing (and future) mechanistic interpretability methods. Approximate transformation (Section 2.4) provides the needed flexible notion of graded faithfulness. The similarity metric between high-level and low-level states, the probability distribution over evaluated interventions, and the summary statistic used to aggregate individual similarity scores are all points of variation that enable our notion of approximate transformation to be adapted to a given situation. Interchange intervention accuracy (Geiger et al., 2022b, 2023; Wu et al., 2023), probability or logit difference (Meng et al., 2022; Chan et al., 2022b; Wang et al., 2023; Zhang and Nanda, 2024), and KL-divergence all can be understood via approximate transformation. 3.3 Behavioral Evaluations as Abstraction by a Two Variable Chains The behavior of an AI model is simply the function from inputs to outputs that the model implements. Behavior is trivial to characterize in causal terms; any input–output behavior can be represented by a model with input variables directly connected to output variables. 3.3.1 LIME: Behavioral Fidelity as Approximate Abstraction Feature attribution methods ascribe scores to input features that capture the ‘impact’ of a feature on model behavior. Gradient-based feature attribution methods (Zeiler and Fergus, 2014; Springenberg et al., 2014; Shrikumar et al., 2016; Binder et al., 2016; Lundberg and Lee, 2017; Kim et al., 2018; Narendra et al., 2018; Lundberg et al., 2019; Schrouff et al., 2022) measure causal properties when they satisfy some basic axioms (Sundararajan et al., 2017). In particular, Geiger et al. (2021) provide a natural causal interpretation of the integrated gradients method and Chattopadhyay et al. (2019) argue for a direct measurement of a feature’s individual causal effect. Among the most popular feature attribution methods is LIME (Ribeiro et al., 2016), which learns an interpretable model that locally approximates an uninterpretable model. LIME defines an explanation to be faithful to the degree that the interpretable model agrees 32 Causal Abstraction for Mechanistic Interpretability with local input–output behavior. While not conceived as a causal explanation method, when we interpret priming a model with an input as an intervention, it becomes obvious that two models having the same local input–output behavior is a matter of causality. Crucially, however, the interpretable model lacks any connection to the internal causal dynamics of the uninterpretable model. In fact, it is presented as a benefit that LIME is a model-agnostic method that provides the same explanations for models with identical behaviors, but different internal structures. Without further grounding in causal abstraction, methods like LIME do not tell us anything meaningful about the abstract causal structure between input and output. Definition 49 (LIME Fidelity of Interpretable Model) LetLandHbe models with identical input and output spaces. LetDistance(·,·) compute some measure of distance between outputs. Given an inputx∈Val X In H , let ∆ x ⊆Val X In H be a finite neighborhood of inputs close tox. The LIME fidelity of usingHto interpretLon the inputxis given as: LIME(H,L,∆ x ) = 1 |∆ x | X x ′ ∈∆ x Distance(Proj X Out L (Solve(L x ′ )),Proj X Out H (Solve(H x ′ ))) The uninterpretable modelLis an AI model with fully-connected causal structure: X In L H 11 H 21 . . . H d1 H 12 H 22 . . . H d2 ... ... ... H 1l H 2l . . . H dl X Out L The interpretable modelHwill often also have rich internal structure—such as a decision tree model—which one could naturally interpret as causal. For instance: X In H X 1 X 2 X 3 X 4 X Out H However, LIME only seeks to find a correspondence between the input–output behaviors of the interpretable and uninterpretable models. Therefore, representing bothLandHas a causal models connecting inputs to outputs is sufficient to describe the fidelity measure in LIME. To shape approximate transformation to mirror the LIME fidelity metric, define 1. The similarity between a low-level and high-level total state to be Sim(v L ,v H ) =Distance(Proj X Out L (v L ),Proj X Out H (v H )) 2. The probability distributionPto assign equal probability mass to input interventions in ∆ x and zero mass to all other interventions. 3. The statisticSto compute the expected value of a random variable. The LIME fidelity metric is the approximate transformation metric (Def. 41) withτandω as identity functions: LIME(H,L,∆ x ) =S I∼P [Sim τ(Solve(L I )),Solve(H ω(I) ) ] 33 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard X In L X Out L X In H X Out H 3.3.2 Single Source Interchange Interventions from Integrated Gradients Integrated gradients (Sundararajan et al., 2017) computes the impact of neurons on model predictions. Following Geiger et al. (2021), we can easily translate the original integrated gradients equation into our causal model formalism. Definition 50 (Integrated Gradients)Given a neural network as a causal modelM, we define the integrated gradient value of theith neuron of an hidden vectorYwhen the network is providedxas an input as follows, wherey ′ is the so-called baseline value ofY: IG i (y,y ′ ) = (y i −y ′ i )· Z 1 δ=0 ∂Proj X Out (M x∪(δy+(1−δ)y ′ ) ) ∂y i dδ The completeness axiom of the integrated gradients method is formulated as follows: |Y| X i=1 IG i (y,y ′ ) =Proj X Out (M x∪y )−Proj X Out (M x∪y ′ ) Integrated gradients was not initially conceived as a method for the causal analysis of neural networks. Therefore, it is perhaps surprising that integrated gradients can be used to compute interchange interventions. This hinges on a strategic use of the ‘baseline’ value of integrated gradients; typically, the baseline value is set to be the zero vector, but here we set it to an interchange intervention. Remark 51 (Integrated Gradients Can Compute Interventions)The following is an immediate consequence of the completeness axiom Proj X Out (M x∪y ′ ) =Proj X Out (M x )− |Y| X i=1 IG i Proj Y (M x ),y ′ In principle, we could perform causal abstraction analysis using the integrated gradients method and taking y ′ =IntInv(M,⟨x ′ ⟩,⟨Y⟩) However, computing integrals is an inefficient way to compute interchange interventions. 3.3.3 Estimating the Causal Effect of Real-World Concepts The ultimate downstream goal of explainable AI is to provide explanations with intuitive concepts (Goyal et al., 2019; Feder et al., 2021; Elazar et al., 2022; Abraham et al., 2022) that are easily understood by human decision makers and guide their actions (Karimi et al., 2021, 2023; Beckers, 2022). These concepts can be abstract and mathematical, such as 34 Causal Abstraction for Mechanistic Interpretability truth-valued propositional content, natural numbers, or real valued quantities like height or weight; they can also be grounded and concrete, such as the breed of a dog, the education level of a job applicant, or the pitch of a singer’s voice. A basic question is how to estimate the effect of real-world concepts on the behavior of AI models. The explainable AI benchmark CEBaB (Abraham et al., 2022) evaluates methods on their ability to estimate the causal effects of the quality of food, service, ambiance, and noise in a real-world dining experience on the prediction of a sentiment classifier, given a restaurant review as input data. Using CEBaB as an illustrative example, we represent the real-world data generating process and the neural network with a single causal model M CEBaB . 8 The real-world conceptsC service ,C noise ,C food , andC ambiance can take on three values +,−, andUnknown, the input dataX In takes on the value of a restaurant review text, the prediction outputX Out takes on the value of a five star rating, and hidden vectors H ij can take on real number values. X In H 11 H 21 . . . H d1 H 12 H 22 . . . H d2 ... ... ... H 1l H 2l . . . H dl X Out C noise C food C service C ambiance If we are interested in the causal effect of food quality on model output, then we can marginalize away every variable other than the real-world conceptC food and the neural network outputX Out to get a causal model with two variables. This marginalized causal model is a high-level abstraction ofM CEBaB that contains a single causal mechanism describing how food quality in a dining experience affects the neural network output. X Out C food 3.4 Patching Activations with Interchange Interventions There is a diverse body of mechanistic interpretability literature in which interchange interventions (see Section 2.5) are used to analyze neural networks (see Table 1). However, the terminology used within this literature is often inconsistent and can lead to confusion regarding the precise techniques being employed. What is called an ‘activation patch’ is typically equivalent to an interchange intervention on a neural network, but the term is sometimes used to describe a variety of other intervention techniques. Wang et al. (2023) use ‘activation patching’ to mean (recursive) interchange interventions, while Conmy et al. (2023), Zhang and Nanda (2024), and Heimersheim and Nanda (2024) include ablation interventions under this heading (see Section 3.5), and Ghandeharioun et al. (2024) include arbitrary transformations that are more akin to activation steering (see Section 3.7). We propose to use ‘activation patch’ to refer broadly to interventions on hidden vectors in neural networks, while using ‘interchange intervention’ to pick out a specific type of intervention on causal models, which may be neural networks in some cases. 8.The models in Abraham et al. (2022) areprobabilisticmodels, but we simplify to the deterministic case. 35 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard 3.4.1 Causal Mediation as Abstraction by a Three-Variable Chain Vig et al. (2020), Finlayson et al. (2021), Meng et al. (2022), and Stolfo et al. (2023) apply the popular causal inference framework of mediation analysis (Imai et al., 2010; Hicks and Tingley, 2011) to understand how internal model components of neural networks mediate the causal effect of inputs on outputs. It is straightforward to show that mediation analysis is a special case of causal abstraction analysis. Mediation analysis is compatible with both ablation interventions (see Section 3.5) and interchange interventions. In this section, we present mediation analysis with interchange interventions. Suppose that changing the value of variablesXfromxtox ′ has an effect on a second set of variablesY. Causal mediation analysis determines how this causal effect is mediated by a third set of intermediate variablesZ. The fundamental notions involved in mediation are total, direct, and indirect effects, which can be defined with interchange interventions. Definition 52 (Total, Direct, and Indirect Effects) Consider a causal modelMwith disjoint sets of variablesX,Y,Z⊂Vsuch that addition and subtraction are well-defined on values ofY. Thetotal causal effectof changing the values ofXfromxtox ′ onYis TotalEffect(M,x,x ′ ,Y) =Proj Y (Solve(M x ′ ))−Proj Y (Solve(M x )) Thedirect causal effectof changingXfromxtox ′ onYaround mediatorZis DirectEffect(M,x,x ′ ,Y,Z) =Proj Y (Solve(M x ′ ∪IntInv(M,⟨x⟩,⟨Z⟩) ))−Proj Y (Solve(M x )) Theindirect causal effectof changingXfromxtox ′ onYthrough mediatorZis IndirectEffect(M,x,x ′ ,Y,Z) =Proj Y (Solve(M x∪IntInv(M,⟨x ′ ⟩,⟨Z⟩) ))−Proj Y (Solve(M x )) This method has been applied to the analysis of neural networks to characterize how the causal effect of inputs on outputs is mediated by (parts of) hidden vectors, with the goal identifying complete mediators. This is equivalent to a simple causal abstraction analysis. Remark 53Consider a neural networkLwith inputsX In , outputsX Out , and hidden vectorH. Define Π X =X In , Π Y =X Out , Π Z =H, and Π ⊥ =V\(X In ∪H∪X Out ). Apply variable merge and marginalization toLwith Π to obtain a high-level modelHthat is an abstraction ofLwithτandωas identity functions. The following are equivalent: 1. The hidden vectorHcompletely mediate the causal effect of inputs on outputs: IndirectEffect(L,x,x ′ ,X Out ,H) =TotalEffect(M,x,x ′ ,X Out ) 2.Hhas a structure such thatXis not a child ofY: XZY 36 Causal Abstraction for Mechanistic Interpretability 3.4.2 Path Patching as Recursive Interchange Interventions Path patching (Wang et al., 2023; Goldowsky-Dill et al., 2023; Hanna et al., 2023; Zhang and Nanda, 2024; Prakash et al., 2024) is a type of interchange intervention analysis that targets the connections between variables rather than variables themselves. To perform a path patch, we use a recursive interchange intervention on a modelMprocessing a base inputbthat simulates ‘sender’ variablesHtaking on intervened values from a source input s, restricting the effect of this intervention to receiver variablesRwhile freezing variablesF. Each receiver variable takes on a value determined by thesinput for the sender variables Hwhile fixingFto be the value determined by thebinput. For receiver variableR∈R, define an interchange intervention: i R =IntInv(M,⟨s,b⟩,⟨H,F\R⟩) The path patch is a recursive interchange intervention resulting from the receiver variablesR 1 ,...,R k =Rtaking on the value determined by each of the basic interchange interventions: j=RecIntInv(M,⟨b,...,b⟩,⟨R 1 ,...,R 2 ⟩,⟨i R 1 ,...,i R k ⟩) The intervened modelM b∪j has a patched path according to the definition of Wang et al. (2023). Simpler path patching experiments will not freeze any variables, meaningF=∅. We show a visualization below, whereH=H 11 is the sender neuron,R=H 32 is the receiver neuron, andF=H 22 is a neuron we would like to keep frozen (meaning it has no effect in computing the value of the patched receiver node). b h 11 h 12 h 13 h 21 h 22 h 23 h 31 h 32 h 33 o s h ′ 11 h ′ 12 h ′ 13 h ′ 21 h ′ 22 h ′ 23 h ′ 31 h ′ 32 h ′ 33 o ′ b h ′ 11 h 12 h 13 h ∗ 21 h 22 h ∗ 23 h ∗ 31 h ∗ 32 h ∗ 33 o ∗ b h 11 h 12 h 13 h 21 h 22 h 23 h 31 h ∗ 32 h 33 o ∗ 3.5 Ablation as Abstraction by a Three Variable Collider Neuroscientific lesion studies involve damage to a region of the brain in order to determine its function; if the lesion results in a behavioral deficit, then the brain region is assumed to be involved in the production of that behavior. In mechanistic interpretability, such interventions are known asablations. Common ablations include replacing neural hidden vectors with zero activations (Cammarata et al., 2020; Olsson et al., 2022; Geva et al., 2023) or with mean activations over a set of input data (Wang et al., 2023), adding random noise to the activations (causal tracing; Meng et al. 2022, 2023), and replacing activations with the values from a different input (resample ablations; Chan et al. 2022b). To capture ablation studies as a special case of causal abstraction analysis, we only need a high-level model 37 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard with an input variable, an output variable, and a binary valued variable aligned with the variables targeted for ablation. 3.5.1 Concept Erasure Concept erasure is a common application of ablations in which an hidden vectorHin a neural networkLis ablated in order to remove information about a particular conceptC (Ravfogel et al., 2020, 2022, 2023b,a; Elazar et al., 2020; Lovering and Pavlick, 2022; Olsson et al., 2022; Belrose et al., 2023). To quantify the success of a concept erasure experiment, each conceptCis associated with some degraded behavioral capability encoded as a partial functionA C :Val X In L →Val X Out L (e.g., ablating the concept ‘dog’ would be associated with the behavior of inaccurately captioning images with dogs in them). If performing an ablation onHto erase the conceptCleadsLto have the degraded behaviorA C without changing other behaviors, then the ablation was successful. We can model ablation onLas abstraction by a three-variable causal model. Define a high-level signature to be an input variableXtaking on values fromX In L , output variableY taking on values fromX Out L , and a binary variableZthat indicates whether the conceptC has been erased. The mechanism forXassigns an arbitrary default input, the mechanism forZassigns 0, and the mechanism forYproduces the degraded behavior ifZis 1, and mimicsLotherwise: F Y (x,z) = ( A C (x)z= 1 andx∈Domain(A C ) Proj X Out L (Solve(L x )) Otherwise . The mapτfrom low-level settings to the high-level settings simply setsZto be 0 exactly when the low-level input determines the value of the model componentH: τ(v) = ( Proj X In (v),0,Proj X Out (v)Proj H (v) =Proj H (Solve(L Proj X In (v) )) Proj X In (v),1,Proj X Out (v)Otherwise . The functionωis defined on low-level input interventions and the interventionalIthat is an ablation onH(e.g., setting activations to zero or a mean value, or projecting activations onto a linear subspace whose complement is thought to encode the conceptC). Low-level input interventions are mapped byωto identical high-level input interventions, whileIis mapped byωto the high-level intervention settingZto 1. The high-level causal modelHis an exact transformation of the low-level neural modelL under (τ,ω) exactly when the ablation removing the conceptCresults in degraded behavior defined byA C . We show a visualization below, whereH=H 12 ,H 22 . 38 Causal Abstraction for Mechanistic Interpretability XZY X In L H 11 H 21 H 31 H 12 H 22 H 32 H 13 H 23 H 33 X Out L Observe that the high-level modelHdoes not have a variable encoding the conceptC and the values it might take on. Ablation studies attempt to determinewhethera concept is used by a model; they do not characterizehowthat concept is used. 3.5.2 Sub-Circuit Analysis Sub-circuit analysis (Michel et al., 2019; Sanh and Rush, 2021; Csord ́as et al., 2021; Cam- marata et al., 2020; Olsson et al., 2022; Chan et al., 2022b; Lepori et al., 2023b,a; Wang et al., 2023; Conmy et al., 2023) aims to identify a minimal circuitC ⊆V×Vbetween components in a modelLthat is sufficient to perform particular behavior that we represent with a partial functionB:Val X In L →Val X Out L . This claim is cashed out in terms of abla- tions, specifically, the behaviorBshould remain intact when all connections between model componentsC=V×V ablated. DefineH=H:∃G(G,H)∈Cas the model components with incoming severed connections,G=G:∃H(G,H)∈Cas the model components with outgoing severed connections, and letgbe the ablation values. We can model sub-circuit analysis onLas abstraction by a three-variable causal model. Define a high-level signature to be an input variableXtaking on values fromX In L , output variableYtaking on values fromX Out L ∪⊥ , and a binary variableZthat indicates whether the connectionsChave been severed. The mechanism forXassigns an arbitrary default input, the mechanism forZassigns 0, and the mechanism forYmimics the behavior ofL whenZis 0 and preserves only the behaviorBwhenZis 1: F Y (x,z) = Proj X Out L (L x )z= 0 B(x)z= 1 andx∈Domain(B) ⊥z= 1 andx̸∈Domain(B) The mapτfrom low-level settings to high-level settings simply setsZto be 0 exactly when the low-level input determines the value of the model componentsH: τ(v) = ( Proj X In (v),0,Proj X Out (v)Proj H (v) =Proj H (Solve(L Proj X In (v) )) Proj X In (v),1,Proj X Out (v)Otherwise The functionωis defined on low-level input interventions inDomain(B) and the inter- ventionalIthat fixes the connections inCto an ablated value and leaves the connections in 39 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard Cuntouched: I⟨F H ⟩ H =v7→F H Proj G:(G,H)̸∈C (v)∪ [ G∈G:(G,H)∈ C Proj G (g) Low-level input interventions are mapped byωto identical high-level input interventions andIis mapped byωto the high-level intervention settingZto 1. The high-level causal modelHis an exact transformation of the low-level neural modelL under (τ,ω) exactly when the subcircuitCpreserves the behaviorB. We show a visualization below, where: C= X In ×H 31 ∪(H 31 ,H 12 ),(H 31 ,H 22 ),(H 31 ,H 32 ),(H 22 ,H 23 ),(H 12 ,H 23 ) XZY X In L H 11 H 21 H 31 H 12 H 22 H 32 H 13 H 23 H 33 X Out L 3.5.3 Causal Scrubbing Causal scrubbing (Chan et al., 2022b) is an ablation method that proposes to determine whether a circuitCis sufficient for a behaviorB. It doesn’t fit into our general paradigm of sub-circuit analysis for two reasons. First, the minimal circuitCis determined by a high-level causal modelHwith identical input and output spaces asLand a surjective partial functionδ:V L → V H assigning each low-level variable a high-level variable. Specifically, the low-level minimal circuit is the high-level causal graph pulled back into the low-level C=(G,H) :δ(G)≺δ(H). Second, the connections in the minimal circuitCare intervened upon in addition to connections inC. This means that every single connection in the network is being intervened upon. Given a base inputb, causal scrubbing recursively intervenes on every connection in the network. The connections inCare replaced using randomly sampled source inputs, and the connections (G,H)∈Care replaced using randomly sampled source inputs that setδ(H) to the same value inH. Chan et al. (2022b) call these interchange interventions—where the base and source input agree on a high-level variable—resampling ablations. The exact intervention value for each targeted connection is determined by a recursive interchange intervention performed by calling the algorithmScrub(b,X Out L ) defined below. 40 Causal Abstraction for Mechanistic Interpretability Algorithm 1:Scrub(b,H) 1h← 2forH∈H do 3ifH∈X In L then 4h←h∪Proj H (b) 5continue 6g← 7forG∈G: (G,H)̸∈Cdo 8s∼Val X In L 9g←g∪Scrub(s,G) 10ifH∈Domain(δ)then 11s∼s∈Val X In L :Proj δ(H) (Solve(H s )) =Proj δ(H) (Solve(H b )) 12g←g∪Scrub(s,G: (G,H)∈C) 13h←h∪Proj H (Solve(L g )) 14return h While causal scrubbing makes use of a high-level modelH, the only use of this model in the algorithmScrubis to sample source inputssthat assign the same value to a variable as a base inputb. No interventions are performed on the high-level model and so no correspondences between high-level and low-level interventions are established. This means that we can still model causal scrubbing as abstraction by the same three-variable causal model we defined in Section 3.5.2, which we will nameH ∗ . The mapτfrom low-level settings to the high-level settings simply setsYto be 0 exactly when the low-level input determines the value of the total setting τ(v) = ( Proj X In (v),0,Proj X Out (v)v=Solve(L Proj X In (v) ) Proj X In (v),1,Proj X Out (v)Otherwise The functionωis defined to map low-level input interventions to identical high-level input interventions and map any interventionalIresulting from a call toScrubto the high-level intervention settingZto 1. Chan et al. (2022b) propose to measure the faithfulness ofHby appeal to the proportion of performance maintained byLunder interventionals determined byScrub. This is equivalent to the approximate transformation metric for high-level causal modelH ∗ and low-level causal modelLunder (τ,ω) (Definition 41) withPas a random distribution over inputs and interventionals fromScrub,Simas a function outputting 1 only if the low-level and high-level outputs are equal, andSas expected value. 3.6 Modular Feature Learning as Bijective Transformation A core task of mechanistic interpretability is disentangling a vector of activations into a set of modular features that correspond to human-intelligible concepts. We construe modular feature learning as constructing a bijective translation (Def. 28). Some methods use a high-level causal model as a source of supervision in order to construct modular features that localize the concepts encoded in the high-level intermediate variables. Other methods are entirely unsupervised and produce modular features that must be further analyzed 41 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard to determine the concepts they might encode. Formalizing modular feature learning as bijective transformation provides a unified framework to evaluate commonly used mechanistic interpretability methods through distributed interchange interventions (see Huang et al. 2024 for a mechanistic interpretability benchmark based in this framework). 3.6.1 Unsupervised Methods Principal Component AnalysisPrincipal Component Analysis (PCA) is a technique that represents high-dimensional data in a lower-dimensional space while maximally pre- serving information in the original data. PCA has been used to identify subspaces related to human-interpretable concepts, e.g., gender (Bolukbasi et al., 2016), sentiment (Tigges et al., 2023), truthfulness (Marks and Tegmark, 2023), and visual concepts involved in object classification (Chormai et al., 2022). The modular features produced by PCA are simply the principal components. Given a modelMandkinputs, to featurize ann-dimensional hidden vectorHwe first create a k×nmatrix with a columnProj H (Solve(M x In )) for each inputx In . Then we use PCA to compute ann×nmatrixPwhose rows are principal components. The bijective translation τ:V→Vis defined as τ(v) =Proj V (v)∪P T Proj H (v). Sparse AutoencodersSparse autoencoders are tools for learning to translate ann- dimensional hidden vectorHinto a sparse,k-dimensional feature space withk≫n(Bricken et al., 2023; Cunningham et al., 2023; Huben et al., 2024; Marks et al., 2024). However, instead of learning a single invertible function that translates activations into a new feature space, sparse autoencoders separately learn an encoderf enc and decoderf dec , each typically parameterized by a single layer feed-forward network. These two functions are optimized to reconstruct the activationsVal H while creating a sparse feature space, with a hyperparameter λbalancing the two terms: ℓ= X x In ∈X In f dec f enc (Proj H (Solve(M x In ))) −Proj H (Solve(M x In )) 2 +λ∥f enc (Proj H (Solve(M x In )))∥ 1 ! Given a sparse autoencoder that perfectly reconstructs the activations, i.e., one with a reconstruction loss of zero, we can view the encoder and decoder as the bijective translation τ(v) =Proj V (v)∪f enc (Proj H (v));τ −1 (v) =Proj V (v)∪f dec (Proj H (v)) However, in practice, reconstruction loss is never zero and sparse autoencoders are approxi- mate transformations of the underlying model. 3.6.2 Aligning Low-Level Features with High-Level Variables Once a space of features has been learned, there is still the task of aligning features with high-level causal variables. Supervised modular feature learning techniques learn a feature 42 Causal Abstraction for Mechanistic Interpretability space with an explicit alignment to high-level causal variables already in mind. However, in unsupervised modular feature learning an additional method is needed to align features with high-level causal variables. Sparse Feature SelectionA simple baseline method for aligning features with a high- level variable is to train a linear probe with a regularization term to select the features most correlated with the high-level variable (Huang et al., 2024). Differential Binary MaskDifferential binary masking selects a subset of features for a high-level variable by optimizing a binary mask with a training objective defined using interventions (De Cao et al., 2020; Csord ́as et al., 2021; De Cao et al., 2021; Davies et al., 2023; Prakash et al., 2024; Huang et al., 2024). Differential binary masking has often been used to select individual neurons that play a particular causal role, but can just as easily be used to select features as long as the bijective translation is differentiable. 3.6.3 Supervised Methods ProbesProbing is the technique of using a supervised or unsupervised model to determine whether a concept is present in a hidden vector of a separate model. Probes are a popular tool for analyzing deep learning models, especially pretrained language models (Hupkes et al., 2018; Conneau et al., 2018; Peters et al., 2018; Tenney et al., 2019; Clark et al., 2019). Although probes are quite simple, they raise subtle methodological issues and our theoretical understanding of probes has greatly improved since their recent introduction into the field. From an information-theoretic point of view, we can observe that using arbitrarily powerful probes is equivalent to measuring the mutual information between the concept and the hidden vector (Hewitt and Liang, 2019; Pimentel et al., 2020). If we restrict the class of probing models based on their complexity, we can measure how usable the information is (Xu et al., 2020; Hewitt et al., 2021). Regardless of what probe models are used, successfully probing a hidden vector does not guarantee that it plays a causal role in model behavior (Ravichander et al., 2020; Elazar et al., 2020; Geiger et al., 2020, 2021). However, a linear probe with weightsWtrained to predict the value of a concept from an hidden vectorHcan be understood as learning a feature space for activations that can be analyzed with causal abstraction. Letr 1 ,...,r k be a set of orthonormal vectors that span the rowspace ofWandu k+1 ,...,u n be a set of orthonormal vectors that span the nullspace ofW. The bijective transformation is τ(v) =Proj V (v)∪[r 1 ...r k u k+1 ...u n ]Proj H (v) Since the probe is trained to capture information related to the conceptC, the rowspace of the projection matrixW(i.e., the firstkdimensions of the new feature space) might localize the conceptC. If we have a high-level model that has a variable for the conceptC, that variable should be aligned with the firstkfeatures. Distributed Alignment SearchDistributed Alignment Search (DAS) finds linear sub- spaces of ann-dimensional hidden vectorHin modelLthat align with high-level variables X 1 ,...,X k in modelHby optimizing an orthogonal matrixQ∈R n×n with a loss objective defined using distributed interchange interventions (Section 2.5). This method has been used to analyze causal mechanisms in a variety of deep learning models (Geiger et al., 2023; 43 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard Wu et al., 2023; Tigges et al., 2023; Arora et al., 2024; Huang et al., 2024; Minder et al., 2024). The bijective translationτ:V→Vis defined τ(v) =Proj V (v)∪Q T Proj H (v). DAS optimizesQso thatY 1 ,...,Y k , disjoint subspaces ofH, are abstracted by high-level variablesX 1 ,...,X k using the loss ℓ= X b,s 1 ,...,s k ∈X In L CE(L b∪DistIntInv(L,τ,⟨s 1 ,...,s k ⟩⟨Y 1 ,...,Y k ⟩) ,H ω(b)∪IntInv(H,⟨ω(s 1 ),...,ω(s k )⟩⟨X 1 ,...,X k ⟩) ), whereωmaps low-level inputs to high-level inputs andCEis cross-entropy loss. 3.7 Activation Steering as Causal Abstraction Causal explanation and manipulation are intrinsically linked (Woodward, 2003); if we understand which components in a deep learning model store high-level causal variables, we will be able to control the behavior of the deep learning model via intervention on those components. Controlling model behavior via interventions was initially studied on recursive neural networks and generative adversarial networks (Giulianelli et al., 2018; Bau et al., 2019; Soulos et al., 2020; Besserve et al., 2020). Recently, various works have focused on steering large language model generation through interventions. For instance, researchers have demonstrated that adding fixed steering vectors to the residual stream of transformer models can control the model’s generation without training (Subramani et al., 2022; Turner et al., 2023; Zou et al., 2023; Vogel, 2024; Li et al., 2024; Wu et al., 2024a). Additionally, parameter-efficient fine-tuning methods such as Adapter-tuning (Houlsby et al., 2019) can be viewed as interventions on model parameters. Between these two types of methods is representation fine-tuning(Wu et al., 2024b), where low-rank adapters are attached to a small number of hidden vectors in order to steer model behavior. While a successful interchange intervention analysis implies an ability to control the low-level model, the reverse does not hold. Activation steering has the power to bring the hidden vectors of a network off the distribution induced by the input data, potentially targeting hidden vectors that have the same value for every possible model input. An interchange intervention on such a vector will never have any impact on the model behavior. However, activation steering can still be represented in the framework of causal abstraction. For hidden vectors that are useful knobs to steer model generations in specific direction, we can simply define the mapωfrom low-level interventionals to high-level interventionals on steering interventions rather than interchange interventions. The crucial point is that causal abstraction analysis that doesn’t defineωon interchange interventions will fail to uncover how the network reasons. It may nonetheless uncover how to control the network’s reasoning process. 3.8 Training AI Models to be Interpretable Our treatment of interpretability has been largely through a scientific lens; an AI model being uninterpretable simply makes it an interesting object of study. However, when cracking open the black box is understood as normative goal that could have real positive societal 44 Causal Abstraction for Mechanistic Interpretability impacts, a natural question is whether we can make our job any easier by creating models that are inherently more interpretable. This is an active area of research. General purpose approaches attempt to design training procedure that produces more interpretable representations, such as using a SoLU function as a non-linearity (Elhage et al., 2022a), collapsing real-valued vector space into a discrete-valued space (Tamkin et al., 2024), learning a contextually weighted combination of vectors for different word meanings (Hewitt et al., 2023), or replacing MLPs with a new kind of network (Liu et al., 2024). More targeted approaches will construct architectural bottle-necks (Koh et al., 2020; Y ̈uksekg ̈on ̈ul et al., 2023; Chauhan et al., 2023) or use interchange intervention based loss terms (Geiger et al., 2022b; Wu et al., 2022b,a; Huang et al., 2023b; Zur et al., 2024) in order to force a concept to be mediated by a particular vector or feature. While these are active avenues of exploration, there are, to date, no state-of-the-art models that have architectures or losses designed around interpretability. 4 Conclusion We submit that causal abstraction provides a theoretical foundation for mechanistic inter- pretability that clarifies core concepts and lays useful groundwork for future development of methods that investigate algorithmic hypotheses about the internal reasoning of AI models. Acknowledgements Thank you to the reviewers, Nora Belrose, and Frederik Hytting Jørgensen for their feedback on earlier drafts of this paper. In particular, we would like to thank Sander Beckers for deep and thoughtful engagement as a reviewer, which improved the quality of this work over the course of the review process. This research is supported by a grant from Open Philanthropy. References Eldar David Abraham, Karel D’Oosterlinck, Amir Feder, Yair Ori Gat, Atticus Geiger, Christopher Potts, Roi Reichart, and Zhengxuan Wu. CEBaB: Estimating the causal effects of real-world concepts on NLP model behavior. arXiv:2205.14140, 2022. URL https://arxiv.org/abs/2205.14140. Raquel G Alhama and Willem Zuidema. A review of computational models of basic rule learning: The neural-symbolic debate and beyond.Psychonomic bulletin & review, 26(4): 1174–1194, 2019. Aryaman Arora, Dan Jurafsky, and Christopher Potts. CausalGym: Benchmarking causal interpretability methods on linguistic tasks.In Lun-Wei Ku, Andre Mar- tins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14638– 14663, Bangkok, Thailand, 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.785. 45 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard Elias Bareinboim and Juan D. Correa. A calculus for stochastic interventions: Causal effect identification and surrogate experiments. InThe Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020. David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B. Tenenbaum, William T. Freeman, and Antonio Torralba. Visualizing and understanding gans. InDeep Generative Models for Highly Structured Data, ICLR 2019 Workshop, New Orleans, Louisiana, United States, May 6, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id= rJgON8ItOV. Sander Beckers. Causal explanations and XAI. In Bernhard Sch ̈olkopf, Caroline Uhler, and Kun Zhang, editors,Proceedings of the First Conference on Causal Learning and Reasoning, volume 177 ofProceedings of Machine Learning Research, pages 90–109. PMLR, 11–13 Apr 2022. URLhttps://proceedings.mlr.press/v177/beckers22a.html. Sander Beckers and Joseph Halpern. Abstracting causal models. InAAAI Conference on Artificial Intelligence, 2019. Sander Beckers, Frederick Eberhardt, and Joseph Y. Halpern. Approximate causal ab- stractions. InProceedings of The 35th Uncertainty in Artificial Intelligence Conference, 2019. Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. LEACE: perfect linear concept erasure in closed form. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URLhttp://papers.nips.c/paper_files/paper/2023/hash/ d066d21c619d0a78c5b557fa3291a8f4-Abstract-Conference.html. Michel Besserve, Arash Mehrjou, R ́emy Sun, and Bernhard Sch ̈olkopf. Counterfactuals uncover the modular structure of deep generative models. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URLhttps://openreview.net/forum?id=SJxDDpEKvH. Alexander Binder, Gr ́egoire Montavon, Sebastian Bach, Klaus-Robert M ̈uller, and Wojciech Samek. Layer-wise relevance propagation for neural networks with local renormalization layers.CoRR, abs/1604.00825, 2016. URLhttp://arxiv.org/abs/1604.00825. Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? Debiasing word em- beddings. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URLhttps://proceedings.neurips.c/paper_files/paper/2016/file/ a486cd07e4ac3d270571622f4f316ec5-Paper.pdf. Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda B. Vi ́egas, and Martin Wattenberg. An interpretability illusion for BERT. InarXiv preprint arXiv:2104.07143, 2021. URLhttps://arxiv.org/abs/2104.07143. 46 Causal Abstraction for Mechanistic Interpretability Stephan Bongers, Patrick Forr ́e, Jonas Peters, and Joris M. Mooij. Foundations of structural causal models with cycles and latent variables.The Annals of Statistics, 49(5):2885–2915, 2021. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023. https://transformer- circuits.pub/2023/monosemantic-features/index.html. Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah, Michael Petrov, Ludwig Schubert, Chelsea Voss, Ben Egan, and Swee Kiat Lim. Thread: Circuits.Distill, 2020. doi: 10.23915/distill.00024. https://distill.pub/2020/circuits. Krzysztof Chalupka, Pietro Perona, and Frederick Eberhardt. Visual causal feature learning. In Marina Meila and Tom Heskes, editors,Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI 2015, July 12-16, 2015, Amsterdam, The Nether- lands, pages 181–190. AUAI Press, 2015. URLhttp://auai.org/uai2015/proceedings/ papers/109.pdf. Krzysztof Chalupka, Tobias Bischoff, Frederick Eberhardt, and Pietro Perona. Unsupervised discovery of el nino using causal feature learning on microlevel climate data. In Alexander T. Ihler and Dominik Janzing, editors,Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI 2016, June 25-29, 2016, New York City, NY, USA. AUAI Press, 2016. URLhttp://auai.org/uai2016/proceedings/papers/11. pdf. Krzysztof Chalupka, Frederick Eberhardt, and Pietro Perona. Causal feature learning: an overview.Behaviormetrika, 44:137–164, 2017. Chun Sik Chan, Huanqi Kong, and Guanqing Liang. A comparative study of faithfulness metrics for model interpretability methods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 5029–5038. Association for Computational Linguistics, 2022a. doi: 10.18653/v1/2022.acl-long.345. URLhttps://doi.org/10.18653/v1/2022.acl-long. 345. Lawrence Chan, Adri`a Garriga-Alonso, Nicholas Goldwosky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas.Causal scrubbing, a method for rigorously testing interpretability hypotheses.AI Align- ment Forum, 2022b.https://w.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/ causal-scrubbing-a-method-for-rigorously-testing. Aditya Chattopadhyay, Piyushi Manupriya, Anirban Sarkar, and Vineeth N Balasubramanian. Neural network attributions: A causal perspective. InProceedings of the 36th International Conference on Machine Learning (ICML), pages 981–990, 2019. 47 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard Kushal Chauhan, Rishabh Tiwari, Jan Freyberg, Pradeep Shenoy, and Krishnamurthy Dvijotham. Interactive concept bottleneck models. In Brian Williams, Yiling Chen, and Jennifer Neville, editors,Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 5948–5955. AAAI Press, 2023. doi: 10.1609/AAAI.V37I5.25736. URLhttps://doi.org/10.1609/aaai.v37i5.25736. Pattarawat Chormai, Jan Herrmann, Klaus-Robert M ̈uller, and Gr ́egoire Montavon. Disen- tangled explanations of neural network predictions by finding relevant subspaces, 2022. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? An analysis of BERT’s attention. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy, August 2019. Association for Computational Linguistics. URL https://w.aclweb.org/anthology/W19-4828. Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri`a Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Confer- ence on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URLhttp://papers.nips.c/paper_files/paper/ 2023/hash/34e1dbe95d34d7ebaf99b9bcaeb5b2be-Abstract-Conference.html. Alexis Conneau, German Kruszewski, Guillaume Lample, Lo ̈ıc Barrault, and Marco Baroni. What you can cram into a single$&!#* vector: Probing sentence embeddings for linguistic properties. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1198. URL https://aclanthology.org/P18-1198. P. Cousot and R. Cousot. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. InConference Record of the Fourth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 238–252, Los Angeles, California, 1977. ACM Press, New York, NY. Kathleen A. Creel. Transparency in complex computational systems.Philosophy of Science, 87:568–589, 2020. R ́obert Csord ́as, Sjoerd van Steenkiste, and J ̈urgen Schmidhuber. Are neural nets modular? inspecting functional modularity through differentiable weight masks. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URLhttps://openreview.net/forum?id=7uVcpu-gMD. R ́obert Csord ́as, Christopher Potts, Christopher D Manning, and Atticus Geiger. Recurrent neural networks learn to store and generate sequences using non-linear representations. InThe 7th BlackboxNLP Workshop, 2024. URLhttps://openreview.net/forum?id= NUQeYgg8x4. 48 Causal Abstraction for Mechanistic Interpretability Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.CoRR, abs/2309.08600, 2023. doi: 10.48550/ARXIV.2309.08600. URLhttps://doi.org/10.48550/arXiv.2309. 08600. Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, and David Bau. Dis- covering variable binding circuitry with desiderata.CoRR, abs/2307.03637, 2023. doi: 10.48550/ARXIV.2307.03637. URLhttps://doi.org/10.48550/arXiv.2307.03637. Nicola De Cao, Michael Sejr Schlichtkrull, Wilker Aziz, and Ivan Titov. How do decisions emerge across layers in neural models? interpretation with differentiable masking. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3243– 3255, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/ v1/2020.emnlp-main.262. URLhttps://aclanthology.org/2020.emnlp-main.262. Nicola De Cao, Leon Schmid, Dieuwke Hupkes, and Ivan Titov. Sparse interventions in language models with differentiable masking. arXiv:2112.06837, 2021. URLhttps: //arxiv.org/abs/2112.06837. Julien Dubois, Frederick Eberhardt, Lynn K. Paul, and Ralph Adolphs. Personality beyond taxonomy.Nature human behaviour, 4 11:1110–1117, 2020a. Julien Dubois, Hiroyuki Oya, Julian Michael Tyszka, Matthew A. Howard, Frederick Eberhardt, and Ralph Adolphs. Causal mapping of emotion networks in the human brain: Framework and initial findings.Neuropsychologia, 145, 2020b. Joel Dyer, Nicholas Bishop, Yorgos Felekis, Fabio Massimo Zennaro, Anisoara Calinescu, Theodoros Damoulas, and Michael J. Wooldridge. Interventionally consistent surrogates for agent-based simulators.CoRR, abs/2312.11158, 2023. doi: 10.48550/ARXIV.2312.11158. URLhttps://doi.org/10.48550/arXiv.2312.11158. Frederick Eberhardt and Richard Scheines. Interventions and causal inference.Philosophy of Science, 74(5):981–995, 2007. Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. Amnesic probing: Behavioral explanation with amnesic counterfactuals. InProceedings of the 2020 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, November 2020. doi: 10.18653/v1/W18-5426. Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov, Hinrich Sch ̈utze, and Yoav Goldberg. Measuring causal effects of data statistics on language model’s ’factual’ predictions.CoRR, abs/2207.14251, 2022. doi: 10.48550/arXiv.2207.14251. URLhttps://doi.org/10.48550/arXiv.2207.14251. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane 49 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam Mc- Candlish, and Chris Olah. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html. Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott John- ston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernan- dez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav Kadavath, Josh Jacobson, Eli Tran-Johnson, Jared Kaplan, Jack Clark, Tom Brown, Sam McCandlish, Dario Amodei, and Christopher Olah. Softmax linear units.Transformer Circuits Thread, 2022a. https://transformer- circuits.pub/2022/solu/index.html. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022b. Amir Feder, Nadav Oved, Uri Shalit, and Roi Reichart. CausaLM: Causal Model Explanation Through Counterfactual Language Models.Computational Linguistics, pages 1–54, 05 2021. ISSN 0891-2017. doi: 10.1162/colia00404. URLhttps://doi.org/10.1162/ coli_a_00404. Jiahai Feng and Jacob Steinhardt. How do language models bind entities in context? In The Twelfth International Conference on Learning Representations, 2024. URLhttps: //openreview.net/forum?id=zb3b6oKO77. Jiahai Feng, Stuart Russell, and Jacob Steinhardt. Monitoring latent world states in language models with propositional probes.CoRR, abs/2406.19501, 2024. doi: 10.48550/ARXIV. 2406.19501. URLhttps://doi.org/10.48550/arXiv.2406.19501. Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1828–1843, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.144. URLhttps://aclanthology.org/2021.acl-long.144. Atticus Geiger, Kyle Richardson, and Christopher Potts. Neural natural language inference models partially embed theories of lexical entailment and negation. InProceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 163–173, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.blackboxnlp-1.16. URLhttps://w.aclweb.org/anthology/2020. blackboxnlp-1.16. Atticus Geiger, Hanson Lu, Thomas F Icard, and Christopher Potts. Causal abstractions of neural networks. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, 50 Causal Abstraction for Mechanistic Interpretability editors,Advances in Neural Information Processing Systems, 2021. URLhttps:// openreview.net/forum?id=RmuXDtjDhG. Atticus Geiger, Alexandra Carstensen, Michael C. Frank, and Christopher Potts. Relational reasoning and generalization using nonsymbolic neural networks.Psychological Review, 2022a. doi: 10.1037/rev0000371. Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Goodman, and Christopher Potts. Inducing causal structure for interpretable neural networks. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 7324–7338. PMLR, 17–23 Jul 2022b. URLhttps://proceedings.mlr.press/v162/ geiger22a.html. Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. Finding alignments between interpretable causal variables and distributed neural repre- sentations. Ms., Stanford University, 2023. URLhttps://arxiv.org/abs/2303.02536. Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URLhttps://openreview. net/forum?id=F1G7y94K02. Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. Patch- scopes: A unifying framework for inspecting hidden representations of language mod- els.CoRR, abs/2401.06102, 2024. doi: 10.48550/ARXIV.2401.06102. URLhttps: //doi.org/10.48550/arXiv.2401.06102. Mario Giulianelli, Jack Harding, Florian Mohnert, Dieuwke Hupkes, and Willem H. Zuidema. Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information. In Tal Linzen, Grzegorz Chrupala, and Afra Alishahi, editors,Proceedings of the Workshop: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2018, Brussels, Belgium, November 1, 2018, pages 240–248. Association for Computational Linguistics, 2018. doi: 10.18653/v1/w18-5426. URL https://doi.org/10.18653/v1/w18-5426. Gabriel Goh, Nick Cammarata†, Chelsea Voss†, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal neurons in artificial neural networks. Distill, 2021. doi: 10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons. Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. Localizing model behavior with path patching, 2023. Ana Esponera G ́omez and Giovanni Cin`a. Interchange intervention training applied to post-meal glucose prediction for type 1 diabetes mellitus patients. In9th Causal Inference Workshop at UAI 2024, 2024. URLhttps://openreview.net/forum?id=6sRLazdA1l. 51 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard Yash Goyal, Uri Shalit, and Been Kim. Explaining classifiers with causal concept effect (cace).CoRR, abs/1907.07165, 2019. URLhttp://arxiv.org/abs/1907.07165. Satchel Grant, Noah D. Goodman, and James L. McClelland. Emergent symbol-like number variables in artificial neural networks, 2025. URLhttps://arxiv.org/abs/2501.06141. Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse prob- ing. InTransactions on Machine Learning Research (TMLR), 2023. URLhttps: //doi.org/10.48550/arXiv.2305.01610. Michael Hanna, Ollie Liu, and Alexandre Variengien. How does GPT-2 compute greater- than?: Interpreting mathematical abilities in a pre-trained language model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URLhttp://papers.nips.c/paper_files/paper/2023/hash/ efbba7719c5172d175240f24be11280-Abstract-Conference.html. Michael Harradon, Jeff Druce, and Brian E. Ruttenberg. Causal learning and explanation of deep neural networks via autoencoded activations.CoRR, abs/1802.00541, 2018. URL http://arxiv.org/abs/1802.00541. Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun.Does localiza- tion inform editing? surprising differences in causality-based localization vs. knowl- edge editing in language models.In Alice Oh, Tristan Naumann, Amir Glober- son, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neu- ral Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.URLhttp://papers.nips.c/paper_files/paper/2023/hash/ 3927bbdcf0e8d1fa8a23c26f358a281-Abstract-Conference.html. Stefan Heimersheim and Neel Nanda. How to use and interpret activation patching, 2024. John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process- ing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China, November 2019. As- sociation for Computational Linguistics. doi: 10.18653/v1/D19-1275. URLhttps: //w.aclweb.org/anthology/D19-1275. John Hewitt, Kawin Ethayarajh, Percy Liang, and Christopher D. Manning. Conditional probing: measuring usable information beyond a baseline. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 1626–1639. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.122. URLhttps://doi.org/10.18653/v1/2021.emnlp-main.122. 52 Causal Abstraction for Mechanistic Interpretability John Hewitt, John Thickstun, Christopher D. Manning, and Percy Liang. Backpack language models. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 9103–9125. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-LONG.506. URLhttps://doi.org/10.18653/v1/2023.acl-long.506. Raymond Hicks and Dustin Tingley. Causal mediation analysis.The Stata Journal, 11(4): 605–619, 2011. doi: 10.1177/1536867X1201100407. URLhttps://doi.org/10.1177/ 1536867X1201100407. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 ofProceedings of Machine Learning Research, pages 2790–2799. PMLR, 2019. URLhttp://proceedings.mlr.press/v97/houlsby19a. html. Yaojie Hu and Jin Tian. Neuron dependency graphs: A causal abstraction of neural networks. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 9020–9040. PMLR, 17–23 Jul 2022. URLhttps://proceedings.mlr.press/v162/hu22b.html. Jing Huang, Atticus Geiger, Karel D’Oosterlinck, Zhengxuan Wu, and Christopher Potts. Rigorously assessing natural language explanations of neurons. In Yonatan Belinkov, Sophie Hao, Jaap Jumelet, Najoung Kim, Arya McCarthy, and Hosein Mo- hebbi, editors,Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Inter- preting Neural Networks for NLP, pages 317–331, Singapore, December 2023a. Asso- ciation for Computational Linguistics. doi: 10.18653/v1/2023.blackboxnlp-1.24. URL https://aclanthology.org/2023.blackboxnlp-1.24. Jing Huang, Zhengxuan Wu, Kyle Mahowald, and Christopher Potts. Inducing character- level structure in subword-based language models with type-level interchange intervention training. InFindings of the Association for Computational Linguistics: ACL 2023, pages 12163–12180, Toronto, Canada, July 2023b. Association for Computational Linguis- tics. doi: 10.18653/v1/2023.findings-acl.770. URLhttps://aclanthology.org/2023. findings-acl.770. Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, and Atticus Geiger. Ravel: Evaluating interpretability methods on disentangling language model representations, 2024. Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview. net/forum?id=F76bwRSLeK. 53 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard Dieuwke Hupkes, Sara Veldhoen, and Willem H. Zuidema. Visualisation and ’diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure. J. Artif. Intell. Res., 61:907–926, 2018. doi: 10.1613/jair.1.11196. URLhttps://doi. org/10.1613/jair.1.11196. Kosuke Imai, Luke Keele, and Dustin Tingley. A general approach to causal mediation analysis.Psychological Methods, 15(4):309–334, Dec 2010. doi: 10.1037/a0020761. Yumi Iwasaki and Herbert A. Simon. Causality and model abstraction.Artificial Intelligence, 67(1):143–194, 1994. Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 4198–4205. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.386. URLhttps://doi.org/10.18653/v1/2020.acl-main.386. Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, Bryon Aragam, and Victor Veitch. On the origins of linear representations in large language models.CoRR, abs/2403.03867, 2024. doi: 10.48550/ARXIV.2403.03867. URLhttps://doi.org/10.48550/arXiv.2403. 03867. Amir-Hossein Karimi, Bernhard Sch ̈olkopf, and Isabel Valera. Algorithmic recourse: From counterfactual explanations to interventions. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 353–362, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445899. URLhttps://doi.org/10.1145/3442188.3445899. Amir-Hossein Karimi, Gilles Barthe, Bernhard Sch ̈olkopf, and Isabel Valera. A survey of algorithmic recourse: Contrastive explanations and consequential recommendations.ACM Comput. Surv., 55(5):95:1–95:29, 2023. doi: 10.1145/3527848. URLhttps://doi.org/ 10.1145/3527848. Armin Kekic, Bernhard Sch ̈olkopf, and Michel Besserve. Targeted reduction of causal models.CoRR, abs/2311.18639, 2023. doi: 10.48550/ARXIV.2311.18639. URLhttps: //doi.org/10.48550/arXiv.2311.18639. Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav), 2018. David Kinney. On the explanatory depth and pragmatic value of coarse-grained, probabilistic, causal explanations.Philosophy of Science, 86:145–167, 2019. Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In Hal Daum ́e I and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 54 Causal Abstraction for Mechanistic Interpretability 119 ofProceedings of Machine Learning Research, pages 5338–5348. PMLR, 13–18 Jul 2020. URLhttps://proceedings.mlr.press/v119/koh20a.html. Michael A. Lepori, Ellie Pavlick, and Thomas Serre. Neurosurgeon: A toolkit for subnetwork analysis.CoRR, abs/2309.00244, 2023a. doi: 10.48550/ARXIV.2309.00244. URLhttps: //doi.org/10.48550/arXiv.2309.00244. Michael A. Lepori, Thomas Serre, and Ellie Pavlick. Break it down: Evidence for structural compositionality in neural networks. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Informa- tion Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b. URLhttp://papers.nips.c/paper_files/paper/2023/hash/ 85069585133c4c168c865e65d72e9775-Abstract-Conference.html. Belinda Z. Li, Maxwell I. Nye, and Jacob Andreas. Implicit representations of meaning in neural language models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 1813–1827. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021. acl-long.143. URLhttps://doi.org/10.18653/v1/2021.acl-long.143. Kenneth Li,Oam Patel,Fernanda Vi ́egas,Hanspeter Pfister,and Martin Wattenberg.Inference-time intervention:Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36, 2024.URLhttps://proceedings.neurips.c/paper_files/paper/2023/hash/ 81b8390039b7302c909cb769f8b6cd93-Abstract-Conference.html. Tom Lieberum, Matthew Rahtz, J ́anos Kram ́ar, Neel Nanda, Geoffrey Irving, Rohin Shah, and Vladimir Mikulik. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla.CoRR, abs/2307.09458, 2023. doi: 10.48550/ARXIV.2307. 09458. URLhttps://doi.org/10.48550/arXiv.2307.09458. Zachary C. Lipton. The mythos of model interpretability.Commun. ACM, 61(10):36–43, 2018. doi: 10.1145/3233231. URLhttps://doi.org/10.1145/3233231. Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Sol- jacic, Thomas Y. Hou, and Max Tegmark. KAN: kolmogorov-arnold networks.CoRR, abs/2404.19756, 2024. doi: 10.48550/ARXIV.2404.19756. URLhttps://doi.org/10. 48550/arXiv.2404.19756. Charles Lovering and Ellie Pavlick. Unit testing for concepts in neural networks.CoRR, abs/2208.10244, 2022. doi: 10.48550/arXiv.2208.10244. URLhttps://doi.org/10. 48550/arXiv.2208.10244. Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and 55 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URLhttps://proceedings.neurips.c/paper/2017/ file/8a20a8621978632d76c43dfd28b67767-Paper.pdf. Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. Consistent individualized feature attribution for tree ensembles, 2019. Qing Lyu, Marianna Apidianaki, and Chris Callison-Burch. Towards faithful model explana- tion in NLP: A survey.CoRR, abs/2209.11326, 2022. doi: 10.48550/arXiv.2209.11326. URLhttps://doi.org/10.48550/arXiv.2209.11326. Gary F Marcus, Sugumaran Vijayan, S Bandi Rao, and Peter M Vishton. Rule learning by seven-month-old infants.Science, 283(5398):77–80, 1999. Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets, 2023. Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models, 2024. David Marr.Vision. W.H. Freeman and Company, 1982. Riccardo Massidda, Atticus Geiger, Thomas Icard, and Davide Bacciu. Causal abstraction with soft interventions. In Mihaela van der Schaar, Cheng Zhang, and Dominik Janzing, editors,Conference on Causal Learning and Reasoning, CLeaR 2023, 11-14 April 2023, Amazon Development Center, T ̈ubingen, Germany, April 11-14, 2023, volume 213 of Proceedings of Machine Learning Research, pages 68–87. PMLR, 2023. URLhttps: //proceedings.mlr.press/v213/massidda23a.html. Riccardo Massidda, Sara Magliacane, and Davide Bacciu. Learning causal abstractions of linear structural causal models. InThe 40th Conference on Uncertainty in Artificial Intelligence, 2024. URLhttps://openreview.net/forum?id=XlFqI9TMhf. J. L. McClelland, D. E. Rumelhart, and PDP Research Group, editors.Parallel Distributed Processing. Volume 2: Psychological and Biological Models. MIT Press, Cambridge, MA, 1986. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt, 2022. URLhttps://arxiv.org/abs/2202.05262. Kevin Meng, Arnab Sen Sharma, Alex J. Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URLhttps://openreview.net/pdf?id=MkbcAHIYgyS. Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one?, 2019. 56 Causal Abstraction for Mechanistic Interpretability Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Lucy Vanderwende, Hal Daum ́e I, and Katrin Kirch- hoff, editors,Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia, June 2013. Association for Computational Linguistics. URL https://aclanthology.org/N13-1090. Julian Minder, Kevin Du, Niklas Stoehr, Giovanni Monea, Chris Wendler, Robert West, and Ryan Cotterell. Controllable context sensitivity and the knob behind it, 2024. URL https://arxiv.org/abs/2411.07404. Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, Eric Todd, David Bau, and Yonatan Belinkov. The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability, 2024. URLhttps://arxiv.org/abs/ 2408.01416. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023a. URLhttps://openreview.net/pdf?id=9XFSbDPmdW. Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models of self-supervised sequence models. In Yonatan Belinkov, Sophie Hao, Jaap Jumelet, Najoung Kim, Arya McCarthy, and Hosein Mohebbi, editors,Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2023, Singapore, December 7, 2023, pages 16–30. Association for Computational Linguistics, 2023b. doi: 10.18653/V1/2023.BLACKBOXNLP-1.2. URL https://doi.org/10.18653/v1/2023.blackboxnlp-1.2. Tanmayee Narendra, Anush Sankaran, Deepak Vijaykeerthy, and Senthil Mani. Explaining deep learning models using causal inference.CoRR, abs/1811.04376, 2018. URLhttp: //arxiv.org/abs/1811.04376. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context- learning-and-induction-heads/index.html. Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models.CoRR, abs/2311.03658, 2023. doi: 10.48550/ARXIV. 2311.03658. URLhttps://doi.org/10.48550/arXiv.2311.03658. 57 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard Judea Pearl.Causality. Cambridge University Press, 2009. Judea Pearl. The limitations of opaque learning machines.Possible minds: twenty-five ways of looking at AI, pages 13–19, 2019. Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. Dissecting contextual word embeddings: Architecture and representation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1499–1509. Association for Computational Linguistics, 2018. doi: 10.18653/v1/d18-1179. URLhttps://doi.org/10.18653/v1/d18-1179. Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. Information-theoretic probing for linguistic structure. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 4609–4622. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.420. URLhttps://doi.org/10.18653/v1/2020.acl-main. 420. Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, and David Bau. Fine- tuning enhances existing mechanisms: A case study on entity tracking. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview. net/forum?id=8sKcAWOf2D. Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out: Guarding protected attributes by iterative nullspace projection. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7237–7256. Association for Computational Linguistics, 2020. doi: 10.18653/ v1/2020.acl-main.647. URLhttps://doi.org/10.18653/v1/2020.acl-main.647. Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan Cotterell. Linear adversarial concept erasure, 2022. Shauli Ravfogel, Yoav Goldberg, and Ryan Cotterell. Log-linear guardedness and its implications. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 9413–9431. Association for Computational Linguistics, 2023a. doi: 10.18653/V1/2023.ACL-LONG.523. URLhttps://doi.org/10.18653/v1/2023.acl-long.523. Shauli Ravfogel, Francisco Vargas, Yoav Goldberg, and Ryan Cotterell. Kernelized concept erasure, 2023b. Abhilasha Ravichander, Yonatan Belinkov, and Eduard Hovy. Probing the probing paradigm: Does probing accuracy entail task relevance?, 2020. 58 Causal Abstraction for Mechanistic Interpretability Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ”why should i trust you?”: Explain- ing the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 1135–1144, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450342322. doi: 10.1145/2939672.2939778. URLhttps://doi.org/10.1145/2939672.2939778. Eigil F. Rischel and Sebastian Weichwald. Compositional abstraction error and a category of causal models. InProceedings of the 37th Conference on Uncertainty in Artificial Intelligence (UAI), 2021. Juan Diego Rodriguez, Aaron Mueller, and Kanishka Misra. Characterizing the role of similarity in the property inferences of language models.CoRR, abs/2410.22590, 2024. doi: 10.48550/ARXIV.2410.22590. URLhttps://doi.org/10.48550/arXiv.2410.22590. Paul K. Rubenstein, Sebastian Weichwald, Stephan Bongers, Joris M. Mooij, Dominik Janzing, Moritz Grosse-Wentrup, and Bernhard Sch ̈olkopf. Causal consistency of structural equation models. InProceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI), 2017. D. E. Rumelhart, J. L. McClelland, and PDP Research Group, editors.Parallel Distributed Processing. Volume 1: Foundations. MIT Press, Cambridge, MA, 1986. Victor Sanh and Alexander M. Rush. Low-complexity probing via finding subnetworks. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-T ̈ur, Iz Belt- agy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, On- line, June 6-11, 2021, pages 960–966. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.NAACL-MAIN.74. URLhttps://doi.org/10.18653/v1/2021. naacl-main.74. Naomi Saphra and Sarah Wiegreffe. Mechanistic?CoRR, abs/2410.09087, 2024. doi: 10.48550/ARXIV.2410.09087. URLhttps://doi.org/10.48550/arXiv.2410.09087. Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, and Buck Shlegeris. Pol- ysemanticity and capacity in neural networks.CoRR, abs/2210.01892, 2022. doi: 10.48550/ARXIV.2210.01892. URLhttps://doi.org/10.48550/arXiv.2210.01892. Jessica Schrouff, Sebastien Baur, Shaobo Hou, Diana Mincu, Eric Loreaux, Ralph Blanes, James Wexler, Alan Karthikesalingam, and Been Kim. Best of both worlds: local and global explanations with human-understandable concepts, 2022. Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a black box: Learning important features through propagating activation differences.CoRR, abs/1605.01713, 2016. URLhttp://arxiv.org/abs/1605.01713. Herbert A. Simon and Albert Ando. Aggregation of variables in dynamic systems.Econo- metrica, 29(2):111–138, 1961. ISSN 00129682, 14680262. URLhttp://w.jstor.org/ stable/1909285. 59 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard Paul Smolensky. Neural and conceptual interpretation of PDP models. In James L. McClel- land, David E. Rumelhart, and the PDP Research Group, editors,Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Psychological and Biological Models, volume 2, pages 390–431. MIT Press, 1986. Paul Soulos, R. Thomas McCoy, Tal Linzen, and Paul Smolensky. Discovering the composi- tional structure of vector representations with role learning networks. In Afra Alishahi, Yonatan Belinkov, Grzegorz Chrupala, Dieuwke Hupkes, Yuval Pinter, and Hassan Sajjad, editors,Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpret- ing Neural Networks for NLP, BlackboxNLP@EMNLP 2020, Online, November 2020, pages 238–254. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020. blackboxnlp-1.23. URLhttps://doi.org/10.18653/v1/2020.blackboxnlp-1.23. Peter Spirtes, Clark Glymour, and Richard Scheines.Causation, Prediction, and Search. MIT Press, 2000. Jost Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net.CoRR, 12 2014. Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 7035–7052. Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.emnlp-main.435. Nishant Subramani, Nivedita Suresh, and Matthew E. Peters. Extracting latent steering vectors from pretrained language models.arXiv:2205.05124, 2022. URLhttps://arxiv. org/abs/2205.05124. Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3319–3328. JMLR. org, 2017. Alex Tamkin, Mohammad Taufeeque, and Noah Goodman. Codebook features: Sparse and discrete interpretability for neural networks, 2024. URLhttps://openreview.net/ forum?id=LfhG5znxzR. Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy, July 2019. Association for Computational Linguistics. URLhttps://w.aclweb.org/anthology/P19-1452. Simon Thorpe. Local vs. distributed coding.Intellectica, 8(2):3–40, 1989. doi: 10.3406/intel. 1989.873. URLhttps://w.persee.fr/doc/intel_0769-4113_1989_num_8_2_873. Jin Tian. Identifying dynamic sequential plans. InAppears in Proceedings of the Twenty- Fourth Conference on Uncertainty in Artificial Intelligence, 2008. 60 Causal Abstraction for Mechanistic Interpretability Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Linear representa- tions of sentiment in large language models, 2023. Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview.net/forum?id=AwyxtyMwaG. Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization, 2023. Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Causal mediation analysis for interpreting neural nlp: The case of gender bias, 2020. Theia Vogel. repeng, 2024. URLhttps://github.com/vgel/repeng/. Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NpsVSN6o4ul. Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 11–20. Association for Computational Linguistics, 2019. doi: 10.18653/v1/D19-1002. URLhttps://doi.org/10.18653/v1/D19-1002. James Woodward.Making Things Happen: A Theory of Causal Explanation. Oxford university press, 2003. James Woodward. Explanatory autonomy: the role of proportionality, stability, and conditional irrelevance.Synthese, 198:237–265, 2021. Muling Wu, Wenhao Liu, Xiaohua Wang, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, and Xuanjing Huang. Advancing parameter efficiency in fine-tuning via representation editing.arXiv:2402.15179, 2024a. URL https://arxiv.org/abs/2402.15179. Zhengxuan Wu, Karel D’Oosterlinck, Atticus Geiger, Amir Zur, and Christopher Potts. Causal Proxy Models for concept-based model explanations. arXiv:2209.14279, 2022a. URLhttps://arxiv.org/abs/2209.14279. Zhengxuan Wu, Atticus Geiger, Josh Rozner, Elisa Kreiss, Hanson Lu, Thomas Icard, Christopher Potts, and Noah D. Goodman. Causal distillation for language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4288–4295, Seattle, United States, July 2022b. Association for Computational Linguistics. doi: 10.18653/v1/ 2022.naacl-main.318. URLhttps://aclanthology.org/2022.naacl-main.318. 61 Geiger, Ibeling, Zur, Chaudhary, Chauhan, Huang, Arora, Wu, Goodman, Potts, Icard Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, and Noah Goodman. Interpretability at scale: Identifying causal mechanisms in alpaca. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview. net/forum?id=nRfClnMhVX. Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. Reft: Representation finetuning for language models. CoRR, abs/2404.03592, 2024b. doi: 10.48550/ARXIV.2404.03592. URLhttps://doi. org/10.48550/arXiv.2404.03592. Yilun Xu, Shengjia Zhao, Jiaming Song, Russell Stewart, and Stefano Ermon. A theory of usable information under computational constraints. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URLhttps://openreview.net/forum?id=r1eBeyHFDH. Stephen Yablo. Mental causation.Philosophical Review, 101:245–280, 1992. Mert Y ̈uksekg ̈on ̈ul, Maggie Wang, and James Zou. Post-hoc concept bottleneck models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URLhttps://openreview.net/forum? id=nA5AZ8CEyow. Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,Computer Vision – ECCV 2014, pages 818–833, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10590-1. Fabio Massimo Zennaro, M ́at ́e Dr ́avucz, Geanina Apachitei, Widanalage Dhammika Widan- age, and Theodoros Damoulas. Jointly learning consistent causal abstractions over multiple interventional distributions. In Mihaela van der Schaar and Cheng Zhang and/D Do- minik Janzing, editors,Conference on Causal Learning and Reasoning, CLeaR 2023, 11-14 April 2023, Amazon Development Center, T ̈ubingen, Germany, April 11-14, 2023, volume 213 ofProceedings of Machine Learning Research, pages 88–121. PMLR, 2023a. URL https://proceedings.mlr.press/v213/zennaro23a.html. Fabio Massimo Zennaro, Paolo Turrini, and Theodoros Damoulas. Quantifying consistency and information loss for causal abstraction learning. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China, pages 5750–5757. ijcai.org, 2023b. doi: 10.24963/IJCAI.2023/ 638. URLhttps://doi.org/10.24963/ijcai.2023/638. Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=Hf17y6u9BC. Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv:2310.01405, 2023. URL https://arxiv.org/abs/2310.01405. 62 Causal Abstraction for Mechanistic Interpretability Amir Zur, Elisa Kreiss, Karel D’Oosterlinck, Christopher Potts, and Atticus Geiger. Updating CLIP to prefer descriptions over captions. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20178–20187, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.1125. URL https://aclanthology.org/2024.emnlp-main.1125. 63