Paper deep dive
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach
Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina Păsăreanu, Somesh Jha
Models: Custom 2-layer decoder-only Transformer (128-dim embeddings, 1 attention head layer 1, 4 attention heads layer 2, 512 hidden MLP)
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/12/2026, 6:00:27 PM
Summary
The paper introduces an axiomatic framework for validating mechanistic interpretations of neural networks, drawing inspiration from abstract interpretation in program analysis. It defines four axioms—Prefix Equivalence, Component Equivalence, Prefix Replaceability, and Component Replaceability—to formally characterize how well a human-interpretable model (h) approximates the semantics of a neural network (t). The authors demonstrate the framework's utility by applying it to a Transformer-based 2-SAT solver and a modular addition model.
Entities (5)
Relation Signals (3)
Transformer → implements → 2-SAT Solver
confidence 98% · new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem
Axiomatic Approach → inspiredby → Abstract Interpretation
confidence 95% · Inspired by the notion of abstract interpretation from the program analysis literature
Axiomatic Approach → validates → Mechanistic Interpretability
confidence 95% · We give a set of axioms that formally characterize a mechanistic interpretation
Cypher Suggestions (2)
Find all models analyzed using the axiomatic approach · confidence 90% · unvalidated
MATCH (m:Model)-[:ANALYZED_BY]->(a:Methodology {name: 'Axiomatic Approach'}) RETURN m.nameIdentify relationships between methodologies and fields of study · confidence 85% · unvalidated
MATCH (m:Methodology)-[r]->(f:FieldOfStudy) RETURN m.name, type(r), f.name
Abstract
Abstract:Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.
Tags
Links
Full Text
239,818 characters extracted from source content.
Expand or collapse full text
Validating Mechanistic Interpretations: An Axiomatic Approach Nils Palumbo Ravi Mangal Zifan Wang Saranya Vijayakumar Corina Păsăreanu Somesh Jha Abstract Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem. mechanistic interpretability, 2-SAT solver, axiomatic approach 1 Introduction Neural networks are notoriously opaque. Mechanistic interpretability seeks to reverse-engineer the computation described by a neural network in a human-interpretable form. Typically, the resulting mechanistic interpretation takes the form of a circuit (i.e., program) operating over features (i.e., symbols). The features refer to human-interpretable properties of the input (such as edges, textures, or shapes for vision models and part-of-speech tags, named entities, or semantic relationships for language models) that the model represents internally in its representations spaces whereas the circuit refers to a chain of human-interpretable operations, encoded by the model in its layers, such that each operation processes and transforms the features (Millière & Buckner, 2024). While mechanistic interpretability has been used to analyze aspects of vision as well as language models, it has found most success in analyzing small models trained to solve specific algorithmic tasks such as modular addition (Nanda et al., 2023a). Although such small models are not necessarily of practical use, analyzing them has been fruitful for developing the techniques needed to mechanistically interpret. Despite the broad interest in mechanistic interpretability, we observe that the notion of a mechanistic interpretation has so far been primarily defined in an ad-hoc manner. In this work, we propose a collection of axioms that characterize a mechanistic interpretation.111In the same sense that group axioms characterize the notion of an abstract group in abstract algebra. Our view on mechanistic interpretability is inspired by the notion of abstract interpretation from the program analysis literature (Cousot & Cousot, 1977, 1992) that seeks to approximate the semantics of programs in order to make their analysis computationally feasible. Similarly, our axioms are meant to capture the property that the mechanistic interpretation should capture the semantics of the model. Further, our axioms also capture the intuition that such an approximation should respect the compositional structure of the neural network—not only should the input-output behavior of the mechanistic interpretation be as close as possible to the original model, but every step in the reverse-engineered algorithm should be as close as possible to the computation performed by the internal components of the model (i.e., neurons and layers). Our axioms clarify the responsibilities of an analyst; in addition to providing a mechanistic interpretation, an analyst also needs to present evidence to support the fact that the provided interpretation is valid, i.e., satisfies these axioms. Note that our axioms present a general framework for validating mechanistic interpretations. Closest to our work are recent papers by Geiger et al. (2021) and Chan et al. (2022) that seek to formalize the notion of mechanistic interpretability from a causal perspective. We believe that these formulations are complementary to our axioms. We demonstrate the applicability of our axioms for validating mechanistic interpretations via two case studies222Our implementation is available at https://github.com/nilspalumbo/axiomatic-validation.—the existing, well-known analysis by Nanda et al. (2023a) of a model with a Transformer-based architecture (Vaswani et al., 2017) trained to solve modular addition, and a new analysis of a small Transformer-based model trained by us to solve a Boolean satisfiability problem. For the new case study, we choose Boolean satisfiability as a challenging yet well-understood problem that serves to highlight new techniques and our axiomatic evaluation criteria for mechanistic interpretability. Further, there are many other problems in computer science that reduce to it, for example, graph reachability is reducible to 2-SAT. To keep the analysis tractable, we focus on the 2-SAT problem with a fixed number of clauses and variables; the problem can be solved in polynomial time (while the more general 3-SAT problem is NP-complete). To apply our axioms, we first need to extract a mechanistic interpretation of the model under analysis. Although such an interpretation is already known for the modular arithmetic model, we present a new analysis for the 2-SAT model that is guided by our axioms. Through our analysis, we are able to reverse engineer the algorithm learned by the 2-SAT model—the model parses the formula into a clause-level representation in its initial layers and then uses the later layers to evaluate the formula satisfiability via enumeration of different possible valuations of the Boolean variables. Since we only consider 2-SAT formulas with ten clauses and five variables, such an enumerative approach is computationally feasible and our model has the capacity to encode it. We use novel variants of attention pattern analysis to interpret the first Transformer block of the model. For the second (i.e., final) Transformer block, attention pattern analysis does not suffice and we use automated learning of functions (decision trees) to derive an interpretation. Finally, for both the case studies, we present evidence that the mechanistic interpretations extracted for these models indeed satisfy our axioms. 2 Related Work Work on interpreting neural networks can be divided between techniques that find input features with the largest effect on model behavior (Input Interpretability) such as gradient-based (Simonyan et al., 2013; Sundararajan et al., 2017; Leino et al., 2018; Smilkov et al., 2017) and activation-based (Olah et al., 2017; Petsiuk et al., 2018; Fong & Vedaldi, 2017; Selvaraju et al., 2019; Wang et al., 2020) attributions, and those that interpret the internal reasoning performed by a model (Internal Interpretability). We focus on the latter. Mechanistic interpretability typically refers to analyses that seek to understand the model behavior completely, i.e., they seek to reverse engineer, in a human-understandable, and hence simplified form, the full algorithm that the neural network learns. Such analyses have tended to focus on analyzing toy models trained for algorithmic tasks such as modular addition (Nanda et al., 2023a; Zhong et al., 2023), finding greatest common divisors (Charton, 2023), n-digit integer addition (Quirke & Barez, 2024), and finite group operations (Chughtai et al., 2023). Mechanistic interpretability has also been used to refer to analyses that only seek to understand specific aspects of the model behavior. Such analyses, often referred to as circuit analyses, isolate circuits, i.e., subgraphs of the neural network computational graph, that are responsible for the behavior of interest (Cammarata et al., 2020; Olsson et al., 2022; Wang et al., 2023; Lepori et al., 2024; Wu et al., 2023b; Lieberum et al., 2023; Conmy et al., 2023). The circuits are validated by measuring the causal effect of ablating the circuit using techniques such as activation patching (Vig et al., 2020; Geiger et al., 2021; Heimersheim & Nanda, 2024) on a metric that measures the relevant model behavior. In either case, the standards for judging the validity of the mechanistic interpretations produced by such analyses are ad-hoc and do not take the compositional structure of the neural network into account. We axiomatize the standards for making such judgment in this paper. Related to circuit analysis and the evaluation of mechanistic interpretations is emerging work on causal abstraction analysis (Geiger et al., 2021, 2024; Wu et al., 2023a, b; Chan et al., 2022) that extracts causal models to explain certain aspects of neural network behavior and uses techniques similar to activation patching for validating the causal abstractions. The surveys by Millière & Buckner (2024), Mueller et al. (2024), and Räuker et al. (2023) are excellent resources for a broader and more detailed description of techniques for interpreting the internals of neural networks and the architectural components which form the interpreted units. Appendix A has a further discussion on internal interpretability techniques such as probing and attention pattern analysis. 3 Mechanistic Interpretation Axioms Neural networks can be seen as programs in a purely functional language, which we denote λTsubscript _Tλitalic_T, with basic operations corresponding to the commonly used neural network layers such as EmbedEmbedE m b e d, UnembedUnembedU n e m b e d, LinLinL i n (i.e., linear), ReLUReLUR e L U, Self-Attention-Self-AttentionS e l f - A t t e n t i o n, ConvolutionConvolutionC o n v o l u t i o n, ResidualResidualR e s i d u a l, and so on. Since λTsubscript _Tλitalic_T is purely functional, all these operations are side-effect free—they simply transform inputs into outputs. The syntax of λTsubscript _Tλitalic_T is as follows (x∈Varx∈ Varx ∈ V a r): t::=Embed(x)∣Unembed(x)∣Lin(x)∣ReLU(x)∣Self-Attention(x)∣Convolution(x)∣Residual(x,t)∣…∣t∘t:absentassigndelimited-∣delimited-∣missing-subexpressionmissing-subexpressionmissing-subexpression-delimited-∣missing-subexpressionmissing-subexpressionmissing-subexpressiondelimited-∣…missing-subexpression array[] @l@\;l@\;l@\,l@t&::=&Embed(x) Unembed(x) Lin(x) ReLU(x)% \\ &&Self-Attention(x) Convolution(x) \\ &&Residual(x,t) … t t arraystart_ARRAY start_ROW start_CELL t end_CELL start_CELL : := end_CELL start_CELL E m b e d ( x ) ∣ U n e m b e d ( x ) ∣ L i n ( x ) ∣ R e L U ( x ) ∣ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL S e l f - A t t e n t i o n ( x ) ∣ C o n v o l u t i o n ( x ) ∣ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL R e s i d u a l ( x , t ) ∣ … ∣ t ∘ t end_CELL start_CELL end_CELL end_ROW end_ARRAY The decomposition of a model t∈λTsubscriptt∈ _Tt ∈ λitalic_T is a list of programs in λTsubscript _Tλitalic_T such that the composition of these programs is syntactically equivalent to t. For example, consider the program t:=Lin∘ReLU∘Lin(x)assignt~:=~Lin ReLU Lin(x)t := L i n ∘ R e L U ∘ L i n ( x ). One possible decomposition of t is the list [Lin,ReLU,Lin][Lin,ReLU,Lin][ L i n , R e L U , L i n ]. Given a decomposition d, we use d[i]delimited-[]d[i]d [ i ] to refer to the ithsuperscriptthi^thith component of d, d[:i]d[:i]d [ : i ] as a shorthand for d[i−1]∘d[i−2]∘…∘d[1]delimited-[]1delimited-[]2…delimited-[]1d[i-1]~ ~d[i-2]~ … ~d[1]d [ i - 1 ] ∘ d [ i - 2 ] ∘ … ∘ d [ 1 ] and refer to it as a prefix of t of length i−11i-1i - 1 (with respect to d), and d[i:]d[i:]d [ i : ] as the shorthand for d[len(d)]∘d[len(d)−1]∘…∘d[i]delimited-[]delimited-[]1…delimited-[]d[len(d)]~ ~d[len(d)-1]~ … ~d[i]d [ l e n ( d ) ] ∘ d [ l e n ( d ) - 1 ] ∘ … ∘ d [ i ] and referred to as a suffix of t. Mechanistic Interpretation. The mechanistic interpretation of a neural network t∈λTsubscriptt∈ _Tt ∈ λitalic_T is a program in a different, purely functional language λHsubscript _Hλitalic_H, that is human interpretable. The basic constructs of λHsubscript _Hλitalic_H will tend to be more abstract than the ones supported by λTsubscript _Tλitalic_T; λHsubscript _Hλitalic_H might also be less expressive than λTsubscript _Tλitalic_T to aid human-interpretability. While the specific design of λHsubscript _Hλitalic_H and the question of how to judge whether a program is human-interpretable is domain-specific and subjective (and therefore difficult to formalize), we focus here on the question of how to judge whether a program h∈λHℎsubscripth∈ _Hh ∈ λitalic_H is a valid mechanistic interpretation of the neural network t∈λTsubscriptt∈ _Tt ∈ λitalic_T under analysis. At the very least, the input-output behavior of hℎh should be similar to t. Ideally, the input-output behaviors of the two programs should coincide on every input but this requirement is much too strong in practice. Assuming that the inputs are drawn from a distribution DD, we formalize the weaker requirement that the outputs of the two programs should be equal with a high probability as an axiom. For these equality conditions to be practically applicable, hℎh must operate over discrete values. Also, the output type of t must be discrete, in particular, we assume that an argmaxargmaxargmaxargmax or top−ktopktop-ktop - k operator has been applied to the logits. We additionally require that the behaviors of hℎh and t are equivalent at the level of individual components (obtained via suitably decomposing these programs) and that replacing a component of t with the corresponding component of hℎh has a negligible effect on the neural network’s output. Definition 3.1 (Mechanistic Interpretation). Given a model t∈λTsubscriptt∈ _Tt ∈ λitalic_T of type X→Y→X→ YX → Y and a decomposition dtsubscriptd_tditalic_t of t, an ϵitalic-ϵεϵ-accurate mechanistic interpretation of t, with respect to decomposition dtsubscriptd_tditalic_t and a distribution DD over X, is a program h∈λHℎsubscripth∈ _Hh ∈ λitalic_H with a decomposition dhsubscriptℎd_hditalic_h such that len(dt)=len(dh)subscriptsubscriptℎlen(d_t)=len(d_h)l e n ( ditalic_t ) = l e n ( ditalic_h ) and Axioms 1, 2, 3, and 4 hold. Note that the definition is parametric in decomposition dtsubscriptd_tditalic_t of t as well as the input distribution DD. In practice, we will often consider layer or block-wise decompositions of t. Abstraction and Concretization. The first four axioms use the notion of α and γ functions which we describe here. The intuition is that a neural network t and a corresponding mechanistic interpretation hℎh operate on different types of data representations. While t operates over real-valued vectors, to aid interpretability, hℎh is intended to operate over human-interpretable features or symbols. Accordingly, α (or abstraction) functions map the real-valued activations (which we also refer to concrete representations) computed by t to the corresponding features or symbols (which we refer to abstract representations) that hℎh operates over. Given a decomposition dtsubscriptd_tditalic_t of t, we use αisubscript _iαitalic_i to refer to the α function mapping concrete representations computed by dt[:i+1]d_t[:i+1]ditalic_t [ : i + 1 ], i.e., the prefix of length i. The α functions are similar to probes (Alain & Bengio, 2017) for extracting features from a model’s internal activations. The γ (or concretization) functions map in the opposite direction—they map abstract features or symbols operated on by hℎh to corresponding real-valued representations of those features in t’s representation space. The α and γ functions in our axioms are directly inspired by the abstraction and concretization operations from the abstract interpretation literature used to map values between the original semantics and the abstract semantics of a program (Cousot & Cousot, 1977, 1992). We note that the α and γ functions need to be individually instantiated every mechanistic interpretation. Axioms. We define the following axioms for a particular choice of ϵitalic-ϵεϵ; in our analysis we say that an interpretation satisfies an axiom with ϵitalic-ϵεϵ when the statement of that axiom holds. Axiom 1 bounds the probability that the abstract and the concrete representations computed by the same-length prefix of hℎh and t, respectively, do not coincide (after applying α functions to map between the representations). Put differently, the mechanistic interpretation prefixes do not introduce too much error. Axiom 1 (ϵitalic-ϵεϵ-Prefix Equivalence). ∀i∈[len(d)].for-alldelimited-[]∀ i∈[len(d)].∀ i ∈ [ l e n ( d ) ] . Prx∼[αi∘dt[:i+1](x)=dh[:i+1]∘α0(x)]≥1−ϵ x Pr[ _i d_t[:i+1](x)=d_% h[:i+1] _0(x)]≥ 1- _UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) = ditalic_h [ : i + 1 ] ∘ α0 ( x ) ] ≥ 1 - ϵ In contrast, Axiom 2 requires that none of the individual components of the mechanistic interpretation hℎh introduce too much error. Axiom 2 does not imply Axiom 1 since the errors introduced by each component of the mechanistic interpretation can compound in the worst case (Dziri et al., 2023). Axiom 2 (ϵitalic-ϵεϵ-Component Equivalence). ∀i∈[len(d)].for-alldelimited-[]∀ i∈[len(d)].∀ i ∈ [ l e n ( d ) ] . Prx∼[αi∘dt[:i+1](x)=dh[i]∘αi−1∘dt[:i](x)]≥1−ϵ x Pr[ _i d_t[:i+1](x)=d_% h[i] _i-1 d_t[:i](x)]≥ 1- _UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) = ditalic_h [ i ] ∘ αitalic_i - 1 ∘ ditalic_t [ : i ] ( x ) ] ≥ 1 - ϵ Axioms 3 and 4 are similar except that they consider equivalence of the output. Axiom 3 requires that replacing the prefixes of t with the corresponding prefixes of hℎh (after applying appropriate α and γ functions to map between the representations) has limited effect on t’s output. Axiom 4 requires the same when individual components of t are replaced by corresponding components of hℎh. Axiom 3 (ϵitalic-ϵεϵ-Prefix Replaceability). ∀i∈[len(d)].for-alldelimited-[]∀ i∈[len(d)].∀ i ∈ [ l e n ( d ) ] . Prx∼[t(x)=dt[i+1:]∘γi∘dh[:i+1]∘α0(x)]≥1−ϵ x Pr[t(x)=d_t[i+1:] _i% d_h[:i+1] _0(x)]≥ 1- _UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ t ( x ) = ditalic_t [ i + 1 : ] ∘ γitalic_i ∘ ditalic_h [ : i + 1 ] ∘ α0 ( x ) ] ≥ 1 - ϵ Axiom 4 (ϵitalic-ϵεϵ-Component Replaceability). ∀i∈[len(d)].for-alldelimited-[]∀ i∈[len(d)].∀ i ∈ [ l e n ( d ) ] . Prx∼[t(x)=dt[i+1:]∘γi∘dh[i]∘αi−1∘dt[:i](x)]≥1−ϵ x Pr[t(x)=d_t[i+1:] _i% d_h[i] _i-1 d_t[:i](x)]≥ 1- _UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ t ( x ) = ditalic_t [ i + 1 : ] ∘ γitalic_i ∘ ditalic_h [ i ] ∘ αitalic_i - 1 ∘ ditalic_t [ : i ] ( x ) ] ≥ 1 - ϵ We propose two additional axioms, namely, Axioms 5 and 6, that are more informal in nature and we do not consider them for the analysis presented in this paper. These axioms are presented as a goal for future work and described in Appendix B. Checking the Axioms. The proposed axioms can be checked statistically. Given a test dataset (which need not be labeled), we can estimate the error rate ϵitalic-ϵεϵ for each axiom by computing the proportion of test inputs which violate the equality condition specified by the axiom. The number of violations is distributed binomially—the parameter p of this binomial distribution is equal to ϵitalic-ϵεϵ. Hence, any method for deriving confidence intervals for the parameter of a binomial distribution can be used to derive confidence intervals for our estimate of ϵitalic-ϵεϵ; we use the Clopper-Pearson method (Clopper & Pearson, 1934) in our experiments. Validating the axioms is thus cheap and feasible in practice. These axioms represent a minimal set that we believe are necessary albeit may not be sufficient in all cases for characterizing a mechanistic interpretation. In future work, we plan to address this by considering analysis of models trained for other problems. Relationship to Existing Approaches. The evaluation of Nanda et al. (2023a)’s seminal paper on mechanistically interpreting the complete behavior exemplifies typical approaches to evaluating a mechanistic interpretation (Zhong et al., 2023; Quirke & Barez, 2024; Chughtai et al., 2023). The first three pieces of evidence (presented in Sections 4.1, 4.2, and 4.3 in their paper) resemble our Axiom 2 that compares each individual component of the mechanistic interpretation with the corresponding component of the model. However, as there does not exist a direct analogue of abstraction, this analysis is primarily observational. The evidence they present in Section 4.4 of their paper is similar to our Axiom 4 that checks the effect on the output of the model when individual model components are replaced by the corresponding components from the mechanistic interpretation. Notably, they do not present the evidence required by our Axioms 1 and 3. This amounts to not considering the compositional structure of the model and, therefore, the possible compounding of error introduced by each component of the mechanistic interpretation; we discuss this further at the end of this section. Causal abstraction (Geiger et al., 2021, 2022, 2024, 2025) and causal scrubbing (Chan et al., 2022) are key existing attempts to formalize the notion of a mechanistic interpretation. Both of these formulate the abstract and concrete models as directed acyclic graphs (DAG) with a map that relates the two DAGs. While it may seem that, compared to our approach, the DAG formulation is more generic since it admits parallel composition of components in addition to sequential composition, this is not fundamental, in particular, Appendix B.1 discusses how to represent arbitrary computational graphs in our framework. The map used by causal abstraction to relate the two DAGs corresponds to our abstraction function α; however, it does not have any analog of a concretization function γ. Geiger et al. (2025) note that the inverse of abstraction may be used as a concretization operator; however, this may not be feasible in practice, as this choice necessitates the use of set semantics for evaluation of the concrete model when the abstraction operator fails to be invertible. Under the causal abstraction framework, a valid interpretation is defined by comparing the effects of interventions on the concrete model with corresponding interventions on the abstract model; in contrast, our notion of a valid interpretation may be viewed as one which is invariant up to a class of interventions, characterized by our axioms, which interleave the concrete and abstract models. While causal abstraction is a very general framework for the validation of mechanistic interpretations, in practice, it uses the specific metric of interchange intervention accuracy (Geiger et al., 2022) to validate interpretations and this metric fails to directly evaluate the equivalence of internal representations. On the other hand, the map defined by causal scrubbing can be seen as an form of concretization, but it does not define an abstraction function. While both these approaches compare the outputs of the concrete and abstract models similar to our Axioms 3 and 4, they tend to omit direct evaluation of the equivalence of internal representations, as in our Axioms 1 and 2. This is potentially problematic since there is no direct validation that the intermediate steps of the mechanistic interpretation correspond to the intermediate steps of the neural network (Scheurer et al., 2023). Further, as causal abstraction and scrubbing lack concretization and abstraction operators respectively, even Axioms 3 and 4 cannot be expressed directly in these frameworks. Compositionality. While it may appear that our axioms for compositional evaluation (Axioms 1 and 3) follow from their componentwise counterparts (Axioms 2 and 4 respectively), this is not the case; for example, while bounds of the form of Axiom 1 can be derived from Axiom 2, these hold only with far weaker ϵitalic-ϵεϵ; when the number of components is sufficiently high, the derived bounds entirely cease to be meaningful. More details are given in Appendix E; Appendix F includes empirical evidence demonstrating the pitfalls of omitting the compositional axioms. 4 Case Study: 2-SAT 4.1 Data and Model Details The 2-SAT Problem. Given a Boolean formula, i.e., a conjunction of disjunctive clauses over exactly two literals, the 2-SAT problem is to determine whether there is an assignment to the formula’s variables that makes the formula evaluate to True; the formula is said to be satisfiable, or SAT. For example, the formula (x0∨x1)∧(x1∨¬x2)subscript0subscript1subscript1subscript2(x_0 x_1) (x_1 x_2)( x0 ∨ x1 ) ∧ ( x1 ∨ ¬ x2 ) is satisfiable, with a satisfying assignment x0=True,x1=False,x2=Falseformulae-sequencesubscript0Trueformulae-sequencesubscript1Falsesubscript2Falsex_0=True,x_1=False,x_2=Falsex0 = True , x1 = False , x2 = False (written also as x0,¬x1,¬x2subscript0subscript1subscript2x_0, x_1, x_2x0 , ¬ x1 , ¬ x2). If a formula has no satisfying assignment, it is said to be unsatisfiable, or UNSAT. Dataset. We construct a dataset of randomly generated 2-SAT formulas with ten clauses and up to five variables, eliminating syntactic duplicates. We use a solver (Z3 (De Moura & Bjørner, 2008)) to check satisfiability for each formula. We built a balanced dataset of 106superscript10610^6106 SAT and 106superscript10610^6106 UNSAT instances; we used 60% (also balanced) for training, and the rest for testing. For model analysis, we construct a separate dataset with 105superscript10510^5105 SAT and 105superscript10510^5105 UNSAT instances split as before in to train and test sets. Each formula is represented as a string. For example, (x0∨x1)∧(x1∨¬x2)subscript0subscript1subscript1subscript2(x_0 x_1) (x_1 x_2)( x0 ∨ x1 ) ∧ ( x1 ∨ ¬ x2 ) is represented as “(x0x1)(x1¬x2):s:subscript0subscript1subscript1subscript2(x_0x_1)(x_1 x_2):s( x0 x1 ) ( x1 ¬ x2 ) : s” where s indicates that the formula is SAT (u would indicate UNSAT). We tokenize these formulas by considering each xisubscriptx_ixitalic_i, its negation ¬xisubscript x_i¬ xitalic_i, and each of the symbols in the set (,),:,s,u\(,),:,s,u\ ( , ) , : , s , u as a separate token. The colon token (‘:’) is the final token in the input and we consider the next token predicted by the model as the output; hence, we refer to this token as the readout token and refer to its position as the readout position. Model Architecture and Training. We use a two-layer ReLU decoder-only Transformer with token embeddings of dimension d=128128d=128d = 128, learned positional embeddings, one attention head of dimension d in the first layer, four attention heads of dimension d/4=32432d/4=32d / 4 = 32 in the second, and n=512512n=512n = 512 hidden units in the MLP (multi-layer perceptron). We use full-batch gradient descent with the AdamW optimizer, setting the learning rate γ=0.0010.001γ=0.001γ = 0.001 and weight decay parameter λ=11λ=1λ = 1. We perform extensive epochs of training (1000 epochs), recognizing the combinatorial nature of SAT problems and the need for thorough exploration of the solution space. We obtained a model with 99.76% accuracy on test data. All our experiments were run on an NVIDIA A100 GPU. 4.2 Mechanistic Interpretability Analysis The network can be naturally decomposed into its blocks and we describe the analysis of each in the following two sections. We note that our axioms help guide our analysis, particularly for the second block. As our goal is to understand the algorithm used by the model to determine satisfiability, we refrain from analyzing behavior which does not affect the final prediction. Thus, we focus on the model’s prediction at the readout position. Figure 5 in Appendix D describes our decomposition of the model into the concrete components dt[i]subscriptdelimited-[]d_t[i]ditalic_t [ i ]. 4.2.1 First Block as Parser Figure 1: Attention scores for all heads on the first sample of the test set, the formula (¬x0¬x1)(x1¬x4)(x1x2)(x0x3)(¬x2¬x3)(x2¬x4)(¬x0¬x3)(x0x2)(x1¬x2)(x1x4)subscript0subscript1subscript1subscript4subscript1subscript2subscript0subscript3subscript2subscript3subscript2subscript4subscript0subscript3subscript0subscript2subscript1subscript2subscript1subscript4( x_0 x_1)(x_1 x_4)(x_1x_2)(x_0x_3)( x_2 x% _3)(x_2 x_4)( x_0 x_3)(x_0x_2)(x_1 x_2)(x_1% x_4)( ¬ x0 ¬ x1 ) ( x1 ¬ x4 ) ( x1 x2 ) ( x0 x3 ) ( ¬ x2 ¬ x3 ) ( x2 ¬ x4 ) ( ¬ x0 ¬ x3 ) ( x0 x2 ) ( x1 ¬ x2 ) ( x1 x4 ). The x-axis represents the source (key) positions and the y-axis represents the destination (query) positions for the attention mechanism. To reverse-engineer the first block’s behavior, we start by examining the attention patterns, i.e., the attention scores (by default, post-softmax unless stated otherwise) calculated by both blocks on given input formulas. Figure 1 shows the attention scores on a formula from our test set. We clearly see patterns emerge—most tokens in the first block attend heavily to the first token, while tokens in position 4i+2424i+24 i + 2, 0≤i<100100≤ i<100 ≤ i < 10, that we refer to as the second literal positions, primarily attend to themselves and the previous token, i.e., to the two literals in each clause.333Formulas are of the form, (x0x1)(x1¬x2):s:subscript0subscript1subscript1subscript2(x_0x_1)(x_1 x_2):s( x0 x1 ) ( x1 ¬ x2 ) : s, so token positions 4i+1414i+14 i + 1 and 4i+2424i+24 i + 2 correspond to the literals of the ithsuperscriptthi^thith clause. Attention from the readout token is nearly uniform. This suggests that the first block parses each clause, storing the parsed clause information in the token representations output by the first block in the second literal positions. In the second block, the attentions heavily focus on these second literal positions, suggesting that the model uses the parsed clause information contained in these representations to classify. This allows us to restrict our analysis of the first block to the second literal positions and the readout position. To further validate this decision, we compute the average attention score from each head on each token across the test set (see Figure 6 in Appendix D.1). The results show strong periodicity with period 4, following the pattern seen in Figure 1. We perform additional distributional and worst-case analysis of attention patterns (described in Appendix D.1) and these analyses confirm the parsing hypothesis for the first block. Interpretation. Based on our analysis of the attention patterns for the first block, we hypothesize that it is parsing the input sequence of tokens into a list of clauses (Listing LABEL:fig:layer1_code in Appendix D.1). The behavior of attention is enough to form this hypothesis, as the sparse attention patterns show that information primarily flows from tokens in a clause to the token in the second literal position in the same clause; hence we can view the action of the first block as computing a representation for each clause in the input. In this sense, analyzing the MLP in the first block is not necessary to identify the semantic function of the first block. We show in Sec 4.2.3 that we can successfully validate this hypothesis using our axioms. As we’l see in the analysis of the second block in Sec 4.2.2, we cannot always count on self-explanatory attention patterns; for the second block, we will be forced to analyze the much harder-to-interpret MLP. 4.2.2 Second Block as Evaluator From the first block’s analysis, we know that the inputs to the second block can be described, in an abstract sense, as the list of clauses in the formula. As the model’s output is a label SAT or UNSAT, the interpretation of the second block is a function that checks satisfiability given a list of clauses. This means that the core logic of the model is encoded primarily in the second block. Identifying the Key Pathway. We observe that the unembedding vector for UNSAT is nearly the negative of that of SAT (in particular, ‖WUSAT+WUUNSAT‖<10−5normsuperscriptsubscriptSATsuperscriptsubscriptUNSATsuperscript105\|W_U^SAT+W_U^UNSAT\|<10^-5∥ Witalic_USAT + Witalic_UUNSAT ∥ < 10- 5 and their cosine similarity is -1 up to floating point precision), and hence, their logits are likewise negatives of each other. Also, for all formulas in the dataset, either the SAT or the UNSAT logit output at the readout position is much larger than the logits for all other tokens. Hence, the effect of components on classification can be fully described by their effects on the SAT logit. Furthermore, we observe that the effect of the attention mechanism on the SAT logit is consistently suppressive and varies little between samples. Hence, the core pathway necessary to identify SAT flows through the MLP at the readout position. Unlike the first block, we cannot identify an interpretation directly from the behavior of attention. See Appendix D.2 for more details. Figure 2: Output coefficients of hidden neurons computed via the composition of the output layer of the MLP and the unembedding matrix projected to the SAT logit. Figure 3: Average activation of neuron 10 from SAT formulas, UNSAT formulas, and formulas satisfiable with particular assignments to the variables. Sparsity in the MLP Hidden Layer. We begin our analysis of the MLP by noting that, somewhat surprisingly, it exhibits sparsity in the hidden neurons. We conclude this by first observing that the second layer of the MLP and the unembedding operation are both linear, so their composition is again linear. From this perspective, then, the MLP’s effect on the classification is fully captured by a weighted sum of the post-activation outputs of the hidden neurons. We refer to these weights as neuron output coefficients for the SAT logit. This is the sense in which we observe sparsity; to see this clearly, see Figure 2. There are only 35 neurons which have a non-negligible absolute coefficient (>10−6absentsuperscript106>10^-6> 10- 6); one of these has a significantly lower coefficient than the others along with consistently low activation values on the samples in our analysis training set resulting in a negligible effect on the SAT logit. Hence, we restrict our analysis to the remaining 34 neurons. Evaluation in the Hidden Neurons. Note furthermore that all the non-negligible coefficients are positive; hence, the effect of each of the relevant neurons is purely to promote SAT. This suggests that the neurons may each recognize some subset of SAT formulas instead of recognizing characteristics of UNSAT formulas. Figure 3 shows that the typical activation of one such neuron on SAT formulas is highly dependent on the assignment to the variables which satisfies the formula. This suggests that the MLP may implement a very simple algorithm: that of exhaustive evaluation. In particular, if we treat the formula ϕitalic-ϕφϕ as a Boolean function fϕsubscriptitalic-ϕf_φfitalic_ϕ, ϕitalic-ϕφϕ is SAT if and only if fϕsubscriptitalic-ϕf_φfitalic_ϕ is not the constant False function. Hence, we can identify SAT formulas by a brute-force technique which computes the truth table of the corresponding Boolean functions and checks whether any entry is True. We’l see that this is indeed the case; hence, we will refer to these neurons as evaluating neurons. However, the behavior of the neurons is somewhat more complex than the most natural form, in which each neuron specializes in identifying formulas satisfiable with one of the 32 possible assignments. In particular, as seen in Figure 3, some neurons are affected by multiple assignments to the variables. The notation ϕ[…]italic-ϕdelimited-[]…φ[...]ϕ [ … ] means that the corresponding Boolean function fϕsubscriptitalic-ϕf_φfitalic_ϕ evaluates to True when the variables are assigned the shown truth values. For instance, ϕ[TTFFT]italic-ϕdelimited-[]φ[TTFFT]ϕ [ T T F F T ] indicates that formula ϕitalic-ϕφϕ is satisfiable with x0,x1,x4subscript0subscript1subscript4x_0,x_1,x_4x0 , x1 , x4 set to True and x2,x3subscript2subscript3x_2,x_3x2 , x3 set to False. A natural hypothesis is then that each of the evaluating neurons evaluates ϕitalic-ϕφϕ on some set of assignments, with some overlap between neurons; we refer to such interpretations of neuron activation behavior as disjunction-only since they take the form h::=ϕ[…]|h∨h::=φ[...]~|~h h : := ϕ [ … ] | h ∨ h. However, we observe that the actual behavior is less intuitive: in some cases, satisfiability with an assignment can decrease activation on some evaluating neurons. Hence, we also consider more general Boolean expressions in terms of the ϕ[…]italic-ϕdelimited-[]…φ[...]ϕ [ … ]’s as neuron interpretations. We find that our axiomatic analysis provides a clear way to quantify the trade-off between interpretability and fidelity of the different interpretations. As decision trees can represent arbitrary Boolean functions, we use standard decision tree learning to learn the more general interpretations of neurons; specifically, we learn classifiers trained to predict whether a neuron’s activation is above or below a threshold (0.5 in our experiments) given all the features ϕ[…]italic-ϕdelimited-[]…φ[...]ϕ [ … ] for a formula ϕitalic-ϕφϕ. Deriving the simpler disjunction-only interpretation is easier; we do so by identifying the assignments a such that the empirical mean, calculated over our analysis training set, of a neuron’s post-activation value on formulas satisfied by a is above the threshold. For instance, in Figure 3, these are the assignments that cause the post-activation score to be above the threshold. Finally, we observe that the output coefficients of the evaluating neurons for the SAT logit are consistently high (> 2.9 in all cases), which means that a high activation on any individual evaluating neuron forces prediction of SAT; see Appendix D.2 for more details. In this sense, the action of MLP’s output layer and the unembedding layer is conceptually close to logical OR operation. We now have enough information to describe our interpretation of the second block. Interpretation. Based on our analysis, we decompose the interpretation of the second block into two components. We hypothesize that the second block up to the MLP’s hidden layer evaluates the input formula with different assignments and outputs the results of these evaluations as a list of Booleans. The neuron interpretations describe the precise combination of assignments considered for computing each Boolean output (see Tables 2 and 3 of Appendix G for two different neuron interpretations). We evaluated the completeness of the neuron interpretations, i.e., their ability to correctly evaluate all 2-SAT formulas, using Z3 (De Moura & Bjørner, 2008). The rest of the model, namely, the MLP’s output layer and the unembedding layer, performs a logical OR over the Booleans output by the previous component and outputs True if the formula is SAT. Listing LABEL:fig:layer2_code and LABEL:fig:abstract_model in Appendix D.2 describe the mechanistic interpretation of the second block and the entire model, respectively, using Python syntax. 4.2.3 Checking the Axioms First Block. Since the input to the neural model is a sequence of token IDs, α0subscript0 _0α0 here is the function that translates token IDs into tokens. To derive α1subscript1 _1α1 and γ1subscript1 _1γ1, we first compute a canonical representation for each possible clause ψ. Drawing from the observation that the output of attention on the second token of each clause is largely a function of the two tokens in the clause alone (ignoring all other past tokens), we compute the representation output by block 1 for clause ψ in each position i by masking attention such that it is only applied to the left and right literals in the clause. The canonical representation for ψ is the average of the representations derived for each position i. α1:[Tensor]→[(Literal,Literal)]:subscript1→delimited-[]Tensordelimited-[]LiteralLiteral _1:[ Tensor]→[( Literal, Literal)]α1 : [ Tensor ] → [ ( Literal , Literal ) ], where Literal is the type consisting of the xisubscriptx_ixitalic_i and their negations, maps the list of representations output by the first block to a list of clauses by comparing (via cosine similarity) the representations in the second-variable positions 4i+2424i+24 i + 2 with the canonical clause representations and returning the clauses with the most-similar representations. γ1:[(Literal,Literal)]→[Tensor]:subscript1→delimited-[]LiteralLiteraldelimited-[]Tensor _1:[( Literal, Literal)]→[ Tensor]γ1 : [ ( Literal , Literal ) ] → [ Tensor ] maps a list of clauses to a list of canonical clause representations. For positions other than 4i+2424i+24 i + 2, γ1subscript1 _1γ1 returns the mean representation in that position across the training set. For position 4i+2424i+24 i + 2, it returns the canonical representation for the clause in position i. See Listing LABEL:lst:alphas_1 and Listing LABEL:lst:gammas_1 of Appendix D for pseudocode for α1subscript1 _1α1 and γ1subscript1 _1γ1, respectively. Prefix Equivalence. For each clause position i, α1subscript1 _1α1 outputs a clause ϕ:=l∨rassignitalic-ϕφ:=l rϕ := l ∨ r, where l and r are literals, and, as such, we consider the clauses r∨lr lr ∨ l and l∨rl rl ∨ r equal. With this notion of equality, we obtain an ϵitalic-ϵεϵ444We always report ϵitalic-ϵεϵ in terms of the upper bound of a one-sided 95% Clopper-Pearson confidence interval. of approximately 0.0000374 (corresponding to perfect matching on the test set) between the abstract states output by α1∘dt[1]subscript1subscriptdelimited-[]1 _1 d_t[1]α1 ∘ ditalic_t [ 1 ] and dh[1]∘α0subscriptℎdelimited-[]1subscript0d_h[1] _0ditalic_h [ 1 ] ∘ α0. When we enforce order consistency, we obtain a much worse ϵitalic-ϵεϵ of 0.849; however, as the abstract state is order independent and the second block’s behavior does not vary significantly with the order of variables in a clause, an order-dependent notion of equality is not necessary here. Component Equivalence. As this is the first component, component equivalence and prefix equivalence are identical, hence the results are the same as above. Prefix Replaceability. We obtain ϵ≈0.0418italic-ϵ0.0418ε≈ 0.0418ϵ ≈ 0.0418, i.e. substituting the first component of the model with γ1∘dh[1]∘α0subscript1subscriptℎdelimited-[]1subscript0 _1 d_h[1] _0γ1 ∘ ditalic_h [ 1 ] ∘ α0 affects the final output 4.2% of the time. In particular, this quantifies the effect of replacing the actual first-block representation of ‘:’ with its mean value across the dataset and the effect of ignoring attention to previous clauses when computing the canonical clause representations. Component Replaceability. As this is the first component, component and prefix replaceability are identical, hence the results are the same as above. Hidden Layer of Second Block. The output of the second concrete component is a pair consisting of the residual from attention and the post-activation outputs of the hidden neurons; α2:(Tensor,Tensor)→[bool]:subscript2→TensorTensordelimited-[]bool _2:( Tensor, Tensor)→[ bool]α2 : ( Tensor , Tensor ) → [ bool ] maps these to a Boolean vector representing whether or not sufficiently high activation (> 0.5) has occurred at each of the 34 evaluating neurons. γ2:[bool]→(Tensor,Tensor):subscript2→delimited-[]boolTensorTensor _2:[ bool]→( Tensor, Tensor)γ2 : [ bool ] → ( Tensor , Tensor ) simply returns a constant value for the residual (the mean on the analysis training set); the MLP’s hidden neurons have zero activation except for the evaluating neurons with True values in the corresponding position in the abstract state vector, which we assign an arbitrary large activation (2 in our experiments). Note that this is higher than the threshold for α2subscript2 _2α2; we observe that this amplification is necessary for a reliable interpretation; see below. See Listing LABEL:lst:alphas_2 and Listing LABEL:lst:gammas_2 of Appendix D for pseudocode for α2subscript2 _2α2 and γ2subscript2 _2γ2, respectively. Prefix Equivalence. We obtain ϵ≈0.182italic-ϵ0.182ε≈ 0.182ϵ ≈ 0.182 when using the decision tree neuron interpretations, compared with approximately 0.309 for the disjunction-only interpretations in which our neuron models activate on formulas satisfiable with any of a set of assignments. This means that the disjunction-only interpretations are much worse at predicting the intermediate state of the model. This is the only place where this simplifying choice has a cost with respect to the validity of the mechanistic interpretation; aside from component equivalence below, the disjunction-only interpretations differ little from the decision trees in all other respects. This illustrates the need to check the axiomatic properties at component level for the model under analysis. Component Equivalence. We again obtain ϵ≈0.182italic-ϵ0.182ε≈ 0.182ϵ ≈ 0.182 for the decision tree interpretations and 0.309 for the disjunction-only interpretations. The similarity in the ϵitalic-ϵεϵ values for prefix and component equivalence is because the interpretation of the first component perfectly matches the behavior of neural network, up to literal order, on the analysis test set. Prefix Replaceability. In both cases, substituting the first two concrete components with the first two abstract components has minimal effect on the model’s output behavior: specifically, we get ϵ≈0.0128italic-ϵ0.0128ε≈ 0.0128ϵ ≈ 0.0128 for the decision tree interpretations, and ϵ≈0.00290italic-ϵ0.00290ε≈ 0.00290ϵ ≈ 0.00290 for the disjunction-only interpretations. If we drop the amplification step discussed above for γ2subscript2 _2γ2, we significantly affect the model’s predictions: we get an ϵitalic-ϵεϵ of approximately 0.249 (i.e. over 24% of samples are affected) for the decision tree interpretations and approximately 0.135 for the disjunction-only interpretations. Component Replaceability. As above, we obtain ϵ≈0.0128italic-ϵ0.0128ε≈ 0.0128ϵ ≈ 0.0128 for the decision tree interpretations, and ϵ≈0.00290italic-ϵ0.00290ε≈ 0.00290ϵ ≈ 0.00290 for the disjunction-only interpretations. Output Layer of Second Block. For our analysis, we consider the model to be composed with a function which returns True if and only if the top logit at the readout position is SAT. The output type of this composed model and the mechanistic interpretation are identical; hence α3subscript3 _3α3 and γ3subscript3 _3γ3 are both the identity function. Prefix Equivalence. Since prefix equivalence incorporates the mechanistic interpretation of the previous component, we need to consider both the decision tree and the disjunction-only version of the previous component. We obtain very close matching in both cases, with ϵ≈0.0128italic-ϵ0.0128ε≈ 0.0128ϵ ≈ 0.0128 for the decision tree interpretations and ≈0.00290absent0.00290≈ 0.00290≈ 0.00290 for the disjunction-only interpretations. Component Equivalence. We obtain a very close match with ϵ≈0.00433italic-ϵ0.00433ε≈ 0.00433ϵ ≈ 0.00433. Prefix Replaceability. As this is the final component of the network and α3subscript3 _3α3 and γ3subscript3 _3γ3 are the identity, this is the same as prefix equivalence. Component Replaceability. Same as component equivalence for the reason above. 5 Case Study: Modular Addition Nanda et al. (2023a) train a model with a single Transformer block to perform modular addition. The model is given input of the form ab=absenta\ b\ =a b = and returns a+bmodPmoduloa+b Pa + b mod P at the ‘=’ token where P is fixed to be 113. They claim the following mechanistic interpretation of the concrete model (see Listing LABEL:lst:grokking_code in Appendix H for the pseudocode for the abstract components): 1. Encoding of inputs: The embedding matrix represents a and b as sin(wka)subscript (w_ka)sin ( witalic_k a ), sin(wkb)subscript (w_kb)sin ( witalic_k b ), cos(wka)subscript (w_ka)cos ( witalic_k a ), and cos(wkb)subscript (w_kb)cos ( witalic_k b ) for key frequencies wk=(2kπ)/Psubscript2w_k=(2kπ)/Pwitalic_k = ( 2 k π ) / P for k∈14,35,41,42,521435414252k∈\14,35,41,42,52\k ∈ 14 , 35 , 41 , 42 , 52 . 2. Computation of sum-of-angles identities: The attention and MLP input layer compute cos(wk(a+b))subscript (w_k(a+b))cos ( witalic_k ( a + b ) ) and sin(wk(a+b))subscript (w_k(a+b))sin ( witalic_k ( a + b ) ) using trigonometric identities. 3. Difference-of-angles identities and argmax: The MLP output layer and the unembedding matrix computes cos(wk(a+b−c))subscript (w_k(a+b-c))cos ( witalic_k ( a + b - c ) ) for each key frequency and for each c∈ℤPsubscriptℤc _Pc ∈ blackboard_ZP, adds the computed cosines, and then selects the c maximizing the sum, which occurs at c∗=a+bmodPsuperscriptmoduloc^*=a+b Pc∗ = a + b mod P. As all abstract components but the last one output continuous values, we must discretize the output of each abstract component for Axioms 1 and 2 to be meaningful. For the first component, we apply a simple discretization of rounding up to three decimal points; for the second component, we define an equivalence class which treats abstract states as equivalent when their downstream effects on both the abstract and concrete models are identical. Discretizations of this form are quite general: an equivalence class up to the next discrete-valued intermediate value is expressible for any abstract model with a discrete-valued output. Axioms 3 and 4 ensure that any discretization does not impact end-to-end behavior. The α and γ functions for each component are linear maps learned from data. See Appendix H for more details. We apply our approach to this model and confirm that the Nanda et al. (2023a)’s mechanistic interpretation satisfies our axioms, noting that, in their paper, while Nanda et al. (2023a) provide evidence similar to our Axioms 2 and 4 in support of the validity of their interpretation, they do not provide any evidence as required by our Axioms 1 and 3. In particular, with the above discretization we obtain a very strong ϵitalic-ϵεϵ of 0.000335, which corresponds to perfect matching. However, it is important to note that results with simple discretizations indicate that the model’s behavior is not fully captured. In particular, we obtain an ϵitalic-ϵεϵ of 1 from Axioms 1 and 2 for the second component when, instead, we discretize by rounding to a single decimal place. There are important tradeoffs to the application of this type of discretization: while it enables the analyst to eliminate variation which has no impact on the downstream behavior of the model, it also limits the ability of Axioms 1 and 2 to validate the internal behavior. 6 Conclusion We presented a set of axioms that characterize a mechanistic interpretation and enable judgment of the interpretation’s validity with respect to the neural network under analysis. Using the axioms as a guide, we analyzed a Transformer-based model trained to solve the 2-SAT problem. We applied our axioms to validate the mechanistic interpretation of this 2-SAT model and a model trained to solve the modular addition task. Our axioms provide an automated and quantitative way of evaluating the quality of a mechanistic interpretation (via the ϵitalic-ϵεϵ values). Not only can the ϵitalic-ϵεϵ’s serve as useful progress measures (in the sense of Nanda et al. (2023a)) to understand the training dynamics of models but can also help in the development of techniques for automated mechanistic interpretability analyses by serving as a useful evaluation metric. Impact Statement We believe that work on mechanistic interpretability in particular and on internal interpretability of neural networks in general is essential for increasing trust in neural networks and mitigating their risks. In this context, the work presented in this paper is of a foundational nature and advances the interpretability research agenda by clarifying the notion of a mechanistic interpretation. Therefore, we do not foresee a direct path from our work to negative societal impacts. Acknowledgments Nils Palumbo and Somesh Jha are partially supported by DARPA under agreement number 885000, NSF CCF-FMiTF-1836978 and ONR N00014-21-1-2492. References Abnar & Zuidema (2020) Abnar, S. and Zuidema, W. Quantifying attention flow in transformers. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, p. 4190–4197, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.385. URL https://aclanthology.org/2020.acl-main.385. Alain & Bengio (2017) Alain, G. and Bengio, Y. Understanding intermediate layers using linear classifier probes, 2017. URL https://openreview.net/forum?id=ryF7rTqgl. Allen-Zhu & Li (2023) Allen-Zhu, Z. and Li, Y. Physics of language models: Part 1, context-free grammar. arXiv preprint arXiv:2305.13673, 2023. Bastings & Filippova (2020) Bastings, J. and Filippova, K. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, p. 149–155, 2020. Belinkov (2022) Belinkov, Y. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, March 2022. doi: 10.1162/coli_a_00422. URL https://aclanthology.org/2022.cl-1.7. Belrose et al. (2023) Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., McKinney, L., Biderman, S., and Steinhardt, J. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023. Bibal et al. (2022) Bibal, A., Cardon, R., Alfter, D., Wilkens, R., Wang, X., François, T., and Watrin, P. Is attention explanation? an introduction to the debate. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 3889–3900, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.269. URL https://aclanthology.org/2022.acl-long.269. Burns et al. (2023) Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ETKGuby0hcs. Cammarata et al. (2020) Cammarata, N., Carter, S., Goh, G., Olah, C., Petrov, M., Schubert, L., Voss, C., Egan, B., and Lim, S. K. Thread: Circuits. Distill, 2020. doi: 10.23915/distill.00024. https://distill.pub/2020/circuits. Chan et al. (2022) Chan, L., Garriga-Alonso, A., Goldwosky-Dill, N., Greenblatt, R., Nitishinskaya, J., Radhakrishnan, A., Shlegeris, B., and Thomas, N. Causal scrubbing, a method for rigorously testing interpretability hypotheses. AI Alignment Forum, 2022. Charton (2023) Charton, F. Can transformers learn the greatest common divisor? arXiv preprint arXiv:2308.15594, 2023. Chefer et al. (2021) Chefer, H., Gur, S., and Wolf, L. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p. 782–791, June 2021. Chughtai et al. (2023) Chughtai, B., Chan, L., and Nanda, N. A toy model of universality: reverse engineering how networks learn group operations. ICML’23. JMLR.org, 2023. Clopper & Pearson (1934) Clopper, C. J. and Pearson, E. S. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4):404–413, 1934. ISSN 00063444. URL http://w.jstor.org/stable/2331986. Conmy et al. (2023) Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. Towards automated circuit discovery for mechanistic interpretability. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=89ia77nZ8u. Cousot & Cousot (1977) Cousot, P. and Cousot, R. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, p. 238–252, 1977. Cousot & Cousot (1992) Cousot, P. and Cousot, R. Abstract interpretation frameworks. Journal of logic and computation, 2(4):511–547, 1992. Crabbé & van der Schaar (2022) Crabbé, J. and van der Schaar, M. Concept activation regions: A generalized framework for concept-based explanations. Advances in Neural Information Processing Systems, 35:2590–2607, 2022. De Moura & Bjørner (2008) De Moura, L. and Bjørner, N. Z3: An efficient smt solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, p. 337–340. Springer, 2008. Dziri et al. (2023) Dziri, N., Lu, X., Sclar, M., Li, X. L., Jiang, L., Lin, B. Y., Welleck, S., West, P., Bhagavatula, C., Bras, R. L., Hwang, J. D., Sanyal, S., Ren, X., Ettinger, A., Harchaoui, Z., and Choi, Y. Faith and fate: Limits of transformers on compositionality. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Fkckkr3ya8. Ebrahimi et al. (2020) Ebrahimi, J., Gelda, D., and Zhang, W. How can self-attention networks recognize dyck-n languages? In Findings of the Association for Computational Linguistics: EMNLP 2020, p. 4301–4306, 2020. Elhage et al. (2021) Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html. Fong & Vedaldi (2017) Fong, R. C. and Vedaldi, A. Interpretable explanations of black boxes by meaningful perturbation. 2017 IEEE International Conference on Computer Vision (ICCV), p. 3449–3457, 2017. Geiger et al. (2021) Geiger, A., Lu, H., Icard, T., and Potts, C. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574–9586, 2021. Geiger et al. (2022) Geiger, A., Wu, Z., Lu, H., Rozner, J., Kreiss, E., Icard, T., Goodman, N., and Potts, C. Inducing causal structure for interpretable neural networks. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, p. 7324–7338. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/geiger22a.html. Geiger et al. (2024) Geiger, A., Wu, Z., Potts, C., Icard, T., and Goodman, N. Finding alignments between interpretable causal variables and distributed neural representations. In Locatello, F. and Didelez, V. (eds.), Proceedings of the Third Conference on Causal Learning and Reasoning, volume 236 of Proceedings of Machine Learning Research, p. 160–187. PMLR, 01–03 Apr 2024. URL https://proceedings.mlr.press/v236/geiger24a.html. Geiger et al. (2025) Geiger, A., Ibeling, D., Zur, A., Chaudhary, M., Chauhan, S., Huang, J., Arora, A., Wu, Z., Goodman, N., Potts, C., and Icard, T. Causal abstraction: A theoretical foundation for mechanistic interpretability, 2025. URL https://arxiv.org/abs/2301.04709. Gurnee et al. (2023) Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D., and Bertsimas, D. Finding neurons in a haystack: Case studies with sparse probing. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=JYs1R9IMJr. Hao et al. (2021) Hao, Y., Dong, L., Wei, F., and Xu, K. Self-attention attribution: Interpreting information interactions inside transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, p. 12963–12971, 2021. Heimersheim & Nanda (2024) Heimersheim, S. and Nanda, N. How to use and interpret activation patching. arXiv preprint arXiv:2404.15255, 2024. Hewitt & Liang (2019) Hewitt, J. and Liang, P. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), p. 2733–2743, 2019. Huben et al. (2024) Huben, R., Cunningham, H., Smith, L. R., Ewart, A., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=F76bwRSLeK. Jain & Wallace (2019) Jain, S. and Wallace, B. C. Attention is not Explanation. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), p. 3543–3556, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1357. URL https://aclanthology.org/N19-1357. Kim et al. (2018) Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, p. 2668–2677. PMLR, 2018. Leino et al. (2018) Leino, K., Sen, S., Datta, A., Fredrikson, M., and Li, L. Influence-directed explanations for deep convolutional networks. In 2018 IEEE International Test Conference (ITC), p. 1–8. IEEE, 2018. Lepori et al. (2024) Lepori, M. A., Serre, T., and Pavlick, E. Uncovering intermediate variables in transformers using circuit probing. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=gUNeyiLNxr. Li et al. (2023) Li, K., Hopkins, A. K., Bau, D., Viégas, F., Pfister, H., and Wattenberg, M. Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=DeG07_TcZvT. Lieberum et al. (2023) Lieberum, T., Rahtz, M., Kramár, J., Irving, G., Shah, R., and Mikulik, V. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla. arXiv preprint arXiv:2307.09458, 2023. Lindner et al. (2023) Lindner, D., Kramar, J., Farquhar, S., Rahtz, M., McGrath, T., and Mikulik, V. Tracr: Compiled transformers as a laboratory for interpretability. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, p. 37876–37899. Curran Associates, Inc., 2023. Liu et al. (2023) Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=De4FYqjFueZ. Lu et al. (2021) Lu, K., Wang, Z., Mardziel, P., and Datta, A. Influence patterns for explaining information flow in bert. Advances in Neural Information Processing Systems, 34:4461–4474, 2021. Millière & Buckner (2024) Millière, R. and Buckner, C. A philosophical introduction to language models - part i: The way forward, 2024. Mueller et al. (2024) Mueller, A., Brinkmann, J., Li, M., Marks, S., Pal, K., Prakash, N., Rager, C., Sankaranarayanan, A., Sharma, A. S., Sun, J., Todd, E., Bau, D., and Belinkov, Y. The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability, 2024. URL https://arxiv.org/abs/2408.01416. Nanda et al. (2023a) Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=9XFSbDPmdW. Nanda et al. (2023b) Nanda, N., Lee, A., and Wattenberg, M. Emergent linear representations in world models of self-supervised sequence models. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, p. 16–30, 2023b. Olah et al. (2017) Olah, C., Schubert, L., and Mordvintsev, A. Feature visualization. Distill, 2017. URL https://distill.pub/2017/feature-visualization/. Olsson et al. (2022) Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. Petsiuk et al. (2018) Petsiuk, V., Das, A., and Saenko, K. Rise: Randomized input sampling for explanation of black-box models. In BMVC, 2018. Pruthi et al. (2020) Pruthi, D., Gupta, M., Dhingra, B., Neubig, G., and Lipton, Z. C. Learning to deceive with attention-based explanations. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, p. 4782–4793, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.432. URL https://aclanthology.org/2020.acl-main.432. Qiang et al. (2022) Qiang, Y., Pan, D., Li, C., Li, X., Jang, R., and Zhu, D. Attcat: Explaining transformers via attentive class activation tokens. Advances in Neural Information Processing Systems, 35:5052–5064, 2022. Quirke & Barez (2024) Quirke, P. and Barez, F. Understanding addition in transformers. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=rIx1YXVWZb. Räuker et al. (2023) Räuker, T., Ho, A., Casper, S., and Hadfield-Menell, D. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), p. 464–483. IEEE, 2023. Scheurer et al. (2023) Scheurer, J., Phil3, tony, Thibodeau, J., and Lindner, D. Practical pitfalls of causal scrubbing. AI Alignment Forum, 2023. Selvaraju et al. (2019) Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision, 128(2):336–359, Oct 2019. ISSN 1573-1405. doi: 10.1007/s11263-019-01228-7. URL http://dx.doi.org/10.1007/s11263-019-01228-7. Simonyan et al. (2013) Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps, 2013. Smilkov et al. (2017) Smilkov, D., Thorat, N., Kim, B., Viégas, F., and Wattenberg, M. Smoothgrad: removing noise by adding noise, 2017. Sundararajan et al. (2017) Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. In International Conference on Machine Learning, 2017. URL https://api.semanticscholar.org/CorpusID:16747630. Tenney et al. (2019) Tenney, I., Das, D., and Pavlick, E. BERT rediscovers the classical NLP pipeline. In Korhonen, A., Traum, D., and Màrquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, p. 4593–4601, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URL https://aclanthology.org/P19-1452. Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017. Vig & Belinkov (2019) Vig, J. and Belinkov, Y. Analyzing the structure of attention in a transformer language model. In Linzen, T., Chrupała, G., Belinkov, Y., and Hupkes, D. (eds.), Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, p. 63–76, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-4808. URL https://aclanthology.org/W19-4808. Vig et al. (2020) Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Sakenis, S., Huang, J., Singer, Y., and Shieber, S. Causal mediation analysis for interpreting neural nlp: The case of gender bias. arXiv preprint arXiv:2004.12265, 2020. Wang et al. (2020) Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S., Mardziel, P., and Hu, X. Score-cam: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, p. 24–25, 2020. Wang et al. (2023) Wang, K. R., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=NpsVSN6o4ul. Wang et al. (2024) Wang, X., Mao, S., Deng, S., Yao, Y., Shen, Y., Liang, L., Gu, J., Chen, H., and Zhang, N. Editing conceptual knowledge for large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, p. 706–724, 2024. Weiss et al. (2021) Weiss, G., Goldberg, Y., and Yahav, E. Thinking like transformers. In International Conference on Machine Learning, p. 11080–11090. PMLR, 2021. Wiegreffe & Pinter (2019) Wiegreffe, S. and Pinter, Y. Attention is not not explanation. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), p. 11–20, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1002. URL https://aclanthology.org/D19-1002. Wu et al. (2023a) Wu, Z., D’Oosterlinck, K., Geiger, A., Zur, A., and Potts, C. Causal proxy models for concept-based model explanations. In International conference on machine learning, p. 37313–37334. PMLR, 2023a. Wu et al. (2023b) Wu, Z., Geiger, A., Icard, T., Potts, C., and Goodman, N. Interpretability at scale: Identifying causal mechanisms in alpaca. Advances in neural information processing systems, 36:78205–78226, 2023b. Yeh et al. (2020) Yeh, C.-K., Kim, B., Arik, S., Li, C.-L., Pfister, T., and Ravikumar, P. On completeness-aware concept-based explanations in deep neural networks. Advances in neural information processing systems, 33:20554–20565, 2020. Yeh et al. (2021) Yeh, C.-K., Kim, B., and Ravikumar, P. Human-centered concept explanations for neural networks. In Neuro-Symbolic Artificial Intelligence: The State of the Art, p. 337–352. IOS Press, 2021. Zhang et al. (2023) Zhang, Y., Backurs, A., Bubeck, S., Eldan, R., Gunasekar, S., and Wagner, T. Unveiling transformers with LEGO: A synthetic reasoning task, 2023. URL https://openreview.net/forum?id=1jDN-RfQfrb. Zhong et al. (2023) Zhong, Z., Liu, Z., Tegmark, M., and Andreas, J. The clock and the pizza: Two stories in mechanistic explanation of neural networks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=S5wmbQc1We. Zou et al. (2023) Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., and Hendrycks, D. Representation engineering: A top-down approach to ai transparency, 2023. Appendix A More Related Work Probing. Probing involves training a separate supervised classifier to predict human-understandable features of the data from the model’s internal activations (Alain & Bengio, 2017; Zou et al., 2023) and has found wide applications for understanding the representations learned by language models (Tenney et al., 2019; Li et al., 2023; Nanda et al., 2023b; Burns et al., 2023; Belrose et al., 2023; Huben et al., 2024; Gurnee et al., 2023; Wang et al., 2024). Even though a probe might be successful in predicting features from internal activations, it need not imply a causal relationship between these features and the model output (Belinkov, 2022). It has also been questioned whether the model activations genuinely represent the feature of interest or the probe itself learns these features by picking up on spurious correlations (Hewitt & Liang, 2019). However, recent works have demonstrated that for Transformer-based language models probes can be used to steer the model’s output behavior by manipulating the identified internal representations (Zou et al., 2023; Li et al., 2023). Related to probing is the work on concept-based reasoning (Kim et al., 2018; Crabbé & van der Schaar, 2022; Yeh et al., 2021, 2020) that extracts representations of high-level concepts as geometric shapes in the neural network representation space (i.e., the activation spaces of inner layers of neural networks) and quantifies the impact of these high-level concepts on model outputs via attribution-based techniques. Attention Pattern Analysis. There is a large body of work that attempts to understand the behavior of Transformer-based models by analyzing the attention patterns computed by the self-attention layer (Allen-Zhu & Li, 2023; Zhang et al., 2023; Ebrahimi et al., 2020; Vig & Belinkov, 2019; Hao et al., 2021; Qiang et al., 2022; Lu et al., 2021; Abnar & Zuidema, 2020; Chefer et al., 2021; Liu et al., 2023). However, there is an active ongoing debate about the validity of attention patterns as a tool for interpreting model behavior (Jain & Wallace, 2019; Pruthi et al., 2020; Bibal et al., 2022; Bastings & Filippova, 2020; Wiegreffe & Pinter, 2019). Appendix B Axioms of Mechanistic Interpretation Axioms 5 and 6 are stated more informally. Unlike the earlier axioms requiring the observed behaviors of the mechanistic interpretation and the neural network to coincide, these axioms are concerned with the compilability of the mechanistic interpretation into a neural network. Both axioms assume the existence of a semantics-preserving compiler from λHsubscript _Hλitalic_H to λTsubscript _Tλitalic_T (denoted by compilerλTλHsubscriptsuperscriptsubscriptsubscriptcompiler _H_ _Tc o m p i l e ritalic_λitalic_Hitalic_λ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT). While Axiom 5 requires the compiled version of hℎh to have the same structure and parameters as t (i.e., syntactic equivalence), Axiom 6 requires the compiled version to only have the same structure as t. The requirements imposed by these last two axioms are hard to establish in practice, and we do not consider them for the analysis presented in this paper. We present these axioms as a goal for future work. Recent work on compilers from the RASP language (Weiss et al., 2021) to Transformer models is a promising step in this direction (Lindner et al., 2023). Axiom 5 (Strong Mechanistic Derivability). hℎh is mechanistically derived from t if there exists a semantics-preserving compilation compilerλTλH(h)subscriptsuperscriptsubscriptsubscriptℎcompiler _H_ _T(h)c o m p i l e ritalic_λitalic_Hitalic_λ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( h ) of hℎh in λTsubscript _Tλitalic_T such that compilerλTλH(h)subscriptsuperscriptsubscriptsubscriptℎcompiler _H_ _T(h)c o m p i l e ritalic_λitalic_Hitalic_λ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( h ) and t are syntactically equal. Axiom 6 (Weak Mechanistic Derivability). hℎh is mechanistically derivable from t if there exists a semantics-preserving compilation compilerλTλH(h)subscriptsuperscriptsubscriptsubscriptℎcompiler _H_ _T(h)c o m p i l e ritalic_λitalic_Hitalic_λ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( h ) of hℎh in λTsubscript _Tλitalic_T such that compilerλTλH(h)subscriptsuperscriptsubscriptsubscriptℎcompiler _H_ _T(h)c o m p i l e ritalic_λitalic_Hitalic_λ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( h ) has the same architecture (with the same number of parameters) as t. B.1 Extension to Circuits While we focus on end-to-end analyses, our axioms are compatible with the analysis of subgraphs of a larger model. B.1.1 Axioms for Arbitrary Computational Graphs We can extend our axioms to operate directly on the graphs G and G′; we present a generalization of Axioms 1 through 4 below. Let G=(V,E)G=(V,E)G = ( V , E ) be the concrete model expressed as a computational graph and let G′=(V′,E′)superscript′superscript′G =(V ,E )G′ = ( V′ , E′ ) be the computational graph, isomorphic to G, representing the abstract model. Let the isomorphism between G and G′ be Π Π and call the input and output vertices of G, inini n and outouto u t, respectively; while this can be easily extended to multiple input and output vertices, we present the single input, single output case for simplicity. Let execute(G)(x)execute(G)(x)e x e c u t e ( G ) ( x ) return a mapping val:V→Val:→val:V→ Valv a l : V → V a l with the values of all intermediate nodes where ValValV a l is the set of all possible values (i.e., vectors of reals) that the nodes can take. For a (partial) assignment to the vertices assign:V′→Val:→superscript′assign:V → Vala s s i g n : V′ → V a l with in⊆V′⊆Vsuperscript′\in\ V V i n ⊆ V′ ⊆ V, propagate(G)(assign)propagate(G)(assign)p r o p a g a t e ( G ) ( a s s i g n ) returns a mapping of the same type val:V→Val:→val:V→ Valv a l : V → V a l which is the result of execution of G where the provided intermediate assignments override the standard results of execution for those vertices. In this way, we can represent arbitrary interventions on G and G′. We define an abstraction operator αvsubscript _vαitalic_v for each concrete node v∈Vv∈ Vv ∈ V and a concretization operator γv′subscript′ _v γitalic_v′ for each abstract node v′∈V′superscript′v ∈ V v′ ∈ V′. α and γ are the corresponding functions which apply the corresponding mappings to each vertex in the graph. Finally, define select(v)(val)=val(v)select(v)(val)=val(v)s e l e c t ( v ) ( v a l ) = v a l ( v ) for val:V→Val:→val:V→ Valv a l : V → V a l and v∈Vv∈ Vv ∈ V; we overload and let select(V′)(val)(v′)=val(v′)superscript′superscript′select(V )(val)(v )=val(v )s e l e c t ( V′ ) ( v a l ) ( v′ ) = v a l ( v′ ) for V′⊆Vsuperscript′V V′ ⊆ V and v′∈V′superscript′v ∈ V v′ ∈ V′. We can then define extensions of Axioms 1 to 4 below: Axiom 7 (ϵitalic-ϵεϵ-Prefix Equivalence). ∀v∈V.for-all∀ v∈ V.∀ v ∈ V . Prx∼[αv∘select(v)∘execute(G)(x)=select(Π(v))∘execute(G′)∘αin(x)]≥1−ϵsimilar-todelimited-[]subscriptΠsuperscript′subscript1italic-ϵ x Pr[ _v select(v) execute% (G)(x)=select( (v)) execute(G ) _in(x)]≥ 1- _UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ αitalic_v ∘ s e l e c t ( v ) ∘ e x e c u t e ( G ) ( x ) = s e l e c t ( Π ( v ) ) ∘ e x e c u t e ( G′ ) ∘ αitalic_i n ( x ) ] ≥ 1 - ϵ Axiom 8 (ϵitalic-ϵεϵ-Component Equivalence). ∀v∈V.for-all∀ v∈ V.∀ v ∈ V . Prx∼[αv∘select(v)∘execute(G)(x)=select(Π(v))∘propagate(G′)∘select(Π(p)∪Π(in))∘α∘execute(G)(x)]≥1−ϵsimilar-todelimited-[]subscriptabsentΠsuperscript′Π1italic-ϵ x Pr [ array[]l _v% select(v) execute(G)(x)=\\ select( (v)) propagate(G ) select( (p)∪\ (in)\)% α execute(G)(x) array ]≥ 1- _UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ start_ARRAY start_ROW start_CELL αitalic_v ∘ s e l e c t ( v ) ∘ e x e c u t e ( G ) ( x ) = end_CELL end_ROW start_ROW start_CELL s e l e c t ( Π ( v ) ) ∘ p r o p a g a t e ( G′ ) ∘ s e l e c t ( Π ( p ) ∪ Π ( i n ) ) ∘ α ∘ e x e c u t e ( G ) ( x ) end_CELL end_ROW end_ARRAY ] ≥ 1 - ϵ where p=predecessors(v)p=predecessors(v)p = p r e d e c e s s o r s ( v ). Axiom 9 (ϵitalic-ϵεϵ-Prefix Replaceability). ∀v∈V.for-all∀ v∈ V.∀ v ∈ V . Prx∼[select(out)∘execute(G)(x)=select(out)∘propagate(G)∘select(in,v)∘γ∘execute(G′)∘αin(x)]≥1−ϵsimilar-todelimited-[]absentsuperscript′subscript1italic-ϵ x Pr [ array[]lselect(out% ) execute(G)(x)=\\ select(out) propagate(G) select(\in,v\) γ execute(G^% ) _in(x) array ]≥ 1- _UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ start_ARRAY start_ROW start_CELL s e l e c t ( o u t ) ∘ e x e c u t e ( G ) ( x ) = end_CELL end_ROW start_ROW start_CELL s e l e c t ( o u t ) ∘ p r o p a g a t e ( G ) ∘ s e l e c t ( i n , v ) ∘ γ ∘ e x e c u t e ( G′ ) ∘ αitalic_i n ( x ) end_CELL end_ROW end_ARRAY ] ≥ 1 - ϵ Axiom 10 (ϵitalic-ϵεϵ-Component Replaceability). ∀v∈V.for-all∀ v∈ V.∀ v ∈ V . Prx∼[select(out)∘execute(G)(x)=(select(out)∘propagate(G)∘select(in,v)∘γ∘propagate(G′)∘select(Π(p)∪Π(in))∘α∘execute(G)(x))]≥1−ϵ x Pr [select(out) execute(G)(% x)= ( array[]lselect(out) propagate(G) select(\in,v\)% \\ γ propagate(G ) select( (p)∪\ (in)\) \\ α execute(G)(x) array ) ]≥ 1- _UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ s e l e c t ( o u t ) ∘ e x e c u t e ( G ) ( x ) = ( start_ARRAY start_ROW start_CELL s e l e c t ( o u t ) ∘ p r o p a g a t e ( G ) ∘ s e l e c t ( i n , v ) ∘ end_CELL end_ROW start_ROW start_CELL γ ∘ p r o p a g a t e ( G′ ) ∘ s e l e c t ( Π ( p ) ∪ Π ( i n ) ) ∘ end_CELL end_ROW start_ROW start_CELL α ∘ e x e c u t e ( G ) ( x ) end_CELL end_ROW end_ARRAY ) ] ≥ 1 - ϵ where p=predecessors(v)p=predecessors(v)p = p r e d e c e s s o r s ( v ). Note that our Axioms 1 to 4 are equivalent to these formulations, specialized to linear computational graphs. In linear computational graphs, all dependencies on ancestor nodes are mediated by a node’s predecessor; hence, there is a limited need to consider parallel interventions, which involve simultaneously performing independent interventions on multiple nodes of the graph. In the case of prefix equivalence and prefix replaceability, any parallel interventions on linear graphs are equivalent to the intervention on the final node in the sequence and hence have no additional effect. While parallel extensions of component equivalence and component replaceability would strengthen evaluation in the linear setting, considering parallel interventions is particularly important in the nonlinear case. Consider the following scenario: there are two sibling nodes in the graph that encode redundant computations and the concrete states output by both these nodes contain information needed by the downstream circuit. Also, let us say that the corresponding abstract model fails to faithfully capture the computation performed by either of these nodes. However, this failure may not be captured by the axioms without parallel interventions since the redundant copy of the computation in the sibling concrete node masks any intervention applied to the other concrete node. One potential solution is to merge such redundant sibling nodes in the analyzed graph G, however, this prevents checking equivalence of the nodes independently. A more general solution is to extend Axioms 7 through 10 to accommodate arbitrary parallel interventions. We present a preliminary formulation of an axiom for parallel intervention below; a final formulation is left to future work. In particular, this axiom cannot be efficiently evaluated in its current form. We first extend executeexecutee x e c u t e and propagatepropagatep r o p a g a t e to accommodate execution of arbitrary mixtures of the graphs G and G′. These behave as before, except that any concrete inputs to an abstract operation are abstracted prior to execution, and, similarly, any abstract inputs to a concrete operation are concretized prior to evaluation. Let interleave(G,G′,V′)superscript′interleave(G,G ,V )i n t e r l e a v e ( G , G′ , V′ ) be a graph identical to G except that the operations for any nodes v∈V′v∈ V v ∈ V′ are replaced with the operations for the corresponding nodes Π(v)Π (v)Π ( v ) in G′. Let conditional_abstract(V′)_superscript′conditional\_abstract(V )c o n d i t i o n a l _ a b s t r a c t ( V′ ) be a mapping on the values of the nodes v∈Vv∈ Vv ∈ V identical to αΠ(v)subscriptΠ _ (v)αroman_Π ( v ) for v∉V′v ∈ V v ∉ V′ and the identity if v∈V′v∈ V v ∈ V′. We can now express an axiom which generalizes Axioms 7, 8, and 10 to parallel interventions: Axiom 11 (ϵitalic-ϵεϵ-Equivalence). ∀V′⊆V,∀v∈Vformulae-sequencefor-allsuperscript′for-all∀ V V,∀ v∈ V∀ V′ ⊆ V , ∀ v ∈ V. Prx∼[select(v)∘conditional_abstract(V′)∘execute(G)(x)=select(v)∘conditional_abstract(V′)∘execute(interleave(G,G′,V′))(x)]≥1−ϵsimilar-todelimited-[]_superscript′absent_superscript′superscript′1italic-ϵ x Pr [ array[]lselect(v)% conditional\_abstract(V ) execute(G)(x)=\\ select(v) conditional\_abstract(V ) execute(interleave(G,G^% ,V ))(x) array ]≥ 1- _UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ start_ARRAY start_ROW start_CELL s e l e c t ( v ) ∘ c o n d i t i o n a l _ a b s t r a c t ( V′ ) ∘ e x e c u t e ( G ) ( x ) = end_CELL end_ROW start_ROW start_CELL s e l e c t ( v ) ∘ c o n d i t i o n a l _ a b s t r a c t ( V′ ) ∘ e x e c u t e ( i n t e r l e a v e ( G , G′ , V′ ) ) ( x ) end_CELL end_ROW end_ARRAY ] ≥ 1 - ϵ The condition of Axiom 11 can be made even stronger by the treeification (Chan et al., 2022) of the graphs G and G′ prior to evaluation. This allows, for each node v∈Vv∈ Vv ∈ V, independent selection of the set of ancestors whose operations are replaced with their abstract equivalents. With treeification, Axiom 11 can express prefix replaceability as well, leading it to generalize Axioms 7 through 10. B.1.2 Linearizing Circuits We can directly validate arbitrary circuits with the standard linear-graph axioms. In particular, we will describe how, for any computational graphs G=(V,E)G=(V,E)G = ( V , E ) and G′=(V′,E′)superscript′superscript′G =(V ,E )G′ = ( V′ , E′ ), which represent the concrete model t and the abstract model hℎh, respectively, we can construct a linearized computational graph. It is enough to show the construction for G. We will do so by propagating partial assignments to the vertices v∈Vv∈ Vv ∈ V through a topologically sorted sequence of operations which compute the values of each node v as a function of the values of the predecessors. First, we extend λTsubscript _Tλitalic_T to include multivariate functions f(x1,x2,…,xn)subscript1subscript2…subscriptf(x_1,x_2,…,x_n)f ( x1 , x2 , … , xitalic_n ) for xi∈Varsubscriptx_i∈ Varxitalic_i ∈ V a r. Suppose that op:V→λT:→subscriptop:V→ _To p : V → λitalic_T returns the operation op(v)op(v)o p ( v ) computed by node each node v, and let predecessors(v)predecessors(v)p r e d e c e s s o r s ( v ) return the predecessors of node v in the argument order of op(v)op(v)o p ( v ). For simplicity, we assume that G has exactly one input node inini n and exactly one output node outouto u t. Let o:[|V|]→V:→delimited-[]o:[|V|]→ Vo : [ | V | ] → V be a topological ordering of G. Let idx:V→[|V|]:→delimited-[]idx:V→[|V|]i d x : V → [ | V | ], the inverse of o, return the position of each node v in the topological ordering o. We again extend λTsubscript _Tλitalic_T to include the function execute_op(f,p,v)(env)_execute\_op(f,p,v)(env)e x e c u t e _ o p ( f , p , v ) ( e n v ) which, given an m-ary function f, a node v∈Vv∈ Vv ∈ V, a partial assignment to the vertices env:V→Val∪⟂:→perpendicular-toenv:V→ Val∪\ \e n v : V → V a l ∪ ⟂ and a sequence of input nodes p:[m]→V:→delimited-[]p:[m]→ Vp : [ m ] → V such that env(p(i))∈Valenv(p(i))∈ Vale n v ( p ( i ) ) ∈ V a l for each 1≤i≤m11≤ i≤ m1 ≤ i ≤ m, returns an updated environment env′env e n v′ of the same type, identical to envenve n v except that env′(v)=f(env(p(1)),…,env(p(m)))superscript′1…env (v)=f(env(p(1)),…,env(p(m)))e n v′ ( v ) = f ( e n v ( p ( 1 ) ) , … , e n v ( p ( m ) ) ). Finally, for val∈Valval∈ Valv a l ∈ V a l, let input_env(val)_input\_env(val)i n p u t _ e n v ( v a l ) be a partial assignment to the vertices input_env(val):V→Val∪⟂:_→perpendicular-toinput\_env(val):V→ Val∪\ \i n p u t _ e n v ( v a l ) : V → V a l ∪ ⟂ which is ⟂perpendicular-to ⟂ everywhere but inini n, where we have input_env(val)(in)=val_input\_env(val)(in)=vali n p u t _ e n v ( v a l ) ( i n ) = v a l. Noting that op(1)1op(1)o p ( 1 ) must always be inini n, we can define a decomposition of t as follows: dt=subscriptabsent d_t=ditalic_t = (select(out)∘execute_op(op(o(|V|)),predecessors(o(|V|)),o(|V|)))_ \, (select(out) execute\_op(op(o(|V|)),predecessors(o(|V% |)),o(|V|)) )( s e l e c t ( o u t ) ∘ e x e c u t e _ o p ( o p ( o ( | V | ) ) , p r e d e c e s s o r s ( o ( | V | ) ) , o ( | V | ) ) ) ∘…absent… \, …∘ … ∘execute_op(op(o(2)),predecessors(o(2)),o(2))absent_222 \, execute\_op(op(o(2)),predecessors(o(2)),o(2))∘ e x e c u t e _ o p ( o p ( o ( 2 ) ) , p r e d e c e s s o r s ( o ( 2 ) ) , o ( 2 ) ) ∘input_env.absent_ \, input\_env.∘ i n p u t _ e n v . Appendix C Remark on Extensional Equivalence x0subscript0x_0x0(0)x1subscript1x_1x1(1)x0subscript0x_0x0(2)x1subscript1x_1x1(3)1x01subscript01 x_0divide start_ARG 1 end_ARG start_ARG x0 end_ARG(4)1x11subscript11 x_1divide start_ARG 1 end_ARG start_ARG x1 end_ARG(5)1x0<1x11subscript01subscript11 x_0<1 x_1divide start_ARG 1 end_ARG start_ARG x0 end_ARG < divide start_ARG 1 end_ARG start_ARG x1 end_ARG(6)(a)x0subscript0x_0x0(0)x1subscript1x_1x1(1)x0subscript0x_0x0(2)x1subscript1x_1x1(3)x0subscript0x_0x0(4)x1subscript1x_1x1(5)x0>x1subscript0subscript1x_0>x_1x0 > x1(6)(b) Figure 4: Extensionally equivalent mechanistic interpretations considered by Scheurer et al. (2023). Model (a) is the concrete model and the ground truth interpretation. Model (b) is the abstract model. Scheurer et al. (2023) presents an example illustrating the extensional equivalence problem: in particular, they demonstrate that causal scrubbing cannot distinguish between two mechanistic interpretations with equivalent input-output behavior but which differ in internal behavior. See Figure 4 for the two models. We make copies of the x0subscript0x_0x0 and x1subscript1x_1x1 nodes in model (b) to ensure that the two models are isomorphic. As observed by Scheurer et al. (2023), the two are indistinguishable to causal scrubbing, noting that we assume that x0subscript0x_0x0 and x1subscript1x_1x1 are both positive. This is a necessary consequence of the resampling technique used by causal scrubbing to derive its interventions. In particular, causal scrubbing replaces the values of concrete nodes with randomly selected values which are equivalent up to the abstract model. Clearly, the concrete model is invariant with respect to resampling ablations derived from interpretation (a), which is identical to the concrete model. But, likewise, there is no effect from the corresponding intervention derived from interpretation (b). For instance, resampling concrete values of node (4), 1x01subscript01 x_0divide start_ARG 1 end_ARG start_ARG x0 end_ARG, up to equivalence in the abstract state of x0subscript0x_0x0 can have no effect. Hence, both interpretations are evaluated as perfect by causal scrubbing. For the same reason, these interpretations cannot be distinguished by interchange intervention accuracy (Geiger et al., 2022): interchange intervention accuracy replaces the values of intermediate nodes with those derived with different sets of inputs and evaluates the rate at which these interventions have an identical effects on the output of the abstract model on and the abstracted output of the concrete model. If we restrict the set of allowable α and γ functions, for example to linear mappings alone, Axioms 1 through 4 can distinguish between the two hypotheses. In particular, as no more than two points on the curve (x,1/x)1(x,1/x)( x , 1 / x ) can be collinear, Axiom 1 cannot hold with ϵ=0italic-ϵ0ε=0ϵ = 0 on any distribution with more than four points in its support when we restrict α and γ to be linear. In the general case, with unrestricted α and γ, the interpretations are again indistinguishable. However to obtain this result, we must conduct meaningful computation in the abstraction and concretization functions, e.g. α4(x)=γ4(x)=1xsubscript4subscript41 _4(x)= _4(x)=1 xα4 ( x ) = γ4 ( x ) = divide start_ARG 1 end_ARG start_ARG x end_ARG; while not directly evaluated by the axioms, any analyst with a bias towards keeping complexity in the interpretation would prefer the ground-truth interpretation. Appendix D More Details on Mechanistic Interpretability Analysis of the 2-SAT Model Figure 5: Breakdown of the concrete model into concrete components, corresponding to the components dt[i]subscriptdelimited-[]d_t[i]ditalic_t [ i ]. The decomposition of the original Transformer 2-SAT solver is shown in Figure 5. The abstraction operators from our analysis are shown in Listings LABEL:lst:alphas_1 and LABEL:lst:alphas_2 and the concretization operators are shown in Listings LABEL:lst:gammas_1 and LABEL:lst:gammas_2. ⬇ 1alpha_0 = identity 2 3def clause_representation(literal_l, literal_r): 4 # Generate the input tensor for the formula with 10 copies of the clause 5 inputs = to_toks([Or(literal_l, literal_r)] * 10) 6 7 embeds = model.embed(inputs) 8 attn_out = embed + model.blocks[0].attention( 9 embeds, 10 mask=gen_attention_mask([4*i+2: clause_positions_and_parens(i)]) 11 ) 12 block_1_out = attn_out + model.blocks[0].mlp(attn_out) 13 14 # We derive the canonical representation by averaging across second literal positions 15 return mean(block_1_out[4*clause_idx+2] for clause_idx in range(10)) 16 17clause_representations = 18 Or(l,r): clause_representation(l, r) 19 for l in literals 20 for r in literals 21 22 23def alpha_1(block_1_output): 24 second_literal_outputs = [block_1_outputs[4*i+2] for i in range(10)] 25 # Calculate cosine similarities between the actual representations output at 26 # the second variable positions and the canonical clause 27 # representations, select the clauses which best match 28 return [argmax_cosine_sims(out, clause_representations) for out in second_literal_outputs] Listing 1: Abstraction operators for the first block in Python pseudocode. ⬇ 1def alpha_2(residual_and_mlp_output, threshold=0.5): 2 mlp_output = residual_and_mlp_output[1] 3 return [mlp_output[i] >= threshold for i in evaluating_neurons] 4 5alpha_3 = identity Listing 2: Abstraction operators for the second block in Python pseudocode. ⬇ 1gamma_0 = identity 2 3# Mean output from the first transformer block, by position, on the training set 4mean_block_1_out = mean(model.intermediate_outputs("block_1", x) for x in train) 5 6# Constant on everything but second token positions 7def gamma_1(clauses): 8 output = mean_block_1_out.copy() 9 # Substitute the corresponding canonical clause representation 10 for i, clause in enumerate(clauses): 11 output[4*i+2] = clause_representations[clause] 12 return output Listing 3: Concretization operators for the first block in Python pseudocode. ⬇ 1# Mean residual from attention for the readout token on the training set 2mean_attn_out = mean(model.intermediate_outputs("attention_block_2", x)[-1] for x in train) 3 4# Output a constant residual term, and map predicted activations to 5# a large constant value 6def gamma_2(activation_model_outputs, high_activation=2): 7 mlp_out_concretization = torch.zeros(mlp_width) 8 for i, model_out in zip(evaluating_neurons, activation_model_outputs): 9 if model_out: 10 mlp_out_concretization[i] = high_activation 11 12 return mean_attn_out, mlp_out_concretization 13 14gamma_3 = identity Listing 4: Concretization operators for the second block in Python pseudocode. D.1 First Block as Parser Figure 6: Average attention scores, grouped by source token position, for all heads, calculated over the test set. In the first block, we further average across destination token positions restricted to positions 4i+2424i+24 i + 2, where 0≤i<100100≤ i<100 ≤ i < 10 is the clause index. For the second block, we only consider the readout token as the destination. Decomposed Attention Analysis. The parsing behavior in the first block cannot be fully described by token-to-token attention, and hence, we further decompose the standard QK-decomposition of Elhage et al. (2021) to account for positional factors as well. In a Transformer, each token’s initial embedding is the sum of a positional embedding and a token embedding; hence, rather than viewing attention as from destination embeddings to source embeddings, we can equivalently express the pre-softmax scores as the sum of four sets of preferences—token-to-token attention, token-to-position attention, position-to-token attention, and position-to-position attention. In particular we can decompose the first block’s self-attention mechanism as follows.555We use the QK circuit notation of Elhage et al. (2021) Given token tsrcsubscriptsrct_srctsrc in position psrcsubscriptsrcp_srcpsrc to token tdstsubscriptdstt_dsttdst in position tdstsubscriptdstt_dsttdst, the pre-softmax score is (tsrcTWET+psrcTWPOST)WKTWQ(WEtdst+WPOSpdst)superscriptsubscriptsrcsuperscriptsubscriptsuperscriptsubscriptsrcsuperscriptsubscriptsuperscriptsubscriptsubscriptsubscriptsubscriptdstsubscriptsubscriptdst \, (t_src^TW_E^T+p_src^T% W_POS^T )W_K^TW_Q (W_Et_dst+W_POS% p_dst )( tsrcitalic_T Witalic_Eitalic_T + psrcitalic_T Witalic_P O Sitalic_T ) Witalic_Kitalic_T Witalic_Q ( Witalic_E tdst + Witalic_P O S pdst ) = == (tsrcTWET+psrcTWPOST)WQK(WEtdst+WPOSpdst)superscriptsubscriptsrcsuperscriptsubscriptsuperscriptsubscriptsrcsuperscriptsubscriptsubscriptsubscriptsubscriptdstsubscriptsubscriptdst \, (t_src^TW_E^T+p_src^T% W_POS^T )W_QK (W_Et_dst+W_POSp_ % dst )( tsrcitalic_T Witalic_Eitalic_T + psrcitalic_T Witalic_P O Sitalic_T ) Witalic_Q K ( Witalic_E tdst + Witalic_P O S pdst ) = == tsrcTWETWQKWEtdst+tsrcTWETWQKWPOStsrc+superscriptsubscriptsrcsuperscriptsubscriptsubscriptsubscriptsubscriptdstlimit-fromsuperscriptsubscriptsrcsuperscriptsubscriptsubscriptsubscriptsubscriptsrc \,t_src^TW_E^TW_QKW_Et_dst% +t_src^TW_E^TW_QKW_POSt_src~+tsrcitalic_T Witalic_Eitalic_T Witalic_Q K Witalic_E tdst + tsrcitalic_T Witalic_Eitalic_T Witalic_Q K Witalic_P O S tsrc + psrcTWPOSTWQKWEtdst+psrcTWPOSTWQKWPOSpdstsuperscriptsubscriptsrcsuperscriptsubscriptsubscriptsubscriptsubscriptdstsuperscriptsubscriptsrcsuperscriptsubscriptsubscriptsubscriptsubscriptdst \,p_src^TW_POS^TW_QKW_Et_% dst+p_src^TW_POS^TW_QKW_POSp_dstpsrcitalic_T Witalic_P O Sitalic_T Witalic_Q K Witalic_E tdst + psrcitalic_T Witalic_P O Sitalic_T Witalic_Q K Witalic_P O S pdst = == tsrcTWQKtok→toktdst+tsrcTWQKtok→pospsrc+psrcTWQKpos→toktdst+psrcTWQKpos→pospdst,superscriptsubscriptsrcsuperscriptsubscript→toktoksubscriptdstsuperscriptsubscriptsrcsuperscriptsubscript→tokpossubscriptsrcsuperscriptsubscriptsrcsuperscriptsubscript→postoksubscriptdstsuperscriptsubscriptsrcsuperscriptsubscript→pospossubscriptdst \,t_src^TW_QK^tok % t_dst+t_src^TW_QK^tok→% posp_src+p_src^TW_QK^pos% t_dst+p_src^TW_QK^% pos p_dst,tsrcitalic_T Witalic_Q Ktok → tok tdst + tsrcitalic_T Witalic_Q Ktok → pos psrc + psrcitalic_T Witalic_Q Kpos → tok tdst + psrcitalic_T Witalic_Q Kpos → pos pdst , where WEsubscriptW_EWitalic_E and WPOSsubscriptW_POSWitalic_P O S are the token and positional embedding matrices, WQKsubscriptW_QKWitalic_Q K is the QK matrix, tsrcsubscriptsrct_srctsrc and tdstsubscriptdstt_dsttdst are the source and destination tokens, while psrcsubscriptsrcp_srcpsrc and pdstsubscriptdstp_dstpdst are source and destination positions. Our observations in Figures 1 and 6 suggest that the tokens relevant to the final classification decision, are in position 4i+2424i+24 i + 2 for 0≤i<100100≤ i<100 ≤ i < 10, which, given the structure of the inputs to the model, are the second tokens of each clause of the form xisubscriptx_ixitalic_i or ¬xisubscript x_i¬ xitalic_i for 0≤i<5050≤ i<50 ≤ i < 5. Figure 7: Decomposed QK circuit of the first block’s attention mechanism; the subset where the destination token is the second variable of a clause is shown. If we restrict our view to destination tokens at this position, we can express the effects of the four components of the first block’s QK circuit as shown in Figure 7. With the somewhat odd exception of the first clause, and, in particular, a first clause with second literal x0subscript0x_0x0 or ¬x0subscript0 x_0¬ x0, attention strongly deprioritizes the parentheses but has weak preferences for the source token otherwise. Outside of destination tokens x0subscript0x_0x0 and ¬x0subscript0 x_0¬ x0, preferences for source position are fairly consistent, and hence, the core effect is a combination of position-to-position preferences (see the bottom right of Figure 7) and a preference against punctuation (the parentheses tokens ‘(’ and ‘)’). This can be seen in the token-to-position preferences in the bottom left, with an exception in the case of token 2, which may behave differently as all tokens to the left belong to the first clause, limiting the importance of learning more specific attention patterns. A general preference against punctuation can be seen in the token-to-position preferences (top left) as well. For the first several clauses, positional preferences encode a preference for the positions where parentheses occur, open parentheses in particular; however, the preference against parentheses shown in the token-to-position attention scores (corresponding to WQKtok→possuperscriptsubscript→tokposW_QK^tok Witalic_Q Ktok → pos) counteract that effect to result in a net effect of attention to the first literal of the clause. Distributional Attention Analysis. Figure 8: Softmax of expected attention scores by position for the second token of each clause, showing that for each second token position, attention on literal positions in the corresponding clause are expected to dominate, consistent with our interpretation of the first block’s attention mechanism as a parser. To illustrate how these four components of the QK circuit work together on typical formulas, we can use the values of the four QK matrices to compute the attention scores in expectation given a uniformly distributed choice of literals. To compute the attention scores in expectation, we only consider destination positions p (p=4i+242p=4i+2p = 4 i + 2 for some i). Call the token in that position, t, and the full embedding passed to attention, e. To compute the expected pre-softmax attention score on position p′≤psuperscript′p ≤ p′ ≤ p (defining t′ and e′ similarly), we first check if p′ contains punctuation. If p′≡0(mod4)superscript′annotated0pmod4p ≡ 0 4p′ ≡ 0 start_MODIFIER ( mod start_ARG 4 end_ARG ) end_MODIFIER, t′ is ‘(’ and if p′≡3(mod4)superscript′annotated3pmod4p ≡ 3 4p′ ≡ 3 start_MODIFIER ( mod start_ARG 4 end_ARG ) end_MODIFIER, t′ is ‘)’. For such t and t′ we can use our decomposition into the four sets of preferences to calculate an expected score: t,t′e′TWQKe=subscriptsuperscript′subscriptabsent *E_t,t e TW_QKe=blackboard_Et , t′ e′ T Witalic_Q K e = t′,t[t′TWQKtok→tokt+t′TWQKtok→posp+p′TWQKpos→tokt+p′TWQKpos→posp]subscriptsuperscript′superscriptsubscript→toktoksuperscript′subscript→tokpossuperscript′subscript→postoksuperscript′subscript→pospos \, *E_t ,t [t TW_% QK^tok t+t TW_QK^tok% p+p TW_QK^pos % t+p TW_QK^pos p ]blackboard_Et′ , t [ t′ T Witalic_Q Ktok → tok t + t′ T Witalic_Q Ktok → pos p + p′ T Witalic_Q Kpos → tok t + p′ T Witalic_Q Kpos → pos p ] = == t′TWQKtok→tokt+t′TWQKtok→posp+tp′TWQKpos→tokt+p′TWQKpos→pospsubscriptsuperscript′subscript→toktoksuperscript′subscript→tokpossubscriptsuperscript′subscript→postoksuperscript′subscript→pospos \, *E_tt TW_QK^tok% t+t TW_QK^tok % p+ *E_tp TW_QK^pos% t+p TW_QK^pos % pblackboard_Et t′ T Witalic_Q Ktok → tok t + t′ T Witalic_Q Ktok → pos p + blackboard_Et p′ T Witalic_Q Kpos → tok t + p′ T Witalic_Q Kpos → pos p = == 1|lit|∑t∈litWQKtok→tokt′,t+t′TWQKtok→posp+1subscriptsubscriptsuperscriptsubscript→toktoksuperscript′limit-fromsuperscript′subscript→tokpos \,1 |lit| _t∈ litW_QK^tok→% tok_t ,t+t TW_QK^tok→% posp~+divide start_ARG 1 end_ARG start_ARG | l i t | end_ARG ∑t ∈ l i t Witalic_Q Ktok → tokt′ , t + t′ T Witalic_Q Ktok → pos p + 1|lit|∑t∈litWQKpos→tokp′,t+p′TWQKpos→posp1subscriptsubscriptsuperscriptsubscript→postoksuperscript′superscriptsubscript→pospos \,1 |lit| _t∈ litW_QK^pos→% tok_p ,t+p TW_QK^pos→% pospdivide start_ARG 1 end_ARG start_ARG | l i t | end_ARG ∑t ∈ l i t Witalic_Q Kpos → tokp′ , t + p′ T Witalic_Q Kpos → pos p where we overload notation and use t, t′ as tokens, indices, and the corresponding one-hot representations, and similarly for the positions p and p′ and where litlitl i t is the set of literals. If p′ does not contain punctuation, then we know that t′ is a variable (xisubscriptx_ixitalic_i or ¬xisubscript x_i¬ xitalic_i for some i), and similarly for t. Using that, we can use our decomposition into the four sets of preferences to calculate an expected score: t,t′e′TWQKe=subscriptsuperscript′subscriptabsent *E_t,t e TW_QKe=blackboard_Et , t′ e′ T Witalic_Q K e = t′,t[t′TWQKtok→tokt+t′TWQKtok→posp+p′TWQKpos→tokt+p′TWQKpos→posp]subscriptsuperscript′superscriptsubscript→toktoksuperscript′subscript→tokpossuperscript′subscript→postoksuperscript′subscript→pospos \, *E_t ,t [t TW_% QK^tok t+t TW_QK^tok% p+p TW_QK^pos % t+p TW_QK^pos p ]blackboard_Et′ , t [ t′ T Witalic_Q Ktok → tok t + t′ T Witalic_Q Ktok → pos p + p′ T Witalic_Q Kpos → tok t + p′ T Witalic_Q Kpos → pos p ] = == t′,t′TWQKtok→tokt+t′t′TWQKtok→posp+tp′TWQKpos→tokt+p′TWQKpos→pospsubscriptsuperscript′superscriptsubscript→toktoksubscriptsuperscript′superscriptsubscript→tokpossubscriptsuperscript′subscript→postoksuperscript′subscript→pospos \, *E_t ,tt TW_QK^% tok t+ *E_t t^% TW_QK^tok p+ *% E_tp TW_QK^pos t+p T% W_QK^pos pblackboard_Et′ , t t′ T Witalic_Q Ktok → tok t + blackboard_Et′ T Witalic_Q Ktok → pos p + blackboard_Et p′ T Witalic_Q Kpos → tok t + p′ T Witalic_Q Kpos → pos p = == 1|lit|2∑t,t′∈litWQKtok→tokt′,t+1|lit|∑t′∈litWQKtok→post′,p+1superscript2subscriptsuperscript′subscriptsuperscriptsubscript→toktoksuperscript′limit-from1subscriptsuperscript′subscriptsuperscriptsubscript→tokpossuperscript′ \,1 |lit|^2 _t,t ∈ litW_QK^tok% _t ,t+1 |lit| _t ∈ lit% W_QK^tok _t ,p+divide start_ARG 1 end_ARG start_ARG | l i t |2 end_ARG ∑t , t′ ∈ l i t Witalic_Q Ktok → tokt′ , t + divide start_ARG 1 end_ARG start_ARG | l i t | end_ARG ∑t′ ∈ l i t Witalic_Q Ktok → post′ , p + 1|lit|∑t∈litWQKpos→tokp′,t+p′TWQKpos→posp1subscriptsubscriptsuperscriptsubscript→postoksuperscript′superscriptsubscript→pospos \,1 |lit| _t∈ litW_QK^pos→% tok_p ,t+p TW_QK^pos→% pospdivide start_ARG 1 end_ARG start_ARG | l i t | end_ARG ∑t ∈ l i t Witalic_Q Kpos → tokp′ , t + p′ T Witalic_Q Kpos → pos p After taking the softmax of the resulting values, the result is shown in Figure 8. Worst-case Attention Analysis. We can also show that the parsing behavior occurs by a worst-case analysis of the attention scores, i.e. what is the minimum weight from attention to the first token of the clause and to the clause as a whole? This allows us to dispense with any distributional assumptions, and to validate that the inconsistent preferences for particular literals (i.e. the differences in preferences when the destination token is x0subscript0x_0x0). Given the definition of softmax, we can compute the minimum weight on the first token of clause i by the second token of clause i as follows: minϕscore4i+1,4i+2softmax(ϕ)subscriptitalic-ϕsubscriptsuperscriptscoresoftmax4142italic-ϕ _φscore *softmax_4i+1,4i+2(φ)minitalic_ϕ scoresoftmax4 i + 1 , 4 i + 2 ( ϕ ) = == mint∈tok4i+2minϕ∣ϕ4i+2=tescore4i+1,4i+2(ϕ)∑j≤4i+2escorej,4i+2(ϕ)subscriptsubscript42subscriptconditional-setitalic-ϕsubscriptitalic-ϕ42superscriptsubscriptscore4142italic-ϕsubscript42superscriptsubscriptscore42italic-ϕ _t∈ tok_4i+2 _\φ _4i+2=t\e^% score_4i+1,4i+2(φ) _j≤ 4i+2e^score_j,4i+2(φ)minitalic_t ∈ t o k start_POSTSUBSCRIPT 4 i + 2 end_POSTSUBSCRIPT min ϕ ∣ ϕ start_POSTSUBSCRIPT 4 i + 2 = t end_POSTSUBSCRIPT divide start_ARG escore4 i + 1 , 4 i + 2 ( ϕ ) end_ARG start_ARG ∑j ≤ 4 i + 2 escoreitalic_j , 4 i + 2 ( ϕ ) end_ARG ≤ ≤ mint∈tok4i+2eminϕ∣ϕ4i+2=tscore4i+1,4i+2(ϕ)eminϕ∣ϕ4i+2=tscore4i+1,4i+2(ϕ)+∑j≤4i∨j=4i+2emaxϕ∣ϕ4i+2=tscorej,4i+2(ϕ)subscriptsubscript42superscriptsubscriptconditional-setitalic-ϕsubscriptitalic-ϕ42subscriptscore4142italic-ϕsuperscriptsubscriptconditional-setitalic-ϕsubscriptitalic-ϕ42subscriptscore4142italic-ϕsubscript442superscriptsubscriptconditional-setitalic-ϕsubscriptitalic-ϕ42subscriptscore42italic-ϕ _t∈ tok_4i+2e _\φ _4i+2=t\% score_4i+1,4i+2(φ) e _\φ _4i+2=t\score% _4i+1,4i+2(φ)+ _j≤ 4i j=4i+2e _\φ _4i+2=t% \score_j,4i+2(φ)minitalic_t ∈ t o k start_POSTSUBSCRIPT 4 i + 2 end_POSTSUBSCRIPT divide start_ARG eroman_min ϕ ∣ ϕ start_POSTSUBSCRIPT 4 i + 2 = t end_POSTSUBSCRIPT score4 i + 1 , 4 i + 2 ( ϕ ) end_ARG start_ARG eroman_min ϕ ∣ ϕ start_POSTSUBSCRIPT 4 i + 2 = t end_POSTSUBSCRIPT score4 i + 1 , 4 i + 2 ( ϕ ) + ∑j ≤ 4 i ∨ j = 4 i + 2 eroman_max ϕ ∣ ϕ start_POSTSUBSCRIPT 4 i + 2 = t end_POSTSUBSCRIPT scorej , 4 i + 2 ( ϕ ) end_ARG where tokisubscripttok_it o kitalic_i is the set of all possible tokens in position i and where scores,d(ϕ)subscriptscoreitalic-ϕscore_s,d(φ)scores , d ( ϕ ) and scores,dsoftmax(ϕ)subscriptsuperscriptscoresoftmaxitalic-ϕscore *softmax_s,d(φ)scoresoftmaxitalic_s , d ( ϕ ) refer to the pre- and post-softmax attention scores with destination token in position d and source token in position s and where the input formula is ϕitalic-ϕφϕ. For the full clause, the approach is similar, except we add the term eminϕ∣ϕ4i+2=tscore4i+2,4i+2(ϕ)superscriptsubscriptconditional-setitalic-ϕsubscriptitalic-ϕ42subscriptscore4242italic-ϕe _\φ _4i+2=t\score_4i+2,4i+2(φ)eroman_min ϕ ∣ ϕ start_POSTSUBSCRIPT 4 i + 2 = t end_POSTSUBSCRIPT score4 i + 2 , 4 i + 2 ( ϕ ) to the numerator. Now, we can derive the minimal and maximal scores for and s and d using our decomposition of the QK circuit: minϕ∣ϕd=tscores,d(ϕ)=WQKpos→poss,d+mint′∈toksWQKtok→post′,d+WQKpos→toks,t+mint′∈toksWQKtok→tokt′,tsubscriptconditional-setitalic-ϕsubscriptitalic-ϕsubscriptscoreitalic-ϕsubscriptsuperscriptsubscript→pospossubscriptsuperscript′subscriptsubscriptsuperscriptsubscript→tokpossuperscript′subscriptsuperscriptsubscript→postoksubscriptsuperscript′subscriptsubscriptsuperscriptsubscript→toktoksuperscript′ _\φ _d=t\score_s,d(φ)=W_QK^pos% _s,d+ _t ∈ tok_sW_QK^tok% _t ,d+W_QK^pos→% tok_s,t+ _t ∈ tok_sW_QK^tok→ % tok_t ,tmin ϕ ∣ ϕ start_POSTSUBSCRIPT d = t end_POSTSUBSCRIPT scores , d ( ϕ ) = Witalic_Q Kpos → poss , d + minitalic_t′ ∈ t o k start_POSTSUBSCRIPT s end_POSTSUBSCRIPT Witalic_Q Ktok → post′ , d + Witalic_Q Kpos → toks , t + minitalic_t′ ∈ t o k start_POSTSUBSCRIPT s end_POSTSUBSCRIPT Witalic_Q Ktok → tokt′ , t and similarly for the maximum (noting that if s=ds=ds = d, we must have t′=tsuperscript′t =t′ = t as well). Figure 9: Minimum post-softmax attention weights for each clause. The results are shown in Figure 9, and demonstrate that regardless of the choice of formula, the majority of attention will be placed on the clause as a whole, except in the case of the first clause, in which case any additional attention is to the first ‘(’ token, which contains no useful information. Figure 10: Pre-softmax attention scores by the ’:’ token to each token, by position and token type. Now, we can perform similar analysis on the attention of the ‘:’ token. We observe in Figure 1 that the attention of the ‘:’ token in the first block is near uniform. To see why this occurs, see Figure 10, which fully describes the first-block attention of ‘:’. These are derived from the four QK matrices using the known position c=4040c=40c = 40 of the token. The attention scores are consistently very small and have very little variation. As above, we can use this set of preferences to calculate a lower bound on the attention paid by the ‘:’ token to any individual token: mini,ϕscorei,40softmax(ϕ)≤emini,tscorei,t(:)emini,tscorei,t(:)+40emaxi,tscorei,t(:)subscriptitalic-ϕsubscriptsuperscriptscoresoftmax40italic-ϕsuperscriptsubscriptsubscriptsuperscriptscore:superscriptsubscriptsubscriptsuperscriptscore:40superscriptsubscriptsubscriptsuperscriptscore: _i,φscore *softmax_i,40(φ)≤e _% i,tscore^(:)_i,t e _i,tscore^(:)_i,t+40e^% _i,tscore^(:)_i,tminitalic_i , ϕ scoresoftmaxitalic_i , 40 ( ϕ ) ≤ divide start_ARG eroman_minitalic_i , t score start_POSTSUPERSCRIPT ( : )i , t end_POSTSUPERSCRIPT end_ARG start_ARG eroman_minitalic_i , t score start_POSTSUPERSCRIPT ( : )i , t end_POSTSUPERSCRIPT + 40 eroman_maxitalic_i , t score start_POSTSUPERSCRIPT ( : )i , t end_POSTSUPERSCRIPT end_ARG where scorei,t(:)subscriptsuperscriptscore:score^(:)_i,tscore( : )i , t refers to the “:” token’s pre-softmax attention to token t in position i. We derive that the minimum attention paid to any token by ‘:’ is approximately 0.0209, and we can similarly show that the maximum attention to any token is ≈0.0285absent0.0285≈ 0.0285≈ 0.0285; hence, the “:” token’s attention can never vary far from the ≈0.0244absent0.0244≈ 0.0244≈ 0.0244 paid by uniform attention. Interpretation. ⬇ 1def parse_clauses(formula: list[tok]) -> list[tuple[lit, lit]]: 2 return [(formula[4*i+1], formula[4*i+2]) for i in range(10)] Listing 5: First component’s mechanistic interpretation in Python syntax. Listing LABEL:fig:layer1_code shows our extracted mechanistic interpretation of the first component as Python code. D.2 Second Block as Evaluator Figure 11: Second-block abstract attention preferences. Abstract Attention Analysis. We showed in Section 4.2.1 that the behavior of the first block is accurately captured by an interpretation which, after the application of γ1subscript1 _1γ1, outputs canonical representations for each clause at the second literal positions and is constant at all other positions. Prefix replaceability, in particular, enforces this property directly. Furthermore, we must only study attention with the readout token as the destination to explain the output behavior; hence, we can characterize the behavior of attention in the second block solely by the preferences of each head in the readout token position to the canonical clause representations in the second literal positions, dramatically reducing dimensionality. In this way, the key behavior of attention in the second block is fully described by Figure 11. Each of the four charts shows the preferences on each clause by the corresponding head; the columns correspond to the first literal of the clause and the rows to the second. We see that the behavior of attention is nearly identical for a clause and its equivalent mirror image (i.e. the attention score on (l∨r)(l r)( l ∨ r ) is approximately the same as that on (r∨l)(r l)( r ∨ l )); this is as expected given that, logically, l∨r=r∨l r=r l ∨ r = r ∨ l. We can also see that each head prefers clauses containing certain literals (for instance, head 0 prefers negated literals as well as x0subscript0x_0x0 and x3subscript3x_3x3); each clause receives a high score from some head (so no clauses are overlooked by the attention mechanism). While we might read far more into these patterns, this is a mistake in this case. In this instance, the behavior of attention is, in fact, not relevant to the underlying algorithm and serves only to obscure the underlying behavior. As discussed in Section 4.2.2, we can much better understand attention in concert with the first layer of the MLP; collectively, these components encode the observed evaluating behavior, as discussed in more detail below. Identifying the Key Pathway. We observe that the unembedding vector for UNSAT is nearly exactly the negative of the unembedding vector for SAT (‖WUSAT+WUUNSAT‖≈6.2×10−6normsuperscriptsubscriptSATsuperscriptsubscriptUNSAT6.2superscript106\|W_U^SAT+W_U^UNSAT\|≈ 6.2× 10^-6∥ Witalic_USAT + Witalic_UUNSAT ∥ ≈ 6.2 × 10- 6); hence, the SAT logit and the UNSAT logit are almost exactly negatives of each other; furthermore, the model’s output token is always SAT or UNSAT. Across all formulas in the analysis dataset, the minimum value of the logit associated with the prediction (either SAT or UNSAT) is slightly over 0, while while the maximum logit on any other token across the dataset is below -9. Hence, we can consider the behavior of the SAT logit exclusively. Figure 12: Effect of the residual connection from the second block’s attention mechanism on the SAT logit. x-axis shows the effect on the SAT logit and y-axis is the number of instances in the analysis test dataset. Figure 13: Average total effect on the SAT logit across all relevant neurons from SAT, UNSAT, and formulas satisfiable with particular assignments to the variables. Next, we show that the key pathway that determines classification passes through the second block’s MLP. For an input formula, the effect of the second block’s attention mechanism through the residual connection on the SAT logit simply the dot product of the post-attention embedding in the readout token position with WUSATsuperscriptsubscriptSATW_U^SATWitalic_USAT; as we can see in Figure 13, this value is fairly consistent across the dataset and always negative; so a positive SAT logit and hence a SAT classification must depend on the action of the MLP. Evaluation in the Hidden Neurons. As discussed in Section 4.2.2, only 34 hidden neurons have a significant effect on classification; hence, we focus on the behavior of these neurons. As observed in Figure 13, the effect of the residual stream from the second block attention on the SAT logit reduces the SAT logit by less than 6 units for all formulas in the analysis test dataset. Furthermore, as observed in Section 4.2.2, the model always predicts SAT or UNSAT, and the SAT and UNSAT logits are tied (in particular, they are negatives). Hence, the model will predict SAT when the collective effect of these 34 neurons is to increase SAT by > 6 units. We have observed that the expected behavior of these neurons is strongly dependent on which assignments to the variables satisfy the formula ϕitalic-ϕφϕ. A natural question to ask is whether, whether these neurons will collectively increase the SAT logit sufficiently to output SAT on SAT formulas regardless of which assignment to the variables satisfies the formulas. Calculating this in the same way as we do for an individual neuron in Figure 3, we obtain Figure 13, which shows that, indeed, the collective effect of the evaluating neurons (recall that we refer to the 34 relevant neurons as evaluating neurons) correctly identifies SAT formulas regardless of how the formulas are satisfiable. It may seem odd that the average effect of SAT is significantly below the minimum of the effects for formulas satisfiable with any given assignment, as any SAT formula belongs to one such category; this is an artifact of the fact that no neuron has a large effect for all SAT formulas, and that each specializes in a subset. Now, each of these neurons has a large positive (> 2.9) output coefficient (recall that the output coefficient of a neuron is the weight in the weighted sum expression constructed by composing the output layer of the MLP and the unembedding matrix projected to the SAT logit), so an activation above 2 units on any of the evaluating neurons is enough to force SAT, ignoring all other neurons; hence, the model outputs SAT if any evaluating neuron activates sufficiently strongly. In this sense, the effect of the final output components (the MLP’s output layer, the unembedding matrix, and the residual from attention) can be viewed as an OR operation. However, in reality, individual neurons do not consistently reach activations of ≥2absent2≥ 2≥ 2 (for example, see Figure 3, in which the average output value for formulas satisfiable with assignment TFFFF is approximately 1.5), and multiple neurons recognize a given SAT formula. Hence, moderate activation may be necessary across several neurons to predict SAT in some cases; this may explain the redundancy in evaluation behavior between neurons seen in Appendix G. For simplicity in our interpretation (in particular, to allow replacing a numerical weighted sum with Boolean operations) we handle this by using varying thresholds between the corresponding α and γ functions for the MLP’s hidden layer (in particular, γ2subscript2 _2γ2 outputs an activation of 2 for neurons whose interpretation predicts high activation, while 0.5 is considered a high activation in the case of α2subscript2 _2α2); as noted in the paragraph titled Hidden Layer of Second Block in Section 4.2.3, if γ2subscript2 _2γ2 and α2subscript2 _2α2 both use the lower threshold, replacing the neural network components with our interpretation no longer has a minimal effect on classification. Interpreting Neurons via Decision Trees. We observe that each assignment to the variables has some neuron for which formulas satisfiable with that assignment result in a significant increase to the SAT logit on average, as we’d expect if the model was implementing the natural exhaustive enumeration algorithm for satisfiability checking. In particular, every assignment a has some neuron which increases the SAT logit by at least 2.9 units on average on formulas where a is a satisfying assignment; however, the average-case analysis hides the complexity of the behavior of these neurons. We can see this by noting that for every relevant neuron that on average demonstrates high activation for formulas satisfiable with some assignment a, we are able to find some formula satisfiable with a which fails to result in a sufficient activation for the neuron. Figure 14: F1 scores of decision tree interpretations classifying high (> 0.5) activation of MLP hidden neurons. We’l focus on the general decision tree classifiers discussed in Section 4.2.2 here. We train decision tree classifiers on the thresholded activations of each evaluating neuron on the analysis training set; this allows us to learn an arbitrary Boolean function in the features ϕ[…]italic-ϕdelimited-[]…φ[...]ϕ [ … ]. We observe that these decision trees do indeed reliably predict activation. In particular, even very limited-size decision trees are reasonably strong predictors of the behavior of the neurons; we limit our decision trees to four leaf nodes in our experiments. Figures 15, 16, and 17 illustrate the effect on the decision trees as we lift this constraint. Figure 14 shows that more complex decision trees are not significantly better predictors of the behavior of the neurons. Figure 15: Decision tree interpretation for unit 29 of the second block MLP’s hidden layer, maximum leaf nodes: two. Figure 16: Decision tree interpretation for unit 29 of the second block MLP’s hidden layer, maximum leaf nodes: four. Figure 17: Decision tree interpretation for unit 29 of the second block MLP’s hidden layer, unlimited leaf node count. Implementation of Decision Trees by the Model. To see how the model implements the decision trees, we’l consider the effect of the composition of the second block MLP’s hidden layer with the attention mechanism on the canonical representations of the clauses, using the abstract attention approach discussed earlier. As the pre-softmax attention score by each head on the second literal of a clause is a fixed value for each clause (recall that γ1subscript1 _1γ1 leaves the representation of the readout token constant), we can express the effect of attention precisely in terms of the number of occurrences of each clause, as we show next. For head hℎh and clause l∨rl rl ∨ r, let the pre-softmax attention score paid by hℎh to the clause be wl∨rhsuperscriptsubscriptℎw_l r^hwitalic_l ∨ ritalic_h. Moreover, the output of the OV circuit is likewise a fixed vector for each clause, call it vl∨rhsuperscriptsubscriptℎv_l r^hvitalic_l ∨ ritalic_h. Say that our formula contains a list of n clauses (n=1010n=10n = 10 for all our experiments) where the ithsuperscriptthi^thith clause is li∨risubscriptsubscriptl_i r_ilitalic_i ∨ ritalic_i. Then, the attention paid to the ithsuperscriptthi^thith clause by head hℎh is: softmax([wlj∨rjh]j∈[1..n]])i=ewli∨rih∑j=1newlj∨rjh *softmax([w_l_j r_j^h]_j∈[1..n]])_i=e^w_l_% i r_i^h _j=1^ne^w_l_j r_j^hsoftmax ( [ witalic_l start_POSTSUBSCRIPT j ∨ ritalic_j end_POSTSUBSCRIPTh ]j ∈ [ 1 . . n ] ] )i = divide start_ARG eitalic_witalic_l start_POSTSUBSCRIPT i ∨ ritalic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT end_ARG start_ARG ∑j = 1n eitalic_witalic_l start_POSTSUBSCRIPT j ∨ ritalic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT end_ARG and hence the output of the attention mechanism (call it o) at the readout token is, by rearranging terms, e:+∑h=1m∑i=1nsoftmax([wlj∨rjh]j∈[1..n]])ivli∨rih=e:+∑h=1m∑l∈lit,r∈litewl∨rhcountl∨r(ϕ)vl∨rh∑l∈lit,r∈litewl∨rhcountl∨r(ϕ)e_:+ _h=1^m _i=1^n *softmax([w_l_j r_j% ^h]_j∈[1..n]])_iv_l_i r_i^h=e_:+ _h=1^m _l% ∈ lit,r∈ lite^w_l r^hcount_l r(φ)v_l r^h % _l∈ lit,r∈ lite^w_l r^hcount_l r(φ)e: + ∑h = 1m ∑i = 1n softmax ( [ witalic_l start_POSTSUBSCRIPT j ∨ ritalic_j end_POSTSUBSCRIPTh ]j ∈ [ 1 . . n ] ] )i vitalic_l start_POSTSUBSCRIPT i ∨ ritalic_i end_POSTSUBSCRIPTh = e: + ∑h = 1m divide start_ARG ∑l ∈ l i t , r ∈ l i t eitalic_witalic_l ∨ r start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT c o u n titalic_l ∨ r ( ϕ ) vitalic_l ∨ ritalic_h end_ARG start_ARG ∑l ∈ l i t , r ∈ l i t eitalic_witalic_l ∨ r start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT c o u n titalic_l ∨ r ( ϕ ) end_ARG where countl∨r(ϕ)subscriptitalic-ϕcount_l r(φ)c o u n titalic_l ∨ r ( ϕ ) is the number of occurrences of l∨rl rl ∨ r in the formula ϕitalic-ϕφϕ and where m is the number of attention heads and where e:subscript:e_:e: is the fixed representation of the readout token ‘:’. Now, consider the action of the MLP’s hidden layer on this value: in particular, consider the effect on a specific neuron n, where wnsubscriptw_nwitalic_n is the corresponding column of the MLP’s input weight matrix and bnsubscriptb_nbitalic_n is the corresponding bias value. The pre-activation score for n will then be: bn+wnTosubscriptsuperscriptsubscript b_n+w_n^Tobitalic_n + witalic_nitalic_T o =bn+wnT(e:+∑h=1m∑l∈lit,r∈litewl∨rhcountl∨r(ϕ)vl∨rh∑l∈lit,r∈litewl∨rhcountl∨r(ϕ))absentsubscriptsuperscriptsubscriptsubscript:superscriptsubscriptℎ1subscriptformulae-sequencesuperscriptsuperscriptsubscriptℎsubscriptitalic-ϕsuperscriptsubscriptℎsubscriptformulae-sequencesuperscriptsuperscriptsubscriptℎsubscriptitalic-ϕ =b_n+w_n^T (e_:+ _h=1^m _l∈ lit,r∈ lit% e^w_l r^hcount_l r(φ)v_l r^h _l∈ lit,r% ∈ lite^w_l r^hcount_l r(φ) )= bitalic_n + witalic_nitalic_T ( e: + ∑h = 1m divide start_ARG ∑l ∈ l i t , r ∈ l i t eitalic_witalic_l ∨ r start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT c o u n titalic_l ∨ r ( ϕ ) vitalic_l ∨ ritalic_h end_ARG start_ARG ∑l ∈ l i t , r ∈ l i t eitalic_witalic_l ∨ r start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT c o u n titalic_l ∨ r ( ϕ ) end_ARG ) =bn+wnTe:+∑h=1m∑l∈lit,r∈litewl∨rhcountl∨r(ϕ)wnTvl∨rh∑l∈lit,r∈litewl∨rhcountl∨r(ϕ)absentsubscriptsuperscriptsubscriptsubscript:superscriptsubscriptℎ1subscriptformulae-sequencesuperscriptsuperscriptsubscriptℎsubscriptitalic-ϕsuperscriptsubscriptsuperscriptsubscriptℎsubscriptformulae-sequencesuperscriptsuperscriptsubscriptℎsubscriptitalic-ϕ =b_n+w_n^Te_:+ _h=1^m _l∈ lit,r∈ lite^w% _l r^hcount_l r(φ)w_n^Tv_l r^h _l∈ lit% ,r∈ lite^w_l r^hcount_l r(φ)= bitalic_n + witalic_nitalic_T e: + ∑h = 1m divide start_ARG ∑l ∈ l i t , r ∈ l i t eitalic_witalic_l ∨ r start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT c o u n titalic_l ∨ r ( ϕ ) witalic_nitalic_T vitalic_l ∨ ritalic_h end_ARG start_ARG ∑l ∈ l i t , r ∈ l i t eitalic_witalic_l ∨ r start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT c o u n titalic_l ∨ r ( ϕ ) end_ARG =cn+∑h=1m∑l∈lit,r∈litcl∨rh,ncountl∨r(ϕ)∑l∈lit,r∈litdl∨rhcountl∨r(ϕ)absentsubscriptsuperscriptsubscriptℎ1subscriptformulae-sequencesuperscriptsubscriptℎsubscriptitalic-ϕsubscriptformulae-sequencesuperscriptsubscriptℎsubscriptitalic-ϕ =c_n+ _h=1^m _l∈ lit,r∈ litc_l r^h,n% count_l r(φ) _l∈ lit,r∈ litd_l r^hcount_l r% (φ)= citalic_n + ∑h = 1m divide start_ARG ∑l ∈ l i t , r ∈ l i t citalic_l ∨ ritalic_h , n c o u n titalic_l ∨ r ( ϕ ) end_ARG start_ARG ∑l ∈ l i t , r ∈ l i t ditalic_l ∨ ritalic_h c o u n titalic_l ∨ r ( ϕ ) end_ARG where we define the following constants: cn=bn+wnTe:subscriptsubscriptsuperscriptsubscriptsubscript:c_n=b_n+w_n^Te_:citalic_n = bitalic_n + witalic_nitalic_T e:, dl∨rh=ewl∨rhsuperscriptsubscriptℎsuperscriptsuperscriptsubscriptℎd_l r^h=e^w_l r^hditalic_l ∨ ritalic_h = eitalic_witalic_l ∨ r start_POSTSUPERSCRIPT h end_POSTSUPERSCRIPT, cl∨rh,n=dl∨rhwnTvl∨rhsuperscriptsubscriptℎsuperscriptsubscriptℎsuperscriptsubscriptsuperscriptsubscriptℎc_l r^h,n=d_l r^hw_n^Tv_l r^hcitalic_l ∨ ritalic_h , n = ditalic_l ∨ ritalic_h witalic_nitalic_T vitalic_l ∨ ritalic_h. Note that the numerator and denominator of the fraction describing the contribution of head hℎh are both linear functions of the number of occurrences of each clause in the formula. We’l briefly describe how ϕ[a]italic-ϕdelimited-[]φ[a]ϕ [ a ] and ¬ϕ[a]italic-ϕdelimited-[] φ[a]¬ ϕ [ a ] can be implemented with a single neuron for an arbitrary assignment a to the variables in this formulation (in particular, we can ensure activation if and only if the condition that ϕitalic-ϕφϕ is satisfiable by a or not satisfiable by a holds), and then analyze the behavior of a neuron with an interpretation of this form. As the formula ϕitalic-ϕφϕ is in conjunctive normal form, ϕ[a]italic-ϕdelimited-[]φ[a]ϕ [ a ] is true if and only if none of the clauses l∨rl rl ∨ r in ϕitalic-ϕφϕ evaluate to false given the assignment a to the variables (in particular, this holds if neither l nor r holds given a). We can then implement ϕ[a]italic-ϕdelimited-[]φ[a]ϕ [ a ] in neuron n if we let dl∨rhsuperscriptsubscriptℎd_l r^hditalic_l ∨ ritalic_h be constant, cl∨rh,n=0superscriptsubscriptℎ0c_l r^h,n=0citalic_l ∨ ritalic_h , n = 0 for l∨rl rl ∨ r which are true given assignment a, and let cl∨rh,n=−∞superscriptsubscriptℎc_l r^h,n=-∞citalic_l ∨ ritalic_h , n = - ∞ for l∨rl rl ∨ r which are false given a, and then let cnsubscriptc_ncitalic_n be a large positive number. Then, if ϕ[a]italic-ϕdelimited-[]φ[a]ϕ [ a ], countl∨r(ϕ)=0subscriptitalic-ϕ0count_l r(φ)=0c o u n titalic_l ∨ r ( ϕ ) = 0 for l∨rl rl ∨ r unsatisfiable given a, so the output is cnsubscriptc_ncitalic_n. Otherwise, if ¬ϕ[a]italic-ϕdelimited-[] φ[a]¬ ϕ [ a ], countl∨r(ϕ)>0subscriptitalic-ϕ0count_l r(φ)>0c o u n titalic_l ∨ r ( ϕ ) > 0 and so the output is −∞-∞- ∞. After applying ReLU, the activation of unit n is a large positive number when ϕ[a]italic-ϕdelimited-[]φ[a]ϕ [ a ] and zero otherwise. Similarly, we can implement ¬ϕ[a]italic-ϕdelimited-[] φ[a]¬ ϕ [ a ] by assigning cnsubscriptc_ncitalic_n to a negative number and letting cl∨rh,n=∞superscriptsubscriptℎc_l r^h,n=∞citalic_l ∨ ritalic_h , n = ∞ for l∨rl rl ∨ r which are false given a. Now, suppose that we want to implement some DNF expression e in the ϕ[…]italic-ϕdelimited-[]…φ[...]ϕ [ … ] without negation of any ϕ[…]italic-ϕdelimited-[]…φ[...]ϕ [ … ]. Say that e contains k conjuncts eisubscripte_ieitalic_i for 1≤i≤k11≤ i≤ k1 ≤ i ≤ k and that the attention mechanism has at least k heads. We can extend the ideas in the implementation of ϕ[a]italic-ϕdelimited-[]φ[a]ϕ [ a ] to one for e as follows: let cn=ksubscriptc_n=kcitalic_n = k; now, consider head hℎh with corresponding conjunct eisubscripte_ieitalic_i as in the ϕ[a]italic-ϕdelimited-[]φ[a]ϕ [ a ] case, let cl∨rh,n=0superscriptsubscriptℎ0c_l r^h,n=0citalic_l ∨ ritalic_h , n = 0 for l∨rl rl ∨ r which are not true for every assignment a appearing in eisubscripte_ieitalic_i. As in that case, ϕitalic-ϕφϕ satisfies eisubscripte_ieitalic_i if and only if no clause violates the constraint (following from the assumption that eisubscripte_ieitalic_i is a conjunction of ϕ[aj]italic-ϕdelimited-[]subscriptφ[a_j]ϕ [ aitalic_j ] which appear without negation, as, for an eisubscripte_ieitalic_i of this form to be satisfied, each clause in ϕitalic-ϕφϕ must be true for each assignment ajsubscripta_jaitalic_j). Hence, if we let cl∨rh,n=−1superscriptsubscriptℎ1c_l r^h,n=-1citalic_l ∨ ritalic_h , n = - 1 for l∨rl rl ∨ r which violate eisubscripte_ieitalic_i, let dl∨rh=1superscriptsubscriptℎ1d_l r^h=1ditalic_l ∨ ritalic_h = 1 for such l∨rl rl ∨ r and dl∨rhsuperscriptsubscriptℎd_l r^hditalic_l ∨ ritalic_h some negligible value for l∨rl rl ∨ r which do not violate the condition, the contribution of head hℎh to the expression is 0 if eisubscripte_ieitalic_i is satisfied and −11-1- 1 otherwise (as the numerator will be −v-v- v and the denominator will be v+ϵitalic-ϵv+ + ϵ where v is the number of clauses violating the condition). Hence, if all conjuncts eisubscripte_ieitalic_i fail to be satisfied, the output will be 0, else, some eisubscripte_ieitalic_i must be satisfied, so the output will be at least k−(k−1)=111k-(k-1)=1k - ( k - 1 ) = 1, and, hence, the neuron will activate. Interpretation. ⬇ 1def evaluate_satisfiability( 2 clauses: list[tuple[lit, lit]], 3 neuron_interpretations: list[Callable[[list[tuple[lit, lit]]], bool]] 4) -> list[bool]: 5 return [ 6 neuron_interp(clauses) 7 for neuron_interp in neuron_interpretations] 8 9def predict_satisfiability(satisfiabilities: list[bool]) -> bool: 10 return any(satisfiabilities) Listing 6: Second block’s mechanistic interpretation. First, it applies a series of functions, namely, the neuron interpretations, to the parsed clauses from the first block (evaluate_satisfiability) and then, it applies an OR operation (predict_satisfiability). ⬇ 1def abstract_model(formula: list[tok]) -> bool: 2 return predict_satisfiability( 3 evaluate_satisfiability( 4 parse_clauses(formula))) Listing 7: Mechanistic interpretation of full model; see Listings LABEL:fig:layer1_code and LABEL:fig:layer2_code for the component interpretations. Listing LABEL:fig:layer2_code shows our derived mechanistic interpretation for the second block. Listing LABEL:fig:abstract_model shows the mechanistic interpretation for the entire model. Appendix E Importance of Prefix Axioms: Theoretical Evidence While it may appear that each componentwise axiom implies the corresponding prefix axiom, for instance, that Axiom 2 implies Axiom 1, this is not the case. While it is possible to derive a bound, the resulting bound is extremely weak, and, for Axioms 3 and 4, we cannot derive even such a weak bound without additional assumptions. E.1 Equivalence Axioms Suppose that Axiom 2, component equivalence, holds with a fixed ϵ0subscriptitalic-ϵ0 _0ϵ0. We will show that prefix equivalence (Axiom 1) holds at component i with an ϵitalic-ϵεϵ of iϵ0subscriptitalic-ϵ0i _0i ϵ0; if the number of components l is at least ⌈1ϵ0⌉1subscriptitalic-ϵ0 1 _0 ⌈ divide start_ARG 1 end_ARG start_ARG ϵ0 end_ARG ⌉ the bound becomes useless. Let PEisubscriptPE_iP Eitalic_i be the event that x∼Dsimilar-tox Dx ∼ D satisfies the prefix equivalence condition for component i, i.e. that αi∘dt[:i+1](x)=dh[:i+1]∘α0(x) _i d_t[:i+1](x)=d_h[:i+1] _0(x)αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) = ditalic_h [ : i + 1 ] ∘ α0 ( x ) and and let CEisubscriptCE_iC Eitalic_i be the event that x satisfies the component equivalence condition for component i, i.e. that αi∘dt[:i+1](x)=dh[i]∘αi−1∘dt[:i](x) _i d_t[:i+1](x)=d_h[i] _i-1 d_t[:i](x)αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) = ditalic_h [ i ] ∘ αitalic_i - 1 ∘ ditalic_t [ : i ] ( x ). Call Pr[PEi]delimited-[]subscriptPr[PE_i]P r [ P Eitalic_i ] as pisubscriptp_ipitalic_i. Now, pi+1subscript1 p_i+1pitalic_i + 1 =Prx∼[αi+1∘dt[:i+2](x)=dh[:i+2]∘α0(x)] = x Pr[ _i+1 d_t[:i+2](x)% =d_h[:i+2] _0(x)]= start_UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ αitalic_i + 1 ∘ ditalic_t [ : i + 2 ] ( x ) = ditalic_h [ : i + 2 ] ∘ α0 ( x ) ] ≥Prx∼[αi+1∘dt[:i+2](x)=dh[i+1]∘αi∘dt[:i+1](x)∧dh[i+1]∘αi∘dt[:i+1](x)=dh[:i+2]∘α0(x)] ≥ x Pr[ _i+1 d_t[:i+2]% (x)=d_h[i+1] _i d_t[:i+1](x) d_h[i+1] _% i d_t[:i+1](x)=d_h[:i+2] _0(x)]≥ start_UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ αitalic_i + 1 ∘ ditalic_t [ : i + 2 ] ( x ) = ditalic_h [ i + 1 ] ∘ αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) ∧ ditalic_h [ i + 1 ] ∘ αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) = ditalic_h [ : i + 2 ] ∘ α0 ( x ) ] ≥Prx∼[αi+1∘dt[:i+2](x)=dh[i+1]∘αi∘dt[:i+1](x)∧αi∘dt[:i+1](x)=dh[:i+1]∘α0(x)] ≥ x Pr[ _i+1 d_t[:i+2]% (x)=d_h[i+1] _i d_t[:i+1](x) _i d_t[:i% +1](x)=d_h[:i+1] _0(x)]≥ start_UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ αitalic_i + 1 ∘ ditalic_t [ : i + 2 ] ( x ) = ditalic_h [ i + 1 ] ∘ αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) ∧ αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) = ditalic_h [ : i + 1 ] ∘ α0 ( x ) ] =Pr[CEi+1∧PEi],absentdelimited-[]subscript1subscript =Pr[CE_i+1 PE_i],= P r [ C Eitalic_i + 1 ∧ P Eitalic_i ] , noting that αi∘dt[:i+1](x)=dh[:i+1]∘α0(x) _i d_t[:i+1](x)=d_h[:i+1] _0(x)αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) = ditalic_h [ : i + 1 ] ∘ α0 ( x ) implies dh[i+1]∘αi∘dt[:i+1](x)=dh[:i+2]∘α0(x)d_h[i+1] _i d_t[:i+1](x)=d_h[:i+2] _0(x)ditalic_h [ i + 1 ] ∘ αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) = ditalic_h [ : i + 2 ] ∘ α0 ( x ) by the definition of the dh[:]subscriptℎdelimited-[]:d_h[:]ditalic_h [ : ] notation. Now, we have that Pr[CEi+1∧PEi]delimited-[]subscript1subscript Pr[CE_i+1 PE_i]P r [ C Eitalic_i + 1 ∧ P Eitalic_i ] =1−Pr[¬(CEi+1∧PEi)]absent1delimited-[]subscript1subscript =1-Pr[ (CE_i+1 PE_i)]= 1 - P r [ ¬ ( C Eitalic_i + 1 ∧ P Eitalic_i ) ] =1−Pr[¬CEi+1∨¬PEi]absent1delimited-[]subscript1subscript =1-Pr[ CE_i+1 PE_i]= 1 - P r [ ¬ C Eitalic_i + 1 ∨ ¬ P Eitalic_i ] ≥1−(Pr[¬CEi+1]+Pr[¬PEi]).absent1delimited-[]subscript1delimited-[]subscript ≥ 1-(Pr[ CE_i+1]+Pr[ PE_i]).≥ 1 - ( P r [ ¬ C Eitalic_i + 1 ] + P r [ ¬ P Eitalic_i ] ) . Hence, as Pr[CEi+1]≥1−ϵ0delimited-[]subscript11subscriptitalic-ϵ0Pr[CE_i+1]≥ 1- _0P r [ C Eitalic_i + 1 ] ≥ 1 - ϵ0 by component equivalence, we have pi+1≥1−(ϵ0+(1−pi))subscript11subscriptitalic-ϵ01subscriptp_i+1≥ 1-( _0+(1-p_i))pitalic_i + 1 ≥ 1 - ( ϵ0 + ( 1 - pitalic_i ) ) and so pi+1≥pi−ϵ0subscript1subscriptsubscriptitalic-ϵ0p_i+1≥ p_i- _0pitalic_i + 1 ≥ pitalic_i - ϵ0. We have p1≥1−ϵ0subscript11subscriptitalic-ϵ0p_1≥ 1- _0p1 ≥ 1 - ϵ0 as prefix equivalence and component equivalence are identical for the first component, and so pi≥1−iϵ0subscript1subscriptitalic-ϵ0p_i≥ 1-i _0pitalic_i ≥ 1 - i ϵ0, and we are done. If we assume that CEi+1subscript1CE_i+1C Eitalic_i + 1 and PEisubscriptPE_iP Eitalic_i are independent, we could derive a stronger bound: Pr[CEi+1∧PEi]=Pr[CEi+1]Pr[PEi]≥(1−ϵ0)pidelimited-[]subscript1subscriptdelimited-[]subscript1delimited-[]subscript1subscriptitalic-ϵ0subscriptPr[CE_i+1 PE_i]=Pr[CE_i+1]Pr[PE_i]≥(1- _0)p_iP r [ C Eitalic_i + 1 ∧ P Eitalic_i ] = P r [ C Eitalic_i + 1 ] P r [ P Eitalic_i ] ≥ ( 1 - ϵ0 ) pitalic_i from which we can derive pi≥(1−ϵ0)isubscriptsuperscript1subscriptitalic-ϵ0p_i≥(1- _0)^ipitalic_i ≥ ( 1 - ϵ0 )i. However, even this stronger bound grows very weak as depth increases, illustrating the need to evaluate a mechanistic interpretation compositionally. E.2 Replaceability Axioms We can only derive a bound in the case that γi=αi−1subscriptsuperscriptsubscript1 _i= _i^-1γitalic_i = αitalic_i- 1 and that all concrete components are invertible. Assume component replaceability, Axiom 4 is satisfied. Similarly to Appendix E.1 let the PRisubscriptPR_iP Ritalic_i and CRisubscriptCR_iC Ritalic_i be the events that x∼Dsimilar-tox Dx ∼ D satisfies the prefix replaceability and component replaceability conditions for component i, i.e. that t(x)=dt[i+1:]∘γi∘dh[:i+1]∘α0(x)t(x)=d_t[i+1:] _i d_h[:i+1] _0(x)t ( x ) = ditalic_t [ i + 1 : ] ∘ γitalic_i ∘ ditalic_h [ : i + 1 ] ∘ α0 ( x ) and that t(x)=dt[i+1:]∘γi∘dh[i]∘αi−1∘dt[:i](x)t(x)=d_t[i+1:] _i d_h[i] _i-1 d_t[:i](x)t ( x ) = ditalic_t [ i + 1 : ] ∘ γitalic_i ∘ ditalic_h [ i ] ∘ αitalic_i - 1 ∘ ditalic_t [ : i ] ( x ), respectively. By the assumption that all concrete components are invertible, PRisubscriptPR_iP Ritalic_i iff dt[:i+1]=γi∘dh[:i+1]∘α0(x)d_t[:i+1]= _i d_h[:i+1] _0(x)ditalic_t [ : i + 1 ] = γitalic_i ∘ ditalic_h [ : i + 1 ] ∘ α0 ( x ) and CRisubscriptCR_iC Ritalic_i iff dt[:i+1]=γi∘dh[i]∘αi−1∘dt[:i](x)d_t[:i+1]= _i d_h[i] _i-1 d_t[:i](x)ditalic_t [ : i + 1 ] = γitalic_i ∘ ditalic_h [ i ] ∘ αitalic_i - 1 ∘ ditalic_t [ : i ] ( x ). Call Pr[PRi]delimited-[]subscriptPr[PR_i]P r [ P Ritalic_i ] pisubscriptp_ipitalic_i. Now, pi+1subscript1 p_i+1pitalic_i + 1 =Prx∼[t(x)=dt[i+2:]∘γi∘dh[:i+2]∘α0(x)] = x Pr[t(x)=d_t[i+2:] _i% d_h[:i+2] _0(x)]= start_UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ t ( x ) = ditalic_t [ i + 2 : ] ∘ γitalic_i ∘ ditalic_h [ : i + 2 ] ∘ α0 ( x ) ] =Prx∼[dt[:i+2]=γi+1∘dh[:i+2]∘α0(x)] = x Pr[d_t[:i+2]= _i+1 d_% h[:i+2] _0(x)]= start_UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ ditalic_t [ : i + 2 ] = γitalic_i + 1 ∘ ditalic_h [ : i + 2 ] ∘ α0 ( x ) ] ≥Prx∼[dt[:i+2]=γi+1∘dh[i+1]∘αi∘dt[:i+1](x)∧γi+1∘dh[:i+2]∘α0(x)=γi+1∘dh[i+1]∘αi∘dt[:i+1](x)] ≥ x Pr [ array[]ld_t[% :i+2]= _i+1 d_h[i+1] _i d_t[:i+1](x) \\ _i+1 d_h[:i+2] _0(x)= _i+1 d_h[i+1]% _i d_t[:i+1](x) array ]≥ start_UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ start_ARRAY start_ROW start_CELL ditalic_t [ : i + 2 ] = γitalic_i + 1 ∘ ditalic_h [ i + 1 ] ∘ αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) ∧ end_CELL end_ROW start_ROW start_CELL γitalic_i + 1 ∘ ditalic_h [ : i + 2 ] ∘ α0 ( x ) = γitalic_i + 1 ∘ ditalic_h [ i + 1 ] ∘ αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) end_CELL end_ROW end_ARRAY ] ≥Prx∼[dt[:i+2]=γi+1∘dh[i+1]∘αi∘dt[:i+1](x)∧dh[:i+1]∘α0(x)=αi∘dt[:i+1](x)] ≥ x Pr [ array[]ld_t[% :i+2]= _i+1 d_h[i+1] _i d_t[:i+1](x) \\ d_h[:i+1] _0(x)= _i d_t[:i+1](x) array ]≥ start_UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ start_ARRAY start_ROW start_CELL ditalic_t [ : i + 2 ] = γitalic_i + 1 ∘ ditalic_h [ i + 1 ] ∘ αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) ∧ end_CELL end_ROW start_ROW start_CELL ditalic_h [ : i + 1 ] ∘ α0 ( x ) = αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) end_CELL end_ROW end_ARRAY ] =Prx∼[dt[i+2:]∘dt[:i+2]=dt[i+2:]∘γi+1∘dh[i+1]∘αi∘dt[:i+1](x)∧dt[i+1:]∘γi∘dh[:i+1]∘α0(x)=dt[i+1:]∘γi∘αi∘dt[:i+1](x)] = x Pr [ array[]ld_t[i+2% :] d_t[:i+2]=d_t[i+2:] _i+1 d_h[i+1] _i% d_t[:i+1](x) \\ d_t[i+1:] _i d_h[:i+1] _0(x)=d_t[i+1:] % _i _i d_t[:i+1](x) array ]= start_UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ start_ARRAY start_ROW start_CELL ditalic_t [ i + 2 : ] ∘ ditalic_t [ : i + 2 ] = ditalic_t [ i + 2 : ] ∘ γitalic_i + 1 ∘ ditalic_h [ i + 1 ] ∘ αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) ∧ end_CELL end_ROW start_ROW start_CELL ditalic_t [ i + 1 : ] ∘ γitalic_i ∘ ditalic_h [ : i + 1 ] ∘ α0 ( x ) = ditalic_t [ i + 1 : ] ∘ γitalic_i ∘ αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) end_CELL end_ROW end_ARRAY ] =Prx∼[t(x)=dt[i+2:]∘γi+1∘dh[i+1]∘αi∘dt[:i+1](x)∧t(x)=dt[i+1:]∘γi∘dh[:i+1]∘α0(x)] = x Pr [ array[]lt(x)=d_t% [i+2:] _i+1 d_h[i+1] _i d_t[:i+1](x)% \\ t(x)=d_t[i+1:] _i d_h[:i+1] _0(x) array ]= start_UNDERACCENT x ∼ D end_UNDERACCENT start_ARG P r end_ARG [ start_ARRAY start_ROW start_CELL t ( x ) = ditalic_t [ i + 2 : ] ∘ γitalic_i + 1 ∘ ditalic_h [ i + 1 ] ∘ αitalic_i ∘ ditalic_t [ : i + 1 ] ( x ) ∧ end_CELL end_ROW start_ROW start_CELL t ( x ) = ditalic_t [ i + 1 : ] ∘ γitalic_i ∘ ditalic_h [ : i + 1 ] ∘ α0 ( x ) end_CELL end_ROW end_ARRAY ] =Pr[CRi+1∧PRi]absentdelimited-[]subscript1subscript =Pr[CR_i+1 PR_i]= P r [ C Ritalic_i + 1 ∧ P Ritalic_i ] by the assumption that all concrete components are invertible and that γi=αi−1subscriptsuperscriptsubscript1 _i= _i^-1γitalic_i = αitalic_i- 1. The proof proceeds as before. In particular, even with strong conditions such as these, we derive a bound which grows very weak as depth increases, and hence, we observe maintaining Axioms 1 and 3 in addition to Axioms 2 and 4 is essential for a reliable evaluation. See Appendix F for empirical evidence for this fact. Appendix F Importance of Prefix Axioms: Empirical Evidence We can demonstrate the importance of the prefix axioms by simulating errors in the interpretation of the 2-SAT model; specifically, every clause output by the first component of the abstract model is replaced by a randomly sampled clause 1% of the time (the slight difference in e.g. the epsilons for component and prefix equivalence for the first component are due to independent sampling); results are shown in Table 1. The evaluation of the componentwise axioms are identical to the original results except on the first component, as these axioms do not take error propagation into account. The prefix axioms, on the other hand, show the extent to which errors in the earlier components result in downstream errors. Taking error propagation into account is particularly important when we consider deeper models. While we can derive worst-case compositional bounds from the componentwise results, this may not actually describe model behavior; in particular, as the number of components increases, the derived bound rapidly becomes meaningless (see Appendix E for more details). For example, in our case, the aggregation operation in the final component enables prefix equivalence to improve with depth; hence, restricting ourselves to the pure worst-case analysis derived from component-only analysis would suggest that our interpretation is far less reliable end-to-end than is actually the case. Table 1: Comparison of results of experiments with synthetic error injected into the first component of the abstract model to original experimental results. Axiom Noised first abstract component Original abstract model Component 1 Component 2 Component 3 Component 1 Component 2 Component 3 Axiom 1: Prefix Equivalence 0.0955 0.212 0.0315 0.0000374 0.182 0.0128 Axiom 2: Component Equivalence 0.0942 0.182 0.00433 0.0000374 0.182 0.00433 Axiom 3: Prefix Replaceability 0.0583 0.0312 0.0318 0.0418 0.0128 0.0128 Axiom 4: Component Replaceability 0.0581 0.0128 0.00433 0.0418 0.0128 0.00433 Appendix G Neuron Interpretations for the 2-SAT Model Tables 2 and 3 show the decision tree based and disjunction-only neuron interpretations, respectively. Table 2: General neuron interpretations derived by decision tree learning on neuron activations. A neuron activates if the corresponding condition in the second column holds. Neuron Interpretation (General Case) 10 ϕ[TFFFF]italic-ϕdelimited-[]φ[TFFFF]ϕ [ T F F F F ] 29 ϕ[FTTFF]∧¬ϕ[FFFFT]∧¬ϕ[TTTTT]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[FTTFF] φ[FFFFT] φ[T]ϕ [ F T T F F ] ∧ ¬ ϕ [ F F F F T ] ∧ ¬ ϕ [ T T T T T ] 36 ϕ[TTFFF]∧¬ϕ[FTFTF]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[TTFFF] φ[FTFTF]ϕ [ T T F F F ] ∧ ¬ ϕ [ F T F T F ] 41 ϕ[FFFFF]italic-ϕdelimited-[]φ[F]ϕ [ F F F F F ] 48 (¬ϕ[FTTFF]∧ϕ[FTFFF])∨ϕ[FTTFF]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]( φ[FTTFF] φ[FTFFF]) φ[FTTFF]( ¬ ϕ [ F T T F F ] ∧ ϕ [ F T F F F ] ) ∨ ϕ [ F T T F F ] 55 ϕ[TTTFF]italic-ϕdelimited-[]φ[TTTFF]ϕ [ T T T F F ] 77 ϕ[FFTTT]italic-ϕdelimited-[]φ[FFTTT]ϕ [ F F T T T ] 81 ϕ[FTTTF]italic-ϕdelimited-[]φ[FTTTF]ϕ [ F T T T F ] 96 ϕ[TFTTF]∧¬ϕ[FFTFF]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[TFTTF] φ[FFTFF]ϕ [ T F T T F ] ∧ ¬ ϕ [ F F T F F ] 150 ϕ[FTFTF]∧¬ϕ[TTFFF]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[FTFTF] φ[TTFFF]ϕ [ F T F T F ] ∧ ¬ ϕ [ T T F F F ] 185 ϕ[FFFTT]italic-ϕdelimited-[]φ[FFFTT]ϕ [ F F F T T ] 189 ϕ[FTTTT]∧¬ϕ[TTTFT]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[FTTTT] φ[TTTFT]ϕ [ F T T T T ] ∧ ¬ ϕ [ T T T F T ] 195 ϕ[TFFTF]∧¬ϕ[TTTTT]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[TFFTF] φ[T]ϕ [ T F F T F ] ∧ ¬ ϕ [ T T T T T ] 201 ϕ[TTFFT]∧¬ϕ[TFTFF]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[TTFFT] φ[TFTFF]ϕ [ T T F F T ] ∧ ¬ ϕ [ T F T F F ] 224 ϕ[FFFFT]italic-ϕdelimited-[]φ[FFFFT]ϕ [ F F F F T ] 231 ϕ[FTFTT]∧¬ϕ[FFTTF]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[FTFTT] φ[FFTTF]ϕ [ F T F T T ] ∧ ¬ ϕ [ F F T T F ] 245 ϕ[FTFFT]italic-ϕdelimited-[]φ[FTFFT]ϕ [ F T F F T ] 261 ϕ[TTFTT]italic-ϕdelimited-[]φ[TTFTT]ϕ [ T T F T T ] 291 ϕ[TFTTT]italic-ϕdelimited-[]φ[TFTTT]ϕ [ T F T T T ] 304 ϕ[FFTTF]∧¬ϕ[TFTFF]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[FFTTF] φ[TFTFF]ϕ [ F F T T F ] ∧ ¬ ϕ [ T F T F F ] 317 ϕ[TFFFT]italic-ϕdelimited-[]φ[TFFFT]ϕ [ T F F F T ] 326 ϕ[TTTTF]italic-ϕdelimited-[]φ[TTTTF]ϕ [ T T T T F ] 334 ϕ[FTTTF]∧¬ϕ[FFFTT]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[FTTTF] φ[FFFTT]ϕ [ F T T T F ] ∧ ¬ ϕ [ F F F T T ] 374 ϕ[TTFTT]∧¬ϕ[TFTTF]∧¬ϕ[FFFFF]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[TTFTT] φ[TFTTF] φ[F]ϕ [ T T F T T ] ∧ ¬ ϕ [ T F T T F ] ∧ ¬ ϕ [ F F F F F ] 380 ϕ[TFFTT]italic-ϕdelimited-[]φ[TFFTT]ϕ [ T F F T T ] 411 ϕ[TTFFT]italic-ϕdelimited-[]φ[TTFFT]ϕ [ T T F F T ] 416 ϕ[TTTFF]italic-ϕdelimited-[]φ[TTTFF]ϕ [ T T T F F ] 435 (ϕ[TTFTF]∧¬ϕ[FTFFF])∨(ϕ[TTFTF]∧ϕ[FTFFF]∧ϕ[TFFTF])italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[](φ[TTFTF] φ[FTFFF]) (φ[TTFTF] φ[FTFFF] φ[% TFFTF])( ϕ [ T T F T F ] ∧ ¬ ϕ [ F T F F F ] ) ∨ ( ϕ [ T T F T F ] ∧ ϕ [ F T F F F ] ∧ ϕ [ T F F T F ] ) 450 ϕ[FFTFT]∧¬ϕ[FTFFF]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[FFTFT] φ[FTFFF]ϕ [ F F T F T ] ∧ ¬ ϕ [ F T F F F ] 482 ϕ[TTTTT]∧¬ϕ[FTTFT]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[T] φ[FTTFT]ϕ [ T T T T T ] ∧ ¬ ϕ [ F T T F T ] 490 (ϕ[FTTFT]∧¬ϕ[TTTTT])∨(ϕ[FTTFT]∧ϕ[TTTTT]∧¬ϕ[TTTFT])italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[](φ[FTTFT] φ[T]) (φ[FTTFT] φ[T] φ% [TTTFT])( ϕ [ F T T F T ] ∧ ¬ ϕ [ T T T T T ] ) ∨ ( ϕ [ F T T F T ] ∧ ϕ [ T T T T T ] ∧ ¬ ϕ [ T T T F T ] ) 492 ϕ[TFTFT]italic-ϕdelimited-[]φ[TFTFT]ϕ [ T F T F T ] 495 ϕ[FFFTF]italic-ϕdelimited-[]φ[FFFTF]ϕ [ F F F T F ] 499 (ϕ[FFTFF]∧¬ϕ[TFTTF])∨(ϕ[FFTFF]∧ϕ[TFTTF]∧ϕ[FFTTF])italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[](φ[FFTFF] φ[TFTTF]) (φ[FFTFF] φ[TFTTF] φ[% FFTTF])( ϕ [ F F T F F ] ∧ ¬ ϕ [ T F T T F ] ) ∨ ( ϕ [ F F T F F ] ∧ ϕ [ T F T T F ] ∧ ϕ [ F F T T F ] ) Table 3: Disjunction-only neuron interpretations derived by finding satisfying assignments that are correlated with a high average neuron activation value over the analysis training set. A neuron activates if the corresponding condition in the second column holds. Neuron Interpretation (Disjunction-Only) 10 ϕ[TFFFF]∨ϕ[TFTFF]∨ϕ[TTFFF]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[TFFFF] φ[TFTFF] φ[TTFFF]ϕ [ T F F F F ] ∨ ϕ [ T F T F F ] ∨ ϕ [ T T F F F ] 29 ϕ[FTTFF]italic-ϕdelimited-[]φ[FTTFF]ϕ [ F T T F F ] 36 ϕ[TTFFF]italic-ϕdelimited-[]φ[TTFFF]ϕ [ T T F F F ] 41 ϕ[FFFFF]∨ϕ[FTFFF]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[F] φ[FTFFF]ϕ [ F F F F F ] ∨ ϕ [ F T F F F ] 48 ϕ[FTFFF]∨ϕ[FTTFF]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[FTFFF] φ[FTTFF]ϕ [ F T F F F ] ∨ ϕ [ F T T F F ] 55 ϕ[TTTFF]∨ϕ[TTTFT]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[TTTFF] φ[TTTFT]ϕ [ T T T F F ] ∨ ϕ [ T T T F T ] 77 ϕ[FFTTF]∨ϕ[FFTTT]∨ϕ[FTTTT]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[FFTTF] φ[FFTTT] φ[FTTTT]ϕ [ F F T T F ] ∨ ϕ [ F F T T T ] ∨ ϕ [ F T T T T ] 81 ϕ[FTTTF]∨ϕ[FTTTT]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[FTTTF] φ[FTTTT]ϕ [ F T T T F ] ∨ ϕ [ F T T T T ] 96 ϕ[TFTTF]italic-ϕdelimited-[]φ[TFTTF]ϕ [ T F T T F ] 150 ϕ[FTFTF]italic-ϕdelimited-[]φ[FTFTF]ϕ [ F T F T F ] 185 ϕ[FFFTT]italic-ϕdelimited-[]φ[FFFTT]ϕ [ F F F T T ] 189 ϕ[FTTTT]italic-ϕdelimited-[]φ[FTTTT]ϕ [ F T T T T ] 195 ϕ[TFFTF]∨ϕ[TFTTF]∨ϕ[TTFTF]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[TFFTF] φ[TFTTF] φ[TTFTF]ϕ [ T F F T F ] ∨ ϕ [ T F T T F ] ∨ ϕ [ T T F T F ] 201 ϕ[TTFFT]italic-ϕdelimited-[]φ[TTFFT]ϕ [ T T F F T ] 224 ϕ[FFFFT]italic-ϕdelimited-[]φ[FFFFT]ϕ [ F F F F T ] 231 ϕ[FTFTT]∨ϕ[FTTTT]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[FTFTT] φ[FTTTT]ϕ [ F T F T T ] ∨ ϕ [ F T T T T ] 245 ϕ[FTFFF]∨ϕ[FTFFT]∨ϕ[FTTFT]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[FTFFF] φ[FTFFT] φ[FTTFT]ϕ [ F T F F F ] ∨ ϕ [ F T F F T ] ∨ ϕ [ F T T F T ] 261 ϕ[TTFTT]italic-ϕdelimited-[]φ[TTFTT]ϕ [ T T F T T ] 291 ϕ[TFTTF]∨ϕ[TFTTT]∨ϕ[TTTTT]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[TFTTF] φ[TFTTT] φ[T]ϕ [ T F T T F ] ∨ ϕ [ T F T T T ] ∨ ϕ [ T T T T T ] 304 ϕ[FFTTF]italic-ϕdelimited-[]φ[FFTTF]ϕ [ F F T T F ] 317 ϕ[TFFFT]italic-ϕdelimited-[]φ[TFFFT]ϕ [ T F F F T ] 326 ϕ[TTFTF]∨ϕ[TTTTF]∨ϕ[TTTTT]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[TTFTF] φ[TTTTF] φ[T]ϕ [ T T F T F ] ∨ ϕ [ T T T T F ] ∨ ϕ [ T T T T T ] 334 ϕ[FTTTF]italic-ϕdelimited-[]φ[FTTTF]ϕ [ F T T T F ] 374 ϕ[TTFTT]italic-ϕdelimited-[]φ[TTFTT]ϕ [ T T F T T ] 380 ϕ[TFFTT]italic-ϕdelimited-[]φ[TFFTT]ϕ [ T F F T T ] 411 ϕ[TTFFT]∨ϕ[TTTFT]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[TTFFT] φ[TTTFT]ϕ [ T T F F T ] ∨ ϕ [ T T T F T ] 416 ϕ[TFTFF]∨ϕ[TTTFF]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[TFTFF] φ[TTTFF]ϕ [ T F T F F ] ∨ ϕ [ T T T F F ] 435 ϕ[TTFTF]italic-ϕdelimited-[]φ[TTFTF]ϕ [ T T F T F ] 450 ϕ[FFTFF]∨ϕ[FFTFT]∨ϕ[FTTFT]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[FFTFF] φ[FFTFT] φ[FTTFT]ϕ [ F F T F F ] ∨ ϕ [ F F T F T ] ∨ ϕ [ F T T F T ] 482 ϕ[TTTTT]italic-ϕdelimited-[]φ[T]ϕ [ T T T T T ] 490 ϕ[FTTFT]italic-ϕdelimited-[]φ[FTTFT]ϕ [ F T T F T ] 492 ϕ[TFTFF]∨ϕ[TFTFT]∨ϕ[TTTFT]italic-ϕdelimited-[]italic-ϕdelimited-[]italic-ϕdelimited-[]φ[TFTFF] φ[TFTFT] φ[TTTFT]ϕ [ T F T F F ] ∨ ϕ [ T F T F T ] ∨ ϕ [ T T T F T ] 495 ϕ[FFFTF]italic-ϕdelimited-[]φ[FFFTF]ϕ [ F F F T F ] 499 ϕ[FFTFF]italic-ϕdelimited-[]φ[FFTFF]ϕ [ F F T F F ] Appendix H More Details on Mechanistic Interpretability Analysis of the Modular Addition Model Pseudocode for the abstract model is shown in Listing LABEL:lst:grokking_code. The concrete model is likewise broken into three corresponding components. The embedding matrix corresponds to encoding_of_inputs, the attention mechanism and the input layer of the MLP correspond to sum_of_angles, and the output layer of the MLP and the unembedding matrix correspond to difference_of_angles_argmax. The abstraction and concretization operators are linear maps learned using the continuous-valued abstract values (e.g. prior to rounding) and the corresponding concrete intermediate representations on the training set, using the original train/test split of Nanda et al. (2023a). Note that, in Nanda et al. (2023a)’s paper, the third component is further decomposed into two pieces; specifically, the difference-of-angles identities and the summation (along with the argmax) steps are considered separate. The difference-of-angles identity is computed using the MLP output layer and the unembedding matrix while the summation step is computed via the unembedding matrix. As there is no canonical factorization of the unembedding matrix, there exists no natural decomposition of the concrete model into these two components—we analyze the composition of these two components instead. ⬇ 1modulus = 113 2key_freqs = [14, 35, 41, 42, 52] 3 4cos_sin = tuple[list[float], list[float]] 5 6def encoding_of_inputs( 7 a: int, 8 b: int, 9) -> tuple[cos_sin, cos_sin]: 10 cos_a = [round(cos(2 * f * pi * a / modulus)) for f in key_freqs] 11 cos_b = [round(cos(2 * f * pi * b / modulus)) for f in key_freqs] 12 sin_a = [round(sin(2 * f * pi * a / modulus)) for f in key_freqs] 13 sin_b = [round(sin(2 * f * pi * b / modulus)) for f in key_freqs] 14 15 return (cos_a, sin_a), (cos_b, sin_b) 16 17def sum_of_angles( 18 components: tuple[cos_sin, cos_sin], 19) -> EquivalenceClass: 20 (cos_a, sin_a), (cos_b, sin_b) = components 21 cos_ab = [ 22 ca * cb - sa * sb 23 for ca, sa, cb, sb in zip(cos_a, sin_a, cos_b, sin_b) 24 ] 25 sin_ab = [ 26 sa * cb + ca * sb 27 for ca, sa, cb, sb in zip(cos_a, sin_a, cos_b, sin_b) 28 ] 29 30 return EquivalenceClass(cos_ab, sin_ab) 31 32def difference_of_angles_argmax(angle_sums_class: EquivalenceClass) -> int: 33 cos_ab, sin_ab = extract_angle_sums(angle_sums_class) 34 cos_ab_minus_c = [ 35 [ 36 cab * cos(2 * f * pi * c / modulus) + sab * sin(2 * f * pi * c / modulus) 37 for cab, sab, f in zip(cos_ab, sin_ab, key_freqs) 38 ] 39 for c in range(modulus) 40 ] 41 42 return argmax(map(sum, cos_ab_minus_c)) 43 44def modular_addition(a: int, b: int) -> int: 45 return difference_of_angles_argmax( 46 sum_of_angles( 47 encoding_of_inputs(a, b) 48 ) 49 ) Listing 8: Mechanistic interpretation of the modular arithmetic model As noted in Section 5, the discretization operators are necessary for compatibility with Axioms 1 and 2. Without rounding on the first component, we obtain epsilons of 1 for Axioms 1 and 2; epsilons for Axioms 3 and 4 are unaffected by rounding. For the second component, we apply the equivalence class formulation described in Section 5; however, we strengthen the equivalence criterion by ensuring that equal samples affect neither concrete nor abstract model behavior. This was chosen to avoid the need to sample from the equivalence class on concretization or to select a canonical representative for each class; doing so would have been needed to correctly evaluate Axioms 3 and 4 without such a guarantee. When sampling in this way is feasible, an equivalence relation up to the abstract only model is sufficient. Pseudocode is given in Listing LABEL:lst:equivalence. ⬇ 1def equivalent(a: cos_sin, b: cos_sin) -> bool: 2 abstract_eq = abstract_components[3](a) == abstract_components[3](b) 3 concrete_eq = concrete_components[3](gammas[2](a)) == concrete_components[3](gammas[2](b)) 4 return abstract_eq and concrete_eq Listing 9: Equivalence relation for second component of mechanistic interpretation of the modular arithmetic model As for the first component, we obtain epsilons of 1 for Axioms 1 and 2 without applying the equivalence class mapping; by construction, the epsilons for Axioms 3 and 4 are unaffected by the equivalence class operation. Note that simple discretization operators do not suffice: we once again obtain an ϵitalic-ϵεϵ of 1 for Axioms 1 and 2 when rounding to a single decimal place. While strong results with the discretization described above show that the abstract and concrete representations of the output of the second component are equivalent up to their downstream impact on both the concrete and abstract models, results with simpler discretizations indicate that the the concrete representation does not represent the abstract state exactly. Appendix I A Sketch of Application to Circuits: Indirect Object Identification As described in Appendix B.1, our axioms may be extended to evaluate circuits-type interpretations, or may be applied directly by linearization of the computational graph of the circuit. To illustrate the process, we’l describe how the circuit for Indirect Object Identification (IOI) (Wang et al., 2023) may be evaluated. The IOI circuit consists of five broad categories of heads: 1. Previous token heads, which copy information from the prior token 2. Duplicate token heads, which identify whether there exists any prior duplicate of the current token 3. Induction heads, which serve the same function as duplicate token heads, mediated via the previous token heads 4. S-inhibition heads, which output a signal suppressing attention by name mover heads to duplicated names 5. Name mover heads, which copy names except those suppressed by the S-inhibition heads for output Figure 18: The IOI circuit, reproduced from Wang et al. (2023). Note that each of these attention heads computes a well-defined interpretable function which can be represented in our framework. Figure 18 shows the structure of the circuit. We’l now illustrate the process by which the circuit may be analyzed via the linearization technique from Appendix B.1.2 or the graph axioms from Appendix B.1.1. The circuit in Figure 18 cannot be directly evaluated, as it must be made isomorphic to a decomposition of the concrete model. To do so, we will first expand the circuit in Figure 18 to a head-by-head graph. Next, for each block with an interpreted head, we construct dummy nodes in the interpretation which correspond to uninterpreted heads; as the interpretation claims that these are uninvolved in the IOI task, these nodes compute the identity function, a constant function or some other no-op which is independent of the IOI task. Now, we must account for computations performed in uninterpreted blocks and in MLPs. We can express the computation performed by a block of a transformer as a simple computational graph: each interpreted head, and well as the dummy node corresponding to the uninterpreted heads, takes the incoming residual stream as an input and returns an update to the residual stream. These updates are added to the residual stream and processed by the remaining components, such as the MLP and any normalization, which produce the final residual output by the block. For simplicity, we combine any uninterpreted layers into this operation. These nodes, abstractly, update the program state from the potentially redundant input information. In this way, we can construct isomorphic concrete and abstract models from any circuit of this form. Figure 19: Concrete and abstract models derived from the IOI circuit. We derive the concrete and abstract models shown in Figure 19. Note that structuring the circuit in this way allows us to clearly define and evaluate a hypothesis on how redundant copies of a given abstract concept are combined downstream. For example, the model might retain independent duplicate copies of the information, one copy per head, retain only the most recently-computed value, or aggregate redundant copies with some weighting. At this point, we can apply either the linearization technique of Appendix B.1.2 or the generalized axioms of Appendix B.1.1 to evaluate the hypotheses. We do not fully evaluate IOI with our axioms here as Wang et al. (2023) leaves key components of the interpretation unspecified. In particular, the authors do not clearly state how the model combines the information produced by redundant heads performing the same function, and the authors do not conclusively state what the duplicate-token suppressing information output by the S-inhibition heads represents. For example, that information may be represented as the suppressed token itself, via either relative or absolute positions, or a combination of token and position information. A complete analysis with our axioms would enable the analyst to derive clear evidence for the correct hypotheses.