Paper deep dive
Optimal Ablation for Interpretability
Maximilian Li, Lucas Janson
Models: GPT-2-XL, GPT-2-small
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 92%
Last extracted: 3/12/2026, 7:21:15 PM
Summary
The paper introduces 'Optimal Ablation' (OA), a novel method for quantifying the importance of machine learning model components by replacing them with a constant that minimizes the expected loss of the ablated model. OA is shown to be a canonical total ablation method that outperforms existing techniques like zero, mean, and resample ablation in downstream tasks such as circuit discovery, factual recall localization, and latent prediction.
Entities (5)
Relation Signals (4)
Optimal Ablation → isa → Total Ablation
confidence 95% · Like zero, mean, and resample ablation, optimal ablation is a total ablation method
Optimal Ablation → improves → Circuit discovery
confidence 90% · using OA for circuit discovery produces smaller and lower-loss circuits than previous ablation methods.
Optimal Ablation → improves → Factual Recall
confidence 90% · OA better identifies important components compared to prior work.
Optimal Ablation → improves → Latent Prediction
confidence 90% · We propose an OA-based prediction map and show that it has better predictive power and causal faithfulness than previous methods.
Cypher Suggestions (2)
Find all tasks improved by Optimal Ablation · confidence 90% · unvalidated
MATCH (m:Method {name: 'Optimal Ablation'})-[:IMPROVES]->(t:Task) RETURN t.nameList all ablation methods mentioned · confidence 85% · unvalidated
MATCH (m:Method) WHERE m.name CONTAINS 'Ablation' RETURN m.name
Abstract
Abstract:Interpretability studies often involve tracing the flow of information through machine learning models to identify specific model components that perform relevant computations for tasks of interest. Prior work quantifies the importance of a model component on a particular task by measuring the impact of performing ablation on that component, or simulating model inference with the component disabled. We propose a new method, optimal ablation (OA), and show that OA-based component importance has theoretical and empirical advantages over measuring importance via other ablation methods. We also show that OA-based component importance can benefit several downstream interpretability tasks, including circuit discovery, localization of factual recall, and latent prediction.
Tags
Links
- Source: https://arxiv.org/abs/2409.09951
- Canonical: https://arxiv.org/abs/2409.09951
Full Text
313,223 characters extracted from source content.
Expand or collapse full text
Optimal ablation for interpretability Maximilian Li Harvard University &Lucas Janson Harvard University Abstract Interpretability studies often involve tracing the flow of information through machine learning models to identify specific model components that perform relevant computations for tasks of interest. Prior work quantifies the importance of a model component on a particular task by measuring the impact of performing ablation on that component, or simulating model inference with the component disabled. We propose a new method, optimal ablation (OA), and show that OA-based component importance has theoretical and empirical advantages over measuring importance via other ablation methods. We also show that OA-based component importance can benefit several downstream interpretability tasks, including circuit discovery, localization of factual recall, and latent prediction. 1 Introduction Interpretability work in machine learning (ML) seeks to develop tools that make models more intelligible to humans in order to better monitor model behavior and predict failure modes. Early work in interpretability sought to identify relationships between model outputs and input features (Ribeiro et al.,, 2016; Covert et al.,, 2022), but with only black-box query access to observe inputs and outputs, it can be difficult to evaluate a model’s internal logic. Hence, recent interpretability work often seeks to take advantage of access to an ML model’s intermediate computations to gain insights about its decision-making, focusing on deciphering internal units like neurons, weights, and activations (Räuker et al.,, 2022). In addition to finding associations between latent representations and semantic concepts (Bau et al.,, 2017; Mu and Andreas,, 2021; Burns et al.,, 2022; Li et al.,, 2023; Gurnee and Tegmark,, 2024), interpretability studies aim to investigate how intermediate results are used in later computation and identify specific model components that extract relevant information or perform necessary computation to produce low loss on particular inputs. A key instrumental goal in interpretability is quantifying the importance of a particular model component for prediction. Studies often measure a component’s importance by performing ablation on that component and comparing model performance with and without the component ablated. Ablating a component typically entails replacing its value with a counterfactual value during inference, sometimes referred to as “activation patching.” However, the details vary greatly and there is a lack of consensus on best practices (Heimersheim and Nanda,, 2024). For example, Meng et al., (2022) adds Gaussian noise to ablated components’ values, while Geva et al., (2023) replaces these values with zeros, and Ghorbani and Zou, (2020) replaces them with their means over the training distribution. In this paper, we present optimal ablation (OA), a new method that sets a component’s value to a constant that minimizes the expected loss of the ablated model. In section 2, we introduce OA and show that it is, in a certain sense, a canonical choice of ablation method for measuring component importance. We then show that using OA produces meaningful improvements for several common downstream applications of measuring component importance. In section 3, we apply OA to algorithmic circuit discovery (Conmy et al.,, 2023), or the identification of a sparse subset of components sufficient for performance on a subset of the training distribution. We demonstrate that OA-based performance is a reasonable metric for evaluating circuits and using OA for circuit discovery produces smaller and lower-loss circuits than previous ablation methods. In deploying OA to this application, we also propose a new search algorithm for identifying sparse circuits that achieve low loss according to any performance metric. In section 4, we use OA to locate relevant components for factual recall (Meng et al.,, 2022) and show that OA better identifies important components compared to prior work. In section 5, we apply OA to latent prediction (Belrose et al., 2023a, ), or the elicitation of output predictions using intermediate activations. We propose an OA-based prediction map and show that it has better predictive power and causal faithfulness than previous methods. 2 Optimal ablation 2.1 Motivation Let ℳMM represent a model that is trained to minimize X,Yℒ(ℳ(X),Y)subscriptℒℳE_X,YL(M(X),Y)blackboard_EX , Y L ( M ( X ) , Y ) for a given loss function ℒLL and a distribution of random input-label pairs (X,Y)(X,Y)( X , Y ). A common theme in interpretability work is quantifying the importance of some model component AA for inference. For example, AA could represent a single neuron, a direction in activation space, a token embedding, an attention head, or an entire transformer layer; further examples of AA are discussed in Section 3 and Appendix C.2. Let (x)A(x)A ( x ) represent the value of AA when the model is evaluated on input x. To identify domain specialization among model components, studies often measure the importance of AA for model performance on a particular “subtask” DD, or an interpretable human-curated distribution of input-label pairs that captures a general aspect of model behavior. We write (X,Y)∼similar-to(X,Y) ( X , Y ) ∼ D to indicate sampling input-label pairs or X∼similar-toX ∼ D to indicate sampling only inputs from the subtask distribution. Although some works quantify component importance via gradients (Leino et al.,, 2018; Dhamdhere et al.,, 2018), such an approach is inherently local (even when aggregated over many inputs) and as such can fail to accurately represent the overall importance of AA in highly nonlinear models. Instead, most interpretability studies use ablation to quantify the importance of AA by studying the gap in performance between the full model ℳMM and a modified version ℳ∖superscriptℳM M∖ A with AA ablated: Δ(ℳ,)Δℳ (M,A)Δ ( M , A ) :=(ℳ∖)−(ℳ),assignabsentsuperscriptℳ :=P(M )-P(% M),:= P ( M∖ A ) - P ( M ) , (1) where PP is a selected metric for model performance. In the context of measuring importance, we argue that the construction of ℳ∖superscriptℳM M∖ A is motivated by the following question: What is the best performance on subtask DD the model ℳMM could have achieved without component AA? To formalize this question, we split its meaning into four elements. I. “Performance on subtask DD”: The relevant performance metric PP is the expected loss on the subtask with respect to the full model’s predictions: (ℳ~)=X∼ℒ(ℳ~(X),ℳ(X))~ℳsubscriptsimilar-toℒ~ℳP( M)=E_X \ L(% M(X),M(X))P ( over~ start_ARG M end_ARG ) = blackboard_EX ∼ D L ( over~ start_ARG M end_ARG ( X ) , M ( X ) ) (note that Eblackboard_E aggregates over any randomness in ℳ~~ℳ Mover~ start_ARG M end_ARG).111A common alternative choice is measuring performance in terms of proximity to the correct labels rather than the original model predictions, i.e. ~(ℳ~)=(X,Y)∼ℒ(ℳ~(X),Y)~~ℳsubscriptsimilar-toℒ~ℳ P( M)=E_(X,Y) \ % L( M(X),Y)over~ start_ARG P end_ARG ( over~ start_ARG M end_ARG ) = blackboard_E( X , Y ) ∼ D L ( over~ start_ARG M end_ARG ( X ) , Y ). See Appendix C.4 for discussion. We call Δ Δ defined using this choice of PP the ablation loss gap. I. “Model ℳMM could have achieved”: Since the goal of measuring component importance is to interpret the model ℳMM, the ablated model ℳ∖superscriptℳM M∖ A should be constructed solely by changing the value of AA, holding fixed all other parts of ℳMM. We write ℳ(x,a)subscriptℳM_A(x,a)Mcaligraphic_A ( x , a ) to indicate computing ℳMM on input x while setting the value of AA to a instead of (x)A(x)A ( x ) (see Appendix C.2 for details). I. “Without component AA”: The ablated model ℳ∖(x)superscriptℳM (x)M∖ A ( x ) should use a value for AA that is devoid of information about the input x. Elements I and I motivate the following definition: Definition 2.1 (Total ablation). A total ablation method satisfies ℳ∖(X)=ℳ(X,A)superscriptℳsubscriptℳM (X)=M_A(X,A)M∖ A ( X ) = Mcaligraphic_A ( X , A ) for some random A, where A⟂XA \!\!\! XA ⟂ ⟂ X. (Conversely, for a partial ablation method, A can depend on X.) IV. “Best” performance: To measure the importance of AA, we wish to understand how much model performance degrades as a result of ablating AA. If two constructions of the ablated model ℳ∖superscriptℳM M∖ A both perform total ablation on AA but one performs worse than another, the former’s underperformance cannot be entirely be attributed to ablating AA, since the latter also totally ablates AA and yet does not degrade performance to the same extent. Thus, the relevant ℳ∖superscriptℳM M∖ A for measuring importance should incur the minimum Δ Δ among total ablation methods. To make element IV more concrete, for an ablation method satisfying element I and a given x, replacing (x)A(x)A ( x ) with a can degrade the ablated model’s performance via both deletion and spoofing: 1. Deletion. The original value (x)A(x)A ( x ) could carry informational content specific to x that serves some mechanistic function in downstream computation and helps the model arrive at its prediction ℳ(x)ℳM(x)M ( x ). Replacing (x)A(x)A ( x ) with some other value a could delete this information about x, hindering the model’s ability to compute the original ℳ(x)ℳM(x)M ( x ). 2. Spoofing. The replacement value a could “spoof” the downstream computation by inserting information about the input that either: (a) causes the model to treat x like a different input x′;222See Appendix D.1 for a brief example of why “treating x like x′” goes beyond deletion. or (b) causes the model to treat x in a way that it never treated any training input, if the new information is inconsistent with information about x derived from retained components. In the latter case, the confluence of conflicting information could cause later activations to become incoherent, causing performance degradation because these abnormal activations were not observed during training and thus not necessarily regulated to lead to reasonable predictions. To measure importance, we seek to isolate the contribution of effect 1 to performance degradation. While total ablation methods all capture a maximal deletion effect since A does not depend on X, measuring the “best” performance minimizes the additional contribution of potential spoofing effects. 2.2 Prior work Component importance is strongly related to variable importance (Sobol,, 1993; Homma and Saltelli,, 1996; Breiman,, 2001; Ishwaran,, 2007; Gromping,, 2009), a longstanding area of research in statistics and ML. The vast body of work in this area is too extensive to review here, and the recent surge of research interest in interpreting internal model components has raised new and unique challenges relating to the values of internal components often being deterministically related. Thus, we focus on recent work applying ablation methods to internal components in this section, and defer broader discussion to Appendix B. Ablation methods previously applied to internal components can be separated into subtask-agnostic methods, which can be applied out-of-the-box to any subtask, and subtask-specific methods, which only work on subtasks for which inputs satisfy a designated structure, and even then require human ingenuity to adapt to each new subtask. Subtask-agnostic ablation methods include zero ablation (Baan et al.,, 2019; Lakretz et al.,, 2019; Bau et al.,, 2020; Olsson et al.,, 2022; Geva et al.,, 2023; Cunningham et al.,, 2023; Merullo et al.,, 2024; Gurnee et al.,, 2024), which replaces (x)A(x)A ( x ) with zero, i.e. ℳ∖(x)=ℳ(x,0)superscriptℳsubscriptℳ0M (x)=M_A(x,0)M∖ A ( x ) = Mcaligraphic_A ( x , 0 ); mean ablation (Ghorbani and Zou,, 2020; McDougall et al.,, 2023; Tigges et al.,, 2023; Gould et al.,, 2023; Li et al.,, 2024; Marks et al.,, 2024), which replaces (x)A(x)A ( x ) with its mean, i.e. ℳ∖(x)=ℳ(x,X′∼[(X′)])superscriptℳsubscriptℳsubscriptsimilar-tosuperscript′delimited-[]superscript′M (x)=M_A(x,E_% X [A(X )])M∖ A ( x ) = Mcaligraphic_A ( x , blackboard_EX′ ∼ D [ A ( X′ ) ] ); and (marginal) resample ablation (Chan et al.,, 2022; Lieberum et al.,, 2023; McGrath et al.,, 2023; Belrose et al., 2023a, ; Rushing and Nanda,, 2024), which replaces (x)A(x)A ( x ) with (X′)superscript′A(X )A ( X′ ) for an independent copy X′∼similar-tosuperscript′X ′ ∼ D of the input, i.e. ℳ∖(x)=ℳ(x,(X′))superscriptℳsubscriptℳsuperscript′M (x)=M_A(x,A(% X ))M∖ A ( x ) = Mcaligraphic_A ( x , A ( X′ ) ). While zero, mean, and resample ablation are total ablation methods, adding Gaussian noise to (x)A(x)A ( x ) (Meng et al.,, 2022) is a subtask-agnostic partial ablation method. These methods are all applicable to any subtask. On the other hand, subtask-specific ablation methods rely on particular details of a chosen subtask. Singh et al., (2024) replaces (x)A(x)A ( x ) with interpretable values, e.g. setting an attention pattern to copy from the previous token, while Goldowsky-Dill et al., (2023) replaces (x)A(x)A ( x ) with (x∗)superscriptA(x^*)A ( x∗ ) for an interpretable reference input x∗superscriptx^*x∗. Hanna et al., (2023) employs counterfactual ablation (CF), a partial ablation method that replaces (x)A(x)A ( x ) with (π(x))A(π(x))A ( π ( x ) ), where π is a map that sends each input x to a “neutral” (potentially random) input π(x)π(x)π ( x ) that preserves most aspects of x but removes information relevant to the subtask, i.e. ℳ∖(x)=ℳ(x,(π(x)))superscriptℳsubscriptℳM (x)=M_A(x,A(% π(x)))M∖ A ( x ) = Mcaligraphic_A ( x , A ( π ( x ) ) ). Wang et al., (2022) also considers a counterfactual distribution of inputs for counterfactual mean ablation, which replaces (x)A(x)A ( x ) with its mean over the distribution of counterfactuals, i.e. ℳ∖(x)=ℳ(x,(X′∼),π(π(X′)))superscriptℳsubscriptℳsubscriptsimilar-tosuperscript′M (x)=M_A(x,E_% (X ),πA(π(X )))M∖ A ( x ) = Mcaligraphic_A ( x , blackboard_E( X′ ∼ D ) , π A ( π ( X′ ) ) ). Subtask-specific methods can be useful, but it is usually unclear how to generalize them beyond the subtask originally selected. CF is the most popular among these methods, leveraged by a range of manual (Vig et al.,, 2020; Merullo et al.,, 2023; Stolfo et al.,, 2023; Tigges et al.,, 2023; Hendel et al.,, 2023; Heimersheim and Janiak,, 2023; Todd et al.,, 2024; Marks et al.,, 2024) and algorithmic (Conmy et al.,, 2023; Syed et al.,, 2023) studies and recommended by meta-studies (Zhang and Nanda,, 2024; Heimersheim and Nanda,, 2024). For text data, the effectiveness of CF relies heavily on token parallelism between x and π(x)π(x)π ( x ), which typically share exact tokens at all but a few sequence positions. Though studies have thus far focused on toy subtasks for which suitable mappings π are relatively easily constructed, it may be difficult or impossible to select well-suited input pairs for certain subtasks (see Appendix F.3 for a few simple examples), especially more general model behaviors. Even for subtasks that admit such a mapping, how π(x)π(x)π ( x ) is engineered to withhold subtask-relevant information differs from subtask to subtask, and the construction of π for each particular subtask is a subjective process that requires human ingenuity. Finally, CF is only a partial ablation method; since (π(x))A(π(x))A ( π ( x ) ) depends on x, it may give away information about x that is useful for performance on DD. 2.3 Definition and properties of optimal ablation We present optimal ablation (OA), our proposed approach to simulating component removal. Definition 2.2 (Optimal ablation). To ablate AA, we replace (x)A(x)A ( x ) with an “optimal” constant a∗superscripta^*a∗. ℳ(opt)∖(x)superscriptsubscriptℳ(opt) _ (opt) (x)M(opt)∖ A ( x ) :=ℳ(x,a∗),a∗:=argminaX∼ℒ(ℳ(X,a),ℳ(X))formulae-sequenceassignabsentsubscriptℳsuperscriptassignsuperscriptsubscriptargminsubscriptsimilar-toℒsubscriptℳ :=M_A(x,a^*), a^*:= *% arg\,min_aE_X \ L(M_ % A(X,a),M(X)):= Mcaligraphic_A ( x , a∗ ) , a∗ := start_OPERATOR arg min end_OPERATORa blackboard_EX ∼ D L ( Mcaligraphic_A ( X , a ) , M ( X ) ) (2) We define ΔoptsubscriptΔopt _optΔroman_opt by plugging ℳ(opt)∖(x)superscriptsubscriptℳ(opt)M_ (opt) (x)M(opt)∖ A ( x ) into Equation (1). Like zero, mean,333See Appendix D.2 for an interesting connection of OA to mean ablation. and resample ablation, optimal ablation is a total ablation method satisfying Definition 2.1. But among all total ablation methods, optimal ablation is optimal in the sense that it yields the lowest Δ Δ. Proposition 2.3. Let Δ(ℳ,)Δℳ (M,A)Δ ( M , A ) be the ablation loss gap for some component AA measured with any total ablation method. Then, Δopt(ℳ,)≤Δ(ℳ,)subscriptΔoptℳΔℳ _opt(M,A)≤ (M,% A)Δroman_opt ( M , A ) ≤ Δ ( M , A ). Proof. Consider a total ablation method that defines ℳ∖(X)superscriptℳM (X)M∖ A ( X ) by replacing (X)A(X)A ( X ) with A (per Definition 2.1), and let Δ(ℳ,)Δℳ (M,A)Δ ( M , A ) be the measured ablation loss gap. By the tower property, Δ(ℳ,)=(X∼),Aℒ(ℳ(X,A),ℳ(X))=A[[ℒ(ℳ(X,A),ℳ(X))|A]].Δℳsubscriptsimilar-toℒsubscriptℳsubscriptdelimited-[]delimited-[]conditionalℒsubscriptℳ (M,A)=E_(X ),A% \ L(M_A(X,A),M(X))=E_A% [E [L(M_A(X,A),M(X% )) |A ] ].Δ ( M , A ) = blackboard_E( X ∼ D ) , A L ( Mcaligraphic_A ( X , A ) , M ( X ) ) = blackboard_EA [ blackboard_E [ L ( Mcaligraphic_A ( X , A ) , M ( X ) ) | A ] ] . Since A⟂XA \!\!\! XA ⟂ ⟂ X, [ℒ(ℳ(X,A),ℳ(X))|A=a]=X∼ℒ(ℳ(X,a),ℳ(X))=:g(a)E [L(M_A(X,A),M(X))\ % |\ A=a ]=E_X L(M_% A(X,a),M(X))=:g(a)blackboard_E [ L ( Mcaligraphic_A ( X , A ) , M ( X ) ) | A = a ] = blackboard_EX ∼ D L ( Mcaligraphic_A ( X , a ) , M ( X ) ) = : g ( a ). ∀a,Δopt(ℳ,)=X∼ℒ(ℳ(X,a∗),ℳ(X))for-allsubscriptΔoptℳsubscriptsimilar-toℒsubscriptℳsuperscriptℳ ∀ a,\ _opt(M,A)=% E_X \ L(M_A(X,a^*),% M(X))∀ a , Δroman_opt ( M , A ) = blackboard_EX ∼ D L ( Mcaligraphic_A ( X , a∗ ) , M ( X ) ) ≤X∼ℒ(ℳ(X,a),ℳ(X))=g(a)absentsubscriptsimilar-toℒsubscriptℳ _X \ L(M_% A(X,a),M(X))=g(a)≤ blackboard_EX ∼ D L ( Mcaligraphic_A ( X , a ) , M ( X ) ) = g ( a ) ⟹Δopt(ℳ,)absentsubscriptΔoptℳ _opt(M,A)⟹ Δroman_opt ( M , A ) ≤Ag(A)=Δ(ℳ,).∎absentsubscriptΔℳ _A\ g(A)= (M,A). ≤ blackboard_EA g ( A ) = Δ ( M , A ) . ∎ (3) Optimal ablation thus provides the unique answer to our motivating question in Section 2.1, since it produces the best performance among all total ablation methods, including zero, mean, and resample ablation. Intuitively, OA minimizes the contribution of spoofing (effect 2 from Section 2.1) to Δ Δ by setting ablated components to constants a∗superscripta^*a∗ that are maximally consistent with information from other components, e.g. by conveying a lack of information about x or by hedging against a wide range of possible x rather than strongly associating with a particular input other than the original x. OA does not entirely eliminate spoofing, since it may be the case that every possible value of AA conveys at least weak information to the model. However, the excess ablation gap Δ−ΔoptΔsubscriptΔopt - _optΔ - Δroman_opt for Δ Δ measured with ablation methods that replace (x)A(x)A ( x ) with a (random) value A is entirely caused by spoofing, since replacing (x)A(x)A ( x ) with the constant a∗superscripta^*a∗ achieves lower loss without giving away any more information about x. In practice, Δ−ΔoptΔsubscriptΔopt - _optΔ - Δroman_opt for prior ablation methods is typically very large compared to ΔoptsubscriptΔopt _optΔroman_opt for both single components (see Table 1) and circuits (see Section 3.2) on prototypical language subtasks. This disparity indicates that effect 2 dominates the Δ Δ measurements for previous ablation methods, making them poor estimators for effect 1 compared to OA. Subtask-specific methods often try to generate consistent interventions by conditioning on features of the input to avoid replacing (x)A(x)A ( x ) with values that could confuse the model. For CF, choosing π(x)π(x)π ( x ) to share many tokens with x mitigates the contribution of effect 2b to ΔCFsubscriptΔCF _CFΔroman_CF, which is the main reason the technique is so widely employed. Thus, among previous measures of component importance, ΔCFsubscriptΔCF _CFΔroman_CF, when it can be well-constructed, may be the best quantification of effect 1. To demonstrate this intuitive relation between OA and CF as techniques that aim to isolate effect 1, we perform a case study in Section 2.4 that shows that among other ablation methods, OA produces the measurements most similar to CF. However, not only is OA more general than subtask-specific methods like CF, but ΔoptsubscriptΔopt _optΔroman_opt may still be a better estimator for effect 1 than ΔCFsubscriptΔCF _CFΔroman_CF even when CF is well-defined. In Section 3, we show that for circuits, ΔoptsubscriptΔopt _optΔroman_opt is much lower than ΔCFsubscriptΔCF _CFΔroman_CF despite reflecting a weakly stronger deletion effect, indicating that effect 2 also contributes to ΔoptsubscriptΔopt _optΔroman_opt less than it does to ΔCFsubscriptΔCF _CFΔroman_CF, and thus ΔoptsubscriptΔopt _optΔroman_opt is a more accurate reflection of components’ informational importance. Computation of a∗superscripta^*a∗. Though it is impossible to derive a∗superscripta^*a∗ in closed form, we find that in practice, mini-batch stochastic gradient descent (SGD) performs well at finding constants a^ aover start_ARG a end_ARG that greatly reduce Δ Δ compared to heuristic point estimates like zero and the mean. We generally adopt the approach of initializing each a^ aover start_ARG a end_ARG to the subtask mean X∼[(X)]subscriptsimilar-todelimited-[]E_X [A(X)]blackboard_EX ∼ D [ A ( X ) ] and performing SGD to minimize Δ Δ. 2.4 Comparison of single-component ablation results on IOI The Indirect Object Identification (IOI) subtask (Wang et al.,, 2022) consists of prompts like “When Mary and John went to the store, Mary gave the apple to ___,” which GPT-2-small (Radford et al.,, 2019) completes with the correct indirect object noun (“John”). We use IOI as a case study because it is discussed extensively in interpretability work (Merullo et al.,, 2023; Makelov et al.,, 2023; Wu et al.,, 2024; Lan et al.,, 2024; Zhang and Nanda,, 2024). To implement CF, for each prompt x, Wang et al., (2022) constructs a random π(x)π(x)π ( x ) in which the names are replaced with random distinct names. We evaluate Δ Δ for attention heads and MLP blocks using zero, mean, resample, counterfactual, counterfactual mean, and optimal ablation. In Table 1, we show that among attention heads and MLP blocks, ΔoptsubscriptΔopt _optΔroman_opt accounts for only 11.1% of ΔzerosubscriptΔzero _zeroΔroman_zero, 33.0% of ΔmeansubscriptΔmean _meanΔroman_mean, and 17.7% of ΔresamplesubscriptΔresample _resampleΔroman_resample for the median component. Furthermore, among these Δ Δ measurements, ΔoptsubscriptΔopt _optΔroman_opt has the highest highest rank correlation (0.907) with ΔCFsubscriptΔCF _CFΔroman_CF. Full results are shown in Appendix E.3. Table 1: Comparison of ablation loss gap Δ Δ on IOI Zero Mean Resample CF-Mean Optimal CF Rank correlation with CF 0.590 0.825 0.828 0.833 0.907 1 Median ratio of ΔoptsubscriptΔopt _optΔroman_opt to Δ Δ 11.1% 33.0% 17.7% 31.7% 100% 88.9% 3 Application: circuit discovery Circuit discovery is the selection of a sparse subnetwork of ℳMM that is sufficient for the recovery of model performance on an algorithmic subtask DD. To define what constitutes a “sparse subnetwork,” we write ℳMM as a computational graph with vertices G and edges E. An edge ek:=(j,i,z)∈Eassignsubscriptsubscriptsubscripte_k:=(A_j,A_i,z)∈ Eeitalic_k := ( Aitalic_j , Aitalic_i , z ) ∈ E indicates that j(x)subscriptA_j(x)Aitalic_j ( x ) is taken as the zzzth input to isubscriptA_iAitalic_i in the computation represented by the graph. To ablate edge eksubscripte_keitalic_k, we replace the zzzth input to isubscriptA_iAitalic_i, which is equal to j(x)subscriptA_j(x)Aitalic_j ( x ) during normal inference, with some value a. We compute ℳE~(X)superscriptℳ~M E(X)Mover~ start_ARG E end_ARG ( X ), which represents modified inference with edges E∖E~~E EE ∖ over~ start_ARG E end_ARG ablated, by applying this intervention for each ablated edge (see Appendices C.1 and C.2 for more details). Circuit discovery aims to select a subset of edges E~∗⊆Esuperscript~ E^* Eover~ start_ARG E end_ARG∗ ⊆ E such that E~∗=argminE~⊆E[X∼ℒ(ℳE~(X),ℳ(X))+ℛ(E~)]=argminE~⊆E[Δ(ℳ,E∖E~)+ℛ(E~)]superscript~subscriptargmin~subscriptsimilar-toℒsuperscriptℳ~ℳℛ~subscriptargmin~Δℳ~ℛ~ E^*= *arg\,min_ E E% [E_X L(M E(X),% M(X))+R( E) ]= *arg\,min_ % E E [ (M,E E)+R(% E) ]over~ start_ARG E end_ARG∗ = start_OPERATOR arg min end_OPERATORover~ start_ARG E end_ARG ⊆ E [ blackboard_EX ∼ D L ( Mover~ start_ARG E end_ARG ( X ) , M ( X ) ) + R ( over~ start_ARG E end_ARG ) ] = start_OPERATOR arg min end_OPERATORover~ start_ARG E end_ARG ⊆ E [ Δ ( M , E ∖ over~ start_ARG E end_ARG ) + R ( over~ start_ARG E end_ARG ) ] (4) for a regularization term ℛRR that measures the sparsity level (further discussed in Appendix F.4). Additionally, when implementing OA, though we could use a different constant for each edge, we instead define a single constant aj∗superscriptsubscripta_j^*aitalic_j∗ for each vertex jsubscriptA_jAitalic_j, so that if multiple out-edges from jsubscriptA_jAitalic_j are ablated, the same value is passed to each of its children (further discussed in Appendix F.2). 3.1 Methods We compare Δ(ℳ,E∖E~)Δℳ~ (M,E E)Δ ( M , E ∖ over~ start_ARG E end_ARG ) measured with mean ablation, resample ablation, OA, and CF as metrics for circuit discovery. We consider the manual circuit for each subtask and circuits optimized on each Δ Δ metric using several search algorithms. ACDC (Conmy et al.,, 2023) identifies circuits by iteratively considering edge ablations. They start by proposing E~=E~ E=Eover~ start_ARG E end_ARG = E, then iterate over edges eksubscripte_keitalic_k, ablating eksubscripte_keitalic_k and updating E~=E~∖ek~~subscript E= E \e_k\over~ start_ARG E end_ARG = over~ start_ARG E end_ARG ∖ eitalic_k if the marginal impact on loss, Δ(ℳ,(E∪ek)∖E~)−Δ(ℳ,E∖(ek∪E~))Δℳsubscript~Δℳsubscript~ (M,(E∪\e_k\) E)- (M,E% (\e_k\∪ E))Δ ( M , ( E ∪ eitalic_k ) ∖ over~ start_ARG E end_ARG ) - Δ ( M , E ∖ ( eitalic_k ∪ over~ start_ARG E end_ARG ) ), is below a tolerance threshold λ. Edge Attribution Patching (EAP) (Syed et al.,, 2023) selects E~~ Eover~ start_ARG E end_ARG to contain the edges eksubscripte_keitalic_k that have the largest gradient approximation of their single-edge ablation loss gap Δ(ℳ,ek)Δℳsubscript (M,e_k)Δ ( M , eitalic_k ). HardConcrete Gradient Sampling (HCGS) is an adaptation of a pruning technique from Louizos et al., (2018) to circuit discovery. Rather than considering only total ablation of an edge ek=(j,i,z)subscriptsubscriptsubscripte_k=(A_j,A_i,z)eitalic_k = ( Aitalic_j , Aitalic_i , z ), we can consider a continuous mask of coefficients α→ αover→ start_ARG α end_ARG and partially ablate eksubscripte_keitalic_k by replacing the zzzth input to isubscriptA_iAitalic_i with a linear combination of the original value j(x)subscriptA_j(x)Aitalic_j ( x ) and ablated value ajsubscripta_jaitalic_j, i.e. αkj(x)+(1−αk)ajsubscriptsubscript1subscriptsubscript _kA_j(x)+(1- _k)a_jαitalic_k Aitalic_j ( x ) + ( 1 - αitalic_k ) aitalic_j. Now, αk=0subscript0 _k=0αitalic_k = 0 designates total ablation (replacing with ajsubscripta_jaitalic_j), while αk=1subscript1 _k=1αitalic_k = 1 designates total retention (keeping j(x)subscriptA_j(x)Aitalic_j ( x )). We use ℳα→(x)superscriptℳ→M α(x)Mover→ start_ARG α end_ARG ( x ) to represent the model with edges partially ablated according to α→ αover→ start_ARG α end_ARG. Some previous work (Liu et al.,, 2017; Huang and Wang,, 2018) optimizes directly on the mask coefficients α→ αover→ start_ARG α end_ARG, but to avoid getting stuck in local minima on αk∈(0,1)subscript01 _k∈(0,1)αitalic_k ∈ ( 0 , 1 ), Louizos et al., (2018) samples αksubscript _kαitalic_k from a HardConcrete distribution parameterized by location θksubscript _kθitalic_k and temperature βksubscript _kβitalic_k for each edge, and performs SGD with respect to the distributional parameters. In effect, we update the parameters based on gradients evaluated at randomly sampled values of α→ αover→ start_ARG α end_ARG rather than gradients evaluated at any exact α→ αover→ start_ARG α end_ARG. Cao et al., (2021) applies this technique to find circuits that consist of a subset of model weights. Conmy et al., (2023) applies this technique to vertices in a computational graph. Unlike previous work, we apply this technique to edges rather than vertices. Uniform Gradient Sampling (UGS) is our proposed method for algorithmic circuit discovery. Similar to HCGS, we consider ablation coefficients α→ αover→ start_ARG α end_ARG and update parameters based on gradients evaluated at sampled values of α→ αover→ start_ARG α end_ARG. We keep track of a parameter θ~ksubscript~ θ_kover~ start_ARG θ end_ARGk for each edge, where θk=(1+exp(−θ~k))−1subscriptsuperscript1subscript~1 _k=(1+ (- θ_k))^-1θitalic_k = ( 1 + exp ( - over~ start_ARG θ end_ARGk ) )- 1 indicates an estimated probability of ek∈E~∗subscriptsuperscript~e_k∈ E^*eitalic_k ∈ over~ start_ARG E end_ARG∗. Using w(θk)=θk(1−θk)subscriptsubscript1subscriptw( _k)= _k(1- _k)w ( θitalic_k ) = θitalic_k ( 1 - θitalic_k ) to determine sampling frequency (further discussed in Appendix F.8), we let αk∼Unif(0,1)similar-tosubscriptUnif01 _k Unif(0,1)αitalic_k ∼ Unif ( 0 , 1 ) with probability (w.p.) w(θk)subscriptw( _k)w ( θitalic_k ), αk=1subscript1 _k=1αitalic_k = 1 w.p. θk−12w(θk)subscript12subscript _k- 12w( _k)θitalic_k - divide start_ARG 1 end_ARG start_ARG 2 end_ARG w ( θitalic_k ), and αk=0subscript0 _k=0αitalic_k = 0 w.p. 1−θk−12w(θk)1subscript12subscript1- _k- 12w( _k)1 - θitalic_k - divide start_ARG 1 end_ARG start_ARG 2 end_ARG w ( θitalic_k ). For a batch of b inputs X(1),…,X(b)superscript1…superscriptX^(1),...,X^(b)X( 1 ) , … , X( b ), let α→(j)superscript→ α^(j)over→ start_ARG α end_ARG( j ) denote the sampled coefficients corresponding to X(j)superscriptX^(j)X( j ), and let Nk=∑j=1b(αk(j)∈(0,1))subscriptsuperscriptsubscript11subscriptsuperscript01N_k= _j=1^b 1(α^(j)_k∈(0,1))Nitalic_k = ∑j = 1b blackboard_1 ( α( j )k ∈ ( 0 , 1 ) ). We construct a loss function ℒ(UGS)subscriptℒUGSL_(UGS)L( UGS ) whose gradient satisfies ∇θkℒ(UGS)=∇θkℛ(θ→)+Nk−1∑j=1b(αk(j)∈(0,1))⋅∇αk(j)ℒ(ℳα→(j)(X(j)),ℳ(X(j)))subscript∇subscriptsubscriptℒUGSsubscript∇subscriptℛ→⋅superscriptsubscript1superscriptsubscript11subscriptsuperscript01subscript∇subscriptsuperscriptℒsuperscriptℳsuperscript→superscriptℳsuperscript _ _kL_(UGS)= _ _k% R( θ)+N_k^-1 _j=1^bΣ 1% (α^(j)_k∈(0,1))· _α^(j)_kL(M% α^(j)(X^(j)),M(X^(j)))∇θ start_POSTSUBSCRIPT k end_POSTSUBSCRIPT L( UGS ) = ∇θ start_POSTSUBSCRIPT k end_POSTSUBSCRIPT R ( over→ start_ARG θ end_ARG ) + Nitalic_k- 1 SUPERSCRIPTOP SUBSCRIPTOP start_ARG ∑ end_ARG j = 1 b blackboard_1 ( α( j )k ∈ ( 0 , 1 ) ) ⋅ ∇α( j ) start_POSTSUBSCRIPT k end_POSTSUBSCRIPT L ( Mover→ start_ARG α end_ARG start_POSTSUPERSCRIPT ( j ) end_POSTSUPERSCRIPT ( X( j ) ) , M ( X( j ) ) ) (5) and perform SGD on the θ~ksubscript~ θ_kover~ start_ARG θ end_ARGk, where ℛ(θ→)ℛ→R( θ)R ( over→ start_ARG θ end_ARG ) is a continuous relaxation of ℛ(E~)ℛ~R( E)R ( over~ start_ARG E end_ARG ) from Equation (4). In Appendix F.5, we motivate UGS as an estimator for sampling over Bernoulli edge coefficients. Optimizing circuits on ΔoptsubscriptΔopt _optΔroman_opt ACDC and EAP are not compatible with optimization on ΔoptsubscriptΔopt _optΔroman_opt, since the optimal a→∗superscript→ a^*over→ start_ARG a end_ARG∗ values depend on the selected circuit and it is intractable to optimize a^ aover start_ARG a end_ARG for every candidate circuit. For our circuit evaluations on ΔoptsubscriptΔopt _optΔroman_opt, we compare to ACDC- and EAP-generated circuits optimized on ΔCFsubscriptΔCF _CFΔroman_CF. On the other hand, HCGS and UGS allow us to perform SGD to optimize the ablation constants a^ aover start_ARG a end_ARG concurrently with the sampling parameters. 3.2 Experiments We study GPT-2-small performance on the IOI (Wang et al.,, 2022) subtask described in Section 2.4 and the Greater-Than (Hanna et al.,, 2023) subtask, which involves completing prompts such as “The conflict started in 1812 and ended in 18__” with digits greater than the first year in the context. We select these settings because their exposition in manual studies is particularly thorough and they are used in prior work (Conmy et al.,, 2023; Syed et al.,, 2023) to benchmark algorithmic circuit discovery. We compare the algorithms in Section 3.1 trained to minimize Δ Δ on the IOI and Greater-Than subtasks when edges E∖E~~E EE ∖ over~ start_ARG E end_ARG are ablated with mean ablation, resample ablation, OA, and CF. For IOI, the mapping π for CF is defined in Section 2.4. For Greater-Than, we continue with the practice from Hanna et al., (2023) of selecting counterfactuals π(x)π(x)π ( x ) by changing the first year in the prompt x to end in “01” so that all numerical completions are equally valid (see Appendix F.3). Figure 1: Left: Circuit discovery Pareto frontier for the IOI subtask with counterfactual ablation. Right: Comparison of ablation methods for circuit discovery on IOI (X indicates manual circuit evaluated on each ablation method). Δ Δ is measured in KL-divergence. UGS achieves Pareto dominance on the Δ Δ-|E~|~| E|| over~ start_ARG E end_ARG | tradeoff over the other methods on both subtasks for each ablation method, identifying circuits that achieve lower Δ Δ at any given |E~|~| E|| over~ start_ARG E end_ARG | and vice versa. Results for IOI circuits optimized on ΔCFsubscriptΔCF _CFΔroman_CF are shown in Figure 1 (left). On IOI, UGS finds a circuit with 385 edges that achieves ΔCF=0.220subscriptΔCF0.220 _CF=0.220Δroman_CF = 0.220. This circuit has 52% fewer edges than the smallest ACDC-identified circuit with comparable ΔCFsubscriptΔCF _CFΔroman_CF and 48% lower ΔCFsubscriptΔCF _CFΔroman_CF than the best-performing ACDC-identified circuit with a comparable edge count. Similar improvements to the Pareto frontier, shown in Appendix F.10, occur for mean, resample, and optimal ablation. UGS also creates Pareto improvements for Greater-Than circuits for each ablation method; see Appendix F.11. Applying OA to circuit discovery reveals that certain sparse circuits can account for model performance on these subtasks to a much greater extent than previously known. We visualize the Δ Δ for each ablation method achieved by UGS-identified circuits in Figure 1 (right). Using OA to ablate excluded components, we find circuits that recover much lower Δ Δ at any given circuit size than any circuit for which excluded components are ablated with any other ablation method. For example, for IOI, at a circuit size of 1,000 edges, ablating excluded components with OA enables the existence of circuits with 32% lower Δ Δ compared to CF, 62% lower Δ Δ compared to mean ablation, and 88% lower Δ Δ compared to resample ablation, and the improvement is even larger at smaller circuit sizes. For Greater-Than (results shown in Appendix F.11), OA again admits circuits with by far the lowest Δ Δ among the four ablation methods. Thus, OA paints a more accurate and compelling picture of how much small subsets of the model can explain behavior on these subtasks. Unlike other ablation methods, OA indicates that the manual circuits are approximately optimal for their size. Holding |E~|~| E|| over~ start_ARG E end_ARG | fixed, the Pareto-optimal ΔoptsubscriptΔopt _optΔroman_opt is 29% below the ΔoptsubscriptΔopt _optΔroman_opt of the manual circuit on IOI and 42% below the ΔoptsubscriptΔopt _optΔroman_opt of the manual circuit on Greater-Than. However, for the other ablation methods, optimized circuits with fewer edges than the manual circuit achieve 84-85% lower Δ Δ than the manual circuit on IOI, and 70-84% lower Δ Δ on Greater-Than. Since the manual circuits are selected using a thorough mechanistic understanding of the model for each subtask and thus arguably capture the important components, this finding furthers the notion that Δ Δ measured with previous methods could be artificially high due to spoofing by ablated components, and therefore ΔoptsubscriptΔopt _optΔroman_opt is a superior evaluation metric for circuits. These results show that ΔoptsubscriptΔopt _optΔroman_opt is useful for evaluating and discovering circuits and provide evidence that OA better quantifies the removal of important mechanisms than previous ablation methods. 4 Application: factual recall Transformers can store and retrieve a large corpus of factual associations. One goal in interpretability is localizing factual recall, or identifying components that store specific facts. To this end, Meng et al., (2022) proposes causal tracing, which involves removing important information about the prompt x and evaluating which components can recover the original ℳ(x)ℳM(x)M ( x ). To isolate components responsible for an association between a subject (e.g. “Eiffel Tower”) and an attribute (“located in Paris”), they select a prompt x (“The Eiffel Tower is located in the city of ___”) that elicits from ℳMM a correctly memorized response y (“Paris”). They produce a corrupted input ξGN(x)subscriptGN _GN(x)ξroman_GN ( x ) by adding a Gaussian noise (GN) term Z∼(0,9Σ)similar-to09ΣZ (0,9 )Z ∼ N ( 0 , 9 Σ ), to all token embeddings that encode the subject, where Σ Σ is a diagonal matrix and ΣiisubscriptΣ _iΣitalic_i i represents the variance of the iiith neuron among token embeddings sampled from the training distribution. Letting [ℳ(x)]ysubscriptdelimited-[]ℳ[M(x)]_y[ M ( x ) ]y represent the probability assigned by ℳ(x)ℳM(x)M ( x ) to label y. Since ξGNsubscriptGN _GNξroman_GN partially ablates information about the subject, [ℳ(ξGN(x))]ysubscriptdelimited-[]ℳsubscriptGN[M( _GN(x))]_y[ M ( ξroman_GN ( x ) ) ]y is typically much smaller than [ℳ(x)]ysubscriptdelimited-[]ℳ[M(x)]_y[ M ( x ) ]y. For each component AA, they estimate its contribution to the recall of y with the following “average indirect effect” (AIE) representing the proportion of probability on the correct y recovered by replacing (ξ(X))A(ξ(X))A ( ξ ( X ) ) with (X)A(X)A ( X ), averaged over (X,Y)∼similar-to(X,Y) ( X , Y ) ∼ D, where ξ=ξGNsubscriptGNξ= _GNξ = ξroman_GN: 444Unlike Meng et al., (2022), we clip [ℳ(X)]Y−[ℳ(ξ(X),A(X))]Ysubscriptdelimited-[]ℳsubscriptdelimited-[]subscriptℳ[M(X)]_Y-[M_A(ξ(X),A(X))]_Y[ M ( X ) ]Y - [ Mcaligraphic_A ( ξ ( X ) , A ( X ) ) ]Y to be non-negative, so we do not give AA additional credit for increasing the probability mass of the true label past that given by the full model. We also report AIE in proportion probability recovered compared to the full model rather than percentage points. AIE():=min(0,1−(X,Y)∼max(0,[ℳ(X)]Y−[ℳ(ξ(X),A(X))]Y)(X,Y)∼[ℳ(X)]Y−(X,Y)∼[ℳ(ξ(X))]Y)assignAIE01subscriptsimilar-to0subscriptdelimited-[]ℳsubscriptdelimited-[]subscriptℳsubscriptsimilar-tosubscriptdelimited-[]ℳsubscriptsimilar-tosubscriptdelimited-[]ℳ (A):= (0,1- E_(X,Y)% (0,[M(X)]_Y-[M_A(ξ(X),% A(X))]_Y)E_(X,Y) \ [M(X)]_Y-E% _(X,Y) \ [M(ξ(X))]_Y )AIE ( A ) := min ( 0 , 1 - divide start_ARG blackboard_E( X , Y ) ∼ D max ( 0 , [ M ( X ) ]Y - [ Mcaligraphic_A ( ξ ( X ) , A ( X ) ) ]Y ) end_ARG start_ARG blackboard_E( X , Y ) ∼ D [ M ( X ) ]Y - blackboard_E( X , Y ) ∼ D [ M ( ξ ( X ) ) ]Y end_ARG ) (6) where we declare AIE()=0AIE0AIE(A)=0AIE ( A ) = 0 if the denominator is non-positive and ablating subject tokens actually helps identify the correct label (however, this is never the case). Method We perform causal tracing by removing the subject with optimal ablation (OA-tracing, or OAT) rather than with Gaussian noise (GNT). We define ξ(x)=ξOA(x,a)subscriptOAsubscriptξ(x)= _OA(x,a_A)ξ ( x ) = ξroman_OA ( x , acaligraphic_A ) by replacing subject token embeddings with a constant asubscripta_Aacaligraphic_A trained to minimize the numerator in Equation (6), which represents Δ Δ with a carefully chosen loss function (see Appendix G.2). Figure 2: Comparison of AIE with GNT and OAT. In the top figure, layer ℓ ℓ on the x-axis represents replacing a sliding window of 5 layers with ℓ ℓ as the median. Error bars indicate the sample estimate plus/minus two standard errors (details given in Appendix G.4). Experiments We compare GNT and OAT for GPT-2-XL on a dataset of subject-attribute prompts from Meng et al., (2022) for which the model completes the correct answer via sampling with temperature 0. To increase the sample size, we augment the data with similarly constructed prompts from follow-up work on factual recall (Hernandez et al.,, 2022). We train OAT on 60% of the dataset and evaluate both methods on the other 40%. On the test set, [ℳ(X)]Y=30.6%subscriptdelimited-[]ℳpercent30.6E[M(X)]_Y=30.6\%blackboard_E [ M ( X ) ]Y = 30.6 %, [ℳ(ξGN(X))]Y=12.3%subscriptdelimited-[]ℳsubscriptGNpercent12.3E[M( _GN(X))]_Y=12.3\%blackboard_E [ M ( ξroman_GN ( X ) ) ]Y = 12.3 %, and [ℳ(ξOA(X))]Y=8.7%subscriptdelimited-[]ℳsubscriptOApercent8.7E[M( _OA(X))]_Y=8.7\%blackboard_E [ M ( ξroman_OA ( X ) ) ]Y = 8.7 %. We let (X)A(X)A ( X ) represent an attention or MLP layer output at a certain token position(s): namely, all subject token positions, only the last subject token position, and only the last token position in the entire sequence. Rather than considering only one layer at a time, Meng et al., (2022) lets AA represent the outputs of a sliding window of several consecutive attention layers or MLP layers. Thus, in addition to replacing the output of a single layer (window size 1), we show results for replacing windows of sizes 5 and 9. OAT offers a more precise localization of relevant components compared to GNT. While GNT indicates a small positive AIE for most components, OAT shows a few components have large contributions while most have little to no effect. For example, Figure 2 (top left) shows that the AIE for a window of 5 attention layers at the last token is as high as 42.6% for the window consisting of layers 30-34, while the AIE peaks at only 20.2% for GNT. On the other hand, for windows centered around layers 15-23, the average AIE for OAT is only 1.7%, indicating little effect for these potentially unimportant layers, compared to 7.0% for GNT. For sliding windows of 9 attention layers at subject token positions, GNT shows marginally positive AIE measurements across layers 0-30, but OAT specifically shows highly positive AIE for layers 0-5 and 25-30 (see Figure 15). Moreover, whereas Meng et al., (2022) focuses on sliding window replacement because GNT effects from single-layer replacements are very small, OAT can sometimes identify information gain from just one layer. For instance, at the last token position, OAT records AIEs above 8% for each of attention layers 30, 32, and 34 by themselves (see Figure 2, bottom left), much greater than the AIE of the other layers. This greater level of granularity opens up the possibility of selectively investigating combinations of layers as opposed to relying on the prior that adjacent layers work together. 5 Application: latent prediction One practice in interpretability is eliciting predictions from latent representations. Let ℳMM have layers 0,…,N0…0,...,N0 , … , N and let ℓi(X)subscriptℓ _i(X)ℓitalic_i ( X ) be the residual stream activation at the last token position (LTP) after layer i. Logit attribution (Geva et al.,, 2022; Wang et al.,, 2022; Dar et al.,, 2023; Katz and Belinkov,, 2023; Dao et al.,, 2023; Merullo et al.,, 2024; Halawi et al.,, 2024) is the practice of applying a transformer model’s unembedding map to an activation to obtain a semantic interpretation of that activation. When applied to the LTP activation after layer i, this practice is equivalent to zero ablating layers i+11i+1i + 1 to N. However, the semantic meanings of LTP activations after layer N can be different from those of LTP activations in earlier layers. As an alternative, tuned lens (Belrose et al., 2023a, ; Din et al.,, 2023) is a linear map fi(ℓi)=Wiℓi+bisubscriptsubscriptℓsubscriptsubscriptℓsubscriptf_i( _i)=W_i _i+b_ifitalic_i ( ℓitalic_i ) = Witalic_i ℓitalic_i + bitalic_i that “translates” from ℓi(X)subscriptℓ _i(X)ℓitalic_i ( X ) to a predicted ℓ^N(X)subscript^ℓ _N(X)over start_ARG ℓ end_ARGN ( X ). ℳTL(X)subscriptℳTLM_TL(X)Mroman_TL ( X ) is defined by replacing ℓN(X)subscriptℓ _N(X)ℓitalic_N ( X ) with ℓ^N(X):=fi(ℓi(X))assignsubscript^ℓsubscriptsubscriptℓ _N(X):=f_i( _i(X))over start_ARG ℓ end_ARGN ( X ) := fitalic_i ( ℓitalic_i ( X ) ) during inference, and training WisubscriptW_iWitalic_i and bisubscriptb_ibitalic_i to minimize ℒTL:=Xℒ(ℳTL(X),ℳ(X))assignsubscriptℒTLsubscriptℒsubscriptℳTLℳL_TL:=E_XL(M_TL% (X),M(X))Lroman_TL := blackboard_EX L ( Mroman_TL ( X ) , M ( X ) ). Tuned lens demonstrates when information is transferred to LTP: if replacing ℓN(X)subscriptℓ _N(X)ℓitalic_N ( X ) with ℓ^N(X)subscript^ℓ _N(X)over start_ARG ℓ end_ARGN ( X ) achieves low loss, then ℓi(X)subscriptℓ _i(X)ℓitalic_i ( X ) contains sufficient context for computing ℳ(X)ℳM(X)M ( X ), so key information is transferred prior to layer i. Method We propose Optimal Constant Attention (OCA) lens. We define ℳOCA(X)subscriptℳOCAM_OCA(X)Mroman_OCA ( X ) by using OA to ablate attention layers i+11i+1i + 1 to N: for each of these layers k, we replace its output at LTP with a constant a^ksubscript a_kover start_ARG a end_ARGk. We train a^=(a^i+1,…,a^N)^subscript^1…subscript a=( a_i+1,..., a_N)over start_ARG a end_ARG = ( over start_ARG a end_ARGi + 1 , … , over start_ARG a end_ARGN ) to minimize XℒOCA:=Xℒ(ℳOCA(X),ℳ(X))assignsubscriptsubscriptℒOCAsubscriptℒsubscriptℳOCAℳE_XL_OCA:=E_XL(M% _OCA(X),M(X))blackboard_EX Lroman_OCA := blackboard_EX L ( Mroman_OCA ( X ) , M ( X ) ). Similar to tuned lens, OCA lens reveals whether the LTP activation after layer i contains sufficient context to compute ℳ(X)ℳM(X)M ( X ) by eliminating information transfer from previous token positions to LTP after layer i. While tuned lens is a linear map, OCA lens is a function that leverages the model’s existing architecture (specifically, its MLP layers) to translate between LTP activations at different layers. OCA lens has far fewer learnable parameters than tuned lens: O(Ndmodel)<O(dmodel2)subscriptmodelsuperscriptsubscriptmodel2O(Nd_model)<O(d_model^2)O ( N droman_model ) < O ( droman_model2 ). Experiments We compare ℒOCAsubscriptℒOCAL_OCALroman_OCA to ℒTLsubscriptℒTLL_TLLroman_TL for various model sizes. As additional baselines, we also consider the ablation of later attention layers with mean or resample ablation rather than OA. Results are shown in Figure 3 (left) for GPT-2-XL and Figure 16 for other model sizes. OCA lens achieves significantly lower loss than tuned lens, indicating better extraction of predictive power from LTP activations. For example, the predictive loss of OCA lens drops to below 0.01 around layer 35 of GPT-2-XL, but does not reach this point even at the last layer for tuned lens. Figure 3: Left: Prediction loss comparison between tuned lens and ablation-based alternatives. Middle, right: Causal faithfulness metrics for tuned and OCA lens under basis-aligned projections. Additionally, Belrose et al., 2023a explains that one desiderata for latent prediction is causal faithfulness, i.e. fisubscriptf_ifitalic_i should use ℓi(X)subscriptℓ _i(X)ℓitalic_i ( X ) in the same way as ℳMM. We can investigate causal faithfulness by intervening on ℓi(X)subscriptℓ _i(X)ℓitalic_i ( X ) and evaluating the extent to which ℳTL(X)subscriptℳTLM_TL(X)Mroman_TL ( X ) and ℳ(X)ℳM(X)M ( X ) move in parallel. If ℳTL(X)subscriptℳTLM_TL(X)Mroman_TL ( X ) changes significantly but ℳ(X)ℳM(X)M ( X ) does not, for example, then fisubscriptf_ifitalic_i could be extrapolating from spurious correlations, e.g. by inferring from directions that predict information transfer that occurs in later layers. Consider a random intervention ξ on ℓi(X)subscriptℓ _i(X)ℓitalic_i ( X ) and let ℳTL(X;ξ)subscriptℳTLM_TL(X;ξ)Mroman_TL ( X ; ξ ) represent replacing ℓi(X)subscriptℓ _i(X)ℓitalic_i ( X ) with ξ(ℓi(X))subscriptℓξ( _i(X))ξ ( ℓitalic_i ( X ) ) before applying fisubscriptf_ifitalic_i. Similarly, let ℳ(X;ξ)ℳM(X;ξ)M ( X ; ξ ) represent replacing ℓi(X)subscriptℓ _i(X)ℓitalic_i ( X ) with ξ(ℓi(X))subscriptℓξ( _i(X))ξ ( ℓitalic_i ( X ) ) during inference. Belrose et al., 2023a separates causal faithfulness into two measurable properties (both range from -1 to 1 and higher values reflect greater faithfulness): 1. Magnitude correlation: corr([ℒ(ℳTL(X;ξ),ℳTL(X))|ξ],[ℒ(ℳ(X;ξ),ℳ(X))|ξ])corrdelimited-[]conditionalℒsubscriptℳTLsubscriptℳTLdelimited-[]conditionalℒℳcorr(E[L(M_TL(X;ξ), % M_TL(X))\ |\ ξ],E[L(M(X;ξ),% M(X))\ |\ ξ])corr ( blackboard_E [ L ( Mroman_TL ( X ; ξ ) , Mroman_TL ( X ) ) | ξ ] , blackboard_E [ L ( M ( X ; ξ ) , M ( X ) ) | ξ ] ). 2. Direction similarity: [⟨ℳTL(X;ξ)⊖ℳTL(X),ℳ(X;ξ)⊖ℳ(X)⟩]delimited-[]symmetric-differencesubscriptℳTLsubscriptℳTLsymmetric-differenceℳE[ _TL(X;ξ) _% TL(X),\ M(X;ξ) (X) ]blackboard_E [ ⟨ Mroman_TL ( X ; ξ ) ⊖ Mroman_TL ( X ) , M ( X ; ξ ) ⊖ M ( X ) ⟩ ], where ⊖symmetric-difference ⊖ denotes subtraction in logit space and ⟨⟩ ⟨ ⟩ denotes the Aitchinson similarity between distributions. We assess these properties for ℳTLsubscriptℳTLM_TLMroman_TL and ℳOCAsubscriptℳOCAM_OCAMroman_OCA for a variety of interventions ξ. In Figure 3 (middle, right), we plot these properties for a modified version of the “causal basis projection” ξ from Belrose et al., 2023a . While they train a basis iteratively, this approach is expensive and unstable, and we instead extract an approximate basis for ℳTLsubscriptℳTLM_TLMroman_TL by performing singular value decomposition (SVD) on WiΣ1/2subscriptsuperscriptΣ12W_i ^1/2Witalic_i Σ1 / 2, where Σ Σ is the covariance matrix of ℓi(X)subscriptℓ _i(X)ℓitalic_i ( X ), and applying Σ1/2superscriptΣ12 ^1/2Σ1 / 2 to the right singular vectors. For ℳOCAsubscriptℳOCAM_OCAMroman_OCA, we extract this basis by training a linear map to approximate fisubscriptf_ifitalic_i and using the weights as the WisubscriptW_iWitalic_i. For both lenses, we compute ξ(a)=μ+p(a−μ)ξ(a)=μ+p(a-μ)ξ ( a ) = μ + p ( a - μ ), where μ=[ℓi(X)]delimited-[]subscriptℓμ=E[ _i(X)]μ = blackboard_E [ ℓitalic_i ( X ) ] and p represents projecting to the orthogonal complement of span(v→)span→span( v)span ( over→ start_ARG v end_ARG ) for a uniformly sampled basis vector v→ vover→ start_ARG v end_ARG. We plot the magnitude correlation and direction similarity for ℳTLsubscriptℳTLM_TLMroman_TL and ℳOCAsubscriptℳOCAM_OCAMroman_OCA with respect to ℳMM in Figure 3. We find that OCA lens measures significantly better on both causal faithfulness metrics across all layers, and we achieve similar results for other choices of interventions ξ (see Appendix H.3). One downstream application of extracting latent predictions from intermediate-layer LTP activations is that they can sometimes be more accurate on text classification tasks than the model’s output predictions, especially if the context contains false demonstrations, i.e. examples of incorrect task completions (Halawi et al.,, 2024). The proposed theory is that the model first computes the correct answer at LTP in early layers, then later layers move contextual information to LTP that lead it to make adjustments that benefit next-token prediction, such as reporting an incorrect answer for consistency with false demonstrations. We compare the elicitation accuracy boost, or the best elicitation accuracy across layers minus the accuracy of the model output, for OCA lens and tuned lens for GPT-2-XL with 2,000 classification samples from each of 15 text classification datasets from Halawi et al., (2024), using their calibrated accuracy metric. We find that OCA lens increases this accuracy boost for prompts with true demonstrations on 12 of the 15 datasets and for prompts with false demonstrations on 11 of the 15 (see Figure 21). In particular, for Wikipedia topic classification (DBPedia), OCA lens increases the elicitation accuracy boost from 2.9% to 18.0% with true demonstrations and from 19.2% to 28.8% with false demonstrations (see Figure 4, middle). Full results are reported in Appendix H.4. Figure 4: Comparison of calibrated elicitation accuracy on selected datasets. 6 Future work The applications of component importance presented in our work are not exhaustive. A variety of interpretability work either directly applies ablation-based importance or can be framed to use it as a potential tool. OA creates new opportunities to incorporate ablation into studies for which it may be impossible to obtain good results with previous ablation methods. For example, we can train probes derived from using OA with different loss functions (Li et al.,, 2023), or use an approach similar to OCA lens to decode activations other than the LTP residual stream activation. See Appendix D.3 for an extension of OA to evaluate the extent to which a component performs classification. Acknowledgements LJ was partially supported by DMS-2045981 and DMS-2134157. References Baan et al., (2019) Baan, J., ter Hoeve, M., van der Wees, M., Schuth, A., and de Rijke, M. (2019). Understanding multi-head attention in abstractive summarization. Bach et al., (2015) Bach, S., Binder, A., Montavon, G., Klauschen, F., Muller, K.-R., and Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS One, 10(7):e0130140. Baehrens et al., (2009) Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., and Mueller, K.-R. (2009). How to explain individual classification decisions. Bau et al., (2017) Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations. Bau et al., (2020) Bau, D., Zhu, J.-Y., Strobelt, H., Lapedriza, A., Zhou, B., and Torralba, A. (2020). Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 117(48):30071–30078. (6) Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., McKinney, L., Biderman, S., and Steinhardt, J. (2023a). Eliciting latent predictions from transformers with the tuned lens. (7) Belrose, N., Schneider-Joseph, D., Ravfogel, S., Cotterell, R., Raff, E., and Biderman, S. (2023b). Leace: Perfect linear concept erasure in closed form. Bhaskar et al., (2024) Bhaskar, A., Wettig, A., Friedman, D., and Chen, D. (2024). Finding transformer circuits with edge pruning. Breiman, (2001) Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32. Burns et al., (2022) Burns, C., Ye, H., Klein, D., and Steinhardt, J. (2022). Discovering latent knowledge in language models without supervision. Cao et al., (2021) Cao, S., Sanh, V., and Rush, A. M. (2021). Low-complexity probing via finding subnetworks. arXiv preprint arXiv:2104.03514. Chan et al., (2022) Chan, L., Garriga-Alonso, A., Goldowsky-Dill, N., Greenblatt, R., Nitishinskaya, J., Radhakrishnan, A., Shlegeris, B., and Thomas, N. (2022). Causal scrubbing: A method for rigorously testing interpretability hypotheses. https://w.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing. Chang et al., (2019) Chang, C.-H., Creager, E., Goldenberg, A., and Duvenaud, D. (2019). Explaining image classifiers by counterfactual generation. Chen et al., (2021) Chen, H., Feng, S., Ganhotra, J., Wan, H., Gunasekara, C., Joshi, S., and Ji, Y. (2021). Explaining neural network predictions on sentence pairs via learning word-group masks. Chen et al., (2018) Chen, J., Song, L., Wainwright, M. J., and Jordan, M. I. (2018). Learning to explain: An information-theoretic perspective on model interpretation. Conmy et al., (2023) Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. (2023). Towards automated circuit discovery for mechanistic interpretability. Covert et al., (2020) Covert, I., Lundberg, S., and Lee, S.-I. (2020). Understanding global feature contributions with additive importance measures. Covert et al., (2022) Covert, I. C., Lundberg, S., and Lee, S.-I. (2022). Explaining by removing: A unified framework for model explanation. J. Mach. Learn. Res., 22(1). Cunningham et al., (2023) Cunningham, H., Ewart, A., Smith, L. R., Huben, R., and Sharkey, L. (2023). Sparse autoencoders find highly interpretable model directions. Dabkowski and Gal, (2017) Dabkowski, P. and Gal, Y. (2017). Real time image saliency for black box classifiers. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Dao et al., (2023) Dao, J., Lau, Y.-T., Rager, C., and Janiak, J. (2023). An adversarial example for direct logit attribution: Memory management in gelu-4l. Dar et al., (2023) Dar, G., Geva, M., Gupta, A., and Berant, J. (2023). Analyzing transformers in embedding space. Datta et al., (2016) Datta, A., Sen, S., and Zick, Y. (2016). Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In 2016 IEEE Symposium on Security and Privacy (SP), pages 598–617. De Cao et al., (2021) De Cao, N., Schlichtkrull, M., Aziz, W., and Titov, I. (2021). How do decisions emerge across layers in neural models? interpretation with differentiable masking. Dhamdhere et al., (2018) Dhamdhere, K., Sundararajan, M., and Yan, Q. (2018). How important is a neuron? Din et al., (2023) Din, A. Y., Karidi, T., Choshen, L., and Geva, M. (2023). Jump to conclusions: Short-cutting transformers with linear transformations. Fisher et al., (2019) Fisher, A., Rudin, C., and Dominici, F. (2019). All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. Fong et al., (2019) Fong, R., Patrick, M., and Vedaldi, A. (2019). Understanding deep networks via extremal perturbations and smooth masks. Fong and Vedaldi, (2017) Fong, R. C. and Vedaldi, A. (2017). Interpretable explanations of black boxes by meaningful perturbation. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE. Geva et al., (2023) Geva, M., Bastings, J., Filippova, K., and Globerson, A. (2023). Dissecting recall of factual associations in auto-regressive language models. Geva et al., (2022) Geva, M., Caciularu, A., Wang, K. R., and Goldberg, Y. (2022). Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. Ghorbani and Zou, (2020) Ghorbani, A. and Zou, J. (2020). Neuron shapley: Discovering the responsible neurons. Goldowsky-Dill et al., (2023) Goldowsky-Dill, N., MacLeod, C., Sato, L., and Arora, A. (2023). Localizing model behavior with path patching. arXiv preprint arXiv:2304.05969. Gould et al., (2023) Gould, R., Ho, E., and Conmy, A. (2023). Mechanistically interpreting time in gpt-2 small. Grömping, (2007) Grömping, U. (2007). Estimators of relative importance in linear regression based on variance decomposition. The American Statistician, 61(2):139–147. Gromping, (2009) Gromping, U. (2009). Variable importance assessment in regression: Linear regression versus random forest. The American Statistician, 63(4):308–319. Guan et al., (2019) Guan, C., Wang, X., Zhang, Q., Chen, R., He, D., and Xie, X. (2019). Towards a deep and unified understanding of deep neural models in NLP. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2454–2463. PMLR. Gurnee et al., (2024) Gurnee, W., Horsley, T., Guo, Z. C., Kheirkhah, T. R., Sun, Q., Hathaway, W., Nanda, N., and Bertsimas, D. (2024). Universal neurons in gpt2 language models. Gurnee and Tegmark, (2024) Gurnee, W. and Tegmark, M. (2024). Language models represent space and time. Halawi et al., (2024) Halawi, D., Denain, J.-S., and Steinhardt, J. (2024). Overthinking the truth: Understanding how language models process false demonstrations. Hanna et al., (2023) Hanna, M., Liu, O., and Variengien, A. (2023). How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. Hase et al., (2023) Hase, P., Bansal, M., Kim, B., and Ghandeharioun, A. (2023). Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. Hase et al., (2021) Hase, P., Xie, H., and Bansal, M. (2021). The out-of-distribution problem in explainability and search methods for feature importance explanations. Heimersheim and Janiak, (2023) Heimersheim, S. and Janiak, J. (2023). A circuit for Python docstrings in a 4-layer attention-only transformer. https://w.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only. Heimersheim and Nanda, (2024) Heimersheim, S. and Nanda, N. (2024). How to use and interpret activation patching. Hendel et al., (2023) Hendel, R., Geva, M., and Globerson, A. (2023). In-context learning creates task vectors. Hernandez et al., (2022) Hernandez, E., Schwettmann, S., Bau, D., Bagashvili, T., Torralba, A., and Andreas, J. (2022). Natural language descriptions of deep visual features. Hernandez et al., (2024) Hernandez, E., Sharma, A. S., Haklay, T., Meng, K., Wattenberg, M., Andreas, J., Belinkov, Y., and Bau, D. (2024). Linearity of relation decoding in transformer language models. Homma and Saltelli, (1996) Homma, T. and Saltelli, A. (1996). Importance measures in global sensitivity analysis of nonlinear models. Reliability Engineering & System Safety, 52(1):1–17. Hooker et al., (2019) Hooker, S., Erhan, D., Kindermans, P.-J., and Kim, B. (2019). A benchmark for interpretability methods in deep neural networks. Huang and Wang, (2018) Huang, Z. and Wang, N. (2018). Data-driven sparse structure selection for deep neural networks. Ishwaran, (2007) Ishwaran, H. (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1(none). Janzing et al., (2019) Janzing, D., Minorics, L., and Blöbaum, P. (2019). Feature relevance quantification in explainable ai: A causal problem. Katz and Belinkov, (2023) Katz, S. and Belinkov, Y. (2023). Visit: Visualizing and interpreting the semantic information flow of transformers. Kim et al., (2020) Kim, S., Yi, J., Kim, E., and Yoon, S. (2020). Interpretation of nlp models through input marginalization. Lakretz et al., (2019) Lakretz, Y., Kruszewski, G., Desbordes, T., Hupkes, D., Dehaene, S., and Baroni, M. (2019). The emergence of number and syntax units in lstm language models. Lan et al., (2024) Lan, M., Torr, P., and Barez, F. (2024). Towards interpretable sequence continuation: Analyzing shared circuits in large language models. Leino et al., (2018) Leino, K., Sen, S., Datta, A., Fredrikson, M., and Li, L. (2018). Influence-directed explanations for deep convolutional networks. Li and Mahadevan, (2017) Li, C. and Mahadevan, S. (2017). Sensitivity Analysis of a Bayesian Network. ASCE-ASME J Risk and Uncert in Engrg Sys Part B Mech Engrg, 4(1). 011003. Li et al., (2017) Li, J., Monroe, W., and Jurafsky, D. (2017). Understanding neural networks through representation erasure. Li et al., (2023) Li, K., Hopkins, A. K., Bau, D., Viégas, F., Pfister, H., and Wattenberg, M. (2023). Emergent world representations: Exploring a sequence model trained on a synthetic task. Li et al., (2024) Li, M., Davies, X., and Nadeau, M. (2024). Circuit breaking: Removing model behaviors with targeted ablation. Lieberum et al., (2023) Lieberum, T., Rahtz, M., Kramár, J., Nanda, N., Irving, G., Shah, R., and Mikulik, V. (2023). Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla. Liu et al., (2017) Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. (2017). Learning efficient convolutional networks through network slimming. Louizos et al., (2018) Louizos, C., Welling, M., and Kingma, D. P. (2018). Learning sparse neural networks through l0subscript0l_0l0 regularization. Lundberg et al., (2020) Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., and Lee, S.-I. (2020). From local explanations to global understanding with explainable ai for trees. Nature machine intelligence, 2(1):56–67. Lundberg and Lee, (2017) Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Makelov et al., (2023) Makelov, A., Lange, G., and Nanda, N. (2023). Is this the subspace you are looking for? an interpretability illusion for subspace activation patching. Marks et al., (2024) Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., and Mueller, A. (2024). Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. Marks and Tegmark, (2023) Marks, S. and Tegmark, M. (2023). The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. Mase et al., (2024) Mase, M., Owen, A. B., and Seiler, B. B. (2024). Variable importance without impossible data. Annual Review of Statistics and Its Application, 11(Volume 11, 2024):153–178. McDougall et al., (2023) McDougall, C., Conmy, A., Rushing, C., McGrath, T., and Nanda, N. (2023). Copy suppression: Comprehensively understanding an attention head. McGrath et al., (2023) McGrath, T., Rahtz, M., Kramar, J., Mikulik, V., and Legg, S. (2023). The hydra effect: Emergent self-repair in language model computations. Meng et al., (2022) Meng, K., Bau, D., Andonian, A., and Belinkov, Y. (2022). Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372. Merullo et al., (2023) Merullo, J., Eickhoff, C., and Pavlick, E. (2023). Circuit component reuse across tasks in transformer language models. Merullo et al., (2024) Merullo, J., Eickhoff, C., and Pavlick, E. (2024). Language models implement simple word2vec-style vector arithmetic. Montavon et al., (2017) Montavon, G., Lapuschkin, S., Binder, A., Samek, W., and Müller, K.-R. (2017). Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211–222. Mu and Andreas, (2021) Mu, J. and Andreas, J. (2021). Compositional explanations of neurons. Nanda, (2023) Nanda, N. (2023). Attribution patching: Activation patching at industrial scale. Nathans et al., (2012) Nathans, L., Oswald, F., and Nimon, K. (2012). Interpreting multiple linear regression: A guidebook of variable importance. Practical Assessment, Research and Evaluation, 17(9):1–19. Olsson et al., (2022) Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. (2022). In-context learning and induction heads. Petsiuk et al., (2018) Petsiuk, V., Das, A., and Saenko, K. (2018). Rise: Randomized input sampling for explanation of black-box models. Rabitz, (1989) Rabitz, H. (1989). Systems analysis at the molecular scale. Science, 246(4927):221–226. Radford et al., (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners. Ribeiro et al., (2016) Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). "why should i trust you?": Explaining the predictions of any classifier. Robnik-Sikonja and Kononenko, (2008) Robnik-Sikonja, M. and Kononenko, I. (2008). Explaining classifications for individual instances. Knowledge and Data Engineering, IEEE Transactions on, 20:589 – 600. Rushing and Nanda, (2024) Rushing, C. and Nanda, N. (2024). Explorations of self-repair in language models. Räuker et al., (2022) Räuker, T., Ho, A., Casper, S., and Hadfield-Menell, D. (2022). Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. Schlichtkrull et al., (2022) Schlichtkrull, M. S., De Cao, N., and Titov, I. (2022). Interpreting graph neural networks for nlp with differentiable edge masking. Schulz et al., (2020) Schulz, K., Sixt, L., Tombari, F., and Landgraf, T. (2020). Restricting the flow: Information bottlenecks for attribution. Schwab and Karlen, (2019) Schwab, P. and Karlen, W. (2019). Cxplain: Causal explanations for model interpretation under uncertainty. Shah et al., (2024) Shah, H., Ilyas, A., and Madry, A. (2024). Decomposing and editing predictions by modeling model computation. Simonyan et al., (2014) Simonyan, K., Vedaldi, A., and Zisserman, A. (2014). Deep inside convolutional networks: Visualising image classification models and saliency maps. Singh et al., (2024) Singh, A. K., Moskovitz, T., Hill, F., Chan, S. C. Y., and Saxe, A. M. (2024). What needs to go right for an induction head? a mechanistic study of in-context learning circuits and their formation. Slack et al., (2020) Slack, D., Hilgard, S., Jia, E., Singh, S., and Lakkaraju, H. (2020). Fooling lime and shap: Adversarial attacks on post hoc explanation methods. Smilkov et al., (2017) Smilkov, D., Thorat, N., Kim, B., Viégas, F., and Wattenberg, M. (2017). Smoothgrad: removing noise by adding noise. Sobol, (1993) Sobol, I. (1993). Sensitivity estimates for nonlinear mathematical models. Computational Mathematics and Mathematical Physics, 1(4):407–413. Srivastava et al., (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958. Stolfo et al., (2023) Stolfo, A., Belinkov, Y., and Sachan, M. (2023). A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. Strobl et al., (2008) Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9:1–11. Strumbelj and Kononenko, (2010) Strumbelj, E. and Kononenko, I. (2010). An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research, 11(1):1–18. Strumbelj et al., (2009) Strumbelj, E., Kononenko, I., and Sikonja, M. R. (2009). Explaining instance classifications with interactions of subsets of feature values. Data and Knowledge Engineering, 68(10):886–904. Sundararajan et al., (2017) Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic attribution for deep networks. Syed et al., (2023) Syed, A., Rager, C., and Conmy, A. (2023). Attribution patching outperforms automated circuit discovery. Tigges et al., (2023) Tigges, C., Hollinsworth, O. J., Geiger, A., and Nanda, N. (2023). Linear representations of sentiment in large language models. Todd et al., (2024) Todd, E., Li, M. L., Sharma, A. S., Mueller, A., Wallace, B. C., and Bau, D. (2024). Function vectors in large language models. Vaswani et al., (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. Vig et al., (2020) Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. (2020). Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33:12388–12401. Wang et al., (2022) Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. (2022). Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. Wei et al., (2020) Wei, C., Kakade, S., and Ma, T. (2020). The implicit and explicit regularization effects of dropout. Williamson and Feng, (2020) Williamson, B. D. and Feng, J. (2020). Efficient nonparametric statistical inference on population feature importance using shapley values. Wu et al., (2024) Wu, Z., Geiger, A., Huang, J., Arora, A., Icard, T., Potts, C., and Goodman, N. D. (2024). A reply to makelov et al. (2023)’s "interpretability illusion" arguments. Ye et al., (2018) Ye, J., Lu, X., Lin, Z., and Wang, J. Z. (2018). Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. Yoon et al., (2018) Yoon, J., Jordon, J., and Van der Schaar, M. (2018). Invase: Instance-wise variable selection using neural networks. In International conference on learning representations. Zeiler and Fergus, (2013) Zeiler, M. D. and Fergus, R. (2013). Visualizing and understanding convolutional networks. Zhang and Nanda, (2024) Zhang, F. and Nanda, N. (2024). Towards best practices of activation patching in language models: Metrics and methods. Zhang and Janson, (2020) Zhang, L. and Janson, L. (2020). Floodgate: Inference for model-free variable importance. arXiv preprint arXiv:2007.01283. Zhou et al., (2015) Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. (2015). Object detectors emerge in deep scene cnns. Zhuang et al., (2020) Zhuang, T., Zhang, Z., Huang, Y., Zeng, X., Shuang, K., and Li, X. (2020). Neuron-level structured pruning using polarization regularizer. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 9865–9877. Curran Associates, Inc. Zintgraf et al., (2017) Zintgraf, L. M., Cohen, T. S., Adel, T., and Welling, M. (2017). Visualizing deep neural network decisions: Prediction difference analysis. Appendix A Limitations As is the case for previous work, we do not provide an entirely precise definition of the “importance” of a component. The importance of a component can be generally described as an aggregation of causal effects in a way that summarizes the component’s contribution to model performance. Among the many ways to aggregate causal effects, there may not be a mathematically rigorous way to show that one measure of importance produces the correct or canonical aggregation. However, component importance is useful for a wide variety of applications in interpretability, so aside from showing that our approach to component importance better captures relevant considerations in a conceptual sense, we focus on the utility that it provides to some of these applications. As noted in Section 2.3, optimal ablation does not entirely eliminate the contribution of information spoofing to Δ Δ. For example, if AA typically conveys strong and exact information about the input, then there may not exist any value of a that hedges between a range of inputs. Though OA produces circuits that achieve lower loss at a given level of edge sparsity, it may elicit mechanisms that were not previously used for some subtask, especially if there are multiple computational paths that could lead to the same conclusion. However, if the subtask of interest is sufficiently complex, it seems unlikely that a model would have many “dormant” mechanisms that can be repurposed to perform the subtask, because this redundancy wastes computational complexity. For factual recall, it remains to be seen whether localization is helpful for applications that are further downstream such as producing surgical model edits (Hase et al.,, 2023; Shah et al.,, 2024). Appendix B Additional related work As mentioned in Section 2.2, component importance is strongly related to variable importance, which quantifies the importance of a model input XisubscriptX_iXitalic_i (also known as a feature or covariate in the variable importance literature). Variable importance Much of variable importance work concerns “oracle” prediction, which roughly considers how much XisubscriptX_iXitalic_i contributes to the performance of the best possible predictor for Y given the set of covariates (X1,…,Xn)subscript1…subscript(X_1,...,X_n)( X1 , … , Xitalic_n ), and frames the importance of XisubscriptX_iXitalic_i as a property of the joint distribution of (X1,…,Xn,Y)subscript1…subscript(X_1,...,X_n,Y)( X1 , … , Xitalic_n , Y ), rather than a property of any particular model used for prediction. Most work in this area analyzes some parametric class of estimators, like linear models (Grömping,, 2007; Nathans et al.,, 2012) or Bayesian networks (Li and Mahadevan,, 2017). Later work generalizes parametric variable importance to arbitrary model classes, e.g. by training an ensemble of models that only have access to subsets of the covariates (Strumbelj et al.,, 2009; Fisher et al.,, 2019). Recent work has also studied non-parametric variable importance, in which we attempt to lower-bound the best performance of any arbitrary estimator (Williamson and Feng,, 2020; Zhang and Janson,, 2020). On the other hand, our motivation is to interpret the behavior of one specific model ℳMM (Fisher et al.,, 2019; Hooker et al.,, 2019), not to analyze the theoretical relationship between model inputs and outputs. Rather than estimating how well any function of XisubscriptX_iXitalic_i can predict Y, we wish to estimate how much the particular function ℳMM uses an input feature XisubscriptX_iXitalic_i to predict Y. Previous work on this “algorithmic” variant of variable importance has taken two main approaches. Local function approximations One way to quantify how much XisubscriptX_iXitalic_i contributes to model performance is to aggregate local function approximations, which approximate the model around a particular input. Common tools for local approximation include the gradient of ℳMM at a given x (Rabitz,, 1989; Baehrens et al.,, 2009; Simonyan et al.,, 2014; Leino et al.,, 2018; Nanda,, 2023), or a linear function that well-approximates ℳ(x+ε)ℳM(x+ )M ( x + ε ) for a chosen noise term ε ε (Ribeiro et al.,, 2016; Smilkov et al.,, 2017). Since these tools often yield straightforward estimates of the local importance of XisubscriptX_iXitalic_i for the input x, one approach to quantifying the global importance of XisubscriptX_iXitalic_i is aggregate the importance estimates given by these local approximations, for example by using a first-degree Taylor approximation around a reference input x0subscript0x_0x0 (Bach et al.,, 2015; Montavon et al.,, 2017), or integrating over gradients along a straight-line path from x0subscript0x_0x0 to any studied input x (Sundararajan et al.,, 2017; Dhamdhere et al.,, 2018). This approach to measuring variable importance works just as well for internal components as it does for inputs. However, local function approximations can fail to capture the overall loss landscape, especially in the common setting where ℳMM has unbounded gradients, and can often be manipulated to produce arbitrary feature importance values (Slack et al.,, 2020; Hase et al.,, 2021). Ablation-based measures The second main approach considers the ablation of feature XisubscriptX_iXitalic_i. In this approach, the feature XisubscriptX_iXitalic_i is ablated by replacing it with a different random variable X¯isubscript¯ X_iover¯ start_ARG X end_ARGi that captures less information about the original feature value. We then compare the model performance when XisubscriptX_iXitalic_i is replaced with X¯isubscript¯ X_iover¯ start_ARG X end_ARGi to the original model performance, per the definition of Δ Δ in Section 2.1 (where the ablated component AA is feature XisubscriptX_iXitalic_i of the model input). Many of the current methods used for ablating internal model components as described in Section 2.2 were first introduced in feature importance work. Zero ablation (Dabkowski and Gal,, 2017; Li et al.,, 2017; Petsiuk et al.,, 2018; Schwab and Karlen,, 2019), mean ablation (Zeiler and Fergus,, 2013; Zhou et al.,, 2015), and Gaussian noise injection (Fong and Vedaldi,, 2017; Fong et al.,, 2019; Guan et al.,, 2019; Schulz et al.,, 2020) are all used to remove input features, such as the pixels of an image or tokens of a text input, to assess their importance. Resample ablation is also common in feature importance work; an early variant samples X¯isubscript¯ X_iover¯ start_ARG X end_ARGi from a uniform distribution (Sobol,, 1993; Homma and Saltelli,, 1996; Strumbelj and Kononenko,, 2010), while later work generally performs resample ablation on features by resampling them from their marginal distribution (Breiman,, 2001; Robnik-Sikonja and Kononenko,, 2008; Datta et al.,, 2016; Lundberg and Lee,, 2017; Janzing et al.,, 2019; Covert et al.,, 2020; Kim et al.,, 2020). Measuring feature importance via these ablation methods suffers from a well-documented “out-of-distribution” problem (Ishwaran,, 2007; Fong and Vedaldi,, 2017; Hooker et al.,, 2019; Hase et al.,, 2021; Mase et al.,, 2024): since setting XisubscriptX_iXitalic_i to zero or its mean, resampling XisubscriptX_iXitalic_i from its marginal distribution, or adding Gaussian noise to XisubscriptX_iXitalic_i could result in an input that was never observed during training, the measured feature importance values could potentially be determined by model behavior on impossible and/or nonsensical inputs. One way to mitigate this out-of-distribution problem is replacing feature XisubscriptX_iXitalic_i by a random variable X¯isubscript¯ X_iover¯ start_ARG X end_ARGi sampled from its conditional distribution (Strobl et al.,, 2008; Lundberg et al.,, 2020), i.e. X¯i∼Xi|X−isimilar-tosubscript¯conditionalsubscriptsubscript X_i X_i|X_-iover¯ start_ARG X end_ARGi ∼ Xitalic_i | X- i, (X¯i⟂Xi)|X−i( X_i \!\!\! X_i)|X_-i( over¯ start_ARG X end_ARGi ⟂ ⟂ Xitalic_i ) | X- i, where X−isubscriptX_-iX- i denotes the other features X1,…,Xi−1,Xi+1,Xnsubscript1…subscript1subscript1subscriptX_1,...,X_i-1,X_i+1,X_nX1 , … , Xitalic_i - 1 , Xitalic_i + 1 , Xitalic_n. Since the conditional distribution is often intractable to sample from, previous work employs a range of approximation techniques. For example, Zintgraf et al., (2017) samples an ablated pixel from its conditional distribution given a ℓ×ℓ × ℓ × ℓ patch of its proximate pixels instead of conditioning on the entire image, and Chang et al., (2019) uses a generative model to simulate the conditional distribution. However, in a setting where the relevant features XisubscriptX_iXitalic_i represent internal model components, rather than inputs to the model, it often does not make sense to discuss an “out-of-distribution” problem because the (X1,…,Xn)=(1(X),…,n(X))subscript1…subscriptsubscript1…subscript(X_1,...,X_n)=(A_1(X),...,A_n(X))( X1 , … , Xitalic_n ) = ( A1 ( X ) , … , Aitalic_n ( X ) ) are usually near-deterministically related to each other. For example, for any neuron, it is typically the case that its value can be almost deterministically recovered from the values of other neurons in the same layer. Thus, nearly any intervention on an internal component isubscriptA_iAitalic_i brings the model “out-of-distribution,” in the sense that the model observes the vector (1(X),…,n(X))subscript1…subscript(A_1(X),...,A_n(X))( A1 ( X ) , … , Aitalic_n ( X ) ) where i(X)subscriptA_i(X)Aitalic_i ( X ) is replaced with a with near-zero probability density. Our dichotomy of deletion and spoofing in Section 2.1 is more precise than the typical discussion of the out-of-distribution problem in its description of distortions to importance values that we wish to avoid. On one hand, our analysis is more lenient than the blanket requirement that the vector of all internal component values (1(x),…,n(x))subscript1…subscript(A_1(x),...,A_n(x))( A1 ( x ) , … , Aitalic_n ( x ) ) is in-distribution, in the sense that not all interventions that bring (1(x),…,n(x))subscript1…subscript(A_1(x),...,A_n(x))( A1 ( x ) , … , Aitalic_n ( x ) ) out of distribution constitute spoofing; for example, replacing i(x)subscriptA_i(x)Aitalic_i ( x ) with a does not have a spoofing-related contribution to Δ Δ if i(x)subscriptA_i(x)Aitalic_i ( x ) and a are equivalent for the sake of downstream computation. On the other hand, our analysis is more stringent in the sense that effect 2 from Section 2.1 is recognized as a form of spoofing that can occur even when interventions are in-distribution (see Appendix D.1 for more details). Using dropout to eliminate spoofing One way to eliminate spoofing when intervening on (X)A(X)A ( X ) is to train ℳMM to accept neutral constant values that indicate that component AA has stopped functioning and then replace AA with these built-in neutral values to assess the importance of AA. Variations of this technique are common in feature importance (Strumbelj et al.,, 2009; Chen et al.,, 2018; Yoon et al.,, 2018; Hooker et al.,, 2019). For internal components, we could train neural networks with dropout (Srivastava et al.,, 2014; Wei et al.,, 2020), and then use zero ablation to assess the importance of AA. Since the downstream computation is trained to recognize (X)=00A(X)=0A ( X ) = 0 as an indication that AA carries no information, as opposed to strong information associated with an input other than the original X, Δzero(ℳ,)subscriptΔzeroℳ _zero(M,A)Δroman_zero ( M , A ) becomes an accurate assessment of deletion (effect 1 from Section 2.1). However, re-training with neutral values does not necessarily assist in analyzing a particular model ℳMM, since re-training ℳMM will in general change ℳMM itself. Furthermore, training with dropout incentivizes ℳMM to lower Δzero(ℳ,)subscriptΔzeroℳ _zero(M,A)Δroman_zero ( M , A ) for any component AA, since part of the loss function involves minimizing loss with a random subset of ablated components. As a result, we expect to observe more redundant computation shared between many model components, since a random subset of them could be ablated during training. This redundancy inherently tends to make ℳMM less modular and harder to analyze – for example, we should expect a broad variety of components to perform relevant computation for any input, even if an accurate prediction could be computed with just a few components, so it becomes difficult to localize model behaviors. Since interpretability involves decomposing model computation into smaller pieces and identifying specialization among model components, models trained with dropout may be less interpretable. To summarize, while the ΔzerosubscriptΔzero _zeroΔroman_zero measurements are more “accurate” when ℳMM is trained with dropout, they may become less “useful” for interpretability. On the other hand, OA makes Δ Δ measurements a more accurate reflection of deletion effects without training ℳMM on to distort the magnitude of these effects. Aggregation mechanisms for ablation methods On top of a selected ablation method, some work uses Shapley values to aggregate performance gap measurements for sets of features (Strumbelj et al.,, 2009; Strumbelj and Kononenko,, 2010; Datta et al.,, 2016; Lundberg and Lee,, 2017; Janzing et al.,, 2019; Lundberg et al.,, 2020; Covert et al.,, 2020). This line of work measures the importance of XisubscriptX_iXitalic_i by estimating a weighted average of the performance gap Δ(ℳ,S)Δℳ (M,S)Δ ( M , S ) for all subsets S⊂X1,…,Xnsubscript1…subscriptS⊂\X_1,...,X_n\S ⊂ X1 , … , Xitalic_n rather than considering only Δ(ℳ,Xi)Δℳsubscript (M,X_i)Δ ( M , Xitalic_i ). This aggregation mechanism is applied after choosing an ablation method via which to measure each Δ Δ, and is just as compatible with OA as with any other ablation method. Sparse pruning and masking Finally, in the literature of sparse pruning and masking, an operation that is procedurally similar to optimal ablation is sometimes performed in prior work by adding a bias term to removed features or activations after setting weights to zero. In some structured pruning work, it is typical to introduce scaled batch normalization layers ~(X)=γ(X)+β~ A(X)= (X)+ ~ start_ARG A end_ARG ( X ) = γ A ( X ) + β for γ∈[0,1]01γ∈[0,1]γ ∈ [ 0 , 1 ] to the output of each computational block AA, and regularize the γ toward zero to select weights to prune (Liu et al.,, 2017; Ye et al.,, 2018; Zhuang et al.,, 2020). When γ reaches 0, the output of AA is set to the constant β, which is trained to minimize the loss of the pruned model. However, the motivation of this reparameterization is not to measure component importance, and optimal ablation can be applied to more general model components (e.g. computational edges in any graph representation). Similar to pruning, sparse masking work searches for a mask over input tokens such that for any input, most inputs are zeroed out while model performance is retained (Li et al.,, 2017; De Cao et al.,, 2021; Chen et al.,, 2021; Schlichtkrull et al.,, 2022). In particular, De Cao et al., (2021) replaces masked tokens in an input X=X1,…,Xnsubscript1…subscriptX=X_1,...,X_nX = X1 , … , Xitalic_n with a learned bias b(X)b(X)b ( X ). While this operation may seem similar to optimal ablation, a fundamental difference is that the bias b(X)b(X)b ( X ) is different for each input sequence X and is trained to equalize embeddings at different token positions in a single X rather than assuming the same constant value for all values of X. Thus, for each X, b(X)b(X)b ( X ) contains specific information about the masked tokens in X, so unlike OA, this technique does not perform total ablation on the masked tokens. A follow-up work Schlichtkrull et al., (2022) trains a common b for a dataset of inputs X. However, Schlichtkrull et al., (2022) uses an auxiliary linear model ϕi(1(X),…,n(X))subscriptitalic-ϕsubscript1…subscript _i(A_1(X),...,A_n(X))ϕitalic_i ( A1 ( X ) , … , Aitalic_n ( X ) ) to predict whether a component i(X)subscriptA_i(X)Aitalic_i ( X ) should be masked. Since ϕisubscriptitalic-ϕ _iϕitalic_i explicitly depends on the values of the masked components i(X)subscriptA_i(X)Aitalic_i ( X ), the model output remains dependent on information contained in i(X)subscriptA_i(X)Aitalic_i ( X ), and total ablation is not achieved. Moreover, the auxiliary model ϕitalic-ϕφϕ provides the model with additional computation to distill information about the input, rather than strictly reducing the computational complexity of the original model as OA does. The use of an auxiliary model is a requisite feature of their method and cannot be decoupled from the masking technique: without using ϕ(X)italic-ϕφ(X)ϕ ( X ) to predict the masked components, computing masks requires a separate optimization procedure for each input, which makes it computationally intractable to optimize a single b over an entire distribution of inputs. Appendix C Additional preliminaries C.1 Models as computational graphs We can write an ML model ℳMM as a connected directed acyclic graph. The graph’s source vertices represent the model’s (typically vector-valued) input, its sink vertices represent the model’s output, and intermediate vertices represent units of computation. For the sake of simplicity, assume ℳMM has a single input and a single output. Each intermediate vertex isubscriptA_iAitalic_i represents a computational block that takes in the values of previous vertices evaluated on x, and itself produces an output i(x)subscriptA_i(x)Aitalic_i ( x ), that is taken as input to later vertices. We indicate that there exists a directed edge from vertex jsubscriptA_jAitalic_j to vertex isubscriptA_iAitalic_i if j(x)subscriptA_j(x)Aitalic_j ( x ) is taken as input to isubscriptA_iAitalic_i. Let ℳMM be represented by computational graph (G,E)(G,E)( G , E ) where G is the overall set of vertices and E be the set of edges. Let 0:nsubscript:0A_0:nA0 : n be a tuple representing G in topologically sorted order (0subscript0A_0A0 represents the model input, while nsubscriptA_nAitalic_n represents the model output). For a particular vertex isubscriptA_iAitalic_i, let G→i=(G1i,…,Gki)superscript→subscriptsuperscript1…subscriptsuperscript G^i=(G^i_1,...,G^i_k)over→ start_ARG G end_ARGi = ( Gitalic_i1 , … , Gitalic_iitalic_k ) be the tuple of vertices (duplicates allowed) whose outputs isubscriptA_iAitalic_i takes as immediate inputs. As we will see, we will sometimes require multiple edges between a pair of vertices. Rather than the standard edge notation e=(j,i)∈Esubscriptsubscripte=(A_j,A_i)∈ Ee = ( Aitalic_j , Aitalic_i ) ∈ E for simple graphs, we adopt the notation e=(j,i,z)∈Esubscriptsubscripte=(A_j,A_i,z)∈ Ee = ( Aitalic_j , Aitalic_i , z ) ∈ E to indicate that Gzi=jsubscriptsuperscriptsubscriptG^i_z=A_jGitalic_iitalic_z = Aitalic_j, i.e. j(x)subscriptA_j(x)Aitalic_j ( x ) is taken as the zzzth input to isubscriptA_iAitalic_i. Model inference is performed by evaluating the vertices in topologically sorted order. We perform inference on an input x by setting 0(x)=xsubscript0A_0(x)=xA0 ( x ) = x and then iteratively evaluating i(x)=i(G→i)subscriptsubscriptsuperscript→A_i(x)=A_i( G^i)Aitalic_i ( x ) = Aitalic_i ( over→ start_ARG G end_ARGi ) for i∈1,…,n1…i∈\1,...,n\i ∈ 1 , … , n . By the time we evaluate some vertex isubscriptA_iAitalic_i, we have already computed the values Gzi(x)subscriptsuperscriptG^i_z(x)Gitalic_iitalic_z ( x ) for each of its inputs because they precede isubscriptA_iAitalic_i in the topological sort. Finally, we determine that ℳ(x)=n(x)ℳsubscriptM(x)=A_n(x)M ( x ) = Aitalic_n ( x ). We will alternate between the notation i(G→i)subscriptsuperscript→A_i( G^i)Aitalic_i ( over→ start_ARG G end_ARGi ) to explicitly write isubscriptA_iAitalic_i as a function of its immediate inputs and the notation i(x)subscriptA_i(x)Aitalic_i ( x ) to indicate that the output of isubscriptA_iAitalic_i is a function of x. We also sometimes use i(x)subscriptA_i(x)Aitalic_i ( x ) as a standalone quantity apart from evaluating ℳ(x)ℳM(x)M ( x ) and observe that this quantity is a function of x computed by evaluating j(x)subscriptA_j(x)Aitalic_j ( x ) in order for j∈1,…,i1…j∈\1,...,i\j ∈ 1 , … , i . The graph notation for any ML model is not unique. For any model, there are many equivalent graphs that faithfully represent its computation. In particular, a computational graph can represent a model at varying levels of detail. At one extreme, intermediate vertices can designate individual additions, multiplications, and nonlinearities. Such a graph would have at least as many vertices as model parameters. Fortunately, most model architectures have self-contained computational blocks, which allows them to be represented by graphs that convey a significantly higher level of abstraction. For example, in convolutional networks, intermediate vertices can represent convolutional filters and pooling layers, while in transformer models, the natural high-level computational units are attention heads and multi-layer perceptron (MLP) modules. C.2 Activation patching Activation patching is the practice of evaluating ℳ(x)ℳM(x)M ( x ) while performing the intervention of setting some component isubscriptA_iAitalic_i to a counterfactual value a instead of i(x)subscriptA_i(x)Aitalic_i ( x ) during inference. We use the notation ℳi(x,a)subscriptℳsubscriptM_A_i(x,a)Mcaligraphic_A start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ( x , a ) extensively in the paper to indicate this practice, and here we give a more precise definition in terms of ℳMM as a computational graph: Definition C.1 (Vertex activation patching). To compute ℳi(x,a)subscriptℳsubscriptM_A_i(x,a)Mcaligraphic_A start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ( x , a ), compute 0(x),…,i−1(x)subscript0…subscript1A_0(x),...,A_i-1(x)A0 ( x ) , … , Aitalic_i - 1 ( x ) in normal fashion and set i(x)=asubscriptA_i(x)=aAitalic_i ( x ) = a. Then compute each vertex i+1(x),…,n(x)subscript1…subscriptA_i+1(x),...,A_n(x)Aitalic_i + 1 ( x ) , … , Aitalic_n ( x ) in order, computing each vertex jsubscriptA_jAitalic_j as a function of its immediate inputs, i.e. j(x)=j(G1j(x),…,Gkj(x))subscriptsubscriptsubscriptsuperscript1…subscriptsuperscriptA_j(x)=A_j(G^j_1(x),...,G^j_k(x))Aitalic_j ( x ) = Aitalic_j ( Gitalic_j1 ( x ) , … , Gitalic_jitalic_k ( x ) ). Finally, return ℳi(x,a)=n(x)subscriptℳsubscriptsubscriptM_A_i(x,a)=A_n(x)Mcaligraphic_A start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ( x , a ) = Aitalic_n ( x ). During this modified forward pass, a vertex jsubscriptA_jAitalic_j that takes isubscriptA_iAitalic_i as its zzzth immediate input, i.e. for which (i,j,z)∈Esubscriptsubscript(A_i,A_j,z)∈ E( Aitalic_i , Aitalic_j , z ) ∈ E, instead takes a as their zzzth input instead of the normal value of i(x)subscriptA_i(x)Aitalic_i ( x ). Later, if some other vertex takes jsubscriptA_jAitalic_j as input, it will take this modified version of j(x)subscriptA_j(x)Aitalic_j ( x ) as input, and so on, so the intervention on isubscriptA_iAitalic_i may have an effect that carries through the graph computation and eventually makes ℳi(x,a)subscriptℳsubscriptM_A_i(x,a)Mcaligraphic_A start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ( x , a ) different from ℳ(x)ℳM(x)M ( x ). In Section 3, we discuss extending this practice to edges e: Definition C.2 (Edge activation patching). To compute ℳe(x,a)subscriptℳM_e(x,a)Mitalic_e ( x , a ), where e=(j,i,z)subscriptsubscripte=(A_j,A_i,z)e = ( Aitalic_j , Aitalic_i , z ), compute 0(x),…,i−1(x)subscript0…subscript1A_0(x),...,A_i-1(x)A0 ( x ) , … , Aitalic_i - 1 ( x ) in normal fashion. Set i(x)=(G1i(x),…,Gz−1i(x),a,Gz+1i(x),…,Gki(x))subscriptsubscriptsuperscript1…subscriptsuperscript1subscriptsuperscript1…subscriptsuperscriptA_i(x)=(G^i_1(x),...,G^i_z-1(x),a,G^i_z+1(x),...,G^i% _k(x))Aitalic_i ( x ) = ( Gitalic_i1 ( x ) , … , Gitalic_iitalic_z - 1 ( x ) , a , Gitalic_iitalic_z + 1 ( x ) , … , Gitalic_iitalic_k ( x ) ), i.e. setting its zzzth input to a instead of j(x)subscriptA_j(x)Aitalic_j ( x ). Then compute each vertex i+1(x),…,n(x)subscript1…subscriptA_i+1(x),...,A_n(x)Aitalic_i + 1 ( x ) , … , Aitalic_n ( x ) in order as a function of its immediate inputs. Finally, return ℳe(x,a)=n(x)subscriptℳsubscriptM_e(x,a)=A_n(x)Mitalic_e ( x , a ) = Aitalic_n ( x ). As mentioned in the main text, using activation patching on a particular edge e=(j,i,z)subscriptsubscripte=(A_j,A_i,z)e = ( Aitalic_j , Aitalic_i , z ) is more surgical than using activation patching on its parent vertex jsubscriptA_jAitalic_j. Performing activation patching on jsubscriptA_jAitalic_j would replace j(x)subscriptA_j(x)Aitalic_j ( x ) with a as an input to all of its child vertices, but performing activation patching on only e modifies j(x)subscriptA_j(x)Aitalic_j ( x ) only as an input to isubscriptA_iAitalic_i. Notice that performing activation patching on j(x)subscriptA_j(x)Aitalic_j ( x ) is equivalent to performing activation patching on e=(j,i,z)subscriptsubscripte=(A_j,A_i,z)e = ( Aitalic_j , Aitalic_i , z ) for all edges e that emanate from jsubscriptA_jAitalic_j in the graph. C.3 Transformer architecture The transformer architecture (Vaswani et al.,, 2017) may be familiar to most readers. However, since our experiments involve interventions during model inference with varying levels of granularity, we include a summary of the transformer computation, which we later reference to crystallize specifically how we edit the computation. Transformers ℳMM take in token sequences x1:ssubscript:1x_1:sx1 : s of length s, which are then prepended with a constant padding token x0subscript0x_0x0. Let x=(xj)i=0ssuperscriptsubscriptsubscript0x=(x_j)_i=0^sx = ( xitalic_j )i = 0s. The model simultaneously computes, for each token position i, a predicted probability distribution (ℙ^(xj+1|x0:j))j=0ssuperscriptsubscript^ℙconditionalsubscript1subscript:00( P(x_j+1\ |\ x_0:j))_j=0^s( over start_ARG blackboard_P end_ARG ( xitalic_j + 1 | x0 : j ) )j = 0s for the (j+11j+1j + 1)th token given the first j tokens. We use ℳ(x)ℳM(x)M ( x ) to refer to the predicted probability distribution over the (s+1)1(s+1)( s + 1 )th token. We sometimes abuse notation to write ℙ(ℳ(x)=y)ℙℳP(M(x)=y)blackboard_P ( M ( x ) = y ) to indicate ℙ^(y|x)^ℙconditional P(y\ |\ x)over start_ARG blackboard_P end_ARG ( y | x ), i.e. the probability placed on prediction y by the distribution ℳ(x)ℳM(x)M ( x ). Let X~~ Xover~ start_ARG X end_ARG be a random input sequence and S be a token position sampled randomly from 1,…,s1…\1,...,s\ 1 , … , s . ℳMM is trained to minimize −X,Slogℙ(ℳ(X~0:S−1)=X~S)subscriptℙℳsubscript~:01subscript~-E_X,S (M( X_0:S-1)= X_S)- blackboard_EX , S log blackboard_P ( M ( over~ start_ARG X end_ARG0 : S - 1 ) = over~ start_ARG X end_ARGS ). However, we generally refer to input-label pairs (X,Y)=(X~0:S−1,X~S)subscript~:01subscript~(X,Y)=( X_0:S-1, X_S)( X , Y ) = ( over~ start_ARG X end_ARG0 : S - 1 , over~ start_ARG X end_ARGS ), so that the loss function is instead written X,Yℒ(ℳ(X),Y)=X,Y[−logℙ(ℳ(X)=Y)]subscriptℒℳsubscriptdelimited-[]ℙℳE_X,YL(M(X),Y)=E_X,Y [- % P(M(X)=Y) ]blackboard_EX , Y L ( M ( X ) , Y ) = blackboard_EX , Y [ - log blackboard_P ( M ( X ) = Y ) ] To evaluate ℳMM, each token xjsubscriptx_jxitalic_j is mapped to a embedding Resid(x)j(0)=t(xj)+p(j)^(0)_j(x)=t(x_j)+p(j)start_FLOATSUPERSCRIPT ( 0 ) end_FLOATSUPERSCRIPTj ( x ) = t ( xitalic_j ) + p ( j ) of dimension dmodelsubscriptmodeld_modeldroman_model, where t(xj)subscriptt(x_j)t ( xitalic_j ) is a token embedding of token xjsubscriptx_jxitalic_j and p(j)p(j)p ( j ) is a position embedding representing position j in the sequence. Over the course of inference, ℳMM keeps track of a “residual stream” representation Residj(i)subscriptsuperscriptResidResid^(i)_jResid( i )j at each token position j that is a vector of dimension dmodelsubscriptmodeld_modeldroman_model, which it updates by iterating over its nlayerssubscriptlayersn_layersnroman_layers layers, adding each layer’s contribution to the previous representation: MResidj(i)(x)subscriptsuperscriptMResid ^(i)_j(x)MResid( i )j ( x ) =Residj(i−1)(x)+∑k=1nheadsAttnj(i,k)(x)absentsubscriptsuperscriptResid1superscriptsubscript1subscriptheadssubscriptsuperscriptAttn =Resid^(i-1)_j(x)+ _k=1^n_heads% Attn^(i,k)_j(x)= Resid( i - 1 )j ( x ) + ∑k = 1nroman_heads Attn( i , k )j ( x ) (7) Residj(i)(x)subscriptsuperscriptResid ^(i)_j(x)Resid( i )j ( x ) =MResidj(i)(x)+MLPj(i)(x).absentsubscriptsuperscriptMResidsubscriptsuperscriptMLP =MResid^(i)_j(x)+MLP^(i)_j(x).= MResid( i )j ( x ) + MLP( i )j ( x ) . (8) Attention heads Attn(i,k)superscriptAttnAttn^(i,k)Attn( i , k ) transfer information between token positions. Let LN (layer-norm) be the function that takes a matrix Z of size m×nm× nm × n and outputs a matrix of the same size such that each row of LN(Z)(Z)( Z ) is equal to the corresponding row of Z normalized by its L2subscript2L_2L2 norm: ((((LN(Z))j=Zj‖Zj‖2(Z))_j= Z_j||Z_j||_2( Z ) )j = divide start_ARG Zitalic_j end_ARG start_ARG | | Zitalic_j | |2 end_ARG. Let R=LN(Resid(i−1)(x))LNsuperscriptResid1R=LN(Resid^(i-1)(x))R = LN ( Resid( i - 1 ) ( x ) ). Attention heads are computed as follows: AttnScore(i,k)(x)superscriptAttnScore ^(i,k)(x)AttnScore( i , k ) ( x ) =softmax(△⋅(RWQ(i,k)+bQ(i,k))⏟Q(i,k)(RWK(i,k)+bK(i,k))T⏟K(i,k)T)absentsoftmax⋅△subscript⏟subscriptsuperscriptsubscriptsuperscriptsubscriptsubscript⏟superscriptsubscriptsuperscriptsubscriptsuperscriptsuperscriptsubscript =softmax ( · (RW^(i,k% )_Q+b^(i,k)_Q )_Q_(i,k) (RW^(i,k)_K+b^(i% ,k)_K )^T_K_(i,k)^T )= softmax ( △ ⋅ under⏟ start_ARG ( R W( i , k )Q + b( i , k )Q ) end_ARGQ start_POSTSUBSCRIPT ( i , k ) end_POSTSUBSCRIPT under⏟ start_ARG ( R W( i , k )K + b( i , k )K )T end_ARGK start_POSTSUBSCRIPT ( i , k )T end_POSTSUBSCRIPT ) Attn(i,k)(x)superscriptAttn ^(i,k)(x)Attn( i , k ) ( x ) =AttnScore(i,k)(x)(RWV(i,k)+bV(i,k))⏟V(i,k)WO(i,k)+bO(i,k).absentsuperscriptAttnScoresubscript⏟subscriptsuperscriptsubscriptsuperscriptsubscriptsubscriptsuperscriptsubscriptsuperscript =AttnScore^(i,k)(x) (RW^(i,k)_V+b^% (i,k)_V )_V_(i,k)W^(i,k)_O+b^(i,k)_O.= AttnScore( i , k ) ( x ) under⏟ start_ARG ( R W( i , k )V + b( i , k )V ) end_ARGV start_POSTSUBSCRIPT ( i , k ) end_POSTSUBSCRIPT W( i , k )O + b( i , k )O . (9) The W(i,k)superscriptW^(i,k)W( i , k )s and b(i,k)superscriptb^(i,k)b( i , k )s are weights. WQ(i,k)subscriptsuperscriptW^(i,k)_QW( i , k )Q, WK(i,k)subscriptsuperscriptW^(i,k)_KW( i , k )K, and WV(i,k)subscriptsuperscriptW^(i,k)_VW( i , k )V have size dmodel×dheadsubscriptmodelsubscriptheadd_model× d_headdroman_model × droman_head, and WO(i,k)subscriptsuperscriptW^(i,k)_OW( i , k )O has size dhead×dmodelsubscriptheadsubscriptmodeld_head× d_modeldroman_head × droman_model, bQ(i,k)subscriptsuperscriptb^(i,k)_Qb( i , k )Q, bK(i,k)subscriptsuperscriptb^(i,k)_Kb( i , k )K, and bV(i,k)subscriptsuperscriptb^(i,k)_Vb( i , k )V have dimension dheadsubscriptheadd_headdroman_head while b(i,k)superscriptb^(i,k)b( i , k ) has dimension dmodelsubscriptmodeld_modeldroman_model. Biases are added to each row. △ △ is a lower triangular matrix of 1s, ⋅·⋅ represents the elementwise product and the softmax is performed row-wise. Multiplying by △ △ ensures that Attnj(i,k)(x)subscriptsuperscriptAttnAttn^(i,k)_j(x)Attn( i , k )j ( x ) only depends on Resid0:j(i)subscriptsuperscriptResid:0Resid^(i)_0:jResid( i )0 : j and thus information can only be propagated forward, so the prediction of token j+11j+1j + 1 can only depend on tokens 00 through j. MLP layers are computed token-wise; the same map is applied to Residj(i−1)(x)subscriptsuperscriptResid1Resid^(i-1)_j(x)Resid( i - 1 )j ( x ) at each token position j. Let R=LN(MResid(i)(x))LNsuperscriptMResidR=LN(MResid^(i)(x))R = LN ( MResid( i ) ( x ) ), and let RjsubscriptR_jRitalic_j be the jjjth row of R. MLPs are computed as follows: MLPj(i)(x)=ReLU(RjWin(i)+bin(i))Wout(i)+bout.subscriptsuperscriptMLPReLUsubscriptsubscriptsuperscriptinsubscriptsuperscriptinsubscriptsuperscriptoutsubscriptoutMLP^(i)_j(x)=ReLU (R_jW^(i)_in+b^(i)% _in )W^(i)_out+b_out.MLP( i )j ( x ) = ReLU ( Ritalic_j W( i )in + b( i )in ) W( i )out + broman_out . The W(i)superscriptW^(i)W( i ) and b(i)superscriptb^(i)b( i ) are weights. Win(i)subscriptsuperscriptinW^(i)_inW( i )in has shape dmodel×dmlpsubscriptmodelsubscriptmlpd_model× d_mlpdroman_model × droman_mlp and Wout(i)subscriptsuperscriptoutW^(i)_outW( i )out has shape dmlp×dmodelsubscriptmlpsubscriptmodeld_mlp× d_modeldroman_mlp × droman_model. bin(i)subscriptsuperscriptinb^(i)_inb( i )in has dimension dmlpsubscriptmlpd_mlpdroman_mlp and bout(i)subscriptsuperscriptoutb^(i)_outb( i )out has dimension dmodelsubscriptmodeld_modeldroman_model. Finally, the output probability distribution is determined by applying a final transformation to the residual stream representation after the last layer. Out(x):=softmax(LN(Resid(nlayers)(x))Wunembed)assignOutsoftmaxLNsuperscriptResidsubscriptlayerssubscriptunembedOut(x):=softmax (LN (Resid^(n_% layers)(x) )W_unembed )Out ( x ) := softmax ( LN ( Resid( nroman_layers ) ( x ) ) Wroman_unembed ) WunembedsubscriptunembedW_unembedWroman_unembed is a learnable weight matrix of size dmodel×dvocabsubscriptmodelsubscriptvocabd_model× d_vocabdroman_model × droman_vocab and the softmax is performed row-wise. Out(x)OutOut(x)Out ( x ) is a matrix of size (s+1)×dvocab1subscriptvocab(s+1)× d_vocab( s + 1 ) × droman_vocab, where dvocabsubscriptvocabd_vocabdroman_vocab is the number of tokens in the vocabulary and each row Outj(x)subscriptOutOut_j(x)Outitalic_j ( x ) indicates a discrete probability distribution over dvocabsubscriptvocabd_vocabdroman_vocab values that predicts the (j+1)1(j+1)( j + 1 )th token given the first j tokens. ℳ(x)ℳM(x)M ( x ) is the prediction for the (s+1)1(s+1)( s + 1 )th-token continuation of x given the entire sequence x, i.e. ℳ(x)=Outs(x)ℳsubscriptOutM(x)=Out_s(x)M ( x ) = Outitalic_s ( x ). C.4 KL-divergence loss function The performance metric (ℳ~)=X∼ℒ(ℳ~(X),ℳ(X))~ℳsubscriptsimilar-toℒ~ℳP( M)=E_X \ L(% M(X),M(X))P ( over~ start_ARG M end_ARG ) = blackboard_EX ∼ D L ( over~ start_ARG M end_ARG ( X ) , M ( X ) ) selected in the paper frames performance in terms of proximity to the original model predictions, and thus the corresponding ablation loss gap Δ(ℳ,)Δℳ (M,A)Δ ( M , A ) measures the importance of component AA for the model to arrive at predictions that are close to its original predictions. A common alternative, ~(ℳ~)=(X,Y)∼ℒ(ℳ~(X),Y)~~ℳsubscriptsimilar-toℒ~ℳ P( M)=E_(X,Y) % L( M(X),Y)over~ start_ARG P end_ARG ( over~ start_ARG M end_ARG ) = blackboard_E( X , Y ) ∼ D L ( over~ start_ARG M end_ARG ( X ) , Y ), frames performance in terms of proximity to the true labels, so the corresponding ablation loss gap Δ~(ℳ,)~Δℳ (M,A)over~ start_ARG Δ end_ARG ( M , A ) measures the importance of component AA for the model to perform a subtask at a comparable level to the original model. As an example, consider a model ℳMM that computes an approximately-optimal solution ℳ~(X)~ℳ M(X)over~ start_ARG M end_ARG ( X ) and then adds noise in a way that changes its predictions but does not improve or worsen X,Yℒ(ℳ(X),Y)subscriptℒℳE_X,YL(M(X),Y)blackboard_EX , Y L ( M ( X ) , Y ). Presenting ℳ~~ℳ Mover~ start_ARG M end_ARG alone is a satisfactory interpretation of the behavior of ℳMM under ~~ Pover~ start_ARG P end_ARG but not under PP. A major advantage of PP is that it is much more sample-efficient to evaluate for language tasks, especially if the label distribution has high entropy. Let (X,Y)(X,Y)( X , Y ) denote a random input-label pair. Recall that a language model is trained to minimize X,Yℒ(ℳ(X),Y)=X,Y[−logℙ(ℳ(X)=Y)]=c+XDKL(ρ(X)||ℳ(X))E_X,YL(M(X),Y)=E_X,Y [- % P(M(X)=Y) ]=c+E_XD_KL(ρ(X)\ ||\ % M(X))blackboard_EX , Y L ( M ( X ) , Y ) = blackboard_EX , Y [ - log blackboard_P ( M ( X ) = Y ) ] = c + blackboard_EX Ditalic_K L ( ρ ( X ) | | M ( X ) ) where c is a constant and ρ(X)ρ(X)ρ ( X ) represents the true probability distribution of Y|XconditionalY|XY | X. For each X, we are unable to observe ρ(X)ρ(X)ρ ( X ) – in fact, we are usually only able to obtain a single sample from Y∼ρ(X)similar-toY ρ(X)Y ∼ ρ ( X ). On the other hand, ℳ(X)ℳM(X)M ( X ) may be a sufficient estimate for ρ(X)ρ(X)ρ ( X ), and provides many more bits of information about ρ than the single sample Y∼ρsimilar-toY ∼ ρ. Even if our desired performance metric were ~~ Pover~ start_ARG P end_ARG, rather than estimating X[Y[ℒ(ℳ~(X),Y)|X]]subscriptdelimited-[]subscriptdelimited-[]conditionalℒ~ℳE_X[E_Y[L( M(X),Y)\ |\ X]]blackboard_EX [ blackboard_EY [ L ( over~ start_ARG M end_ARG ( X ) , Y ) | X ] ] from individual samples (X,Y)(X,Y)( X , Y ), it may often be more sample-efficient to approximate Y[ℒ(ℳ~(X),Y)|X]subscriptdelimited-[]conditionalℒ~ℳE_Y[L( M(X),Y)\ |\ X]blackboard_EY [ L ( over~ start_ARG M end_ARG ( X ) , Y ) | X ] analytically for a particular X by assuming that the full model well-approximates that true distribution ρ, i.e. by assuming that ℳ(X)≈ρ(X)ℳM(X)≈ρ(X)M ( X ) ≈ ρ ( X ) (in the sense that DKL(ρ(X)||ℳ(X))≈0D_KL(ρ(X)\ ||\ M(X))≈ 0Ditalic_K L ( ρ ( X ) | | M ( X ) ) ≈ 0), which implies XDKL(ρ(X)||ℳ~(X))≈XDKL(ℳ(X)||ℳ~(X)). _XD_KL(ρ(X)\ ||\ M(X))≈% E_XD_KL(M(X)\ ||\ M(X)).blackboard_EX Ditalic_K L ( ρ ( X ) | | over~ start_ARG M end_ARG ( X ) ) ≈ blackboard_EX Ditalic_K L ( M ( X ) | | over~ start_ARG M end_ARG ( X ) ) . (10) and so we still evaluate ℳ~~ℳ Mover~ start_ARG M end_ARG using PP as the performance metric. In practice, Equation (10) may be an unreasonable assumption and the two criteria may yield very different interpretability results. We cannot estimate X,Yℒ(ℳ(X),Y)subscriptℒℳE_X,YL(M(X),Y)blackboard_EX , Y L ( M ( X ) , Y ) because it is impossible to obtain an estimate of the ground truth entropy of next-token prediction – for long sequences, we typically never observe the same sequence more than once. However, we can deduce a lower bound X,Yℒ(ℳ(X),Y)≥1subscriptℒℳ1E_X,YL(M(X),Y)≥ 1blackboard_EX , Y L ( M ( X ) , Y ) ≥ 1 for a model like GPT-2 because larger models reduce cross-entropy by more than this amount compared to GPT-2. Note that a better approximation of X,Yℒ(ℳ~(X),Y)subscriptℒ~ℳE_X,YL( M(X),Y)blackboard_EX , Y L ( over~ start_ARG M end_ARG ( X ) , Y ) is to obtain a probability distribution from a larger language model ℳ∗superscriptℳM^*M∗, and future work may wish to explore this direction However, there are several other reasons to prefer using PP over ~~ Pover~ start_ARG P end_ARG with labels from a larger model. The use of KL-divergence to the original model ℳMM is consistent with previous work. In the real-world scenario of performing interpretability on the largest frontier model, we will not have access to a better ℳ∗superscriptℳM^*M∗. Most importantly, one concern with circuit discovery for subtasks (X,Y)∼similar-to(X,Y) ( X , Y ) ∼ D is that it may be possible to adversarially select ℳ~~ℳ Mover~ start_ARG M end_ARG such that (X,Y)∼ℒ(ℳ~(X),Y)≤(X,Y)∼ℒ(ℳ(X),Y).subscriptsimilar-toℒ~ℳsubscriptsimilar-toℒℳ _(X,Y) L( M(% X),Y) _(X,Y) L(M(X),Y).blackboard_E( X , Y ) ∼ D L ( over~ start_ARG M end_ARG ( X ) , Y ) ≤ blackboard_E( X , Y ) ∼ D L ( M ( X ) , Y ) . (11) which can occur if ℳMM sacrifices some performance on (X,Y)∼similar-to(X,Y) ( X , Y ) ∼ D for better performance on other regions of the input distribution. Selecting only the components of ℳMM that maximize performance on DD may ignore important mechanisms that must be included in its predictions on DD as a result of this tradeoff. On the other hand, evaluating circuits with ~~ Pover~ start_ARG P end_ARG llows mitigating mechanisms to be included in the selected circuit, since we must select ℳ~~ℳ Mover~ start_ARG M end_ARG in a way that imitates the behavior of ℳMM itself on the subtask DD. Using this metric, a subnetwork can never achieve lower loss than ℳMM, since ℒ(ℳ~(X),ℳ(X))≥ℒ(ℳ(X),ℳ(X))ℒ~ℳℒℳL( M(X),M(X)) (M(% X),M(X))L ( over~ start_ARG M end_ARG ( X ) , M ( X ) ) ≥ L ( M ( X ) , M ( X ) ). Appendix D Commentary D.1 Understanding the difference between deletion and treating x like x′ Colloquially, “deletion” means the model has lost the information it would use to distinguish between inputs x and x′. One might expect that if the model were able to rationally handle this lack of information, it would produce an output that hedges between labels corresponding to inputs x and x′. On the other hand, subclass 2 of “spoofing” means the model was given information in component AA that is compatible with x′ and not x, leading the model to output something close to what it would have produced on input x′. To illustrate the difference between deletion and insertion, consider the following example. Assume a classifier ℳMM has two possible labels and two possible inputs, x and x′, and the model entirely depends on component AA to determine the correct label. Let ℳMM output a probability vector, and suppose ℒLL is KL-divergence. Let ℳ(x)=(1,0)ℳ10M(x)=(1,0)M ( x ) = ( 1 , 0 ) and ℳ(x′)=(0,1)ℳsuperscript′01M(x )=(0,1)M ( x′ ) = ( 0 , 1 ). If we remove the information given by AA, we should expect the model to output (0.5,0.5)0.50.5(0.5,0.5)( 0.5 , 0.5 ), giving ℒ=log0.5ℒ0.5L= 0.5L = log 0.5, but if we instead intervene by inserting (x′)superscript′A(x )A ( x′ ) into an inference pass on x or vice versa, then the model places probability 1 on the incorrect label, and the loss is infinite from assessing that the input is x′ when the true input is x. D.2 OA as an extension of mean ablation for nonlinear functions Let (X)A(X)A ( X ) be a vector-valued model component. As noted in Lundberg and Lee, (2017), one motivation for mean ablation is that [(X)]delimited-[]E[A(X)]blackboard_E [ A ( X ) ] is, under certain assumptions, a reasonable point estimate for (X)A(X)A ( X ). For instance, if the relevant loss function is the squared distance between our point estimate a and the realized value of (X)A(X)A ( X ), then [(X)]=argminaX‖a−(X)‖22delimited-[]subscriptargminsubscriptsuperscriptsubscriptnorm22E[A(X)]= *arg\,min_aE_X||a-% A(X)||_2^2blackboard_E [ A ( X ) ] = start_OPERATOR arg min end_OPERATORa blackboard_EX | | a - A ( X ) | |22. Indeed, the mean is also the best point estimate of (X)A(X)A ( X ) if the relevant loss is squared distance between ℳ(X,a)subscriptℳM_A(X,a)Mcaligraphic_A ( X , a ) and ℳ(X)=ℳ(X,(X))ℳsubscriptℳM(X)=M_A(X,A(X))M ( X ) = Mcaligraphic_A ( X , A ( X ) ) and the model ℳMM is linear in (X)A(X)A ( X ): [(X)]=argminaX‖ℳ(X,a)−ℳ(X,(X))‖22delimited-[]subscriptargminsubscriptsubscriptsuperscriptnormsubscriptℳsubscriptℳ22 [A(X)]= *arg\,min_aE% _X||M_A(X,a)-M_A(X,A(X% ))||^2_2blackboard_E [ A ( X ) ] = start_OPERATOR arg min end_OPERATORa blackboard_EX | | Mcaligraphic_A ( X , a ) - Mcaligraphic_A ( X , A ( X ) ) | |22 (12) if ℳ(X,a)=M(X)a+b(X)ℳM(X,a)=M(X)a+b(X)M ( X , a ) = M ( X ) a + b ( X ) for a random matrix M(X)⟂(X)M(X) \!\!\! (X)M ( X ) ⟂ ⟂ A ( X ) and random bias b(X)b(X)b ( X ). Thus, for model components (X)A(X)A ( X ) for which the downstream computation is roughly linear, [(X)]delimited-[]E[A(X)]blackboard_E [ A ( X ) ] could potentially be a reasonable point estimate, hence justifying mean ablation. This presumption of linearity also shows up in other interpretability work, including Hernandez et al., (2024), which uses a linear map to approximate the decoding of subject-attribute relations, and Belrose et al., 2023b , which considers erasing concepts Z from a model’s latent space in a “minimal” sense by transforming activations (X)A(X)A ( X ) with a map gZsubscriptg_Zgitalic_Z that makes gZ((X))subscriptg_Z(A(X))gitalic_Z ( A ( X ) ) uncorrelated with Z and minimizes expected squared distance to the original activations, [gZ((X)),(X)]subscriptE[g_Z(A(X)),A(X)]blackboard_E [ gitalic_Z ( A ( X ) ) , A ( X ) ]. However, in most settings, ℳ(X,a)ℳM(X,a)M ( X , a ) is highly nonlinear in a, and the mean [(X)]delimited-[]E[A(X)]blackboard_E [ A ( X ) ] could be an arbitrarily poor point estimate for (X)A(X)A ( X ). Optimal ablation generalizes the idea of selecting the “best point estimate” for (X)A(X)A ( X ) as measured by replacing (X)A(X)A ( X ) with a and evaluating model loss. In particular, optimal ablation constants a∗superscripta^*a∗ generalize the property given in Equation (12) to arbitrary models ℳMM and loss functions ℒLL: a∗=argminaℒ(ℳ(X,a),ℳ(X,(X))).superscriptsubscriptargminℒsubscriptℳsubscriptℳ a^*= *arg\,min_aL(M_% A(X,a),M_A(X,A(X))).a∗ = start_OPERATOR arg min end_OPERATORa L ( Mcaligraphic_A ( X , a ) , Mcaligraphic_A ( X , A ( X ) ) ) . (13) D.3 Generalizing OA to constrained-form estimates of (X)A(X)A ( X ) Measuring Δopt(ℳ,)subscriptΔoptℳ _opt(M,A)Δroman_opt ( M , A ) on a subtask (X,Y)∼similar-to(X,Y) ( X , Y ) ∼ D is, in a sense, a testing procedure for the hypothesis that “AA does not provide relevant information for model performance on subtask DD.” Verifying that Δopt≈0subscriptΔopt0 _opt≈ 0Δroman_opt ≈ 0 validates this hypothesis, since a point estimate of (X)A(X)A ( X ) performs as well as the realized value of (X)A(X)A ( X ) for the purpose of model inference. Optimal ablation can be generalized to test interpretability hypotheses beyond assertions that a computed quantity (X)A(X)A ( X ) is unimportant. In particular, we can test hypotheses about the specific properties of (X)A(X)A ( X ) that are important. Suppose (X)A(X)A ( X ) is vector-valued, and consider the hypothesis “the only relevant information in (X)A(X)A ( X ) is stored in subspace W.” We can test this hypothesis by replacing (X)A(X)A ( X ) with PW(X)+a∗subscriptsuperscriptP_WA(X)+a^*Pitalic_W A ( X ) + a∗, where a∗superscripta^*a∗ is an optimal constant that lies in W⟂superscriptperpendicular-toW W⟂ and PWsubscriptP_WPitalic_W is the projection matrix to subspace W. While this example is simple and illustrates some of the flexibility of OA, it does not add to the space of what OA can express, in the sense that we could have simply considered PW(X)subscriptP_WA(X)Pitalic_W A ( X ) and PW⟂(X)subscriptsuperscriptperpendicular-toP_W A(X)Pitalic_W⟂ A ( X ) as separate vertices in the graph and used OA (or any other ablation method) on only the latter. However, a real gain of expression from OA materializes from being able to generalize the idea of null point estimates to estimates with constrained form. For example, consider the subspace hypothesis “every (X)A(X)A ( X ) can be adequately represented in subspace W.” We can test this hypothesis by training an optimal activation a∗(X)superscripta^*(X)a∗ ( X ) for each X that lies in subspace W. Though activation training is expensive, we can train a function a∗(X)superscripta^*(X)a∗ ( X ) that maps X to values in W, and then estimate the error of this function by performing activation training on a few samples of X. Similarly, we can generalize a∗superscripta^*a∗ to include multiple point estimates to test the claim that (X)A(X)A ( X ) is the outcome of an internal classification problem, i.e. the relevant information provided by (X)A(X)A ( X ) is the classification of X among a few input classes. We can train optimal point estimates (a1∗,…,ak∗)superscriptsubscript1…superscriptsubscript(a_1^*,...,a_k^*)( a1∗ , … , aitalic_k∗ ) such that (a1∗,…,ak∗)=argmin(a1,…,ak)Xminj∈1,…,kℒ(ℳ(X,aj),Y)superscriptsubscript1…superscriptsubscriptsubscriptargminsubscript1…subscriptsubscriptsubscript1…ℒℳsubscript(a_1^*,...,a_k^*)= *arg\,min_(a_1,...,a_k)% E_X _j∈\1,...,k\L(M(X,a_j),Y)( a1∗ , … , aitalic_k∗ ) = start_OPERATOR arg min end_OPERATOR( a start_POSTSUBSCRIPT 1 , … , aitalic_k ) end_POSTSUBSCRIPT blackboard_EX minitalic_j ∈ 1 , … , k L ( M ( X , aitalic_j ) , Y ) calling the outer minimized quantity Δk−optsubscriptΔkopt _k-optΔroman_k - opt. If Δk−opt≈0subscriptΔkopt0 _k-opt≈ 0Δroman_k - opt ≈ 0, then every (X)A(X)A ( X ) can be represented by one of a small number of prototype quantities. Appendix E Single-component loss on IOI E.1 Transformer graph representation We use a graph representation in which each vertex corresponds to an attention head (Attn(i,k)(x)superscriptAttnAttn^(i,k)(x)Attn( i , k ) ( x )), an MLP block (MLP(i)(x)superscriptMLPMLP^(i)(x)MLP( i ) ( x )), the model input (Resid(0)(x)superscriptResid0Resid^(0)(x)Resid( 0 ) ( x )), or the model output (Out(x)OutOut(x)Out ( x )). We also allow vertices representing the Resid(i)(x)superscriptResidResid^(i)(x)Resid( i ) ( x ) and MResid(i)(x)superscriptMResidMResid^(i)(x)MResid( i ) ( x ) computations. Appendix C.3 defines the computation of each vertex. However, we slightly modify the definition of attention head vertices Attn(i,k)superscriptAttnAttn^(i,k)Attn( i , k ) to save memory and so that ablation constants a∗superscripta^*a∗ for OA lie in the column space of attention head outputs. Recall from Equation (9) that attention heads produce output in a dheadsubscriptheadd_headdroman_head-dimensional vector space, which is then mapped linearly to dmodelsubscriptmodeld_modeldroman_model-dimensional space by a weight matrix WO(i,k)superscriptsubscriptW_O^(i,k)Witalic_O( i , k ): Attn(i,k)(x)superscriptAttn ^(i,k)(x)Attn( i , k ) ( x ) =AttnScore(i,k)(x)(RWV(i,k)+bV(i,k))WO(i,k)+bO(i,k)absentsuperscriptAttnScoresubscriptsuperscriptsubscriptsuperscriptsubscriptsuperscriptsubscriptsuperscript =AttnScore^(i,k)(x) (RW^(i,k)_V+b^(i,k)_V% )W^(i,k)_O+b^(i,k)_O= AttnScore( i , k ) ( x ) ( R W( i , k )V + b( i , k )V ) W( i , k )O + b( i , k )O Thus, while Attn(i,k)(x)superscriptAttnAttn^(i,k)(x)Attn( i , k ) ( x ) is dmodelsubscriptmodeld_modeldroman_model-dimensional, its distribution lies within a dheadsubscriptheadd_headdroman_head subspace of the residual stream. If we used vertices Attn(i,k)superscriptAttnAttn^(i,k)Attn( i , k ), then our dmodelsubscriptmodeld_modeldroman_model-dimensional a∗superscripta^*a∗ for attention head (i,k)(i,k)( i , k ) could sometimes contribute to subspaces that the attention head (i,k)(i,k)( i , k ) can never write to. Instead, for an attention head vertex, we represent its output computation in dheadsubscriptheadd_headdroman_head-dimensional space: ZAttn(i,k)(x)superscriptZAttn ^(i,k)(x)ZAttn( i , k ) ( x ) =AttnScore(i,k)(x)(RWV(i,k)+bV(i,k))absentsuperscriptAttnScoresubscriptsuperscriptsubscriptsuperscript =AttnScore^(i,k)(x) (RW^(i,k)_V+b^(i,k)_V% )= AttnScore( i , k ) ( x ) ( R W( i , k )V + b( i , k )V ) and consider replacing ZAttn(i,k)(x)superscriptZAttnZAttn^(i,k)(x)ZAttn( i , k ) ( x ) rather than replacing Attn(i,k)(x)superscriptAttnAttn^(i,k)(x)Attn( i , k ) ( x ) to ablate an attention head. This slight modification reduces our parameter count by a factor of dmodel/dheadsubscriptmodelsubscriptheadd_model/d_headdroman_model / droman_head when applying OA but does not affect the results for the other ablation types. We measure the single-component ablation loss gap Δ Δ for the ZAttn(i,k)superscriptZAttnZAttn^(i,k)ZAttn( i , k ) and MLP(i)superscriptMLPMLP^(i)MLP( i ) vertices, (nheads+1)⋅nlayers=156⋅subscriptheads1subscriptlayers156(n_heads+1)· n_layers=156( nroman_heads + 1 ) ⋅ nroman_layers = 156 vertices in total for GPT-2. E.2 Ablation details We consider zero ablation, mean ablation, optimal ablation, counterfactual mean ablation, resample ablation, and counterfactual ablation. For a token position j, let [(X)]jsubscriptdelimited-[][A(X)]_j[ A ( X ) ]j denote the representation of (X)A(X)A ( X ) at token position j. Zero ablation: To zero ablate AA, we replace [(X)]jsubscriptdelimited-[][A(X)]_j[ A ( X ) ]j with 0 at each sequence position j≥11j≥ 1j ≥ 1. We do not replace [(X)]0subscriptdelimited-[]0[A(X)]_0[ A ( X ) ]0 because it is a constant that does not depend on X (any result at token position 0 must only be a function of Resid0(0)subscriptsuperscriptabsent00^(0)_0start_FLOATSUPERSCRIPT ( 0 ) end_FLOATSUPERSCRIPT0, which represents a padding token that is the same for every sequence). Transformers may read from this beginning-of-string (BOS) token position in attention heads if no token in the sequence indicates a particularly strong signal, and since this token position does not distinguish between any inputs X and is more appropriately viewed as a structural part of the architecture, we choose not to modify it. Mean ablation: For each vertex AA, we compute (X,Y)∼[(X)]subscriptsimilar-todelimited-[]E_(X,Y) [A(X)]blackboard_E( X , Y ) ∼ D [ A ( X ) ] over 20,000 samples, conditional on token position. We let μj=(X,Y)∼[[(X)]j]subscriptsubscriptsimilar-todelimited-[]subscriptdelimited-[] _j=E_(X,Y) [[A(X)]_j]μitalic_j = blackboard_E( X , Y ) ∼ D [ [ A ( X ) ]j ]. To mean ablate AA, we replace [(X)]jsubscriptdelimited-[][A(X)]_j[ A ( X ) ]j with μjsubscript _jμitalic_j at each sequence position j. In the Greater-Than dataset, all prompts X are the same length, but in the IOI dataset, some prompts X are longer than others, reducing our sample size for later sequence positions. In particular, if X∗superscriptX^*X∗ is the longest prompt in the dataset with ℓ ℓ tokens, then μℓ=Xℓ∗subscriptℓsubscriptsuperscriptℓ _ =X^*_ μroman_ℓ = X∗roman_ℓ, so the mean value actually carries identifying information about the prompt. Since we want the mean value μjsubscript _jμitalic_j to be uninformative about the original prompt X, we instead consider m to the minimum length of any prompt, compute a modified mean μm=(X,Y)∼,S∼Unif1,…,ℓ[[(X)]S|S≥m]subscriptsubscriptformulae-sequencesimilar-tosimilar-toUnif1…ℓdelimited-[]conditionalsubscriptdelimited-[] _m=E_(X,Y) ,\ S \1,..., \[[% A(X)]_S\ |\ S≥ m]μitalic_m = blackboard_E( X , Y ) ∼ D , S ∼ Unif 1 , … , ℓ [ [ A ( X ) ]S | S ≥ m ] that considers all values of (X)A(X)A ( X ) at token positions after token position m, and replace [(X)]jsubscriptdelimited-[][A(X)]_j[ A ( X ) ]j with μmsubscript _mμitalic_m if j≥mj≥ mj ≥ m. Optimal ablation: Similar to mean ablation, we optimally ablate AA by replacing [(X)]jsubscriptdelimited-[][A(X)]_j[ A ( X ) ]j with a constant a^jsubscript a_jover start_ARG a end_ARGj for each sequence position j<mj<mj < m and replacing [(X)]jsubscriptdelimited-[][A(X)]_j[ A ( X ) ]j with a constant a^msubscript a_mover start_ARG a end_ARGm for each j≥mj≥ mj ≥ m, where m is the minimum length of any prompt. We initialize (a^0,a^1,…,a^m)=(μ0,μ1,…,μm)subscript^0subscript^1…subscript^subscript0subscript1…subscript( a_0, a_1,..., a_m)=( _0, _1,...,μ% _m)( over start_ARG a end_ARG0 , over start_ARG a end_ARG1 , … , over start_ARG a end_ARGm ) = ( μ0 , μ1 , … , μitalic_m ) as defined for mean ablation and then optimize (a^1,…,a^m)subscript^1…subscript^( a_1,..., a_m)( over start_ARG a end_ARG1 , … , over start_ARG a end_ARGm ) for each ablated component AA to minimize Δ(ℳ,)Δℳ (M,A)Δ ( M , A ). Note that similarly to zero ablation, we fix a^0=μ0=[(X)]0subscript^0subscript0subscriptdelimited-[]0 a_0= _0=[A(X)]_0over start_ARG a end_ARG0 = μ0 = [ A ( X ) ]0 and do not optimize its value as an ablation constant because [(X)]0subscriptdelimited-[]0[A(X)]_0[ A ( X ) ]0 does not depend on X and thus naturally conveys no information about the input. Counterfactual mean ablation: Our implementation is the same as for mean ablation except that we compute means over (X,Y)∼′similar-tosuperscript′(X,Y) ( X , Y ) ∼ D′ for the counterfactual distribution ′superscript′D D′, and m is taken as the minimum prompt length in the counterfactual distribution. Resample ablation: To perform modified inference on an input X, we first sample an independent copy X′⟂X \!\!\! X′ ⟂ ⟂ X. Let X and X′ have lengths s and s′ respectively. If s≤s′≤ s s ≤ s′, for an ablated component AA, we replace [(X)]jsubscriptdelimited-[][A(X)]_j[ A ( X ) ]j with [(X′)]jsubscriptdelimited-[]superscript′[A(X )]_j[ A ( X′ ) ]j at each token position j∈1,…,s1…j∈\1,...,s\j ∈ 1 , … , s (in other words, we only resample from the first s tokens of X′). If s>s′>s s > s′, then we left-pad X′ with an additional s−s′-s s - s′ tokens to form a modified token sequence X~′superscript~′ X over~ start_ARG X end_ARG′ that is the same length as X. We then replace ablated component values (X)A(X)A ( X ) with (X~′)superscript~′A( X )A ( over~ start_ARG X end_ARG′ ) with respect to each sequence position. Before arriving upon this implementation, we tried other choices, like resampling from the last s tokens of X′ in the case that s≤s′≤ s s ≤ s′, or right-padding X′ in the case that s>s′>s s > s′. Counterfactual ablation: We choose a function π (details discussed in the main text and further analyzed in Appendix F.3) that maps inputs X to neutral counterfactual inputs X′. Typically, X and X′ are the same length and have many tokens in common. For ablated components, we replace [(X)]jsubscriptdelimited-[][A(X)]_j[ A ( X ) ]j with [(π(X))]jsubscriptdelimited-[][A(π(X))]_j[ A ( π ( X ) ) ]j at each token position. E.3 Full results Figure 5: Correlation of single-component ablation loss measurements on IOI. Lower triangle shows rank correlation and upper triangle shows log-log correlation across metrics. Figure 5 plots the pairwise correlations of single-component ablation loss evaluated on the IOI dataset with a variety of ablation methods. Table 2 is an extended version of Table 1 in the main paper that provides a summary of these results. Table 2: Comparison of ablation loss gap Δ Δ on IOI, extended Zero Mean Resample CF-Mean Optimal CF Log-log correlation with CF 0.626 0.831 0.826 0.847 0.908 1 Rank correlation with CF 0.590 0.825 0.828 0.833 0.907 1 Mean Δ Δ 0.0584 0.0405 0.0559 0.0412 0.0035 0.0296 Median ratio of ΔoptsubscriptΔopt _optΔroman_opt to Δ Δ 11.1% 33.0% 17.7% 31.7% 100% 88.9% Appendix F Circuit discovery F.1 Transformer graph representation We use a residual rewrite graph representation favored by Wang et al., (2022), Goldowsky-Dill et al., (2023), and Conmy et al., (2023). Similarly to Appendix E.1, we define vertices that correspond to an attention head, an MLP block, the model input (Resid(0)(x)superscriptResid0Resid^(0)(x)Resid( 0 ) ( x )), or the model output (Out(x)OutOut(x)Out ( x )), but we eliminate the Resid(i)(x)superscriptResidResid^(i)(x)Resid( i ) ( x ) and MResid(i)(x)superscriptMResidMResid^(i)(x)MResid( i ) ( x ) vertices. We have nlayers(nheads+1)+2subscriptlayerssubscriptheads12n_layers(n_heads+1)+2nroman_layers ( nroman_heads + 1 ) + 2 vertices in total (156 for GPT-2). Notice from Appendix C.3 that Resid(ℓ)(x)=Resid(0)(x)+∑i=1ℓ(MLP(i)(x)+∑k=1nheadsAttn(i,k)(x))superscriptResidℓsuperscriptResid0superscriptsubscript1ℓsuperscriptMLPsuperscriptsubscript1subscriptheadssuperscriptAttnResid^( )(x)=Resid^(0)(x)+ _i=1 (% MLP^(i)(x)+ _k=1^n_headsAttn^(i,k)(x) )Resid( ℓ ) ( x ) = Resid( 0 ) ( x ) + ∑i = 1ℓ ( MLP( i ) ( x ) + ∑k = 1nroman_heads Attn( i , k ) ( x ) ) so rather than assuming that attention heads, MLP blocks, and the model output take Resid(i)(x)superscriptResidResid^(i)(x)Resid( i ) ( x ) as input, we can assume that they take the output of each previous block as a separate input to the computation. In particular, we can write MLP(i)(x)=MLP(i)(Resid(0)(x), ^(i)(x)=MLP^(i) (Resid^(0% )(x),\ MLP( i ) ( x ) = MLP( i ) ( Resid( 0 ) ( x ) , MLP(1)(x),…,MLP(i−1)(x),superscriptMLP1…superscriptMLP1 ^(1)(x),...,MLP^(i-1)(x),MLP( 1 ) ( x ) , … , MLP( i - 1 ) ( x ) , (14) Attn(1,1)(x),…,Attn(i,nheads)(x)) ^(1,1)(x),...,Attn^(i,n_heads% )(x) )Attn( 1 , 1 ) ( x ) , … , Attn( i , nroman_heads ) ( x ) ) (15) in which the MLP(i)superscriptMLPMLP^(i)MLP( i ) vertex has i(nheads+1)+1subscriptheads11i(n_heads+1)+1i ( nroman_heads + 1 ) + 1 incoming edges from previous vertices. Similarly, we can write Out(x)=Out(Resid(0)(x), (x)=Out (Resid^(0)(x),\ Out ( x ) = Out ( Resid( 0 ) ( x ) , MLP(1)(x),…,MLP(nlayers)(x),superscriptMLP1…superscriptMLPsubscriptlayers ^(1)(x),...,MLP^(n_layers)(x),MLP( 1 ) ( x ) , … , MLP( nroman_layers ) ( x ) , (16) Attn(1,1)(x),…,Attn(nlayers,nheads)(x)) ^(1,1)(x),...,Attn^(n_layers,% n_heads)(x) )Attn( 1 , 1 ) ( x ) , … , Attn( nroman_layers , nroman_heads ) ( x ) ) (17) so the OutOutOutOut vertex has nlayers(nheads+1)+1subscriptlayerssubscriptheads11n_layers(n_heads+1)+1nroman_layers ( nroman_heads + 1 ) + 1 incoming edges, one from each previous vertex in the graph. Finally, notice that attention heads Attn(i,k)superscriptAttnAttn^(i,k)Attn( i , k ) take Resid(i−1)superscriptResid1Resid^(i-1)Resid( i - 1 ) as input in three different locations, once in each of the query, key, and value subcircuits, so we can write attention heads as taking three copies of each previous vertex’s output, which can be ablated individually. Attn(i,k)(x)=Attn(i,k)(Resid(0),Q(x), ^(i,k)(x)=Attn^(i,k) (% Resid^(0),Q(x),\ Attn( i , k ) ( x ) = Attn( i , k ) ( Resid( 0 ) , Q ( x ) , MLP(1),Q(x),…,MLP(i−1),Q(x),superscriptMLP1…superscriptMLP1 ^(1),Q(x),...,MLP^(i-1),Q(x),MLP( 1 ) , Q ( x ) , … , MLP( i - 1 ) , Q ( x ) , (18) Attn(1,1),Q(x),…,Attn(i−1,nheads),Q(x),superscriptAttn11…superscriptAttn1subscriptheads ^(1,1),Q(x),...,Attn^(i-1,n_% heads),Q(x),Attn( 1 , 1 ) , Q ( x ) , … , Attn( i - 1 , nroman_heads ) , Q ( x ) , (19) Resid(0),K(x),superscriptResid0 ^(0),K(x),\ Resid( 0 ) , K ( x ) , MLP(1),K(x),…,MLP(i−1),K(x),superscriptMLP1…superscriptMLP1 ^(1),K(x),...,MLP^(i-1),K(x),MLP( 1 ) , K ( x ) , … , MLP( i - 1 ) , K ( x ) , (20) Attn(1,1),K(x),…,Attn(i−1,nheads),K(x),superscriptAttn11…superscriptAttn1subscriptheads ^(1,1),K(x),...,Attn^(i-1,n_% heads),K(x),Attn( 1 , 1 ) , K ( x ) , … , Attn( i - 1 , nroman_heads ) , K ( x ) , (21) Resid(0),V(x),superscriptResid0 ^(0),V(x),\ Resid( 0 ) , V ( x ) , MLP(1),V(x),…,MLP(i−1),V(x),superscriptMLP1…superscriptMLP1 ^(1),V(x),...,MLP^(i-1),V(x),MLP( 1 ) , V ( x ) , … , MLP( i - 1 ) , V ( x ) , (22) Attn(1,1),V(x),…,Attn(i−1,nheads),V(x)) ^(1,1),V(x),...,Attn^(i-1,n_% heads),V(x) )Attn( 1 , 1 ) , V ( x ) , … , Attn( i - 1 , nroman_heads ) , V ( x ) ) (23) This notation indicates that attention heads admit multiple incoming edges for each previous vertex, which is somewhat non-standard. Alternatively, rather than allowing multiple edges between pairs of vertices, Conmy et al., (2023) creates a separate vertex for each of the query, key, and value subcircuits and considers each attention head output to take the outputs of these three circuits as input. However, edges between the subcircuits and attention head outputs are essentially placeholder edges that cannot be independently removed, since removing them is informationally equivalent to ablating the entire attention head. Thus, our graph representation is more natural and provides a more realistic edge count when considering removing model components. Furthermore, we continue with the adjustment from Appendix E.1 of using ZAttn(i,k)superscriptZAttnZAttn^(i,k)ZAttn( i , k ) as computational vertices rather than Attn(i,k)superscriptAttnAttn^(i,k)Attn( i , k ) to conserve memory and reduce the parameter count of OA. We consider the linear map ϕ(i,k)(Z)=ZWO(i,k)+bO(i,k)superscriptitalic-ϕsubscriptsuperscriptsubscriptsuperscriptφ^(i,k)(Z)=ZW^(i,k)_O+b^(i,k)_Oϕ( i , k ) ( Z ) = Z W( i , k )O + b( i , k )O (so that Attn(i,k)(x)=ϕ(i,k)(ZAttn(i,k)(x))superscriptAttnsuperscriptitalic-ϕsuperscriptZAttnAttn^(i,k)(x)=φ^(i,k)(ZAttn^(i,k)(x))Attn( i , k ) ( x ) = ϕ( i , k ) ( ZAttn( i , k ) ( x ) )) and express all downstream vertices as taking ZAttn(i,k)(x)superscriptZAttnZAttn^(i,k)(x)ZAttn( i , k ) ( x ) as input rather than Attn(i,k)(x)superscriptAttnAttn^(i,k)(x)Attn( i , k ) ( x ) and performing their computation by pre-composing with ϕ(i,k)superscriptitalic-ϕφ^(i,k)ϕ( i , k ). For example, for an MLP vertex MLP(i)superscriptMLPMLP^(i)MLP( i ), if m(i)superscriptm^(i)m( i ) represents how MLP(i)(x)superscriptMLPMLP^(i)(x)MLP( i ) ( x ) is computed using Attn(i,k)(x)superscriptAttnAttn^(i,k)(x)Attn( i , k ) ( x ) values as inputs, then its computation taking ZAttn(i,k)(x)superscriptZAttnZAttn^(i,k)(x)ZAttn( i , k ) ( x ) as inputs is equal to MLP(i)(x)=m(i)(Resid(0)(x), ^(i)(x)=m^(i)(Resid^(0)(x),\ MLP( i ) ( x ) = m( i ) ( Resid( 0 ) ( x ) , MLP(1)(x),…,MLP(i−1)(x),superscriptMLP1…superscriptMLP1 ^(1)(x),...,MLP^(i-1)(x),MLP( 1 ) ( x ) , … , MLP( i - 1 ) ( x ) , ϕ(1,1)(ZAttn(1,1)(x)),…,ϕ(i,nheads)(Attn(i,nheads)(x))) φ^(1,1)(ZAttn^(1,1)(x)),...,φ^(i,n_% heads)(Attn^(i,n_heads)(x)) )ϕ( 1 , 1 ) ( ZAttn( 1 , 1 ) ( x ) ) , … , ϕ( i , nroman_heads ) ( Attn( i , nroman_heads ) ( x ) ) ) We replace all Attn(i,k)superscriptAttnAttn^(i,k)Attn( i , k ) vertices in the graph structure with the corresponding ZAttn(i,k)superscriptZAttnZAttn^(i,k)ZAttn( i , k ) vertex. In total, we have (nheads⋅nlayers)⋅subscriptheadssubscriptlayers(n_heads· n_layers)( nroman_heads ⋅ nroman_layers ) ZAttn(i,k)superscriptZAttnZAttn^(i,k)ZAttn( i , k ) vertices, nlayerssubscriptlayersn_layersnroman_layers MLP(i)superscriptMLPMLP^(i)MLP( i ) vertices, an input vertex (Resid(0)) and an output vertex (Out). Letting V represent the set of vertices that includes the input and MLP(i)superscriptMLPMLP^(i)MLP( i ) vertices. There are 3⋅12nlayers⋅(nlayers−1)⋅nheads2⋅312subscriptlayerssubscriptlayers1subscriptsuperscript2heads3· 12n_layers·(n_layers-1)· n^2_% heads3 ⋅ divide start_ARG 1 end_ARG start_ARG 2 end_ARG nroman_layers ⋅ ( nroman_layers - 1 ) ⋅ n2roman_heads edges between two attention heads, 2⋅(nlayers+1)⋅nlayers⋅nheads⋅2subscriptlayers1subscriptlayerssubscriptheads2·(n_layers+1)· n_layers· n_heads2 ⋅ ( nroman_layers + 1 ) ⋅ nroman_layers ⋅ nroman_heads edges between an attention head and a vertex in V, 12⋅nlayers⋅(nlayers+1)⋅12subscriptlayerssubscriptlayers1 12· n_layers·(n_layers+1)divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⋅ nroman_layers ⋅ ( nroman_layers + 1 ) edges between two vertices in V, and (nlayers)⋅(nheads+1)+1⋅subscriptlayerssubscriptheads11(n_layers)·(n_heads+1)+1( nroman_layers ) ⋅ ( nroman_heads + 1 ) + 1 edges from any vertex to the output. For GPT-2, there are 28,512 edges between two attention heads, 3,744 edges between an attention head and a vertex in V, 78 edges between two vertices in V, and 157 edges from any vertex to the output for a total of 32,491 edges. F.2 Ablation details For mean, resample, and counterfactual ablation, our implementation is the same as in Appendix E.2. For optimal ablation, we make an adjustment to the implementation to remove dependence on token positions and further reduce the parameter count. For each ablated component AA, rather than training a different aj∗subscriptsuperscripta^*_ja∗italic_j to replace [(X)]jsubscriptdelimited-[][A(X)]_j[ A ( X ) ]j for each token position j, we train a single optimal constant a∗superscripta^*a∗ that is the same shape as any particular [(X)]jsubscriptdelimited-[][A(X)]_j[ A ( X ) ]j. We initialize a∗superscripta^*a∗ to [[(X)]j|j>9]delimited-[]subscriptdelimited-[]ket9E[[A(X)]_j\ |\ j>9]blackboard_E [ [ A ( X ) ]j | j > 9 ], the subtask mean excluding early token positions, since early positional embeddings may have idiosyncratic effects. To ablate AA during inference on X, for all token positions j>00j>0j > 0, we replace [(X)]jsubscriptdelimited-[][A(X)]_j[ A ( X ) ]j with a∗superscripta^*a∗. As in Appendix E.2, we do not replace [(X)]0subscriptdelimited-[]0[A(X)]_0[ A ( X ) ]0 because this value is a constant that does not depend on X. We take this conservative approach to demonstrate that OA can be implemented in a sequence-position agnostic manner while still outperforming sequence-position specific implementations of other ablation methods by a large margin. Several other ablation methods, like resample and counterfactual ablation, are inherently sequence-position specific, and we believe that compatibility with a sequence-position agnostic implementation is a crucial advantage of OA over these ablation methods. Sequence-position agnostic ablation will be necessary for interpretability studies to move beyond synthetic data and analyze training-distribution samples in a systematic manner; we continue this discussion in the next section, Appendix F.3. As noted in Section 3, we train a single constant aj∗subscriptsuperscripta^*_ja∗italic_j for each vertex jsubscriptA_jAitalic_j. An alternative implementation is training a separate constant for each ablated edge. In theory, this approach is consistent with the spirit of OA: though ablated edges transmit different values to downstream components that would normally receive the same value, none of the transmitted constants transmit any information about the input to downstream components. However, this approach would greatly increase the number of learnable parameters, and arguably may actually increase the computational capacity of the model. Additionally, training a single constant for each vertex has the appealing property that ablating all of the out-edges from a vertex is equivalent to ablating that vertex. F.3 Normative comparison of ablation types Thus far, circuit discovery on language models has focused on synthetic subtasks for which a mapping π from studied inputs x to counterfactual inputs π(x)π(x)π ( x ) is easily constructed. Recall that a crucial criterion for selecting π is to preserve as many tokens as possible between x and π(x)π(x)π ( x ). For Greater-Than, an example counterfactual pair is ⬇ Token: S Y1 Y2 Y1* x: The [conflict] began in [18][89] and ended in [18] π(x): The [conflict] began in [18][01] and ended in [18] where the brackets [] are added to emphasize the two-token representation of the year. Similarly, for IOI, an example counterfactual pair is ⬇ Token: S1 IO S2 x: Friends [Alice] and [Bob] found a bone at the store. [Alice] gave the bone to π(x): Friends [Charlie] and [David] found a bone at the store. [Charlie] gave the bone to Since x and π(x)π(x)π ( x ) share the same token at all but a few token positions (only the S1, S2, and IO token positions for IOI and only the Y2 token position for Greater-Than), we are able to isolate the effect of changing the specific token that conveys important information necessary for the subtask. CF thus allows us to study subtasks involving input-label pairs where relevant information is given by only one or several tokens. However, replacing (x)A(x)A ( x ) with (x′)superscript′A(x )A ( x′ ) typically incurs significantly higher loss if x and x′ differ at many different token positions, even if most tokens are unimportant in relation to the behavior we wish to study. The model representation at a token position j is likely to contain information specific to the tokens at position j and at surrounding positions in the input X, so replacing [(x)]jsubscriptdelimited-[][A(x)]_j[ A ( x ) ]j with [(x′)]jsubscriptdelimited-[]superscript′[A(x )]_j[ A ( x′ ) ]j is likely to inject inconsistent information if XjsubscriptX_jXitalic_j and Xj′subscriptsuperscript′X _jX′italic_j (or pairs of tokens at corresponding proximate token positions) are different tokens, causing Δ Δ to be high as a result of spoofing (per Section 2.3). An illustration is the discrepancy in resample ablation loss between the Greater-Than and IOI subtasks. The Greater-Than dataset only contains a single prompt template, so any sampled X and X′ only differ in tokens that encode the subject and year. Here is an example of a sampled (X,X′)superscript′(X,X )( X , X′ ) pair: ⬇ Token: S Y1 Y2 Y1* X: The [conflict] began in [18][89] and ended in [18] X’: The [deal] began in [15][47] and ended in [15] On the other hand, the IOI dataset consists of multiple prompt templates that differ in sentence structure, so X and X′ may differ at nearly all token positions, not just the S1, IO, and S2 positions as shown above. For example: ⬇ X: Friends Alice and Bob found a bone at the store. Alice gave the bone to X’: <> <> <> Then, Charlie and David had a long argument, and Charlie said to where <> represents a padding token added to make the sequences the same length. As a result, resample ablation loss is relatively low for Greater-Than (see Figure 10) but relatively high for IOI (see Figure 8), indicating that token parallelism is an important requirement for CF to work well. While the synthetic IOI and Greater-Than datasets are specifically engineered so that we can modify a prompt x at only a few token positions to obtain a neutral prompt π(x)π(x)π ( x ), more general language behaviors may not be suited for this type of counterfactual analysis. Here are a few examples of language subtasks for which it not be possible to pair up x and π(x)π(x)π ( x ) with parallel tokens: • A case study of the effect of modifiers, e.g. adjectives and adverbs, compared to a sentence with no modifier. Consider the following (degenerate) counterfactual pair, inspired by Marks and Tegmark, (2023): ⬇ x: Paris is a city in the country of x’: Paris is not a city in the country of Since the presence of a modifier creates an extra token, replacing (x)A(x)A ( x ) with (x′)superscript′A(x )A ( x′ ) patching between sequences with and without the modifier would result in the embedding at most token positions in X being replaced with embeddings at the token position of a token reflecting the previous input token in X (“city” with “a,” “in” with “city,” and so on). • A case study comparing sentence order in situations where order matters, like giving directions. Patching in activations from a counterfactual prompt in which the order of two sentences is permuted involves introducing a new token at many token positions. ⬇ x: Make a left turn, then walk forward one block. Your position is now x’: Walk foward one block, then make a left turn. Your position is now • A case study relating to how language models handle mis-tokenization, like processing prompts in which a word is misspelled or the model is required to spell out a word. ⬇ x: The correct spelling of the word umpire is x’: The correct [sp] [le] [ling] of [te] [h] word [up][mire] is Additionally, as the field of interpretability moves forward, we believe that it must progress toward “total” interpretation of models’ internal mechanisms. This level of interpretation requires reasoning about subtasks that are much more general than those that have been studied and will require performing intervention-based analysis across a broad distribution of inputs. For example, we may want to make the claim that certain components of a model ℳMM are unimportant for performing mathematical calculations; or that some components are not involved in ensuring grammatical correctness; or do not assist in making theory-of-mind assessments; etc. Additionally, we likely wish to assess component functions “in the wild” with filtered sampling from the model’s training distribution as opposed to engineering synthetic datasets. These circumstances mean that the data will be much less suited for token parallelism between counterfactual prompts, so the adoption of a sequence-position agnostic ablation method is likely critical. This quality of OA makes it a much better candidate than CF as a suitable ablation method for scaling interpretability. F.4 Sparsity metric As stated in Equation (4), we wish to select a circuit E~~ Eover~ start_ARG E end_ARG that achieves low loss XℒX(ℳ(opt)E~(X))subscriptsubscriptℒsubscriptsuperscriptℳ~optE_XL_X(M E_(opt)(X))blackboard_EX Litalic_X ( Mover~ start_ARG E end_ARG( opt ) ( X ) ) and which is a sparse subset of the model. Let ℰsubscriptℰE_AEcaligraphic_A represent the set of edges connected to vertex AA in graph G, i.e. ℰ=(j,i,z)∈E|j=∨i=subscriptℰconditional-setsubscriptsubscriptsubscriptsubscriptE_A=\(A_j,A_i,z)∈ E\ |\ % A_j=A _i=A\Ecaligraphic_A = ( Aitalic_j , Aitalic_i , z ) ∈ E | Aitalic_j = A ∨ Aitalic_i = A . The selected circuit E~~ Eover~ start_ARG E end_ARG should ideally satisfy two types of sparsity: 1. Edge sparsity: |E~|<<|E|much-less-than~| E|<<|E|| over~ start_ARG E end_ARG | < < | E |. The circuit should contain a small number of edges compared to the total number of edges in the model. 2. Vertex sparsity: |||ℰ∩E~|>0|<<|G|much-less-thanconditional-setsubscriptℰ~0|\A\ |\ |E_A∩ E|>0\|<<|G|| A | | Ecaligraphic_A ∩ over~ start_ARG E end_ARG | > 0 | < < | G |. The circuit should pass through a small number of vertices compared to the total number of vertices in the model. There is a lack of guidance in prior work about whether smaller structures with more densely packed connections are more interpretable than larger structures with more thinly distributed connections. Indeed, one could argue that the larger structure is in fact easier to understand, since we do not need to dissect as many relationships to consider the function of any particular vertex within the circuit. While circuit discovery aims to localize model behaviors on specific subtasks, we contend that a central challenge in interpretability going forward could be stacking together many circuit analyses to form a sum-of-the-parts analysis of the model’s overall structure. Considering circuit discovery as a tool for decomposing model computation into interpretable subtasks, holding the total number of edges equal, we may prefer each circuit to have a smaller number of vertices to reduce the complexity of interactions between circuits rather than within circuits. As such, we set ℛ(E~)ℛ~R( E)R ( over~ start_ARG E end_ARG ) to select for circuits with high levels of both edge and vertex sparsity: ℛ(E~)=λ|E~|+γλ∑∈G12|ℰ|tanh(2|ℰ∩E~||ℰ|)ℛ~~subscript12subscriptℰ2subscriptℰ~subscriptℰ ( E)=λ| E|+γλ _% A∈ G 12|E_A| (2 |% E_A∩ E||E_A| )R ( over~ start_ARG E end_ARG ) = λ | over~ start_ARG E end_ARG | + γ λ ∑A ∈ G divide start_ARG 1 end_ARG start_ARG 2 end_ARG | Ecaligraphic_A | tanh ( 2 divide start_ARG | Ecaligraphic_A ∩ over~ start_ARG E end_ARG | end_ARG start_ARG | Ecaligraphic_A | end_ARG ) (24) where λ,γλ,γλ , γ are constants. Similarly, the continuous relaxation ℛ(θk)ℛsubscriptR( _k)R ( θitalic_k ) is for HCGS and UGS is derived by replacing |E~|~| E|| over~ start_ARG E end_ARG | with ∑k=1|E|θksuperscriptsubscript1subscript _k=1^|E| _k∑k = 1| E | θitalic_k and replacing |ℰ∩E~|subscriptℰ~|E_A∩ E|| Ecaligraphic_A ∩ over~ start_ARG E end_ARG | with ε:=∑ek∈ℰθkassignsubscriptsubscriptsubscriptsubscriptℰsubscript _A:= _e_k _A _kεcaligraphic_A := ∑e start_POSTSUBSCRIPT k ∈ Ecaligraphic_A end_POSTSUBSCRIPT θitalic_k. Note that ∇θkℛ(θ→)=λ+γλ∑∈Gsech2(2ε|ℰ|)subscript∇subscriptℛ→subscriptsuperscriptsech22subscriptsubscriptℰ _ _kR( θ)=λ+γλ _% A∈ Gsech^2(2 _A| % E_A|)∇θ start_POSTSUBSCRIPT k end_POSTSUBSCRIPT R ( over→ start_ARG θ end_ARG ) = λ + γ λ ∑A ∈ G sech2 ( 2 divide start_ARG εcaligraphic_A end_ARG start_ARG | Ecaligraphic_A | end_ARG ). The first term, λ, is generally used to control the tradeoff between edge sparsity and circuit loss; a general interpretation is that we should include an edge e∈E~~e∈ Ee ∈ over~ start_ARG E end_ARG if its marginal contribution, Δ(ℳ,(E∪ek)∖E~)−Δ(ℳ,E∖(ek∪E~))Δℳsubscript~Δℳsubscript~ (M,(E∪\e_k\) E)- (M,E% (\e_k\∪ E))Δ ( M , ( E ∪ eitalic_k ) ∖ over~ start_ARG E end_ARG ) - Δ ( M , E ∖ ( eitalic_k ∪ over~ start_ARG E end_ARG ) ), is greater than λ, expressing the same tradeoff as the discrete threshold λ in ACDC. However, since ACDC is a less fine-grained optimization algorithm than UGS, the λ required to achieve the same circuit size |E~|~| E|| over~ start_ARG E end_ARG | tends to be larger for ACDC. The second term expresses vertex sparsity, and its effect is to increase the regularization effect for edges that are attached to vertices that have few other edges included in the circuit. Its effect is small when ε≈|ℰ|subscriptsubscriptℰ _A≈|E_A|εcaligraphic_A ≈ | Ecaligraphic_A |, since sech2(2)≈0superscriptsech220sech^2(2)≈ 0sech2 ( 2 ) ≈ 0, so we do not apply additional regularization to edges attached to vertices that have high overall likelihood to be included in the selected circuit. However, its effect is significant when ε/|ℰ|≈0subscriptsubscriptℰ0 _A/|E_A|≈ 0εcaligraphic_A / | Ecaligraphic_A | ≈ 0 to prune the remaining edges from a vertex whose edge probabilities as represented by θ→ θover→ start_ARG θ end_ARG are low on average. We use γ to express the maximum influence of vertex regularization as compared to the effect of edge regularization (since maxxsech2(x)=1subscriptsuperscriptsech21 _xsech^2(x)=1maxitalic_x sech2 ( x ) = 1), and generally select γ=0.50.5γ=0.5γ = 0.5, so the second term adds at most 50% more regularization. F.5 Uniform Gradient Sampling: motivation In circuit discovery, the number of possible circuits E~⊂E~ E⊂ Eover~ start_ARG E end_ARG ⊂ E is exponential in |E||E|| E | and the circuit losses Δ(ℳ,E∖E~)Δℳ~ (M,E E)Δ ( M , E ∖ over~ start_ARG E end_ARG ) for subsets E~~ Eover~ start_ARG E end_ARG are not required to be related. Δ Δ is not even necessarily monotonic in E~~ Eover~ start_ARG E end_ARG for any ablation method considered, i.e. E~⊂E′~~~superscript′ E⊂ E over~ start_ARG E end_ARG ⊂ over~ start_ARG E′ end_ARG does not imply that Δ(ℳ,E∖E~)≥Δ(ℳ,E∖E~′)Δℳ~Δℳsuperscript~′ (M,E E)≥ (M,E % E )Δ ( M , E ∖ over~ start_ARG E end_ARG ) ≥ Δ ( M , E ∖ over~ start_ARG E end_ARG′ ). In reality, we can hope that the optimal ground-truth circuit E~∗superscript~ E^*over~ start_ARG E end_ARG∗ is clear-cut and Δ Δ is relatively well-behaved. If so, we could try to relax the discrete optimization problem and find a solution with gradient descent. As mentioned in Section 3, one way to produce a continuous relaxation is to consider partial ablation for each edge ek=(j,i,z)subscriptsubscriptsubscripte_k=(A_j,A_i,z)eitalic_k = ( Aitalic_j , Aitalic_i , z ), where we replace j(x)subscriptA_j(x)Aitalic_j ( x ) as the zzzth input to i(x)subscriptA_i(x)Aitalic_i ( x ) with αkj(x)+(1−αk)a^jsubscriptsubscript1subscriptsubscript _kA_j(x)+(1- _k) a_jαitalic_k Aitalic_j ( x ) + ( 1 - αitalic_k ) over start_ARG a end_ARGj and use L1subscript1L_1L1 or L2subscript2L_2L2 regularization on the αksubscript _kαitalic_k. However, this approach is likely to get stuck in local minima in which edge coefficients converge to the optimal magnitude instead of edges being completely ablated or retained. The continuous relaxation we prefer is to consider a vector of independent sampling probabilities for the inclusion of each edge – this way, we never consider αk∈(0,1)subscript01 _k∈(0,1)αitalic_k ∈ ( 0 , 1 ) in our space of possible solutions. We can then perform optimization on the sampling probabilities so that the probability for each edge converges to 0 or 1. The loss function we want to minimize with respect to the sampling probabilities θ→ θover→ start_ARG θ end_ARG is f(θ→):=X,(E~∼θ→)[ℒ(ℳE~(X),ℳ(X))+ℛ(E~)]assign→subscriptsimilar-to~→delimited-[]ℒsuperscriptℳ~ℳℛ~ f( θ):=E_X,( E θ) % [L(M E(X),M(X))+R( E) ]f ( over→ start_ARG θ end_ARG ) := blackboard_EX , ( over~ start_ARG E end_ARG ∼ over→ start_ARG θ end_ARG ) [ L ( Mover~ start_ARG E end_ARG ( X ) , M ( X ) ) + R ( over~ start_ARG E end_ARG ) ] (25) To simplify the notation, we can denote ℒℛ(X,E~)=ℒ(ℳE~(X),ℳ(X))+ℛ(E~)subscriptℒℛ~ℒsuperscriptℳ~ℳℛ~L_R(X, E)=L(M E(X),% M(X))+R( E)Lcaligraphic_R ( X , over~ start_ARG E end_ARG ) = L ( Mover~ start_ARG E end_ARG ( X ) , M ( X ) ) + R ( over~ start_ARG E end_ARG ), so that f(θ→)=X,(E~∼θ→)[ℒℛ(X,E~)]→subscriptsimilar-to~→delimited-[]subscriptℒℛ~f( θ)=E_X,( E θ)[L_% R(X, E)]f ( over→ start_ARG θ end_ARG ) = blackboard_EX , ( over~ start_ARG E end_ARG ∼ over→ start_ARG θ end_ARG ) [ Lcaligraphic_R ( X , over~ start_ARG E end_ARG ) ]. Now the gradient with respect to the sampling probability θksubscript _kθitalic_k for each edge is simply a marginal ablation loss gap: ∂f(θ→)∂θk=X,(E~∼θ→)[ℒℛ(X,E~∪ek)−ℒℛ(X,E~∖ek)]=:X,(E~∼θ→)Δℛ(ℳ,ek,) ∂ f( θ)∂ _k=E_X% ,( E θ) [L_R(X, E∪\% e_k\)-L_R(X, E \e_k\) ]=:% E_X,( E θ) _R(M,e_% k,)divide start_ARG ∂ f ( over→ start_ARG θ end_ARG ) end_ARG start_ARG ∂ θitalic_k end_ARG = blackboard_EX , ( over~ start_ARG E end_ARG ∼ over→ start_ARG θ end_ARG ) [ Lcaligraphic_R ( X , over~ start_ARG E end_ARG ∪ eitalic_k ) - Lcaligraphic_R ( X , over~ start_ARG E end_ARG ∖ eitalic_k ) ] = : blackboard_EX , ( over~ start_ARG E end_ARG ∼ over→ start_ARG θ end_ARG ) Δcaligraphic_R ( M , eitalic_k , ) (26) The problem is that |E||E|| E | is large and it is not tractable to estimate this quantity individually for all k. Our goal is to find a good sample estimator for this quantity simultaneously for all k. One way to perform this simultaneous estimation is importance sampling, where we write f(θ→)=X,(E~∼p→)[ℙE′∼θ→(E~=E′)ℙE′∼p→(E~=E′)⋅ℒℛ(X,E~)]→subscriptsimilar-to~→delimited-[]⋅subscriptℙsimilar-tosuperscript′→~superscript′subscriptℙsimilar-tosuperscript′→~superscript′subscriptℒℛ~f( θ)=E_X,( E p) [ P_E% θ( E=E )P_E % p( E=E )·L_R(X, E) ]f ( over→ start_ARG θ end_ARG ) = blackboard_EX , ( over~ start_ARG E end_ARG ∼ over→ start_ARG p end_ARG ) [ divide start_ARG blackboard_PE′ ∼ over→ start_ARG θ end_ARG ( over~ start_ARG E end_ARG = E′ ) end_ARG start_ARG blackboard_PE′ ∼ over→ start_ARG p end_ARG ( over~ start_ARG E end_ARG = E′ ) end_ARG ⋅ Lcaligraphic_R ( X , over~ start_ARG E end_ARG ) ] so therefore, when p→=θ→ p= θover→ start_ARG p end_ARG = over→ start_ARG θ end_ARG ∂f(θ→)∂θk=X,(E~∼θ→)[(ek∈E~)θkℒℛ(X,E~)−(ek∉E~)1−θkℒℛ(X,E~)].→subscriptsubscriptsimilar-to~→delimited-[]1subscript~subscriptsubscriptℒℛ~1subscript~1subscriptsubscriptℒℛ~ ∂ f( θ)∂ _k=E_X% ,( E θ) [ 1(e_k∈ E)θ% _kL_R(X, E)- 1(e_k ∈% E)1- _kL_R(X, E) ].divide start_ARG ∂ f ( over→ start_ARG θ end_ARG ) end_ARG start_ARG ∂ θitalic_k end_ARG = blackboard_EX , ( over~ start_ARG E end_ARG ∼ over→ start_ARG θ end_ARG ) [ divide start_ARG blackboard_1 ( eitalic_k ∈ over~ start_ARG E end_ARG ) end_ARG start_ARG θitalic_k end_ARG Lcaligraphic_R ( X , over~ start_ARG E end_ARG ) - divide start_ARG blackboard_1 ( eitalic_k ∉ over~ start_ARG E end_ARG ) end_ARG start_ARG 1 - θitalic_k end_ARG Lcaligraphic_R ( X , over~ start_ARG E end_ARG ) ] . (27) Empirically, this method leads to poor estimates. Most of the variance in gradient updates to θksubscript _kθitalic_k comes from sampling different subsets of edges among the |E|−11|E|-1| E | - 1 edges other than eksubscripte_keitalic_k, not the effect of fixing ek∈E~subscript~e_k∈ Eeitalic_k ∈ over~ start_ARG E end_ARG or ek∉E~subscript~e_k ∈ Eeitalic_k ∉ over~ start_ARG E end_ARG for a particular edge. Instead, the basis for UGS is an approximation of the marginal ablation loss gap for each edge obtained by taking gradients with respect to sampled partial ablation coefficients α→ αover→ start_ARG α end_ARG. We consider the extension of ℳE~superscriptℳ~M EMover~ start_ARG E end_ARG to convex relaxations ℳα→superscriptℳ→M αMover→ start_ARG α end_ARG, where αksubscript _kαitalic_k represents the partial ablation coefficient for edge eksubscripte_keitalic_k as alluded to above. Similarly, we consider ℒℛ(X,α→)subscriptℒℛ→L_R(X, α)Lcaligraphic_R ( X , over→ start_ARG α end_ARG ) in place of ℒℛ(X,E~)subscriptℒℛ~L_R(X, E)Lcaligraphic_R ( X , over~ start_ARG E end_ARG ). Let α→(E~,S)→~ α( E,S)over→ start_ARG α end_ARG ( over~ start_ARG E end_ARG , S ) where S⊂1,…,|E|1…S⊂\1,...,|E|\S ⊂ 1 , … , | E | such that αk(U→,E~,S)=(k∈S)Uk+(k∉S)(ek∈E~).subscript→~1subscript11subscript~ _k( U, E,S)= 1(k∈ S)U_k+ 1(k ∈ S)% 1(e_k∈ E).αitalic_k ( over→ start_ARG U end_ARG , over~ start_ARG E end_ARG , S ) = blackboard_1 ( k ∈ S ) Uitalic_k + blackboard_1 ( k ∉ S ) blackboard_1 ( eitalic_k ∈ over~ start_ARG E end_ARG ) . For any edge eksubscripte_keitalic_k, the marginal ablation loss gap inside the expectation in Equation (26) is equal to the expected gradient with respect to αk∼Unif(0,1)similar-tosubscriptUnif01 _k Unif(0,1)αitalic_k ∼ Unif ( 0 , 1 ) when other edges are sampled according to E~~ Eover~ start_ARG E end_ARG: ∂f(θ→)∂θk=X,(E~∼θ→),(U→∼Unif(0,1)iid)[∂Ukℒℛ(X,α→(U→,E~,k))].→subscriptsubscriptsimilar-to~→similar-to→Unif01iiddelimited-[]subscriptsubscriptℒℛ→~ ∂ f( θ)∂ _k=E_X% ,( E θ),( U Unif(0,1)\ iid)% [ ∂ U_kL_R(X, α% ( U, E,\k\)) ].divide start_ARG ∂ f ( over→ start_ARG θ end_ARG ) end_ARG start_ARG ∂ θitalic_k end_ARG = blackboard_EX , ( over~ start_ARG E end_ARG ∼ over→ start_ARG θ end_ARG ) , ( over→ start_ARG U end_ARG ∼ Unif ( 0 , 1 ) iid ) [ divide start_ARG ∂ end_ARG start_ARG ∂ Uitalic_k end_ARG Lcaligraphic_R ( X , over→ start_ARG α end_ARG ( over→ start_ARG U end_ARG , over~ start_ARG E end_ARG , k ) ) ] . (28) In other words, we can estimate the effect of totally ablating eksubscripte_keitalic_k for a given (X,E~)~(X, E)( X , over~ start_ARG E end_ARG ) by sampling a partial ablation coefficient αk∼Unif(0,1)similar-tosubscriptUnif01 _k Unif(0,1)αitalic_k ∼ Unif ( 0 , 1 ) and taking the loss gradient with respect to αksubscript _kαitalic_k. However, we run into the same problem of needing to estimate the effect individually for each edge k. UGS assumes that we can estimate this loss gradient for many edges simultaneously without introducing much bias. For any particular edge eksubscripte_keitalic_k, the interference effects caused by sampling other edges ek∼Unif(0,1)similar-tosubscriptUnif01e_k Unif(0,1)eitalic_k ∼ Unif ( 0 , 1 ) for all k∈Sk∈ Sk ∈ S instead of setting them according to E~~ Eover~ start_ARG E end_ARG could be small, if S is small enough. The main approximation of UGS is that, if S is sampled from a distribution SsubscriptD_SDitalic_S, ∂f(θ→)∂θk≈X,(E~∼θ→),(U→∼Unif(0,1)iid),(S∼S)[∂Ukℒℛ(X,α→(U→,E~,S))|k∈S].→subscriptsubscriptsimilar-to~→similar-to→Unif01iidsimilar-tosubscriptdelimited-[]conditionalsubscriptsubscriptℒℛ→~ ∂ f( θ)∂ _k≈ % E_X,( E θ),( U Unif(0,1)\ % iid),(S _S) [ ∂ U_kL% _R(X, α( U, E,S))\ |\ k∈ S ].divide start_ARG ∂ f ( over→ start_ARG θ end_ARG ) end_ARG start_ARG ∂ θitalic_k end_ARG ≈ blackboard_EX , ( over~ start_ARG E end_ARG ∼ over→ start_ARG θ end_ARG ) , ( over→ start_ARG U end_ARG ∼ Unif ( 0 , 1 ) iid ) , ( S ∼ D start_POSTSUBSCRIPT S ) end_POSTSUBSCRIPT [ divide start_ARG ∂ end_ARG start_ARG ∂ Uitalic_k end_ARG Lcaligraphic_R ( X , over→ start_ARG α end_ARG ( over→ start_ARG U end_ARG , over~ start_ARG E end_ARG , S ) ) | k ∈ S ] . (29) F.6 Uniform Gradient Sampling: construction We use θksubscript _kθitalic_k to represent the sampling probabilities for each edge, and perform gradient descent on the parameters by constructing a loss function whose gradient is a sample estimator of Equation (29). As noted in the main text, rather than using θk∈(0,1)subscript01 _k∈(0,1)θitalic_k ∈ ( 0 , 1 ) as our parameters, we use θ~k∈(−∞,∞)subscript~ θ_k∈(-∞,∞)over~ start_ARG θ end_ARGk ∈ ( - ∞ , ∞ ) as our parameters and compute θk=σ(θ~k)subscriptsubscript~ _k=σ( θ_k)θitalic_k = σ ( over~ start_ARG θ end_ARGk ). We initialize θ~k=1subscript~1 θ_k=1over~ start_ARG θ end_ARGk = 1 for all edges eksubscripte_keitalic_k. We avoid random initialization because it achieves worse results by causing the resulting circuits to be suboptimally constrained to be close to our random prior. We sample S⊂1,…,k1…S⊂\1,...,k\S ⊂ 1 , … , k by independently sampling each (k∈S)∼Bern(w(θk))similar-to1Bernsubscript 1(k∈ S) Bern(w( _k))blackboard_1 ( k ∈ S ) ∼ Bern ( w ( θitalic_k ) ), where a window function w determines how often we sample αk∼Unif(0,1)similar-tosubscriptUnif01 _k Unif(0,1)αitalic_k ∼ Unif ( 0 , 1 ). Additionally, we require that ℙ(ek∈E~|k∈S)=ℙ(ek∉E~|k∈S)=12ℙsubscriptconditional~ℙsubscriptconditional~12 (e_k∈ E\ |\ k∈ S)=P(e_k ∈% E\ |\ k∈ S)= 12blackboard_P ( eitalic_k ∈ over~ start_ARG E end_ARG | k ∈ S ) = blackboard_P ( eitalic_k ∉ over~ start_ARG E end_ARG | k ∈ S ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG (30) so that sampling αk∼Unif(0,1)similar-tosubscriptUnif01 _k Unif(0,1)αitalic_k ∼ Unif ( 0 , 1 ) takes away probability mass equally from ek∈E~subscript~e_k∈ Eeitalic_k ∈ over~ start_ARG E end_ARG and ek∉E~subscript~e_k ∈ Eeitalic_k ∉ over~ start_ARG E end_ARG. We perform this adjustment because for the purpose of estimating gradients ∇\!\!\!\!∇ ∇ with respect to edges other than eksubscripte_keitalic_k, Equation (29) implicitly assumes that [∇|k∈S]≈p⋅[∇|αk=0]+(1−p)⋅[∇|αk=1],p=ℙ(ek∈E~|k∈S)formulae-sequencedelimited-[]conditional∇⋅delimited-[]conditional∇subscript0⋅1delimited-[]conditional∇subscript1ℙsubscriptconditional~ [∇\!\!\!\!∇\ |\ k∈ S]≈ p·% E[∇\!\!\!\!∇\ |\ _k=0]+(1-p)·E[∇\!\!\!\!% ∇\ |\ _k=1], p=P(e_k∈ E\ |\ k∈ S)blackboard_E [ ∇ ∇ | k ∈ S ] ≈ p ⋅ blackboard_E [ ∇ ∇ | αitalic_k = 0 ] + ( 1 - p ) ⋅ blackboard_E [ ∇ ∇ | αitalic_k = 1 ] , p = blackboard_P ( eitalic_k ∈ over~ start_ARG E end_ARG | k ∈ S ) (31) and without any priors about the functional form of ∇\!\!\!\!∇ ∇, p=1212p= 12p = divide start_ARG 1 end_ARG start_ARG 2 end_ARG is a reasonable choice. We construct a loss function whose gradient is given by Equation (5), a sample estimator of Equation (29) with this construction of S. We construct the batch X(1),…,X(b)superscript1…superscriptX^(1),...,X^(b)X( 1 ) , … , X( b ) by choosing b/nssubscriptb/n_sb / nitalic_s unique samples of input X and repeating each input nssubscriptn_snitalic_s times in the batch. We generally use ns=12subscript12n_s=12nitalic_s = 12 and b=6060b=60b = 60. We choose w(θk)=c⋅θk(1−θk)subscript⋅subscript1subscriptw( _k)=c· _k(1- _k)w ( θitalic_k ) = c ⋅ θitalic_k ( 1 - θitalic_k ). Note that c≤22c≤ 2c ≤ 2 in order for 12w(θk)≤min(θk,1−θk)12subscriptsubscript1subscript 12w( _k)≤ ( _k,1- _k)divide start_ARG 1 end_ARG start_ARG 2 end_ARG w ( θitalic_k ) ≤ min ( θitalic_k , 1 - θitalic_k ) to hold, as is required by Equation (30), and we choose c=11c=1c = 1. We discuss this choice in F.8. The full algorithm pseudocode is given after describing a few additional details in Appendix F.7. F.7 Additional circuit discovery details For HCGS and UGS, we use learning rates between 0.01 and 0.15 for the sampling parameters. Pruning dangling edges For HCGS and UGS, after sampling α→(U→,E~,S)→~ α( U, E,S)over→ start_ARG α end_ARG ( over→ start_ARG U end_ARG , over~ start_ARG E end_ARG , S ) for each input, we remove “dangling” edges eksubscripte_keitalic_k. A dangling vertex is a vertex AA for which there does not exist a path from the model input to AA along edges ejsubscripte_jeitalic_j for which αj>0subscript0 _j>0αitalic_j > 0, or for which there does not exist such a path from AA to the model output. A dangling edge is an edge that is connected to a dangling vertex. For a dangling edge eksubscripte_keitalic_k, we replace αk=0subscript0 _k=0αitalic_k = 0 (i.e. the equivalent of removing eksubscripte_keitalic_k from E~~ Eover~ start_ARG E end_ARG and S). Discretization For HCGS and UGS, after training the θksubscript _kθitalic_k, we select a final circuit by selecting all edges for which θk>τsubscript _k>τθitalic_k > τ for a threshold τ. Generally, for UGS, all but a handful of θksubscript _kθitalic_k converge to highly negative or positive values, the choice of τ does not have much impact, and we choose τ=0.50.5τ=0.5τ = 0.5. However, for HCGS, many edges have θksubscript _kθitalic_k parameters around zero even after training for 10,000 batches. We again select τ=0.50.5τ=0.5τ = 0.5 since we observe that including edges with θk∈(−1,0.5)subscript10.5 _k∈(-1,0.5)θitalic_k ∈ ( - 1 , 0.5 ) does not generally affect performance. Optimizing constants for OA For HCGS and UGS, we train ablation constants a^ aover start_ARG a end_ARG concurrently with training the sampling parameters θksubscript _kθitalic_k. We use a learning rate of 0.002 for a→ aover→ start_ARG a end_ARG, lower than the learning rate used for the sampling parameters. Note that we only provide gradient updates to a^jsubscript a_jover start_ARG a end_ARGj for a vertex jsubscriptA_jAitalic_j along edges eksubscripte_keitalic_k for which αk=0subscript0 _k=0αitalic_k = 0. Updating a^jsubscript a_jover start_ARG a end_ARGj when αk≠0subscript0 _k≠ 0αitalic_k ≠ 0 can lead a^ aover start_ARG a end_ARG to update toward a value that is optimal when taking a linear combination of a^ aover start_ARG a end_ARG and j(X)subscriptA_j(X)Aitalic_j ( X ), rather than a value that is optimal as a constant. See Figure 6. a^jsubscript a_jover start_ARG a end_ARGjUka^j+(1−Uk)j(X)subscriptsubscript^1subscriptsubscriptU_k a_j+(1-U_k)A_j(X)Uitalic_k over start_ARG a end_ARGj + ( 1 - Uitalic_k ) Aitalic_j ( X )j(X)subscriptA_j(X)Aitalic_j ( X )true aj∗subscriptsuperscripta^*_ja∗italic_jlearned aj∗subscriptsuperscripta^*_ja∗italic_jincorrect updatecorrect update∇a^jℒ(ℳα→(X),ℳ(X))subscript∇subscript^ℒsuperscriptℳ→ℳ _ a_jL(M α(X),M(% X))∇over start_ARG a end_ARG start_POSTSUBSCRIPT j end_POSTSUBSCRIPT L ( Mover→ start_ARG α end_ARG ( X ) , M ( X ) ) Figure 6: Gradient updates on a^ aover start_ARG a end_ARG can be biased when αk≠0subscript0 _k≠ 0αitalic_k ≠ 0. Even though we obtain approximate constants a^ aover start_ARG a end_ARG through the training process for HCGS and UGS, in order to level the playing field when comparing to ACDC and EAP, we do not use the constants found during training during circuit evaluation. Instead, for each circuit discovery algorithm, we evaluate circuits with OA by initializing constants to subtask means and then training for the same number of batches (10,000) with the same settings with a learning rate of 0.002. Algorithm 1 shows the full algorithm of UGS, including our exact loss function, for optimization with OA. For optimization with other ablation methods, we set ablated values a^ aover start_ARG a end_ARG according to the ablation method and do not perform gradient updates on a^ aover start_ARG a end_ARG. Note that we use the notation ℳα→(X,a^)subscriptℳ→^M_ α(X, a)Mover→ start_ARG α end_ARG ( X , over start_ARG a end_ARG ) with OA to indicate running the circuit evaluation with ablation coefficients α→ αover→ start_ARG α end_ARG and replacing ablated components with a^ aover start_ARG a end_ARG, in line with our notation in Appendix C.2. Algorithm 1 Uniform gradient sampling Input: set of edges E, initial parameter array θ, initial constant array a^ aover start_ARG a end_ARG Output: a set of edges E~⊂E~ E⊂ Eover~ start_ARG E end_ARG ⊂ E that represents the circuit Require: metric ℒLL, learning rates δθsubscript _θδitalic_θ, δasubscript _aδitalic_a, final threshold τ, batch size b, sample count per input nssubscriptn_snitalic_s, window function w loop X←[]←X←[\ ]X ← [ ] α←[]←α←[\ ]α ← [ ] UnifCount ←[0←[0← [ 0 for k∈[length(θ)]]k∈[length(θ)]]k ∈ [ length ( θ ) ] ] for j∈[b/ns]delimited-[]subscriptj∈[b/n_s]j ∈ [ b / nitalic_s ] do α[j]←[]←delimited-[]α[j]←[\ ]α [ j ] ← [ ] X[j]←delimited-[]absentX[j] [ j ] ← sample_input() for i∈nssubscripti∈ n_si ∈ nitalic_s do α[j][i]←[]←delimited-[]delimited-[]α[j][i]←[\ ]α [ j ] [ i ] ← [ ] for k∈[length(θ)]delimited-[]lengthk∈[length(θ)]k ∈ [ length ( θ ) ] do U←Unif(0,1)←Unif01U← Unif(0,1)U ← Unif ( 0 , 1 ) W←w(θ[k])←delimited-[]W← w(θ[k])W ← w ( θ [ k ] ).detach_gradient() p←W⋅θ[k]+(1−W)⋅θ[k]←⋅delimited-[]⋅1delimited-[]p← W·θ[k]+(1-W)·θ[k]p ← W ⋅ θ [ k ] + ( 1 - W ) ⋅ θ [ k ].detach_gradient() α[j][i][k]←(p−UW+0.5)←delimited-[]delimited-[]delimited-[]0.5α[j][i][k]← ( p-UW+0.5 )α [ j ] [ i ] [ k ] ← ( divide start_ARG p - U end_ARG start_ARG W end_ARG + 0.5 ).clamp(0,1010,10 , 1) UnifCount[k]←delimited-[]absent[k]←[ k ] ← UnifCount[k]+(α[j][i][k]∈(0,1))delimited-[]1delimited-[]delimited-[]delimited-[]01[k]+ 1(α[j][i][k]∈(0,1))[ k ] + blackboard_1 ( α [ j ] [ i ] [ k ] ∈ ( 0 , 1 ) ) α[j][i]←delimited-[]delimited-[]absentα[j][i]←α [ j ] [ i ] ← prune_dangling_edges(α[j][i])delimited-[]delimited-[](α[j][i])( α [ j ] [ i ] ) L←[]←L←[\ ]L ← [ ] for j∈[b/ns]delimited-[]subscriptj∈[b/n_s]j ∈ [ b / nitalic_s ] do for i∈nssubscripti∈ n_si ∈ nitalic_s do for k∈[length(θ)]delimited-[]lengthk∈[length(θ)]k ∈ [ length ( θ ) ] do g←b/UnifCount[k]←UnifCountdelimited-[]g←b/UnifCount[k]g ← b / UnifCount [ k ] α[j][i][k]←g⋅α[j][i][k]+(1−g)⋅α[j][i][k]←delimited-[]delimited-[]delimited-[]⋅delimited-[]delimited-[]delimited-[]⋅1delimited-[]delimited-[]delimited-[]α[j][i][k]← g·α[j][i][k]+(1-g)·α[j][i][k]α [ j ] [ i ] [ k ] ← g ⋅ α [ j ] [ i ] [ k ] + ( 1 - g ) ⋅ α [ j ] [ i ] [ k ].detach_gradient() b^←(α[j][i]==0)⋅a^+(α[j][i]>0)⋅a b←(α[j][i]==0)· a+(α[j][i]>0)·% aover start_ARG b end_ARG ← ( α [ j ] [ i ] = = 0 ) ⋅ over start_ARG a end_ARG + ( α [ j ] [ i ] > 0 ) ⋅ over start_ARG a end_ARG.detach_gradient() where +++ and ⋅·⋅ are applied componentwise; this step is only used with OA. L.append(ℒ(ℳα[j][i](X[j],b^),ℳ(X[judged]))ℒsubscriptℳdelimited-[]delimited-[]delimited-[]^ℳdelimited-[]L(M_α[j][i](X[j], b),M(X[judged]))L ( Mitalic_α [ j ] [ i ] ( X [ j ] , over start_ARG b end_ARG ) , M ( X [ j u d g e d ] ) )) f←(∑L)/b←f←(Σ L)/bf ← ( ∑ L ) / b θ←θ−δθ⋅∇θf←⋅subscriptsubscript∇θ←θ- _θ· _θfθ ← θ - δitalic_θ ⋅ ∇θ f a^←a^−δa⋅∇a^f←^^⋅subscriptsubscript∇ a← a- _a· _ afover start_ARG a end_ARG ← over start_ARG a end_ARG - δitalic_a ⋅ ∇over start_ARG a end_ARG f; this step is only used with OA. Note that in practice, we use Adam to determine step sizes. E~←∅←~ E← ~ start_ARG E end_ARG ← ∅ for k∈[length(p)]delimited-[]lengthk∈[length(p)]k ∈ [ length ( p ) ] do if θ[k]>τdelimited-[]θ[k]>τθ [ k ] > τ then E~~ Eover~ start_ARG E end_ARG.add(E[k]delimited-[]E[k]E [ k ]) return E~~ Eover~ start_ARG E end_ARG F.8 Choosing a window size for UGS We motivate the choice of our window function w(θk)=θk(1−θk)subscriptsubscript1subscriptw( _k)= _k(1- _k)w ( θitalic_k ) = θitalic_k ( 1 - θitalic_k ). Let fk∗:=∂f(θ→)∂θ~k=∂f(θ→)∂θk⋅∂θk∂θ~kassignsuperscriptsubscript→subscript~⋅→subscriptsubscriptsubscript~f_k^*:= ∂ f( θ)∂ θ_k= % ∂ f( θ)∂ _k· ∂ _k% ∂ θ_kfitalic_k∗ := divide start_ARG ∂ f ( over→ start_ARG θ end_ARG ) end_ARG start_ARG ∂ over~ start_ARG θ end_ARGk end_ARG = divide start_ARG ∂ f ( over→ start_ARG θ end_ARG ) end_ARG start_ARG ∂ θitalic_k end_ARG ⋅ divide start_ARG ∂ θitalic_k end_ARG start_ARG ∂ over~ start_ARG θ end_ARGk end_ARG, and let fksubscriptf_kfitalic_k be our sample estimate from approximation 29. Let K∼Unif(1,…,|E|)similar-toUnif1…K (\1,...,|E|\)K ∼ Unif ( 1 , … , | E | ). We may want to minimize the squared distance between our sample estimates and the true gradient values, ε:=(fK∗−fK)2=KVar(fK|K)+K([fK|K]−fK∗)2.assignsuperscriptsuperscriptsubscriptsubscript2subscriptVarconditionalsubscriptsubscriptsuperscriptdelimited-[]conditionalsubscriptsuperscriptsubscript2 :=E(f_K^*-f_K)^2=E_K% Var(f_K\ |\ K)+E_K(E[f_K\ |\ K]-f_K^*)^2.ε := blackboard_E ( fitalic_K∗ - fitalic_K )2 = blackboard_EK Var ( fitalic_K | K ) + blackboard_EK ( blackboard_E [ fitalic_K | K ] - fitalic_K∗ )2 . (32) Let our sampling distribution S∼similar-toS ∼ D be defined by independent Bernoulli random variables (ek∈S)∼Bern(wk)similar-to1subscriptBernsubscript 1(e_k∈ S) Bern(w_k)blackboard_1 ( eitalic_k ∈ S ) ∼ Bern ( witalic_k ). Assume that we collect b samples of (U→,E~,S)→~( U, E,S)( over→ start_ARG U end_ARG , over~ start_ARG E end_ARG , S ) in a batch and use the samples i for which k∈Sisubscriptk∈ S_ik ∈ Sitalic_i to estimate the fksubscriptf_kfitalic_k. Let Tk:=∑i=1n(k∈Si)assignsubscriptsuperscriptsubscript11subscriptT_k:= _i=1^n 1(k∈ S_i)Titalic_k := ∑i = 1n blackboard_1 ( k ∈ Sitalic_i ), and let fk:=1Tk∑i=1nfk(i)(k∈Si)assignsubscript1subscriptsuperscriptsubscript1superscriptsubscript1subscriptf_k:= 1T_k _i=1^nf_k^(i) 1(k∈ S_i)fitalic_k := divide start_ARG 1 end_ARG start_ARG Titalic_k end_ARG ∑i = 1n fitalic_k( i ) blackboard_1 ( k ∈ Sitalic_i ). Let fk()subscriptsuperscriptf^()_kf( )k represent the first sample for which k∈Sisubscriptk∈ S_ik ∈ Sitalic_i. Assuming that αk∼Unif(0,1)similar-tosubscriptUnif01 _k Unif(0,1)αitalic_k ∼ Unif ( 0 , 1 ) w.p. wksubscriptw_kwitalic_k, αk=0subscript0 _k=0αitalic_k = 0 w.p. θk−12wksubscript12subscript _k- 12w_kθitalic_k - divide start_ARG 1 end_ARG start_ARG 2 end_ARG witalic_k, and αk=1subscript1 _k=1αitalic_k = 1 w.p. 1−θk−12wk1subscript12subscript1- _k- 12w_k1 - θitalic_k - divide start_ARG 1 end_ARG start_ARG 2 end_ARG witalic_k as given by Equation (30), we have the loss derivative |E|∂ε∂wksubscript |E| ∂ ∂ w_k| E | divide start_ARG ∂ ε end_ARG start_ARG ∂ witalic_k end_ARG =∂Var(fk)∂wk+∑ℓ≠k(∂Var(fℓ)∂wk+2([fℓ]−fℓ∗)∂[fℓ]∂wk)absentVarsubscriptsubscriptsubscriptℓVarsubscriptℓsubscript2delimited-[]subscriptℓsubscriptsuperscriptℓdelimited-[]subscriptℓsubscript = (f_k)∂ w_k+ _ % ≠ k ( (f_ )∂ w_k+2(E% [f_ ]-f^*_ ) [f_ ]∂ w_k )= divide start_ARG ∂ Var ( fitalic_k ) end_ARG start_ARG ∂ witalic_k end_ARG + ∑ℓ ≠ k ( divide start_ARG ∂ Var ( froman_ℓ ) end_ARG start_ARG ∂ witalic_k end_ARG + 2 ( blackboard_E [ froman_ℓ ] - f∗roman_ℓ ) divide start_ARG ∂ blackboard_E [ froman_ℓ ] end_ARG start_ARG ∂ witalic_k end_ARG ) (33) This computation tells us that as we increase wksubscriptw_kwitalic_k, we can decrease the error ε ε by lowering the variance of our estimate fksubscriptf_kfitalic_k, and we can increase the error by increasing the variance of our estimates fℓsubscriptℓf_ froman_ℓ for other edges and also by making the fℓsubscriptℓf_ froman_ℓ more biased. If we assume that the second term is roughly equal to a constant c for all edges, then |E|∂ε∂wksubscript |E| ∂ ∂ w_k| E | divide start_ARG ∂ ε end_ARG start_ARG ∂ witalic_k end_ARG =Var(fk())(∂θk∂θ~k)2∂wk[1Tk|Tk>0]absentVarsuperscriptsubscriptsuperscriptsubscriptsubscript~2subscriptdelimited-[]1subscriptketsubscript0 =Var(f_k^()) ( ∂ _k∂% θ_k )^2 ∂ w_kE [% 1T_k\ |\ T_k>0 ]= Var ( fitalic_k( ) ) ( divide start_ARG ∂ θitalic_k end_ARG start_ARG ∂ over~ start_ARG θ end_ARGk end_ARG )2 divide start_ARG ∂ end_ARG start_ARG ∂ witalic_k end_ARG blackboard_E [ divide start_ARG 1 end_ARG start_ARG Titalic_k end_ARG | Titalic_k > 0 ] (34) since Tk=0subscript0T_k=0Titalic_k = 0 for an edge implies that it simply does not get a gradient update. Note that ∂θk∂θ~k=θk(1−θk)subscriptsubscript~subscript1subscript ∂ _k∂ θ_k= _k(1- _k)divide start_ARG ∂ θitalic_k end_ARG start_ARG ∂ over~ start_ARG θ end_ARGk end_ARG = θitalic_k ( 1 - θitalic_k ), and ∂wk[1Tk]≈−bwk−2subscriptdelimited-[]1subscriptsuperscriptsubscript2 ∂ w_kE [ 1T_k ]≈-% bw_k^-2divide start_ARG ∂ end_ARG start_ARG ∂ witalic_k end_ARG blackboard_E [ divide start_ARG 1 end_ARG start_ARG Titalic_k end_ARG ] ≈ - b witalic_k- 2 for a constant b>00b>0b > 0. Solving for ∂ε∂wk=0subscript0 ∂ ∂ w_k=0divide start_ARG ∂ ε end_ARG start_ARG ∂ witalic_k end_ARG = 0, ε ε is minimized when wk∝θk(1−θk)Var(fk())proportional-tosubscriptsubscript1subscriptVarsuperscriptsubscriptw_k _k(1- _k) Var(f_k^())witalic_k ∝ θitalic_k ( 1 - θitalic_k ) square-root start_ARG Var ( fitalic_k( ) ) end_ARG which motivates the definition of the window function w(θk)=θk(1−θk)subscriptsubscript1subscriptw( _k)= _k(1- _k)w ( θitalic_k ) = θitalic_k ( 1 - θitalic_k ) in our main experiments. We try including the additional factor of Var(fk())Varsuperscriptsubscript Var(f_k^())square-root start_ARG Var ( fitalic_k( ) ) end_ARG, but our results do not improve. The lack of improvement could be explained by the fact that edges with higher-variance gradients could also have higher c representing their disruptive effects on other edge-gradient estimates when αk∼Unif(0,1)similar-tosubscriptUnif01 _k Unif(0,1)αitalic_k ∼ Unif ( 0 , 1 ), making the optimal window size wksubscriptw_kwitalic_k closer to equal. F.9 Comparison of UGS and HCGS HCGS and UGS both involve sampling gradients from a distribution over α→∈(0,1)|E|→superscript01 α∈(0,1)^|E|over→ start_ARG α end_ARG ∈ ( 0 , 1 )| E | and taking gradient steps on parameters θ~ksubscript~ θ_kover~ start_ARG θ end_ARGk that represent our confidence in ek∈E~∗subscriptsuperscript~e_k∈ E^*eitalic_k ∈ over~ start_ARG E end_ARG∗ using an average of gradients with respect to the edge coefficients αksubscript _kαitalic_k. The original explanation provided by Louizos et al., (2018) for the convergence of HCGS to a satisfactory subset of weights that minimizes a loss function similar to Equation (4) involves L0subscript0L_0L0 regularization. To the contrary, we believe that a more compelling explanation for the performance of HCGS is that its behavior of sampling gradients on a region of partial ablations, α→∈(0,1)|E|→superscript01 α∈(0,1)^|E|over→ start_ARG α end_ARG ∈ ( 0 , 1 )| E |, serves as a vague approximation of Equation (28). Sampling αk∈Unif(0,1)subscriptUnif01 _k∈ Unif(0,1)αitalic_k ∈ Unif ( 0 , 1 ) to obtain gradient information, rather than a scaled conditional Concrete distribution, makes the simultaneous gradient-sampling estimator unbiased in the single-dimensional case, and thus is the choice that makes Equation (29) most resemble Equation (28). F.10 Additional IOI circuit discovery results This section displays additional circuit discovery results on the IOI subtask. In Figure 7, we show the tradeoff between Δ Δ (y-axis) and |E~|~| E|| over~ start_ARG E end_ARG | (x-axis) for optimal ablation to compare different circuit discovery methods. In Figure 8 (left), we show this tradeoff for mean ablation, and in Figure 8 (right), we show this tradeoff for resample ablation. Figure 7: Circuit discovery Pareto frontier for the IOI subtask with optimal ablation. Figure 8: Circuit discovery Pareto frontier for IOI with mean ablation (left) and resample ablation (right). F.11 Circuit discovery results for Greater-Than This section displays circuit discovery results on the Greater-Than subtask. In Figure 9 (left), we show the tradeoff between Δ Δ (y-axis) and |E~|~| E|| over~ start_ARG E end_ARG | (x-axis) for optimal ablation. In Figure 9 (right), we show this tradeoff for counterfactual ablation. In Figure 10 (left), we show this tradeoff for mean ablation. In Figure 10 (right), we show this tradeoff for resample ablation. Finally, in Figure 11, we show the Δ Δ achieved by circuits optimized using UGS on Δ Δ with different ablation methods, analogous to Figure 1 (right) in the main text for the IOI subtask. Figure 9: Circuit discovery Pareto frontier for the Greater-Than subtask with optimal ablation (left) and counterfactual ablation (right). Figure 10: Circuit discovery Pareto frontier for Greater-Than with mean ablation (left) and resample ablation (right). Figure 11: Comparison of different ablation methods for circuit discovery for Greater-Than. F.12 Comparison to Edge Pruning Edge Pruning (Bhaskar et al.,, 2024) is a concurrent work that uses HCGS for circuit discovery. We compare our implementation of HCGS against their custom implementation on their evaluation settings (IOI with counterfactual ablation and Greater-Than with resample ablation) and find that these additional details do not cause Edge Pruning to outperform our HCGS implementation (see Figure 12). Thus, we only include our implementation of HCGS as a baseline in our main figures. Figure 12: Comparison of our methods to Edge Pruning on IOI (left) and Greater-Than (right). F.13 Random circuits One question is whether it may be possible to extract circuits with OA that do not necessarily explain model behavior on the training distribution by setting vertices to out-of-distribution values which maximally elicit a certain behavior. If the ablation constants a^ aover start_ARG a end_ARG overparameterize the data, then performing OA could behave similarly to fine-tuning the model to perform the desired task. Intuitively, however, OA strictly decreases the amount of computation available to the model, since we only add constants to model components and do not allow additional transformations of internal representation that are not already present in the downstream computation. To verify our stance, we compare the loss recovered by circuits discovered by UGS to random circuits to verify that OA indeed distinguishes subtask-performing mechanisms and does not provide enough degrees of freedom to elicit subtask behavior from unrelated model components. In Table 3, we compare the Δ(ℳ,E~)Δℳ~ (M, E)Δ ( M , over~ start_ARG E end_ARG ) achieved by random circuits E~~ Eover~ start_ARG E end_ARG to those achieved by circuits optimized with UGS for various ablation types. We construct E~~ Eover~ start_ARG E end_ARG by sampling each (ek∈E~)1subscript~ 1(e_k∈ E)blackboard_1 ( eitalic_k ∈ over~ start_ARG E end_ARG ) independently with some probability p, and prune dangling edges as detailed in Appendix F.7. We accept E~~ Eover~ start_ARG E end_ARG if |E~|~| E|| over~ start_ARG E end_ARG | is within an acceptable range, and we select p to maximize the probability that |E~|~| E|| over~ start_ARG E end_ARG | falls within this range. We set our range of |E~|~| E|| over~ start_ARG E end_ARG | to be [400,500]400500[400,500][ 400 , 500 ] for IOI and [200,300]200300[200,300][ 200 , 300 ] for Greater-Than. Recall that to evaluate circuits with OA, we perform gradient descent on a^ aover start_ARG a end_ARG to approximate the optimal constants. Since repeating this process is expensive, we truncate training after just 200 training batches, far short of the 10,000 batches used for a full training run, for both the random circuits and optimized circuits. However, we test using a smaller sample size that the loss for the random circuits does not tend to decrease much with further training; in fact, for the optimized circuit, the loss typically drops by more than 50% after the first batch, which does not occur for the random circuits. Table 3: Optimized circuits compared to random circuits for various ablation types Mean Resample Optimal Counterfactual IOI Random circuit loss 4.529 6.527 2.723 4.264 UGS circuit loss 0.264 1.779 0.176 0.191 Std 0.200 0.085 0.024 0.049 Z-score -21.28 -55.67 -100.57 -82.44 Greater-Than Random circuit loss 1.010 2.109 0.900 1.785 UGS circuit loss 0.033 0.056 0.029 0.021 Std 0.020 0.039 0.011 0.027 Z-score -49.29 -52.89 -80.81 -64.76 While random circuits achieve lower loss under OA than mean and resample ablation, the ΔoptsubscriptΔopt _optΔroman_opt measurements for random circuits do not approach the low figures achieved by optimized circuits. Furthermore, the standard deviation of Δ Δ for random circuits is lower on average for OA than for mean or resample ablation, and surprisingly, the OA losses for optimized circuits have the most significant Z-score for both IOI and Greater-Than, though there is not necessarily a difference between Z-scores of such large magnitude. These results demonstrate that OA is likely highlighting specialized circuit components that already exist in the model rather than fabricating non-existent mechanisms. Appendix G Causal tracing G.1 Transformer graph representation Consider running a causal tracing experiment on vertex AA. We represent the model with four vertices: Subj(X)SubjSubj(X)Subj ( X ) representing the subject tokens of the input, Non-Subj(X)Non-SubjNon-Subj(X)Non-Subj ( X ) representing the remaining input tokens, the component of concern (X)=(Subj(X),Non-Subj(X))SubjNon-SubjA(X)=A(Subj(X),Non-Subj(X))A ( X ) = A ( Subj ( X ) , Non-Subj ( X ) ), and the model output Out(X)=Out(Subj(X),Non-Subj(X),(X))OutOutSubjNon-SubjOut(X)=Out(Subj(X),Non-Subj(X),A(X))Out ( X ) = Out ( Subj ( X ) , Non-Subj ( X ) , A ( X ) ). In particular, if =MLP(i)superscriptMLPA=MLP^(i)A = MLP( i ), then we compute Out(X)OutOut(X)Out ( X ) as a function of these three arguments by computing Equation (8), which takes (X)A(X)A ( X ) and MResid(i)(X)superscriptMResidMResid^(i)(X)MResid( i ) ( X ) as input, by taking the latter term MResid(i)(X)superscriptMResidMResid^(i)(X)MResid( i ) ( X ) as a function of Non-Subj(X)Non-SubjNon-Subj(X)Non-Subj ( X ) and Subj(X)SubjSubj(X)Subj ( X ), and then computing Out(X)OutOut(X)Out ( X ) as a function of Resid(i)(X)superscriptResidResid^(i)(X)Resid( i ) ( X ). A similar construction is used for attention layers. Note that the AIE compares the performance of the model with the vertex Subj ablated (the denominator in Equation (6)) to the performance of the model with only the edge (Subj,Out)SubjOut(Subj,Out)( Subj , Out ) ablated (the numerator in Equation (6)). G.2 Relation of AIE to ablation loss gap For consistency with Meng et al., (2022), we use a carefully selected loss function in the definition of Δ Δ to represent proximity to the model’s original predictions rather than the typical KL-divergence loss. In particular, we choose ℒAIE(P,Q):=min(0,maxQ−[P]argmaxQ),assignsubscriptℒAIE0subscriptdelimited-[] _AIE(P,Q):= (0, Q-[P]_ Q% ),Lroman_AIE ( P , Q ) := min ( 0 , max Q - [ P ]arg max Q ) , (35) where P and Q represent probability distributions over the model vocabulary. Note that since the dataset is filtered so that Y=argmaxℳ(X)ℳY= (X)Y = arg max M ( X ), replacing ℒLL with ℒAIEsubscriptℒAIEL_AIELroman_AIE in Δ Δ, we get Δ Δ =X∼ℒAIE(ℳ(ξ(X),A(X)),ℳ(X))absentsubscriptsimilar-tosubscriptℒAIEsubscriptℳ =E_X L_AIE( % M_A(ξ(X),A(X)),M(X))= blackboard_EX ∼ D Lroman_AIE ( Mcaligraphic_A ( ξ ( X ) , A ( X ) ) , M ( X ) ) =(X,Y)∼max(0,[ℳ(X)]Y−[ℳ(ξ(X),A(X))]Y)absentsubscriptsimilar-to0subscriptdelimited-[]ℳsubscriptdelimited-[]subscriptℳ =E_(X,Y) (0,[M(X)]_Y-[% M_A(ξ(X),A(X))]_Y)= blackboard_E( X , Y ) ∼ D max ( 0 , [ M ( X ) ]Y - [ Mcaligraphic_A ( ξ ( X ) , A ( X ) ) ]Y ) (36) which is the numerator in Equation (6). G.3 Additional results Figure 13: Causal tracing probabilities for different token positions with window size 1 (patching a single component). Error bars indicate the sample estimate plus/minus two standard errors. Figure 14: Causal tracing probabilities for different token positions with a sliding window of size 5. Error bars indicate the sample estimate plus/minus two standard errors. Figure 15: Causal tracing probabilities for different token positions with a sliding window of size 9. Error bars indicate the sample estimate plus/minus two standard errors. We show results for additional window sizes and token positions. In particular, we show results for intervening at all subject token positions, only the last subject token position, and only the last token position, for window sizes 1 (see Figure 13), 5 (see Figure 14), and 9 (see Figure 15). In addition to providing a more precise localization of components’ informational contributions, the results provide some evidence against one of the claims of Meng et al., (2022), the idea that the last subject token position is a uniquely important “early site” for processing information at MLP layers 10-20. For sliding windows of size 5, GNT shows that intervening on MLPs at only the last subject token position achieves over half of the AIE of performing the same intervention at all subject token positions (33.8% vs 19.3%, shown in Figure 14). However, the OAT results indicate that intervening at all subject tokens is much more effective (35.2% vs 11.1%), indicating that early subject token positions may be more important than previously thought. G.4 Construction of standard errors For input-label pairs (X,Y)∼similar-to(X,Y) ( X , Y ) ∼ D, let W=min(0,[ℳ(X)]Y−[ℳ(ξ(X),A(X))]Y)0subscriptdelimited-[]ℳsubscriptdelimited-[]subscriptℳW= (0,[M(X)]_Y-[M_A(ξ(X),A(X))]_Y)W = min ( 0 , [ M ( X ) ]Y - [ Mcaligraphic_A ( ξ ( X ) , A ( X ) ) ]Y ) and Z=[ℳ(X)]Y−[ℳ(ξ(X))]Ysubscriptdelimited-[]ℳsubscriptdelimited-[]ℳZ=[M(X)]_Y-[M(ξ(X))]_YZ = [ M ( X ) ]Y - [ M ( ξ ( X ) ) ]Y, and let W^nsubscript W_nover start_ARG W end_ARGn and Z^nsubscript Z_nover start_ARG Z end_ARGn be their respective sample means with n samples. Recall from Equation (6) that we want to estimate from samples the quantity AIE()=min(0,1−WZ)=:min(0,1−μWμZ)AIE(A)= (0,1- EWEZ )% =: (0,1- _W _Z )AIE ( A ) = min ( 0 , 1 - divide start_ARG blackboard_E W end_ARG start_ARG blackboard_E Z end_ARG ) = : min ( 0 , 1 - divide start_ARG μitalic_W end_ARG start_ARG μitalic_Z end_ARG ). By the central limit theorem, n([W^nZ^n]−[μWμZ])→(0,Σ):=(0,[σW2σWZσWZσZ2])→matrixsubscript^subscript^matrixsubscriptsubscript0Σassign0matrixsubscriptsuperscript2subscriptsubscriptsubscriptsuperscript2 n ( bmatrix W_n\\ Z_n bmatrix- bmatrix _W\\ _Z bmatrix ) dN(0, ):=N% (0, bmatrixσ^2_W& _WZ\\ _WZ&σ^2_Z bmatrix )square-root start_ARG n end_ARG ( [ start_ARG start_ROW start_CELL over start_ARG W end_ARGn end_CELL end_ROW start_ROW start_CELL over start_ARG Z end_ARGn end_CELL end_ROW end_ARG ] - [ start_ARG start_ROW start_CELL μitalic_W end_CELL end_ROW start_ROW start_CELL μitalic_Z end_CELL end_ROW end_ARG ] ) start_ARROW overd → end_ARROW N ( 0 , Σ ) := N ( 0 , [ start_ARG start_ROW start_CELL σ2italic_W end_CELL start_CELL σitalic_W Z end_CELL end_ROW start_ROW start_CELL σitalic_W Z end_CELL start_CELL σ2italic_Z end_CELL end_ROW end_ARG ] ) (37) By the multivariate delta method, for h([wz])=wzℎmatrixh ( bmatrixw\\ z bmatrix )= wzh ( [ start_ARG start_ROW start_CELL w end_CELL end_ROW start_ROW start_CELL z end_CELL end_ROW end_ARG ] ) = divide start_ARG w end_ARG start_ARG z end_ARG and v:=∇h([μWμZ])=[1μZ−μWμZ2]assign∇ℎmatrixsubscriptsubscriptmatrix1subscriptsubscriptsuperscriptsubscript2v:=∇ h ( bmatrix _W\\ _Z bmatrix )= bmatrix 1 _Z\\ - _W _Z^2 bmatrixv := ∇ h ( [ start_ARG start_ROW start_CELL μitalic_W end_CELL end_ROW start_ROW start_CELL μitalic_Z end_CELL end_ROW end_ARG ] ) = [ start_ARG start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG μitalic_Z end_ARG end_CELL end_ROW start_ROW start_CELL - divide start_ARG μitalic_W end_ARG start_ARG μitalic_Z2 end_ARG end_CELL end_ROW end_ARG ] n(W^nZ^n−μWμZ)=n(h([W^nZ^n])−h([μWμZ]))→(0,vTΣv)subscript^subscript^subscriptsubscriptℎmatrixsubscript^subscript^ℎmatrixsubscriptsubscript→0superscriptΣ n ( W_n Z_n- _% W _Z )= n (h ( bmatrix W_n\\ Z_n bmatrix )-h ( bmatrix _W\\ _Z bmatrix ) ) dN(0,v^T v)square-root start_ARG n end_ARG ( divide start_ARG over start_ARG W end_ARGn end_ARG start_ARG over start_ARG Z end_ARGn end_ARG - divide start_ARG μitalic_W end_ARG start_ARG μitalic_Z end_ARG ) = square-root start_ARG n end_ARG ( h ( [ start_ARG start_ROW start_CELL over start_ARG W end_ARGn end_CELL end_ROW start_ROW start_CELL over start_ARG Z end_ARGn end_CELL end_ROW end_ARG ] ) - h ( [ start_ARG start_ROW start_CELL μitalic_W end_CELL end_ROW start_ROW start_CELL μitalic_Z end_CELL end_ROW end_ARG ] ) ) start_ARROW overd → end_ARROW N ( 0 , vitalic_T Σ v ) (38) so the asymptotic variance is μW2μZ2(σW2μW2+σZ2μZ2−2σWZμWμZ)superscriptsubscript2superscriptsubscript2superscriptsubscript2superscriptsubscript2superscriptsubscript2superscriptsubscript22subscriptsubscriptsubscript _W^2 _Z^2 ( _W^2 _W^2+ % _Z^2 _Z^2-2 _WZ _W _Z )divide start_ARG μitalic_W2 end_ARG start_ARG μitalic_Z2 end_ARG ( divide start_ARG σitalic_W2 end_ARG start_ARG μitalic_W2 end_ARG + divide start_ARG σitalic_Z2 end_ARG start_ARG μitalic_Z2 end_ARG - 2 divide start_ARG σitalic_W Z end_ARG start_ARG μitalic_W μitalic_Z end_ARG ) which we estimate via samples to obtain our standard errors. Appendix H OCA lens H.1 Transformer graph representation We represent the model with Resid(i)superscriptResidResid^(i)Resid( i ), MResid(i)superscriptMResidMResid^(i)MResid( i ), Attn(i)superscriptAttnAttn^(i)Attn( i ), and MLP(i)superscriptMLPMLP^(i)MLP( i ) vertices for each layer i and a vertex Out(x)OutOut(x)Out ( x ) representing the model output, where the relationships between the vertices are defined by the equations given in Appendix C.3. Applying OCA lens at layer i entails ablating vertices Attn(i+1)superscriptAttn1Attn^(i+1)Attn( i + 1 ) through Attn(N)superscriptAttnAttn^(N)Attn( N ) (where N is the number of layers in the model). H.2 Additional prediction accuracy results Figure 16 shows results on prediction loss for GPT-2-small, GPT-2-medium, and GPT-2-large. For all models, we use a learning rate of 0.01 for tuned lens and 0.002 for OCA lens. Figure 16: Comparison of prediction loss between tuned lens and ablation-based alternatives. H.3 Additional causal faithfulness results The following figures show the causal faithfulness metrics with several kinds of perturbations. Let μ=[ℓi(X)]delimited-[]subscriptℓμ=E[ _i(X)]μ = blackboard_E [ ℓitalic_i ( X ) ] and Σ=Var(ℓi(X))ΣVarsubscriptℓ =Var( _i(X))Σ = Var ( ℓitalic_i ( X ) ). • Random perturbation: We sample Z∼(0,Σ)similar-to0ΣZ (0, )Z ∼ N ( 0 , Σ ), and let V=Z/‖Z‖normV=Z/||Z||V = Z / | | Z | |. We let Z′∼(0,1)similar-tosuperscript′01Z (0,1)Z′ ∼ N ( 0 , 1 ), Z′⟂Z \!\!\! Z′ ⟂ ⟂ Z. We define ξ(a)=a+c⋅Z′⋅V⋅superscript′ξ(a)=a+c· Z · Vξ ( a ) = a + c ⋅ Z′ ⋅ V. We define the constant c such that [ℒ(ℳ(X;ξ),ℳ(X))]≈0.2delimited-[]ℒℳ0.2E[L(M(X;ξ),M(X))]≈ 0.2blackboard_E [ L ( M ( X ; ξ ) , M ( X ) ) ] ≈ 0.2. Results shown in Figure 17. Figure 17: Causal faithfulness comparison under random perturbations. • Basis-aligned perturbation: Same as random perturbation, except we choose a basis of dmodelsubscriptmodeld_modeldroman_model vectors as described in Section 5, and let Z be a uniformly sampled basis element. Results shown in Figure 18. Figure 18: Causal faithfulness comparison under basis-aligned perturbations. • Random projection: We sample Z∼(0,Σ)similar-to0ΣZ (0, )Z ∼ N ( 0 , Σ ), and let V=Z/‖Z‖normV=Z/||Z||V = Z / | | Z | |. We define ξ(x)=μ+p(a−μ)ξ(x)=μ+p(a-μ)ξ ( x ) = μ + p ( a - μ ), where p represents the projection to the orthogonal complement of V. Results shown in Figure 19 Figure 19: Causal faithfulness comparison under random projections. • Basis-aligned projection: Described in the main text. Results shown in Figure 3. • Basis-aligned resample ablation: We choose a basis as described in Section 5. We consider the subspace spanned by 100 basis elements with the largest singular vectors, and define ξ(a)ξ(a)ξ ( a ) by performing resample ablation on the projection of a to this subspace. Results shown in Figure 20. Figure 20: Causal faithfulness comparison under basis-aligned resample ablation. We find that the improvement in causal faithfulness is consistent across all perturbation types studied. H.4 Elicitation results on factual datasets Figure 21: Comparison of elicitation accuracy boost between OCA lens and tuned lens. We show additional results for elicitations on the text classification datasets. Figure 21 compares the elicitation accuracy boost between OCA lens and tuned lens. Figure 22 shows comprehensive results for each of the individual datasets. For our experiments, we use 10 demonstrations and sample from datasets without replacement to generate the demonstration examples. Note that we exclude SST2-AB, a toy dataset constructed by Halawi et al., (2024) that replaces SST2 sentiment labels with letters “A” and “B,” since it is only created to show that elicitation accuracy does not improve when the expected answer is unrelated to the question (since the label is encoded in a non-intuitive manner, information from later layers is required to relate internal knowledge to the correct label). Figure 22: Comparison of calibrated accuracy of elicited completions on 15 datasets from Halawi et al., (2024). Dotted lines indicate the accuracy of the model’s output predictions for true demonstrations (black) and false demonstrations (red). Appendix I Reproducibility All code can be found at https://github.com/maxtli/optimalablation. All experiments were run on a single Nvidia A100 GPU with 80GB VRAM. The cost of UGS is comparable to ACDC (about 1-2 hours to train). Training OCA lens until convergence also takes about 3-5 hours, which is similar to the amount of time to train tuned lens. Appendix J Impact statement We believe that OA can lead to a more granular level of understanding for models’ internal mechanisms. A better understanding of interpretability can help to reduce risk from dangerous AIs, but more work is required to scale interpretability techniques to larger models. Interpretability can also help us to understand how to build better inductive biases into models, paving the way for future developments in architecture. On the other hand, advanced interpretability can also be repurposed for nefarious applications, like eliciting dangerous knowledge from models’ latent space. However, we believe that better interpretability will also provide better clarity on how to mitigate these risks.