Paper deep dive

Attribution Patching Outperforms Automated Circuit Discovery

Aaquib Syed, Can Rager, Arthur Conmy

Year: 2023Venue: BlackboxNLP 2024Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 31

Models: GPT-2 Small

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/12/2026, 8:17:00 PM

Summary

The paper introduces Edge Attribution Patching (EAP), an efficient automated circuit discovery method for neural networks. EAP uses a linear approximation of activation patching to estimate edge importance in a computational graph, requiring only two forward passes and one backward pass. The authors demonstrate that EAP outperforms existing methods like ACDC in terms of AUC for circuit recovery while significantly reducing computational costs, and suggest a hybrid approach where EAP is used to prune the graph before applying ACDC.

Entities (5)

Activation Patching · method · 100%Automated Circuit Discovery · method · 100%Edge Attribution Patching · method · 100%GPT-2 · model · 95%Mechanistic Interpretability · field · 95%

Relation Signals (3)

Edge Attribution Patching → approximates → Activation Patching

confidence 98% · We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph.

Edge Attribution Patching → outperforms → Automated Circuit Discovery

confidence 95% · We provide evidence that Edge Attribution Patching (EAP) outperforms ACDC in identifying circuits while being substantially faster to compute.

Mechanistic Interpretability → includes → Automated Circuit Discovery

confidence 90% · Automated Circuit Discovery (ACDC; [10]) attempts to automate a large portion of the mechanistic interpretability workflow

Cypher Suggestions (2)

Identify the relationship between EAP and Activation Patching · confidence 95% · unvalidated

MATCH (e:Method {name: 'Edge Attribution Patching'})-[r]->(a:Method {name: 'Activation Patching'}) RETURN type(r)

Find all methods that outperform Automated Circuit Discovery · confidence 90% · unvalidated

MATCH (m1:Method)-[:OUTPERFORMS]->(m2:Method {name: 'Automated Circuit Discovery'}) RETURN m1.name

Abstract

Abstract:Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods.

PDF

Open source PDF →Open local PDF →

Full Text

30,629 characters extracted from source content.

Expand or collapse full text

Attribution Patching Outperforms Automated Circuit Discovery Aaquib Syed University of Maryland, College Park asyed04@umd.edu Can Rager Independent canrager@gmail.com Arthur Conmy Independent arthurconmy@gmail.com Abstract Automated interpretability research has recently attracted attention as a poten- tial research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patch- ing to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods. 1 1 Introduction Mechanistic interpretability is a subfield of AI interpretability that focuses on attributing model be- haviors to its components, thus reverse engineering the network [1]. This field aims to identify subnetworks (circuits) within the model which are responsible for solving specific tasks [2]. Prior attempts at finding circuits in language models have led to finding networks of attention heads and multi-layer perceptrons (MLPs) that partially or fully explain model behaviors at tasks such as in- direct object identification, modular arithmetic, completion of docstrings, and predicting successive dates [3, 4, 5, 6]. However, almost all previous work has been limited to relatively small models since manually applying mechanistic interpretability methods has not currently scaled to end-to-end circuits in larger models [7]. It may be important to scale interpretability to large models as these are the neural networks most widely deployed and used by a wide range of people. Currently, we have little understanding into these models work and failure modes are not always found ahead of deployment. If successful, scaled interpretability could address a wide variety of concerns about the lack of transparency of language models [8], in addition to speculative risks about the alignment of machine learning sys- tems [9]. Automated Circuit Discovery (ACDC; [10]) attempts to automate a large portion of the mechanistic interpretability workflow — the pruning of edges between attention heads and MLPs that do not affect the task being studied. ACDC begins with a computational graph, and recursively calculates the importance of an edge in the graph for a specific task. In our work, we use edges to refer to activations inside models between two components (Section 2 describes this motivation further). ACDC’s pruning algorithm appliesactivation patching. (Note thatactivation patchingis not attribution patching. Both are defined in full in Section 3.3.) At a high level, activation patching edits a specific activation in a model forward pass and measures a model statistic (e.g loss) under this 1 Our code is available athttps://github.com/Aaquib111/acdcpp 37th Conference on Neural Information Processing Systems (NeurIPS 2023) ATTRIB Workshop. arXiv:2310.10348v2 [cs.LG] 20 Nov 2023 intervention. Activation patching is inefficient for circuit discovery because getting each statistic about model activations requires another forward pass. Our work usesattribution patchingto recover circuits more efficiently (Section 3.3). Our main contributions are: 1. Introducing a method for using attribution patching on all computational graph edges for automated circuit discovery (Edge Attribution Patching, Section 3.3). 2. Benchmarking Edge Attribution Patching vs existing circuit discovery methods (Section 4). 3. Finding and explaining some limitations with Edge Attribution Patching (Section 5). 2 Related Work Automated Circuit Discoveryrefers to finding the important subgraph of models’ computational graphs for performance on particular tasks [10]. Existing algorithms include efficient heuristics [11] and gradient-descent based methods [12, 13]. ACDC is related to pruning [14] and other compres- sion techniques [15], but differs in how the compressed networks are reflective of the circuits that model uses to compute outputs to certain tasks and the goal of ACDC is not to speed up forward passes (all techniques studied in this work slow forward passes). Activation Patchingis a technique for analyzing the role of individual components in a model. It in- volves targeted manipulations of activations during a forward pass (further explained in Section 3.1). Previous works applied this technique under various names, such as Interchange Interventions [16], Causal Mediation Analysis [8] and Causal Tracing [17]. We adapt the terminology used by Conmy et al. [10]. Transformer Circuits. Our work builds upon the framework for understanding transformers for interpretability as introduced by Elhage et al. [18]. The important details include how they formulate forward passes of transformer models. Individual attention heads and MLPs (collectively called nodes) read and write information to a central communication channel, also called the residual stream. In these terms we can examine dependencies of nodes with the output of earlier nodes, i.e we can measure the effect of attention heads in layer 0 on the attention heads in layer 2. In the following, we view these dependencies as edges between nodes, building on existing work using this perspective [5, 6, 3]. 3 Edge Attribution Patching We presentEdge Attribution Patching(EAP) as a technique to identify relevant model components for solving a specific task. In the following, we view language models as directed, acyclic graphs. In these terms, we aim to find small subgraphs that retain good performance on narrow tasks. We determine the importance of a specific edge through targeted manipulation of activations during a forward pass. We compare two approaches, Attribution Patching and Activation Patching, in order to motivate EAP. 3.1 Activation Patching Activation patchingrefers to replacing the activations from one model forward pass with the activa- tions from a different forward pass. This method is typically applied to measure the counterfactual importance of model components, i.e. to measure a statisticL(x)from model outputs under the acti- vation patching, wherexis an input prompt. For example,Loften represents loss or logit difference [3]. Following existing work (Section 2), we study the effect of activation patching on specific model edges by setting these equal to activations from different forward passes. Concretely, suppose that an edgeEin the computational graph has activatione corr on some corrupted prompt. In this work, we use the change in metric under activation patching |L(x clean |do(E=e corr ))−L(x clean )|(1) 2 e clean e corr (x, y): Activation z:L (a) Attribution Patching (Section 3.3) approximates the dif- ference in metricLcaused by corrupting edges. (b) Removing the least important edges. Figure 1: Edge Attribution Patching (EAP) to measure the impact of edgeE. We use do-notation from causality [19] to emphasise that activation patching is a causal intervention. 3.2 Attribution Patching Activation patching slows ACDC since each measurement (like Equation (1)) requires another for- ward pass.Attribution patching[20] is a technique for estimating Equation (1) for many different edgesEusing only two forward passes and one backward pass. 2 It linearly approximates the metric difference after corrupting a single edge in the computational graph (Figure 1) by expandingLas a function of the edge activation as a Taylor series with terms up to the first order: L(x clean |do(E=e corr ))≈L(x clean ) + (e corr −e clean ) ⊤ ∂ ∂e clean L(x clean |do(E=e clean )) | z Call this∆ e L, theattribution score. .(2) A simple rearrangement implies that Equation (1) is approximately equal to|∆ e L|(3) which we call theabsolute attribution scorefor the rest of this paper. In this work we always compute this score across a set of(x clean ,x corr )pairs and take the mean. In practice, all gradients needed to calculate the attribution scores come from intermediate terms computed in one ordinary backwards pass 3 in PyTorch [21], hence attribution patching is extremely efficient. 3.3 Edge Attribution Patching We can use the insights from Section 3.2 to build an automated circuit discovery algorithm. This takes two steps: i) use Equation (2) to obtain absolute attribution scores for the importance of all edges in the computational graph and then i) sort these scores and keep the topkedges in a circuit. We useEdge Attribution Patching(EAP) to refer to this algorithm. In the rest of the work we report results for allkvalues when we evaluate EAP (similar to HISP in [10]). Note that one limitation of attribution patching is that it will not work when the gradient of the metric is the zero vector. Conmy et al. [10] recommended the use of KL divergence as a metric, which is i) equal to 0 when we run the model without ablations and i) a non-negative metric. Therefore the zero point is a global minimum and hence all gradients are the zero vector at this point. In this work we use the ‘task-specific metrics’ (not KL divergence) from [10] so avoid this issue. 2 Attribution patching (like activation patching) also applies to nodes and other model internal components that aren’t edges, but we only use edges in this work. 3 In Appendix F we show how only one backwards pass is required. 3 4 Results 4.1 Edge Attribution Patching vs Activation Patching vs ACDC We compare Edge Attribution Patching (EAP) and ACDC on the Indirect Object Identification (IOI), Docstring, and Greater-Than tasks. For each of these tasks, previous studies identified a subgraph (circuit) relevant for solving the task. We use their results as a ground truth for benchmarking both methods. We also compare using ACDC with the task-specific metrics (e.g logit difference) and KL Divergence (which was originally recommended). For the docstring task, we also include repeated activation patching as another point of reference for performance comparisons. We applied repeated activation patching by running the same circuit discovery method described in Section 3.3 but using Equation (1) rather than absolute attribution scores. Activation patching was not included in the other tasks as it was too computationally expensive to run on the GPT-2 small models used by IOI and Greater-Than. Subnetworks found using EAP for all three tasks are shown in Appendix A. 0.00.20.40.60.81.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate ROC Curve (IOI) EAP Only; (AUC = 0.904) ACDC w/ IOI Metric; (AUC = 0.588) ACDC w/ KL Divergence Metric; (AUC = 0.868) (a) IOI task 0.00.20.40.60.81.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate ROC Curve (Greaterthan) EAP Only; (AUC = 0.896) ACDC w/ GreaterThan Metric; (AUC = 0.458) ACDC w/ KL Divergence Metric; (AUC = 0.849) (b) Greater-Than task 0.00.20.40.60.81.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate ROC Curve (Docstring) EAP Only; (AUC = 0.976) Activation Patching; (AUC = 0.978) ACDC w/ Docstring Metric; (AUC = 0.972) ACDC w/ KL Divergence metric; (AUC = 0.982) (c) Docstring task Figure 2: ROC Curves comparing EAP, ACDC with task metric, and ACDC with KL Divergence. The Docstring plot also compares to Activation Patching. The ROC curves in Figure 2 suggest the performance of EAP is better than ACDC overall: it has the maximal AUC in Figure 2a-2b, while ACDC used with the KL Divergence metric outperforms EAP in Figure 2c. ACDC outperformed the existing methods HISP and Subnetwork Probing methods [10]. We conclude EAP outperforms all previous methods for circuit discovery, since it is competi- tive with ACDC on recovering circuits while significantly reducing the computational demand: EAP only takes a constant number of forward and backwards passes while the number of forward passes required by ACDC is scaling exponentially with the number of nodes. 4.2 Validating EAP Attribution Scores In this section, we look at the approximate metric change (attribution score) EAP assigns to each edge in the model. We aim to understand the relation between the attribution score and the function of the edge in the task being studied. First, we look at the distribution of scores for edges in the circuit compared to edges not in the circuit for each of the three tasks. Figure 3 shows the distribution of attribution scores for the IOI task. The distributions for the remaining tasks can be found in Appendix B. Qualitatively, attribution scores for edges in the circuit tend to be spread further from zero. Furthermore, there are only 6 edges outside of the interval [−0.25,0.25]that aren’t part of the IOI circuit. We further explore the attribution scores for the IOI circuit’s classes of heads in Appendix E. 5 Limitations We introduced edge activation patching as an approximation to activation patching. However, we found that edge activation patching outperformed ACDC, a technique based on activation patching (Section 4). In this section, we investigate whether attribution patching’s success is due to extremely 4 1.000.750.500.250.000.250.500.75 Change in Logit Difference 10 0 10 1 10 2 10 3 10 4 Log Count Histogram of Edge Scores (IOI Task) In IOI: mean=-0.0080, std=0.1063 Not In IOI: mean=0.0000, std=0.0061 Figure 3: Distribution of Attribution Scores for the IOI Task (Logit Diff) accurate approximations (in Section 5.1 we find that the answer is no), and whether there is any further use for ACDC (in Section 5.2 we find that the answer is yes). We use the docstring task as a case study due to the small model size used. 5.1 How faithful are Attribution Patching’s approximations? To study how faithful the approximation Equation (2) is, we plot the attribution patching scores (Equation (2)) against the activation patching scores (Equation (1)) in Figure 4a. Surprisingly, we find a fairly weak correlation between activation and attribution patching scores (R 2 = 0.27). Fur- ther, the line of best fit has gradient 0.531, suggesting that attribution patching estimates the effect of activation patching as twice as important as it really is. Moreover, we can gain some sense for the discrepancy between activation and attribution patching by studying the continuous transition between clean (e clean ) and corrupted (e corr ) activations in Equa- tion (1), i.e studying the values|L(x clean |do(E=λe corr + (1−λ)e clean ))−L(x clean )|for0≤λ≤1. We can compare this to the linear approximations of Attribution Patchingλ∆ e L. Figure 4b shows the result for one edge in the docstring circuit where the linear approximation to activation patching is not accurate. 321012345 EAP 2 1 0 1 2 A c t i v a t i o n P a t c h i n g Docstring Logit Diff change when Ablating Edges Edge in circuit Edge not in circuit Line of Best Fit: y = 0.531x + -0.049; R^2: 0.27 (a) Distribution of attribution scores for edges from activation patching and attribution patching. Circled: outlier EAP point studied in Figure 4b. 0.00.20.40.60.81.0 Interpolation towards corruption 0 1 2 3 4 Change in Docstring Logit Diff Clean edge Corrupted edge Input to L1H4K EAP linear approximation Interpolated activation patching EAP value (b) Visualizing the rightmost point in Figure 4a. Note that corrupting this edge (surprisingly) slightly in- creases the logit difference on the Docstring task (higher logit difference is better). However, EAP overestimates how large this increase is. Figure 4: Visualizing Edge Attribution Patching. We find that interpolating towards the corrupted input creates a concave curve (Figure 4b) such that the linear approximation atλ= 0overestimates the effect of activation patching this edge. In Appendix D we show that this also holds for the other outlier edges in the ellipse in Figure 4a. 5 5.2 Is there any further use for ACDC? In Section 5.1 above, we found that EAP overestimates activation patching in cases where the at- tribution score is concave. This suggests the potential to refine the result by running ACDC on the pruned subgraph returned by EAP. We ran EAP first, then ACDC on the resulting subgraph for the Docstring task, varying pruning thresholds for EAP and ACDC independently. Figure 5 compares the TPR and FPR for the combined methods with the ROC curve of EAP only. The combined methods show increased performance compared to EAP only. Figure 5: Comparing statistics of the combined EAP + ACDC methods with EAP only. The inset shows a zoom to the significant area of the statistics of the combined method. Finally, one further limitation of this research is that the metrics used for interpretability do not precisely capture meaningful human understanding. Recovering a subgraph that humans previously recovered is limited because i) we can’t evaluate this metric for interpretability tasks that we don’t yet understand and i) human-found circuits are imperfect, increasing the noise in this measurement. 6 Conclusion We provide evidence that Edge Attribution Patching (EAP) outperforms ACDC in identifying cir- cuits while being substantially faster to compute. This result is surprising, as EAP is an approx- imation for activation patching, the method applied by ACDC. However, running ACDC on the prepruned subnetwork found by EAP can improve the identification of relevant edges. Therefore, we suggest future circuit discovery experiments to run EAP first and then apply ACDC. 7 Author Contributions Aaquib Syed and Can Rager proposed combining ACDC with attribution patching methods and implemented initial prototypes. Arthur Conmy advised working on attributing edges rather than nodes and Aaquib made the first findings that this outperformed Automatic Circuit Discovery. All authors worked on the paper’s figures, experiments and code. 8 Acknowledgements We would like to thank Callum McDougall for organising ARENA and providing a great intro- duction to interpretability, and Rusheb Shah and Lucy Farnik for collaboration on the ARENA hackathon prototype which this work is based on. We would like to thank Neel Nanda for a help- ful discussion and J ́ anos Kram ́ ar, Stephen Casper and Euan Ong for suggestions based on a earlier version of this work. 6 References [1] Chris Olah.Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases. 2022.URL:https://w.transformer-circuits.pub/2022/mech-interp- essay. [2] Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. “Zoom In: An Introduction to Circuits”. In:Distill(2020).DOI:10.23915/distill. 00024.001. [3] Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. “Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small”. In: The Eleventh International Conference on Learning Representations. 2023.URL:https: //openreview.net/forum?id=NpsVSN6o4ul. [4] Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. “Progress measures for grokking via mechanistic interpretability”. In:The Eleventh International Con- ference on Learning Representations. 2023.URL:https://openreview.net/forum?id= 9XFSbDPmdW. [5] Stefan Heimersheim and Jett Janiak.A circuit for Python docstrings in a 4-layer attention- only transformer. 2023.URL:https : / / w . alignmentforum . org / posts / u6KXXmKFbXfWzoAXn / a - circuit - for - python - docstrings - in - a - 4 - layer - attention-only. [6] Michael Hanna, Ollie Liu, and Alexandre Variengien.How does GPT-2 compute greater- than?: Interpreting mathematical abilities in a pre-trained language model. 2023. arXiv: 2305.00586 [cs.CL]. [7] Tom Lieberum, Matthew Rahtz, J ́ anos Kram ́ ar, Neel Nanda, Geoffrey Irving, Rohin Shah, and Vladimir Mikulik.Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla. 2023. arXiv:2307.09458 [cs.LG]. [8] Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Simas Sake- nis, Jason Huang, Yaron Singer, and Stuart Shieber.Causal Mediation Analysis for Interpret- ing Neural NLP: The Case of Gender Bias. 2020. arXiv:2004.12265 [cs.CL]. [9] Evan Hubinger.An overview of 11 proposals for building safe advanced AI. 2020. arXiv: 2012.07532 [cs.LG]. [10] Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri ` a Garriga-Alonso. “Towards Automated Circuit Discovery for Mechanistic Interpretability”. In: Thirty-seventh Conference on Neural Information Processing Systems. 2023. arXiv:2304. 14997 [cs.LG]. [11] Paul Michel, Omer Levy, and Graham Neubig. “Are Sixteen Heads Really Better than One?” In:Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. Ed. by Hanna M. Wallach, Hugo Larochelle, Alina Beygelz- imer, Florence d’Alch ́ e-Buc, Emily B. Fox, and Roman Garnett. 2019, p. 14014– 14024.URL:https : / / proceedings . neurips . c / paper / 2019 / hash / 2c601ad9d2f9bc8b282670cdd54f69f-Abstract.html. [12] Christos Louizos, Max Welling, and Diederik P. Kingma. “Learning Sparse Neural Networks throughL 0 Regularization”. In:6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.URL:https://openreview.net/forum?id=H1Y8hhg0b. [13] Steven Cao, Victor Sanh, and Alexander Rush. “Low-Complexity Probing via Finding Sub- networks”. In:Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2021, p. 960–966.DOI:10.18653/v1/2021.naacl-main.74. [14] Davis W. Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John V. Guttag. “What is the State of Neural Network Pruning?” In:Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020. Ed. by Inderjit S. Dhillon, Dimitris S. Papailiopoulos, and Vivienne Sze. mlsys.org, 2020.URL:https://proceedings.mlsys. org/book/296.pdf. [15] Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang.A Survey on Model Compression for Large Language Models. 2023. arXiv:2308.07633 [cs.CL]. 7 [16] Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts.Causal Abstractions of Neural Networks. 2021.URL:https://arxiv.org/abs/2106.02997. [17] Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. “Locating and editing factual associations in GPT”. In:Advances in Neural Information Processing Systems. 2022. [18] Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. “A Mathematical Framework for Transformer Circuits”. In:Transformer Circuits Thread(2021).URL:https://transformer-circuits.pub/2021/framework/index. html. [19] Judea Pearl. “Causal diagrams for empirical research”. In:Biometrika82.4 (1995), p. 669– 688. [20] Neel Nanda.Attribution Patching: Activation Patching At Industrial Scale. 2023.URL: https://w.neelnanda.io/mechanistic- interpretability/attribution- patching. [21] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. “PyTorch: An Imperative Style, High-Performance Deep Learning Library”. In:Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019, p. 8024–8035.URL:http://papers.neurips. c / paper / 9015 - pytorch - an - imperative - style - high - performance - deep - learning-library.pdf. 8 A EAP Subnetworks <m10> <resid_post> <a10.10> <a11.10> <a10.7><a10.6><a10.3><a10.1><a10.0><a10.2> <m0> <a9.9><m9> <a11.2> <a5.5><a5.9><m5> <m4> <m1> <m3><a3.0> <a7.9><m7> <m6><a6.9> <m2><a2.11> <a0.1><a0.10> <a8.10> <a0.5> <a9.8><a9.7><a9.6> <a8.6> <a7.1> <a6.6> <a5.6> <a4.11><a4.7><a4.3><a4.4> <a3.7><a3.3><a3.4> embed <m8> <a7.3> <a6.0> <a2.9><a2.2> (a) IOI Subnetwork, Threshold=0.077 <a1.4> <a2.2><a2.0> <a3.6><a3.0> <a1.5><a1.2><a1.0> embed <a0.5><a0.0><a0.1><a0.6><a0.3><a0.2><a0.7><a0.4> <a2.5><a2.3> <a3.7> <resid_post> (b) Docstring Subnetwork, Threshold=0.244 <a7.10><m7> <resid_post> <m8> <m9> <a8.10> <m2> <m3> <a9.1> <a6.9><a6.1> <m1> <m11> embed <m0><a0.1> <m10><a10.7> <m4> <m5><a5.8> <a8.8> <a0.5><a0.10> <a8.11> (c) Greaterthan Subnetwork, Threshold=0.009 Figure 6: Resulting subnetworks after EAP at the given thresholds. 9 B Distribution of EAP Attribution Scores 32101234 Change in Logit Difference 10 0 10 1 10 2 10 3 Log Count Histogram of Edge Scores (Docstring Task) In Docstring: mean=-0.6777, std=0.8054 Not In Docstring: mean=0.0170, std=0.2127 (a) Distribution of Attribution Scores for the Docstring Task 0.200.150.100.050.000.05 Change in Probability Difference 10 0 10 1 10 2 10 3 10 4 Log Count Histogram of Edge Scores (Greaterthan Task) In Greaterthan: mean=-0.0061, std=0.0237 Not In Greaterthan: mean=-0.0000, std=0.0005 (b) Distribution of Attribution Scores for the Greater- Than Task Figure 7: Distribution of Attribution Scores for the Docstring and Greater-Than tasks C Further investigation into combining EAP with ACDC Figure 8: Youdens-J statistic (maximum TPR minus FPR value) for combining EAP and ACDC methods on the docstring task. We applied ACDC to the pruned subgraph returned by EAP. D Further failures of attribution patching approximation In Figure 9 we show further cases where in the docstring task attribution patching can be misleading. These cases all involve an edge that comes from the model’s embeddings (positional and tokens). Our interpretation is that weighted averages of embeddings are anomalous inputs to the model and cause the concave change in docstring logit diff which doesn’t occur when edges ae between non- embedding model components. 10 0.00.20.40.60.81.0 Interpolation towards corruption 0.0 0.5 1.0 1.5 2.0 2.5 Change in Docstring Logit Diff Clean edge Corrupted edge Input to L3H0K EAP linear approximation Interpolated activation patching EAP value 0.00.20.40.60.81.0 Interpolation towards corruption 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Change in Docstring Logit Diff Clean edge Corrupted edge Input to L3H6K EAP linear approximation Interpolated activation patching EAP value Figure 9: Visualizing Edge Attribution Patching in two further cases where the concave activation patching curve means the linear fit is poor. E Edges Roles in IOI We further explore the attribution scores for the IOI circuit. The IOI circuit is comprised of different attention head classes such as Induction heads, S-Inhibition heads, etc. [3]. Figure 10 shows the distributions of scores stratified by the roles of the edges. The edge roles are defined according to the role of their origin node. While edge roles such as Previous Token, Duplicate Token, Induction, and S-Inhibition edges have attribution scores centered around zero, we see a bias in edge scores given to name mover and negative name mover edges. As the name mover edges are directly responsible for the model outputting the indirect object, the attribution scores are largely negative since ablating these edges removes the model’s ability to output the indirect object, lowering the logit difference. Similarly, the negative name movers have attribution scores that are largely positive since ablating these edges improves the logit difference. This matches the intuitive function of the edges. 0.20.10.00.1 10 0 10 1 Log Count Previous Token Edge Scores mean=0.0015 std=0.0566 0.050.000.050.100.150.20 10 0 2 × 10 0 3 × 10 0 4 × 10 0 6 × 10 0 Duplicate Token Edge Scores mean=0.0011 std=0.0441 0.30.20.10.00.10.20.3 10 0 10 1 Induction Edge Scores mean=-0.0094 std=0.0739 0.100.050.000.05 Change in Logit Difference 10 0 2 × 10 0 3 × 10 0 4 × 10 0 Log Count S-inhibition Edge Scores mean=-0.0167 std=0.0364 0.00.20.40.6 Change in Logit Difference 10 1 10 0 10 1 Negative Name Mover Edge Scores mean=0.2372 std=0.2807 1.00.80.60.40.20.0 Change in Logit Difference 10 0 2 × 10 0 3 × 10 0 4 × 10 0 6 × 10 0 Name Mover Edge Scores mean=-0.0713 std=0.2194 Figure 10: Distribution of Attribution Scores for each Edge Role in the IOI Task. 11 F Only one backwards pass is required for EAP Note: it may be easier to understand our implementationhttps://github.com/Aaquib111/ acdcpp/blob/main/utils/prune_utils.py#L249rather than read this explanation. Alterna- tively, this derivation uses essentially the same arguments as Nanda [20] 4 though with an updated codebase. There are only two types of edges iterated over in ACDC: i) residual edges where the result is added at its endpoint, and i) edges between the residual stream and the query, key and value calculations. 5 Clearly for all edges like i) we can compute the gradient terms in Equation (2) in one backwards pass. Interestingly, for all∆ e Lterms whereeis a type i) edge (i.e added at the endpoint), we only need calculate the gradient with respect to the endpoint of the edge! For example, suppose we’re calculating the effect of L0H0 on L1H0Q. If we represent the input to L1H0Q as a nodeVin the computational graph then ∂ ∂e clean L(x clean |do(E=e clean )) = ∂ ∂v clean L(x clean |do(V=v clean ))(3) due to howVis just the sum of all the edges enteringV. This allows efficient calculation of all the ∆ e Lvalues since gradients with respect to nodes in computational graphs are calculated by default in backwards passes. 4 Specifically,thissection:https://w.neelnanda.io/mechanistic-interpretability/ attribution-patching#how-to-think-about-activation-patching=: ~ :text=axes%20of% 20variation.-,Path%20patching,-The%20core%20intuition 5 It may be worth looking at some ACDC outputs from [10]. Seehttps://colab.research.google. com/github/ArthurConmy/Automatic-Circuit-Discovery/blob/main/notebooks/colabs/ACDC_ Implementation_Demo.ipynbfor an explanation of this design choice. 12