Paper deep dive
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
Vedant Palit, Rohan Pandey, Aryaman Arora, Paul Pu Liang
Models: BLIP
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/12/2026, 7:26:36 PM
Summary
This paper introduces a causal tracing tool for the BLIP vision-language model, adapting causal mediation analysis (CMA) to investigate neural mechanisms in image-conditioned text generation. By patching intermediate states from clean to corrupted runs, the authors identify that causal relevance for VQA tasks is primarily concentrated in the final layers of both the question encoder and answer decoder.
Entities (5)
Relation Signals (3)
Causal Mediation Analysis → appliedto → BLIP
confidence 98% · In this work, we seek a deeper understanding of how BLIP performs VQA by adapting CMA to the vision-language setting.
BLIP → evaluatedon → COCO-QA
confidence 95% · We utilize this task as a simple testbed for causal tracing vision-language models. The dataset used we use is COCO-QA
Question Encoder → partof → BLIP
confidence 95% · The BLIP-for-VQA architecture: embedding for an image patch is fed into the question encoder
Cypher Suggestions (2)
Find all datasets used to evaluate a specific model · confidence 90% · unvalidated
MATCH (m:Model {name: 'BLIP'})-[:EVALUATED_ON]->(d:Dataset) RETURN d.nameIdentify components of a model · confidence 90% · unvalidated
MATCH (c:Component)-[:PART_OF]->(m:Model {name: 'BLIP'}) RETURN c.nameAbstract
Abstract:Mechanistic interpretability seeks to understand the neural mechanisms that enable specific behaviors in Large Language Models (LLMs) by leveraging causality-based methods. While these approaches have identified neural circuits that copy spans of text, capture factual knowledge, and more, they remain unusable for multimodal models since adapting these tools to the vision-language domain requires considerable architectural changes. In this work, we adapt a unimodal causal tracing tool to BLIP to enable the study of the neural mechanisms underlying image-conditioned text generation. We demonstrate our approach on a visual question answering dataset, highlighting the causal relevance of later layer representations for all tokens. Furthermore, we release our BLIP causal tracing tool as open source to enable further experimentation in vision-language mechanistic interpretability by the community. Our code is available at this https URL.
Tags
Links
Full Text
24,657 characters extracted from source content.
Expand or collapse full text
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP Vedant Palit * IIT Kharagpur Rohan Pandey ∗ Reworkd.ai Aryaman Arora Georgetown University Paul Pu Liang Carnegie Mellon University Abstract Mechanistic interpretability seeks to understand the neu- ral mechanisms that enable specific behaviors in Large Lan- guage Models (LLMs) by leveraging causality-based meth- ods. While these approaches have identified neural cir- cuits that copy spans of text, capture factual knowledge, and more, they remain unusable for multimodal models since adapting these tools to the vision-language domain requires considerable architectural changes. In this work, we adapt a unimodal causal tracing tool to BLIP to en- able the study of the neural mechanisms underlying image- conditioned text generation. We demonstrate our approach on a visual question answering dataset, highlighting the causal relevance of later layer representations for all to- kens. Furthermore, we release our BLIP causal tracing tool as open source to enable further experimentation in vision-language mechanistic interpretability by the commu- nity. Our code is available at this URL. 1. Introduction Mechanistic interpretability [30] analyzes neural net- works with the goal of reverse engineering the algorithms a network implicitly learns in their parameters. This al- lows for finer-grained control over a model’s knowledge [27, 28, 17] and behavior [23]. In particular, causal me- diation analysis (CMA) [34] is a popular mechanistic inter- pretability method that studies the effect of introducing a mediator on the outcome of a system. However, CMA has so far been implemented only for the unimodal language do- main [27], limiting our understanding to this narrow class of models [5]. In recent years, multimodal models have rapidly grown in relevance as vision-language transformers have enabled strong performance on image-text retrieval, image caption- ing, and visual question answering (VQA) tasks [25]. Con- sidering the powerful effects of visual stimulus on seman- * Corresponding authors:vedantpalit@kgpian.iitkgp.ac.in, rohan@reworkd.ai [Encode] what is the colour of the animal Clean Image Embeddings Brown Correct Output Question Encoder Visual Information Injection Image Encoder Image Patching of States [Encode] what is the colour of the animal Corrupted Image Embeddings ? Incorrect Output ********* * * * ** **** * * ** * * * * * * * **** * * * * * * ** * ** * * * * * * * * * * Corrupt Visual Information Injection Noise Figure 1: Causal intervention to measure state’s relevance: Above, an image of a cow is encoded, cross-attends with the question encoding, and results in the correct answer “brown”. Below, the same image encoding is corrupted, cross-attends with the question encoding, and results in an incorrect answer. An intermediate state is patched from the clean to the corrupted run to observe the state’s effect on the answer probabilities. tic representations in humans [21], it is important to un- derstand how similar processes occur in vision-language models. Take as an example the vision-language trans- former BLIP [22], which consists of an image encoder cross-attending with a text encoder, jointly conditioning a text decoder (Fig. 2). In this work, we seek a deeper under- standing of how BLIP performs VQA by adapting CMA to the vision-language setting. arXiv:2308.14179v1 [cs.CL] 27 Aug 2023 Figure 2: The BLIP-for-VQA [22] architecture: embedding for an image patch is fed into the question encoder along- side question tokens to generate image-conditioned ques- tion embeddings through cross-attention, which are finally input to the answer decoder for answer generation. 2. Related Work Pearl [34] introduces causal mediation analysis by mea- suring the change in a response variable following an inter- vention, taking into consideration the effects of intermedi- aries or mediators. Vig et al. [40] applies this analysis to language models of the GPT-2 family to study how gram- matical gender bias is mediated by the different components inside a model. They argue that probing representations [1, 14, 9] for information does not tell us [2, 37] whether the model actually uses this information, and causal approaches to interpretability are a better approach. Meanwhile, researchers in mechanistic interpretability have developed a variety of techniques to better understand neurons and mechanisms inside neural networks (particu- larly unimodal language models), building on earlier work on identifying circuits in vision models [31]. This includes applying linear algebra to understand interactions between modules inside the transformer architecture [11, 32], study- ing the training dynamics of transformer models, often on simple tasks [10, 29, 3, 7, 19], intervening on model- internal activations to identify causal relationships between model components [12, 41, 13, 42, 8, 15], and attempt- ing to map neuron features to human-interpretable concepts [43, 16, 4]. Meng et al. [27] also base their causal intervention meth- ods on the previous works by corrupting token embedding inputs to a language model (GPT-2 XL, GPT-NeoX) to mea- sure causal relevance of states for capturing factual knowl- edge. The corruption in the input is produced by introduc- ing noise into a sentence’s subject tokens. Following this, the models are observed in three different runs—a clean in- put run, a corrupted input run, and an intervention involv- ing patching of the layer outputs from a clean run of the same sentence input to the corresponding layer outputs of a corrupted run. Our implementation follows this work most closely. On the multimodal side of interpretability, there have TaskAccuracy Color Identification80.23% Location Identification26.30% Object Counting3.27% Table 1: BLIP Performance on COCO-QA Task Categories (a) COCOQA-ID458864: What is the color of the animal? (b)COCOQA-ID220218: What is the color of the character? Figure 3: Two example images from COCO-QA and their accompanying questions. been thorough analyses of vision-language transformers leveraging probing approaches [6, 36], though these face the same epistemic issues as those in the unimodal setting [2]. Another line of work explores unimodal interactions present in a multimodal model and proposes methods to un- derstand the nature & degree of these interactions [39, 24]. Joshi et al. [18] present a comprehensive survey of inter- pretability in multimodal machine learning until early 2021. Finally, Kervadec et al. [20] present some interpretability experiments on transformers trained for VQA, specifically concerned with their reasoning ability. 3. Method We adapt the causal intervention method from Meng et al. [27] to investigate visual question answering (COCO- QA) in a vision-language model (BLIP). 3.1. Causal Tracing for BLIP As input, BLIP takes a pre-processed image and ques- tion tokens, returning a single-word answer as output. We corrupt the image embeddings before they are fed into the question encoder, resulting in an incorrect output. Follow- ing this, we try and ‘make the answer correct again’ by patching individual intermediate states (token embeddings at a layer) of a clean run into the corrupted run. The states that result in the greatest answer improvement are consid- ered causally relevant. Corruption and PatchingThe second image embedding of the batch is corrupted by adding noise to all the 577 patch embeddings of the image, resulting in pairs of clean and corrupted embeddings(E,E ∗ ). For each image, we sample a single instance of noiseε∼ N(1,ν), whereνis an ad- justable hyperparameter (standard deviation of noise), and multiplyεto the embedding for each patch. This corrupted image encoding is then passed into the question encoder alongside question input tokens for patching. To perform the causal intervention, the output of each individual state (layerL, tokenT) of theE ∗ question en- coder run is overwritten with the corresponding state from the clean image embedding runE(see Fig. 1). Finally, we measure the resulting effect on output logits. This interven- tion process is also replicated for the answer decoder block. MetricsGiven the question text embedding Q and the im- age embedding pair(E,E ∗ ), to measure the effect of our causal intervention, we compare the correct answer’s (A) probability between the corrupted runp(A|E ∗ ,Q)and the restored run (where we patch from the clean run into the corrupted run at layerLand token positionT), and normal- ize across the difference between the clean and corrupted run probabilities: Γ L,T = p(A|patch L,T (E,E ∗ ),Q)−p(A|E ∗ ,Q) p(A|E,Q)−p(A|E ∗ ,Q) (1) We expectΓ L,T to be in the range[0,1], where0repre- sents no improvement from complete corruption and1rep- resents perfect recovery of the original answer probability. We may then plotΓ L,T for all(L,T)pairs to observe the causal relevance of that state on producing the correct answer. The darker shades of the heatmaps in Fig. 4 rep- resent high causal relevanceΓ L,T . We can also compute an average probability difference as a function of the noise factor: Γ(ν) = 1 |L|·|T| X l∈L X t∈T Γ l,t (E ∗ =νE)(2) We plot this function in Fig. 5, illustrating how the aver- age difference in answer probabilities varies depending on the strength of the image embedding’s corruption noise. 3.2. COCO-QA Dataset VQA is an open-ended answer generation task which requires the model to predict an answer given an image and associated question input. We utilize this task as a simple testbed for causal tracing vision-language models. The dataset used we use is COCO-QA [35] consisting of 123,287 images, 78,736 train and 38,948 test questions. This was sourced from MSCOCO [26]. The COCO-QA dataset contains one-word answers to questions belonging to four categories: object identification, object counting, colour identification, and location identification. We divided the training subset of COCO-QA into three splits pertaining to each of the three categories: colour, lo- cation identification, and counting. Following this division, BLIP’s zero-shot performance was assessed on each of the datasets individually, results of which are shown in Table 1. The accuracy percentages demonstrate that BLIP’s pre- trained VQA model performs best in the color identification task. Further analysis showed that BLIP tends to output number of objects in an image using digits rather than natu- ral language, which causes a low accuracy score on textual answers. Similarly, it also differs in answer structuring in the location identification task. Thus, we utilize the color identification data split of COCO-QA for causal tracing, since we want to understand mechanisms behind a behavior that a model is highly performant at. 4. Results In order to understand the correlation between the amount of noise injected into the image embeddings with Γ(ν), we first plotted the effects of adjusting the noise fac- torνin the range[0.1,30], averaged over 200 samples from the dataset with 10 runs for each of the samples (see Fig. 5). We do not measureΓwhenνis 0, since we would be patching from clean runs into clean runs, soΓ(ν) = 1. A decaying curve is observed as theνvalue increases from 0.1 to 30, with very little variation inΓ(ν)at extremely large values and negative values for a few values ofν. Keeping both the curves in mind, we refrain from injecting too little noise that patching becomes trivial or too much noise where restoration becomes impossible, hence choosingνas 5. The heatmaps in Fig. 4a and Fig. 4b demonstrate the causal effects in the question encoder and answer decoder for two examples from the dataset shown in Fig. 3, averaged across 10 runs. Fig. 4c demonstrates the average effects across 200 samples from the COCO-QA dataset. The en- coder and decoder layers are indexed from 0–11, and input question tokens are plotted vertically. It is clear from the figures that in the question encoder, only the final layer (11) for all tokens plays a significant role in affecting the output to a higher degree than any preced- ing layers or tokens. In the case of the answer decoder, the final layers (9 to 11) play the most apparent role in the fi- nal output of the model. These results show that BLIP does not benefit from restored access to the correct image em- beddings until the final few layers. This may mean that the vision modality is not relevant to model computations until the final layer, i.e. vision and language are processed in- dependently in the intermediate layers. On the other hand, it may also mean that the final layers override preceding layers, which may still be weakly causally relevant to the (a) COCOQA-ID458864(b) COCOQA-ID220218(c) Average over 200 samples Figure 4: ProbabilityΓ L,T of the correct answer after performing causal interventions at specific layers on specific tokens in the question encoder (above) and answer decoder (below). Most of the causal relevance is concentrated in the final layers of the encoder as well as decoder blocks. 02468101214 Nu 0.0 0.2 0.4 0.6 0.8 1.0 Gamma Nu vs Gamma in Question Encoder (a) Question Encoder 02468101214 Nu 1.5 1.0 0.5 0.0 0.5 1.0 Gamma Nu vs Gamma in Answer Decoder (b) Answer Decoder Figure 5: Effect of adjusting the noise factorνon the answer probability differenceΓ(averaged across allL,T patches) for different components of the BLIP model. model output. 5. Conclusion We introduced the first causal tracing tool for a vision- language model and studied how model performance is lo- calized in BLIP on a subset of the visual question answer- ing task with the COCO-QA dataset. Previous work on in- terpretability of vision-language models has not focused on identifying causal mechanisms, so we hope that this work invigorates research in this area. Towards this end, we fully open source our code and will soon release a visualizer as well as adaptations to other vision-language models. Many aspects of the causal tracing methodology are still not fully understood. For example, since the role of the noise factorνis unclear, future work could study why dif- ferent components of the model have different sensitivities to noise; for example, why is performance not monotoni- cally reduced by increasingν? Also, restoration of the clean image embedding atmultiplepoints (instead of just one) may help us understand cross-module coordination within the model. A bigger project is to identify larger mechanisms within vision-language models that can explain how the model per- forms specific tasks, as has been done in unimodal language models [29, 41]. This will help us understand how multi- modal models work and let us verify whether they perform tasks as expected, e.g. whether they learn good algorithms or poor shortcuts on compositional understanding bench- marks like Winoground [38, 33]. Overall, much work re- mains in this line of research and we look forward to using causal intervention methods for disentangling the mecha- nisms learned by vision-language models. References [1] Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. Fine-grained analysis of sentence embed- dings using auxiliary prediction tasks, 2017. [2] Yonatan Belinkov and James Glass. Analysis methods in neural language processing: A survey.Transactions of the Association for Computational Linguistics, 7:49–72, 2019. [3] Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint.arXiv preprint arXiv:2306.00802, 2023. [4] Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neu- rons in language models.OpenAI, 2023. [5] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Rad- ford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020. [6] Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, and Jingjing Liu. Behind the scene: Revealing the secrets of pre-trained vision-and-language models. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 565–580. Springer, 2020. [7] Bilal Chughtai, Lawrence Chan, and Neel Nanda. A toy model of universality: Reverse engineering how networks learn group operations.arXiv preprint arXiv:2302.03025, 2023. [8] Arthur Conmy, Augustine N Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri ` a Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. arXiv preprint arXiv:2304.14997, 2023. [9] Alexis Conneau, German Kruszewski, Guillaume Lample, Lo ̈ ıc Barrault, and Marco Baroni. What you can cram into a single $&!#* vector: Probing sentence embeddings for lin- guistic properties. InProceedings of the 56th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia, July 2018. Association for Computational Linguistics. [10] Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield- Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022. [11] Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathemati- cal framework for transformer circuits.Transformer Circuits Thread, 2021. [12] Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In M. Ran- zato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wort- man Vaughan, editors,Advances in Neural Information Pro- cessing Systems, volume 34, pages 9574–9586. Curran As- sociates, Inc., 2021. [13] Atticus Geiger, Chris Potts, and Thomas Icard. Causal ab- straction for faithful model interpretation.arXiv preprint arXiv:2301.04709, 2023. [14] Mario Giulianelli, Jacqueline Harding, Florian Mohnert, Dieuwke Hupkes, and Willem Zuidema. Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information, 2021. [15] Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. Localizing model behavior with path patch- ing.arXiv preprint arXiv:2304.05969, 2023. [16] Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610, 2023. [17] Evan Hernandez, Belinda Z Li, and Jacob Andreas. Measur- ing and manipulating knowledge representations in language models.arXiv preprint arXiv:2304.00740, 2023. [18] Gargi Joshi, Rahee Walambe, and Ketan Kotecha. A review on explainability in multimodal deep neural nets.IEEE Ac- cess, 9:59800–59821, 2021. [19] Jeevesh Juneja, Rachit Bansal, Kyunghyun Cho, Jo ̃ ao Sedoc, and Naomi Saphra. Linear connectivity reveals generaliza- tion strategies.arXiv preprint arXiv:2205.12411, 2022. [20] Corentin Kervadec, Theo Jaunet, Grigory Antipov, Moez Baccouche, Romain Vuillemot, and Christian Wolf. How transferable are reasoning patterns in vqa? InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 4207–4216, 2021. [21] George Lakoff and Mark Johnson.Metaphors we live by. University of Chicago press, 2008. [22] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. InIn- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. [23] Maximilian Li, Xander Davies, and Max Nadeau. Circuit breaking: Removing model behaviors with targeted ablation. InDeployableGenerativeAI, 2023. [24] Paul Pu Liang, Yiwei Lyu, Gunjan Chhablani, Nihal Jain, Zi- hao Deng, Xingbo Wang, Louis-Philippe Morency, and Rus- lan Salakhutdinov. Multiviz: Towards visualizing and under- standing multimodal models. InThe Eleventh International Conference on Learning Representations, 2022. [25] Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. Foundations and recent trends in multimodal machine learn- ing: Principles, challenges, and open questions.arXiv preprint arXiv:2209.03430, 2022. [26] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ́ ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. [27] Kevin Meng, David Bau, Alex Andonian, and Yonatan Be- linkov. Locating and editing factual associations in GPT. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Process- ing Systems, volume 35, pages 17359–17372. Curran Asso- ciates, Inc., 2022. [28] Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a trans- former.arXiv preprint arXiv:2210.07229, 2022. [29] Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Representations, 2023. [30] Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases.Transformer Circuits Thread, 2022. [31] Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An in- troduction to circuits.Distill, 5(3):e00024–001, 2020. [32] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al.In- context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022. [33] Rohan Pandey. Semantic composition in visually grounded language models.arXiv preprint arXiv:2305.16328, 2023. [34] Judea Pearl. Direct and indirect effects. InProbabilistic and causal inference: the works of Judea Pearl, pages 373–392. 2022. [35] Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering.Advances in neural information processing systems, 28, 2015. [36] Emmanuelle Salin, Badreddine Farah, St ́ ephane Ayache, and Benoit Favre. Are vision-language transformers learning multimodal representations? a probing perspective. InPro- ceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11248–11257, 2022. [37] Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT redis- covers the classical NLP pipeline, 2019. [38] Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio- linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022. [39] Yao-Hung Hubert Tsai, Martin Q Ma, Muqiao Yang, Rus- lan Salakhutdinov, and Louis-Philippe Morency. Multimodal routing: Improving local and global interpretability of mul- timodal language analysis. InProceedings of the Confer- ence on Empirical Methods in Natural Language Process- ing. Conference on Empirical Methods in Natural Language Processing, volume 2020, page 1823. NIH Public Access, 2020. [40] Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Inves- tigating gender bias in language models using causal medi- ation analysis.Advances in neural information processing systems, 33:12388–12401, 2020. [41] Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 Small. In The Eleventh International Conference on Learning Repre- sentations, 2023. [42] Zhengxuan Wu, Atticus Geiger, Christopher Potts, and Noah D. Goodman.Interpretability at scale:Iden- tifying causal mechanisms in Alpaca.arXiv preprint arXiv:2305.08809, 2023. [43] Roland S Zimmermann, Thomas Klein, and Wieland Bren- del. Scale alone does not improve mechanistic interpretabil- ity in vision models.arXiv preprint arXiv:2307.05471, 2023.