← Back to papers

Paper deep dive

LM-Debugger: An Interactive Tool for Inspection and Intervention in Transformer-Based Language Models

Mor Geva, Avi Caciularu, Guy Dar, Paul Roit, Shoval Sadde, Micah Shlain, Bar Tamir, Yoav Goldberg

Year: 2022Venue: ACL 2022 System DemonstrationsArea: Mechanistic Interp.Type: ToolEmbeddings: 36

Models: GPT-2, GPT-2 Large, GPT-2 Medium, GPT-2 XL

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 7:44:19 PM

Summary

LM-Debugger is an interactive tool designed for the inspection and intervention of transformer-based language models (LMs). It interprets the internal prediction construction process by analyzing feed-forward network (FFN) updates to token representations in the vocabulary space. The tool provides a prediction view for tracing model decisions and an exploration view for identifying and clustering FFN value vectors, enabling users to perform fine-grained debugging and control model behavior through targeted interventions.

Entities (4)

GPT2 · language-model · 100%LM-Debugger · software-tool · 100%Feed-Forward Network · model-component · 95%Transformer · model-architecture · 95%

Relation Signals (3)

Feed-Forward Network partof Transformer

confidence 98% · FFN layers, one of the building blocks of transformers

LM-Debugger debugs GPT2

confidence 95% · We demonstrate the utility of LM-Debugger for single-prediction debugging, by inspecting the internal disambiguation process done by GPT2.

LM-Debugger interprets Feed-Forward Network

confidence 95% · LM-Debugger relies on a recent method that interprets the inner token representations and their updates by the feed-forward layers

Cypher Suggestions (2)

Find all models supported by the tool · confidence 90% · unvalidated

MATCH (t:Tool {name: 'LM-Debugger'})-[:DEBUGS]->(m:Model) RETURN m.name

Identify components of a specific model architecture · confidence 85% · unvalidated

MATCH (c:Component)-[:PART_OF]->(a:Architecture {name: 'Transformer'}) RETURN c.name

Abstract

Abstract:The opaque nature and unexplained behavior of transformer-based language models (LMs) have spurred a wide interest in interpreting their predictions. However, current interpretation methods mostly focus on probing models from outside, executing behavioral tests, and analyzing salience input features, while the internal prediction construction process is largely not understood. In this work, we introduce LM-Debugger, an interactive debugger tool for transformer-based LMs, which provides a fine-grained interpretation of the model's internal prediction process, as well as a powerful framework for intervening in LM behavior. For its backbone, LM-Debugger relies on a recent method that interprets the inner token representations and their updates by the feed-forward layers in the vocabulary space. We demonstrate the utility of LM-Debugger for single-prediction debugging, by inspecting the internal disambiguation process done by GPT2. Moreover, we show how easily LM-Debugger allows to shift model behavior in a direction of the user's choice, by identifying a few vectors in the network and inducing effective interventions to the prediction process. We release LM-Debugger as an open-source tool and a demo over GPT2 models.

Tags

ai-safety (imported, 100%)mechanistic-interp (suggested, 92%)tool (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

35,780 characters extracted from source content.

Expand or collapse full text

LM-Debugger: An Interactive Tool for Inspection and Intervention in Transformer-Based Language Models Mor Geva 1 Avi Caciularu 2,∗ Guy Dar 3 Paul Roit 2 Shoval Sadde 1 Micah Shlain 1 Bar Tamir 4 Yoav Goldberg 1,2 1 Allen Institute for AI 2 Bar-Ilan University 3 Tel Aviv University 4 The Hebrew University of Jerusalem morp@allenai.org Abstract The opaque nature and unexplained behavior of transformer-based language models (LMs) have spurred a wide interest in interpreting their predictions. However, current interpre- tation methods mostly focus on probing mod- els from outside, executing behavioral tests, and analyzing salience input features, while the internal prediction construction process is largely not understood. In this work, we in- troduceLM-Debugger, an interactive debug- ger tool for transformer-based LMs, which provides a fine-grained interpretation of the model’s internal prediction process, as well as a powerful framework for intervening in LM behavior. For its backbone,LM-Debuggerre- lies on a recent method that interprets the inner token representations and their updates by the feed-forward layers in the vocabulary space. We demonstrate the utility ofLM-Debugger for single-prediction debugging, by inspect- ing the internal disambiguation process done by GPT2. Moreover, we show how easily LM-Debuggerallows to shift model behavior in a direction of the user’s choice, by iden- tifying a few vectors in the network and in- ducing effective interventions to the prediction process. We releaseLM-Debuggeras an open- source tool and a demo over GPT2 models. 1 Introduction Transformer-based language models (LMs) are the backbone of modern NLP models (Bommasani et al., 2021), but their internal prediction construc- tion process is opaque. This is problematic to end- users that do not understand why the model makes specific predictions, as well as for developers who wish to debug or fix model behaviour. Recent work (Elhage et al., 2021; Geva et al., 2022) suggested that the construction process of LM predictions can be viewed as a sequence of updates to the token representation. Specifically, ∗ Work done during an internship at AI2. She is working as a DJ kindergarten, school, kids, elementary, teacher, classroom lawyer nurse dentist nanny DJ singer lawyer rapper FFN FFN FFN album, DJ, rapper, funk, music, song, vocals, punk, disco, rock, ... inspection intervention projections Figure 1:Illustration of the main capabilities of LM-Debugger. Our tool interprets dominant changes in the output distribution induced by the feed-forward layers across the network (self-attention layers are not shown), and enables configuring interventions for shift- ing the prediction in directions of the user’s choice. Geva et al. (2022) showed that updates by the feed- forward network (FFN) layers, one of the building blocks of transformers (Vaswani et al., 2017), can be decomposed into weighted collections of sub- updates, each induced by a FFN parameter vector, that can be interpreted in the vocabulary space. In this work, we make a step towards LM trans- parency by employing this interpretation approach to createLM-Debugger, a powerful tool for inspec- tion and intervention in transformer LM predic- tions.LM-Debuggerprovides three main capabil- ities for single-prediction debugging and model analysis (illustrated in Figure 1). First, for a given input (e.g.“My wife is working as a”), it interprets the model’s prediction at each layer in the network, and the major changes applied to it by FFN layers. This is done by projecting the token representa- arXiv:2204.12130v2 [cs.CL] 12 Oct 2022 tion before and after the FFN update as well as the major FFN sub-updates at any layer to the output vocabulary. Second, it allows intervening in the prediction by changing the weights of specific sub- updates, e.g. increasing (decreasing) a sub-update that promotes music-related (teaching-related) con- cepts, which results in a modified output. Last, for a given LM,LM-Debuggerinterprets all the FFN parameter vectors across the network and creates a search index over the tokens they promote. This al- lows an input-independent analysis of the concepts encoded by the model’s FFN layers, and enables configuring general and effective interventions. We demonstrate the utility ofLM-Debuggerfor two general use-cases. In the context of predic- tion debugging, we use the fine-grained tracing of LM-Debuggerto inspect the internal disambigua- tion process performed by the model. Furthermore, we demonstrate how our tool can be used to con- figure a few powerful interventions that effectively control different aspects in text generation. We releaseLM-Debuggeras an open-source tool athttps://github.com/mega002/lm-debugger and host a demo ofGPT2(Brown et al., 2020) athttps://lm-debugger.apps.allenai.org. 1 This to increase the transparency of transformer LMs and facilitate research in analyzing and con- trolling NLP models. 2 Underlying Interpretation Method LM-Debuggerestablishes a framework for inter- preting a token’s representation and updates ap- plied to it at each layer in the network. This frame- work builds upon recent findings by Geva et al. (2022), who viewed the token representation as a changing distribution over the output vocabulary, and the output from each FFN layer as a collec- tion of weighted sub-updates to that distribution, which are often interpretable to humans. We next elaborate on the findings we rely on at this work. Consider a transformer LM withLlayers and an embedding matrixE∈R d×|V| of hidden dimen- siond, over a vocabularyV. Letw=w 1 , ..., w t s.t.∀i= 1, ..., t:w i ∈ Vbe an input sequence of tokens, then at each layer`= 1, ..., L, the hid- den representationx ` i of thei-th token is being processed and updated by a FFN layer through a residual connection (He et al., 2016): 2 ̃x ` i =x ` i +FFN ` (x ` i ), 1 See a video athttps://youtu.be/5D_GiJv7O-M 2 Layer normalization is omitted (Geva et al., 2022). wherex ` i is the output from the preceding multi- head self-attention layer, and ̃x ` i is the updated to- ken representation (Vaswani et al., 2017). Geva et al. (2022) proposed an interpretation method for these updates in terms of the vocabulary, which we employ as the backbone ofLM-Debuggerand describe in detail next. Token Representation as a Distribution Over the Output Vocabulary.The token representa- tion before (x ` i ) and after ( ̃x ` i ) the FFN update at any layer`is interpreted by projecting it to the vo- cabulary space and converting it to a distribution: p ` i =softmax(Ex ` i ) ; ̃p ` i =softmax(E ̃x ` i ) The final model output is defined byy= ̃p L i . The FFN Output as a Weighted Collection of Sub-Updates.Each FFN layer is defined with two parameter matricesK ` , V ` ∈R d m ×d , where d m is the intermediate hidden dimension, and a non-linearity functionf(bias terms are omitted): FFN ` (x ` ) =f ( K ` x ` ) V ` (1) Geva et al. (2022) interpreted the FFN output by (a) decomposing it into sub-updates, each induced by a single FFN parameter vector, and (b) projecting each sub-update to the vocabulary space. Formally, Eq. 1 can be decomposed as: FFN ` (x ` ) = d m ∑ i=1 f(x ` ·k ` i )v ` i = d m ∑ i=1 m ` i v ` i . wherek ` i is thei-th row ofK ` ,v ` i is thei-th col- umn ofV ` , andm ` i :=f(x ` ·k ` i )is the activation coefficient ofv ` i for the given input. Each term in this sum is interpreted as a sub-update to the output distribution, by inspecting the top-scoring tokens in its projection to the vocabulary, i.e.Ev ` i . In the rest of the paper, we follow Geva et al. (2022) and refer to columns ofV ` as“value vec- tors”and to their weighted input-dependent form as“sub-updates”. Importantly, value vectors are staticparameter vectors that are independent on the input sequence, while sub-updates aredynamicas they are weighted by input-dependent coefficients. For a model withLlayers and a hidden dimension d m , there areL∗d m static value vectors, which induceL∗d m corresponding sub-updates when running an input through the model. Figure 2: The prediction view ofLM-Debugger, showing the prediction trace for a given input (main panel), allowing to configure interventions (lower panel) and interpret sub-updates to the output distribution (right panel). 3LM-Debugger LM-Debugger leverages both static and dynamic analysis of transformer FFN layers and the updates they induce to the output distribution for debugging and intervention in LM predictions. These capa- bilities are provided in two main views, which we describe next. 3.1 Prediction View This view, shown in Figure 2, is designed for per-example debugging. It allows running inputs through the model to generate text in an auto- regressive manner, while tracing the dominant sub- updates in every layer and applying interventions. Prediction Trace (Figure 2, main panel). The user enters an input for the model, for which a de- tailed trace of the prediction across the network is provided. For each layer, it shows the top-tokens in the output distribution, before and after the FFN up- date, and the 10 most dominant FFN sub-updates. For every sub-updatem i v ` i we show an identifier L[`]D[i]of its corresponding value vector and the coefficient for the given input (e.g.L17D4005and 9.79). 3 The top distribution tokens and sub-updates are sorted by the token probability/sub-update co- efficient from left (highest) to right (lowest). A small arrow next to each sub-update allows setting an intervention on its corresponding value vector. 3 The layer and dimension in the identifier use zero-index. Interventions(Figure 2, lower panel). Beyond tracing the output distribution,LM-Debuggeralso allows intervening in the prediction process by set- ting the coefficients ofany vector values in the network, thus, inducing sub-updates of the user’s choice. To set an intervention for a specific value vector, the user should enter its identifier to the panel and choose whether to “turn it on or off”, that is, setting its coefficient to the value of the coefficient of the most dominant sub-update in that layer, or to zero, respectively. When running an input example, all interventions in the panel will be effective, for the entire generation process. Value Vector Information (Figure 2, right panel). A natural question that arises is how to choose meaningful interventions.LM-Debugger provides two complementary approaches for this. A bottom-up approach is to observe the dominant sub-updates for specific examples, and apply inter- ventions on them. A sub-update can be interpreted by inspecting the top-tokens in the projection of its corresponding value vector to the vocabulary (Geva et al., 2022). For convenience, we let the user assign names to value vectors. Another way to find meaningful interventions is by a top-down ap- proach of searching for value vectors that express concepts of the user’s interest. We provide this capability in the exploration view ofLM-Debugger, which is described next. 3.2 Exploration View This view allows static exploration of value vec- tors, primarily for analyzing which concepts are encoded in the FFN layers, how concepts are spread over different layers, and identifying groups of re- lated value vectors. Keyword Search(Figure 3). Value vectors are interpreted by the top tokens they promote. By considering these sets of tokens as textual docu- ments,LM-Debuggerallows searching for concepts encoded in value vectors across the layers. This is enabled by a search index thatLM-Debuggerholds in the background, which stores the projections of all value vectors to the vocabulary, and allows executing simple queries against them using the BM25 (Robertson et al., 1995) algorithm. Cluster Visualization(Figure 4). Assuming the user is interested in locating a specific concept in the network and that she has found a relevant value vector, either from debugging an example in the prediction view or by the keyword search. A nat- ural next step is to find similar value vectors that promote related tokens. To this end,LM-Debugger provides a clustering of all value vectors in the network, which allows mapping any value vector to a cluster of similar vectors in the hidden space (Geva et al., 2022). The interface displays a ran- dom sample of vectors from the cluster, as well as an aggregation of their top tokens as a word cloud, showing the concepts promoted by the cluster. 4 Debugging LM Predictions by Tracing FFN Updates In this section, we demonstrate the utility of LM-Debuggerfor interpreting model behaviour upon a given example. As an instructive example, we will consider the case of sense disambiguation. When generating text, LMs often need to per- form sense disambiguation and decide on one plau- sible continuation. For example, the word“for” in the input“The book is for”has two plausible senses ofpurpose(e.g.“reading”) andperson (e.g.“him”) (Karidi et al., 2021). We will now in- spect the prediction byGPT2(Brown et al., 2020) and track the internal sense disambiguation pro- cess for this example. To this end, we enter the input in the prediction view and clickTrace, which provides a full trace of the prediction across layers. Table 1 displays a part of this trace from selected layers, showing a gradual transition frompurpose Layer: 4 Sense:purpose Before:example, the, instance, purposes After:example, the, instance, all Layer: 10 Sense:purpose Before:the, sale, example, a After:the, sale, a, example Layer: 15 Sense:purpose/person Before:sale, the, anyone, use After:sale, anyone, the, ages Layer: 20 Sense:person Before:beginners, anyone, adults, sale After:anyone, beginners, adults, readers Table 1: Partial prediction trace of GPT2 for the input “This book is for”, showing the internal disambiguation process frompurposetopersonsense across layers. topersonsense. Until layer 11 (out of 24), the top- tokens in the output distribution are mostly related to sale/example purposes. Starting from layer 12, the prediction slowly shifts to revolve about the audience of the book, e.g.anyoneandages, until layer 18 wheresaleis eliminated from the top position. In the last layers, tokens become more specific, e.g.beginnersandadults. To examine the major updates through which the prediction has formed, we can click on spe- cific sub-updates in the trace to inspect the top- scoring tokens in their projections. We observe that in early layers, tokens are often related to purposesense (e.g.instanceinL2D1855and buyersinL12D659), in intermediate layers tokens are a mix of both senses (readersinL16D3026 andpreschoolinL17D2454, andsale/freein L16D1662), and mostlypersonsense in the last lay- ers (usersinL18D685,peopleinL20D3643, and thoseinL21D2007). 5 Configuring Effective Interventions for Controlled Text Generation Beyond interpretability,LM-Debuggerenables to intervenein LM predictions. We show this by find- ing value vectors that promote specific concepts and applying simple and effective interventions. Controlling Occupation Prediction.Consider the input“My wife is working as a”. When run- ning it throughGPT2, the final prediction from the last layer has the top tokensnurse, teacher, waitress . We would like to intervene in the pre- diction in order to change its focus to occupations related to software engineering, which in general are less associated with women (De-Arteaga et al., 2019). To this end, we will use the exploration Figure 3: Keyword search in the exploration view ofLM-Debugger, which matches user queries against the tokens promoted by value vectors of the model. Figure 4: Cluster visualization in the exploration view ofLM-Debugger, which maps a given value vector to its cluster of similar value vectors in the network. view ofLM-Debuggerto search for value vectors promoting software-related concepts. Searching the keywords“software”,“devel- oper”, and“engineer”brings up two value vectors with coherent concepts:L10D3141andL17D115 (Figure 3). Now, we will add these value vectors to the intervention panel in the prediction view, and run the example again. Our intervention, that only involved two (0.002%) vectors in the network, dramatically changed the prediction tosoftware, programmer, consultant, developer, effec- tively shifting it in the direction we wanted. This demonstrates the power ofLM-Debuggerto change model behaviour and fix undesirable predictions. Controlling the Sentiment of Generated Text. The previous example focused on next-token pre- diction. We now take this one step further and configure powerful and general interventions that influence various texts generated by the model. For our experimental setting, we will attempt to control the sentiment in generated reviews byGPT2, for inputs taken from the Yelp dataset (Asghar, 2016). We choose our interventions independently of the inputs, with two easy steps. First, we use the keyword search (Figure 3) to identify “seed” value vectors that promote positive and negative adjec- tives/adverbs, using the queries“terrible, mediocre, boring”and“spacious, superb, delicious”. Then, we take one value vector for each polarity and, us- ing the cluster visualization (Figure 4), expand it to a diverse set of vectors from its corresponding cluster, that promote similar concepts. Overall, we select 5-6 value vectors for each polarity (details in Appendix A.1), to which we apply interventions. Table 2 presents the texts generated byGPT2 (each limited to 10 tokens) for multiple inputs, with and without applying interventions. Clearly, across InputInterven.Continuation “Service in this place is” -a bit of a mess. I’m not sure ↑Positivea good place to make the right efforts to make ↑Negativea waste of a bunch of crap that is too “I have been to this restaurant twice and” -both times I was disappointed. The first time I ↑Positivehave been served excellent food and good service. The ↑Negativehave been disappointed. The food is over processed and “We went on a weeknight. Place was” -packed. We had to wait for the bus ↑Positivegood, good food, good staff, good people ↑Negativetoo far for us to get lost. We were “Went for breakfast on 6/16/14. We” -had a great time. We had a great time ↑Positivehave a good team of people who are able to ↑Negativewere too heavy for the wrong type of food that Table 2: Continuations (limited to 10 tokens) generated by GPT2 for different inputs from the Yelp dataset, with and without interventions for “turning on” sub-updates for positive and negative sentiment. all the examples, our intervention in the prediction successfully leads to the desired effect, turning the sentiment of the generated text to be positive or negative, according to the configured sub-updates. 6 Implementation Details The prediction view is implemented as a React web application with a backend Flask server that runs an API for executing models from the Transformers library by HuggingFace (Wolf et al., 2020). The exploration view is a Streamlit web application, which (a) sends user search queries to an Elas- ticsearch index with the top tokens of all vector value projections, and (b) visualize clusters of value vectors created with the scikit-learn package (Pe- dregosa et al., 2011). Our current implementation supports anyGPT2model from HuggingFace, and other auto-regressive models can be plugged-in with only a few local modifications (e.g. translat- ing the relevant layer names). More details and in- structions for how to deploy and runLM-Debugger are provided athttps://github.com/mega002/ lm-debugger. 7 Related Work Interpreting single-predictions and the general be- havior of LMs is a growing research area that at- tracted immense attention in recent years (Belinkov et al., 2020; Choudhary et al., 2022).LM-Debugger is a the first tool to interpret and intervene in the pre- diction construction process of transformer-based LMs based on FFN updates. Existing interpretation and analysis frameworks mostly rely on methods for behavioral analysis (Ribeiro et al., 2020) by probing models with ad- versarial (Wallace et al., 2019b) or counterfactual examples (Tenney et al., 2020), input saliency meth- ods that assign importance scores to input features (Wallace et al., 2019b; Tenney et al., 2020), and analysis of the attention layers (Hoover et al., 2020; Vig and Belinkov, 2019). More related toLM-Debugger, other tools ana- lyze patterns in neuron activations (Rethmeier et al., 2020; Dalvi et al., 2019; Alammar, 2021). Unlike these methods, we focus on interpreting the model parameters and on intervening in their contribution to the model’s prediction. The functionality ofLM-Debuggeris mostly re- lated to tools that trace hidden representations across layers. Similarly toLM-Debugger, Alammar (2021); Nostalgebraist (2020) interpret the token representation in terms of the output vocabulary. We take this one step further and interpret the FFN updates to the representation, allowing to observe not only the evolution of the representation but also the factors that induce changes in it. Our intervention in FFN sub-updates relates to recent methods for locating and editing knowledge in the FFN layers of LMs (Meng et al., 2022; Dai et al., 2022). Different from these methods, LM-Debuggeraims to provide a comprehensive and fine-grained interpretation of the prediction construction process across the layers. 8 Conclusion We introduceLM-Debugger, a debugger tool for transformer-based LMs, and the first tool to analyze the FFN updates to the token representations across layers.LM-Debuggerprovides a fine-grained inter- pretation of single-predictions, as well as a power- ful framework for intervention in LM predictions. Ethical Statement Our work aims to increase the transparency of transformer-based LMs. It is well known that such models often produce offensive, harmful language (Bender et al., 2021; McGuffie and Newhouse, 2020; Gehman et al., 2020; Wallace et al., 2019a), which might originate in toxic concepts encoded in their parameters (Geva et al., 2022).LM-Debugger, which traces and interprets LM predictions, could expose such toxic concepts and therefore should be used with caution. LM-Debuggeralso provides a framework for modifying LM behavior in particular directions. While our intention is to provide developers tools for fixing model errors, mitigating biases, and build- ing trustworthy models, this capability also has the potential for abuse. In this context, it should be made clear thatLM-Debuggerdoes not modify the information encoded in LMs, but only changes the intensity in which this information is exposed in the model’s predictions. At the same time, LM-Debuggerlets the user observe the intensity of updates to the prediction, which could be used to identify suspicious interventions. Nonetheless, be- cause of these concerns, we stress that LMs should not be integrated into critical systems without cau- tion and monitoring. Acknowledgements We thank the REVIZ team at the Allen Institute for AI, particularly Sam Skjonsberg and Sam Stuesser. This project has received funding from the Com- puter Science Scholarship granted by the Séphora Berrebi Foundation, the PBC fellowship for out- standing PhD candidates in Data Science, and the European Research Council (ERC) under the Eu- ropean Union’s Horizon 2020 research and inno- vation programme, grant agreement No. 802774 (iEXTRACT). References J Alammar. 2021.Ecco: An open source library for the explainability of transformer language mod- els. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natu- ral Language Processing: System Demonstrations, pages 249–257, Online. Association for Computa- tional Linguistics. Nabiha Asghar. 2016. Yelp dataset challenge: Review rating prediction.arXiv preprint arXiv:1605.05362. Yonatan Belinkov, Sebastian Gehrmann, and Ellie Pavlick. 2020. Interpretability and analysis in neural NLP. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tu- torial Abstracts, pages 1–5, Online. Association for Computational Linguistics. Emily M Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big?InProceedings of the ACM Confer- ence on Fairness, Accountability, and Transparency (FAccT). Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen Creel, Jared Davis, Dora Dem- szky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren E. Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas F. Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Sid- dharth Karamcheti, Geoff Keeling, Fereshte Khani, O. Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Is- abelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir P. Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Benjamin Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, J. F. Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Robert Reich, Hongyu Ren, Frieda Rong, Yusuf H. Roohani, Camilo Ruiz, Jack Ryan, Christopher R’e, Dorsa Sadigh, Sh- iori Sagawa, Keshav Santhanam, Andy Shih, Kr- ishna Parasuram Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiax- uan You, Matei A. Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 2021. On the op- portunities and risks of foundation models.ArXiv, abs/2108.07258. Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc- Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learn- ers. InProceedings of Neural Information Process- ing Systems (NeurIPS). Shivani Choudhary, Niladri Chatterjee, and Subir Ku- mar Saha. 2022.Interpretation of black box nlp models:A survey.arXiv preprint arXiv:2203.17081. Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. Knowledge neurons in pretrained transformers. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 8493–8502, Dublin, Ireland. Association for Com- putational Linguistics. Fahim Dalvi, Avery Nortonsmith, Anthony Bau, Yonatan Belinkov, Hassan Sajjad, Nadir Durrani, and James Glass. 2019. NeuroX: A toolkit for an- alyzing individual neurons in neural networks.Pro- ceedings of the AAAI Conference on Artificial Intel- ligence, 33(01):9851–9852. Maria De-Arteaga, Alexey Romanov, Hanna Wal- lach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kentha- padi, and Adam Tauman Kalai. 2019. Bias in bios: A case study of semantic representation bias in a high-stakes setting. InProceedings of the Confer- ence on Fairness, Accountability, and Transparency, FAT* ’19, page 120–128, New York, NY, USA. As- sociation for Computing Machinery. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Ka- plan, Sam McCandlish, and Chris Olah. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread. Https://transformer- circuits.pub/2021/framework/index.html. Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxi- cityPrompts: Evaluating neural toxic degeneration in language models. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics. Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. 2022. Transformer feed-forward layers build predictions by promoting concepts in the vo- cabulary space.arXiv preprint arXiv:2203.14680. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recog- nition. InProceedings of the conference on com- puter vision and pattern recognition (CVPR). Benjamin Hoover, Hendrik Strobelt, and Sebastian Gehrmann. 2020. exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models. InProceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics: System Demonstrations, pages 187–196, Online. As- sociation for Computational Linguistics. Taelin Karidi, Yichu Zhou, Nathan Schneider, Omri Abend, and Vivek Srikumar. 2021. Putting words in BERT’s mouth: Navigating contextualized vec- tor spaces with pseudowords. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10300–10313, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Kris McGuffie and Alex Newhouse. 2020. The radical- ization risks of gpt-3 and advanced neural language models.arXiv preprint arXiv:2009.06807. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual knowl- edge in gpt.arXiv preprint arXiv:2202.05262. Nostalgebraist. 2020. interpreting GPT: the logit lens. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830. Nils Rethmeier, Vageesh Kumar Saxena, and Isabelle Augenstein. 2020. Tx-ray: Quantifying and explain- ing model-knowledge transfer in (un-)supervised nlp.InProceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 ofProceedings of Machine Learning Research, pages 440–449. PMLR. Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Be- havioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4902– 4912, Online. Association for Computational Lin- guistics. Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, and Mike Gatford. 1995. et almbox. 1995. okapi at trec-3.Nist Special Publication Sp, 109:109. Ian Tenney, James Wexler, Jasmijn Bastings, Tolga Bolukbasi, Andy Coenen, Sebastian Gehrmann, Ellen Jiang, Mahima Pushkarna, Carey Radebaugh, Emily Reif, and Ann Yuan. 2020. The language in- terpretability tool: Extensible, interactive visualiza- tions and analysis for NLP models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstra- tions, pages 107–118, Online. Association for Com- putational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InAdvances in Neural Information Pro- cessing Systems (NIPS), pages 5998–6008. Jesse Vig and Yonatan Belinkov. 2019.Analyzing the structure of attention in a transformer language model. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 63–76, Florence, Italy. As- sociation for Computational Linguistics. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gard- ner, and Sameer Singh. 2019a. Universal adversar- ial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 2153– 2162, Hong Kong, China. Association for Computa- tional Linguistics. Eric Wallace, Jens Tuyls, Junlin Wang, Sanjay Subra- manian, Matt Gardner, and Sameer Singh. 2019b. AllenNLP interpret: A framework for explaining predictions of NLP models. InProceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 7–12, Hong Kong, China. Association for Compu- tational Linguistics. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Remi Louf, Morgan Funtow- icz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Trans- formers: State-of-the-art natural language process- ing. InProceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Asso- ciation for Computational Linguistics. A Appendix A.1 Details on Interventions to Control Generated Text Sentiment Table 3 lists all the value vectors selected for our interventions described in §5, and examples for top-scoring tokens in their projections. These vectors were found with the exploration view of LM-Debugger(§3.2), using both keyword search and clustering visualisation. All the interventions were configured to “turn on” these vectors, namely, setting their coefficients to be maximal for the cor- responding layer. This is following the observation by Geva et al. (2022) that FFN updates operate in a token promotion mechanism (rather than elimina- tion). Sentiment Value VectorExample Top-scoring Tokens Positive L13D1763 properly, appropriately, adequate, truthful, humane, fulfil, inclusive, timely, patiently, sustainable L13D2011 clean, Proper, secure, flawless, safest, graceful, smooth, calmly L14D944 peacefully, graceful, respectful, careful, generous, patiently, calm, tolerant, fair L15D74 Excellence, superb, trustworthy, marvelous, terrific, awesome, Amazing L20D988 successful, optimal, perfect, satisfactory, welcome, helpful, fulfilling, healthy Negative L11D4 outdated, inadequate, stale, lousy, dull, mediocre, boring, wasteful L14D2653 trivial, dismiss, rigid, unsupported, only, prejud, obfusc, pretend, dispar, slander L16D974 inappropriately, poorly, disrespect, unreliable, unhealthy, insecure, improperly, arrogance L17D3790 inappropriate, improper, wrong, bad, harmful, unreasonable, defective, disturbance, errors L18D91 confused, bizarre, unfairly, horrible, reckless, neglect, misplaced, strange, nasty, mistakenly L18D3981 wrong, incorrect, insufficient, misleading, premature, improperly, unrealistic, outdated, unfair Table 3: Value vectors used for controlling sentiment in generated text, that promote positive and negative ad- jectives/adverbs. For each vector, we show example top-scoring tokens from its projection to the vocabulary, as presented in the exploration view ofLM-Debugger.