← Back to papers

Paper deep dive

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

Alex Tamkin, Mohammad Taufeeque, Noah D. Goodman

Year: 2023Venue: ICLR 2024Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 100

Models: 1-layer TinyStories GPT (21M), 24-layer Pythia-410M, 4-layer TokFSM Transformer (custom)

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 7:48:31 PM

Summary

The paper introduces 'codebook features', a method to improve neural network interpretability by replacing continuous hidden states with sparse, discrete vector quantization bottlenecks. By finetuning models with these bottlenecks, the authors demonstrate that networks maintain performance while enabling causal control and disentangled representation of concepts, effectively addressing the superposition problem.

Entities (6)

Transformer · neural-network-architecture · 99%Codebook Features · methodology · 98%TinyStories · dataset · 95%TokFSM · dataset · 95%WikiText-103 · dataset · 95%Superposition Problem · research-problem · 92%

Relation Signals (3)

Codebook Features appliedto Transformer

confidence 95% · While codebook features can be applied to any neural network, we primarily focus on Transformer networks

Transformer trainedon TokFSM

confidence 95% · We train 4-layer Transformers... on the TokFSM dataset.

Codebook Features solves Superposition Problem

confidence 90% · In this setting, our approach overcomes the superposition problem by assigning states to distinct codes

Cypher Suggestions (2)

Identify architectures that support codebook features · confidence 90% · unvalidated

MATCH (m:Methodology {name: 'Codebook Features'})-[:APPLIED_TO]->(a:Architecture) RETURN a.name

Find all datasets used to evaluate codebook features · confidence 85% · unvalidated

MATCH (m:Methodology {name: 'Codebook Features'})-[:EVALUATED_ON]->(d:Dataset) RETURN d.name

Abstract

Abstract:Understanding neural networks is challenging in part because of the dense, continuous nature of their hidden states. We explore whether we can train neural networks to have hidden states that are sparse, discrete, and more interpretable by quantizing their continuous features into what we call codebook features. Codebook features are produced by finetuning neural networks with vector quantization bottlenecks at each layer, producing a network whose hidden features are the sum of a small number of discrete vector codes chosen from a larger codebook. Surprisingly, we find that neural networks can operate under this extreme bottleneck with only modest degradation in performance. This sparse, discrete bottleneck also provides an intuitive way of controlling neural network behavior: first, find codes that activate when the desired behavior is present, then activate those same codes during generation to elicit that behavior. We validate our approach by training codebook Transformers on several different datasets. First, we explore a finite state machine dataset with far more hidden states than neurons. In this setting, our approach overcomes the superposition problem by assigning states to distinct codes, and we find that we can make the neural network behave as if it is in a different state by activating the code for that state. Second, we train Transformer language models with up to 410M parameters on two natural language datasets. We identify codes in these models representing diverse, disentangled concepts (ranging from negative emotions to months of the year) and find that we can guide the model to generate different topics by activating the appropriate codes during inference. Overall, codebook features appear to be a promising unit of analysis and control for neural networks and interpretability. Our codebase and models are open-sourced at this https URL.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)interpretability (suggested, 80%)mechanistic-interp (suggested, 92%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

99,202 characters extracted from source content.

Expand or collapse full text

CODEBOOKFEATURES: SPARSE ANDDISCRETE INTERPRETABILITY FORNEURALNETWORKS Alex Tamkin Anthropic † Mohammad Taufeeque FAR AI Noah D. Goodman Stanford University ABSTRACT Understanding neural networks is challenging in part because of the dense, con- tinuous nature of their hidden states. We explore whether we can train neural networks to have hidden states that are sparse, discrete, and more interpretable by quantizing their continuous features into what we callcodebook features. Code- book features are produced by finetuning neural networks with vector quantization bottlenecks at each layer, producing a network whose hidden features are the sum of a small number of discrete vectorcodeschosen from a larger codebook. Sur- prisingly, we find that neural networks can operate under this extreme bottleneck with only modest degradation in performance. This sparse, discrete bottleneck also provides an intuitive way ofcontrollingneural network behavior: first, find codes that activate when the desired behavior is present, then activate those same codes during generation to elicit that behavior. We validate our approach by train- ing codebook Transformers on several different datasets. First, we explore a finite state machine dataset with far more hidden states than neurons. In this setting, our approach overcomes thesuperpositionproblem by assigning states to distinct codes, and we find that we can make the neural network behave as if it is in a different state by activating the code for that state. Second, we train Transformer language models with up to 410M parameters on two natural language datasets. We identify codes in these models representing diverse, disentangled concepts (ranging from negative emotions to months of the year) and find that we can guide the model to generate different topics by activating the appropriate codes during inference. Overall, codebook features appear to be a promisingunit of analysis and controlfor neural networks and interpretability. Our codebase and models are open-sourced at https://github.com/taufeeque9/codebook-features 1 1INTRODUCTION The strength of neural networks lies in their ability to learnemergentsolutions that we could not program ourselves. Unfortunately, the learned programs inside neural networks are challenging to make sense of, in part because they differ from traditional software in important ways. Most strikingly, thestateof a neural network program, including intermediate computations and features, is implemented in dense, continuous vectors inside of a network. As a result, many different pieces of information are commingled inside of these vectors, violating the software engineering principle ofseparation of concerns(Dijkstra & Dijkstra, 1982). Moreover, the continuous nature of these vectors means no feature is ever trulyoffinside of a network; instead, they are activated to varying degrees, vastly increasing the complexity of this state and the possible interactions within it. A natural question is whether it is possible to recover some of the sparsity and discreteness properties of traditional software systems while preserving the expressivity and learnability of neural networks. To make progress here, we introduce astructural constraintinto training thatrefactorsa network to adhere more closely to these design principles. Specifically, we finetune a network with trainable vector quantization bottlenecks (Gray, 1984) at each layer, which are sparse and discrete. We refer to each vector in this bottleneck as acodeand the entire library of codes as thecodebook. See Figure 1 for a visual depiction of this motivation. † Work performed while at Stanford University. Correspondence to atamkin@cs.stanford.edu. 1 Author contributions listed in Appendix A. 1 arXiv:2310.17230v1 [cs.LG] 26 Oct 2023 Figure 1: Codebook features attempt to combine the expressivity of neural networks with the sparse, discrete state often found in traditional software. The resulting codebooks learned through this process are a promising interface for understanding and controlling neural networks. For example, when we train a codebook language model on the outputs of a finite state machine, we find a precise mapping between activated codes in different layers of the model to the states of the state machine, overcoming the challenge ofsuperposition (Elhage et al., 2022b). Furthermore, we demonstrate acausalrole for these codes: changing which code is activated during the forward pass causes the network to behave as if it were in a different state. Additionally, we apply codebook features to transformer language models with up to 410M parameters, showing that despite this bottleneck, they can be trained with only modest accuracy degradation compared to the original model. We find codes that activate on a wide range of concepts, spanning punctuation, syntax, lexical semantics, and high-level topics. We then show how to use codebook features to control the topic of a model’s generations, providing a practical example of how to use our method to understand and control real language models. 2METHOD Codebook features aim to improve our understanding and control of neural networks by compressing their activation space with a sparse, discrete bottleneck. Specifically, we aim to learn a set ofdiscrete statesthe network can occupy, of which very few are active during any single forward pass. As we will show later in the paper (Sections 3 and 4), this bottleneck encourages the network to store useful and disentangled concepts in each code. Even more importantly, we show that these interpretations enable us to make causal interventions on the network internals, producing the expected change in the network’s behavior. Crucially, codebooks arelearned, not hand-specified, enabling them to capture behaviors potentially unknown by human researchers. Concretely, codebook features are produced by replacing a hidden layer’s activations with a sparse combination of code vectors. Leta∈R N be the activation vector of a given N-dimensional layer in a network. We have a codebookC=c 1 ,c 2 ,...,c C ∈R C×N , whereCis the codebook size. To apply the codebook, we first compute the cosine similarities sim(a,c i ) = a·c i |a||c i | betweenaand each code vectorc i . We then replaceawith P i∈S c i , whereScontains the indices of the topkmost similar code vectors. In other words, we activate and sum thekcode vectors most similar to the original activationa. The value ofkcontrols the bottleneck’s sparsity; we aim to makekas small as possible while achieving adequate performance.kis a small fraction ofCin our experiments, typically less than1%, and as a result, we find that codebooks are tight information bottlenecks, transmitting much less information than even 4-bit quantized activations (Appendix C). While codebook features can be applied to any neural network, we primarily focus on Transformer networks, placing codebooks after either the network’s MLP blocks or attention heads. Figure 2 shows the precise location of the codebook for each type of sublayer. Note that this positioning of the codebooks preserves the integrity of the residual stream of the network, which is important for optimizing deep networks (He et al., 2016; Elhage et al., 2021). 2 Figure 2:Applying codebook features to trans- formers.Attention heads: We add one codebook (depicted by the colored rectangles) for each at- tention head. The codebook is inserted before the projection into the residual stream.Feedfor- ward block: We insert the codebook after the feed- forward block, before addition into the residual stream. 2.1TRAINING WITH CODEBOOKS To obtain codebook features, we add the codebook bottlenecks to existing pretrained models and finetune the model with the original training loss. Thus, the network must learn to perform the task well while adjusting to the discrete codebook bottleneck. Using a pretrained model enables us to produce codebook features more cheaply than training a network from scratch. When finetuning, we use a linear combination of two losses: Original training lossIn our work, we apply codebooks to Transformer-based causal lan- guage models and thus use the typical cross-entropy loss these models were trained with: L LM (θ) =− P N i=1 logp θ (x i |x <i )whereθrepresents the model parameters,x i is the next token of input sequencex <i ,p θ (x i |x <i )is the model’s predicted probability of tokenx i given inputx <i , andNis the length of the input sequence. Reconstruction lossBecause we compute the similarity between activations and codebook features using the cosine similarity, which is invariant to magnitude, the code vectors can often grow in size throughout training, leading to instability. For this reason, we find it helpful to add an auxiliary loss to the codes:L MSE =MSE(C(x),stop-gradient(x)), wherexis the input to the codebook,C(x)is its output, and MSE is the mean squared error, to keep the distance between inputs and chosen codes small. The stop gradient means the gradient of this operation only passes through the codebook, not the inputx, which we found was important to avoid damaging the network’s capabilities. 2 Final loss and optimizationThe final loss is simply a combination of both losses aboveL= L LM +λL MSE whereλis a tradeoff coefficient. We setλto1in this work. To optimize the codebooks despite the discrete choice of codes, we use the straight-through estimator: we propagate gradients to the codes that were chosen on each forward pass and pass no gradients to the remaining codes (Bengio et al., 2013; van den Oord et al., 2017). We use this strategy to successfully perform end-to-end training of networks up to 24 layers deep, with each layer having a codebook. We defer additional details to Appendix B. 2.2USING CODEBOOKS FOR UNDERSTANDING AND CONTROL A trained codebook model enables a simple and intuitive way of controlling the network’s behavior. This method consists of two phases: 1) Generating hypotheses for the role of codes.Most codes are activated infrequently in the training dataset. We can gain an intuition for thefunctional roleof each code in the network’s hidden state by retrieving many examples in the dataset where that code was activated. For example, if a code activates mainly around words like “candle,” “matches,” and “lighters,” we might hypothesize that the token is involved in representations of fire. The discrete on-or-off nature of codes makes this task more manageable than looking at continuous values like neuron activations, as past work has speculated that lower-activating neurons can “smuggle” important information across layers, even if many neurons appear interpretable (Elhage et al., 2022a). As we will show in the following sections, the codes we discover activate more often on a single interpretable feature, while neurons 2 We performed preliminary experiments that only used the reconstruction loss (keeping the language model’s parameters fixed), similar to a VQ-VAE (van den Oord et al., 2017) at every layer. However, we achieved significantly worse performance. See Table 8 for more details. 3 Figure 3:Codebook fea- tures learn the hidden structure of an algorith- mic sequence modeling task.The codebook trans- former learns to detect the states of a finite state ma- chine and assigns a code to each state. We can then manipulate these codes to cause the network to make predictions as if it were in a different state. may activate on many unrelated features. Appendix F.1 discusses the advantages and tradeoffs of codebooks over neuron- and feature direction–based approaches in more detail. 2) Steering the network by activating codes.After we have identified codes that reliably activate on the concept we are interested in, we can directly activate those codes to influence the network’s behavior. For example, if we identified several codes related to fire, we could activate those codes during generation to produce outputs about fire (e.g., as in Section 4.1). This intervention confirms that the codes have acausal rolein the network’s behavior. In the following sections, we apply this same two-step procedure across several different datasets, showing that we can successfully gain insight into the network and control its behavior in each case. 3ALGORITHMIC SEQUENCE MODELING The first setting we consider is an algorithmic sequence modeling dataset called TokFSM. The pur- pose of this dataset is to create a controlled setting exhibiting some of the complexities of language modeling,but where the latent features present in the sequence are known. This setting enables us to evaluate how well the model learns codes that activate on these distinct features. An overview of the section and our findings is shown in Figure 3. Below, we describe the dataset, and then (following Section 2.2) we first generate hypotheses for the role of codes, then show how one can predictably influence the network’s behavior by manipulating these codes. The TokFSM DatasetThe TokFSM dataset is produced by first constructing a simplified finite state machine (FSM). Our FSM is defined by(V,E)whereV=0,·,N−1is a set of nodes andE⊆V×Vindicates the set of valid transitions from one state to the next. In our setting, we chooseN= 100and give each node 10 randomly chosen outbound neighbors, each assigned an equal transition probability (0.1). Entries in the dataset are randomly sampled rollouts of the FSM up to 64 transitions. We tokenize the sequences at the digit level; this gives a sequence length of 128 for each input. For example, if our sampled rollout is [18, 00, 39], we would tokenize it as [1, 8, 0, 0, 3, 9] for the neural network. Thus, the model must learn to detokenize the input into its constituent states, predict the next FSM state, and then retokenize the state to predict the next token. Training and evaluating the codebook modelsWe train 4-layer Transformers with 4 attention heads and an embedding size of 128 based on the GPTNeoX architecture (Black et al., 2022) on the TokFSM dataset. We train several models with different numbers of codes and sparsity valuesk, with codebooks either at the network’s attention heads or both the attention heads and MLP Layers (see Figure 2). In Table 1, we report the accuracy of the resulting models both in terms of their language modeling loss, next token accuracy, and their ability to produce valid transitions of the FSM across a generated sequence. Thek= 1model with codebooks at only the attention layers achieves comparable performance across all metrics to the original model. At the same time, larger values ofkenable the model with codebooks at both attention and MLP blocks to attain comparable performance. It is striking that networks can perform so well despite this extreme bottleneck at every layer. We defer additional training details to Appendix D.1 and ablation studies to Table 8. 4 Codebook TypeLossLM AccState Acc No Codebook1.17946.3696.77 Attn Only C=2k 1.1846.3396.39 †Attn+MLP k=1,C=10k 1.26945.2763.65 Attn+MLP k=4,C=2k 1.20446.0476.32 Attn+MLP k=16,C=20k 1.18346.3291.53 Attn+MLP k=128,C=20k 1.17846.3895.82 Table 1:Performance of original and codebook models on TokFSM. Ak= 1codebook model on only attention layers attains similar performance to the original model, while attention-and-MLP codebooks require a higherkto match perfor- mance.†indicates the model we an- alyze in the rest of the section. NoneL0 AttnL1 MLPAll AttnAll MLPAll Attn, MLP 0 0.2 0.4 0.6 0.8 1 Normalized JS Div (a) State code interventions NoneL0 AttnL1 MLPAll AttnAll MLPAll Attn, MLP 0 0.2 0.4 0.6 0.8 1 Normalized JS Div (b) State-plus-digit code interventions Figure 4:Interventions on the state and state-plus-digit codes in a sequence.Changing just the MLP codes to codes associated with another state shifts the output distribution almost entirely to the target state. Changing codes in other layers has a much smaller effect. Normalized JS Div stands for the normalized Jensen-Shannon Divergence, where the initial difference (None) is normalized to 1. 3.1GENERATING HYPOTHESES FOR THE ROLE OF CODES After training these models, we examine thek= 1attention and MLP codebook transformer fol- lowing Section 2.2. Looking at activating tokens reveals a wide range of interesting-looking codes. We provide descriptions of these codes along with a table of examples in Table 6, and focus our analysis on two families of codes here: in the last three MLP layers (layers 1, 2, and 3), we identify state codesthat reliably activate on the second token of a specific state (of which there are 100 pos- sibilities), as well asstate-plus-digit codesthat activate on a specific digit when it follows a specific state (686 possibilities in our state machine). For example, code 2543 in MLP layer 2 activates on the 0 in the state 40 (e.g., 50-40-59). This finding is notable because there are only 128 neurons in a given MLP layer, far lower than the total number of these features. Thus, the codebooks must disentangle features represented in a distributed manner across different neurons inside the network. (Anecdotally, the top-activating tokens for the neurons in these layers do not appear to follow any consistent pattern.) We quantify this further with an experiment where we use state codes toclassifystates and compare them to the neuron with the highest precision at that state code’s recall level. As shown in Figure 6a, codes have an average precision of 97.1%, far better than the average best neuron precision of 70.5%. These pieces of evidence indicate that codebooks can minimize the superposition problem in this setting. See Appendix D for additional details and experiments. 3.2STEERING THE NETWORK BY ACTIVATING CODES While these associations can provide hypotheses for code function, they do not provide causal ev- idence that codes causally influence the network’s behavior. For this, interventional studies are necessary (Spirtes et al., 2000; Pearl & Mackenzie, 2018; Geiger et al., 2020; 2021). The state and state-plus-digit codes presented in Section 3.1 suggest a natural causal experiment: set the activated code in a given codebook to the code corresponding to another state and see whether the next token distribution shifts accordingly. 3 More specifically, letC (l) (x t )be the codebook at layerlapplied to input tokenx t . As we consider ak= 1model,C (l) (x t )returns a single codec (l) t ∈R d . We replace 3 This experiment is similar to what Geiger et al. (2020) call an interchange intervention, and more generally establish acausal abstractionover the neural network (Geiger et al., 2021). 5 Figure 5:Steering a lan- guage model with topic codes.We identify sev- eral codes that activate on examples of a given topic (e.g., dragons). We then activate these codes at each generation step, producing generated text about that topic. See Ta- ble 10 for examples. Table 2:Codebook models are still capable language models.. Asterisks (*) denote the base model we apply the codebooks to, while daggers (†) indicate the codebook models we analyze in the rest of the paper. We trained the other models to provide additional comparisons (see Appendix E.3 for more details). While we use a pretrained TinyStories model as our base model, we also report the loss/accuracy of a finetuned model for comparison to account for any subtle differences in how we process the finetuning data (e.g., padding). (a) TinyStories 1-Layer Model Language ModelLossAcc *Pretrained1.8256.22 Finetuned1.5759.27 †Attn,k= 81.6657.91 MLP,k= 1001.5759.47 (b) WikiText-103 410M 24-Layer Model Language ModelLossAcc *Finetuned (Wiki)2.4150.52 Finetuned 160M (Wiki)2.7246.75 †Attn,k= 82.7446.68 Attn,k= 642.5548.44 MLP,k= 1003.0342.47 MLP, grouped16×(k= 64)2.5748.46 this code with ̃c (l) t , a code that activates when a different state is present. We then recompute the forward pass from that point and observe whether the network’s next token distribution resembles the next token distribution for the new state. In Figure 4a, we find that this is precisely the case—changing only the state codes in the MLP layers to a different state code shifts the next token distribution towards that other state, as measured by the Jensen-Shannon Divergence (JSD Lin, 1991), averaged over 500 random state transitions. This effect is even more substantial for the state-plus-digit codes, where changing the codes in the MLP layers makes the next-state distribution almost identical to that of the new state (Figure 4b). These results provide strong evidence that these codes perform the expected causal role in the network. Note that applying a similar perturbation to just a single MLP layer or all the attention layers causes a much smaller drop in JSD, indicating that this information is mainly stored across several MLP layers. 4LANGUAGE MODELING Next, we apply codebook features to language models (LMs) trained on naturalistic text corpora. We demonstrate the generality and scalability of our approach by training two models of different sizes on two different datasets. After describing the models we train and the training data, we follow the strategy described in Section 2.2 and identify hypotheses for the role of codes in the network. Then, we validate these hypotheses by steering the models through targeted activation of codes. Trained modelsWe finetune a small, 1-layer, 21 million parameter model on the TinyStories dataset of children’s stories (Eldan & Li, 2023). We also finetune a larger, 24-layer 410M parameter model on the WikiText-103 dataset, consisting of high-quality English-language Wikipedia articles (Merity et al., 2016). See Appendix E for more training details. 6 00.20.40.60.81 0 0.2 0.4 0.6 0.8 1 Neuron Precision Code Precision (a) Finite-state machine dataset (TokFSM) 00.20.40.60.81 0 0.2 0.4 0.6 0.8 1 Neuron Precision Code Precision (b) WikiText-103 Figure 6:Codes are better classifiers of simple textual features than neurons.Y-axis: precision of a given code at classifying a regular expression.X-axis: precision of the best neuron in the network, with a threshold chosen to match the recall of the code.Red line:y=x Codebook models are still strong language modelsRemarkably, despite the extreme bottleneck imposed by the codebook constraint, the codebook language models can still achieve strong lan- guage modeling performance. As shown in Table 2, codebook models can attain a loss and accuracy close to or better than the original models with the proper settings. In addition, the generations of the codebook look comparable to the base models, as shown in Table 10. Finally, in Appendix E.4, we profile the inference speed of these codebook models, showing how sparsity and fast maximum inner product search (MIPS) algorithms enable codebooks to run much more efficiently than the naive implementation of two large matrix multiplications. Generating hypotheses for the role of codesWe also explore the interpretability of codes by looking at examples that the code activates on. In Table 11, we catalog codes that selectively activate on a wide range of linguistic phenomena, spanning orthography (e.g., names starting with “B”), word types (e.g., months of the year), events (e.g., instances of fighting), and overall topics (e.g., fire or football). Interestingly, codes for a particular linguistic phenomenon may not always activate on the words most relevant to that concept. For example, in our TinyStories model, we find a code that activates on mentions of fighting and violence might trigger on the wordthebut not the adjacent wordquarrel. We suspect this may be because the network can store pieces of information in nearby tokens and retrieve them when needed via attention. Comparison to neuron-level interpretabilityAs in Section 3.1, we would like to compare the interpretability of the codebook to neuron-level interpretability. While natural language features are more complex than the states in Section 3, we conduct a preliminary experiment comparing both neuron- and code-based classifiers to regular expression-based classifiers. We first collect a set of codes that appear to have simple, interpretable activation patterns (e.g., “fires on years beginning with 2”). We then created heuristic regular expressions targeting those features (e.g.,2 ). Next, we compute the precision of the code classifier, using the regular expression as our source of truth. We then take the recall of our code classifier and search across all neurons, thresholding each at the same recall as the code and reporting the highest precision found. As Figure 6b demonstrates, codes are far better classifiers of these features than neurons on average, with over30%higher average precision. We defer additional details and discussion to Appendix E.7. 4.1STEERING THE NETWORK BY ACTIVATING TOPIC CODES As in Section 3.2, we would like to validate that codes do not merely fire in acorrelatedway with different linguistic features but that they have acausalrole in the network’s behavior. As an initial investigation of this goal, we study a subset of codes in the attention codebook model that appear to identify and control thetopicdiscussed by a model. To identify potentialtopic codes, we use a simple heuristic and select only codes that activate on more than50%of tokens in a given sequence. 4 4 This heuristic is inspired by past work connecting activation patterns in frequency space to different lin- guistic phenomena (Tamkin et al., 2020) 7 Table 3:Activating topic codes causes the model to discuss those topics.Percentage of gener- ations that mention the topic before and after setting one or all codes in each attention head to the topic code. Numbers in (parentheses) indicate the number of activated topic codes. This number is smaller for theall codescondition because only one topic code will be activated if multiple topic codes are located in the same attention head. (a) Wikitext TopicBaseline Freq Steered (one code) Steered (all codes) Video game 2.555.0 (18) 75.0 (4) Football 7.547.5 (18) 95.0 (8) Movie27.542.5 (12) 90.0 (5) Song20.032.5 (17) 85.0 (11) (b) TinyStories TopicBaseline Freq Steered (one code) Dragon2.565.0 (8) Slide2.595.0 (12) Friend42.575.0 (9) Flower0.090.0 (8) Fire2.5100.0 (16) Baby0.090.0 (15) Princess40.087.5 (14) Table 4:Example steered generations for TinyStories model.More examples in Table 13 Code Concept# codesExample steered generation Dragon8Once upon a time,there was a little girl named Lily. She was very excited to go outside and explore. She flew over the trees and saw a big, scary dragon. The dragon was very scary. [...] Flower8Once upon a time,there was a little girl named Lily. She liked to pick flowers in the meadow. One day, she saw a big, green [...] Fire16Once upon a time,there was a little boy named Timmy. Timmy loved his new toy. He always felt like a real fireman. [...] Princess14Once upon a time,there was a little bird named Tweety. One day, the princess had a dream that she was invited to a big castle. She was very excited and said, “I want to be a princess and [...] Of these, we manually filter by looking at the activating tokens of these codes and choose only those that appear to activate frequently on other examples related to that topic. To shift the output generations of the model, we then take an input prompt (e.g., the start-of-sequence token) and activate the topic codes in the model for every token of this prompt. Then, we sample from the model, activating the topic codes for each newly generated token. Unlike Section 3, our models here havek >1. Thus, we explore two types of interventions: First, activating asinglecode in each codebook (replacing the code with the lowest similarity with the input) and second, replacing allactivated codes in each codebook withkcopies of the topic code. 5 We use the attention-only codebook withk= 8in our experiments. See Figure 5 for a graphical depiction. Remarkably, activating the topic codes causes the model to introduce the target topic into the sam- pled tokens in a largely natural way. We show several examples of this phenomenon in Tables 4, 13 and 14. Interestingly, even though the topic code is activated at every token, the topic itself is often only introduced many words later in the sequence, when it would be contextually appropriate. We quantify the success of this method by generating many steered sequences and classifying the generated examples into different categories with a simple word-based classifier. The results, pre- sented in Table 3, demonstrate that the steered generations mention the topic far more often, with almost all generations successfully mentioning the topic when all codes in a codebook are replaced. See Appendix E.8 for more details and additional generations. These interventions constitute mean- ingful evidence of how codebook features can enable the interpretation and control of real language models. 5 Ifm >1codes map to the steering topic in a given codebook, we replace themlowest-scoring codes in the first case and randomly select one code to replace all the codes in that codebook in the second case. 8 5RELATED WORK Mechanistic interpretabilityOur work continues a long stream of work since the 1980s on un- derstanding how neural networks operate, especially when individual neurons are uninterpretable (Servan-Schreiber et al., 1988; Elman, 1990) 6 . Recent work has continued these investigations in modern computer vision models (Olah et al., 2018; 2020; Bau et al., 2020b) and language models (Elhage et al., 2021; Geva et al., 2021), with special focus on the problem of understandingsuperpo- sition, when many features are distributed across a smaller number of neurons (Elhage et al., 2022b). Recent work has investigated whether sparse dictionary learning techniques can recover these fea- tures (Yun et al., 2021; Sharkey et al., 2022), including the concurrent work of Bricken et al. (2023) and Cunningham et al. (2023). Our work shares similar goals as the above works. Codebook fea- tures attempt to make it easier to identify concepts and algorithms inside of networks by refactoring their hidden states into a sparse and discrete form. We also show how codebooks can mitigate su- perposition by representing more features than there are neurons and that we can intervene on the codebooks to alter model behavior systematically. Discrete structure in neural networksOur work also connects to multiple streams of research on incorporating discrete structure into neural networks (Andreas et al., 2016; Mao et al., 2019). Most relevant is VQ-VAE (van den Oord et al., 2017), which trains an autoencoder with a vector quantized hidden state (Gray, 1984). Our work also leverages vector quantization; however, unlike past work, we extend this method by using it as a sparse, discrete bottleneck that could inserted between the layers of any neural network (and apply it to autoregressive language models), enabling better understanding and control of the network’s intermediate computation. Inference-time steering of model internalsFinally, our work connects to recent research on steering models based on inference-time perturbations. For example, Merullo et al. (2023) and Turner et al. (2023) steer networks by adding vectors of different magnitudes to different layers in the network. Our work supports these aims by making it easier to localize behaviors inside the network (guided by activating tokens) and making it easier to perform the intervention by substituting codes (so the user does not have to try many different magnitudes of a given steering vector at each layer). We include an extended discussion of related work, including the relative advantages of codebooks and dictionary learning methods in Appendix F. 6DISCUSSION AND FUTURE WORK We presentcodebook features, a method for training neural networks with sparse and discrete hid- den states. Codebook features enable unsupervised discovery of algorithmic and linguistic features inside language models, making progress on the superposition problem (Elhage et al., 2022b). We have shown how the sparse, discrete nature of codebook features reduces the complexity of a neural network’s hidden state, making it easier to search for specific features and control the network’s behavior with them. Our work has limitations. First, we only study Transformer neural networks on one algorithmic dataset and two natural language datasets; we do not study transformers applied to visual data or other architectures, such as convolutional neural networks, leaving this for future work. In addition, we only explore topic manipulation in language models; future work can explore the manipulation of other linguistic features in text, including sentiment, style, and logical flow. Ultimately, our results suggest that codebooks are an appealing unit of analysis for neural networks and a promising foundation for the interpretability and control of more complex phenomena in models. Looking forward, the sparse, discrete nature of codebook features should aid in discovering circuits across layers, more sophisticated control of model behaviors, and making automated, larger- scale interpretability methods more tractable. 7 6 As Elman (1990) phrases the problem: “A given node participates in representing multiple concepts. It is the activation pattern in its entirety that is meaningful. The activation of an individual node may be uninter- pretable in isolation (i.e., it may not even refer to a feature or micro-feature).” 7 See Appendix G for an extended discussion of applications and future work. 9 REPRODUCIBILITYSTATEMENT We release our codebase and trained models to enable others to easily build on our work. Addi- tionally, Sections 2 to 4 and appendices B, D and E describe the specific experimental details and settings we used to carry out our experiments. ACKNOWLEDGMENTS We would like to thank Shyamal Buch, Adri ` a Garriga-Alonso, Atticus Geiger, Adam Gleave, Lev McKinney, Jesse Mu, Remy Ochei, and Zhengxuan Wu for helpful discussions and comments on drafts, and Hofvarpnir Studios for compute support. AT was supported by an Open Phil AI Fellow- ship. REFERENCES Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016. Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, p. 39–48, 2016. Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic structure of word senses, with applications to polysemy.Transactions of the Association for Computational Linguistics, 6:483–495, 2018. doi: 10.1162/tacl a00034. URLhttps: //aclanthology.org/Q18-1034. David Bau, Steven Liu, Tongzhou Wang, Jun-Yan Zhu, and Antonio Torralba. Rewriting a deep generative model. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, p. 351–369. Springer, 2020a. David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Antonio Torralba. Understanding the role of individual units in a deep neural network.Proceedings of the National Academy of Sciences, 117(48):30071–30078, 2020b. Yoshua Bengio, Nicholas L ́ eonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, p. 2397–2430. PMLR, 2023. Sid Black, Stella Rose Biderman, Eric Hallahan, Quentin G. Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Martin Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Benqi Wang, and Samuel Wein- bach.GPT-NeoX-20B: An Open-Source Autoregressive Language Model.arXiv preprint arXiv:2204.06745, 2022.URLhttps://api.semanticscholar.org/CorpusID: 248177957. Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the Opportu- nities and Risks of Foundation Models.arXiv preprint arXiv:2108.07258, 2021. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah.Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023.https://transformer- circuits.pub/2023/monosemantic-features/index.html. 10 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. Shyamal Buch, Li Fei-Fei, and Noah D Goodman. Neural event semantics for grounded language understanding.Transactions of the Association for Computational Linguistics, 9:875–890, 2021. Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements.Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8):1207–1223, 2006. Lawrence Chan, Adri ` a Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishin- skaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrubbing: A method for rigorously testing interpretability hypotheses. InAlignment Forum, 2022. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What Does BERT Look At? An Analysis of BERT’s Attention.arXiv preprint arXiv:1906.04341, 2019. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023. Edsger W Dijkstra and Edsger W Dijkstra. On the role of scientific thought.Selected writings on computing: a personal perspective, p. 60–66, 1982. David L Donoho. Compressed sensing.IEEE Transactions on information theory, 52(4):1289– 1306, 2006. Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned dictionaries.IEEE Transactions on Image processing, 15(12):3736–3745, 2006. Ronen Eldan and Yuanzhi Li. TinyStories: How Small Can Language Models Be and Still Speak Coherent English?, 2023. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 1, 2021. Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Ka- mal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav Kadavath, Josh Jacobson, Eli Tran-Johnson, Jared Kaplan, Jack Clark, Tom Brown, Sam McCandlish, Dario Amodei, and Christopher Olah. Softmax Linear Units.Transformer Circuits Thread, 2022a. https://transformer-circuits.pub/2022/solu/index.html. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy Models of Super- position.Transformer Circuits Thread, 2022b. Jeffrey L Elman. Finding structure in time.Cognitive science, 14(2):179–211, 1990. Ruth Fong and Andrea Vedaldi. Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, p. 8730–8738, 2018. Dan Friedman, Alexander Wettig, and Danqi Chen. Learning Transformer Programs.arXiv preprint arXiv:2306.01128, 2023. Atticus Geiger, Kyle Richardson, and Christopher Potts. Neural natural language inference models partially embed theories of lexical entailment and negation.arXiv preprint arXiv:2004.14623, 2020. 11 Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks.Advances in Neural Information Processing Systems, 34:9574–9586, 2021. Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. Find- ing Alignments Between Interpretable Causal Variables and Distributed Neural Representations. arXiv preprint arXiv:2303.02536, 2023. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer Feed-Forward Layers Are Key-Value Memories, 2021. Mario Giulianelli, Jacqueline Harding, Florian Mohnert, Dieuwke Hupkes, and Willem Zuidema. Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information.arXiv preprint arXiv:1808.08079, 2018. Gabriel Goh, Nick Cammarata †, Chelsea Voss †, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal Neurons in Artificial Neural Networks.Distill, 2021. doi: 10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons. Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572, 2014. Stephen Jay Gould. The exaptive excellence of spandrels as a term and prototype.Proceedings of the National Academy of Sciences, 94(20):10750–10755, 1997. Stephen Jay Gould and Richard C Lewontin. 5 The Spandrels of San Marco and the Panglossian Paradigm: A Critique of the Adaptationist Programme.Conceptual Issues in Evolutionary Biol- ogy, 205:79, 1979. Robert Gray. Vector quantization.IEEE Assp Magazine, 1(2):4–29, 1984. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. InProceedings of the IEEE conference on computer vision and pattern recognition, p. 770–778, 2016. Evan Hernandez, Belinda Z Li, and Jacob Andreas. Measuring and manipulating knowledge repre- sentations in language models.arXiv preprint arXiv:2304.00740, 2023. John Hewitt, John Thickstun, Christopher D. Manning, and Percy Liang. Backpack Language Mod- els, 2023. Henrik Jacobsson. Rule extraction from recurrent neural networks: Ataxonomy and review.Neural Computation, 17(6):1223–1263, 2005. Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010. Jeff Johnson, Matthijs Douze, and Herv ́ e J ́ egou. Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2019. Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Vi ́ egas, Martin Wattenberg, Greg Corrado, et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation.Transactions of the Association for Computational Linguistics, 5:339–351, 2017. Pentti Kanerva.Sparse distributed memory. MIT press, 1988. Rohit Keshari, Richa Singh, and Mayank Vatsa. Guided Dropout.Proceedings of the AAAI Confer- ence on Artificial Intelligence, 33(01):4065–4072, Jul. 2019. doi: 10.1609/aaai.v33i01.33014065. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/4302. Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. CTRL: A Conditional Transformer Language Model for Controllable Generation, 2019. 12 Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). InInternational conference on machine learning, p. 2668–2677. PMLR, 2018. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. George Kingsley Zipf.Selected studies of the principle of relative frequency in language. Harvard university press, 1932. Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. InInternational conference on machine learning, p. 5338–5348. PMLR, 2020. Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Ng. Efficient sparse coding algorithms. Advances in neural information processing systems, 19, 2006. Jianhua Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151, 1991. Ziming Liu, Eric Gan, and Max Tegmark. Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability, 2023. Andreas Madsen, Siva Reddy, and Sarath Chandar. Post-hoc Interpretability for Neural NLP: A Survey.ACM Computing Surveys, 55(8):1–42, 2022. Alireza Makhzani and Brendan J Frey. Winner-take-all autoencoders.Advances in neural informa- tion processing systems, 28, 2015. Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. The neuro- symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584, 2019. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT.Advances in Neural Information Processing Systems, 35:17359–17372, 2022a. Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer.arXiv preprint arXiv:2210.07229, 2022b. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer Sentinel Mixture Models, 2016. Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. Language Models Implement Simple Word2Vec- style Vector Arithmetic.arXiv preprint arXiv:2305.16130, 2023. Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. Fast model editing at scale.arXiv preprint arXiv:2110.11309, 2021. Jesse Mu and Jacob Andreas. Compositional explanations of neurons.Advances in Neural Informa- tion Processing Systems, 33:17153–17163, 2020. Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. The Building Blocks of Interpretability.Distill, 2018. doi: 10.23915/ distill.00010. https://distill.pub/2018/building-blocks. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom In: An Introduction to Circuits.Distill, 2020.doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in. Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by V1?Vision research, 37(23):3311–3325, 1997. 13 Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022. Judea Pearl and Dana Mackenzie.The book of why: the new science of cause and effect. Basic books, 2018. Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in BERTology: What we know about how BERT works.Transactions of the Association for Computational Linguistics, 8:842–866, 2021. Christopher J Rozell, Don H Johnson, Richard G Baraniuk, and Bruno A Olshausen. Sparse coding via thresholding and local competition in neural circuits.Neural computation, 20(10):2526–2563, 2008. David E Rumelhart, Geoffrey E Hinton, James L McClelland, et al. A general framework for parallel distributed processing.Parallel distributed processing: Explorations in the microstructure of cognition, 1(45-76):26, 1986. David E Rumelhart, James L McClelland, PDP Research Group, et al. Parallel distributed process- ing.Foundations, 1, 1988. Shibani Santurkar, Dimitris Tsipras, Mahalaxmi Elango, David Bau, Antonio Torralba, and Aleksander Madry.Editing a classifier by rewriting its prediction rules.In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.),Advances in Neu- ral Information Processing Systems, volume 34, p. 23359–23373. Curran Associates, Inc., 2021.URLhttps://proceedings.neurips.c/paper_files/paper/2021/ file/c46489a2d5a9a9ecfc53b17610926d-Paper.pdf. Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, and Yann LeCun. Pedestrian detection with unsupervised multi-stage feature learning. InProceedings of the IEEE conference on com- puter vision and pattern recognition, p. 3626–3633, 2013. David Servan-Schreiber, Axel Cleeremans, and James McClelland. Learning sequential structure in simple recurrent networks.Advances in neural information processing systems, 1, 1988. Lee Sharkey, Dan Braun, and Beren Millidge. Taking features out of superposition with sparse autoencoders. InAlignment Forum, 2022. URLhttps://w.alignmentforum.org/ posts/z6QQJbtpkEAX3Aojj. Peter Spirtes, Clark N Glymour, and Richard Scheines.Causation, prediction, and search. MIT press, 2000. Alex Tamkin, Dan Jurafsky, and Noah Goodman. Language through a prism: A spectral approach for multiscale language representations.Advances in Neural Information Processing Systems, 33: 5492–5504, 2020. Simon Thorpe. Local vs. distributed coding.Intellectica, 8(2):3–40, 1989. Alex Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDi- armid. Activation Addition: Steering Language Models Without Optimization.arXiv preprint arXiv:2308.10248, 2023. Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017. Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Inter- pretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small.arXiv preprint arXiv:2211.00593, 2022. Eric Wong, Shibani Santurkar, and Aleksander Madry. Leveraging sparse linear layers for de- buggable deep networks. InInternational Conference on Machine Learning, p. 11205–11216. PMLR, 2021. 14 Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021. Mert Yuksekgonul, Maggie Wang, and James Zou. Post-hoc concept bottleneck models.arXiv preprint arXiv:2205.15480, 2022. Zeyu Yun, Yubei Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization via dictio- nary learning: contextualized embedding as a linear superposition of transformer factors.arXiv preprint arXiv:2103.15949, 2021. Ting Zhang, Chao Du, and Jingdong Wang. Composite quantization for approximate nearest neigh- bor search. InInternational Conference on Machine Learning, p. 838–846. PMLR, 2014. Jun Zhu and Eric P Xing. Sparse topical coding.arXiv preprint arXiv:1202.3778, 2012. 15 AAUTHOR CONTRIBUTIONS AT served as the primary research contributor to the work. MT served as the primary engineering contributor. NDG provided feedback and advice throughout the project. BGENERALTRAINING ANDOPTIMIZATIONDETAILS Here, we provide some additional training details relevant to all experiments. Layer normWe apply layer norm to the input activations of the codebooks, which we found improved accuracy and stability. Optimizer hyperparametersUnless otherwise specified, we use the Adam optimizer (Kingma & Ba, 2014) with learning rate 5e-4 and default values ofβ 1 = 0.9,β 2 = 0.99. For experiments using learning rate decay this refers to the peak learning rate; we spend 5% of training on a linear warmup to the max learning rate and the rest on a linear decay to 0. We did not find a benefit to using weight decay in our experiments. We also found no benefit to using k-means initialization of the codebooks. Training hyperparametersWe train for 15k steps for most experiments. For the TinyStories datasets, we train for 100k steps. The sequence length for WikiText-103 is 1024, and for TinyStories it is 512. Depending on the model, we use a batch size of 64 to 256 and between 1-4 A100 GPUs. By default, codebooks haveC=10k codebook size unless otherwise specified. CCODEBOOKS AS INFORMATION BOTTLENECKS Codebooks are information bottlenecks: they limit the bits of information that can be transmitted from a given layer into the rest of the network. Intuitively, they force the network to represent its activations as a choice ofkdistinct, unordered codes out of a vocabulary size ofC. This fact enables us to compute thechannel capacity, or number of bits the codebook can transmit each forward pass: ⌈log 2 C k ⌉. In Table 5, we present the channel capacity of various codebooks of size 10,000 with values ofk∈[1,8,100]. We also compare this with the channel capacity of a standard 16-bit activation with size 1024 hidden state, as well as quantized 4-bit vectors. We observe that even the k= 100case transmits far fewer bits than even a 4-bit quantized 1024-dimensional vector. Table 5: Comparison of information content for different information bottlenecks. ScenarioBits Transmitted 1024-dimensional 16-bit vector16384 1024-dimensional 4-bit vector4096 1 code from codebook of size 10,00014 8 codes from codebook of size 10,00091 100 codes from codebook of size 10,000804 DFINITE STATE MACHINE EXPERIMENTS This section presents additional details and experiments for the finite state machine (FSM) domain. D.1TOKFSM TRAININGHYPERPARAMETERS We use a constant learning rate of1e−3with a batch size of 512 and train the models for20,000 training steps. Note that the architecture used in Section 3 uses parallel attention and MLP blocks, following (Black et al., 2022). 16 D.2DEAD CODES After training the models, we notice that many codes in the model do not activate at all on the eval set; we refer to these asdead codes, and the opposite asactive codes(Yu et al., 2021). We report the number of active codes for each component of thek= 1Attn+MLP codebook model in Table 7, computed over an evaluation set of 10240 samples of sequence length 128. While many codes end up dead, we find that starting training with fewer codes leads to worse accuracy than training with more codes than needed, suggesting some role for dead codes in the codebook optimization process. D.3ADDITIONAL OBSERVATIONS FROM ACTIVATING TOKENS Although the strongest form of evidence we consider are the causal intervention experiments in Section 4.1, we briefly overview a range of different types of codes we identify through qualitative observation: • Codes in MLP layer 0 (the first MLP layer), which activate on each different token • Codes in MLP layers 1, 2, and 3, which activate on bigrams corresponding to different states of the FSM (e.g., 42, 59, 29), only on the second digit of a state (state codes) • Codes in MLP layers 1, 2, and 3, which activate on trigrams: (e.g., 823, 182), only on the first digit of a state (state-plus-digit codes) • In many cases, several different states (or state-plus-digits) activate the same code. In Appendix D.4, we show that these state groups have much more similar next-token distri- butions than average codes and provide potential interpretations for this phenomenon. • Codes that activate on bigrams or trigrams, regardless of which digit they are present on • Codes in several attention heads, which activate on statesbeginningwith a specific digit (e.g.,51,52,53...) • Codes that do not appear to fire on any discernible pattern. From these points of anecdotal evidence, we make several broader observations: 1. The network learns codes that fire in association with useful high-level features of the input space, e.g., when a given FSM state is present 2. Individual features are not necessarily isolated to a single point in the network; multiple places may represent the same piece of information, as (Bau et al., 2020b) found in a computer vision context. 8 3. It is possible for the behavior of a given layer to beposition dependent—that is, the network can store different information in the same layer depending on the position in the sequence. For example, the same MLP layer may hold different information when the input token is the first digit vs. when it is the second digit of a state. Thus, absolute statements that certain layers or attention heads “store concept X” warrant caution, as this layer’s function could be contextually dependent. 4. Sometimes, the network forms representations that seem to admit a meaningful interpreta- tion but do not immediately appear useful to the network. For example, it initially seems useless to have a code that activates based on states that share the same first digit (e.g., 51, 52, 53, . . . ) as these states are unrelated. It may be possible this code is used as part of acircuitto identify an FSM state in a future layer, or perhaps it is simply a vestigial or spandrel feature (Gould & Lewontin, 1979; Gould, 1997). D.4ANALYSIS OF CODE PURITY IN THE FINITE-STATE-MACHINE MODELS The TokFSM dataset from Section 3 was designed such that we know the exact number of features in the data, permitting us to understand how the representation of these features changes across the network. In Figure 8, we plot the fraction of codes that arepureat each layer, meaning they activate 8 We suspect it may be possible to detect these families of codes by computing co-occurrence statistics, but we leave this to future work. 17 Table 6: Example Code Activations for theTokFSMdataset. Theboldeddigits indicate the token positions that activated the given code. Hyphens (-) are added between each state for readability but are not presented to the model. MLP codes are written in the formlayer.code-id, while attention codes are written in the formlayer.head.code-id. More activations are available at https://huggingface.co/spaces/taufeeque/codebook-features. CodeInterpretationExample Activations MLP 0.25231digit31-83-40-87-80-78-38-76-03-86-17-97-76-09-15 10-57-62-43-92-31-83-82-23-65-94-33-23-49-41 19-83-31-73-29-47-04-15-77-05-79-23-47-89-95 MLP 1.2527489trigram (either pos.)86-04-89-80-17-03-40-74-24-09-93-35-59-61-49 40-46-50-38-47-04-89-80-91-82-94-33-41-77-59 18-94-55-55-48-24-68-48-90-43-97-50-74-77-59 MLP 2.254340bigram (2nd pos.)80-04-70-50-40-59-07-73-28-02-71-54-31-62-40 74-05-13-72-95-66-52-31-98-20-88-40-59-22-19 40-46-44-01-88-66-51-14-41-57-18-84-89-60-51 Attn 1.2.3207Tokens after 44 bigram44-27-74-05-59-64-67-72-42-93-35-09-67-39-96 44-27-74-05-22-65-98-75-83-20-00-60-80-57-94 77-69-28-02-34-46-52-72-94-18-84-12-16-64-46 Attn 2.0.3044Tokens on or after 5974-05-59-64-67-72-42-93-35-09-67-39-96-07-96 88-40-59-22-19-33-31-93-42-53-75-94-33-31-76 87-14-40-59-24-72-86-04-30-04-81-56-01-17-30 only on a single state (in the case ofstate codes) or state and first digit (in the case ofstate-plus- digitcodes). We compute these statistics over all valid combinations of two- or three-digit starting sequences. We see very high levels of purity for both sets of codes. The high purity of the codes at the first layer demonstrates that codebook training has mostly resolved the superposition problem at the first layer. The code purity declines in higher layers as the model forms its prediction of the next token. Why is this? As Figure 9 demonstrates, when two different states activate the same code, they tend to have much more similar next-token distributions. Specifically, the next-token distributions of trigram states that activate the same code (red bars) are much smaller than those of random pairs of trigram states (blue bars). This result suggests that states are merged when they share a similar next- token distribution. We speculate that codes merge later in the network as the network shifts from identifying the state to forming its prediction of the next token, as previous work has also speculated (Elhage et al., 2022a), Table 7:Number of active codes ink= 1attention + MLP codebook model trained on Tok- FSM. Each codebook has 10,000 codes; most of the codes in each codebook are not active by the end of training. LayerHead 0Head 1Head 2Head 3MLP 04045414911 12933676574601027 21482307111031499943 36902823151233247 D.5ABLATION EXPERIMENTS We perform several ablation studies to identify the importance of different elements of our training method. Specifically, we compare the next-token accuracies of several families of models, including the TinyStories one-layer model, the 4-layer TokFSM model, and the 24-layer wikitext model. For each model, we present the accuracies for 1) the attention codebook model presented in the paper, 18 5 1 25 10 25 100 25 1000 25 10k 25 100k 25 1μ 10μ 100μ 0.001 0.01 0.1 Rank Frequency Figure 7:Code activation frequencies appear to follow a power lawFrequency of code activations by rank from TinyStories 1-layer attention-only codebook model. The x-axis denotes the rank of the code in terms of frequency on a subset of the training set. We observe that most codes activate very rarely, while a long tail of codes activate very frequently. 0123 0 50 100 bigramtrigram Layer Pure Codes (%) Figure 8:Codebook training overcomes the superposition challenge in the first layer.We plot the fraction of codes which arepureat each layer, meaning they activate only on a single state (in the case of bigrams) or state + first digit (in the case of trigrams). We see very high levels of purity for both bigram and trigram models. Because the number of hidden states is 128, and there are 1000 trigram combinations for the model to learn, the network cannot allocate each state to a different neuron. The high purity of the codes demonstrates that codebook training has mostly resolved the superposition problem at the first layer. Code purity declines in higher layers as the model forms its prediction of the next token (see Figure 9). Experiment performed on the MLP codebooks of the k= 1Attn + MLP codebook TokFSM model over all 100 and 1000 possible combinations of the first two and three digits, respectively. 19 00.20.40.6 0 0.1 0.2 0.3 0.4 00.20.40.600.20.40.6 random trigram pairsmerged trigram pairs JS DivergenceJS DivergenceJS Divergence Probability ProbabilityProbability Layer 1Layer 2Layer 3 Figure 9:When two different states activate the same code, they tend to have much more similar next-token distributions.We find that the next-token distributions of trigram states that activate the same code (red bars) are much smaller than those of random pairs of trigram states (blue bars). This result suggests that states are merged when they share a similar next-token distribution. X-axis: Jenson-Shannon Divergence (JSD) between next-token distributions of different states. The JSD is a measure of the distance between probability distributions). 2) the same model but with a random initialization as opposed to the pretrained model, and 3) a codebook model where the model parameters were frozen and only the codebook parameters were trained, and 4) a model where only the codebook parameters were trained, and they were trained with only the autoencoding portion of the loss. The results of these experiments are presented in Table 8. Broadly, we find that all components are necessary for strong performance, although we do not exhaustively tune hyperparameters for each ablation. Table 8:Ablation studies.Next-token accuracy (for TinyStories and WikiText-103) and next-state transition accuracies (for TokFSM) across various ablation studies.Legend:Attn CB: Codebook applied to the attention layers.Random Init: Codebooks applied to a randomly-initialized model instead of a pretrained model (then finetuned end-to-end as usual).Train Only CB: Train only the codebook layers with the original loss while keeping the base model frozen.Only AE Loss: Only apply the autoencoding loss to the codebooks; do not update the model parameters.Attn + MLP CBCodebooks applied to the attention and MLP codebooks simultaneously. ModelAttn CBRandom InitTrain Only CBOnly AE Loss TinyStories-1L57.9155.6747.0851.73 FSM-4L96.3952.3558.4843.44 WikiText-103-24L46.1638.5331.2228.35 ELANGUAGE MODEL EXPERIMENTS E.11-LAYERTINYSTORIES MODEL We train a small, 1-layer 21 million parameter transformer on the TinyStories dataset of children’s stories, constructed by prompting a language model (Eldan & Li, 2023). We train for 100k steps with a batch size of 96, with learning rate warmup of 5% and linear cooldown to 0. We start by loading the 21M pretrained model from the TinyStories paper (Eldan & Li, 2023). We train two models: one with the codebook affixed to each of the heads of all the attention layers and one to both the attention heads and MLP layers (Figure 2). In Figure 7, we plot the distribution of code activation frequencies for the 1-layer TinyStoriesk= 1 Attn + MLP model. We find a very unequal distribution of use of the codebooks, with a small number of codes activated extremely frequently and many others activated hardly at all. This distribution is reminiscent of the Zipfian distribution known to characterize phenomena such as word frequency in natural language (Kingsley Zipf, 1932). 20 Table 9:Maximum inner product search algorithms can close much of performance gap be- tween codebook and tranditional models.Performance Comparison of Models with Different Parameters. Computed on an A100 40GB GPU, with a batch size of 64 and over 100 batches. (a) 70m Parameters ModelTok/s∆FAISS∆Base Base57.5 CB w/ FAISS37.434.2%-34.9% CB no FAISS27.9-51.5% (b) 410m Parameters ModelTok/s∆FAISS∆Base Base14.8 CB w/ FAISS7.256.2%-51.5% CB no FAISS4.6-68.9% E.224-LAYERWIKITEXT-103MODEL We also train a larger, 24-layer 410M parameter model on the WikiText-103 dataset, consisting of high-quality English-language Wikipedia articles. We finetune for20,000steps with a batch size of 24 and learning rate warmup and cooldown. For a pretrained model, we use the Pythia 410m parameter model, trained on the Pile dataset with deduplication (Biderman et al., 2023). The model has 16 attention heads, with a hidden size of 1024. We again train two variants of codebook models here, with codebooks on every attention head and codebooks on every MLP block. E.3COMPARING THE PERFORMANCE OF CODEBOOK AND BASE MODELS Here, we provide more details on the models trained in Table 2. Most model names in the table are self-explanatory; for example,MLP, k=100indicates a model with codebooks on the MLP layers with akof 100. The two exceptions are as follows: Finetuned 160M (Wiki)The largest base language model we finetune is a 410M parameter 24- layer model from the Pythia series of models (Biderman et al., 2023), finetuned on the WikiText-103 dataset (Merity et al., 2016). To explore how much codebooks reduce the performance of language models, we also finetune the next smallest model in the series: a 160M parameter 16-layer model. As we see, the language modeling accuracy of the Attnk= 8model is comparable to this smaller model, and the Attnk= 64model falls squarely in between the 160M and 410M parameter models. MLP, grouped16×(k=64)The MLP codebook layers broadly seem to attain lower perfor- mance than the attention layers. Moreover, we found diminishing returns to increasing the value ofkfor this layer. We observe that we can attain higher performance for these layers by splitting the MLP layer activations into several equal-sized chunks (16 in our case) and training a smaller codebook independently on each chunk, as in product quantization (Jegou et al., 2010). We refer to this method as “grouped codebooks.” All models except the grouped MLP codebook model are trained with the same hyperparameters. We found that the grouped MLP codebook model achieved 4-5% higher accuracy and trained more stably if we used a 10x higher learning rate on the codebook parameters than the default learning rate (which was used for the language model parameters). We suspect the combination of grouped codebooks and higher learning rates on the codebook parameters may be helpful when applying codebooks to higher-dimensional layers. E.4CODEBOOK MODELS STILL HAVE USABLE INFERENCE SPEED The codebook modules at each attention head add parameters and computation to the model. While this results in higher latency, the resulting model is still usable for real-time inference. Moreover, in- ference can be sped up an additional amount through fast maximum inner product search (MIPS) al- gorithms such as FAISS, which are faster than computing the matrix multiplication explicitly (John- son et al., 2019). In Table 9, we show that the codebook models show a significant decrease in the number of generated tokens per second (between 34% and 69% slowdown). However, this decrease is significantly lower when FAISS is used. A decrease in latency may be acceptable in exchange for increased interpretability or control, and we expect further optimizations (e.g., approximate MIPS algorithms, custom kernels) to continue to close this gap. 21 Table 10: Example generations from language models. The prompts are highlighted in bold. While the factuality of the completions is unreliable for all models, all models generate largely grammatical text. Language Model TinyStories 1-Layer ModelWikiText-103 Model BaseOnce upon a timethere was a little boy named Timmy. Timmy loved to play outside in the rain. He would jump in puddles and splash around. One day, Timmy saw a big puddle in the park. He jumped in it and got all wet.[...] The war was foughtagainst the Ot- toman Empire and the Kingdom of Hungary.The Ottoman Turks, their king, and several of their princes were killed and many more captured, and the kingdom was divided among the Hun- garian monarchs ; [...] Codebooks (Attn) Once upon a time, there was a little girl named Lily. She loved to play with her toys and her friends. One day, Lily’s mom told her that they were going to buy a new toy. Lily was very excited and asked, “Can I play with your toys, please?”[...] The war was foughtby France and the British Empire, and by the Axis pow- ers. With the exception of the Italians and Americans, whose armies won the war against the Axis Powers, the victo- rious Allies suffered the most of the war, a terrible defeat on both fronts. [...] Codebooks (MLP) Once upon a time, there was a little boy named Timmy. Timmy loved to play with his toy cars and trucks. One day, Timmy’s mom took him to the store to buy a new toy. Timmy saw a big red truck and asked his mommy if they could get it, but she said they had to wait until they got to the store. The war was foughtbetween the United States and France. The French responded by launching an invasion of the Allied continent in June 1917 with the aim of defeating the Allied armies in northern France. [...] E.5EXAMPLE LANGUAGE MODEL GENERATIONS We display example generations from both language models in Table 10. E.6ACTIVATINGTOKENS We present examples of activating tokens for both language models in Table 11 E.7ADDITIONAL NOTES ON NEURON-LEVEL INTERPRETABILITY EXPERIMENTS We briefly note two caveats to this preliminary experiment. First, regular expressions are not perfect proxies for the features we care about (e.g., our regular expression for countries only includes some countries or ways of spelling each country). Thus, these precision scores likely underestimate each classifier’s true precision. Second, we note a potential bias in the experimental protocol due to developing the regular expressions for codes that admit a meaningful interpretation. This could result in a slight bias in favor of the code classifiers. However, we also exhaustively search over all 410 million neurons in the network to find the best performer, which mitigates this bias. The complete list of regexes we use is available in our codebase. E.8LANGUAGE MODEL STEERING EXPERIMENTS We present additional language model steering results in Table 13. 22 Table 11: Example Code Activations for theTinyStoriesand theWikiText-103dataset. Thebolded word indicates the token positions that activated the given code.Note that the concept may be near but not directly at the activated token.MLP codes are written in the formlayer.code-id, while attention codes are written in the formlayer.head.code-id. At symbol (@) delimiters present in WikiText-103 data have been omitted for readability. More activations are available at https://huggingface.co/spaces/taufeeque/codebook-features. (a) WikiText-103 CodeInterpretationExample Activations 7.12.7884Months(after preposition) at Toulon inAugustThe ship began trials [. . . ] and spent three weeks inSeptemberattached to 14 : 30 on 7December. The division had the [. . . ] a major attack until 8December onAugust31, a Utah [. . . ] OnSeptember1, 1987 4.15.6101Evaluative wordsInitially , the New Zealand attack progressedwell Superman from the main timeline issuccessfullyteleported into only HWMs evaluated as ”excellent” are used by NHC 1.9.295Namesstarting with ‘B‘ In one account from the Bahamas, a mating pair ascended while John and Roy Boultingnoted that [...] Bockscar, sometimes called Bock’s Car, is the name of the United States Army Air Forces B-29bomber 4.14.4742Years in 2000sAs of2011, the International Shark Attack File lists In2014, a study at the University of Amsterdam with Fabian Cancellara kicked off his2010campaign with an overall victory at the Tour of 9.3.3727Square UnitsAtlanta encompasses 134.0square miles(347.1km2) it covered more than 55 square metres (590sqft) 6 percent or 101,593squarekilometres (39,225 sqmi) of [...] (b) TinyStories CodeInterpretationExample Activations 0.2FightingThe two cats started toquarrel loudlyover the bone They ran around the house, fighting overthethread But then, they got into a fightoverwho got to play with the toy 0.3Negative emotionsHe feels angry andscared. He tries to catch the boat, but it She started to feelnervousbecause she thought she wouldn’t be able to Lily and Tom feltfearful. They did not like storms. 0.6“You” dialogueThe dragon smiled and said, ”You aretoo small. It’s not possible.” The happy fish thanked her and said ”You mustbe very persistent to complete this task. John smiled and said, ”You won! You were really fast.” 1.2FireThe firespread tothecans and bottles andmade more explosions. The garage was full of smoke Lily knew that fire could be dangerous and shealwaysremembered to be carefulwhenplaying with matches orlighters. Mom hugged them and said, ”I know, butfireisnot atoy.It can hurt you andthe plantsandanimals. 5.3Discovered/foundLily found a delicate flower in the garden andshowedit to her sister. had discovered an amazing reef and helped a turtle inneed. One day, Tom and Mia found a ball inthehut. 23 Table 12: Regular expressions used to measure topic steering for the text generated by the models. (a) Wikitext TopicRegex Footballfootball|soccer|goal|stadium|fifa|player|trophy|league Moviemovie|tv|television|film|media Video Gamegame Songsong|music|mtv (b) TinyStories TopicRegex Dragondragon Slideslide Friendfriend Tom & Samtom|sam Flowerflower Firefire Babybaby Princessprince|crown|king|castle Table 13: Steering the TinyStories language model with topic codes. While the language model generations are often incoherent, the topics are largely introduced in grammatical ways TopicCodes Steered generations Null (default generation) -Once upon a time,there was a little girl named Lily. She loved to play with her toys and eat yummy food. One day, she went to the store with her mommy to buy some apples. She saw a big basket and wanted to buy it. Her mommy said, ”Let’s count the apples and buy the yummiest apples!” Lily was so happy and said, ”Yes, please!” Dragon8Once upon a time,there was a little girl named Lily. She was very excited to go outside and explore. She flew over the trees and saw a big, scary dragon. The dragon was very scary. But Lily knew that things were not real and she would hurt her. Slide12Once upon a time,there was a big cake. He wanted to escape and explore the world, but he was scared. He thought and thought about how to use the slide. He asked his mom for some help. Friend9Once upon a time,there was a little girl named Lily. She had a friend named Sue. Sue was a nosy cat. She liked to play with her brother Tom. One day, they went to the park and saw a big dog. The dog was not nice. Flower8Once upon a time,there was a little girl named Lily. She liked to pick flowers in the meadow. One day, she saw a big, green caterpillar on a leaf. She wanted to take it home and sell it to someone else. As she picked the flower, it started to bloom and made it look pretty. Fire16Once upon a time,there was a little boy named Timmy. Timmy loved his new toy. He always felt like a real fireman. One day, Timmy’s mom made him some hot soup and gave him some medicine to help his mommy feel better. Timmy was scared that the fire would be gone, but he didn’t feel happy. Baby15Once upon a time,there was a little girl named Lily. She loved going to the gym with her mommy. One day, Lily’s mom asked her to help put the baby in the crib. Princess14Once upon a time,there was a little bird named Tweety. One day, the princess had a dream that she was invited to a big castle. She was very excited and said, “I want to be a princess and ride the big, pretty castle!” 24 Table 14:Steering language model generations with topic codes. Full table available at link . At symbol (@) delimiters present in WikiText-103 data have been omitted for readability. TopicCodes Original generationsSteered generations Video game 18The war was foughton two fronts. The war was initiated in 1914 between Austria-Hungary and Serbia, when the Entente Powers signed a treaty of friendship between the two countries. In October 1914, Tschichky was sent to defend the German Empire’ The war was foughton both sides, and was only the second game to deal with one-on-one battles, follow- ing SimCity 2D Blade I. The game was released to critical acclaim, with praise particularly directed to the new console Football 18The war was foughton two fronts. The war was initiated in 1914 between Austria-Hungary and Serbia, when the Entente Powers signed a treaty of friendship between the two countries. In October 1914, Tschichky was sent to defend the German Empire’ The war was foughtin its first forty years. In the summer of 1946, the Cardinals of the All-America Football Conference (AAFC) were rapidly be- coming the favorites for NFL Hall-of- Fame coach Jim Mora, who had Movie12The novel was published inNovem- ber 2009 by MacChinnacle, a London publishing house. The book’s publish- ers, Syco, published the book in the United Kingdom and the United States on 1 November 2009. The book re- ceived generally positive reviews from critics, who praised the The novel was published inthe United States and Canada.The film was directed by Joe Hahn and stars Steven Spielberg as Lucas, Neil Patrick Harris, and Jude Lawder as Lu- cas’s best friend, Jonathan Miller. The plot follows a character (Lucas Song17The team wontheir first ever Grand Prix and the first since the 1990 sea- son.The team finished in third place behind Williams and Ralf Schu- macher, with the Ferraris of David Coulthard and Jarno Trulli finishing in the top three. The team wonthe Grammy Awards for Best Gospel Album. = = Back- ground = = In 2004, The Dream re- leased their third studio album, The Beacon Street Collection, which pro- duced the singles ”HOV Lane” and ”Wishing Machine 25 FEXTENDED DISCUSSION OF RELATED WORK In this section, we review related work and attempt to describe in more detail the design decisions behind codebook features and how these lead to different tradeoffs compared to other approaches. We focus on several subareas most relevant to our current work, with a particular focus on dictio- nary learning methods, leaving more general overviews of interpretability research to prior surveys (Rogers et al., 2021; Bommasani et al., 2021; Madsen et al., 2022). F.1SPARSECODING ANDSPARSEDICTIONARYLEARNING Sparse coding, also known as sparse dictionary learning, is a well-studied research area with appli- cations in machine learning, neuroscience, and compressed sensing (Kanerva, 1988; Olshausen & Field, 1997; Lee et al., 2006; Candes et al., 2006; Donoho, 2006; Rozell et al., 2008). The typical objective in sparse coding is to learn a fixed set of vectors, known asatomsordictionary elements; given this set of vectors, one should be able to represent a given input as a sparse linear combination of these vectors. Sparse coding methods have been applied to various problems in machine learning, including in computer vision (Elad & Aharon, 2006) and natural language domains (Zhu & Xing, 2012; Arora et al., 2018). Dictionary learning methods have recently seen renewed interest as an interpretability approach for neural networks (Yun et al., 2021; Wong et al., 2021). One reason for this is thesuperposition prob- lem: to represent more feature directions than neurons, some neurons will be activated for multiple different features (Yun et al., 2021; Elhage et al., 2022b). For example, one family of approaches trains a wide autoencoder with a sparsity penalty. The width of the autoencoder is made greater than the size of the input activations (producing anovercomplete basis); by regularizing the activations of the autoencoder to be sparse, the dimensions of the autoencoder appear to correspond to more disentangled features (Yun et al., 2021; Sharkey et al., 2022; Bricken et al., 2023; Cunningham et al., 2023). Codebook features share important similarities with dictionary learning approaches: for example, both approaches learn a codebook of elements larger than the number of input neurons and attempt to activate a small fraction of that basis on each forward pass. However, a significant conceptual difference between codebook features and dictionary learning is their implicit choice ofhow features are representedinside of neural networks: F.1.1FEATURES-AS-DIRECTIONS Recent dictionary learning approaches typically start from an assumption we might callfeatures- as-directions: features the network learns are represented as continuous vectors along adirection in activation space. This assumption is substantiated by prior work on interpretability (Kim et al., 2018; Olah et al., 2018), and has the benefit that the magnitude of the vector along that direction cor- responds to the strength of the feature or the probability of the feature existing in the data. However, thefeature as directionsassumption also faces some challenges: A direction can hold multiple featuresFirst, a single direction can theoretically represent mul- tiple distinct features. For example, the positive and negative magnitudes of a direction could each hold a different (mutually exclusive) feature, which could be extracted by outgoing weights of1 and−1, respectively, in combination with a ReLU activation. More complex encodings of multiple features within a single direction are possible with bias terms and activation functions. For example, a network could detect whether a feature along directionxhas low, medium, or high magnitude by computing softmax(x,2x−1,5x−7); the first dimension is greatest whenx <1, the second when 1< x <2and the third whenx >2. Continuous features can be challenging to interpretSecond, the continuous and graded nature of feature directions can make them challenging to interpret: does an increase in the magnitude of one feature mean the network is more confident the feature is present, or merely that the strength of the feature is stronger in the input? If an input activates a feature at magnitude 0.52, or more strongly than in 90% of inputs, does this mean the feature is present? The same factors also make 26 it challenging to compare the strengths of different features without understanding how the network weights process each of them. Smuggling of informationAnother difference between codebook features and dictionary learn- ing approaches is the contrast between soft and hard sparsity. Recent dictionary learning approaches train an L1-regularized autoencoder (Sharkey et al., 2022). This method causes the hidden acti- vations of the autoencoder to have a small number of entries with a high magnitude but does not force the model to set the other features to be exactly zero. Past work has suggested that impor- tant information can be “smuggled” via low-magnitude activations (Elhage et al., 2022a), making it challenging to be confident that the interpretable features found by a dictionary learning approach are fully capturing the information a network is detecting in the input. F.1.2FEATURES-AS-POINTS In contrast, codebook features embody a view offeatures-as-points. For example, an activated code is simply a vector of fixed magnitude that is added to the output of the codebook layer. This design avoids many of the challenges in the previous subsection. For example, a single point can only hold one bit of information, indicating the presence or absence of some feature, avoiding the challenges of holding multiple features and graded interpretations. Similarly, because the weight of non-activated codes is zero, the network cannot smuggle information through them. However, there are several reasonable concerns one might have about features-as-points: Multiple codes per featureFirst, the network could hypothetically encode more complex features via complicated combinations of codes instead of assigning one feature to each code. For example, codes 1 and 2 together might represent happiness, while codes 1 and 3 together might represent cars. However, the simplicity of how the codes are chosen (by cosine similarity) makes it challeng- ing to select codes with much complexity. Furthermore, similar concerns present themselves for continuous dictionary learning approaches where complex features are encoded via combinations of directions. Multiple features per codeSecond, the reverse failure mode might present itself: the model might still encode multiple features per code. Indeed, we have discussed certain cases where this is true, for example, in Sections 3 and 4. While some of this may be improved by choosing a larger codebook size or enabling the number of active codeskto vary based on the input and position, it is unclear whether these approaches will solve the problem. Of course, as noted above, features-as-directions approaches may also suffer these failure modes. Lack of gradednessThird, one might worry that features-as-points cannot express the graded, continuous nature of many real-world features, such as sentiment. We share this concern; however, we note that there are mechanisms for expressing gradedness with discrete codes. For example, the network might choose to activate multiple codes in a given position or nearby positions or allocate different codes to different levels of the gradation. Furthermore, the strong language modeling performance of the codebook models suggests that the model can accomplish its task well despite this discrete constraint. F.2ADDITIONAL BENEFITS AND TRADEOFFS OF CODEBOOK FEATURES We list two additional differences between codebook features and dictionary learning approaches: Modification of the original networkDictionary learning approaches are typically trained off of a frozen network. By contrast, in codebook features, the pretrained network is typically finetuned to achieve high performance on the task with the codebook bottleneck. This training means we are interpreting a new network rather than the original one. Furthermore, the performance of this network is often slightly lower than the pretrained network, which is another tradeoff. Improved EfficiencyBecause codebook features use hard sparsity, only one large matrix multi- plication is necessary (to compute similarity scores with each element of the codebook). In contrast, 27 a second large matrix multiplication may be needed by some sparse autoencoder approaches to do a full weighted sum over allCdictionary elements rather than overk << Celements chosen from the codebook; though activations such as ReLU may mitigate this problem to some degree. Fur- thermore, as we show in Appendix E.4, hard sparsity enables us to use libraries such as FAISS to replace the first matrix multiplication as well, further increasing efficiency. F.3MECHANISTICINTERPRETABILITY Researchers have long attempted to extract concepts, rules, and algorithms from neural networks. For example, a line of work since the late 1980s attempted to extract rules and finite automata from neural networks, especially recurrent neural networks (RNNs) (Servan-Schreiber et al., 1988; Elman, 1990, see (Jacobsson, 2005) for a review). A core challenge noted in these works is that neural networks use distributed representations (Rumelhart et al., 1986; 1988; Thorpe, 1989). This form of representation enables networks to represent more concepts than hidden units, at the expense of each unit no longer being interpretable (Elman, 1990). Thus, individual hidden units may not correspond to interpretable concepts, and a holistic analysis of the entire vector may be necessary to extract such structures (Servan-Schreiber et al., 1988; Elman, 1990; Jacobsson, 2005). Recent work has attempted to revitalize this goal for today’s much more expressive networks, at- tempting to detect concepts (Alain & Bengio, 2016; Kim et al., 2018; Olah et al., 2018; Goh et al., 2021; Bau et al., 2020b) and algorithms (Giulianelli et al., 2018; Clark et al., 2019; Olah et al., 2020; Bau et al., 2020a; Geiger et al., 2021; Geva et al., 2021; Elhage et al., 2021; Olsson et al., 2022; Wang et al., 2022; Chan et al., 2022; Friedman et al., 2023) inside of models, with many works focusing specifically on the challenges of neurons that fire on multiple concepts (Fong & Vedaldi, 2018; Olah et al., 2020; Mu & Andreas, 2020; Elhage et al., 2022b; Geiger et al., 2023), sometimes termedsuperposition(Olah et al., 2020). Our work shares similar goals with the above works. Codebook features attempt to make identifying concepts and algorithms more manageable inside networks by refactoring their internal representa- tions into a sparse and discrete form that is easier to understand and manipulate. We also discover one instance in Section 3 where codebooks represent more features than there are neurons, circum- venting the superposition problem. F.4INTRODUCINGDISCRETESTRUCTURE INTONEURALNETWORKS A range of works attempts to introduce discrete bottlenecks or structures into neural networks (Makhzani & Frey, 2015; Andreas et al., 2016; Keshari et al., 2019; Buch et al., 2021; Mao et al., 2019; Liu et al., 2023). Most saliently, vector quantization (Gray, 1984, VQ) is a classical tech- nique in signal processing that was applied most prominently in machine learning through VQ-VAE (van den Oord et al., 2017) for use in autoencoder networks. By contrast, our method applies vector quantization to each hidden layer of any neural network (including autoregressive language models), enabling better understanding and control of the network’s intermediate computation. Our grouped codebook method additionally employs product quantization (Jegou et al., 2010), an extension of vector quantization to multiple codebooks whose outputs are concatenated. Finally, ourk >1mod- els leverage ideas very similar to composite quantization (Zhang et al., 2014), where vectors from multiple codebooks are aggregated to represent the network; in our setting, it is the top-k vectors of the same codebook which are aggregated. Another line of work introduces structured bottlenecks into training for interpretability and con- trol. For example, concept bottlenecks (Koh et al., 2020) directly supervise an intermediate state of the network to align to a set of known features, while post-hoc concept bottlenecks (Yuksekgonul et al., 2022) enable transferring known features from another source (e.g., a multimodal model). In contrast to these methods, the concepts learned by the codebook are discoveredemergentlyby the network as part of the training process. Another related work, Backpack Language Models (Hewitt et al., 2023), generate predictions by computing a set of weights over previous tokens; the next token is then predicted through a weighted sum of learnedsense vectorsassociated with those tokens. By contrast, codebook features are applied to thehidden statesof a neural network and facilitate better understanding and control of this via a sparse, discrete representation. 28 F.5EDITING OR STEERING NEURAL NETWORKS Various methods attempt to control, edit, or steer the behavior of trained neural networks. A natural approach is tofinetunethe network on labeled data (Sermanet et al., 2013), though this process can be time- and resource-intensive and may distort the model’s other capabilities.Promptinga model with natural language instructions (Brown et al., 2020) or control tokens (Keskar et al., 2019) is a lightweight steering method that overcomes some of these difficulties; however, not all models are promptable, and there may be instances where prompting is insufficient to ensure the model performs the desired behavior. In addition, a stream of work focusing onmodel editingmakes targeted edits to concepts or decision rules inside of neural networks with a small number of examples (Bau et al., 2020a; Santurkar et al., 2021; Mitchell et al., 2021; Meng et al., 2022a;b). Most related to our work, several recent works perform post-hoc steering of networks in ways that do not require per-edit optimization (Merullo et al., 2023; Hernandez et al., 2023; Turner et al., 2023) by adding vectors of different magnitudes to different layers in the network. Our work attempts to support the aims of such work by producing a sparse, discrete, hidden representation inside of networks. This representation makes it easier to localize behaviors inside the network (so that the user does not have to exhaustively perform interventions at every layer of the network to find the most effective intervention site) and makes it easier to perform the intervention by substituting codes (so the user does not have to try many different magnitudes of a given steering vector at each layer). GEXTENDED DISCUSSION OF APPLICATIONS,SIGNIFICANCE,AND FUTURE DIRECTIONS G.1USES FOR CODEBOOK FEATURES While we primarily explore codebook features on transformer language models, our method is modality agnostic and can be applied to neural networks trained on any combination of modalities. We envision several different use cases for codebook features in such diverse contexts: Identifying phenomena in complex dataCodebook features is an unsupervised method for dis- covering different latent features inside models. This method could be useful in situations where brainstorming novel kinds of features in data may be helpful for research. For example, codebook features could potentially help uncover new protein, genomic, or medical imaging data features by observing token activations and seeing what the examples all have in common. Feature detectionIn many applications, it is helpful to count the number of times a particular feature occurs or raise an alert when it does. While it may be more effective in many cases to collect a labeled dataset and train a classifier for a particular feature, codebook features are ready-made for this task and may enable faster iteration and experimentation. Counterfactual explanationsOne way of explaining a model’s decision is via a counterfactual: would the model’s decision change if this feature changed? While these counterfactuals often occur at the input level, codebooks enable counterfactual explanations at the hidden feature level. Steering modelsFinally, as explored in Sections 3.2 and 4.1, codebook features can be used to steer the complex generations of models. We anticipate the flexibility of this method to improve as codebook features are better understood. G.2WHAT THIS SAYS ABOUT TRANSFORMER COMPUTATION As seen in Table 5, codebooks enforce a strong information bottleneck between layers. We find it surprising that neural networks can operate amidst such a strong information constraint; this suggests that the underlying computation happening inside these networks is or can be made sparse along a set of understandable features. G.3FUTURE WORK We see several exciting directions for future work: 29 Understanding circuits and weightsPast work has investigatedcircuitsin vision models, where more complex features are built up out of smaller features (see Appendix F for a full overview). The sparse and discrete nature of codebooks may make it far easier to identify such circuits, including in language models, due to the smaller number of possible relationships between components across layers. The discrete nature of codebooks also makes it easier to compute which codes tend to fire together across layers without the added complexity of accounting for continuous-valued neurons or feature directions. Understanding the relationship between activations across a single layer may also enable a better understanding of theweightsof that layer, as these determine the input-output relationship the layer must produce. Understanding adversarial examplesIn computer vision, adversarial examples are small pertur- bations added to images that cause the network to misclassify them; for example, misclassifying a cat as a dog (Goodfellow et al., 2014). Codebooks enable identifying which codes in the network shifted to produce that change in decision: for example, was a cat ear feature changed to a dog ear feature? The discrete nature of codebook activations may also enable better defenses against adversarial attacks. Improving interpretability in larger modelsWhile we found that single-layer codebook models produced codebooks where the majority of codes had a comprehensible interpretation, in larger models, there were many codes where this was not the case. Future work might consider training models with even larger codebooks to capture the greater number of features the models represent. Future work might also consider using co-occurrence statistics of code activations to investigate whether there are codes that routinely fire together and may represent a single feature in tandem. Understand shared representations across domains and modalitiesRecent work has shown generalization across distributions: for example, multimodal models contain neurons that fire on concepts (e.g., spiderman) in both text and image form (Goh et al., 2021), and language models trained on multiple languages can generalize zero-shot from one language to another (Johnson et al., 2017). Codebooks may enable tracing exactly how and where these features are integrated across the network. 30