Paper deep dive

Successor Heads: Recurring, Interpretable Attention Heads In The Wild

Rhys Gould, Euan Ong, George Ogden, Arthur Conmy

Year: 2024Venue: ICLR 2024Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 75

Models: GPT-2, Llama-2, Pythia (31M-12B)

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/12/2026, 8:01:15 PM

Summary

The paper introduces 'successor heads' in large language models (LLMs), which are attention heads responsible for incrementing tokens in ordinal sequences (e.g., 'Monday' to 'Tuesday'). Through mechanistic interpretability, the authors identify abstract 'mod-10' features within the MLP0 layer that encode numeric indices, demonstrating that these representations are compositional and transferable across different model architectures and scales (31M to 12B parameters).

Entities (6)

GPT-2 · language-model · 100%LLaMA 2 · language-model · 100%Mechanistic Interpretability · field-of-study · 100%Pythia · language-model · 100%Successor Heads · model-component · 100%Mod-10 Features · feature · 95%

Relation Signals (3)

Pythia → contains → Successor Heads

confidence 100% · We analyze the behavior of successor heads in large language models (LLMs) ... such as GPT-2, Pythia, and Llama-2.

Mod-10 Features → underlie → Successor Heads

confidence 95% · We find a set of 'mod-10 features' that underlie how successor heads increment in LLMs

Successor Heads → implements → Abstract Numeric Representations

confidence 90% · find that they implement abstract representations that are common to different architectures

Cypher Suggestions (2)

Identify features associated with successor heads · confidence 95% · unvalidated

MATCH (f:Feature)-[:UNDERLIE]->(h:ModelComponent {name: 'Successor Heads'}) RETURN f.name

Find all language models that contain successor heads · confidence 90% · unvalidated

MATCH (m:LanguageModel)-[:CONTAINS]->(h:ModelComponent {name: 'Successor Heads'}) RETURN m.name

Abstract

Abstract:In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, successor heads increment 'Monday' into 'Tuesday'. We explain the successor head behavior with an approach rooted in mechanistic interpretability, the field that aims to explain how models complete tasks in human-understandable terms. Existing research in this area has found interpretable language model components in small toy models. However, results in toy models have not yet led to insights that explain the internals of frontier models and little is currently understood about the internal operations of large language models. In this paper, we analyze the behavior of successor heads in large language models (LLMs) and find that they implement abstract representations that are common to different architectures. They form in LLMs with as few as 31 million parameters, and at least as many as 12 billion parameters, such as GPT-2, Pythia, and Llama-2. We find a set of 'mod-10 features' that underlie how successor heads increment in LLMs across different architectures and sizes. We perform vector arithmetic with these features to edit head behavior and provide insights into numeric representations within LLMs. Additionally, we study the behavior of successor heads on natural language data, identifying interpretable polysemanticity in a Pythia successor head.

PDF

Open source PDF →Open local PDF →

Full Text

74,815 characters extracted from source content.

Expand or collapse full text

Preprint. Under review. SUCCESSORHEADS: RECURRING, INTERPRETABLE ATTENTIONHEADSINTHEWILD Rhys Gould 1 , Euan Ong 1 , George Ogden 1 , Arthur Conmy 2 1 University of Cambridge 2 Independent Correspondence torg664@cam.ac.uk ABSTRACT In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, succes- sor heads increment ‘Monday’ into ‘Tuesday’. We explain the successor head be- havior with an approach rooted in mechanistic interpretability, the field that aims to explain how models complete tasks in human-understandable terms. Existing research in this area has struggled to find recurring, mechanistically interpretable language model components beyond small toy models. Further, existing results have led to very little insight to explain the internals of larger models that are used in practice. In this paper, we analyze the behavior of successor heads in large language models (LLMs) and find that they implement abstract representations that are common to different architectures. They form in LLMs with as few as 31 million parameters, and at least as many as 12 billion parameters, such as GPT-2, Pythia, and Llama-2. We find a set of ‘mod 10’ features 1 that underlie how succes- sor heads increment in LLMs across different architectures and sizes. We perform vector arithmetic with these features to edit head behavior and provide insights into numeric representations within LLMs. Finally, we study the behavior of suc- cessor heads on models’ training data, finding that successor heads are important for the model getting low loss on examples of succession in this dataset. Finally, we interpret some of the other tasks these polysemantic heads perform and discuss the implications of our findings. 1INTRODUCTION Figure 1: A successor head with OV matrixW OV takes a numbered token such as ‘Monday’ in embedding space and maps it to its successor value in unembedding space, e.g. ‘Tuesday’. The circuit is the simple composition of the embedding matrix, the first MLP block, a single attention head, and the unembedding matrix. Mechanistic interpretability (Olah, 2022) is the process of reverse-engineering the algorithms that trained neural networks have learned. Recently, much attention has been paid to interpreting transformer-based large language models (LLMs), as these models have demonstrated impressive 1 In this work, we use ‘feature’ to mean an interpretable (linear) direction in activation space, inspired by the second ‘potential working definition’ from Elhage et al. (2022). 1 arXiv:2312.09230v1 [cs.LG] 14 Dec 2023 Preprint. Under review. capabilities (OpenAI, 2023) but there is little understanding of how these models produce their out- puts. Existing interpretability research includes comprehensive reverse-engineering efforts into toy models (Nanda et al., 2023) and small language models (Olsson et al., 2022; Wang et al., 2023), though few insights have been gained about how frontier LLMs function. In mechanistic interpretability, universality (Olah et al., 2020; Li et al., 2016) is a hypothesis that there arecommon representationsin neural networks. The universality hypothesis asserts that neu- ral networks with different architectures andscalesform common internal representations. Strong evidence for (or against) the universality hypothesis could significantly affect research priorities in interpretability. If common representations form across different language models and tasks, then research on small or toy language models (Elhage et al., 2022; 2021) and narrow tasks (Wang et al., 2023; Heimersheim & Janiak, 2023; Hanna et al., 2023) may be the best way to gain insights into LLM capabilities. Conversely, if the representations used by language models do not generalize to different model scales and/or tasks, then developing methods that can be applied to larger language models and don’t rely on lessons from small models generalizing (such as Wu et al. (2023), Bills et al. (2023), Conmy et al. (2023)) may be the most fruitful direction for interpretability. In this work, we find an interpretable set of attention heads we callsuccessor headsin models of many different scales and architectures. Successor heads are attention heads that perform incremen- tation in language models. The input to a successor head is the representation of a token in an ordinal sequence such as ‘Monday’, ‘first’, ‘January’, or ‘one’. The output of a successor head assigns a higher likelihood to the incremented token, such as ‘Tuesday’, ‘second, ‘February’, or ‘two’. In our work, we find evidence for a weak form of universality (Chughtai et al. (2023); points 1. and 2.) in finding Successor Heads across different models, as well as finding that numeric representations in language models are compositional (point 3.), as 1. Successor heads form across language models from thescaleof 31 million parameters and to at least as many as 12 billion parameters. 2. Successor heads form across models with different architectures, including Pythia, GPT-2 and Llama-2 (Touvron et al., 2023). 3. Language models useabstract numeric representationsto encode the index of these tokens within their ordinal sequence; these representations exhibitcompositional structure, such as mod-10 features. Our contributions can be summarised as follows: 1. Introducing and interpreting successor heads (Section 2-3) (a) We introduce and explain successor heads, and show that they occur in language models across almost three orders of magnitude of model parameter count. 2. Finding abstract numeric representations in language models (Section 3) (b) We isolate a commonnumeric subspacewithin embedding space, that for any given token (e.g. ‘February’) encodes the index of that token within its ordinal sequence (e.g. months). (c) We find evidence forinterpretable, abstract featureswithin successor heads’ numeric inputs: an unsupervised decomposition of token representations yields a crucial set of features we call the mod-10 featuresf 0 ,...,f 9 .f n is present in all tokens whose numerical index≡n(mod 10), e.g.f 2 is present in the model’s representations of ‘2’, ‘32’, ‘172’, ‘February’, ‘second’ and ‘twelve’. (d) We steer the semantics of successor heads’ numeric inputs with vector arithmetic. 3. Showing that the succession mechanism is important in the wild (Section 4) (d) Finally, we show that successor heads play an important role in incrementation-based tasks in natural language datasets – for instance, predicting the next number in a numbered list of items. 2SUCCESSORHEADS LLMs are able to increment elements in anordinal sequence. For instance, Pythia-1.4B will com- plete the prompt “If this is 1, the next is” with “ 2”, and the prompt “If this is January, the next is” with “ February”. Given this observation, we find evidence for attention heads within LLMs (which we refer to assuccessor heads) responsible for performing this type of incrementation. To get evi- 2 Preprint. Under review. dence for successor heads we require three definitions: i) thesuccession datasetof tasks involving abstract numeric representations, i) aneffective OV circuitto measure how attention heads affect model outputs, and finally i)successor score. Thesuccession datasetis the set of tokens across eight differenttasksthat can be incremented: 2 TaskTokens Numbers‘1’, ‘2’, ..., ‘199’, ‘200’ Number words‘one’, ‘two’, ..., ‘nineteen’, ‘twenty’ Cardinal words‘first’, ‘second’, ..., ‘tenth’ Days‘Monday’, ‘Tuesday’, ..., ‘Sunday’ Day prefixes‘Mon’, ‘Tue’, ..., ‘Sun’ Months‘January’, ‘February’, ..., ‘December’ Month prefixes‘Jan’, ‘Feb’, ..., ‘Dec’ Letters‘A’, ‘B’, ..., ‘Z’ Table 1: Tokens in the succession dataset We also include different forms of these tokens, as language model tokenizers often have several tokens corresponding to the same word (e.g. words with/without a space at the start being different tokens). Full details of our dataset can be found in our open-sourced experiments. 3 Notation. For consistency with prior work we follow all Elhage et al. (2021)’s notation choices, though the following definitions are sufficient and self-contained for this work regardless. Trans- former language models use an embedding matrixW E ∈R d model ×n vocab to map tokens to vectors in theresidual stream(the cumulative sum of embeddings, attention heads and MLPs). After additive application of attention and MLP layers, the unembedding matrixW U ∈R n vocab ×d model maps the final state of the residual stream to logits for all next token predictions. An OV matrix W OV ∈R d model ×d model maps the residual stream to the output of an attention head, assuming the at- tention head solely attended to that residual stream. Altogether, our diagram in Figure 1 shows one shallow path through a transformer model. Effective OV Circuit. We determine whether attention heads perform succession by studying their effective OV circuit, which measures thedirect effectof input tokens when multiplied by an OV matrixW OV , as in concurrent work (McDougall et al., 2023) which surveys the importance of MLP0. The (non-effective) OV circuitW U W OV W E (1) from Elhage et al. (2021) is the inspiration for oureffective OV circuitW U W OV MLP 0 (W E )(2). Intuitively, (2)’s columns represent input tokens to the head and the rows represent the logits on each possible output token. Successor Headsare then operationalized by considering an input tokentfrom our succession dataset (e.g.t=‘Monday’). If the effective OV circuit column for inputthas a larger output on the successor tot(‘Tuesday’) than on any other of the tokens in that task (‘Monday’ or ‘Wednesday’ or ‘Thursday’ or ...) then we consider the head to have performed succession in this case.Successor Headsare then defined as the attention heads that pass this test for more than half of the tokens in the succession dataset. We call the proportion of succession dataset tokens on which an attention head performs succession thesuccession score. The succession scores across a range of models are displayed in Figure 2. 3DECOMPOSINGNUMERICREPRESENTATIONS Having found behavioral evidence that successor heads exist across a range of models, we now perform a case study on the attention head (L12H0) with the maximal successor score in Pythia- 1.4B. Indeed, not only do we find that the representations on which the successor head acts (i.e. the outputs of the MLP 0 layer) share a commonnumeric subspace(Section 3.1), that for any given 2 The days and months tasks are special as the final tokens in these classes (‘Sunday’ and ‘December’) have cyclical successors (‘Monday’ and ‘January’). We don’t consider the end tokens of the other tasks to have cyclical successors. 3 Available soon athttps://github.com/euanong/numeric-representations/blob/ main/exp_numeric_representations/model.py#L19 3 Preprint. Under review. 1M10M100M1B10B 0 0.2 0.4 0.6 0.8 1 Model family: pythia gpt2 llama Successor Heads in Language Models Number of model parameters Proportion of tokens incremented Successor Head Cutoff 1. Numbers2. Number words3. Cardinal words4. Days5. Day prefixes6. Months7. Month prefixes8. Letters 0 0.2 0.4 0.6 0.8 1 Model family: Pythia GPT-2 Llama-2 Task performance of best Successor Heads Task Score Figure 2: Plots of successor scores (proportion of tokens where succession occurs) for each model tested. A plot of the highest successor score observed across all attention heads for each model tested (left) and successor scores of the best successor heads in models (Pythia-1.4B, GPT-2 XL, Llama-2 7B) across different tasks (right). token (e.g. ‘February’) encodes the index of that token within its ordinal sequence (e.g. months), but we also find mechanistic evidence fortransferable arithmetic featureswithin these representations. 4 Specifically, we use sparse autoencoders to isolate abstract ‘mod-10’ features within the outputs of the MLP 0 layer (Section 3.2), provide further evidence for these features through linear probing and ablative experiments on individual neurons (Section 3.3), and use these features to steer model behavior across different successor tasks (Section 3.4).N.B: Section 3.1 can be read independently from Section 3.2-3.4, and Section 3.3 and Section 3.4 can be read in parallel. 3.1ORDINAL SEQUENCES ARE REPRESENTED COMPOSITIONALLY Leti s denote theith token in ordinal sequences(such that e.g.2 Month corresponds to the token ‘February’), and letJi s K=MLP 0 (W E (i s ))denote the model’s internalMLP 0 -representationof tokeni s (the output of MLP 0 in Figure 1). Given successor headsS=W OV can increment tokenss i from a range of ordinal sequencess(e.g. numbers, months, days of the week), one might hypothesise that the MLP 0 -representations of such tokens havecompositional structure– i.e. that information about a token’spositioniin its ordinal sequence is encoded independently from information aboutwhich ordinal sequencesit comes from. More precisely, we claim that we can decompose representationsJi s Kinto featuresv i living in some ‘index space’ andv s living in some ‘domain space’, such thatJi s K=v i +v s . Method.To test this compositionality hypothesis, we wish to learn two linear maps – anindex-space projectionπ N :R d model →R d model and adomain-space projectionπ D :R d model →R d model – such that, for all pairs of tokensi s andj t (withi t a valid token), ˆ Ji t K:=π N (Ji s K) +π D (Jj t K)≈Ji t K. To do so, we enforce thatπ N +π D =I, and ensure predicted representations ˆ Ji t K‘behave like’ ground truth representationsJi t Kfor randomly sampled pairs of tokensi s andj t – in other words, that there is lowL 2 -distance between ˆ Ji t KandJi t K, that ˆ Ji t Kdecodes toi t (output-space decoding), and that S(Ji t K)decodes to(i+ 1) t (successor decoding). For full experimental details see Appendix L. Results.On our held-out dataset of token pairs, we obtained a top-1 output-space decoding accuracy of 1.00. To explore out-of-distribution performance, we also test whetherπ N can project out the numeric component of Roman numerals (which weren’t in the successor dataset), by taking Roman numeralsi Rom ∈‘I’,...,‘XII’and tokens1 s from sequencessin the successor dataset, and testing whetherπ D (1 s ) +π N (i Rom )decodes toi s . We present the top-1 predicted tokens under output- space decoding in Table 2: observe that we obtain perfect top-1 accuracy (apart fromi∈1,5,10, 4 Note that we also observe similar abstract numeric representations across other models too – see Ap- pendix A.3. 4 Preprint. Under review. Source index token () I I I IV V VI VII VIII IX X XI XII 1 one first Monday Mon January Jan A Sequence token ( ) iRom 1 s 9234226789241112 nine two three four twenty six seven eight nine twenty eleven twelve ninth second third fourth second sixth seventh eighth ninth fourth-- Sunday Tuesday Wedne... Thursday Tuesday Saturday Sunday----- Sep Tue Wed Thu Tue Sat Sun----- February March April February June July August Decem... Novem... Decem... Sep Feb Mar Apr Feb Jun Jul Aug Sep Dec Nov Dec I B C D V F G H I X W L Septem... Septem... Table 2: A table presenting top-1 predicted tokens under output-space decoding fromπ D (1 s ) + π N (i Rom ). Green cells denote predictions which match their target exactly; red cells denote incorrect predictions. Dashed cells denote pairs of1 s andi Rom for whichi s is not a valid (single) token. which we can attribute to the Roman numerals I, V and X being single-letter and impossible to disambiguate from9 Letter ,22 Letter and24 Letter ). These results – in particular, our ability to project the numeric component out of tokens from unseen sequences and transfer indices across domains – suggest that there is a shared numeric subspace storing the index of a token within its ordinal sequence. Indeed, informal testing suggests that this numeric subspace may be interpretable even for tokens not part of an ordinal sequence: for instance, d(π N (J‘ triangle’K) +π D (1 Num ))yields3 Num , andd(π N (J‘ week’K) +π D (1 Num ))yields7 Num . We note in Appendix M, however, that applying the successor head to these learned representations did not always preserve performance (i.e. for a representationS( ˆ Ji t K)did not always decode to (i+ 1) t ). This suggests our numeric projectionπ N might be capturing slightly more than just the numeric subspace: specifically, there may be some components of domain-space which are ignored by output-space decoding, but which our successor head lifts into output-space. 3.2FINDING MOD-10FEATURES To uncover more structure within these MLP 0 -representations, we train asparse autoencoder(SAE) (Ng, 2011; Cunningham et al., 2023; Bricken et al., 2023) on tokenstfrom a range of ordinal sequences. 5 In short, SAEs find a sparse set of linear features that can reconstruct activations in neural networks. We apply SAEs to the MLP 0 -representations of all tokenstand call this their SAE-decomposition. Theith SAE feature has activation (on input tokent)α i (t)≥0. Given an ordinal sequence tokentand a trained SAE, we definet’smost important featuret ⋆ as the SAE feature that, when ablated from the reconstruction oft’s MLP 0 -representation, causes the biggest decrease in the probability of the successor oft(by calculating probabilities from the logits obtained by multiplying byW U W OV ). For numeric tokenst n (e.g.t 13 =‘13’), we find that their most important feature is usually shared by other numeric tokenst m for whichm≡nmod 10(e.g.t 3 ,t 23 ,...,t 93 ). Indeed, for each of the ‘mod-10 classes’t 0 ,...,t 90 ,t 1 ,...,t 91 ,...,t 9 ,...,t 99 , the modal most important feature for tokens in that class is shared on average by58.5%of tokens in that class. Moreover, we find that the most important feature oft i typically only has a high activationα i (t j )in the SAE-decomposition oft j ifi≡jmod 10, which we visualise as ‘mod-10 bands’ in Figure 3. We also observe that these features arecausally importantfor the successor head to perform incre- mentation: if we apply the successor head to the most important featurest ⋆ i of tokenst i (i.e. com- pute logits asW U W OV t ⋆ i ), the resulting distribution places high weight on tokenst j forj≡i+ 1 mod 10, which we visualise as ‘mod-10 bands’ in Figure 5. Furthermore, the weight placed on tokenst j whent i is a single-digit number is much larger than that whent i is a double-digit number. Given these observations, we hypothesise that the most important features of numeric tokenst i might in some way encode the value ofimod 10. As such, we define themod-10 featuresf 0 ,...,f 9 as the modal most important features from mod-10 classest 0 ,...,t 90 ,t 1 ,...,t 91 ,...,t 9 ,...,t 99 , averaged over 100 SAE training runs. We verify that these mod-10 features are causally important 5 We describe the full SAE training setup in Appendix A. 5 Preprint. Under review. for incrementation, by observing that the logit distribution obtained by applying the successor head to featuresf i (i.e.W U W OV f i ) places high weight on tokenst j forj≡i+ 1 mod 10(which we visualise as ‘mod-10 bands’ in Figure 4). 3.3TRANSFERABILITY OF MOD-10FEATURES Are the mod-10 features we found in Section 3.2 simply an artifact of the SAE technique? We pro- vide evidence that these are natural, causally important features by using two independent methods to recover them: (1) linear probing, and (2) identifying MLP 0 neurons. We also demonstrate that these features transfer to other tasks in the succession dataset (Section 2). (1) Linear probing.We train a linear probeP:R 10×d model to predict the value ofimod 10from the MLP 0 -representations of numeric tokenst i . We find thatP i , theith row of our linear probe, has a high cosine similarity (on average 0.70764) to the corresponding mod-10 featuref i obtained from the SAE. Surprisingly, the probe even generalizes to non-numeric tokens, correctly predicting the index valuemod 10for 94/102 examples from succession dataset tasks 2-8 (Section 2). Our full experimental setup is described in Appendix B.1. Additionally, we provide an analysis of linear probes for moduli other than 10 in Appendix P. (2) MLP 0 neurons.We perform ablative experiments on individual MLP 0 neurons (activations im- mediately after the final ReLU/GELU) to find themost important neuronsfor incrementing numeric tokenst i (measured by decrease in probability of the successor tokent i+1 , as per the definition of most important feature). Observing the behavior of these neurons on tokenst i reveals periodic spik- ing patterns in firing intensity (the neuron’s activation value), with the most common period across the top 16 most important neurons being 10. Figure 6 presents the firing patterns of some of these neurons. Indeed, we also find that the neurons increase probability on successor tokens by multi- plying the neuron’s corresponding direction withW U W OV in the same figures. Further technical details can be found in Appendix C. Our results on the interpretability of individual neurons may seem surprising in light of recent work suggesting that the individual neurons of language models may be inappropriate as the units of understanding (Elhage et al., 2022). However, our results do not contradict previous findings that understanding MLPs requires understanding distributed behav- iors, since, for example, featuref 6 appears to be in superposition across at least two neurons (see Figure 6). 3.4VECTOR ARITHMETIC WITH MOD-10FEATURES The generalization of our mod-10 linear probe to unseen numeric tasks suggests that token repre- sentations across different tasks might becompositionaland share some common mod-10 struc- ture. In this section, we test our understanding of this structure by performing vector arithmetic with our mod-10 features to manipulate the index of ordinal sequence tokens. For instance, just as Mikolov et al. (2013) found that workvectors satisfiedvec(‘King’)−vec(‘Man’)+vec(‘Woman’)= vec(‘Queen’), we expect MLP 0 (W E (‘fifth’))−kf 5 +kf 7 (3) to be causally used by the model in a similar way to MLP 0 (W E (‘seventh’)), wherekis a scaling factor (Appendix D). We use our successor head to test this hypothesis.Observe that, if (3) behaves like MLP 0 (W E (‘seventh’)), applying the successor head to (3) (i.e. multiplying byW U W OV ) should yield a distribution with more weight on ‘eighth’ than on any other token from the cardinal-word task in the succession dataset (Section 2). Indeed, this is correct as indicated by the circled check- mark ✓in Figure 7. We can perform a similar experiment with all tokens in the succession dataset and with features other thanf 7 added. The cases where the max logits are on the successor token are check-marked in Figure 7. We describe the experiment in more detail in Appendix D and we also display how logits are distributed across tokens for individual cells in Appendix O. We find that when the mod 10 addition feature is larger than the source value (modulo 10), vector arithmetic works on 53% (for months) and 89% (for digits 20-29) of cases. Greater-than bias. The vector arithmetic experiments (Figure 7) work much worse when the mod- 10 addition is smaller than the source tokens’s ordinal sequence position mod 10 (e.g. experiments involving ‘March’ and addingf 1 orf 2 do not go well). This is because Successor Heads are bi- ased towards values greater than the successor, compared to values less than the successor. This effect can be seen in Figure 8a on the tokens ‘first’, ‘second’, ..., ‘tenth’. However, our mod-10 6 Preprint. Under review. features do not exhibit a greater-than bias, as seen in Figure 8b. We survey these effects across all tasks in Appendix H. As a result, using the mod-10 features to shift logits towards tokens of a lower order than the input token fails, as the changes in logits are not significant compared to the large logits on higher-order tokens. In the case of numbers, this leads to the effect that, for exam- ple,W U W OV (MLP 0 (W E (‘35’))−kf 5 +kf 3 )has high logits on ‘43’, rather than ‘33’ (this ‘+10’ effect occurs for 2/3 of entries below the diagonal in the 20-29 numbers table of Figure 7). 020406080 0 20 40 60 80 0 2 4 6 8 10 Output number Feature source Feature activations for the most important SAE features across numbers Figure 3: The activations oft i ’s most important feature (y-axis) in the SAE-decomposition oft j (x-axis), fort i ,t j numeric tokens. Values aver- aged over 100 SAE training runs. 020406080 0 20 40 60 80 −10 −5 0 5 10 15 20 Output number Feature source Logit distributions for the most important SAE features across numbers Figure 4: The logit value fort j (x-axis) when unembedding the most important feature oft i (y-axis), fort i ,t j numeric tokens. Values aver- aged over 100 SAE training runs. 020406080 0 5 −10 0 10 20 Output number mod-10 feature Logit distribution of each mod-10 feature across numbers Figure 5: Logit distributionW U W OV f i for each mod-10 featuref i . Limitations:The absence of a strong greater-than bias in our mod 10 features suggests this feature- level description is missing some details – specifically, that successor heads must use other numeric information to produce the greater-than bias we observe. Additionally, while we see a good gener- alization of the mod 10 features across various tasks in the table in Figure 7, the mod 10 features are not able to steer the Days and Letters tasks from Section 2. We describe this in Appendix E. 4SUCCESSORHEADS IN THEWILD In this section, we analyze the behaviour of successor heads within natural-language datasets, and observe that they aren’t simply responsible for incrementation: indeed, we identify four distinct, interpretable categories of successor head behavior, highlighting successor heads as an example of aninterpretably polysemanticattention head ‘in the wild’. In order to characterize the behavior of Pythia-1.4B’s successor head on natural-language data, we randomlysample 128 length-512 contexts fromThe Pile, and for each prefix of each context, we assess whether the successor head is important for the model’s ability to predict the correct next token. To measure importance, we usedirect effect mean ablation, which involves patching the output of a head with its mean output over a chosen distribution (in this case, the same batch), and, at the very end of the model, subtracting this mean output and adding the original head output to 7 Preprint. Under review. 05101520 0 1 2 05101520 −5 0 5 10 15 Across promptsAcross prompts Firing intensity Logits 1st most important neuron (a) 1st place neuron, firing for odd numbers, and slightly more for 1 mod 10 020406080 0 0.5 1 1.5 2 020406080 −5 0 5 10 15 Across promptsAcross prompts Firing intensity Logits 7th most important neuron (b) 7th place neuron, 0 mod 10 neuron 020406080 0 0.5 1 1.5 020406080 0 10 20 Across promptsAcross prompts Firing intensity Logits 8th most important neuron (c) 8th place neuron, 6 mod 10 neuron 020406080 0 0.5 1 020406080 −10 0 10 20 Across promptsAcross prompts Firing intensity Logits 10th most important neuron (d) 10th place neuron, 4 mod 10 neuron 020406080 0 0.5 1 020406080 0 10 20 Across promptsAcross prompts Firing intensity Logits 15th most important neuron (e) 15th place neuron, 5 mod 10 neuron 020406080 0 0.5 1 020406080 −5 0 5 10 Across promptsAcross prompts Firing intensity Logits 16th most important neuron (f) 16th place neuron, another 6 mod 10 neuron Figure 6: Some examples of neurons firing strongly in modulo 10 patterns out of the top 16 most important MLP0 neurons for successorship. Firing intensity (y-axis) corresponds to the activation value for the neuron. 0123456789 20✓ 21✓ 22✓ 23✓ 24✓ 25✓ 26✓ 27✓ 28✓ 29✓ ten✓ eleven✓ twelve✓ thirteen✓ fourteen✓ fifteen✓ sixteen✓ seventeen✓ eighteen✓ nineteen✓ twenty✓ 1234567890 January✓ February✓ March✓ April✓ May✓ June✓ July✓ August✓ September✓ first✓ second✓ third✓ fourth✓ fifth✓ sixth✓ seventh✓ eighth✓ ninth✓ Example: The circled case (3) has a source token of ‘fifth’ and a target residue of7, and it is ticked because vector arithmetic leads to this token being successfully treated like ‘seventh’) Figure 7: Table displaying cases in which vector arithmetic such as (3) is successful for various ranges of tokens. Other ranges of tokens give similar results, as displayed in Appendix N. Rows: source tokens. Columns: target residues modulo 10. 8 Preprint. Under review. first second third fourth fifth sixth seventh eighth ninth tenth first second third fourth fifth sixth seventh eighth ninth tenth 0 10 20 30 40 50 Output token Source token Logits (a) Evaluating the effective OV circuit on the input and output tokens ‘first’, ‘second’, ..., ‘tenth’. first second third fourth fifth sixth seventh eighth ninth tenth 0 2 4 6 8 −10 −5 0 5 10 15 20 Output token mod 10 feature Logits (b) Multiplying all mod 10 featuresf i byW U W OV . Figure 8: The Successor Head OV circuit displays a clear bias against decrementation (Figure 8a), i.e. the logits on or above the main diagonal are less than the logits below the main diagonal. This bias isn’t captured in the mod 10 feature (Figure 8b). the residual stream (other effects and ablation methods are explored in Appendix K). We evaluate prefixes using two different metrics for per-prompt successor head importance: Winning cases.We identify prefixes where the head that most decreases the logit for the correct next token under direct effect mean ablation is the successor head, denoting them aswinning cases. Loss-reducing cases.We identify prefixespwhere direct effect mean ablation of the successor head increases next-token prediction loss (by∆L(p)), denoting them asloss-reducing cases. 4.1INTERPRETABLEPOLYSEMANTICITY INSUCCESSORHEADS On analyzing prefixes where the successor head is particularly important for next-token prediction – i.e. loss-reducing and winning cases – we observe four main categories of behavior, which we operationalize as follows (denoting atop-n-attended tokenas a token at one of the topnpositions to which the successor head attends most strongly): Successorship behavior:the successor head pushes for the successor of a token in the context.We say this behavior occurs when one of the top-5-attended tokens is in the successorship dataset, and the correct next token is the successor oft. Acronym behavior:the successor head pushes for an acronym of words in the context.We say this behavior occurs when the correct next token is an acronym whose last letter corresponds to the first letter of the top-1-attended token. (For example, if the successor head attends most strongly to the token ‘Defense’, and the correct next token is ‘OSD’.) Copying behavior:the successor head pushes for a previous token in the context.We say this behavior occurs when the correct next tokenthas already occurred in the prompt, and tokentis one of the top-5-attended tokens. Greater-than behavior:the successor head pushes for a token greater than a previous token in the context.We say this behavior occurs when we do not observe successorship behavior, but when the correct next token is still part of an ordinal sequence and has greater order than some top-5-attended token (e.g. if the successor head attends to the token ‘first’ and the model predicts the token ‘third’.) We plot the proportions of each behavior observed across winning cases in Figure 9, and the fraction of total reduced loss over all contexts (∆L) attributable to contexts of each behavior in Figure 11. We also illustrate a random sample of 5 winning cases in Figure 10, and of 5 loss-reducing cases in Figure 12. We observe that, while successorship is the predominant behavior across both winning and loss-reducing cases, acronym and greater-than behaviors also form a non-negligible fraction of successor head behavior. In other words, the successor head is an example of an attention head with 9 Preprint. Under review. Successorship 43.4% Acronym 23.8% Greater-than 18.9% Other 14% Proportions of behaviours in winning cases Figure 9: Proportions of three dominant behaviors across winning cases. PromptCompletion (...) [@B2] Hence, bonding to ceramic requires strict attention to detail for op- timal clinical outcomes.[@B 3 (...) designated as boxazomycin A andB (...)called Generalized Single Step Single Solve ( GS (...) More than two-thirds (...) where one or more access points (AP Figure 10: Random sample of 5 winning cases. Negli- gibly many winning cases were copying. Successorship 38.2% Other 28.7% Copying 11.7% Acronym 10.8% Greater-than 10.5% Proportions of behaviours in most loss reducing cases Figure 11:Proportions of reduced loss (∆L) attributable to prompts of each be- haviour. PromptCompletion (...) low-dose administration of 14- and15 (...) in the second round, let’s get it to the third (...) for 1) investigators to ask the mis- sionaries, 2 (...) You are using Microsoft Test Man- ager ( MT (...) known as the Medical Research Council Technology ( MR Figure 12: The top 5 most loss reducing examples. interpretable polysemanticity 6 . While polysemanticity has been observed in both vision models (Olah et al., 2020) and toy models trained to perform simple tasks (Elhage et al., 2022), to the best of our knowledge the presence of both successorship and acronym behavior in head L12H0 is the cleanest example of polysemantic behavior identified so far in an LLM, where we show two clear distinct behaviors one model component has. Finally, this finding is surprising given research into polysemanticity and superposition (Elhage et al., 2022; Bricken et al., 2023). The succession and acronym behaviors are different tasks that L12H0 completes, yet they are not independent tasks occur in completely different contexts (this is because a token completion could involve both suc- cession and an acronym, e.g ‘the First Limited Corporation, ’ could be completed with ‘ Second’ or ‘ FLC’). Note that in this section, while we identified that successor heads are often used in tasks involving incrementation, we did not explicitly demonstrate that successor heads arenecessaryfor incremen- tation. In Appendix J we describe an experiment that reveals that successor heads are necessary for a specific incrementation task (numbered listing). 5RELATEDWORK Mechanistic Interpretabilityresearch aims to reverse engineer trained neural networks analo- gously to how software binaries can be reverse-engineered (Olah, 2022). This research was largely developed in vision models (Bau et al., 2017; Olah et al., 2017) though most recent research has studied language models (Elhage et al., 2021; Olsson et al., 2022; Gurnee et al., 2023) and trans- formers (Nanda et al., 2023). Olah et al. (2020) introduces the universality hypothesis and we use Chughtai et al. (2023)’s ‘weak universality’ notion in this work (Section 1). 6 A component of a network is said to beinterpretably polysemanticif it performs multiple distinct, inter- pretable functions. 10 Preprint. Under review. Transformer circuits. More specifically, our work builds from the insights of Elhage et al. (2021)’s framework for understanding circuits in transformers, including how autoregressive transformers have a residual stream. Due to the residual stream, different paths from input to output can bypass as many attention heads and MLPs as necessary. This has further been explored in specific case studies (Wang et al., 2023; Goldowsky-Dill et al., 2023) and generalizes to backwards passes (Du et al., 2023). One related case study to our work is Hanna et al. (2023) which studies a Greater-Than circuit in GPT-2 Small, similar to how we indirectly found the Greater-Than operation in Section 3. Hanna et al. (2023) focus mainly on numbers, not other tasks. Our work is inspired by Olsson et al. (2022) who study induction heads and find that heads with similar attention patterns exist in larger models. In our work we provide an end-to-end explanation of generalizing language model components (Figure 1), though induction heads are related to in-context learning and a consistent phase change in training, which we didn’t observe for successor heads (Appendix G). LLMs and vector arithmetic. Mikolov et al. (2013)’s seminal work on word embedding arith- metic showed that latent language model representations had compositionality, e.g.vec(‘King’)− vec(‘Man’)+vec(‘Woman’)=vec(‘Queen’) approximated the embedding of ‘Queen’. Recently Merullo et al. (2023) showed some extension of these arithmetic results to LLMs. L (2023) found that ‘one is 1’ and that similar heads in GPT-2 Small boost successors numbers, months and days, which we generalize to more architectures and to an end-to-end circuit (Figure 1). Lan & Barez (2023) also use an automated approach to study the overlap of these tasks. Finally, Subramani et al. (2022) and Turner et al. (2023) use residual stream additions to steer models. Our work differs in that it considers shallow targeted paths through networks, rather than deep hidden states in networks. 6CONCLUSION In this work, we discovered and interpreted a class of attention heads we callsuccessor heads. We showed that these heads increment tokens in ordinal sequences like numbers, months, and days, and that the representations of the tokens arecompositionaland containinterpretable, abstract ‘mod-10’ features. We provided evidence that successor heads exhibit a weak form of universality, arising in models across different architectures and scales (from 31M to 12B parameters), and using similar underlying mod-10 features in all cases. Finally, we validated our understanding by demonstrating that a successor head reduced the loss on training data by predicting successor tokens. Additional findings that stemmed from our work include: 1. Finding a ‘greater-than bias’, where a language model was much more likely to predict numeric answers larger than the values in the prompt, compared to smaller values than tokens present in the prompt, that was observable by a weights-level analysis. 2. Surprisingly interpretable individual MLP0 neurons on this narrow task. 3. A novel example of attention head polysemanticity (successor heads predicting acronyms). Findings 1-3 motivate future work into how language models represent numeric concepts, particu- larly as 2 and 3 were surprising given existing evidence from existing work. Our work in finding a language model component that arises in models of many different scales (and uses abstract un- derlying numeric representations) may be a valuable contribution toward understanding the inner workings of frontier LLMs. 7AUTHORCONTRIBUTIONS Rhys Gould first found successor heads in GPT-2 Small and Pythia-1.4B, and identified the mech- anism from Figure 1 as well as the mod-10 features. Euan Ong developed Section 3.1, improved our understanding, and worked on writing across the paper. George Ogden found successor heads in larger models (e.g Figure 2). Arthur Conmy led the project, framed the paper’s contributions and suggested and implemented several experiments throughout the work. 8ACKNOWLEDGEMENTS We would like to thank Bilal Chughtai, Th ́ eo Wang and reviewers for comments on a draft of this work and Neel Nanda for a helpful discussion, as well as Lawrence Chan and Sebastian Farquhar 11 Preprint. Under review. for pieces of advice. Elizabeth Ho, Will Harpur-Davies and Andy Zhou worked on an early version of this work in GPT-2 Small with help from Th ́ eo Wang. REFERENCES David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations, 2017. 10 Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens, 2023. 18 Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hal- lahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. 19 Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders.Language models can explain neurons in language models.https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html, 2023. 2 Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah.Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023.https://transformer- circuits.pub/2023/monosemantic-features/index.html. 5, 10 Bilal Chughtai, Lawrence Chan, and Neel Nanda. A toy model of universality: Reverse engineer- ing how networks learn group operations, 2023. URLhttps://arxiv.org/abs/2302. 03025. 2, 10 Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri ` a Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability, 2023. 2 Stanford CRFM. Mistral: A framework for transparent and accessible large-scale language model training.https://github.com/stanford-crfm/mistral, 2021. 19 Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models, 2023. 5, 14 Kevin Du, Lucas Torroba Hennigen, Niklas Stoehr, Alexander Warstadt, and Ryan Cotterell. Gen- eralizing backpropagation for gradient-based interpretability, 2023. 11 Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Gan- guli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021. URLhttps://transformer-circuits.pub/2021/framework/index.html. 2, 3, 10, 11 Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposi- tion.arXiv preprint arXiv:2209.10652, 2022. 1, 2, 6, 10 Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. Localizing model behavior with path patching, 2023. 11 Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsi- mas. Finding neurons in a haystack: Case studies with sparse probing, 2023. 10 12 Preprint. Under review. Michael Hanna, Ollie Liu, and Alexandre Variengien. How does gpt-2 compute greater-than?: In- terpreting mathematical abilities in a pre-trained language model, 2023. 2, 11 StefanHeimersheimandJettJaniak.AcircuitforPythondoc- stringsina4-layerattention-onlytransformer,2023.URLhttps: //w.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/ a-circuit-for-python-docstrings-in-a-4-layer-attention-only. 2 Mikhail L. One is 1: Analyzing activations of numerical words vs digits 1, Jul 2023. 11 Michael Lan and Fazl Barez. Locating cross-task sequence continuation circuits in transformers, 2023. 11 Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. Convergent learning: Do different neural networks learn the same representations?, 2016. 2 Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, and Neel Nanda. Copy sup- pression: Comprehensively understanding an attention head.arXiv preprint, 2023. 3 Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, and Shane Legg. The hydra effect: Emergent self-repair in language model computations, 2023. 21 Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. Language models implement simple word2vec- style vector arithmetic, 2023. 11 Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representa- tions of words and phrases and their compositionality, 2013. 6, 11 Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learn- ing Representations, 2023. URLhttps://openreview.net/forum?id=9XFSbDPmdW. 2, 10 Andrew Ng. Sparse autoencoder. Online Course Notes, Stanford University, 2011. URLhttps: //web.stanford.edu/class/cs294a/sparseAutoencoder.pdf. 5 Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. https://w.transformer-circuits.pub/2022/mech-interp-essay, 2022. 1, 10 Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Featture visualization.Distill, 2017. doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization. 10 Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020. doi: 10.23915/distill.00024.001. 2, 10 Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al.In-context learn- ing and induction heads, 2022. URLhttps://transformer-circuits.pub/2022/ in-context-learning-and-induction-heads/index.html. 2, 10, 11 OpenAI. Gpt-4 technical report, 2023. 2 Judea Pearl.Causality. Cambridge University Press, 2 edition, 2009. ISBN 978-0-521-89560-6. doi: 10.1017/CBO9780511803161. 21 LeeSharkey,DanBraun,andBerenMillidge.[interimresearchre- port]takingfeaturesoutofsuperpositionwithsparseautoencoders,2022. URLhttps://w.lesswrong.com/posts/z6QQJbtpkEAX3Aojj/ interim-research-report-taking-features-out-of-superposition. 14 Nishant Subramani, Nivedita Suresh, and Matthew E. Peters. Extracting latent steering vectors from pretrained language models, 2022. 11 13 Preprint. Under review. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023. 2 Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDi- armid. Activation addition: Steering language models without optimization, 2023. 11 Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small.In The Eleventh International Conference on Learning Representations, 2023.URLhttps: //openreview.net/forum?id=NpsVSN6o4ul. 2, 11 Zhengxuan Wu, Atticus Geiger, Christopher Potts, and Noah D. Goodman. Interpretability at scale: Identifying causal mechanisms in alpaca, 2023. 2 ASPARSE AUTO-ENCODERS A.1DEFINITION We refer to a single-layer autoencoder with a sparsity regularization term in its loss as asparse auto-encoder. For a dataset generated from a set of underlying vectors (each dataset example is a sparse linear combination of such vectors), it has been empirically observed (Sharkey et al., 2022; Cunningham et al., 2023) that sparse auto-encoders are capable of retrieving the underlying set of vectors. We hope to obtain a set of sparse, interpretable features from the SAEs that decompose some of the structure of MLP 0 space that we can use to analyze the way numeric operations are performed. A.2TRAINING PROCESS FOR MOD10FEATURES Training a sparse auto-encoder withDfeatures and regularization coefficientλon a dataset of tokens in MLP 0 space results in a mapF:Tokens→(R d ×R + ) D , withF(x) =(v 1 ,a 1 ),...,(v D ,a D ), mapping a token to a set of feature and feature-activation pairs, with reconstructionR(x) = P D i=1 a i v i . Eachv i the ReLU of a linear transformation with the input to the SAE, represented byW e in Figure 13. Note that we use SAEs in MLP 0 space, i.e. the reconstruction loss is at the middle stage of Figure 1 which we have illustrated in Figure 13. We train the SAE using number tokens from 0 to 500, both with and without a space (‘123’ and ‘ 123’), alongside other tasks, such as number words, cardinal words, days, months, etc. 90% of these tokens go into the train set, and the remaining 10% to the validation set. Even with the other tasks, the dataset is dominated by numbers, but creating a more balanced dataset would give us less data to work with, and without enough data, the SAE fails to generalize to the validation set. Hence, we only concern ourselves with the features that the SAE learns for number tokens, and we then separately check whether these features generalize to the other tasks on the basis of logits, rather than SAE activations. We used the hyperparametersD= 512andλ= 0.3, with a batch size of 64, and trained for 100 epochs. To find these hyperparameters, we used the metric of mean max cosine similarity between two trained SAEs, as described in Sharkey et al. (2022) and Cunningham et al. (2023). 14 Preprint. Under review. Figure 13: SAEs are trained on the activation after MLP 0 and beforeW OV . 020406080 0 20 40 60 80 0 2 4 6 8 10 12 Input number Feature source Feature activations for the most important SAE features across numbers (EleutherAI/pythia-2.8b) (a) SAE feature activations 020406080 0 20 40 60 80 −10 −5 0 5 10 15 20 Input number Feature source Logit distributions for the most important SAE features across numbers (EleutherAI/pythia-2.8b) (b) Feature logit distribution 020406080 0 5 −10 0 10 20 Input number mod-10 feature Logit distribution of each mod-10 feature across numbers (EleutherAI/pythia-2.8b) (c) Logit distribution for each mod-10 feature Figure 14: SAE plots for Pythia-2.8B analogous to Figure 3, Figure 4, and Figure 5. A.3UNIVERSALITY OF MOD-10RESULTS We also observe the mod 10 structure via SAEs across models other than Pythia-1.4B, without any finetuning of SAE parameters to these models. We reproduce the SAE figures seen in Section 3.2 for other models, with Appendix A.3 for Pythia-2.8B, and Appendix A.3 for celebrimbor-gpt2-medium- x81. 15 Preprint. Under review. 020406080 0 20 40 60 80 0 0.5 1 1.5 2 Input number Feature source Feature activations for the most important SAE features across numbers (celebrimbor-gpt2-medium-x81) (a) SAE feature activations 020406080 0 20 40 60 80 −4 −2 0 2 4 6 8 10 12 Input number Feature source Logit distributions for the most important SAE features across numbers (celebrimbor-gpt2-medium-x81) (b) Feature logit distribution 020406080 0 5 −5 0 5 10 Input number mod-10 feature Logit distribution of each mod-10 feature across numbers (celebrimbor-gpt2-medium-x81) (c) Logit distribution for each mod-10 feature Figure 15: SAE plots for celebrimbor-gpt2-medium-x81 analogous to Figure 3, Figure 4, and Fig- ure 5 16 Preprint. Under review. BTEST SET EVALUATION B.1LINEARPROBING We train a linear probe to predict the mod 10 value of tokens. Specifically, we train on number tokens from ‘0’ to ‘500’, both with and without a space, assigning 90% of tokens to a train set, and the remaining 10% to a validation set. We use a learning rate of 0.001, and a batch size of 32, for 100 epochs. We then evaluate on a dataset of unseen tasks, including number words (from ’one’ to ’nineteen’), placements, Roman numerals, months, days, and any valid spaced and capitalized variants. Out of the total 102 such examples, 94/102 are correct, and the 8 failures are: [‘January’, ‘December’, ‘Friday’, ‘Saturday’, ‘Sunday’, ‘ V’, ‘ X’, ‘ XV’]. The failures of 3 out of 7 days are consistent with our inability to interpret the day task well with our mod 10 features. Additionally, we see ‘January’ and ‘December’ as failure cases, which is also consistent with our finding that there does not seem to be a mod 10 feature that corresponds to any of them:f 1 behaves as ‘November’ rather than ‘January’, andf 2 as ‘February’. B.2HELD-OUTTASK 1M10M100M1B10B 0 0.2 0.4 0.6 0.8 1 Model family: gpt2 EleutherAI/pythia llama Roman Numeral Successor Score Number of model parameters Proportion of tok ens incr emented In this work, we used all tokens in numeric sequences in the models’ vocabularies, except Roman numerals, so can use these as a test set, as in Appendix B.1 and Ap- pendix L. We tested all the OpenAI GPT-2 models as well as the Pythia models with at least 1B parameters. We find that Successor Heads have variable performance on this held out task, with many (including all Pythia models) achieving a high succession score. However, the original Llama-7B does not generalize well to this task. Figure 16: (Left) Succession scoresfor the Roman numer- als task only. CMLP 0 NEURONS In our MLP 0 neuron experiments, we do the following: for eachT∈‘0’,‘1’,...,‘99’, we ablate each neuron from the final activation in MLP 0 (the final activation is just before the final linear layer of MLP 0 ), and store the probability attributed to the successor ofTafter passing the modified (due to ablation) MLP 0 output through the successor. Averaging the correct probabilities across all 100 prompts then gives an averaged correct probabil- ity for each neuron after ablation. We then look at the intensities (neuron activation values) and logits across inputs of number tokens for neurons with the lowest correct probability after ablation, meaning they have the most impact on successorship when ablated. This gives us the plots seen in Figure 6. 17 Preprint. Under review. MondayTuesdayWednesdayThursdayFridaySaturdaySunday Monday Tuesday Wednesday Thursday Friday Saturday Sunday −5 0 5 10 15 20 Prompts Source prompt Logits (a) Output logits for suc- cessor heads on day tokens (cf. Fig. 8a) MondayTuesdayWednesdayThursdayFridaySaturdaySunday 0 2 4 6 8 −10 −5 0 5 10 Prompts mod 10 feature Logits (b) Presence of mod-10 Features on Day Token Predictions (cf. Fig. 8b) acegikmoqsuwy a c e g i k m o q s u w y −15 −10 −5 0 5 10 15 20 25 Prompts Source prompt Logits (c) Output logits for suc- cessor heads on letter to- kens (cf. Fig. 8a) acegikmoqsuwy 0 2 4 6 8 −5 0 5 10 Prompts mod 10 feature Logits (d) Presence of mod-10 features on letter tokens (cf. Fig. 8b) Figure 17: Comparative analysis of output logits for day and letter tasks and presence of mod-10 features. This complements the main text analysis in Figures 8a and 8b, limitations of mod-10 features across certain tasks successor heads perform. DARITHMETIC EXPERIMENTS For a tokent(row of arithmetic table) with ordern:=ord(t), and mod-10 featuref i (column of arithmetic table), we consider howx:=MLP 0 (W E (t)) +k(−f n +f i )attributes logits to tokens within the same task ast, withk∈R + a scaling. We denote whetherxcorrectly attributes maximal logits to the successor tokent + oft(defined by the property that ord(t + ) =n+ 1) by a checkmark, as seen in Figure 7. Since the mod-10 featuresf i i obtained from the SAE are normalized to unit norm, the scalingkis necessary in order to modify the order of a numeric token. For example, thoughf 1 may be present in the MLP 0 embedding of ’ eleven’, we do not know theintensityof the feature (analogous to SAE feature strengthsα i (t) i ) in the embedding. We use the heuristic of a scalingk:=λ(MLP 0 (W E (t))·f n ∈Rfor someλ∈R + . The appropriateλshould be such that performing the arithmetic described byxhas an effect on the embedding’s numeric order (i.e.λlarge enough) while not corrupting the task identity of the original embedding (i.e.λnot too large), which is checked by multiplyingxbyW U W OV and observing whether the top token has these two properties (that is, whether the numeric order of the token has been altered while maintaining the same task identity of, say, a month). By checking this criterion for a rangeλ∈ 0.0,...,0.75,1.0,1.25,...,3.0, we find thatλ= 1achieves this criterion for all tasks other than months, where months instead haveλ= 2. EFAILURE CASES OF MOD10FEATURES For the day and alphabet task, analogously to Figure 8, we look at the logits across the task and the mod 10 features. These are displayed in Figure 17, and demonstrate that our mod 10 features are not very interpretable in the context of days and the alphabet in terms of logits, with no clear diagonal of high logits. FRESIDUALCONNECTIONSARENOTIMPORTANTFORSUCCESSION To show that there is no relevant information in the residual stream, i.e. the pathW U MLP 0 (W E )is not sufficient to predict successors, we perform an experiment using the Tuned Lens (Belrose et al., 2023), which approximates the optimal predictions after a given layer inside a transformer. For all tasks in the succession dataset (Section 2), we used prompt formats (where|denotes a gap between tokens) 1.|Here| is| a| list|:| alpha| beta| gamma| and| here| is| another|:|<token1>| 2.|The|Monday|Tuesday|Wednesday| and| The|<token1>| 18 Preprint. Under review. 020k40k60k80k100k120k140k 0 0.2 0.4 0.6 0.8 pythia-70m pythia-1.4b pythia-2.8b Successor scores across checkpoints in Pythia models Checkpoints Best successor score (a) Pythia models(b) Stanford GPT-2 Medium-C Figure 18: Best successor scores across successor heads throughout training checkpoints for Pythia and stanford-gpt2 models. 0123456789 0 1 2 3 4 5 6 7 8 9 −5 0 5 10 15 20 25 30 35 Prompts Source prompt Logits (a) Numbers one three five seven nine eleven thirteen fifteen seventeen nineteen one three five seven nine eleven thirteen fifteen seventeen nineteen −5 0 5 10 15 20 25 30 Prompts Source prompt Logits (b) Number words first second third fourth fifth sixth seventh eighth ninth tenth first second third fourth fifth sixth seventh eighth ninth tenth 0 10 20 30 40 50 Output token Source token Logits (c) Placements JanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecember January February March April May June July August September October November December −10 −5 0 5 10 15 20 Prompts Source prompt Logits (d) Months Figure 19: Plots of logits across various numeric classes, analogous to Figure 8a. Source tokens are on they-axis and output tokens are on thex-axis. in order to measure how well models were able to predict the successor<token2>(e.g ‘February’) given the final token of the prompt was<token1>(e.g ‘January’), as LLMs, predict successors given these prompts. We then took GPT-2 Small and Pythia-1.4B’s output after MLP 0 and used the Tuned Lens to get logits on output tokens. 7 The resulting successor score was<1% and commonly predicted bigrams, such as<token1>=“ first” giving “ time” as a completion and<token1>=‘ Sunday’ giving ‘ morning’ as a prediction. This suggests that MLP 0 information is insufficient for incrementation and the successor head is critical for succession. GTESTING SUCCESSOR SCORE OVER TRAINING STEPS Another line of evidence that Successor Heads are an important model component for low training loss can be found by studying successor scores across training points. We study a Pythia model (Bi- derman et al., 2023) as well as a Stanford GPT model (CRFM, 2021), as these models have training checkpoints. The emergence of Successor Heads throughout training is displayed in Figure 18. HDECREMENTATION BIAS ACROSS DIFFERENT TASKS We show the strength of the decrementation bias in figures Figure 19 and 20. 7 Note we did run with GPT-2 Small’s attention layer 0 to maximise the model’s chances are being able to perform succession. Pythia-1.4B has parallel attention so we just take the MLP0 output in this case. 19 Preprint. Under review. 0123456789 0 2 4 6 8 −10 −5 0 5 10 15 20 Prompts mod 10 feature Logits (a) Numbers one three five seven nine eleven thirteen fifteen seventeen nineteen 0 2 4 6 8 −5 0 5 10 15 Prompts mod 10 feature Logits (b) Number words first second third fourth fifth sixth seventh eighth ninth tenth 0 2 4 6 8 −10 −5 0 5 10 15 20 Output token mod 10 feature Logits (c) Placements JanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecember 0 2 4 6 8 −4 −2 0 2 4 6 8 Prompts mod 10 feature Logits (d) Months Figure 20: Plots of mod 10 feature logits across various numeric classes, analogous to Figure 8b 02468101214 22 20 18 16 14 12 10 8 6 4 2 0 −0.5 0 0.5 Fraction of tokens which are weakly successed Head Layer (a) Pythia-1.4B 024681012141618 34 32 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 −0.2 −0.1 0 0.1 0.2 Fraction of tokens which are weakly successed Head Layer (b) GPT-2 Large Figure 21: Successor scores for Pythia-1.4B and GPT-2 Large. IALL SUCCESSOR SCORES IN A MODEL In Appendix I we find that for both Pythia-1.4B (the mainline model in the paper) and GPT-2 Large (a randomly selected model without a successor head from (Figure 2, left)), the heads with highest successor score are sparse: in Pythia-1.4B L12H0 has eight times as great a successor score to the next higher successor score. JCASE STUDY:NUMBERED LISTING In Section 4 we demonstrate that when the successor head is contributing usefully, the prompts often required some kind of incrementation. However, we want to investigate whether the converse holds: are prompts requiring incrementation mostly solved by successor heads? Numbered listing is widespread across real datasets and requires incrementation. Additionally, blog post discussion 8 suggests that even small LLMs are capable of this task in the case of incrementing citations. Examples of prompts involving numbered listing can be seen in Figure 22. We collect 64 such prompts and check for whether the successor head in Pythia-1.4B is a winning case (as in Section 4), and find that the successor head is indeed the winning head acrossall64 prompts. Hence this provides some evidence prompts requiring incrementation in real datasets are indeed mostly solved by successor heads. 8 https://w.lesswrong.com/posts/LkBmAGJgZX2tbwGKg/ help-out-redwood-research-s-interpretability-team-by-finding 20 Preprint. Under review. PromptAnswer (...) (A) Colony formation and (<B (...) (i)f ∗ g (y)equals the factual densityf(y)for allg∈G; (i (...) [ˆ2]: Conceived and designed the experiments (...) [ˆ3 (...) 6. Kirovsky Zavod Station – St. Petersburg, Russia (...) you can see a statue of Lenin here. 7 (...) [9] Minutes, Criminal Law Revision Commission, January 28, 1972, 16.[ 10 Figure 22: Some examples of numbered listing prompts from the Pile dataset. KDIFFERENT ABLATION METHODS To analyze the effect of language model components when running ablation experiments it is im- portant to distinguish the direct, indirect and total effect of language model components (McGrath et al., 2023; Pearl, 2009) on model outputs, which are illustrated in Figure 23. To measure the di- rect effect of a component (for a given ablation method) thedirect effectinvolves, at the end of the model, subtracting the head’s output and adding the ablated output of the head to the residual stream. In Section 4, we analyzed direct effect under the ablation method of mean ablation. This appendix argues that our the direct effect is the largest effect and our results hold under different ablation methods. A D O Basic Graph A D O Direct Effect A D O Indirect Effect A D O Total Effect Figure 23: The types of effect an Attention Head (A) could have on model output (O), possibly through mediating downstream model components (D). Theindirect effectinstead involves replacing a head’s output with an ablated output and, at the very end of the model, subtracting this ablated output and adding the head’s original output. This effectively ablates the downstream effects of a head. Thetotal effectreplaces the head’s output with an ablated output, which effectively ablates both the direct and indirect effect of the head. We ablate head outputs using one of the two techniques: 1. Mean ablation: replacing the output of a head with the average head output over a chosen distribution. We choose this distribution to be the current batch we are using. 2. Resampling ablation: replacing the output of a head with the head’s activation on a ran- domly sampled example from a dataset (e.g. the Pile). When using this method, we repeat this process a number of times and average the results. Rerunning the loss reducing experiments described in Section 4.1 for these different methods gives: 1. Direct effect resampling ablation: 33% for successorship, 11% for acronym, 10% for greater-than, and 11% for copying behaviour. 2. Indirect effect mean ablation: 6.5% for successorship, 2.8% for acronym, 12% for greater- than, and 9.0% for copying behaviour. 3. Indirect effect resampling ablation: 5.8% for successorship, 2.7% for acronym, 11% for greater-than, and 8% for copying behaviour. 21 Preprint. Under review. 4. Total effect mean ablation: 27% for successorship, 8.2% for acronym, 10% for greater- than, and 8.4% for copying behaviour. 5. Total effect resampling ablation: 30% for successorship, 6.5% for acronym, 11% for greater-than, and 6.2% for copying behaviour. Result 1 demonstrates that using resampling ablation instead of mean ablation for analyzing the loss reducing effect has little effect on the results in Figure 11. Additionally, the indirect effect results (2 and 3) provide some evidence that downstream effects of the successor head are not highly significant to successorship. LTRAINING DETAILS FOR COMPOSITIONALITY EXPERIMENTS Remark: obtaining a decoding function.Recall that we wish to learn two linear maps – anindex- space projectionπ N :R d model →R d model and adomain-space projectionπ D :R d model →R d model – such that, for all pairs of tokensi s andj t (withi t a valid token), ˆ Ji t K:=π N (Ji s K) +π D (Jj t K)≈Ji t K. To evaluate the above identity, we must first learn a decoding functiond:R d model →R d vocab , such thatarg max t d(Ji s K) t =i s . Given the informal observation that directly unembeddingJi s Kyields next-token predictions fori s whereas unembeddingS(i s )yields(i+ 1) s (see Appendix F), we hypothesise that the unembedding matrixW U reads from some ‘output space’Oand the embedding transformJ·Kwrites to some ‘input space’I– and that the successor head reads fromIand writes to O. Indeed, when training anoutput-space projectionπ O :I →Oover tokens in the vocabulary such thatW U (π O (Ji s K)) =i s , we obtain 97.4% top-1 accuracy on a set of 1000 held-out tokens – which both confirms the output-space hypothesis, and gives us a decoding functiond(x) =W U (π O (x)). Method.With our decoding function in hand, we can trainπ N andπ D to satisfy our identity. Specifically, we defineπ N andπ D to be matrices such thatπ N +π D =I. For valid token pairsi s andj t , we obtain predicted representations ˆ Ji t K=π N (Ji s K)+π D (Jj t K), and minimise a combination of ‘closeness metrics’: || ˆ Ji t K−Ji t K|| 2 +L(W U (π O ( ˆ Ji t K)),i t ) +L(W U (S( ˆ Ji t K)),W U (S(Ji t K))) forLthe cross-entropy loss. Specifically, we ensure that predicted and ground truth representations ‘behave in the same way’ – in other words, that they are close together, that predicted representations ˆ Ji t Kdecode to tokensi t (output-space decoding), and that the logit distribution when decoding in- cremented predicted representationsS(Ji t K)matches that when decoding incremented ground truth representationsS(Ji t K)(successor decoding). More succinctly, we can frame the training procedure as learningπ N ,π D such that the following diagram commutes: R N ×R D I I×ILogits[N×D]N×DI (N×D)×(N×D)N×D ⟨π 1 ◦π 1 ,π 2 ◦π 2 ⟩ J·K×J·K π N ×π D (+) W U ◦S◦J·K id I W U ◦S J·K W U ◦π O id N×D We trained for 10 epochs over valid token pairs sampled from the succession dataset, and evaluated on a held-out dataset of 500 randomly-sampled token pairs. 22 Preprint. Under review. MEFFECTS OF APPLYING THE SUCCESSOR HEAD TO COMPOSITIONAL REPRESENTATIONS In Section 3.1, we assessed the performance of our factoring mechanism throughoutput-space de- coding(i.e. evaluating a representation ˆ Ji t Kby testing whether ˆ Ji t Kdecodes toi t ). Below, we also present results forsuccessor-space decoding(i.e. evaluating a representation ˆ Ji t Kby testing whether S( ˆ Ji t K)decodes to(i+ 1) t ). In contrast to the high performance of output-space decoding, we found successor-space decoding yielded a top-1 accuracy of 0.90 on the held-out dataset of token pairs, and a top-1 accuracy of 0.125 on the out-of-distribution Roman numeral dataset (see Table 3). 9 Source index token () I I I IV V VI VII VIII IX X XI XII 1 one first Monday Mon January Jan A Sequence token ( ) iRom 1 s 10 Third Fourth523VII8910251213 10 Third fourth fifth23 seventh eighth ninth10251213 tenth Third fourth fifth VI seventh ninth ninth tenth--- RS ThirdMCFridayMVVII------ RSHMHMHMMVMTP------ RS cs cs May23 cs csOctober25 cs CS cs cs May23 cs cs Septrs35 cs IIIIIIMC FifthWVVII VIII9cxY XIIiii Septembe Table 3: A table presenting top-1 predicted tokens under successor decoding fromπ D (1 s ) + π N (i Rom ). Green cells denote predictions which match their target ((i+ 1) s ) exactly; yellow cells denote predictions which match the targetindexbut not the target domain; red cells denote incorrect predictions. Dashed cells denote pairs of1 s andi Rom for whichi s is not a valid (single) token. This drop in performance when switching from output-space to successor decoding (and in partic- ular, the leakage of Roman-numeral information intoπ D (1 s ) +π N (i Rom )– notice the VII and VIII in Table 3) suggests our numeric projectionπ N might be capturing slightly more than just the nu- meric subspace. Specifically, there may be some components of domain-space which are ignored by output-space decoding, but which our successor head lifts into output-space. NADDITIONAL ARITHMETIC TABLES We display additional arithmetic tables analogously to those in Figure 7, with 3 number tables (randomly sampled ranges) in Figure 24 and a number word table in Figure 25. We see that the results are similar to those in Figure 7. OLOGIT DISTRIBUTION FOR CELLS IN THE ARITHMETIC TABLE The logit distributions across tokens for randomly sampled correct arithmetic examples are displayed in Figure 26. PLINEAR PROBES FOR MODULI OTHER THAN10 While our experiments in Section 3 focus on the study of mod-10 features, one might hypothesise that there exist similar natural mod-kfeatures for otherk. To explore this hypothesis, we repeat the linear probing experiment detailed in Section 3.3 for a range of modulim∈2,...,25. Specifically, we test whether we can learn linear probesP (m) :R m×d model to predict the value ofimodmfrom the MLP-representations of numeric tokenst i , and whether these probes generalise to non-numeric tokens. As per Appendix B.1, we train our probes on numeric tokens from ‘0’ to ‘500’ (both with and without a leading space), holding out 10% of these tokens as a validation set, and we test our probes on a dataset of unseen tasks including cardinal numbers, Roman numerals, months and days. 9 Note, though, that, as our successor dataset contains 1041 tokens, a random classifier (even when restricted to tokens in the successor dataset) would achieve an accuracy of 0.001. 23 Preprint. Under review. (a) Arithmetic table for range 50 to 59 (b) Arithmetic table for range 30 to 39 (c) Arithmetic table for range 120 to 129 Figure 24: Additional randomly selected number arithmetic tables, analogous to those in Figure 7 We present the results of these experiments in Figure 27. Observe that, while for almost all moduli, validation and test performance are above random chance, we cannot easily extract ‘mod-mdata’ from token representations for allm. Indeed, the only probes with out-of-distribution performance 24 Preprint. Under review. (a) Arithmetic table for range zero to nine. Figure 25: An additional arithmetic table for number words, analogous to those in Figure 7 (a) Source token ’83’, target residue of 7 modulo 10. (b) Source token ’50’, target residue of 6 modulo 10. (c) Source token ’ eleven’, target residue of 5 modulo 10. (d) Source token ’September’, target residue of 0 modulo 10. Figure 26: Logit distributions for randomly sampled checkmarked cells, sampling two cells for numbers, one cell for number words, and one cell for months. above 0.5 are those for moduli 2-5, 6, 10, 12, 15 and 20 (and the six probes with the lowest validation performance are those corresponding to the six primes between 5 and 25). We see some evidence, however, that10 is a particularly significant modulus for token MLP 0 - representations: indeed,P (10) has both the joint highest validation performance (together with 25 Preprint. Under review. 2345678910111213141516171819202122232425 Modulus 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Validation set Test set Figure 27: Validation (in-distribution task) and test (out-of-distribution task) performance for linear probesP (m) . The vertical purple lines denote moduli divisible by 5. P (2) ,P (4) ,P (5) andP (20) , and the highest out-of-distribution performance. Moreover, of all the probes whose validation accuracy is above 0.5,P (10) has the smallest drop in performance from in-distribution to out-of-distribution tasks. In particular, for weeks (which we might expect to have ‘mod-7 features’) and months (which we might expect to have ‘mod-12 features’), not only are the performances ofP (7) andP (12) substan- tially lower than that ofP (10) , butP (7) fails to correctly identify the indexmod 7of any day of the week, whileP (12) only correctly identifies the indexmod 12of 8/12 months (failing on February, May, July and August). By contrast,P (10) correctly identifies the indexmod 10of 4/7 days of the week, and 10/12 months. 26