Paper deep dive

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

Bilal Chughtai, Alan Cooney, Neel Nanda

Year: 2024Venue: ICML 2024Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 111

Models: GPT-J, GPT2-XL, Pythia-12B, Pythia-160M, Pythia-1B, Pythia-2.8B, Pythia-410M, Pythia-6.9B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/12/2026, 7:30:07 PM

Summary

The paper investigates the mechanistic basis of factual recall in transformer-based LLMs, identifying an 'additive motif' where multiple independent mechanisms (Subject Heads, Relation Heads, Mixed Heads, and MLPs) contribute additively to the correct output. The authors extend Direct Logit Attribution (DLA) to disentangle these contributions by source token, revealing how models constructively interfere on correct attributes despite individual components often being insufficient or prone to misfiring.

Entities (6)

Direct Logit Attribution · interpretability-technique · 99%Pythia-2.8b · llm · 98%Additive Motif · mechanism · 95%Mixed Heads · model-component · 95%Relation Heads · model-component · 95%Subject Heads · model-component · 95%

Relation Signals (4)

Subject Heads → contributesto → Factual Recall

confidence 90% · Subject heads extract specific attributes about subjects... these mechanisms all act on the END position.

Relation Heads → contributesto → Factual Recall

confidence 90% · Relation heads extract many attributes in the set R... these mechanisms all act on the END position.

Mixed Heads → contributesto → Factual Recall

confidence 90% · Mixed heads attend to both SUBJECT and RELATION, and perform the role of both.

Additive Motif → governs → Factual Recall

confidence 90% · We find that the mechanistic story behind factual recall is more complex... It comprises several distinct... mechanisms that additively combine.

Cypher Suggestions (2)

Find all model components contributing to factual recall · confidence 90% · unvalidated

MATCH (c:Component)-[:CONTRIBUTES_TO]->(t:Task {name: 'Factual Recall'}) RETURN c.name, c.type

Identify mechanisms that interact with specific token types · confidence 85% · unvalidated

MATCH (c:Component)-[:ATTENDS_TO]->(t:TokenGroup) RETURN c.name, t.name

Abstract

Abstract:How do transformer-based large language models (LLMs) store and retrieve knowledge? We focus on the most basic form of this task -- factual recall, where the model is tasked with explicitly surfacing stored facts in prompts of form `Fact: The Colosseum is in the country of'. We find that the mechanistic story behind factual recall is more complex than previously thought. It comprises several distinct, independent, and qualitatively different mechanisms that additively combine, constructively interfering on the correct attribute. We term this generic phenomena the additive motif: models compute through summing up multiple independent contributions. Each mechanism's contribution may be insufficient alone, but summing results in constructive interfere on the correct answer. In addition, we extend the method of direct logit attribution to attribute an attention head's output to individual source tokens. We use this technique to unpack what we call `mixed heads' -- which are themselves a pair of two separate additive updates from different source tokens.

PDF

Open source PDF →Open local PDF →

Full Text

110,818 characters extracted from source content.

Expand or collapse full text

Summing Up The Facts: Additive Mechanisms Behind Factual Recall in LLMs Bilal Chughtai 1 Alan Cooney 1 Neel Nanda Abstract How do transformer-based large language mod- els (LLMs) store and retrieve knowledge? We focus on the most basic form of this task – factual recall, where the model is tasked with explicitly surfacing stored facts in prompts of formFact: The Colosseum is in the country of. We find that the mechanistic story behind factual recall is more complex than previ- ously thought. It comprises several distinct, inde- pendent, and qualitatively different mechanisms that additively combine, constructively interfer- ing on the correct attribute. We term this generic phenomena theadditive motif: models compute through summing up multiple independent con- tributions. Each mechanism’s contribution may be insufficient alone, but summing results in con- structive interfere on the correct answer. In ad- dition, we extend the method of direct logit attri- bution to attribute an attention head’s output to individual source tokens. We use this technique to unpack what we call ‘mixed heads’ – which are themselves a pair of two separate additive updates from different source tokens. 1. Introduction How do large language models (LLMs) store and use factual knowledge? We study the factual recall set up, where mod- els are explicitly tasked with surfacing knowledge as out- put tokens in prompts of formFact: The Colosseum is in the country of. Our work falls within the field of mechanistic interpretability (Elhage et al., 2021; Olah et al., 2020), which focuses on reverse-engineering the al- gorithms that trained neural networks have learned. Much attention has recently been paid to interpreting decoder-only transformer-based large language models, as while these models have demonstrated impressive capabilities (Brown et al., 2020; Wei et al., 2022), we have little understanding 1 Independent. Correspondence to: Bilal Chughtai<brchugh- taii@gmail.com>. NeurIPS 2023 Attributing Model Behaviour at Scale Workshop. intohowthese models produce their outputs. Prior work on interpreting factual recall has mostly focused on localizing knowledge within transformer parameters. Meng et al. (2023a) find an important role of early MLP layers is toenrichthe internal representations of subjects (The Colosseum), through simultaneously looking up all known facts, and storing them in activations on the final subject token. Since the model is autoregressive, this occurs before seeing which relation (country of) is requested. Our contribution is to study how this information is subse- quently moved and used by the model. There are several possible mechanisms modelscoulduse to retrieve facts from these enriched subject representations. Geva et al. (2023a) suggest an algorithm that allows models to extract just the correct fact, ignoring other irrelevant facts in the enriched subject representation. Hernandez et al. (2023) more re- cently showed that such facts can belinearlydecoded from the enriched subject representations. In this paper, we build on this prior work by carefully inspecting what modelsac- tuallydo, using tools from mechanistic interpretability. Ourcore contributionin this work is showing that mod- els primarily solve factual recall tasksadditively. We say models produce outputs additively if 1.There are multiple model components whose outputs independently directly contribute positively to the cor- rect (mean-centred) logit. 2. These components are qualitatively different – their distribution over output logits are meaningfully differ- ent. 3.These components constructively interfere on the cor- rect answer, even if the correct answer is not the argmax output logit of individual components in isolation. We term this generic phenomena theadditive motif. We provide further discussion regarding this motif in Section 4. What are these different mechanisms? Consider the example shown in Figure 1. There are two sources of information here – the subjectColosseumand the relationcountry. These correspond to two independent clusters of possible 1 arXiv:2402.07321v1 [cs.LG] 11 Feb 2024 Summing Up The Facts gladiators amphitheatre Italy Flavian ... The Colosseum is in country the : Mixed Heads Extract attributes from both SUBJECT and RELATION Subject Heads Extract particular SUBJECT attributes Relation Heads Extract many RELATION attributes SUBJECT RELATION + input tokens END residual stream Fact top head output tokens of Italian Italy Rome Austria ... France England Spain Italy ... Italy Rome Italian Vatican additive result of multiple independent mechanisms PREFIX Figure 1.Four independent mechanisms models use for factual recall. (1) Subject heads, (2) Relation Heads, (3) Mixed Heads and (4) MLPs (omitted). These combineadditively,constructively interferingto elicit the correct answer. Each mechanism individually is less performant than the sum of them all, with most individual mechanisms incapable of performing the task alone. updates - updates that consider many different attributes about the Colosseum (e.g.Italy,Rome), and updates that consider many different countries (e.g.Italy,Spain). By using mechanistic interpretability to investigate the how fac- tual recall is performed by the model, we find four different internal model mechanisms implement these two updates. Each mechanism independently boosts the correct answer (condition 1). There are two qualitatively different clusters of output behavior (condition 2). And while each mecha- nism may not individually completely solve the task, we find that additively combining all four results in a large amount of constructive interference on correct attributes – this is significantly more robust (condition 3). Thus, factual recall isadditive. Our work highlights a limitation of narrow circuit analysis. We should expect models to make predictions based on mul- tiple parts of their input. Prior mechanistic interpretability work has neglected to consider all sources of information in mechanistic analysis. For instance, in the work by Wang et al. (2022) models are tasked with completing sentences of the formWhen John and Mary went to the store, John bought flowers for. This task has two compo- nents – (a) figure out the answer should be a name, and then (b) figure out what the correct name is. Through a combina- tion of using ‘logit difference’Mary - Johnas a metric, and heavilytemplated prompts, the authors isolate the cir- cuit for (b), but neglect to study (a). Though just studying (b) and conditioning on the answer being a name is a valid research question, it’s important to be explicit that part of the behaviour is left unexplained, and our work implies that (a) is also an important part of predicting the next token. In factual recall, this corresponds to updating outputs based on the relation, as well as the subject. We find additivity through studying both of these sources of information, and analyzing output attributes relating to both. Oursecond contributionis to extend the technique of direct logit attribution (DLA) (Wang et al., 2022; Elhage et al., 2021; nostalgebraist, 2020). We find this technique crucial in our analysis. DLA is a technique that converts individual model component (attention head, MLP neuron) outputs into the space of output logits, through the insights that the map to logits from the residual stream is linear, 1 and that the residual stream is a cumulative sum of prior model com- ponents (Elhage et al., 2021). DLA by default considers the entire attention head as one unit, but Elhage et al. (2021) demonstrate that attention head outputs are a linear weighted sum over source positions. We may therefore split the DLA of attention head up into contributions from different source tokens. This insight allows us to disentangle the two sepa- rate and additive contributions of particular attention heads fromSUBJECTandRELATIONtokens. 2. Methods Task. We consider tuples(s,r,a)of factual information containing a subjects, attribute 2 a, and relationrconnect- ing the two. To elicit facts in models, we provide a natural language prompt describing the pair(s,r). See Table 1 for example tuples and prompts. At various points we study 1 up to LayerNorm, which may be reduced to just a scaling factor for our purposes (Nanda, 2022). 2 We use the words ‘attribute’ and ‘fact’ interchangeably. 2 Summing Up The Facts SubjectsRelationrAttributeaAttributesS\aAttributesR\a Kobe Bryantplays the sport ofbasketballNBA, Lakers, USAtennis, golf, football The Eiffel Toweris in the country ofFranceParis, iron, GustavePakistan, China, Sudan Germanyhas capital cityBerlinGerman, Rhine, BMWLondon, Rome, Canberra Table 1.Some examples of factual tuples(s, r, a). We prepend the prefixFact:to the concatenated pair(s, r)for inference, as this slightly improves performance. We also include example elements in the setsSandRof attributes pertaining to the subjectsand relation rrespectively. and aggregate over sets(s,r,a)withsorrheld constant. We filter for tuples(s,r,a)for which the model attains the correct answer, which we define asabeing within the top ten output logits. Most commonly the correct attributea attains rank0(See Figure 7 in the Appendix). Our dataset is hand written, but is inspired byCounterFact(Meng et al., 2023a),ParaRel(Elazar et al., 2021), and Hernan- dez et al. (2023). See Appendix C for more information on our dataset, including a discussion of dataset requirements that limit size. Model. We primarily investigate the Pythia-2.8b model (Biderman et al., 2023), though find similar mechanisms are present in other models. In Appendix E.2 we briefly study GPT2-XL (Radford et al.), GPT-J (Wang & Komatsuzaki, 2021), and Pythia models with fewer and greater parameters. Counterfactual Attributes. We are interested in what mechanisms surface the correct attributesa. In order to better understand this, we find it useful to study two further sets of attributes 3 SandR. The correct attributea∈S∩R. Sis the set of attributes relevant to the subject. In particular, an attributea∈Sif there exists some other relationship r ′ such that(s,r ′ ,a)is a valid factual tuple.Ris the set of attributes relevant to the relation. An attributea∈R if there exists some other subjects ′ such that(s ′ ,r,a)is a valid factual tuple. See Table 1 for example elements inS andR. Token Positions. We will often refer to particular groups of token positions in the input sequence. •PREFIX– all tokensbeforethe subject, usuallyFact: • SUBJECT– all tokens of the subjects, e.g.The Colosseum. • RELATION– all tokens of the relationre.g.is in the country of. • END– the final token, which is where factual informa- tion must be moved to in order to surface the correct answer, e.g.of. Logit Lens.The logit lens (nostalgebraist, 2020) is an interpretability technique for interpreting intermediate acti- 3 While our sets may not be complete, or faithful to true model concepts, but do suffice to help us find mechanisms. vations of language models, through the insights that (1) the residual stream is a linear sum of contributions from each layer (Elhage et al., 2021) and (2) that the map to logits is approximately linear. It pauses model computation early, converting hidden residual stream activations to a set of logits over the vocabulary at each layer by directly applying the unembedding map. Direct Logit Attribution(DLA) is an extension of the logit lens technique. It zooms in to individual model compo- nents, through the further insight that the residual stream of a transformer can be viewed as an accumulated sum of outputs from all model components (Elhage et al., 2021). 4 DLA therefore gives a measure of the direct effect on the of individual model components on model outputs. DLA by source token groupis an extension to the DLA technique through the further insight that attention head outputs are a weighted sum of outputs corresponding to distinct attention source position (Elhage et al., 2021). This allows us to quantify how source token group directly ef- fects the logits through individual attention heads. This is useful in disentangling head types, in particular mixed heads (Figure 1), which comprise two separate contributions from their attention paid to the subject and their attention paid to the relationship. See Appendix D for more details on this technique. We say the DLA can beattributedto either the SUBJECTtokens orRELATIONtokens. This mostly makes sense for the short prompts in our setup, but may be mislead- ing in longer context lengths, as models move information around and may store information on intermediate tokens. 3. Results In this section, we use mechanistic interpretability to find four separate mechanisms behind factual recall that corre- spond to two clusters of additive updates, relating to either the subject or relation in the prompt. These updates construc- tively interfere on the correct attribute to elicit the correct answer. These mechanisms all act on theENDposition. We summarize these mechanisms as follows and in Figure 1. 1. Subject Heads(Section 3.1) – Attention Heads that 4 DLA can be limited, see e.g. Rager et al. (2023). 3 Summing Up The Facts 0123 −0.5 0 0.5 1 Subject Mixed Relation DLA attributed to SUBJECT DLA attributed to RELATION 00.20.40.60.81 0 0.2 0.4 0.6 0.8 1 Subject Mixed Relation Attention from END to SUBJECT Attention from END to RELATiON Figure 2.Three different types of attention head for factual extrac- tion prompts of formsplays the sport of: subject heads, relation heads and mixed heads. (Left) DLA on the correct sport, split by attention headsourcetoken. top 10 heads by total DLA shown. Each data point is one prompt. The grey lines have gradi- ents1/10and10and denote the boundary we use todefinehead types,afteraggregating over the relationshipr. These cleanly sep- arate subject and relation heads. (Right) Attention patterns of the top four heads of each kind on each prompt in the dataset. Subject and Relation heads attend mostly toSUBJECTandRELATION respectively. Mixed heads attend to both. Attention patterns are not used to define head type, but correlate well with the head type. attend strongly toSUBJECTand extract attributes per- taining to the subject, in the setS, from the enriched subject representation. Some such heads extract the correct attributea, others extract a range of other at- tributes. These heads activate in response to any factual recall type prompt, even if the relationship given does not match their category - they can and domisfire, extracting irrelevant attributes. 2. Relation Heads(Section 3.2)– Attention Heads that attend strongly toRELATIONfor a particular relation and extract many attributes pertaining to that relation, in the setR. They do not preferentially extract the correct attribute associated with the subject,a. 3.Mixed Heads(Section 3.3)– Attention Heads that at- tend to bothSUBJECTandRELATION, and perform the role of both (1) and (2) simultaneously. FromSUB- JECT, they extract the correct attributea, among other things. FromRELATION, they extract many attributes in the setR, often also privileging the correct attribute a, due to a phenomena we term ‘subject to relation propagation’. The sum of these two separate contribu- tions is the total head direct effect. 4.MLPs(Section 3.4)– Part of the function of MLPs is to boost many attributes in the setR. Inspecting the logit ranks is highly suggestive of an additive algorithm: many incorrect attributes in both of the setsSand Rappear highly in output tokens (Table 4 in the Appendix). In the remainder of this section, we provide several lines of evidence that these four mechanisms implement an additive L17H2 SL23H22 ML17H17 SL16H12 SL26H8 ML21H23 ML13H31 RL21H18 ML18H30 SL18H20 R 0 0.5 1 1.5 a (SUBJECT) a (RELATION) R-a, top-5-mean (SUBJECT) R-a, top-5-mean (RELATION) Head DLA on attribute(s) by source token group Figure 3.Top heads by absolute DLA onafor the relationshipis in the country of. We also plot the mean DLA on the 5 largest magnitude relation attributes inR−a; other countries. Heads labelled as Subject (S), Relation (R) or Mixed (M) heads. Studying a large set of counterfactual attributes, and splitting by attention source token lets us disentangle these head types. All three head types emerge. Subject heads are characterized by the largest column being blue – among the tokens we study they mostly extract the correct attributeafromSUBJECT. Relation heads have comparable red and purple columns, with small blue and green columns – among tokens we study they extract a range of relationship attributes inRfromRELATION. Mixed heads capture everything remaining. algorithm. In particular, we will show (a) all four mecha- nisms exist for a range of relationships and are distinct, (b) each mechanism contributes positively to both correct and incorrect attributes and matters for task performance and therefore (c) each individual mechanism is inferior to the sum of all four mechanisms. Showing (a-c) suffices via our definition of additivity in the Introduction. We perform fur- ther experiments in Appendix E. Figures 2 and 3 summarise these results. 3.1. Subject Heads Individual subject heads extract specific attributes about subjects in some setS∩Cby attending fromENDtoSUB- JECT, but not meaningfully toRELATION. 5 These heads extract the same attributes from a given subjectno matter what relationship is given– the attributebasketballis still extracted significantly by some subject head on the promptMichael Jordan is from the country of. Such heads explain why we observe incorrect attributes about the subject (i.e. in the setS) appearing in the top few output tokens on factual recall prompts. These heads sometimes depend on the relationship indirectly, through their attention pattern. 5 Generically, since individual attention heads read and write from a low rank subspace of the residual stream (Elhage et al., 2021), we find them to be specialized to same category of attributes C, which may not perfectly align withSorR(See Appendix E.8 for more discussion on head categories). 4 Summing Up The Facts L21H9 Michael Jordan Roger Federer Stephen Hawking The Colosseum MLP subject enrichment basketball tennis --- boxing L17H2 USA Switzerland England Italy L21H9 PLAYS_SPORT AR=16.76 L16H20 PLAYS_SPORT AR=10.53 L17H2 IN_COUNTRY AR=10.05 L17H17 IN_COUNTRY AR=107.23 L17H2 CAPITAL_CITY AR=2.99 L16H12 CAPITAL_CITY AR=2.83 L23H2 PRODUCT_BY AR=24.61 L19H13 PRODUCT_BY AR=9.84 −0.5 0 0.5 1 1.5 a R-a, top-5-mean S-a, top-5-mean Head DLA on attribute(s) Figure 4.Subject Heads exist for a range of relations. (Top) The mechanism by which subject heads act. They read from enriched subject representations, and copy the relevant attributes to output directions. We show this for a ‘sport’ head and a ‘country’ head. Both pathways activate whenever a factual recall prompt with the given subject is presented, no matter what the stated relationship is – they ‘misfire’. No sport is extracted forStephen Hawking. Raw data for this figure is in the Appendix in Table 6. (Bottom) Top two subject heads for four different relationships. These heads individually extract the correct attribute (blue) significantly more than other relation attributesR(red) and other subject attributesS (green). This indicates their categoryCis mostly narrow. L17H2 is more general, extracting many correlated facts about countries (e.g. country, currency, cities, etc.). These heads also have a high attention ratio toSUBJECToverRELATION(shown in the x axis labels). We define subject heads for a relationrto be heads with average DLA attributed toSUBJECTtokens / average DLA attributed toRELATIONtokens>10, when aggregated over a dataset of prompts with the relation held constant. This captures the intuition that these heads primarily read at- tributes from the subject and not the relation. In Figure 4, we analyze subject heads for different relation- ships across a range of subjects. By composing headOV circuits with the model unembedding, we may view indi- vidual heads as linear probes for particular output tokens. (Elhage et al., 2021). This technique effectively saturates the attention of the subject head to one on the final subject token, performs the usual attention head calculation, and reads off some DLA from the output. Since subject heads alwaysattend to the subject, this is principled: we discuss attention patterns of subject heads in Appendix E.4. We evaluate each head-probe qualitatively on a range of sub- L13H31 PLAYS_SPORT AR=138.62 L14H24 PLAYS_SPORT AR=13.73 L13H31 IN_COUNTRY AR=650.78 L18H20 IN_COUNTRY AR=12.09 L18H20 CAPITAL_CITY AR=18.36 L13H31 CAPITAL_CITY AR=216.87 L28H17 PRODUCT_BY AR=4950.10 L31H20 PRODUCT_BY AR=17.39 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 a R-a, top-5-mean Head DLA on attribute(s) LondonParisSeoulMumbaiTokyoBeijingShanghaiBangkokAmsterdamMadrid 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 L18H20 AR=18.36 L13H31 AR=216.87 L14H24 AR=30.00 L23H12 AR=26.87 L18H25 AR=4.48 Attribute DLA on Attribute Figure 5.Relation heads exist for a range of different relationships. (Left) The top two relation heads for four different relationships. The heads extract the correct attribute (blue) about as much as they extract many other attributes in the setR(red). They also have a high attention ratio toRELATIONoverSUBJECT(shown in the x axis labels). (Right) Many cities are extracted by heads over a range of prompts with relationhas the capital citywith differentsubjects. The error bars denote the standard deviation over these subjects. While heads push for some cities more than others, small error bars indicate this variation is consistent across input subjects. This suggests relation head outputs do not causally depend on the subject. We include similar plots for other relation- ships in Appendix E.5. jects, showing they extract meaningful and interpretable attributes from the enriched subject representation. We note demonstrates that the head categoryCis not aligned withR orS: e.g. L22H17 extracts only the sport ofbasketball, but not other sports. We also often observe correlated facts NBAandbasketballbeing extracted simultaneously. 3.2. Relation Heads Individual relation heads extract many attributes in the set R∩Cby attending fromENDtoRELATION, but not signif- icantly toSUBJECT. These heads do not causally depend on the subject, even indirectly. Such heads explain why we observe incorrect attributes pertaining to the relation (i.e. in the setR) appearing in the top few output tokens on factual recall prompts. 5 Summing Up The Facts We define relation heads for a relationrto be heads with av- erage DLA attributed toRELATIONtokens / average DLA at- tributed toSUBJECTtokens>10, over a dataset of prompts with the the relation held constant. This captures the intu- ition that relation heads mostly read the correct attribute from the relation, and not the subject. Figure 5 demonstrates relation heads exist for a range of re- lationshipsrand that their direct effect on logits mostly does not depend on the subjects. Preliminary results suggest this latter finding is less true in larger models; a result which we expand on in Appendix E.5.4. These heads can additionally be characterized through high attention to theRELATION overSUBJECT. Interestingly, there are many shared heads between relationships, including L13H31, which is impor- tant for both sports and countries. Each relation head will push for certain attributes over others, with a small amount of variation from prompt to prompt. Which attributes a relation head prefers is affected minimally by the subject. A complication is that DLA can be affected by the norm of the accumulated residual stream (via LayerNorm), which varies slightly between prompts, leading to some variation. To show this is a large effect, we analyze the ordered DLA across all vocab tokens of the top few relation heads for several prompts in Appendix E.5. This demonstrates that theprimary functionof these heads is to extract attributes in R. We also perform causal activation patching experiments, where we patch the top few relation heads, and demonstrate that this does not reduce performance on average - indicating that these heads do not meaningfully depend on the subject, even indirectly. 3.3. Mixed Heads Individual mixed heads extract many attributes in some set (S∪R)∩C, and also privilege the correct attributeaamong such attributes. They behave as a combination of subject and relation heads – they attend to bothSUBJECTandRE- LATION. FromSUBJECT, they extract the correct attribute amore than other attributes fromR. FromRELATION, they extract many attributes inR, often also privileginga. This is due to significant propagation of subject information to theRELATION, which we do not rigorously study, but at- tempt to disentangle in Appendix E.6. We attribute the two contributions from different source positionsSUBJECTand RELATIONthrough our DLA by source technique. Figure 3 demonstrates this; We see mixed heads generally extract the correct attributeafrombothSUBJECTandRE- LATION(blue and green) more than other relation attributes R(red and purple). To further illustrate this effect, we an- alyze the top DLA token outputs of a selection of mixed heads in the Appendix in Table 8, split by source token, demonstrating these heads (a) attend to two distinct places and (b) extract significant information from these two dis- PLAYS_SPORTIN_COUNTRYCAPITAL_CITYPRODUCT_BY 0 1 2 3 4 5 6 7 8 9 DLA, a DLA, R-a, top-5-mean Relation DLA (sum over all MLP layers) squashsurfingskiingrugbyboxingvolleyballhockeycricketcyclingswimmingbasketballgolfsailingsoccertennisbaseballgymnasticslacrossecanoeingwrestling 0 1 2 3 4 5 6 7 8 9 Attribute DLA (sum over all MLP layers) Figure 6.(Left) The sum of all MLP outputs boosts relation at- tributesRfor a range of relationships. The MLPs boost the correct attribute (blue) less than they boost other attributes in the setR (red). The MLP boosts a wider set of attributes inRthan we au- tomatically check for. (Right) many sports are boosted by MLPs over a range of prompts with relationplays the sport of, independent of which subject is given. Error bars are standard deviation over different subjects. This suggests the direct effect of the MLP does not causally depend on the subject. tinct places. 3.4. MLPs MLP layers on theENDtoken often uniformly boost many attributes in the setR(like relation heads). The MLPs do not preferentially boost the correct attributea. We find that the categoryCof the MLP direct effect is significantly larger than those of individual heads, which intuitively makes sense given the MLP has many more parameters than indi- vidual attention heads. Individual neurons would likely have much more restricted categories. We note we only study part of the function of the MLP, and only on theENDposition. We hypothesize MLPs either compose with relation heads, or with relation information directly. In Figure 6, we show that for a range of relationships an aspect of the total direct effect of the MLP layers is to boost many attributes inR, includinga, but thatais not privileged among the attributesR. We too see that while 6 Summing Up The Facts the MLP layers up-weight certain attributes more than oth- ers, this variation is consistent across subjects, indicating these outputs do not causally depend on the subject. In Appendix E.7, we show that, at least for some relationships, this is the primary direct effect of the MLP layers, through analyzing the top DLA tokens of summed MLP outputs. 4. Discussion Additivity.We speculate that models in general prefer to solve tasks in anadditivemanner via multiple independent circuits, as we describe in Section 1. This claim is supported by prior work in toy models (Nanda et al., 2023; Chugh- tai et al., 2023), but has not been shown in real language models. We do not explainwhythe additive mechanism is preferred, but speculate that compounding evidence through several simple circuits is significantly easier for models. The model is able to achieve comparable performance through fewer steps of composition by aggregating many shallow circuits. Additionally, due to a softmax being applied to model outputs when taking cross-entropy loss, models ex- tremize their outputs. Outputting small amounts of incorrect answers is therefore not that costly to the model, so long as constructive interference results in a large logit difference between correct and incorrect answers pre-softmax. As additional intuition regarding what additivity is, we present a toy example of additivity. Consider a two class model tasked with predicting whether an integer is divisible by 6 (i.e. we have two classes, true or false). Consider the following two mechanistic ways of solving the task. (a) Solve the task directly, memorising which integers are di- visible by 6. (b) Solve the task in two independent parts. Assign a +1 true logit to all numbers divisible by 2. Assign +1 true logit to all numbers divisible by 3, with a different circuit. Apply a uniform bias corresponding to a -1.5 false logit. Both mechanisms solve the task, in the argmax logit sense. In this example (a) is non additive. (b) is additive, by the criteria (1-3) given in Section 1. There are two different components that contribute to the answer (1), they have qual- itatively different outputs (2), which constructively interfere on the correct answer, with each component insufficient alone (3). This example is analogous to how a transformer functions, since the residual stream is an additive sum of outputs from model components, and there is an (approxi- mately) linear map from the residual stream to the output logits given by the unembedding, so each component can be considered to be writing to logits separately in a linear fashion (Elhage et al., 2021). Note that condition (2) is necessary to exclude cases where the model increases its confidence through adding two identical components, which we do not consider to be additive. Reversal Curse. Our work on factual recall offers a partial mechanistic explanation for the reversal curse – the noted limitation of LLMs to generalize to ‘B is A’ when trained on ‘A is B’ (Berglund et al., 2023), which has also been suggested by Grosse et al. (2023); Thibodeau (2022). We provide indirect and suggestive evidence this is to be ex- pected. We find a circuit by which models may learn to output ‘A is B’, involving subject enrichment on the A to- kens, and some attention head attending to A and extracting B. Importantly, this is a unidirectional circuit with two unidi- rectional components - it extracts the fact ‘B’ from ‘A’. Our circuit suggests that the reason training on A is B does not boost ‘B is A’ in general is because training on ‘A is B’ only boosts the unidirectional A→B mechanisms, and has no effect on potential B→A mechanisms. As further evidence, assembling multi-token input representations is a different task mechanistically to outputting multi-token facts. This is in part due to input and output spaces being separate – Embeddings and unembeddings are untied in modern LLMs: W E ̸=W T U . So the ‘A’ in ‘A is B’ is internally represented differentlyto the ‘A’ in ‘B is A’, further suggesting these two tasks are separate. We view this as evidence that our work, and mechanistic interpretability more generally, can produce useful insights into the kinds of high level behavior neural networks may implement. 5. Related Work Interpreting Factual Knowledge.There has been much interest in understanding and editing factual knowledge in language models in a white box manner. Geva et al. (2021) demonstrated transformer MLP layers can be interpreted as key-value memories, and later extended this to show a partial function of transformer MLP layers is to perform computation to iteratively update the distribution over output vocabulary space (Geva et al., 2022). In a separate line of work, Meng et al. (2023a) found a separate function of MLP layers: to enrich the representa- tions in the residual stream of subjects with facts for the model to later use, which was discovered using a causal in- tervention based methodology. They also had success with using this localization to edit the weights of the model to change output predictions (ROME), which was later scaled up to 10000 facts (Meng et al., 2023b). Subsequent work has demonstrated this technique may just be introducing a “loud” fact (Thibodeau, 2022), and that the performance of editing in a layer may not be a reliable way to localize the fact (Hase et al., 2023). Equipped with this knowledge, an interesting question is that of how specific knowledge about a subject is isolated from other knowledge. Geva et al. (2023b) describe a circuit for factual recall with three steps: (1) subject enrichment in MLP sublayers, as in ROME, (2) relation propagation to 7 Summing Up The Facts theENDtoken, and (3) selective attribute extraction by later layer attention heads. Our work offers a fuller understanding of this circuitry and finds additional circuitry by zooming in more deeply into what individual model components are doing. Separately, Hernandez et al. (2023) demonstrate that facts can be linearly decoded from the enriched subject residual stream, which supports an aspect of the full picture we find. We build on this by zooming in to the actual transformer mechanisms, finding linear decoding maps ‘in the wild’ in headOVcircuits as opposed to trainng Extracting Knowledge from LMs. The standard approach to understand what a model knows is through prompting models in a black box fashion. (Petroni et al., 2019; Jiang et al., 2020; Roberts et al., 2020; Zhong et al., 2021). Elazar et al. (2021) study whether factual knowledge generalizes across paraphrasing. Our work gives initial insights into what mechanisms could explain when models may gener- alize to paraphrases and when they would not. Recently, Berglund et al. (2023) discuss a phenomena named the ‘reversal curse’, where models trained on “A is B” fail to generalize to “B is A”, which has also been observed by prior work (Grosse et al., 2023; Thibodeau, 2022). Our work explains why this phenomenon is to be expected mech- anistically – facts are stored as asymmetric look up tables in models, and so training on “A is B” is unlikely to reinforce the inverse look up table “B is A” too. Mechanistic interpretabilityencompasses understanding features learned by machine learning models (Olah et al., 2017), mathematical frameworks for understanding machine learning architectures (Elhage et al., 2021), and efforts to find circuits in models (Cammarata et al., 2021; Nanda et al., 2023; Chughtai et al., 2023; Heimersheim & Janiak; Wang et al., 2022). Mechanistic interpretability work encompasses manually inspecting model components, performing causal interventions to localize model behavior (Chan et al.; Geiger et al., 2022; 2021) and work on automating the discovery of causal mechanisms (Conmy et al., 2023; Bills et al., 2023). We make use of mechanistic interpretability techniques and frameworks in this paper. 6. Conclusion In this work, we analyze neural circuitry responsible for the recall of known facts about subjects. We show that in a small dataset factual recall mechanistically comprises several distinct moving parts. We find several simple and distinct mechanisms that interactadditivelyto extract facts. These constructively interfere to produce the correct answer. Each mechanism is insufficient alone, but the summing up of several contributions is significantly more robust. We call this theadditive motif. This motif seems core to the model’s functioning in this fairly general set up, and so likely generalizes to other tasks - we see this as a promising direction of future investigation. Our work contributes to the growing literature on factual recall, and opens up sev- eral interesting new directions, discussed in Appendix B. We also highlight some of the limitations of narrow circuit analysis. By expanding our scope of study were able to un- cover mechanisms for factual recall prior work had missed. We consider such study important forcomprehensivelyun- derstanding neural networks, a stated goal of the field of mechanistic interpretability (Elhage et al., 2021). 6.1. Impact Statement This paper presents work whose goal is to advance the field of AI interpretability. We hope that such work helps shed light on how black box machine learning systems function, which we expect to be vital in their safe and beneficial development. 6.2. Author Contributions Bilal Chughtaiwas the primary research contributor on the project. He contributed the DLA by source token technique and the idea to study the attribute setsSandR. He used this to propose the four seperate mechanisms. He ran many experiments verifying this distinction, and wrote the large majority of the paper. Alan CooneyAlan Cooney was the secondary research contributor on the project. He was heavily involved in the scoping and research stages of the project. He lead the re- search effort into mixed heads, and proposed the categorical head distinction. He was less involved in the writing of the paper, primarily taking a lead on writing the mixed heads section and drafting figure 1. Neel Nandaadvised on the project. He proposed the factual recall set up as an interesting set up to study, originally in the context of attention head superposition (Appendix F). He gave advice and feedback throughout the project, including on the final manuscript. 6.3. Acknowledgments We are grateful to Arthur Conmy, Oskar Hollinsworth, Jett Janiak and Tony Wang for providing generous and valu- able feedback on our manuscript. Over the course of the project, our thinking and exposition was also greatly clari- fied through correspondence with Callum McDougall. BC and AC would like to thank the London Initiative for Safe AI for providing an excellent research environment throughout the project. BC was supported by the Long Term Future Fund. We used PyTorch (Paszke et al., 2019) as our machine learn- ing framework. We made use of the TransformerLens li- brary (Nanda, 2023) for helpful transformer interpretability 8 Summing Up The Facts tooling. Our figures were made using Plotly (Inc., 2015). References Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stick- land, A. C., Korbak, T., and Evans, O. The Reversal Curse: LLMs trained on ”A is B” fail to learn ”B is A”, September 2023. URLhttp://arxiv.org/abs/ 2309.12288. arXiv:2309.12288 [cs]. Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., and van der Wal, O. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling, May 2023. URLhttp://arxiv.org/abs/2304. 01373. arXiv:2304.01373 [cs]. Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W.Language models can explain neurons in language models, 2023.URLhttps: //openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language Models are Few-Shot Learners, July 2020. URLhttp://arxiv.org/abs/2005. 14165. arXiv:2005.14165 [cs]. Cammarata, N., Goh, G., Carter, S., Voss, C., Schubert, L., and Olah, C. Curve circuits.Distill, 2021. doi: 10.23915/distill.00024.006. Chan, L., Garriga-alonso, A., Goldowsky-Dill, N., ryangreenblatt, jenny, Radhakrishnan, A., Buck, and Thomas, N.Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]. URLhttps://w.alignmentforum. org/posts/JvZhhzycHu2Yd57RN/ causal-scrubbing-a-method-for-rigorously-testing . Chughtai, B., Chan, L., and Nanda, N. A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations, May 2023. URLhttp:// arxiv.org/abs/2302.03025. arXiv:2302.03025 [cs, math]. Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimer- sheim, S., and Garriga-Alonso, A.Towards Auto- mated Circuit Discovery for Mechanistic Interpretability, July 2023. URLhttp://arxiv.org/abs/2304. 14997. arXiv:2304.14997 [cs]. Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L.Sparse Autoencoders Find Highly Interpretable Features in Language Models, Septem- ber 2023. URLhttp://arxiv.org/abs/2309. 08600. arXiv:2309.08600 [cs]. Elazar, Y., Kassner, N., Ravfogel, S., Ravichander, A., Hovy, E., Sch ̈ utze, H., and Goldberg, Y. Measuring and Improving Consistency in Pretrained Language Models, May 2021. URLhttp://arxiv.org/abs/2102. 01017. arXiv:2102.01017 [cs]. Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A mathematical framework for transformer circuits, 2021. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., and Olah, C. Toy models of superposition.Transformer Circuits Thread, 2022. Geiger, A., Lu, H., Icard, T., and Potts, C. Causal Abstrac- tions of Neural Networks, October 2021. URLhttp:// arxiv.org/abs/2106.02997 . arXiv:2106.02997 [cs]. Geiger, A., Wu, Z., Lu, H., Rozner, J., Kreiss, E., Icard, T., Goodman, N. D., and Potts, C.Induc- ing Causal Structure for Interpretable Neural Networks, July 2022. URLhttp://arxiv.org/abs/2112. 00826. arXiv:2112.00826 [cs]. Geva, M., Schuster, R., Berant, J., and Levy, O. Trans- former Feed-Forward Layers Are Key-Value Memories, September 2021. URLhttp://arxiv.org/abs/ 2012.14913. arXiv:2012.14913 [cs]. Geva, M., Caciularu, A., Wang, K. R., and Goldberg, Y. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space, Oc- tober 2022. URLhttp://arxiv.org/abs/2203. 14680. arXiv:2203.14680 [cs]. Geva, M., Bastings, J., Filippova, K., and Globerson, A. Dis- secting Recall of Factual Associations in Auto-Regressive Language Models, April 2023a. URLhttp://arxiv. org/abs/2304.14767. arXiv:2304.14767 [cs]. 9 Summing Up The Facts Geva, M., Bastings, J., Filippova, K., and Globerson, A. Dis- secting Recall of Factual Associations in Auto-Regressive Language Models, April 2023b. URLhttp://arxiv. org/abs/2304.14767. arXiv:2304.14767 [cs]. Grosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Taj- dini, A., Steiner, B., Li, D., Durmus, E., Perez, E., Hub- inger, E., Luko ˇ si ̄ ut ̇ e, K., Nguyen, K., Joseph, N., Mc- Candlish, S., Kaplan, J., and Bowman, S. R. Studying Large Language Model Generalization with Influence Functions, August 2023. URLhttp://arxiv.org/ abs/2308.03296. arXiv:2308.03296 [cs, stat]. Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troitskii, D., and Bertsimas, D. Finding Neurons in a Haystack: Case Studies with Sparse Probing, June 2023. URLhttp:// arxiv.org/abs/2305.01610. arXiv:2305.01610 [cs]. Hase, P., Bansal, M., Kim, B., and Ghandeharioun, A. Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models, January 2023. URLhttp:// arxiv.org/abs/2301.04213. arXiv:2301.04213 [cs]. Heimersheim, S. and Janiak, J.A circuit for Python docstrings in a 4-layer attention-only trans- former.URLhttps://w.alignmentforum. org/posts/u6KXXmKFbXfWzoAXn/ a-circuit-for-python-docstrings-in-a-4-layer-attention-only. Hernandez, E., Sharma, A. S., Haklay, T., Meng, K., Watten- berg, M., Andreas, J., Belinkov, Y., and Bau, D. Linearity of Relation Decoding in Transformer Language Mod- els, August 2023. URLhttp://arxiv.org/abs/ 2308.09124. arXiv:2308.09124 [cs]. Inc., P. T. Collaborative data science, 2015. URLhttps: //plot.ly . Place: Montreal, QC Publisher: Plotly Technologies Inc. Jermyn, A., Olah, C., and Henighan, T.At- tentionHeadSuperposition,2023.URL https://transformer-circuits. pub/2023/may-update/index.html# attention-superposition. Jiang, Z., Xu, F. F., Araki, J., and Neubig, G. How Can We Know What Language Models Know?, May 2020. URLhttp://arxiv.org/abs/1911. 12543. arXiv:1911.12543 [cs]. McGrath, T., Rahtz, M., Kramar, J., Mikulik, V., and Legg, S. The Hydra Effect: Emergent Self-repair in Language Model Computations, July 2023. URLhttp://arxiv. org/abs/2307.15771. arXiv:2307.15771 [cs]. Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Lo- cating and Editing Factual Associations in GPT, Jan- uary 2023a. URLhttp://arxiv.org/abs/2202. 05262. arXiv:2202.05262 [cs]. Meng, K., Sharma, A. S., Andonian, A., Belinkov, Y., and Bau, D. Mass-Editing Memory in a Transformer, Au- gust 2023b. URLhttp://arxiv.org/abs/2210. 07229. arXiv:2210.07229 [cs]. Nanda,N.TransformerLens/furthercomments.md atmain·neelnanda-io/TransformerLens,2022. URLhttps://github.com/neelnanda-io/ TransformerLens/blob/main/further_ comments.md. Nanda, N.TransformerLens, January 2023.URL https://github.com/neelnanda-io/ TransformerLens.original-date:2022-08- 26T20:20:38Z. Nanda, N., Chan, L., Lieberum, T., Smith, J., and Stein- hardt, J. Progress measures for grokking via mechanistic interpretability, January 2023. URLhttp://arxiv. org/abs/2301.05217. arXiv:2301.05217 [cs]. nostalgebraist.interpretingGPT:the logitlens—LessWrong,January2020. URLhttps://w.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens. Olah, C., Mordvintsev, A., and Schubert, L.Fea- ture Visualization.Distill, 2(11):e7, November 2017.ISSN 2476-0757.doi:10.23915/distill. 00007.URLhttps://distill.pub/2017/ feature-visualization. Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. In-context learn- ing and induction heads.Transformer Circuits Thread, 2022. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. PyTorch: An imperative style, high-performance deep learning library. 10 Summing Up The Facts InAdvances in neural information processing systems 32, p. 8024–8035. Curran Associates, Inc., 2019. URLhttp://papers.neurips.c/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library. pdf. Petroni, F., Rockt ̈ aschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., and Riedel, S. Language Models as Knowledge Bases?, September 2019. URLhttp:// arxiv.org/abs/1909.01066. arXiv:1909.01066 [cs]. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language Models are Unsupervised Multi- task Learners. Rager, C., Lau, Y.-T., Dao, J., and Jett. An adversarial ex- ample for Direct Logit Attribution: memory management in gelu-4l. 2023. URLhttps://w.lesswrong. com/posts/2PucFqdRyEvaHb4Hn/ an-adversarial-example-for-direct-logit-attribution-memory. Roberts, A., Raffel, C., and Shazeer, N. How Much Knowl- edge Can You Pack Into the Parameters of a Language Model?, October 2020. URLhttp://arxiv.org/ abs/2002.08910. arXiv:2002.08910 [cs, stat]. Thibodeau, J. But is it really in Rome? An investigation of the ROME model editing technique — AI Alignment Fo- rum, 2022. URLhttps://w.alignmentforum. org/posts/QL7J9wmS6W2fWpofd/ but-is-it-really-in-rome-an-investigation-of-the-rome-model. Wang, B. and Komatsuzaki, A.GPT-J-6B: A 6 bil- lion parameter autoregressive language model, May 2021. URLhttps://github.com/kingoflolz/ mesh-transformer-jax. Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small, Novem- ber 2022. URLhttp://arxiv.org/abs/2211. 00593. arXiv:2211.00593 [cs]. Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Met- zler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W. Emergent Abilities of Large Lan- guage Models, October 2022. URLhttp://arxiv. org/abs/2206.07682. arXiv:2206.07682 [cs]. Zhong, Z., Friedman, D., and Chen, D. Factual Probing Is [MASK]: Learning vs. Learning to Recall, Decem- ber 2021. URLhttp://arxiv.org/abs/2104. 05240. arXiv:2104.05240 [cs]. Zou, A., Wang, Z., Kolter, J. Z., and Fredrikson, M. Uni- versal and Transferable Adversarial Attacks on Aligned Language Models, July 2023. URLhttp://arxiv. org/abs/2307.15043. arXiv:2307.15043 [cs]. 11 Summing Up The Facts A. Limitations Our investigation attempts to present evidence that a range of mechanisms for factual recall exist within models, but does not claim to explain all such mechanisms. We present some evidence that all of these are important, but do not attempt to quantify how important each mechanism is. Further, while our separation of heads into Subject, Relation and Mixed heads is useful in understanding head function, the true picture is less clean, and where we draw the boundaries is somewhat arbitrary. In this paper, we argue that the distinction of at least Subject and Relation heads makes sense, but we acknowledge the mixed head boundary is somewhat fuzzy. Since our focus is demonstrating a range of mechanisms exist, we primarily investigate one model, Pythia-2.8b, on a fairly small dataset. We discuss some of the limitations we faced during dataset curation in Appendix C. Finally, in the plots in the main body of the paper, we generally focus on attributes that have high unigram frequency - (sports, countries, etc.). This makes analysis of individual model components simpler, as polysemantic heads will generally write common attributes with higher norm than less common ones. Our additive and constructive interference picture does still hold up for less common categories of word - however, often with lower logit lens significance on individual model components. B. Future Work Understanding Correlation. Correlated features have been shown to be organize themselves into geometric patterns in toy set ups where there are more features than parameter count. This can be thought of as a form of lossy compression, and is known as superposition (Elhage et al., 2022). In our work, we found similar attention heads responsible for reading and writing very correlated features, eg. ‘France’ and ‘Paris’ or ‘basketball’ and ‘NBA’, suggesting these features are stored together in superposition. We know superposition exists in real language models (Gurnee et al., 2023), but an open problem is understanding how models performcomputationof compressed features in superposition, overcoming issues of interference. In particular, it is unlikely that a linear method such as (Hernandez et al., 2023) could disentangle these. It is possible that constructive interference of our four mechanisms suffice to, but something more complex may be at play. Understanding MLP neurons. In this work, we analyze MLPs very briefly, showing they generally boost many attributes in the setR. Understanding how this is done more precisely would be of interest. One could first zoom in to individual neurons, instead of MLP layers as a whole, and attempt to identify which inputs are responsible for boosting the unembedding directions of attributes inR. Is the relation information being used explicitly? Or do these neurons just compose directly with relation head outputs? MLP neurons remain a challenge in interpreting the algorithms implemented by transformers. ROME. The ROME technique (Meng et al., 2023a) is able to edit model outputs in a way that generalises across a range of prompts, but has some limitations. The localisation needed may not be precise (Hase et al., 2023), and the phenomenon of “loud facts” suggests ROME is not as surgical as initially thought (Thibodeau, 2022). Future work could use our understanding of the end to end mechanisms behind factual recall to try and understand how ROME works in an end-to-end manner, and explain mechanistically why these limitations exist. Prompting Set Up. One could study how different prompting set ups affect the task of factual recall. For instance, how does a few shot prompt, or prompt injection of form ”Never say ‘Paris’. The Eiffel Tower is in the city of” work in improving or reducing performance? One could also study paraphrasing, in a similar fashion to (Elazar et al., 2021). One could compare the internal mechanisms found in this paper to those found for different prompting set ups and analyze the difference. Olsson et al. (2022) argues induction heads are important in in-context learning, but our understanding of the general phenomenon remains poor in general. Similarly, our understanding why jailbreaks such as that presented in (Zou et al., 2023) occur would be productive in mitigating the prevalence of jailbreaks. Multi-Step Factual Recall. Consider prompts of formThe largest church in the world is located in the city of. Humans would solve this task sequentially, with two inference steps. However, models may be able to solve this task in one forward pass. Additivity may be able to explain why. Investigating the mechanisms behind model performance in this task would be an interesting area of further investigation. 12 Summing Up The Facts C. Dataset Our dataset is loosely inspired by Meng et al. (2023a) and Hernandez et al. (2023), but is manually generated. We found these preexisting datasets to be unsatisfactory for our analysis, due to some additional requirements our set up necessitated. We firstly required models to bothknowfacts and tosay factswhen asked in a simple prompting set up, and for the correct attributeato be completely determined in its tokenized form by the subject and relationship. For example ‘The Eiffel Tower is in’ permits both the answer ‘Paris’ and ‘France’. For simplicity we avoided prompts of this form. Synonyms also gave us issues, e.g. ‘football’ and ‘soccer’, or ‘unsafe’ and ‘dangerous’. This mostly restricted us to very categorical facts, like sports, countries, cities, colors etc. We also wanted to avoid attributes that mostly involved copying, such as ‘The Syndey Opera House is in the city of Sydney‘, as we expect this mechanism to differ substantially from the more general mechanism, and to rely mostly on induction heads (Olsson et al., 2022). Next, we wanted to create large datasets withrheld constant, and separately, withsheld constant. Holding the relation constant and generating many facts is fairly easy. But generally models know few facts about a given subject, e.g. ‘Michael Jordan’ is associated very strongly with ‘basketball’, but other facts about him are less important and well known. Certain kinds of attributes, like ‘gender’ are likely properties of the tokens themselves, and not likely not reliant on the ‘subject enrichment’ circuitry - e.g. ‘Michael’ and ‘male’. We try and avoid these cases. We also restrict to attributes where the first attribute token mostly uniquely identifies the token – often the attribute is just a single token. If the first token of the attribute is a single character, this can be vague, so we omitted these cases. These considerations limited the size of the dataset we studied. Here, we provide further details regarding our dataset. Our dataset comprises 106 prompts, across 10 different relationsr. We summarise the relations we study in Table 2, and validate our primary model of study achieves high accuracy on the dataset in Figure 7. RelationRelation TextNumber of SubjectsExample Subjects PROFESSORATis a professor at the university of9Charles Darwin, Isaac Newton, Alan Turing PLAYSSPORTplays the sport of15Tom Brady, Patrick Mahomes, LeBron James PRIMARYMACROhas the primary macronutrient of11Potatoes, Rice, Oil PRODUCTBYis a product by the company of9Wii Balance Board, Windows 10, Platform Controller hub INCOUNTRYis in the country of7The Eiffel Tower, Sydney Opera House, Machu Picchu CAPITAL CITYhas the capital city of10Brazil, Spain, Russia LEAGUECALLEDplays in the league called the6Tom Brady, Patrick Mahomes, Mookie Betts FROM COUNTRYis from the country of12LeBron James, David Beckham, Kobe Bryant IN CONTINENTis in the continent of7The Eiffel Tower, Sydney Opera House, Machu Picchu INCITYis in the city of7The Eiffel Tower, Sydney Opera House, Machu Picchu Table 2.The factual tuples in our dataset, aggregated over the relationr. 0246810 0 10 20 30 40 50 60 Rank C o u n t Figure 7.Ranks of the correct attributeafor all prompts in our dataset on Pythia-2.9b. We filter for prompts where the attributeais within the top 10 logits. Though, the model has a very high top-1 accuracy – the rank is usually zero. To generate setsSandRWe used GPT-4 to generate a large list of relevant attributes for each subjectsand relationr. We 13 Summing Up The Facts then manually filtered these lists of attributes. For instance, removing attributes beginning with the token the. C.1. Example Datapoints We include below three data points, corresponding to three separate tuples(s,r,a), along with setsSandR. subjectrelationrelation textattributepromptcounterfactual subject attributescounterfactual relation attributes Sydney Opera HouseINCOUNTRYis in the country ofAustraliaFact:Sydney Opera House is in the country of [’1973’, ’Sydney’, ’modern architec- ture’, ’iconic’, ’Jørn Utzon’, ’Benne- long Point’, ’performing arts’, ’shell roofs’, ’UNESCO World Heritage site’, ’Sydney Harbour’, ’Danish architect’, ’multi-venue’, ’ceramic tiles’, ’expressionist design’] [’China’, ’France’, ’Germany’, ’Italy’, ’Austria’, ’USA’, ’Canada’, ’Finland’, ’Hungary’, ’Afghanistan’, ’Alba- nia’, ’Algeria’, ’Greece’, ’Argentina’, ’Bangladesh’, ’Bel- gium’, ’Brazil’, ’Cambodia’, ’Bulgaria’, ’Chile’, ’Colom- bia’, ’Croatia’, ’Cuba’, ’Denmark’, ’England’, ’Egypt’, ’Es- tonia’, ’Ethiopia’, ’Iceland’, ’India’, ’Indonesia’, ’Iran’, ’Iraq’, ’Ireland’, ’Israel’, ’Jamaica’, ’Japan’, ’Jordan’, ’Kenya’, ’Kuwait’, ’Lebanon’, ’Malaysia’, ’Mexico’, ’Mon- golia’, ’Morocco’, ’Nepal’, ’New Zealand’, ’Nigeria’, ’Nor- way’, ’Pakistan’, ’Peru’, ’Philippines’, ’Poland’, ’Portugal’, ’Qatar’, ’Romania’] Cristiano RonaldoFROM COUNTRYis from the country ofPortugalFact: Cristiano Ronaldo is from the country of [’football’, ’Real Madrid’, ’Manch- ester United’, ’Juventus’, ’World Player’, ’Euro’, ’Nike’, ’endorse- ments’, ”Ballon d’Or”, ’Champi- ons League’, ’forward’, ’La Liga’, ’Serie A’, ’free-kicks’, ’hat-tricks’, ’CR7 brand’, ’foundation’, ’Museu CR7’, ’scoring records’] [’USA’, ’China’, ’France’, ’Germany’, ’England’, ’Italy’, ’Afghanistan’, ’Albania’, ’Algeria’, ’Argentina’, ’Aus- tralia’, ’Austria’, ’Bangladesh’, ’Belgium’, ’Brazil’, ’Bul- garia’, ’Cambodia’, ’Canada’, ’Chile’, ’Colombia’, ’Croa- tia’, ’Cuba’, ’Denmark’, ’Egypt’, ’Estonia’, ’Ethiopia’, ’Fin- land’, ’Greece’, ’Hungary’, ’Iceland’, ’India’, ’Indonesia’, ’Iran’, ’Iraq’, ’Ireland’, ’Israel’, ’Jamaica’, ’Japan’, ’Jordan’, ’Kenya’, ’Kuwait’, ’Lebanon’, ’Malaysia’, ’Mexico’, ’Mon- golia’, ’Morocco’, ’Nepal’, ’New Zealand’, ’Nigeria’, ’Nor- way’, ’Pakistan’, ’Peru’, ’Philippines’, ’Poland’, ’Qatar’, ’Romania’] Table 3.Some full example data points from our dataset,(s, r, a, S, R) D. Further Methods Here, we provide details for regarding how to calculate the logit lens, DLA and DLA by source token. We borrow from the notation presented in McGrath et al. (2023). The function a standard transformer withLlayers and parametersθimplementsf θ can be expressedf θ (x ≤t ) = softmax(π t (x ≤ t))whereπ t is a vecotr of logits given by π t =LayerNorm(z L t )W U z l t =z l−1 t +a l t +m l t a l t =Attn(z l−1 t ) m l t =MLP(z l−1 t ), where LayerNorm()is a LayerNorm normalisation layer,W U an unembedding matrix, Attn()a multi-head attention layer, and MLP()a two layer perceptron. The dependence on model parametersθis left implicit. In common with much of the literature on mechanistic interpretability (Elhage et al., 2021), we refer to the series of residual activationsz l i as the residual stream. Logit Lens.The logit lens (nostalgebraist, 2020) is an interpretability technique for interpreting intermediate activations of language models, through the insights that the residual stream is a linear sum of contributions from each layer (Elhage et al., 2021) and that the map to logits is approximately linear. It pauses model computation early, converting hidden residual stream activations to probability distributions over the vocabulary at each layer. ̃π l t =LayerNorm(z l t )W U withl≤L. 14 Summing Up The Facts Direct Logit Attribution(DLA) is an extension of the logit lens technique. It zooms in to individual model components, through the insight that the residual stream of a transformer can be viewed as an accumulated sum of outputs from all model components (Elhage et al., 2021) DLA therefore gives a measure of the direct effect on the of individual model components on model outputs. Mathematically, we may write a l t =Attn(z l−1 t ) = H X h=1 a h (z l−1 t ) m l t =MLP(z l−1 t ) = N X n=1 m n (z l−1 t ), where we decompose the attention layer into individual attention heads, and the mlp layer into individual neurons (Elhage et al., 2021). DLA corresponds to the sets of logits ̃π l,h t =LayerNorm(a h (z l−1 t ))W U ̃π l,n t =LayerNorm(m n (z l−1 t ))W U DLA by source token. We extend this technique for attention heads through the further insight that attention head outputs are a weighted sum of outputs corresponding to distinct attention source position (Elhage et al., 2021), allowing us to quantify how each group of source tokens in turn contributes directly to the logits. To do so note that the attention head contribution with query positionq=l−1is a sum over key (source) positions a h (z l−1 t ) = t X k=1 attn prob l−1,k LayerNorm(z l−1 k )W V W O Unravelling this sum, just as above, gives a separation of attention head DLA contributions by source token. ̃π l,h,k t =LayerNorm(attnprob l−1,k LayerNorm(z l−1 k )W V W O )W U E. Further Results E.1. Many Attributes are Extracted PromptAttributeCounterfactual Relation AttributesCounterfact Subject Attributes Fact: Tom Brady plays the sport offootball (0)golf (2), baseball (3), hockey (5)quarterback (4), NFL (23), Gisele B ̈ undchen (34) Fact: The Eiffel Tower is in the country ofFrance (0)Belgium (1), China (13), Germany (14)Paris (2), Europe (12), Seine River (347) Fact: The Colosseum is in the country ofItaly (0)Albania (1), Egypt (13), Greece (15)Rome (2), ancient (33), ruins (97) Fact: England has the capital city ofLondon (0)Kuala Lumpur (35), Beijing (40), Dublin (43) Queen Elizabeth (219), English (236), football (337) Fact: Michael Jordan plays in the league called the NBA (0)NFL (9), PGA (13), NHL (34) United States (6), USA (23), Chicago Bulls (39) Fact: Pasta has the primary macronutrient ofcarbohydrates (0)protein (3), fiber (4), fat (12)macaroni (49), fettuccine (54), spaghetti (217) Fact: Stephen Hawking is a professor at the uni- versity of Cambridge (0)Edinburgh (1), Manchester (2), Oxford (3)CBE (30), England (31), cosmology (46) Fact: Alan Turing is a professor at the university of Manchester (0) Cambridge (1), Edinburgh (2), California Insti- tute of Technology (6) computer science (13), Bletchley Park (29), Eng- land (38) Table 4.Many attributes are extracted from the setsSandR. Rank displayed in brackets. We analyze the rank of many attribute logits, and display the top 3 from each category among those in our dataset. Generally the highest attributes inRhave higher rank than the highest inS. Sometimes, the highest rank attributes inSare very correlated withaand thereforer, e.g.FrancewithParis. Often the counterfactual attributes are decorrelated withr. For instanceprofessor at the university ofandCBEorEngland. This suggests subject heads ‘misfire’ and extract these attributes even in contexts that do not necessitate it. 15 Summing Up The Facts E.2. Other Models In this section, we provide some analogous summary plots to Figures 2 and 3 for the relationsplays the sport ofand is in the country offor several other models. GPT2-XL (1.5B) 00.511.52 −0.4 −0.2 0 0.2 0.4 0.6 0.8 Mixed Subject Relation DLA on subject D L A o n r e l a t i o n 00.20.40.60.81 0 0.2 0.4 0.6 0.8 1 Mixed Subject Relation Attention to subject A t t e n t i o n t o r e l a t i o n Figure 8.GPT2-XL. Three different types of attention head for factual extraction for prompts of form “splays the sport of”: Subject heads, Relation heads and Mixed heads. (Left) DLA on the correct sport, split by attention headsourcetoken, for top 10 heads by total DLA. Each data point is one prompt for one factual tuple. The gray lines have gradients1/10and10and denote the boundary we use to defineheads, post averaging, which is somewhat arbitrary. (Right) Attention patterns of the top four heads of each kind on each prompt. Subject and Relation heads attend mostly to subjects and relations respectively. Mixed heads attend to both. Attention is not used to define head type. L38H22 ML39H2 SL45H8 ML32H12 ML27H13 ML35H19 ML26H8 ML42H0 ML21H19 RL32H15 M 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 DLA, a (SUBJECT) DLA, a (RELATION) DLA, R-a (SUBJECT), top-5-mean DLA, R-a (RELATION), top-5-mean Head DLA by source L35H19 ML30H8 SL31H0 SL42H24 ML39H0 SL24H4 SL34H20 SL43H15 ML32H3 RL36H3 M −0.2 0 0.2 0.4 0.6 a (SUBJECT) a (RELATION) R-a, top-5-mean (SUBJECT) R-a, top-5-mean (RELATION) Head DLA on attribute(s) by source token group Figure 9.GPT2-XL. Top heads by absolute DLA for the relationshipsplays the sport of(left) and foris in the country of(right), labeled as Subject (S), Relation (R) or Mixed (M). Studying a large set of counterfactual attributes, and splitting by attention source token let’s us disentangle these head types. We plot DLA on the attributea, and for the mean of the top 5 attributes in the setRbut excludinga, both split by attention source token (SUBJECTvsRELATION). All three head types emerge. 16 Summing Up The Facts GPT-J (5.6B) −101234 −1 −0.5 0 0.5 1 1.5 Mixed Subject Relation DLA on subject D L A o n r e l a t i o n 00.20.40.60.8 0 0.2 0.4 0.6 0.8 1 Mixed Subject Relation Attention to subject A t t e n t i o n t o r e l a t i o n Figure 10.GPT-J. Three different types of attention head for factual extraction for prompts of form “splays the sport of”: Subject heads, Relation heads and Mixed heads. (Left) DLA on the correct sport, split by attention headsourcetoken, for top 10 heads by total DLA. Each data point is one prompt for one factual tuple. The gray lines have gradients1/10and10and denote the boundary we use todefine heads, post averaging, which is somewhat arbitrary. (Right) Attention patterns of the top four heads of each kind on each prompt. Subject and Relation heads attend mostly to subjects and relations respectively. Mixed heads attend to both. Attention is not used to define head type. L19H4 ML21H5 ML21H14 ML16H15 SL15H12 ML18H9 ML17H8 ML21H15 ML11H8 ML19H14 S −0.5 0 0.5 1 1.5 2 2.5 3 DLA, a (SUBJECT) DLA, a (RELATION) DLA, R-a (SUBJECT), top-5-mean DLA, R-a (RELATION), top-5-mean Head DLA by source L19H12 ML16H10 ML16H0 SL12H7 SL18H3 RL19H14 ML20H6 SL12H6 RL19H15 SL23H0 M −0.5 0 0.5 1 1.5 2 a (SUBJECT) a (RELATION) R-a, top-5-mean (SUBJECT) R-a, top-5-mean (RELATION) Head DLA on attribute(s) by source token group Figure 11.GPT-J. Top heads by absolute DLA for the relationshipsplays the sport of(left) and foris in the country of(right), labeled as Subject (S), Relation (R) or Mixed (M). Studying a large set of counterfactual attributes, and splitting by attention source token let’s us disentangle these head types. We plot DLA on the attributea, and for the mean of the top 5 attributes in the setRbut excludinga, both split by attention source token (SUBJECTvsRELATION). All three head types emerge. Pythia-6.9b 17 Summing Up The Facts 00.511.52 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Mixed Relation Subject DLA attributed to SUBJECT DLA attributed to RELATION 00.20.40.60.81 0 0.2 0.4 0.6 0.8 1 Mixed Relation Subject Attention from END to SUBJECT Attention from END to RELATiON Figure 12.Pythia-6.9b. Three different types of attention head for factual extraction for prompts of form “splays the sport of”: Subject heads, Relation heads and Mixed heads. (Left) DLA on the correct sport, split by attention headsourcetoken, for top 10 heads by total DLA. Each data point is one prompt for one factual tuple. The gray lines have gradients1/10and10and denote the boundary we use to defineheads, post averaging, which is somewhat arbitrary. (Right) Attention patterns of the top four heads of each kind on each prompt. Subject and Relation heads attend mostly to subjects and relations respectively. Mixed heads attend to both. Attention is not used to define head type. L18H24 ML26H6 ML22H12 ML21H7 ML20H0 ML12H22 RL18H22 ML19H11 ML26H11 ML16H5 M −0.2 0 0.2 0.4 0.6 0.8 1 1.2 a (SUBJECT) a (RELATION) R-a, top-5-mean (SUBJECT) R-a, top-5-mean (RELATION) Head DLA on attribute(s) by source token group L24H5 ML20H18 SL19H28 SL14H6 SL16H16 RL20H8 SL22H12 ML13H28 RL20H0 RL17H29 S 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 a (SUBJECT) a (RELATION) R-a, top-5-mean (SUBJECT) R-a, top-5-mean (RELATION) Head DLA on attribute(s) by source token group Figure 13.Pythia-6.9b. Top heads by absolute DLA for the relationshipsplays the sport of(left) and foris in the country of(right), labeled as Subject (S), Relation (R) or Mixed (M). Studying a large set of counterfactual attributes, and splitting by attention source token let’s us disentangle these head types. We plot DLA on the attributea, and for the mean of the top 5 attributes in the setRbut excludinga, both split by attention source token (SUBJECTvsRELATION). All three head types emerge. E.3. Relative Mechanism Importance We consider here several measures of importance among the various mechanisms. Fraction of heads in each class. The fraction of heads in each of (subject/relation/mixed) is one possible measure of importance. This varies depending on the choice of subject and relation. See Figure 3 for two examples. There, we see the 18 Summing Up The Facts relationbaseline losssubject percent changerelation percent changemixed percent change lossmlp percent change PLAYSSPORT0.6816.71254.31345.56483.11 IN COUNTRY0.51250.29710.23375.23207.43 CAPITALCITY1.4167.38206.4890.53222.64 LEAGUECALLED1.58170.3797.31185.97104.80 PROFESSORAT0.62127.90503.2778.52714.40 PRIMARY MACRO1.60190.5510.9269.08189.48 PRODUCTBY0.75112.21578.38145.64249.04 FROMCOUNTRY1.16196.88234.61101.40255.63 Table 5.Percent change in loss when ablating the direct path to logits of each component. split of (subject, relation, mixed) among the top 10 heads for the relation “plays the sport of” is (2, 1, 7), but for “is in the country of” it is (4, 2, 4). Inspecting the top 10 heads by DLA for each example in the entire dataset we find 37% of heads get categorised as subject heads and 33% as relation heads, with the remaining 30% as mixed heads. This indicates all three head types are important for the task. Logit contribution. The contribution to logits is another possible choice of metric. Figure 2 visualises this – we can qualitatively see that all three head types are important. We may analyse the percentage of the final (mean centered) logit contributed from each component type, across the entire dataset. We omitted several negatively suppressive components for the purpose of this analysis. Again, we see that the contributions from each type of mechanism is important. Subject heads contribute 18%, relation heads 24%, mixed heads 27%, and the mlp layers 30%, across the entire dataset. Ablations. Naive ablations have been noted in prior work to be counteracted by self-repair in the factual recall set up, a phenomenon known as the hydra effect (McGrath et al., 2023). We therefore followed the approach of (Wang et al., 2022), and performed edge patching - ablating the direct path term between model components and logits. We present in Table 5 baseline loss, together with the loss after knocking out one of the four model mechanisms. Each loss reported is aggregated over the relation dataset. We see knocking out any individual component significantly harms loss in each case. E.4. Subject Heads subjectL21H9 PLAYS SPORTL16H20PLAYS SPORT L22H17PLAYS SPORT L17H2 IN COUNTRY L16H12 IN COUNTRYL17H17 IN COUNTRYL18H14 PROFESSOR AT Michael Jordanbasketball,shooting, shoot, Basketball, shot, Shot, shoots Basketball, basketball, NBA basketball, Basketball, NBA, basket, ho, asket- ball USA,US,Amer- ica, American, USA, Chicago, Americans Jordan, Jordan, ordan, Nile Chicago, Chicago, Illi- nois David BeckhamSoccer, soccer, football, Football, Football, foot- ballers, Soc Soccer, soccer, FIFA, MLS London, UK, England, British, London, Britain, English Roger Federertennissingles, ATP, tourna- ment, tournaments, ten- nis court, final, final, court, courts, Rac, serve Switzerland,Swiss, global, global Swiss Stephen HawkingEngland, Britain, Lon- don, British, Brit, Eng- land, UK Cambridge, Cambridge, calculation, mathematic Niels Bohrenergies, ATP, energySwedish,Sweden, Swed, ̊ A, Danish rPhilosophy The ColosseumfightItaly, Rome, Roman, Romans, Italian, Ital, Italian Italy, Italian, Italian, Rome, Ital, Milan Italy, Italian, Italian, Rome, Ital, Roma The Taj MahalIndia, IndianIndian, Indian, Indians, Pakistani, India, India, Shah The Eiffel TowerParis, Paris, France, France, London, Lon- don France, Paris Table 6.Using headOVcircuits as probes acting on the enriched subject representation’s final token residual stream elicits interpretable attributes in the head categoryCas the top few DLA tokens. We include the relation for which the head is a subject head for in the column titles. We only include attributes the head is sufficiently confident about (>1%). For instance, applying the head L21H9 to sports players usually elicits their sport. Applying it toThe Colosseumelicitsfight, and with lower confidenceboxing, which falls within the same category. 19 Summing Up The Facts 00.20.40.60.81 0 0.2 0.4 0.6 0.8 1 L21H9 L22H17 L17H2 L16H12 attention to subject a t t e n t i o n t o r e l a t i o n Figure 14.Attention scores of several subject heads on prompts with subjectMichael Jordanwith a range of different relations, pertaining to sport, country, language, etc. We see several interesting attention patterns. L22H17 attends quite uniformly to the subject here, while other subject heads have more variable attention patterns. In Figure 14 we analyze the attention patterns of several subject heads, across a range of prompts with a single fixed subject s, but different relationshipsr. We see significant attention toSUBJECTno matter what prompt is given, i.e. these heads often extract attributes irrelevant to the relationship. We find several kinds of interesting attention pattern. (1) Heads that always attend to the subject with very high probability, independent of the relationship given in the prompt (e.g. Layer 22 Head 17 (L22H17)), for basketball players). This correlates with the attributes this head extracts, only the sport of basketball, no other sports. Notably, this head does not have as high attention on non-basketball sports players. (2a) Heads that pay variable attention to the subject, in a mostly uninterpretable way. (2b) Heads paying variable attention to the subject, in an interpretable fashion, dependent on the prompt (e.g. L17H2, which attends more if the prompt requests a country or city). These latter heads, by virtue of attending fromENDtoSUBJECTcan only be influenced by the relation on the query side. This is an instance of query composition (Elhage et al., 2021), as suggested by (Geva et al., 2023a). We however note this mechanism is relatively unimportant among the studied examples – we only found a handful of instances of this, all of which related to country attributes. E.5. Relation Heads In this section, we provide further results on relation heads. 20 Summing Up The Facts E.5.1. RELATION HEADS PULL OUT MANY ATTRIBUTES IN THE SETR. tennissoccercricketbasketballgolfvolleyballbaseballhockeygymnasticsrugby −0.4 −0.2 0 0.2 0.4 0.6 L13H31 AR=138.62 L14H24 AR=13.73 L14H5 AR=15.86 L18H9 AR=10.21 L26H17 AR=7.77 Attribute DLA on Attribute EnglandFranceItalyFinlandGermanyNorwayUSAIrelandAustraliaCanada −0.2 0 0.2 0.4 0.6 0.8 1 L13H31 AR=650.78 L18H20 AR=12.09 L14H24 AR=35.86 L23H12 AR=3.62 L14H29 AR=7.25 Attribute DLA on Attribute Figure 15.Relationship heads pull out many attributes consistently across a range different subjects for prompts with relationshipplays the sport ofandis in the country of. The error bars are standard deviation over subjects, and them beings small suggests that these heads do not meaningfully depend on the subject. We also include the mean attention ratio of toRELATIONoverSUBJECT for each head. 21 Summing Up The Facts E.5.2. RELATION HEADS PRIMARY FUNCTION IS OFTEN TO BOOST ATTRIBUTES IN THE SETR. promptmost important relation headsecond most important relation head Fact: Michael Jordan plays the sport ofgames (0), roles (1), role (2), genres (3), soccer (5), tennis (8), sport (9), Role (10), lite (12), football (13), genre (14), ballet (16), cricket (18), disciplines (21), athlet (23), games (26), violin (27), basketball (30), bass (31), biology (33), sports (34), afers (35), music (37), slots (40), slot (41), battles (42), golf (43), Wrestling (46), volley (49) football (2), Football (5), chess (8), Wrestling (9), baseball (14), opera (16), football (17), switch (18), JavaScript (19), tennis (22), Football (26), JavaScript (31), guitar (34) Fact: The Eiffel Tower is in the country ofterritory (0), territories (1), countries (2), country (3), Eng- land (4), Kingdom (5), States (6), Region (7), Netherlands (8), province (9), France (10), region (11), Province (12), regions (13), place (14), Germany (15), provinces (16), Fin- land (17), Italy (18), Region (19), Norway (20), states (21), America (22), area (23), Britain (24), Spain (25), States (26), Territory (27), USA (28), regions (29), Sweden (30), region (31), Ireland (32), northern (33), country (34), Australia (35), Denmark (36), Canada (37), land (38), Arabia (39), place (41), France (42), England (43), kingdom (44), areas (45), area (46), realm (47), Switzerland (48), homeland (49) abroad (0), country (1), countries (2), overseas (3), country (4), international (5), internationally (6), international (7), Country (8), expatri (9), foreigners (10), expatriate (11), for- eign (12), France (13), national (15), national (16), France (17), pa ́ ıs (18), nationality (19), Country (20), nation (21), nations (22), export (23), Germany (24), foreign (25), nation- als (26), Germany (27), pa ́ ıses (28), passport (29), exported (30), International (31), visa (32), England (33), Interna- tional (34), USA (35), embassy (36), Foreign (37), visas (38), USA (39), travel (40), Foreign (41), Switzerland (42), England (43), extrad (44), UK (45), Europe (47), travel (48), Belgium (49) Fact: England has the capital city ofcity (0), cities (1), city (2), City (3), City (4), metropolitan (5), urban (6), London (7), Cities (8), street (9), CITY (10), London (11), streets (12), Mayor (13), municipal (14), NYC (15), street (16), downtown (17), urban (18), Metropolitan (19), metro (20), mayor (21), Municipal (22), Street (23), Paris (24), nationwide (25), Urban (26), borough (27), ciudad (28), Paris (29), Delhi (30), town (31), Metro (32), hometown (33), Dublin (34), suburbs (35), overseas (36), regional (37), Mumbai (38), Street (39), Downtown (40), residents (41), Amsterdam (42), Philadelphia (43), capital (44), Chicago (45), Edinburgh (46), abroad (47), national (48), Madrid (49) cities (0), towns (1), Cities (2), city (3), town (4), hometown (5), city (6), municipalities (7), town (8), City (9), CITY (10), City (11), locations (12), metropolitan (13), Town (14), villages (15), Town (16), locations (17), ville (18), location (19), location (20), stown (21), centres (22), places (23), Sites (24), London (25), destinations (27), headquarters (28), neighborhoods (29), capital (30), localities (31), metro (32), London (33), place (34), centers (35), sites (36), downtown (37), sites (38), Location (39), located (40), Place (41), re- gions (42), counties (43), venues (44), ports (45), develop (46), ciudad (47), ville (48), apolis (49) Fact: Michael Jordan plays in the league called thebas (0), Draft (1), draft (4), NBA (6), fil (8), drafting (9), bas (10), draft (11), drafted (12), offseason (15), fil (16), MLB (29), NHL (32), preseason (36), Steelers (41), cent (42), (49) player (0), players (1), league (2), championship (3), stadium (4), club (5), NFL (6), franchise (7), team (8), player (9), teams (10), clubs (11), NBA (12), fans (13), coaches (14), football (15), game (16), coach (17), soccer (18), coaching (19), Players (20), teammates (21), franch (22), leagues (23), squad (24), athletes (25), referee (26), training (27), athlete (28), games (29), hockey (30), tournament (31), basketball (32), team (33), Stadium (34), championships (35), rookie (36), Championship (37), baseball (38), ESPN (39), injury (40), preseason (41), club (42), competitive (43), roster (44), season (45), NCAA (46), teammate (47), Player (48), Fact: Stephen Hawking is a professor at the university ofUniversity (0), university (1), universities (2), University (3), College (4), UK (5), Academic (6), UK (8), college (9), col- leges (10), College (11), England (12), institute (13), ’s (14), Univers (15), ’ (16), Institute (17), itself (18), British (19), Univers (20), UCLA (21), Cambridge (22), Faculty (23), in- stitution (25), Zealand (26), undergraduate (27), Britain (28), Academy (29), (30), academy (31), Yale (32), academic (34), England (35), Cambridge (36), overrightarrow (39), Harvard (40), Enum (41), Oxford (42), achus (43), professors (44), Ireland (45), School (47), Scotland (48), $’ (49) Fact: Chicken has the primary macronutrient ofnutrients (0), nutrient (1), nutrition (2), dietary (3), vitamins (4), nutritional (5), sugars (6), carbohydrates (7), glucose (8), protein (9), carbohydrate (10), energy (11), proteins (12), calories (13), iron (14), minerals (15), amino (16), vitamin (17), metabolic (18), metabolism (19), Nutrition (20), Diet (21), Protein (22), diet (23), diets (24), nutrients (25), lipids (26), Vitamin (27), Energy (28), fatty (29), sugar (30), in- sulin (31), calcium (32), energy (33), lipid (34), Energy (35), protein (36), fats (37), nutrition (38), metabol (39), phospho- rus (40), Proteins (41), Iron (42), carot (43), Protein (44), glucose (45), iron (46), selenium (47), collagen (48), fat (49) Quantity (0), iv (1), olean (2), carbon (3), leen (6), Judaism (7), rice (9), organic (15), beef (25),electrons (42), Table 7. Some of the top few DLA tokens for the top two relation heads corresponding to a range of relations. Manually sampled relevant words from the top 50 output tokens, together with rank in brackets. There are many interesting things to note. For example, the top relation head forplays the sport ofextracts both sports, as well as other things one canplay- the categoryCof this head is wider than just sports. E.5.3. RELATION HEADS(MOSTLY)DO NOT HAVE SIGNIFICANT INDIRECT EFFECT DEPENDENT ON THE SUBJECT. We perform patching experiments where we patch the subject between two prompts with the same relationship on the top 5 relation head outputs, and measure the difference in performance. We find that for some relationships, performance is invariant. If the relation heads causally depended on specific features of the subject, we would expect to see a large decrease in performance. 22 Summing Up The Facts 024681012 −2 0 2 4 6 8 10 pre-patched to-logit minus from-logit post-patched to-logit minus from-logit −3.5−3−2.5−2−1.5−1−0.50 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 pre-patched logprob p o s t - p a t c h e d l o g p r o b Figure 16.We patch the top 5 relation heads outputs for prompts with relationplays the sport ofbetween different subjects to study the indirect effect of relation heads. For this relationship, we see that on average, performance does not increase for both a logit-diff between to-logit and from-logit (left) and logprob based (right) metric. The gray line indicates no change. 024681012 0 2 4 6 8 10 12 pre-patched to-logit minus from-logit post-patched to-logit minus from-logit −1−0.8−0.6−0.4−0.2 −1 −0.8 −0.6 −0.4 −0.2 pgpppp pre-patched logprob p o s t - p a t c h e d l o g p r o b Figure 17.We patch the top 5 relation heads outputs for prompts of formis in the country ofbetween different subjects to study the indirect effect of relation heads. We see that on average, logprob does not decrease, but logit difference between the to-logit and from-logit decrease slightly. We generally see that patching improves performance on low probability outputs. This suggests the model has some ‘confidence’ feature that gets modified through patching. The gray line indicates no change. E.5.4. SUBJECT-RELATIONPROPAGATION In Pythia-2.8b, we found that relation heads generally did not privilege the correct attributeaamong the setR. When investigating a larger Pythia 6.9B model, we observed relation heads frequently extract the correct attribute whilst attending onlytoRELATION, for a variety of different subject/attributes. For example, withs plays the sport ofprompts, we found attention head L26H6 can extractbasketballwhen givenMichael Jordanas the subject, andsoccerwhen givenDavid Beckhamas the subject. We hypothesize that there are two mechanisms here. Firstly some subject head attends fromsportto the subject, and propagates facts (including the sport and other facts about the subject). Then the relationship head L26H6 receives both a large number of sports from the usual mechanism, but also a boosted correct sport that was already moved to the same place in thesportresidual stream. We verified this hypothesis through ‘attention-knockout’, zero-ablating all attention fromRELATIONtoSUBJECT. This 23 Summing Up The Facts resulted in head L26H6 instead of extracting a consistent set of sports regardless of the player, and not privileging the correct attributea. This head remains the most important relation head by DLA for a variety ofplays the sport ofprompts. The general takeaway from this finding is that our set of mechanisms, may not be completely universal (Olah et al., 2020) across model scale, and we should expect larger models to implement more sophisticated circuits. E.6. Mixed Heads E.6.1. INSPECTING SORTEDDLA To illustrate the facts extracted by a selection of mixed heads and prompts, we investigate DLA by source token, for all vocab tokens. We note that for some headsC≈R, i.e. the head’s category of specialization is similar to the relationshiprthat is being investigated. For example, we show in Table 11 that L22H15 is specialized to the categories of sport and communication, which overlaps significantly with theplays the sport ofprompts. Similarly L23H22 was found to be a countries and languages extractor, overlapping significantly with theis in the country ofprompts. With these heads, the correct attribute is consistently one of the top tokens fromSUBJECT, and high but not top fromRELATION. In many cases there is less overlap betweenCandR. For example L17H30 appears to specialize in ”players/things that can be played”. This head does have the correct attribute forplays the sport ofprompts in the top tokens for the subject, but it gives a higher DLA for more generic terms that also could reasonably fit withinC∩SandC∩R(e.g.playerand team). Understanding the category of head specialization is therefore useful in interpreting this type of mixed head. 24 Summing Up The Facts headpromptsubjectrelationtotal L22H15Fact: Michael Jordan plays the sport of 0. basketball (0.397) 1. football (0.354) 2. sports (0.344) 3. soccer (0.336) 4. footballers (0.327) 0. games (1.082) 1. basketball (1.026) 2. sports (0.990) 3. game (0.987) 4. sport (0.948) 0. basketball (1.427) 1. games (1.354) 2. sports (1.340) 3. game (1.267) 4. sport (1.264) L22H15Fact:Mike Trout plays the sport of 0. baseball (0.522) 1. Baseball (0.447) 2. MLB (0.434) 3. teammates (0.401) 4. sports (0.388) 0. games (0.610) 1. baseball (0.603) 2. players (0.602) 3. Players (0.597) 4. player (0.593) 0. baseball (1.127) 1. players (0.992) 2. sports (0.988) 3. Players (0.971) 4. player (0.957) L22H15Fact:Tom Brady plays the sport of 0. football (0.893) 1. Football (0.810) 2. NFL (0.806) 3. Football (0.792) 4. football (0.792) 0. football (0.561) 1. games (0.546) 2. players (0.522) 3. player (0.518) 4. Football (0.514) 0. football (1.456) 1. Football (1.313) 2. Football (1.305) 3. football (1.292) 4. NFL (1.198) L17H30Fact: Michael Jordan plays the sport of 0. games (0.355) 1. players (0.352) 2. player (0.338) ... 23. basketball (0.233) 0. players (1.371) 1. player (1.330) 2. play (1.315) ... 43. basketball (0.550) 0. players (1.716) 1. player (1.663) 2. play (1.596) ... 34. basketball (0.782) L17H30Fact:Mike Trout plays the sport of 0. players (0.229) 1. player (0.216) 2. teams (0.187) 3. games (0.185) 4. baseball (0.179) 0. players (1.300) 1. player (1.260) 2. play (1.200) ... 55. baseball (0.488) 0. players (1.513) 1. player (1.463) 2. play (1.333) ... 42. baseball (0.661) L17H30Fact:Tom Brady plays the sport of 0. players (0.265) 1. player (0.247) 2. Players (0.241) ... 10. football (0.214) 0. players (1.428) 1. player (1.397) 2. play (1.365) ... 31. football (0.692) 0. players (1.684) 1. player (1.638) 2. play (1.591) ... 28. football (0.902) L18H25Fact: Michael Jordan plays the sport of 0. skating (0.246) 1. skate (0.210) 2. Stadium (0.182) ... 9. basketball (0.151) 0. sport (0.198) 1. Sport (0.192) 2. tennis (0.184) ... 58. basketball (0.107) 0. skating (0.405) 1. skate (0.354) 2. sport (0.331) ... 19. basketball (0.257) L18H25Fact:Mike Trout plays the sport of 0. Golf (0.045) 1. leaf (0.038) 2. golf (0.037) ... 274. baseball (0.016) 0. Sport (0.081) 1. sport (0.080) 2. skiing (0.077) ... 129. baseball (0.036) 0. Golf (0.097) 1. Track (0.093) 2. golf (0.093) ... 29. baseball (0.062) L18H25Fact:Tom Brady plays the sport of 0. Formula (0.009) 1. luggage (0.008) 2. Stadium (0.008) ... 646. football (0.004) 0. Sport (0.333) 1. sport (0.327) 2. skiing (0.308) ... 102. football (0.160) 0. Sport (0.323) 1. sport (0.315) 2. sports (0.304) ... 70. football (0.168) L23H22Fact: The Colosseum is in the country of 0. Italy (0.960) 1. Italian (0.954) 2. Italian (0.914) 3. Ital (0.860) 4. Rome (0.722) 0. Italy (0.304) 1. Italian (0.232) 2. Ital (0.222) 3. Rome (0.220) 4. Italian (0.216) 0. Italy (1.257) 1. Italian (1.179) 2. Italian (1.125) 3. Ital (1.076) 4. Rome (0.938) L23H22Fact:The Eiffel Towerisinthe country of 0. French (1.229) 1. France (1.176) 2. French (1.111) 3. Paris (1.081) 4. France (1.031) 0. France (0.416) 1. France (0.364) 2. Paris (0.347) 3. French (0.305) 4. Paris (0.300) 0. France (1.589) 1. French (1.531) 2. Paris (1.427) 3. France (1.394) 4. French (1.371) L23H22Fact: The Taj Mahal is in the country of 0. India (0.863) 1. India (0.795) 2. Indian (0.734) 3. Pakistan (0.684) 4. Indian (0.645) 0. India (0.248) 1. Pakistan (0.223) 2. India (0.219) 3. istan (0.199) 4. Arabia (0.198) 0. India (1.106) 1. India (1.010) 2. Pakistan (0.907) 3. Indian (0.851) 4. Indian (0.737) L26H8Fact: The Colosseum is in the country of 0. Rome (1.110) 1. Italy (0.963) 2. Italian (0.962) 3. Italian (0.918) 4. Ital (0.911) 0. Italian (0.109) 1. Italy (0.109) 2. Italian (0.106) 3. Ital (0.099) 4. Rome (0.095) 0. Rome (1.206) 1. Italy (1.072) 2. Italian (1.071) 3. Italian (1.024) 4. Ital (1.010) L26H8Fact:The Eiffel Towerisinthe country of 0. French (1.953) 1. Paris (1.952) 2. Paris (1.845) 3. French (1.829) 4. France (1.815) 0. French (0.095) 1. French (0.090) 2. France (0.086) 3. Paris (0.086) 4. Paris (0.081) 0. French (2.048) 1. Paris (2.038) 2. Paris (1.926) 3. French (1.919) 4. France (1.901) L26H8Fact: The Taj Mahal is in the country of 0. Paris (0.211) 1. French (0.205) 2. Paris (0.198) 3. French (0.194) 4. France (0.190) 0. Spanish (0.007) 1. Spain (0.007) 2. Spain (0.007) 3. Barcelona (0.007) 4. Portuguese (0.007) 0. Paris (0.208) 1. Paris (0.196) 2. French (0.194) 3. France (0.191) 4. French (0.189) Table 8.Sorted DLA over all vocab tokens, broken down by source tokens (SUBJECTorRELATION). We note that for mixed heads whereC≈Rsuch as L22H15 (a head with a specialized category of sports), the correct attribute is consistently one of the top tokens fromSUBJECT, and high but not top fromRELATION. By contrast, for mixed heads with slightly different specializations, the correct attribute is high but not top from bothSUBJECTandRELATION. 25 Summing Up The Facts E.6.2. SUBJECT-RELATIONPROPAGATION WITHMIXEDHEADS In order to provide a clear distinction between the attributes extracted in theSUBJECTandRELATIONtokens, we also investigated knocking out attention from all baring the lastRELATIONtokens toSUBJECT. This prevents the correct attribute from having already been moved into earlierRELATIONtokens. We find that the DLA from the relation tokens increases significantly, which demonstrates that some information about the subject had already propagated to earlierRELATIONtokens. By isolating this effect through attention knockout, we confirm that mixed heads whereC≈Rregularly result in the attribute being the top token fromSUBJECTand near, but not at, the top fromRELATION. HeadPromptSubjectRelation (Without Knockout)Relation (With Knockout)Relation Change L22H15Fact: Michael Jordan plays the sport of010-1 L22H15Fact: Mike Trout plays the sport of013+2 L22H15Fact: Tom Brady plays the sport of009+9 L17H30Fact: Michael Jordan plays the sport of234352+9 L17H30Fact: Mike Trout plays the sport of455139+84 L17H30Fact: Tom Brady plays the sport of73137+6 L18H25Fact: Michael Jordan plays the sport of95860+2 L18H25Fact: Mike Trout plays the sport of27712984-45 L18H25Fact: Tom Brady plays the sport of55010293-9 L23H22Fact: The Colosseum is in the country of0015+15 L23H22Fact: The Eiffel Tower is in the country of1000 L23H22Fact: The Taj Mahal is in the country of0000 L26H8Fact: The Colosseum is in the country of113+2 L26H8Fact: The Eiffel Tower is in the country of4220 L26H8Fact: The Taj Mahal is in the country of226302259425454+2860 L21H23Fact: The Colosseum is in the country of62077+77 L21H23Fact: The Eiffel Tower is in the country of3324+2 L21H23Fact: The Taj Mahal is in the country of1220 Mean131112781446+167 Table 9.The rank of the correct attribute fromRELATIONincreases when we knock out attention from earlierRELATIONtokens to SUBJECT. This suggests significant subject-relation propagation otherwise occurs of the correct fact. E.7. MLPs E.7.1. ATTRIBUTES INRARE CONSISTENTLY BOOSTED BYMLPS CambodiaPeruEstoniaAlbaniaAfghanistanNepalPolandBulgariaAlgeriaLebanonBangladeshJamaicaEthiopiaKenyaMoroccoIraqAustriaQatarFinlandRomania 0 1 2 3 4 5 6 7 8 9 Attribute DLA (sum over all MLP layers) B u d a p e s t O s l o C o p e n h a g e n V i e n n a D u b l i n A u c k l a n d W a r s a w S y d n e y M u m b a i M e l b o u r n e A m s t e r d a m B u e n o s A i r e s I s t a n b u l T o k y o S t o c k h o l m L i s b o n J a k a r t a T o r o n t o B a n g k o k H o n g K o n g 0 2 4 6 8 10 Attribute D L A ( s u m o v e r a l l M L P l a y e r s ) Figure 18.Many attributes inRare boosted by MLPs over a range of prompts with relationsin the country of(left) andhas capital city(right), independent of which subject is given. Error bars are standard deviation over different subjects, which are small. This suggests the direct effect of the MLP does not causally depend on the subject. 26 Summing Up The Facts E.7.2. THE PRIMARY DIRECT EFFECT OFMLPS IS OFTEN TO BOOST MANY ATTRIBUTES IN THE SETR. PromptTop MLP Logit Lens Tokens Fact: Michael Jordan plays the sport ofWrest (0), squash (1), skiing (4), lineback (11), surfing (12), rugby (13), Rugby (14), volley (15), Running (17), boxing (21), Baseball (22), cricket (27), Floor (28), cycling (29), shooting (30), Mixed (31), gardening (32), Golf (34), Forward (35) swimming (39), bridge (40), coward (42), paddle (43), impat (44), CLUDE (45), ping (46), escaping (47), contacting (48), flag (49) Fact: The Eiffel Tower is in the country ofNiger (0), Burk (1), Georgia (2), Aust (3), Eston (4), Zimbabwe (5), Utt (6), Gren (7), Trin (8), Haiti (9), Lithuan (10), Guatemala (11), Lub (12), Hond (13), Liber (14), Equ (15), Bangladesh (16), Yug (17), Tun (18), Ly (19), Belf (20), Myanmar (21), Kenya (22), Hawai (23), Nepal (24), Sen (25), Ecuador (26), Yemen (27), Iraq (28), Cambodia (29), Chin (30), Afghanistan (31), Turk (32), Chad (33), Somalia (34), Alaska (35), Continuous (36), Tanzania (37), Sloven (38), Peru (39), Idaho (40), Bul (41), Aqu (42), Albany (43), fered (44), Norfolk (45), Byz (46), Kazakh (47), Tuc (48), Bulgaria (49) Fact: Stephen Hawking is a professor at the university ofAdelaide (0), Cape (5), Alaska (6), Cinc (7), Cincinnati (8), Hawai (12), Manchester (13), Manit (21), Cam (22), fered (23), Chester (24), Chel (25), Gib (26), icago (28), Manila (29), Sussex (31), Minn (33), Buenos (40), Ald (41), Ald (42), Malta (45), Calgary (46), Leicester (48) Fact: England has the capital city ofBudapest (0), Oslo (1), Birmingham (2), Belfast (3), Cincinnati (4), Constantin (5), Sask (6), Manchester (7), Lancaster (8), Kingston (9), Vienna (10), Malta (11), Copenhagen (12), Guatemala (13), Byz (14), Fuk (15), Chester (16), Brighton (17), Ottawa (18), Trin (19), Helsinki (20), Sacramento (21), Adelaide (22), Omaha (23), Winnipeg (24), Lah (25), Newcastle (26), Mumbai (27), Concord (28), Manila (29), Prague (30), Warsaw (31), Newport (32), Lans (33), Hartford (34), Rochester (35), Glasgow (36), Bulgaria (37), Card (38), Pret (39), Derby (40), Richmond (41), Windsor (42), Buenos (43), Calgary (44), Leeds (45), Dublin (46), Tun (47), Lok (48), Hull (49), Jak (50) Table 10.Top DLA tokens on the sum of all MLP layers tend to be attributes in the setR. Rank is included in brackets. Often, they are attributes we did not pragmatically check through inclusion in our dataset setsSandR. For instance, our setRforis in the country ofdid not include the country ofBurkina Faso, which is the rank 1 attribute pushed for by the MLP for the promptThe Eiffel Tower is in the country of. The correct attributeais not privileged among these, and is often quite low in rank. MLPs onEND(mostly) do not have significant indirect effect dependent on the subject. 024681012 −4 −2 0 2 4 6 8 10 12 pre-patched to-logit minus from-logit post-patched to-logit minus from-logit −3.5−3−2.5−2−1.5−1−0.50 −5 −4 −3 −2 −1 0 pre-patched logprob p o s t - p a t c h e d l o g p r o b Figure 19.We patch the all MLP outputs onENDfor prompts of formplays the sport ofbetween different subjects to study the indirect effect of MLPS on theENDtoken. For this relationship, we see that on average, performance does not increase for both a logit-diff between to-logit and from-logit (left) and logprob based (right) metric. The grey line indicates no change. Note that, for some relations, the MLPdoeshave significant indirect effect. We do not explain these cases, instead opting to only explainpartof the function of the MLP through it’s direct effect. E.8. Category Identification Here we try to better understand head categories, by inspecting the top head DLA tokens on a wider distribution of factual recall prompts. We looked at top DLA tokens extracted from 10,000 randomly selected prompts from theCounterFact dataset (Meng et al., 2023a). A summary of some of these are included below for the top 3 Subject, Relation and Mixed heads. We find that head categories are not quite aligned withSorR– heads arepolysemantic(Elhage et al., 2022). 27 Summing Up The Facts In Section 3.2, we saw the relation head L13H31 responded to both sports an countries. It may do this because it appears to be specialized to locations/position/places, and most sports are also the start of places (e.g.basketballcan be the sport or the first token inbasketball stadium. We also note that some heads misfire, extracting irrelevant attributes, as well as relevant ones. L18H25 appears to be specialized to the category of transport and consumables. However, this head is the 8th most important mixed head across theplays the sport ofandis in the country ofprompts (by DLA). Upon investigation we believe this is due to there being some cross-over between sports and transport words, such asGolf(a car brand and also a sport),swimming (a means of traveling and also a sport) andtrack(a railway track and also the sport of track and field). HeadTypeTheorized CategoriesTop 50 Tokens L13H31RelationLocation, positioning, places locations, location, places, cities, place, towns, locate, sites, languages, positions, located, spots, position, hometown, spot, locating, where, continents, loc, placement, territory, professions, city, states, site, town, headquarters, anywhere, kingdom, countries, municipalities, metropolitan, wherever, roles, regions, country, territories, vicinity, camps, venues, venue, centers, placed, destinations, france, residence, placing, finland, island, positioning L14H24RelationLocation, direction, lan- guages locations, location, anywhere, places, loc, located, wherever, somewhere, downtown, place, placement, english, regions, nearby, east, everywhere, vicinity, positions, geographical, localization, north, zones, situated, languages, nearest, southeast, locating, localities, sites, geographic, northeast, elsewhere, placed, south, locate, northwest, language, proximity, locality, geography, locale, nearer, point, spots, outside, areas, travels, hebrew, centralized, centers L17H2SubjectInternational relations, politics france, french, paris, european, international, europeans, europe, german, germany, public, germans, london, global, eu, england, uk, british, translated, franc, fran, worldwide, britain, monsieur, eur, europ, us, translations, globally, translation, internationally, euro, franois, brit, translate, translator, russian, europa, europea, deutsch, russia, montreal, philippe, publicly, canada, russians, translating, canadian, berlin, jacques, english L17H17SubjectCountries, ethnicities, politicians arizona, alabama, ariz, tamil, indian, kerala, india, nigeria, japanese, pakistan, az, nigerian, japan, ala, pakistani, phoenix, seoul, poland, greek, ari, indians, italy, polish, tokyo, istanbul, delhi, athens, birmingham, punjab, cyprus, greece, turkish, italian, turkey, hindu, niger, venice, lebanese, hawaiian, tampa, warsaw, turk, sic, hawaii, pak, ital, greeks, mumbai, abama, tuc L18H20RelationPlaces, diplomacycountries, city, country, nations, international, globally, diplomatic, abroad, worldwide, global, diplomats, ads, europe, internationally, nation, governor, continents, campus, france, legislators, treaty, street, legislative, foreigners, diplomacy, cities, european, overseas, ticket, national, expatri, diplomat, attendees, pases, foreign, capitol, germany, delegates, asia, conference, nationals, student, expatriate, globe, americas, downtown, students, eur, faculty, australia L21H9SubjectHedonism,wealth, sport, violence stock, wrest, beer, tennis, wrestling, coffee, gun, beers, brewery, soccer, brew, tenn, drink, atp, stocks, football, drunk, guns, fighters, drank, fighter, drinking, alcohol, shooting, footballers, drinks, golf, firearm, drunken, shoot, wwe, fight, fifa, alcoholic, beverage, brewing, play, firearms, bullets, nra, vince, caffeine, shooter, mma, starbucks, fighting, train, beverages, shot, liquor L22H15MixedCommunication, sportstelevision, tv, games, football, soccer, broadcast, game, sports, broadcasting, players, sport, fifa, player, broadcasts, basketball, radio, payment, hockey, baseball, footballers, tournament, tennis, league, tele, sporting, rugby, gamers, espn, athletes, gaming, footballer, tournaments, athlet, payments, cameras, internet, playing, watches, athletic, camera, cricket, stadium, play, athlete, aired, nfl, golf, advertising, gameplay, storage L23H22MixedCountries,languages, ethnicity chinese, china, greek, japanese, japan, beijing, french, russian, italian, spanish, france, mexican, italy, greece, russia, tokyo, greeks, shanghai, russians, finnish, ital, mexico, mex, german, spani, latino, dutch, germany, portuguese, moscow, cyprus, taiwan, brazilian, spain, soviet, ukrainian, germans, swedish, brazil, quebec, guang, hispanic, zhang, jiang, norwegian, ukraine, korean, paris, qing, belgian L26H8MixedPlaces, culture, universi- ties van, dutch, von, las, brazilian, los, filip, french, la, portuguese, han, brazil, hait, france, mexican, lap, italian, mexico, paris, mex, portugal, louis, philippine, spanish, ital, chile, italy, sierra, spain, holland, manila, louisiana, netherlands, so, philippines, jean, monsieur, portug, argentine, barcelona, spani, rio, ucla, argentina, lisbon, haiti, pierre, madrid, brasil, buenos Table 11.For a selection of important heads, we display the top 50 tokens that they output (by maximum DLA) from a broader data set with 10,000 prompts. We also include hand written categories that the head specializes in, based on human evaluation of the top 500 tokens that they output. We note that subject, relation and mixed attention heads all seem to specialize to just a few categories. F. Attention Head Superposition Our initial motivation for studying the factual recall set up was to find real world examples ofattention head superposition (Jermyn et al., 2023). In this section, we explain this motivation. In mechanistic interpretability, we wish to explain the behavior of neural networks through understanding the representations and algorithms implemented in weights and activations. This requires a notion of the ‘fundamental units’ of networks. It is a reasonable place to start to investigate the natural structures we find in networks. In some cases, this seems very reasonable: non linear activations produce a privileged basis in the space of neuron activations, which could result in feature representations being aligned to the neuron basis, and individual neurons being interpretable. Unfortunately, we find that neurons are often polysemantic, encoding many different features. We hypothesize this occurs due to superposition: the network is incentivized to encode more features than it has dimensions. It seems like the correct place to look for features is not in the neurons, but as directions in the neuron activation space. Through a similar argument, we also expect that the residual stream of a transformer stores features in superposition, which is termedbottlenecksuperposition. Much work is being put into the problem of ‘solving’ superposition, and finding meaningful, interpretable and sparsely activating directions in activation space (Cunningham et al., 2023). A natural further question to ask is, where else are we studying the wrong fundamental units? In language model 28 Summing Up The Facts interpretability, we often care about localising the computational graph of particular behaviors. This often initially consists of a set of attention heads and MLP layers that “matter” for a given task. But are attention heads themselves the correct unit of study? We know neurons are not, is it possible that attention heads are also not? We have reason to believe the network may try to introduce compression in attention heads themselves. We should expect that models may use a mechanism like this to implement many more behaviors than they have heads. It is possible that each head is individually polysemantic, and implements several distinct behaviors, but in any given context a specific subset of heads work together, attend to the same place, and the output is the residual stream times the weighted sum of their OV matrices. Is there meaningful structure on the set of (nlayers * nheads) attention heads? Can we productively think of attention heads as being insuperposition in certain contexts? This idea was first introduced by Jermyn et al. (2023), who suggested attention head superposition as a phenomena, and thought they had found a toy example of it, which they later thought was not quite attention head superposition. There is some evidence in LLMs for attention head superposition – we often find many heads that seem to be doing the same thing on some sub-distribution. For instance, why are there often several induction heads? In the IOI circuit (Wang et al., 2022), why are there several name mover heads? Can these be productively thought of as a single superposed name mover head? This could additionally explain why negative name mover heads exist. The heads should only be thought of as a single coherent unit, rather than the model learning a real circuit (name movers) and learning a weird anti-circuit (negative name movers) on top. Here is a theoretical example of head superposition. Say, we have 2 heads X and Y that extract 3 different things (depending on the context) A, B, C. X activations in contexts A and B, giving +A+2B-2C. Y activates for A and C, giving A-2B+2C. Then, in the A context X+Y = 2A, in the B context X+Y = A+2B-2C, and in the C context X+Y = A-2B+2C, and this works. We have compressed 3 tasks into 2 heads. If the “relation propagation” hypothesis of Geva et al. (2023a) were the primary story behind factual recall, factual recall may be a good place to hunt for attention head superposition. Models likely know many more kinds of facts than they have heads, and so may could use heads in combination to extract the correct fact. We however found that models did not use enough relation propagation for this theoretical picture to hold up. Nevertheless, finding examples of attention head superposition is still an interesting future direction. 29