Paper deep dive
How Do Language Models Bind Entities in Context?
Jiahai Feng, Jacob Steinhardt
Models: LLaMA family, Pythia family
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/12/2026, 7:54:06 PM
Summary
The paper identifies a 'binding ID mechanism' in large language models (LLMs) from the Pythia and LLaMA families, which allows them to associate entities with their attributes in context. Using causal mediation analysis, the authors demonstrate that LLMs represent these bindings by attaching binding ID vectors to entity and attribute activations. These vectors form a continuous subspace, and the mechanism is shown to be robust, transferable, and increasingly precise with model scale.
Entities (5)
Relation Signals (4)
Causal Mediation Analysis → verifies → Binding ID mechanism
confidence 96% · Using causal mediation analysis we empirically verify two key properties of the binding ID mechanism
Pythia → utilizes → Binding ID mechanism
confidence 95% · we observe in every sufficiently large model from the Pythia and LLaMA families
Llama → utilizes → Binding ID mechanism
confidence 95% · we observe in every sufficiently large model from the Pythia and LLaMA families
Binding ID mechanism → representsvia → Binding ID vectors
confidence 94% · LMs’ internal activations represent binding information by attaching binding ID vectors to corresponding entities and attributes
Cypher Suggestions (2)
Find all model families that utilize the binding ID mechanism. · confidence 90% · unvalidated
MATCH (m:ModelFamily)-[:UTILIZES]->(b:Mechanism {name: 'Binding ID mechanism'}) RETURN m.nameIdentify methodologies used to study the binding ID mechanism. · confidence 90% · unvalidated
MATCH (m:Methodology)-[:VERIFIES]->(b:Mechanism {name: 'Binding ID mechanism'}) RETURN m.nameAbstract
Abstract:To correctly use in-context information, language models (LMs) must bind entities to their attributes. For example, given a context describing a "green square" and a "blue circle", LMs must bind the shapes to their respective colors. We analyze LM representations and identify the binding ID mechanism: a general mechanism for solving the binding problem, which we observe in every sufficiently large model from the Pythia and LLaMA families. Using causal interventions, we show that LMs' internal activations represent binding information by attaching binding ID vectors to corresponding entities and attributes. We further show that binding ID vectors form a continuous subspace, in which distances between binding ID vectors reflect their discernability. Overall, our results uncover interpretable strategies in LMs for representing symbolic knowledge in-context, providing a step towards understanding general in-context reasoning in large-scale LMs.
Tags
Links
- Source: https://arxiv.org/abs/2310.17191
- Canonical: https://arxiv.org/abs/2310.17191
Full Text
134,705 characters extracted from source content.
Expand or collapse full text
capbtabboxtable[][ ] How do Language Models Bind Entities in Context? Jiahai Feng & Jacob Steinhardt UC Berkeley Correspondence to fjiahai@berkeley.edu Abstract To correctly use in-context information, language models (LMs) must bind entities to their attributes. For example, given a context describing a “green square” and a “blue circle”, LMs must bind the shapes to their respective colors. We analyze LM representations and identify the binding ID mechanism: a general mechanism for solving the binding problem, which we observe in every sufficiently large model from the Pythia and LLaMA families. Using causal interventions, we show that LMs’ internal activations represent binding information by attaching binding ID vectors to corresponding entities and attributes. We further show that binding ID vectors form a continuous subspace, in which distances between binding ID vectors reflect their discernability. Overall, our results uncover interpretable strategies in LMs for representing symbolic knowledge in-context, providing a step towards understanding general in-context reasoning in large-scale LMs. 1 Introduction Modern language models (LMs) excel at many reasoning benchmarks, suggesting that they can perform general purpose reasoning across many domains. However, the mechanisms that underlie LM reasoning remain largely unknown (Räuker et al., 2023). The deployment of LMs in society has led to calls to better understand these mechanisms (Hendrycks et al., 2021), so as to know why they work and when they fail (Mu & Andreas, 2020; Hernandez et al., 2021; Vig et al., 2020b). In this work, we seek to understand binding, a foundational skill that underlies many compositional reasoning capabilities (Fodor & Pylyshyn, 1988) such as entity tracking (Kim & Schuster, 2023). How humans solve binding, i.e. recognize features of an object as bound to that object and not to others, is a fundamental problem in psychology (Treisman, 1996). Here, we study binding in LMs. Binding arises any time the LM has to reason about two or more objects of the same kind. For example, consider the following passage involving two people and two countries: Context: Alice lives in the capital city of France. Bob lives in the capital city of Thailand. Question: Which city does Bob live in? (1) In this example the LM has to represent the associations lives(Alice, Paris) and lives(Bob, Bangkok). We call this the binding problem—for the predicate lives, Alice is bound to Paris and Bob to Bangkok. Since predicates are bound in-context, binding must occur in the activations, rather than in the weights as with factual recall (Meng et al., 2022). This raises the question: how do LMs represent binding information in the context such that they can be later recalled? Overall, our key technical contribution is the identification of a robust general mechanism in LMs for solving the binding problem. The mechanism relies on binding IDs, which are abstract concepts that LMs use internally to mark variables in the same predicate apart from variables in other predicates (Fig. 1). Using causal mediation analysis we empirically verify two key properties of the binding ID mechanism (section 3): factorizability and position independence. Turning to the structure of binding IDs, we find that binding IDs are represented as vectors which are bound to variables by simple addition (section 4) in the activation space. Further, we show that binding IDs occupy a subspace, in the sense that linear combinations of binding IDs are still valid binding IDs, even though random vectors are not. Lastly, we find that binding IDs are ubiquitous and transferable (section 5). They are used by every sufficiently large model in the LLaMA (Touvron et al., 2023) and Pythia (Biderman et al., 2023) families, and their fidelity increases with scale. They are used for a variety of synthetic binding tasks with different surface forms, and binding vectors from one task transfer to other tasks. Finally, we qualify our findings by showing that despite their ubiquity, binding IDs are not universal: we exhibit a question-answering task where an alternate mechanism, “direct binding”, is used instead (appendix E). We release code and datasets here: https://github.com/jiahai-feng/binding-iclr Figure 1: The Binding ID mechanism. The LM learns abstract binding IDs (drawn as triangles or squares) which distinguish between entity-attribute pairs. Binding functions ΓEsubscriptΓ _EΓitalic_E and ΓAsubscriptΓ _AΓitalic_A bind entities and attributes to their abstract binding ID, and store the results in the activations. To answer queries, the LM identifies the attribute that shares the same binding ID as the queried entity. 2 Preliminaries In this section we define the binding task and explain causal mediation analysis, our main experimental technique. Binding task. Formally, the binding task consists of a set of entities ℰEE and a set of attributes AA. An n-entity instance of the binding task consists of a context that is constructed from n entities e0,…,en−1∈ℰsubscript0…subscript1ℰe_0,…,e_n-1 0 , … , eitalic_n - 1 ∈ E and n attributes a0,…,an−1∈subscript0…subscript1a_0,…,a_n-1 0 , … , aitalic_n - 1 ∈ A, and we denote the corresponding context as =ctxt(e0↔a0,…,en−1↔an−1)ctxt↔subscript0subscript0…subscript1subscript1c=ctxt(e_0 a_0,…,e_n-1% a_n-1)c = ctxt ( e0 ↔ a0 , … , eitalic_n - 1 ↔ aitalic_n - 1 ). For a context cc, we use Ek()subscriptE_k(c)Eitalic_k ( c ) and Ak()subscriptA_k(c)Aitalic_k ( c ) to denote the k-th entity and the k-th attribute of the context cc, for k∈[0,n−1]01k∈[0,n-1]k ∈ [ 0 , n - 1 ]. We will drop the dependence on cc for brevity when the choice of cc is clear from context. In the capitals task, which is the main task we study for most of the paper, ℰEE is a set of single-token names, and AA is a set of single-token countries. Quote 1 is an example instance of the capitals task with context =ctxt(Alice↔France,Bob↔Thailand)ctxt↔ℎc=ctxt(Alice France,Bob Thailand)c = ctxt ( A l i c e ↔ F r a n c e , B o b ↔ T h a i l a n d ). In this context, E0subscript0E_0E0 is AliceAliceA l i c e, A0subscript0A_0A0 is FranceFranceF r a n c e, etc. Given a context cc, we are interested in the model’s behavior when queried with each of the n entities present in cc. For any k∈[0,n−1]01k∈[0,n-1]k ∈ [ 0 , n - 1 ], when queried with the entity EksubscriptE_kEitalic_k the model should place high probability on the answer matching AksubscriptA_kAitalic_k. In our running example, the model should predict “Paris” when queried with “Alice”, and “Bangkok” when queried with “Bob”. To evaluate a model’s behavior on a binding task, we sample N=100100N=100N = 100 contexts. For each context cc, we query the LM with every entity mentioned in the context, which returns a vector of log probabilities over every token in the vocabulary. The mean log prob metric measures the mean of the log probability assigned to the correct attribute token. Top-1 accuracy measures the proportion of queries where the correct attribute token has the highest log probability out of all attribute tokens. However, we will instead use the median-calibrated accuracy (Zhao et al., 2021), which calibrates the log probabilities with the median log probability before taking the top-1 accuracy. We discuss this choice in appendix A. Figure 2: a) Causal diagram for autoregressive LMs. From input context ctxt(e0↔a0,e1↔a1)ctxt↔subscript0subscript0subscript1subscript1ctxt(e_0 a_0,e_1 a_1)ctxt ( e0 ↔ a0 , e1 ↔ a1 ), the LM constructs internal representations ZcontextsubscriptcontextZ_contextZcontext. We will mainly study the components of ZcontextsubscriptcontextZ_contextZcontext boxed in blue. b) A secondary run of the LM on context ctxt(e2↔a2,e3↔a3)ctxt↔subscript2subscript2subscript3subscript3ctxt(e_2 a_2,e_3 a_3)ctxt ( e2 ↔ a2 , e3 ↔ a3 ) to produce Zcontext′subscriptcontext′Z_context Zcontext′. c) An example intervention where ZcontextsubscriptcontextZ_contextZcontext is modified by replacing ZA0→ZA0′→subscriptsubscript0superscriptsubscriptsubscript0′Z_A_0→ Z_A_0 Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT′ from Zcontext′subscriptcontext′Z_context Zcontext′. Causality in autoregressive LMs. We utilize inherent causal structure in autoregressive LMs. Let an LM have nlayerssubscriptlayersn_layersnlayers transformer layers and a dmodelsubscriptmodeld_modeldmodel-dimensional activation space. For every token position p, we use Zp∈ℝnlayers×dmodelsubscriptsuperscriptℝsubscriptlayerssubscriptmodelZ_p ^n_layers× d_modelZitalic_p ∈ blackboard_Rnlayers × dmodel to denote the stacked set of internal activations111These are the pre-transformer layer activations, sometimes referred to as the residual stream. at token p (see Fig. 2a). We refer to the collective internal activations of the context as ZcontextsubscriptcontextZ_contextZcontext. In addition, we denote the activations at the token for the k-th entity as ZEksubscriptsubscriptZ_E_kZitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT, and the k-th attribute as ZAksubscriptsubscriptZ_A_kZitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT. We sometimes write ZAk(),Zcontext()subscriptsubscriptsubscriptcontextZ_A_k(c),Z_context(c)Zitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT ( c ) , Zcontext ( c ), etc. to make clear the dependence on the context cc. Fig. 2a shows that ZcontextsubscriptcontextZ_contextZcontext contains all the information about the context that the LM uses. We thus study the structure of ZcontextsubscriptcontextZ_contextZcontext using causal mediation analysis, a widely used tool for understanding neural networks (Vig et al., 2020a; Geiger et al., 2021; Meng et al., 2022). Causal mediation analysis involves substituting one set of activations in a network for another, and we adopt the /././ . notation (from Mathematica) to denote this. For example, for activations Z∗∈ℝnlayers×dmodelsubscriptsuperscriptℝsubscriptlayerssubscriptmodelZ_* ^n_layers× d_modelZ∗ ∈ blackboard_Rnlayers × dmodel, and a token position p in the context, Zcontext/.Zp→Z∗=[Z0,…,Zp−1,Z∗,Zp+1,…]Z_context/.\Z_p→ Z_*\=[Z_0,…,Z_p-1,Z_*,Z_p% +1,…]Zcontext / . Zitalic_p → Z∗ = [ Z0 , … , Zitalic_p - 1 , Z∗ , Zitalic_p + 1 , … ]. Similarly, for a context =ctxt(e0↔a0,…,en−1↔an−1)ctxt↔subscript0subscript0…subscript1subscript1c=ctxt(e_0 a_0,…,e_n-1% a_n-1)c = ctxt ( e0 ↔ a0 , … , eitalic_n - 1 ↔ aitalic_n - 1 ), we have /.Ek→e∗=ctxt(e0↔a0,…,e∗↔ak,…,en−1↔an−1)c/.\E_k→ e_*\=ctxt(e_0% a_0,…,e_* a_k,…,e_n-1% a_n-1)c / . Eitalic_k → e∗ = ctxt ( e0 ↔ a0 , … , e∗ ↔ aitalic_k , … , eitalic_n - 1 ↔ aitalic_n - 1 ). Given a causal graph, causal mediation analysis determines the role of an intermediate node by experimentally intervening on the value of the node and measuring the model’s output on various queries. For convenience, when the model answers queries in accordance to a context cc, we say that the model believes222We do not claim or assume that LMs actually have beliefs in the sense that humans do. This is a purely notational choice to reduce verbosity. cc. If there is no context consistent with the language model’s behavior, then we say that the LM is confused. As an example, suppose we are interested in the role of the activations ZA0subscriptsubscript0Z_A_0Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in Fig. 2a. To apply causal mediation analysis, we would: 1. Obtain ZcontextsubscriptcontextZ_contextZcontext by running the model on the original context cc (which we also refer to as the target context) (Fig. 2a) 2. Obtain Zcontext′subscriptcontext′Z_context Zcontext′ by running the model on a different context ′superscript′c c′ (i.e. source context) (Fig. 2b) 3. Modify ZcontextsubscriptcontextZ_contextZcontext by replacing ZA0subscriptsubscript0Z_A_0Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the target context with ZA0′subscriptsubscript0′Z_A_0 Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT′ from the source context (Fig. 2c), while keeping all other aspects of ZcontextsubscriptcontextZ_contextZcontext the same, resulting in Zcontextintervened=Zcontext/.ZA0→ZA0′Z_context^intervened=Z_context/.\Z_A_0% → Z_A_0 \Zcontextintervened = Zcontext / . Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT′ 4. Evaluate the model’s beliefs based on the new ZcontextintervenedsuperscriptsubscriptcontextintervenedZ_context^intervenedZcontextintervened We can infer the causal role of ZA0subscriptsubscript0Z_A_0Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from how the intervention Zcontext/.ZA0→ZA0′Z_context/.\Z_A_0→ Z_A_0 \Zcontext / . Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT′ changes the model’s beliefs. Intuitively, if the model retains its original beliefs cc, then ZA0subscriptsubscript0Z_A_0Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT has no causal role in the model’s behavior on the binding task. On the other hand, if the model now believes the source context ′superscript′c c′, then ZA0subscriptsubscript0Z_A_0Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT contains all the information in the context. In reality both hypothetical extremes are implausible, and in section 3 we discuss a more realistic hypothesis. A subtle point is that we study how different components of ZcontextsubscriptcontextZ_contextZcontext store information about the context (and thus influence behavior), and not how ZcontextsubscriptcontextZ_contextZcontext itself is constructed. We thus suppress the causal influence that ZA0subscriptsubscript0Z_A_0Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT has on downstream parts of ZcontextsubscriptcontextZ_contextZcontext (such as ZE1subscriptsubscript1Z_E_1Zitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ZA1subscriptsubscript1Z_A_1Zitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) by freezing the values of ZE1subscriptsubscript1Z_E_1Zitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ZA1subscriptsubscript1Z_A_1Zitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in ZcontextintervenedsuperscriptsubscriptcontextintervenedZ_context^intervenedZcontextintervened instead of recomputing them based on ZA0′subscriptsubscript0′Z_A_0 Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT′. 3 Existence of Binding IDs In this section, we first describe our hypothesized binding ID mechanism. Then, we identify two key predictions of the mechanism, factorizability and position independence, and verify them experimentally. We provide an informal argument in appendix B for why this binding ID mechanism is the only mechanism consistent with factorizability and position independence. Binding ID mechanism. We claim that to bind attributes to entities, the LM learns abstract binding IDs that it assigns to entities and attributes, so that entities and attributes bound together have the same binding ID (Fig. 1). In more detail, our informal description of the binding ID mechanism is: 1. For the k-th entity-attribute pair construct an abstract binding ID that is independent of the entity/attribute values. Thus, for a fixed n-entity binding task (e.g. capitals task) we can identify the k-th abstract binding ID with the integer k∈0,…,n−10…1k∈\0,…,n-1\k ∈ 0 , … , n - 1 . 2. For entity EksubscriptE_kEitalic_k, encode both the entity EksubscriptE_kEitalic_k and the binding ID k in the activations ZEksubscriptsubscriptZ_E_kZitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT. 3. For attribute AksubscriptA_kAitalic_k, encode both the attribute AksubscriptA_kAitalic_k and the binding ID k in the activations ZAksubscriptsubscriptZ_A_kZitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT. 4. To answer a query for entity EksubscriptE_kEitalic_k, retrieve from ZcontextsubscriptcontextZ_contextZcontext the attribute that shares the same binding ID as EksubscriptE_kEitalic_k. Further, for activations ZEksubscriptsubscriptZ_E_kZitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT and ZAksubscriptsubscriptZ_A_kZitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT, the binding ID and the entity/attribute are the only information they contain that affects the query behavior. More formally, there are binding functions ΓE(e,k)subscriptΓ _E(e,k)Γitalic_E ( e , k ) and ΓA(a,k)subscriptΓ _A(a,k)Γitalic_A ( a , k ) that fully specify how ZEsubscriptZ_EZitalic_E and ZAsubscriptZ_AZitalic_A bind entities/attributes with binding IDs. Specifically, if Ek=e∈ℰsubscriptℰE_k=e _k = e ∈ E, then we can replace ZEksubscriptsubscriptZ_E_kZitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT with ΓE(e,k)subscriptΓ _E(e,k)Γitalic_E ( e , k ) without changing the query behavior, and likewise for ZAsubscriptZ_AZitalic_A. More generally, given ZcontextsubscriptcontextZ_contextZcontext with entity representations ΓE(e0,0),…,ΓE(en−1,n−1)subscriptΓsubscript00…subscriptΓsubscript11 _E(e_0,0),…, _E(e_n-1,n-1)Γitalic_E ( e0 , 0 ) , … , Γitalic_E ( eitalic_n - 1 , n - 1 ) and attribute representations ΓA(a0,π(0)),…,ΓA(an−1,π(n−1))subscriptΓsubscript00…subscriptΓsubscript11 _A(a_0,π(0)),…, _A(a_n-1,π(n-1))Γitalic_A ( a0 , π ( 0 ) ) , … , Γitalic_A ( aitalic_n - 1 , π ( n - 1 ) ) for a permutation π, the LM should answer queries according to the context =ctxt(e0↔aπ−1(0),…,en−1↔aπ−1(n−1))ctxt↔subscript0subscriptsuperscript10…subscript1subscriptsuperscript11c=ctxt(e_0 a_π^-1(0),…,e_% n-1 a_π^-1(n-1))c = ctxt ( e0 ↔ aitalic_π- 1 ( 0 ) , … , eitalic_n - 1 ↔ aitalic_π- 1 ( n - 1 ) ). This implies two properties in particular, which we will test in the following subsections: • Factorizability: if we replace ZAksubscriptsubscriptZ_A_kZitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT with ZAk′subscriptsuperscriptsubscript′Z_A_k Zitalic_A start_POSTSUBSCRIPT k′ end_POSTSUBSCRIPT, then the model will bind EksubscriptE_kEitalic_k to Ak′subscript′A_k Aitalic_k′ instead of AksubscriptA_kAitalic_k, i.e. it will believe ./Ak→Ak′c./\A_k→ A_k \c . / Aitalic_k → Aitalic_k′ . This is because ZAk′subscriptsubscript′Z_A_k Zitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT′ encodes ΓA(Ak′,k)subscriptΓsuperscriptsubscript′ _A(A_k ,k)Γitalic_A ( Aitalic_k′ , k ) and ZAksubscriptsubscriptZ_A_kZitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT encodes ΓA(Ak,k)subscriptΓsubscript _A(A_k,k)Γitalic_A ( Aitalic_k , k ). Substituting ZAk→ZAk′→subscriptsubscriptsubscriptsuperscriptsubscript′Z_A_k→ Z_A_k Zitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT → Zitalic_A start_POSTSUBSCRIPT k′ end_POSTSUBSCRIPT will overwrite ΓA(Ak,k)subscriptΓsubscript _A(A_k,k)Γitalic_A ( Aitalic_k , k ) with ΓA(Ak′,k)subscriptΓsuperscriptsubscript′ _A(A_k ,k)Γitalic_A ( Aitalic_k′ , k ), causing the model to bind EksubscriptE_kEitalic_k to Ak′subscript′A_k Aitalic_k′. • Position independence: if we e.g. swap ZA0subscriptsubscript0Z_A_0Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ZA1subscriptsubscript1Z_A_1Zitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the model still binds A0↔E0↔subscript0subscript0A_0 E_0A0 ↔ E0 and A1↔E1↔subscript1subscript1A_1 E_1A1 ↔ E1, because it looks up attributes based on binding ID and not position in the context. In section 4, we construct fine-grained modifications to the activation Z that modify the binding ID but not the attributes, allowing us to test the binding hypothesis more directly. In section 5 we extend this further by showing that binding IDs can be transplanted from entirely different tasks. (a) Swapping entity/attribute for (E0,A0)subscript0subscript0(E_0,A_0)( E0 , A0 ) (b) Swapping entity/attribute for (E1,A1)subscript1subscript1(E_1,A_1)( E1 , A1 ) Figure 3: Factorizability results. Each row corresponds to querying for a particular entity. Plotted are the mean log prob for all four attributes. Highlighted squares are predicted by factorizability. 3.1 Factorizability of activations The first property of ZcontextsubscriptcontextZ_contextZcontext we test is factorizability. In our claimed mechanism, information is highly localized—ZAksubscriptsubscriptZ_A_kZitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT contains all relevant information about AksubscriptA_kAitalic_k, and likewise for ZEksubscriptsubscriptZ_E_kZitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT. Therefore, we expect LMs that implement this mechanism to have factorizable activations: for any contexts ,′superscript′c,c c , c′, substituting ZEk→ZEk(′)→subscriptsubscriptsubscriptsubscriptsuperscript′Z_E_k→ Z_E_k(c )Zitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT → Zitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT ( c′ ) into Zcontext()subscriptcontextZ_context(c)Zcontext ( c ) will cause the model to believe /.Ek→Ek′c/.\E_k→ E_k \c / . Eitalic_k → Eitalic_k′ , and substituting ZAk→ZAk(′)→subscriptsubscriptsubscriptsubscriptsuperscript′Z_A_k→ Z_A_k(c )Zitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT → Zitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT ( c′ ) cause the model to believe /.Ak→Ak′c/.\A_k→ A_k \c / . Aitalic_k → Aitalic_k′ . To test this concretely, we considered the capitals task from section 2 with n=22n=2n = 2 entity-attribute pairs. We computed representations for two contexts =ctxt(e0↔a0,e1↔a1)ctxt↔subscript0subscript0subscript1subscript1c=ctxt(e_0 a_0,e_1 a% _1)c = ctxt ( e0 ↔ a0 , e1 ↔ a1 ) and ′=ctxt(e0′↔a0′,e1′↔a1′)superscript′ctxt↔superscriptsubscript0′subscript0′subscript1′subscript1′c =ctxt(e_0 a_0^% ,e_1 a_1 )c′ = ctxt ( e0′ ↔ a0′ , e1′ ↔ a1′ ), and used causal mediation analysis (section 2) to swap representations from the source context ′superscript′c c′ into the target context cc. Specifically, we fix k∈0,101k∈\0,1\k ∈ 0 , 1 and intervene on either just the entity (ZEK→ZEk′→subscriptsubscriptsuperscriptsubscriptsubscript′Z_E_K→ Z_E_k Zitalic_E start_POSTSUBSCRIPT K end_POSTSUBSCRIPT → Zitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT′), just the attribute, neither, or both. We then measure the mean log probs for all possible queries (E0,E1,E0′,E1′)E_0,E_1,E_0 ,E_1 )E0 , E1 , E0′ , E1′ ). For instance, swapping AksubscriptA_kAitalic_k with Ak′subscript′A_k Aitalic_k′ in ZcontextsubscriptcontextZ_contextZcontext should lead Ak′subscript′A_k Aitalic_k′ (and not AksubscriptA_kAitalic_k) to have high log-probability when EksubscriptE_kEitalic_k is queried. Results are shown in Fig. 3 and support the factorizability hypothesis. As an example, consider Fig. 3(a). In the None setting (no intervention), we see high log probs for A0subscript0A_0A0 when queried for E0subscript0E_0E0, and for A1subscript1A_1A1 when queried for E1subscript1E_1E1. This indicates that the LM is able to solve this task. Next, consider the Attribute intervention setting (A0→A0′→subscript0superscriptsubscript0′A_0→ A_0 A0 → A0′): querying for E0subscript0E_0E0 now gives high log probs for A0′subscript0′A_0 A0′, and querying for E1subscript1E_1E1 gives A1subscript1A_1A1 as usual. Finally, in the Both setting (where both entity and attribute are swapped), querying E0′subscript0′E_0 E0′ returns A0′subscript0′A_0 A0′ while querying E0subscript0E_0E0 leads to approximately uniform predictions. Experiment details. We use LLaMA 30-b here and elsewhere unless otherwise stated. In practice, we found that activations for both the entity token and the subsequent token encode the entity binding information. Thus for all experiments in this paper, we expand the definition of ZEksubscriptsubscriptZ_E_kZitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT to include the token activations immediately after EksubscriptE_kEitalic_k. 3.2 Position independence Figure 4: Top: Mean log probs for entity interventions. Bottom: Mean log probs for attributes. For brevity, let ZksubscriptZ_kZitalic_k refer to ZEksubscriptsubscriptZ_E_kZitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT or ZAksubscriptsubscriptZ_A_kZitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT. The grey and green vertical lines indicate the original positions for Z0subscript0Z_0Z0 and Z1subscript1Z_1Z1 respectively. The x-axis marks x, Z0subscript0Z_0Z0’s new position. Under the position interventions X0→x,X1→X1−(x−X0)formulae-sequence→subscript0→subscript1subscript1subscript0\X_0→ x,X_1→ X_1-(x-X_0)\ X0 → x , X1 → X1 - ( x - X0 ) , the grey line is the control condition with no interventions, and the green line is the swapped condition where Z0subscript0Z_0Z0 and Z1subscript1Z_1Z1 have swapped positions. We next turn to position independence, which is the other property we expect LMs implementing the binding ID mechanism to have. This says that permuting the order of the ZEksubscriptsubscriptZ_E_kZitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT and ZAksubscriptsubscriptZ_A_kZitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT should have no effect on the output, because the LM looks only at the binding IDs and not the positions of entities or attributes activations. To apply causal interventions to the positions, we use the fact that transformers use positional embeddings to encode the (relative) position of each token in the input. We can thus intervene on these embeddings to “move” one of the ZksubscriptZ_kZitalic_k’s to another location k′. Formally, we let XksubscriptX_kXitalic_k describe the position embedding for ZksubscriptZ_kZitalic_k, and denote the position intervention as Xk→k′→subscriptsuperscript′\X_k→ k \ Xitalic_k → k′ . In appendix C we describe how to do this for rotary position embeddings (RoPE), which underlie all the models we study. For now, we will assume this intervention as a primitive and discuss experimental results. For our experiments, we again consider the capitals task with n=22n=2n = 2. Let XE0subscriptsubscript0X_E_0Xitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and XE1subscriptsubscript1X_E_1Xitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT denote the positions of the two entities. We apply interventions of the form XE0→x,XE1→XE1−(x−XE0)formulae-sequence→subscriptsubscript0→subscriptsubscript1subscriptsubscript1subscriptsubscript0\X_E_0→ x,X_E_1→ X_E_1-(x-X_E_0)\ Xitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → x , Xitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → Xitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ( x - Xitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , for x∈XE0,XE0+1,…,XE1subscriptsubscript0subscriptsubscript01…subscriptsubscript1x∈\X_E_0,X_E_0+1,…,X_E_1\x ∈ Xitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , Xitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 , … , Xitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . This measures the effect of gradually moving the two entity positions past each other: when x=XE0subscriptsubscript0x=X_E_0x = Xitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, no intervention is performed (control condition), and when x=XE1subscriptsubscript1x=X_E_1x = Xitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT the entity positions are swapped (swapped condition). We repeat the same experiment with attribute activations and measure the mean log probs in both cases. Results are shown in Fig. 4. As predicted under position independence, position interventions result in little change in model behavior. Consider the swapped condition at the green line. Had the binding information been entirely encoded in position, we expect a complete switch in beliefs compared to the control condition. In reality, we observe almost no change in mean log probs for entities and a small change in mean log probs for attributes that seems to be part of an overall gradual trend. We interpret this gradual trend as an artifact of position-dependent bias, and not as evidence against position independence. We view it as a bias because it affects all attributes regardless of how they are bound—attributes that are shifted to later positions always have higher log probs. We provide further discussion of this bias, as well as other experimental details, in appendix C. 4 Structure of Binding ID The earlier section shows evidence for the binding ID mechanism. Here, we investigate two hypotheses on the structure of binding IDs and binding functions. The first is that the binding functions ΓAsubscriptΓ _AΓitalic_A and ΓEsubscriptΓ _EΓitalic_E are additive, which lets us think of binding IDs as binding vectors. The second is contingent on the first, and asks if binding vectors have a geometric relationship between each other. 4.1 Additivity of Binding Functions Prior interpretability research has proposed that transformers represent features linearly (Elhage et al., 2021). Therefore a natural hypothesis is that both entity/attribute representations and abstract binding IDs are vectors in activation space, and that the binding function (section 3) simply adds the vectors for entity/attribute and binding ID. We let the binding ID k be represented by the pair of vectors [bE(k),bA(k)]subscriptsubscript[b_E(k),b_A(k)][ bitalic_E ( k ) , bitalic_A ( k ) ], and the representations of entity e and attribute a be fE(e)subscriptf_E(e)fitalic_E ( e ) and fA(a)subscriptf_A(a)fitalic_A ( a ) respectively. Then, we hypothesize that the binding functions can be linearly decomposed as: ΓA(a,k)=fA(a)+bA(k),ΓE(e,k)=fE(e)+bE(k).formulae-sequencesubscriptΓsubscriptsubscriptsubscriptΓsubscriptsubscript _A(a,k)=f_A(a)+b_A(k), _E(e,k)=f_E(e)+b_E(k).Γitalic_A ( a , k ) = fitalic_A ( a ) + bitalic_A ( k ) , Γitalic_E ( e , k ) = fitalic_E ( e ) + bitalic_E ( k ) . (1) Binding ID vectors seem intuitive and plausibly implementable by transformer circuits. To experimentally test this, we seek to extract bA(k)subscriptb_A(k)bitalic_A ( k ) and bE(k)subscriptb_E(k)bitalic_E ( k ) in order to perform vector arithmetic on them. We use (1) to extract the differences ΔE(k):=bE(k)−bE(0)assignsubscriptΔsubscriptsubscript0 _E(k):=b_E(k)-b_E(0)Δitalic_E ( k ) := bitalic_E ( k ) - bitalic_E ( 0 ), ΔA(k):=bA(k)−bA(0)assignsubscriptΔsubscriptsubscript0 _A(k):=b_A(k)-b_A(0)Δitalic_A ( k ) := bitalic_A ( k ) - bitalic_A ( 0 ). Rearranging (1), we obtain ΔA(k)=ΓA(a,k)−ΓA(a,0),ΔE(k)=ΓE(e,k)−ΓE(e,0).formulae-sequencesubscriptΔsubscriptΓsubscriptΓ0subscriptΔsubscriptΓsubscriptΓ0 _A(k)= _A(a,k)- _A(a,0), _E(k)= _E(e,% k)- _E(e,0).Δitalic_A ( k ) = Γitalic_A ( a , k ) - Γitalic_A ( a , 0 ) , Δitalic_E ( k ) = Γitalic_E ( e , k ) - Γitalic_E ( e , 0 ) . (2) We estimate ΔA(k)subscriptΔ _A(k)Δitalic_A ( k ) by sampling ,′[ZAk()−ZA0(′)]subscriptsuperscript′delimited-[]subscriptsubscriptsubscriptsubscript0superscript′E_c,c [Z_A_k(c)-Z_A_0(% c )]blackboard_Ec , c′ [ Zitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT ( c ) - Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( c′ ) ], and likewise for ΔE(k)subscriptΔ _E(k)Δitalic_E ( k ). Test condition Control Attribute Entity Both Attribute Entity Both Querying E0subscript0E_0E0 0.99 0.00 0.00 0.97 0.98 0.98 0.97 Querying E1subscript1E_1E1 1.00 0.03 0.01 0.99 1.00 1.00 1.00 Table 1: Left: Mean calibrated accuracies for mean interventions on four test conditions. Columns are the test conditions, and rows are queries. Right: Mean interventions with random vectors. Mean interventions. With the difference vectors, we can modify binding IDs by performing mean interventions, and observe how model behavior changes. The attribute mean intervention switches the binding ID vectors in ZA0subscriptsubscript0Z_A_0Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ZA1subscriptsubscript1Z_A_1Zitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with the interventions ZA0→ZA0+ΔA(1),ZA1→ZA1−ΔA(1)formulae-sequence→subscriptsubscript0subscriptsubscript0subscriptΔ1→subscriptsubscript1subscriptsubscript1subscriptΔ1Z_A_0→ Z_A_0+ _A(1),Z_A_1→ Z_A_1-% _A(1)Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + Δitalic_A ( 1 ) , Zitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → Zitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - Δitalic_A ( 1 ). The entity mean intervention similarly switches the binding ID vectors in ZE0subscriptsubscript0Z_E_0Zitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ZE1subscriptsubscript1Z_E_1Zitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Additivity predicts that performing either mean intervention will reverse the model behavior: E0subscript0E_0E0 will be associated with A1subscript1A_1A1, and E1subscript1E_1E1 with A0subscript0A_0A0. Experiments. In our experiments, we fix n=22n=2n = 2 and use 500 samples to estimate ΔE(1)subscriptΔ1 _E(1)Δitalic_E ( 1 ) and ΔA(1)subscriptΔ1 _A(1)Δitalic_A ( 1 ). We then perform four tests, and evaluate the model accuracy under the original belief. The Control test has no interventions, and the accuracy reflects model’s base performance. The Attribute and Entity tests perform the attribute and entity mean interventions, which should lead to a complete switch in model beliefs so that the accuracy is near 0. Table 1 shows agreement with additivity: the accuracies are above 99%percent9999\%99 % for Control, and below 3%percent33\%3 % for Attribute and Entity. As a further check, we perform both attribute and entity mean interventions simultaneously, which should cancel out and thus restore accuracy. Indeed, Table 1 shows that accuracy for Both is above 97%percent9797\%97 %. Finally, to show that the specific directions obtained by the difference vectors matter, we sample random vectors with the same magnitude but random directions, and perform the same mean interventions with the random vectors. These random vectors have no effect on the model behavior. 4.2 The Geometry of Binding ID Vectors Figure 5: The plots show the mean median-calibrated accuracy when one pair of binding ID, v0subscript0v_0v0, is fixed at the green circle, and the other, v1subscript1v_1v1, is varied across the grid. The binding IDs b(0)0b(0)b ( 0 ), b(1)1b(1)b ( 1 ), and b(2)2b(2)b ( 2 ) are shown as the origin of the arrows, the end of the horizontal arrow, and the end of the diagonal arrow respectively. We use LLaMA-13b for computational reasons. section 4.1 shows that we can think of binding IDs as pairs of ID vectors, and that randomly chosen vectors do not function as binding IDs. We next investigate the geometric structure of valid binding vectors and find that linear interpolations or extrapolations of binding vectors are often also valid binding vectors. This suggests that binding vectors occupy a continuous binding subspace. We find evidence of a metric structure in this space, such that nearby binding vectors are hard for the model to distinguish, but far-away vectors can be reliably distinguished and thus used for the binding task. To perform our investigation, we apply variants of the mean interventions in section 4.1. As before, we start with an n=22n=2n = 2 context, thus obtaining representations Z0=(ZE0,ZA0)subscript0subscriptsubscript0subscriptsubscript0Z_0=(Z_E_0,Z_A_0)Z0 = ( Zitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and Z1=(ZE1,ZA1)subscript1subscriptsubscript1subscriptsubscript1Z_1=(Z_E_1,Z_A_1)Z1 = ( Zitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , Zitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). We first erase the binding information by subtracting (ΔE(1),ΔA(1))subscriptΔ1subscriptΔ1( _E(1), _A(1))( Δitalic_E ( 1 ) , Δitalic_A ( 1 ) ) from Z1subscript1Z_1Z1, which reduces accuracy to chance. Next, we will add vectors v0=(vE0,vA0)subscript0subscriptsubscript0subscriptsubscript0v_0=(v_E_0,v_A_0)v0 = ( vitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , vitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and v1=(vE1,vA1)subscript1subscriptsubscript1subscriptsubscript1v_1=(v_E_1,v_A_1)v1 = ( vitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , vitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) to the representations Z; if doing so restores accuracy, then we view (vE0,vA0)subscriptsubscript0subscriptsubscript0(v_E_0,v_A_0)( vitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , vitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and (vE1,vA1)subscriptsubscript1subscriptsubscript1(v_E_1,v_A_1)( vitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , vitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) as valid binding pairs. To generate different choices of v, we take linear combinations across a two-dimensional space. The basis vectors for this space are (ΔE(1),ΔA(1))subscriptΔ1subscriptΔ1( _E(1), _A(1))( Δitalic_E ( 1 ) , Δitalic_A ( 1 ) ) and (ΔE(2),ΔA(2))subscriptΔ2subscriptΔ2( _E(2), _A(2))( Δitalic_E ( 2 ) , Δitalic_A ( 2 ) ) obtained by averaging across an n=33n=3n = 3 context. Fig. 5 shows the result for several different combinations, where the coordinates of v0subscript0v_0v0 are fixed and shown in green while the coordinates of v1subscript1v_1v1 vary. When v1subscript1v_1v1 is close to v0subscript0v_0v0, the LM gets close to 50% accuracy, which indicates confusion. Far away from v1subscript1v_1v1, the network consistently achieves high accuracy, demonstrating that linear combinations of binding IDs (even with negative coefficients) are themselves valid binding IDs. See appendix G for details. The geometry of the binding subspace hints at circuits (Elhage et al., 2021) in LMs that process binding vectors. For example, we speculate that certain attention heads might be responsible for comparing binding ID vectors, since the attention mechanism computes attention scores using a quadratic form which could provide the metric over the binding subspace. 5 Generality and Limitations of Binding ID The earlier sections investigate binding IDs for one particular task: the capitals task. In this section, we evaluate their generality. We first show that binding vectors are used for a variety of tasks and models. We then show evidence that the binding vectors are task-agnostic: vectors from one task transfer across many different tasks. However, our mechanism is not fully universal. appendix E describes a question-answering task that uses an alternative binding mechanism. Figure 6: Left: models in Pythia and LLaMA on capitals. LLaMA-65b not present for computational reasons. Right: LLaMA-30b on binding tasks. Unlike others, the bios task has attributes that are several tokens long. Generality of binding ID vectors. We evaluate the generality of binding vectors across models and tasks. For a (model, task) pair, we compute the median-calibrated accuracy on the n=33n=3n = 3 context under three conditions: (1) the control condition in which no interventions are performed, and the (2) entity and (3) attribute conditions in which entity or attribute mean interventions (section 4.1) are performed. We use the mean interventions to permute binding pairs by a cyclic shift and measure accuracy according to this shift (see appendix F). As shown in Figure 6, the interventions induce the expected behavior on most tasks; moreover, their effectiveness increases with model scale, suggesting that larger models have more robust structured representations. Task capitals parallel shapes fruits bios Zeros Random Mean accuracy 0.88 0.87 0.71 0.80 0.47 0.30 0.31 Mean log prob -1.01 -1.07 -1.18 -1.21 -1.64 -1.86 -2.15 Table 2: The mean median-calibrated accuracy and mean log prob for mean interventions on n=33n=3n = 3 capitals using binding ID estimates from other tasks. Random chance has 0.330.330.330.33 mean accuracy. Transfer across tasks. We next show that binding vectors often transfer across tasks. Without access to the binding vectors [bE(k),bA(k)]subscriptsubscript[b_E(k),b_A(k)][ bitalic_E ( k ) , bitalic_A ( k ) ], we instead test if the difference vectors [ΔEsrc(k),ΔAsrc(k)]subscriptsuperscriptΔsrcsubscriptsuperscriptΔsrc[ ^src_E(k), ^src_A(k)][ Δroman_srcitalic_E ( k ) , Δroman_srcitalic_A ( k ) ] from a source task, when applied to a target task, result in valid binding IDs. To do so, we follow a similar procedure to section 4.2: First, we erase binding information by subtracting [ΔEtar(k),ΔAtar(k)]subscriptsuperscriptΔtarsubscriptsuperscriptΔtar[ ^tar_E(k), ^tar_A(k)][ Δroman_taritalic_E ( k ) , Δroman_taritalic_A ( k ) ] from each target-task representation [ZEk,ZAk]subscriptsubscriptsubscriptsubscript[Z_E_k,Z_A_k][ Zitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT , Zitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT ], resulting in near-chance accuracy. Then, we add back in [ΔEsrc(k),ΔAsrc(k)]subscriptsuperscriptΔsrcsubscriptsuperscriptΔsrc[ ^src_E(k), ^src_A(k)][ Δroman_srcitalic_E ( k ) , Δroman_srcitalic_A ( k ) ] computed from the source task with the hope of restoring performance. Table 2 shows results for a variety of source tasks when using capitals as the target task. Accuracy is consistently high, even when the target task has limited surface similarity to the target task. For example, the shapes task asks questions about colored shapes, and parallel lists all entities before any attributes instead of interleaving them as in capitals. We include two baselines for comparison: replacing Δsrc(k)superscriptΔsrc ^src(k)Δroman_src ( k ) with the zero vector (“Zeros”), or picking a randomly oriented difference vector as in Table 1 (“Random”). Both lead to chance accuracy. See appendix D for tasks details. The fact that binding vectors transfer across tasks, together with the results from section 4, suggests that there could be a task-agnostic subspace in the model’s activations reserved for binding vectors. 6 Related work Causal mediation analysis. In recent years, causal methods have gained popularity in post hoc interpretability (Meng et al., 2022; Geiger et al., 2021). Instead of relying on correlations, which could lead to spurious features (Hewitt & Liang, 2019), causal mediation analysis (Vig et al., 2020a) performs causal interventions on internal states of LMs to understand their causal role on LM behavior. Our work shares the same causal perspective adopted by many in this field. Knowledge recall. A line of work studies recalling factual associations that LMs learn from pretraining (Geva et al., 2020; Dai et al., 2021; Meng et al., 2022; Geva et al., 2023; Hernandez et al., 2023b). This is spiritually related to binding, as entities must be associated to facts about them. However, this work studies factual relations learned from pretraining and how they are recalled from model weights. In contrast, we study representations of relations learned from context, and how they are recalled from model activations. More recently, Hernandez et al. (2023a) found a method to construct bound representations by directly binding attribute representations to entity representations. In contrast, our work investigates bound representations constructed by the LM itself, and identifies that the binding ID mechanism (and not direct binding) is the mechanism that LM representations predominantly uses. An avenue for future work is to study how bound representations constructed by Hernandez et al. (2023a) relates to the direct binding mechanism we identified in appendix E. Symbolic representations in connectionist systems. Many works have studied how neural networks represent symbolic concepts in activation space (Mikolov et al., 2013; Tenney et al., 2019; Belinkov & Glass, 2019; Rogers et al., 2021; Patel & Pavlick, 2021). To gain deeper insights into how these representations are used for reasoning, recent works have studied representations used for specialized reasoning tasks (Nanda et al., 2023; Li et al., 2022; 2021). Our work shares the motivation of uncovering how neural networks implement structured representations that enable reasoning. Mechanistic Interpretability. Mechanistic interpretability aims to uncover circuits (Elhage et al., 2021; Wang et al., 2022; Wu et al., 2023), often composed of attention heads, embedded in language models. In our work, we study language model internals on a more coarse-grained level by identifying structures in representations that have causal influences on model behavior. Concurrent work by Prakash et al. (2024) complements ours by analyzing the circuits involved in the binding problem. 7 Conclusion In this paper we identify and study the binding problem, a common and fundamental reasoning subproblem. We find that pretrained LMs can solve the binding task by binding entities and attributes to abstract binding IDs. Then, we identify that the binding IDs are vectors from a binding subspace with a notion of distance. Lastly, we find that the binding IDs are used broadly for a variety of binding tasks and are present in all sufficiently large models that we studied. Taking a broader view, we see our work as a part of the endeavor to interpret LM reasoning by decomposing it into primitive skills. In this work we identified the binding skill, which is used in several settings and has a simple and robust representation structure. An interesting direction of future work would be to identify other primitive skills that support general purpose reasoning and have similarly interpretable mechanisms. Our work also suggests that ever-larger LMs may still have interpretable representations. A common intuition is that larger models are more complex, and hence more challenging to interpret. Our work provides a counterexample: as LMs become larger, their representations can become more structured and interpretable, since only the larger models exhibited binding IDs (Fig. 6). Speculating further, the fact that large enough models in two unrelated LM families learn the same structured representation strategy points to a convergence in representations with scale. Could there be an ultimate representation that these LMs are converging towards? Perhaps the properties of natural language corpora and LM inductive biases lead to certain core representation strategies that are invariant to changes in model hyperparameters or exact dataset composition. This would encouragingly imply that interpretability results can transfer across models—studying the core representations of any sufficiently large model would yield insights into other similarly large models. Acknowledgments We thank Danny Halawi, Fred Zhang, Erik Jenner, Cassidy Laidlaw, Shawn Im, Arthur Conmy, Shivam Singhal, and Olivia Watkins for their helpful feedback. JF was supported by the Long-Term Future Fund. JS was supported by the National Science Foundation under Grants No. 2031899 and 1804794. In addition, we thank Open Philanthropy for its support of both JS and the Center for Human-Compatible AI. References Belinkov & Glass (2019) Yonatan Belinkov and James Glass. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72, 2019. doi: 10.1162/tacl˙a˙00254. URL https://aclanthology.org/Q19-1004. Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, p. 2397–2430. PMLR, 2023. Dai et al. (2021) Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. Knowledge neurons in pretrained transformers. ArXiv, abs/2104.08696, 2021. URL https://api.semanticscholar.org/CorpusID:233296761. De-Arteaga et al. (2019) Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In proceedings of the Conference on Fairness, Accountability, and Transparency, p. 120–128, 2019. Elhage et al. (2021) Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html. Fodor & Pylyshyn (1988) Jerry A Fodor and Zenon W Pylyshyn. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71, 1988. Geiger et al. (2021) Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574–9586, 2021. Geva et al. (2020) Mor Geva, R. Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. ArXiv, abs/2012.14913, 2020. URL https://api.semanticscholar.org/CorpusID:229923720. Geva et al. (2023) Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767, 2023. Hendrycks et al. (2021) Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916, 2021. Hernandez et al. (2021) Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. In International Conference on Learning Representations, 2021. Hernandez et al. (2023a) Evan Hernandez, Belinda Z Li, and Jacob Andreas. Measuring and manipulating knowledge representations in language models. arXiv preprint arXiv:2304.00740, 2023a. Hernandez et al. (2023b) Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. arXiv preprint arXiv:2308.09124, 2023b. Hewitt & Liang (2019) John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. arXiv preprint arXiv:1909.03368, 2019. Kim & Schuster (2023) Najoung Kim and Sebastian Schuster. Entity tracking in language models. arXiv preprint arXiv:2305.02363, 2023. Li et al. (2021) Belinda Z Li, Maxwell Nye, and Jacob Andreas. Implicit representations of meaning in neural language models. arXiv preprint arXiv:2106.00737, 2021. Li et al. (2022) Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382, 2022. Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022. Mikolov et al. (2013) Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, p. 746–751, 2013. Mu & Andreas (2020) Jesse Mu and Jacob Andreas. Compositional explanations of neurons. Advances in Neural Information Processing Systems, 33:17153–17163, 2020. Nanda et al. (2023) Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=9XFSbDPmdW. Patel & Pavlick (2021) Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. In International Conference on Learning Representations, 2021. Prakash et al. (2024) Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, and David Bau. Fine-tuning enhances existing mechanisms: A case study on entity tracking. arXiv preprint arXiv:2402.14811, 2024. Räuker et al. (2023) Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), p. 464–483. IEEE, 2023. Rogers et al. (2021) Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842–866, 2021. Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, p. 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://w.aclweb.org/anthology/D13-1170. Su et al. (2021) Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021. Tenney et al. (2019) Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. In Association for Computational Linguistics, 2019. URL https://arxiv.org/abs/1905.05950. Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. Treisman (1996) Anne Treisman. The binding problem. Current opinion in neurobiology, 6(2):171–178, 1996. Vig et al. (2020a) Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33:12388–12401, 2020a. Vig et al. (2020b) Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. Advances in Neural Information Processing Systems, 33:12388–12401, 2020b. Wang et al. (2022) Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: A circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022. Wang et al. (2023) Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023. Wu et al. (2023) Zhengxuan Wu, Atticus Geiger, Christopher Potts, and Noah D Goodman. Interpretability at scale: Identifying causal mechanisms in alpaca. arXiv preprint arXiv:2305.08809, 2023. Zhao et al. (2021) Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, p. 12697–12706. PMLR, 2021. Appendix A Evaluation details In all of our evaluations, we sample N=100100N=100N = 100 instances of contexts from the binding task, obtaining ii=1Nsuperscriptsubscriptsubscript1\c_i\_i=1^N citalic_i i = 1N. For succinctness, we write Ek(i):=Ek(i)assignsuperscriptsubscriptsubscriptsubscriptE_k^(i):=E_k(c_i)Eitalic_k( i ) := Eitalic_k ( citalic_i ) and Ak(i):=Ak(i)assignsuperscriptsubscriptsubscriptsubscriptA_k^(i):=A_k(c_i)Aitalic_k( i ) := Aitalic_k ( citalic_i ). For the i-th context instance, we query E0(i)superscriptsubscript0E_0^(i)E0( i ) and E1(i)superscriptsubscript1E_1^(i)E1( i ) which return log probabilities ΦE0(i)(t)subscriptsuperscriptΦsubscript0 ^(i)_E_0(t)Φ( i )E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( t ) and ΦE1(i)(t)subscriptsuperscriptΦsubscript1 ^(i)_E_1(t)Φ( i )E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( t ) over tokens t in the vocabulary. However, we consider only the log probabilities for relevant attributes ΦEk(i)(A0(i))subscriptsuperscriptΦsubscriptsuperscriptsubscript0 ^(i)_E_k(A_0^(i))Φ( i )E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT ( A0( i ) ) and ΦEk(i)(A1(i))subscriptsuperscriptΦsubscriptsuperscriptsubscript1 ^(i)_E_k(A_1^(i))Φ( i )E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT ( A1( i ) ). We then compute the summary statistics (described below) over the entire population of samples so that we get two scalars, σE0subscriptsubscript0 _E_0σitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and σE1subscriptsubscript1 _E_1σitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT describing the performance under each query entity. • The mean log prob is given by σEk=1N∑i=1NΦEk(i)(Ak(i))subscriptsubscript1superscriptsubscript1subscriptsuperscriptΦsubscriptsuperscriptsubscript _E_k= 1N _i=1^N ^(i)_E_k(A_k^(i))σitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG N end_ARG ∑i = 1N Φ( i )E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT ( Aitalic_k( i ) ). • The top-1 accuracy is σEk=1N∑i=1N[k=argmaxlΦEk(i)(Al(i))]subscriptsubscript1superscriptsubscript11delimited-[]subscriptargmaxsubscriptsuperscriptΦsubscriptsuperscriptsubscript _E_k= 1N _i=1^N1[k= *arg\,max_% l ^(i)_E_k(A_l^(i))]σitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG N end_ARG ∑i = 1N 1 [ k = start_OPERATOR arg max end_OPERATORl Φ( i )E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT ( Aitalic_l( i ) ) ]. • We adopt the median calibrated accuracy from Zhao et al. (2021). First, we obtain a baseline by computing medians for every attribute m(Al):=mediani,kΦEk(i)(Al(i))assignsubscriptsubscriptmediansubscriptsuperscriptΦsubscriptsuperscriptsubscriptm(A_l):=median_i,k\ ^(i)_E_k(A_l^(i))\m ( Aitalic_l ) := medianitalic_i , k Φ( i )E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT ( Aitalic_l( i ) ) . Then, compute calibrated log probs Φ~Ek(i)(Al):=ΦEk(i)(Al)−m(Al)assignsuperscriptsubscript~ΦsubscriptsubscriptsuperscriptsubscriptΦsubscriptsubscriptsubscript _E_k^(i)(A_l):= _E_k^(i)(A_l)-m(A_l)over~ start_ARG Φ end_ARGE start_POSTSUBSCRIPT k end_POSTSUBSCRIPT( i ) ( Aitalic_l ) := Φitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT( i ) ( Aitalic_l ) - m ( Aitalic_l ). The median calibrated accuracy is then σEk=1N∑i=1N[k=argmaxlΦ~(i)(Al(i))]subscriptsubscript1superscriptsubscript11delimited-[]subscriptargmaxsuperscript~Φsuperscriptsubscript _E_k= 1N _i=1^N1[k= *arg\,max_% l ^(i)(A_l^(i))]σitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG N end_ARG ∑i = 1N 1 [ k = start_OPERATOR arg max end_OPERATORl over~ start_ARG Φ end_ARG( i ) ( Aitalic_l( i ) ) ]. Zhao et al. (2021) discusses motivations for the median calibrated accuracy. In our case, the position dependent bias provides addition reasons, which we discuss in appendix C. Appendix B Necessity of Binding ID mechanism In this section, we provide one definition of the binding ID mechanism, and argue informally that under this definition, factorizability and position independence necessarily implies the binding ID mechanism. First, let us define the binding ID mechanism. Fix n=22n=2n = 2 for simplicity. There are two claims: 1. Representation. There exists a binding function ΓEsubscriptΓ _EΓitalic_E such that for any contexts cc, ZEksubscriptsubscriptZ_E_kZitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT is represented by ΓE(Ek,k)subscriptΓsubscript _E(E_k,k)Γitalic_E ( Eitalic_k , k ), in the sense that for any e∈ℰe ∈ E, ZEk→ΓE(e,k)→subscriptsubscriptsubscriptΓ\Z_E_k→ _E(e,k)\ Zitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT → Γitalic_E ( e , k ) leads to the belief /.Ek→ec/.\E_k→ e\c / . Eitalic_k → e . Likewise, there exists a binding function ΓAsubscriptΓ _AΓitalic_A such that for any contexts cc, ZAksubscriptsubscriptZ_A_kZitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT is represented by ΓA(Ak,k)subscriptΓsubscript _A(A_k,k)Γitalic_A ( Aitalic_k , k ), in the sense that for any a∈a ∈ A, ZAk→ΓA(a,k)→subscriptsubscriptsubscriptΓ\Z_A_k→ _A(a,k)\ Zitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT → Γitalic_A ( a , k ) leads to the belief /.Ak→ac/.\A_k→ a\c / . Aitalic_k → a . These substitutions should also compose appropriately. 2. Query. Further, the binding functions ΓAsubscriptΓ _AΓitalic_A and ΓEsubscriptΓ _EΓitalic_E satisfy the following property: Choose any 2 permutations πE(k)subscript _E(k)πitalic_E ( k ) and πA(k)subscript _A(k)πitalic_A ( k ) over 0,101\0,1\ 0 , 1 , and consider a ZcontextsubscriptcontextZ_contextZcontext containing [ΓE(e0,πE(0)),ΓA(a0,πA(0)),ΓE(e1,πE(1)),ΓA(a1,πA(1))]subscriptΓsubscript0subscript0subscriptΓsubscript0subscript0subscriptΓsubscript1subscript1subscriptΓsubscript1subscript1[ _E(e_0, _E(0)), _A(a_0, _A(0)), _E(e_1,% _E(1)), _A(a_1, _A(1))][ Γitalic_E ( e0 , πitalic_E ( 0 ) ) , Γitalic_A ( a0 , πitalic_A ( 0 ) ) , Γitalic_E ( e1 , πitalic_E ( 1 ) ) , Γitalic_A ( a1 , πitalic_A ( 1 ) ) ] . The query system will then believe e0↔a0,e1↔a1↔subscript0subscript0subscript1↔subscript1e_0 a_0,e_1 a_1e0 ↔ a0 , e1 ↔ a1 if πE=πAsubscriptsubscript _E= _Aπitalic_E = πitalic_A, and e0↔a1,e1↔a0↔subscript0subscript1subscript1↔subscript0e_0 a_1,e_1 a_0e0 ↔ a1 , e1 ↔ a0 otherwise. The first claim follows from factorizability. From factorizability, we can construct the candidate binding functions simply by picking an arbitrary context consistent with the parameters. For any e∈ℰe ∈ E and any binding ID k∈[0,n−1]01k∈[0,n-1]k ∈ [ 0 , n - 1 ], pick any context cc such that Ek()=esubscriptE_k(c)=eEitalic_k ( c ) = e. Then, let ΓE(e,k)=ZEk()subscriptΓsubscriptsubscript _E(e,k)=Z_E_k(c)Γitalic_E ( e , k ) = Zitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT ( c ). ΓAsubscriptΓ _AΓitalic_A can be constructed similarly. Our factorizability results show that the binding functions constructed this way satisfy the Representation claim. The second claim follows from Representation and position independence. Pick an arbitrary context cc to generate ZcontextsubscriptcontextZ_contextZcontext. Then, by factorizability we can make the substitutions ZEk→ΓE(eπE−1(k),k)→subscriptsubscriptsubscriptΓsubscriptsuperscriptsubscript1Z_E_k→ _E(e_ _E^-1(k),k)Zitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT → Γitalic_E ( eitalic_π start_POSTSUBSCRIPT E- 1 ( k ) end_POSTSUBSCRIPT , k ) and ZAk→ΓA(aπA−1(k),k)→subscriptsubscriptsubscriptΓsubscriptsuperscriptsubscript1Z_A_k→ _A(a_ _A^-1(k),k)Zitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT → Γitalic_A ( aitalic_π start_POSTSUBSCRIPT A- 1 ( k ) end_POSTSUBSCRIPT , k ), to obtain [ΓE(eπE−1(0),0),ΓA(aπA−1(0),0),ΓE(eπE−1(1),1),ΓA(aπA−1(1),1)].subscriptΓsubscriptsuperscriptsubscript100subscriptΓsubscriptsuperscriptsubscript100subscriptΓsubscriptsuperscriptsubscript111subscriptΓsubscriptsuperscriptsubscript111[ _E(e_ _E^-1(0),0), _A(a_ _A^-1(0),0), _E% (e_ _E^-1(1),1), _A(a_ _A^-1(1),1)].[ Γitalic_E ( eitalic_π start_POSTSUBSCRIPT E- 1 ( 0 ) end_POSTSUBSCRIPT , 0 ) , Γitalic_A ( aitalic_π start_POSTSUBSCRIPT A- 1 ( 0 ) end_POSTSUBSCRIPT , 0 ) , Γitalic_E ( eitalic_π start_POSTSUBSCRIPT E- 1 ( 1 ) end_POSTSUBSCRIPT , 1 ) , Γitalic_A ( aitalic_π start_POSTSUBSCRIPT A- 1 ( 1 ) end_POSTSUBSCRIPT , 1 ) ] . Because of factorizability, the model believes e0↔a0,e1↔a1↔subscript0subscript0subscript1↔subscript1e_0 a_0,e_1 a_1e0 ↔ a0 , e1 ↔ a1 if πE=πAsubscriptsubscript _E= _Aπitalic_E = πitalic_A, and e0↔a1,e1↔a0↔subscript0subscript1subscript1↔subscript0e_0 a_1,e_1 a_0e0 ↔ a1 , e1 ↔ a0 otherwise. Now, position independence lets us freely permute ZE0,ZE1subscriptsubscript0subscriptsubscript1\Z_E_0,Z_E_1\ Zitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , Zitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ZA0,ZA1subscriptsubscript0subscriptsubscript1\Z_A_0,Z_A_1\ Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , Zitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT without changing beliefs, which achieves the following context [ΓE(e0,πE(0)),ΓA(a0,πA(0)),ΓE(e1,πE(1)),ΓA(a1,πA(1))]subscriptΓsubscript0subscript0subscriptΓsubscript0subscript0subscriptΓsubscript1subscript1subscriptΓsubscript1subscript1[ _E(e_0, _E(0)), _A(a_0, _A(0)), _E(e_1,% _E(1)), _A(a_1, _A(1))][ Γitalic_E ( e0 , πitalic_E ( 0 ) ) , Γitalic_A ( a0 , πitalic_A ( 0 ) ) , Γitalic_E ( e1 , πitalic_E ( 1 ) ) , Γitalic_A ( a1 , πitalic_A ( 1 ) ) ] with the desired beliefs. Appendix C Details for Position Independence RoPE Intervention. In Fig. 2a, the context activations ZcontextsubscriptcontextZ_contextZcontext is drawn in a line, suggesting a linear form: Zcontext=[…,ZE0,…,ZA0,…,ZE1,…,ZA1,…]subscriptcontext…subscriptsubscript0…subscriptsubscript0…subscriptsubscript1…subscriptsubscript1…Z_context=[…,Z_E_0,…,Z_A_0,…,Z_E_1,…,Z_A% _1,…]Zcontext = [ … , Zitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , Zitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , Zitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … ]. We can equivalently think of ZcontextsubscriptcontextZ_contextZcontext as a set of pairs: Zcontext=(p,Zp)∣pis an index for a context tokensubscriptcontextconditional-setsubscriptis an index for a context tokenZ_context=\(p,Z_p) p~is an index for a context token\Zcontext = ( p , Zitalic_p ) ∣ p is an index for a context token . LMs that use Rotary Position Embedding (RoPE) (Su et al., 2021), such as those in the LLaMA and Pythia families, have architectures that allow arbitrarily intervention on the apparent position of an activation (p,Zp)→(p′,Zp)→subscriptsuperscript′subscript(p,Z_p)→(p ,Z_p)( p , Zitalic_p ) → ( p′ , Zitalic_p ), even if this results in overall context activations that cannot be written down as a list of activations. This is because position information is applied at every layer, and not injected into the residual stream like in absolute position embeddings. Specifically, equation 16 in Su et al. (2021) provides the definition of RoPE (recreated verbatim as follows): qm⊺kn=(Θ,mdqxm)⊺(Θ,ndkxn)superscriptsubscript⊺subscriptsuperscriptsubscriptsuperscriptΘsubscriptsubscript⊺subscriptsuperscriptΘsubscriptsubscriptq_m k_n=(R^d_ ,mW_qx_m)^% (R^d_ ,nW_kx_n)qitalic_m⊺ kitalic_n = ( Ritalic_droman_Θ , m Witalic_q xitalic_m )⊺ ( Ritalic_droman_Θ , n Witalic_k xitalic_n ) (3) Then, making the intervention Θ,nd→Θ,n∗d→subscriptsuperscriptΘsubscriptsuperscriptΘsuperscriptR^d_ ,n ^d_ ,n^*Ritalic_droman_Θ , n → Ritalic_droman_Θ , n∗ changes the apparent position of the activations at position n to the position at n∗superscriptn^*n∗. Is the position dependent bias just a bias? For the purposes of determining if position encodes binding, the fact that the LM does not substantially change its beliefs when we switch the positions of the attribute activations (or the entity activations) suggests that position can only play a limited role. However, calling the position dependency of attributes a “bias” implies that it is an artifact that we should correct for. To what extent is this true? The case for regarding it as a bias is two-fold. First, as discussed by Su et al. (2021), RoPE exhibits long-term position decay, which systematically lowers the attention paid to activations that are further away from the query (i.e. earlier in the context). Plausibly, at some point when computing the query mechanism, the LM has to make a decision whether to pay attention to the first or the second attribute, and the presence of the long-term position decay can bias this decision, leading to the position dependent bias in the final prediction that we see. The second reason is that there are systematic and unbiased ways of calibrating the LM to recover the correct answer, in spite of the position dependent bias. We discuss two strategies. Because the position dependent bias modifies the log probs for A0subscript0A_0A0 (or A1subscript1A_1A1) regardless of which entity is being queried, we can estimate this effect by averaging the log probs for A0subscript0A_0A0 (or A1subscript1A_1A1) for both queries E0subscript0E_0E0 and E1subscript1E_1E1. Then, when making a prediction, we can subtract this average from the log probs for A0subscript0A_0A0 (or A1subscript1A_1A1). This corresponds to the median calibrated accuracy metric discussed earlier. The second procedure to mitigate the position dependent bias is an intervention to set all attribute activations to have the same position, which limits the amount of bias position dependency can introduce. These procedures do not require foreknowledge of what the ground truth predicates are, and hence do not leak knowledge into the prediction process — if the calibrated LM answers queries correctly, the information must have come from the context activations and not from the calibration process. Nonetheless, there are features about the position dependent bias that could be interesting to study. For example, we might hope to predict the magnitudes of the position dependent bias based on RoPE’s parameters. However, such an investigation will most likely involve a deeper mechanistic understanding of the query system, which we leave as future work. Appendix D Binding Task Details D.1 capitals Construct a list of one-token names and a list of country-capital pairs that are also each one-token wide. Then, apply the following template: ⬇ Answer the question based on the context below. Keep the answer short. Context: E_0 lives in the capital city of A_0. E_1 lives in the capital city of A_1. Question: Which city does qn_subject live in? Answer: qn_subject lives in the city of The LM is expected to answer with the capital of the country that is bound to the queried entity. Note that the LM is expected to simultaneously solve the factual recall task of looking up the capital city of a country. D.2 parallel The parallel task uses the same country capital setup, but with the prompt template: ⬇ Answer the question based on the context below. Keep the answer short. Context: E_0 and E_1 live in the capital cities of A_0 and A_1 respectively. Question: Which city does qn_subject live in? Answer: qn_subject lives in the city of This prompt format breaks the confounder in the capitals task that entity always appear in the same sentence as attributes, suggesting binding ID is not merely a syntactic property. D.3 fruits The fruits task uses the same set of names, but for attributes it uses a set of common fruits and food that are one-token wide. The prompt format is: ⬇ Answer the question based on the context below. Keep the answer short. Context: E_0 likes eating the A_0. E_1 likes eating the A_1 respectively. Question: What food does qn_subject like? Answer: qn_subject likes the D.4 shapes The shapes tasks have entities which are one-token wide colors, and attributes which are one-token wide shapes. The prompt looks like: ⬇ Answer the question based on the context below. Keep the answer short. Context: The A_0 is E_0. The A_1 is E_1. Question: Which shape is colored qn_subject? Answer: The qn_subject shape is This task inverts the assumption that entities have to be nouns, and attributes are adjectives. D.5 Bios This task is adapted from the bias in bios dataset De-Arteaga et al. (2019), with a prompt format following Hernandez et al. (2023a). The entities are the set of one-token names, and the attributes are a set of biography descriptions obtained using the procedure from Hernandez et al. (2023a). The LM is expected to infer the occupation from this description. This time, the attributes are typically one sentence long, and are no longer one-token wide. We thus do not expect the mean interventions for attributes to work, although we may still expect entity interventions to work. Just inferring the correct occupation is also a much more challenging task than the other synthetic tasks. The prompt format is: ⬇ Answer the question based on the context below. Keep the answer short. Context: About E_0: A_0 About E_1: A_1 Question: What occupation does qn_subject have? Answer: qn_subject has the occupation of Appendix E MCQ Task Figure 7: Direct binding in MCQ task. OksubscriptO_kOitalic_k and LksubscriptL_kLitalic_k denote options and labels respectively. Under direct binding, ZO0subscriptsubscript0Z_O_0Zitalic_O start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ZO1subscriptsubscript1Z_O_1Zitalic_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are represented by a binding function ΛOsubscriptΛ _OΛitalic_O that directly binds option and label together, whereas ZL0subscriptsubscript0Z_L_0Zitalic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ZL1subscriptsubscript1Z_L_1Zitalic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are causally irrelevant. Direct binding in MCQ. While binding IDs are used for many tasks, they are not universal. We briefly identify an alternate binding mechanism, the direct binding mechanism, that is used for a multiple-choice question-answering task (MCQ). In MCQ, each label (A or B) has to be bound to its associated option text. In this task, instead of binding variables to an abstract binding ID, the model directly binds the label to the option (Fig. 7). Multiple choice questions (MCQs) can be formulated as a binding task if we put the options before the question. This is to force the LM to represent the binding between label and option text before it sees the questions. We study the SST-2 task (Socher et al., 2013), which is a binary sentiment classification task on movie reviews (either positive or negative). Then, the attributes are single letter labels from A to E, and the entities are “Positive” and “Negative”. The prompt is as follows: ⬇ Classify the review using the following options: A_0: E_0 A_1: E_1 Review: question Answer: Then, when prompted with a question with a certain sentiment, the LM is expected to retrieve its corresponding label. E.1 Experiments It turns out that the reversed MCQ format is too out of distribution for LLaMA-30b to solve. However, we find that the instruction finetuned tulu-13b model (Wang et al., 2023) is able to solve this task. We find that the activations for this task are not factorizable in the same way. Consider the target context: ⬇ C: Negative A: Positive and the source context: ⬇ A: Negative C: Positive We denote the labels as L0subscript0L_0L0 and L1subscript1L_1L1, so that L0subscript0L_0L0 is A in the first context and B in the second context. We denote the option texts as O0subscript0O_0O0 and O1subscript1O_1O1. We perform an experiment where we intervene by copying over a suffix of every line from the source context into the target context, and plot the accuracy based on whether the intervention successfully changes the belief (Fig. 8). The right most point of the plot is the control condition where no interventions are made. The accuracy is near zero because the model currently believes in the original context. At the left most point, we intervene on the entire statement, which is a substitution of the entire ZcontextsubscriptcontextZ_contextZcontext. Thus, we observe a near perfect accuracy. Interestingly, copying over the activations for the tokens corresponding to “ative” and the whitespace following it suffices for almost completely changing the belief, despite having a surface token form that is identical at those two tokens (“ative ⟨ ⟨WS⟩ ⟩ ” for both source and target contexts). This suggests that those activations captures the binding information that contains both the label and the option text. This leads to the conclusion that binding information is bound directly at those activations, instead of indirectly via binding IDs. In contrast, binding ID would have predicted that substituting these two tokens would not have made a difference, because the option activations ZOsubscriptZ_OZitalic_O should contain only information about the option text and the binding ID, which is identical for our choice of source and target contexts. Figure 8: Substitutions for MCQ option suffix Appendix F Generality details Suppose π is a cyclic shift, say π(0)=1,π(1)=2,π(2)=0formulae-sequence01formulae-sequence1220π(0)=1,π(1)=2,π(2)=0π ( 0 ) = 1 , π ( 1 ) = 2 , π ( 2 ) = 0. Then, we can perform mean interventions based on the cyclic shift on entities as follows: ZEk→ZEk+bE(π(k))−bE(k)=ZEk+ΔE(π(k))−ΔE(k).→subscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptΔsubscriptΔZ_E_k→ Z_E_k+b_E(π(k))-b_E(k)=Z_E_k+ _E(π% (k))- _E(k).Zitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT → Zitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT + bitalic_E ( π ( k ) ) - bitalic_E ( k ) = Zitalic_E start_POSTSUBSCRIPT k end_POSTSUBSCRIPT + Δitalic_E ( π ( k ) ) - Δitalic_E ( k ) . We then expect the belief to follow the same shift, so that the LM believes Ek↔Aπ(k)↔subscriptsubscriptE_k A_π(k)Eitalic_k ↔ Aitalic_π ( k ). Similarly, we can perform mean interventions on attributes as follows: ZAk→ZAk+bA(π(k))−bA(k)=ZAk+ΔA(π(k))−ΔA(k).→subscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptΔsubscriptΔZ_A_k→ Z_A_k+b_A(π(k))-b_A(k)=Z_A_k+ _A(π% (k))- _A(k).Zitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT → Zitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT + bitalic_A ( π ( k ) ) - bitalic_A ( k ) = Zitalic_A start_POSTSUBSCRIPT k end_POSTSUBSCRIPT + Δitalic_A ( π ( k ) ) - Δitalic_A ( k ) . However, this time we expect the belief to follow the inverse shift, i.e. Ek↔Aπ−1(k)↔subscriptsubscriptsuperscript1E_k A_π^-1(k)Eitalic_k ↔ Aitalic_π- 1 ( k ), which is the same as Eπ(k)↔Ak↔subscriptsubscriptE_π(k) A_kEitalic_π ( k ) ↔ Aitalic_k. As usual, we sample Δ Δ using 500 samples. We perform the intervention using both cyclic shifts over 3 elements, (i.e. π and π−1superscript1π^-1π- 1), and report the mean results over these two shifts. Appendix G Geometry details An experimental challenge we face is that we do not have access to the binding ID vectors bA,bEsubscriptsubscriptb_A,b_Ebitalic_A , bitalic_E themselves, only differences between them, ΔA,ΔEsubscriptΔsubscriptΔ _A, _EΔitalic_A , Δitalic_E. For clarity of exposition we first describe the procedure we would perform if we had access to the binding ID vectors, before describing the actual experiment. In the ideal case, we would obtain two pairs of binding ID vectors, [bE(0),bA(0)],[bE(1),bA(1)]subscript0subscript0subscript1subscript1[b_E(0),b_A(0)],[b_E(1),b_A(1)][ bitalic_E ( 0 ) , bitalic_A ( 0 ) ] , [ bitalic_E ( 1 ) , bitalic_A ( 1 ) ]. Then, we can construct two linear combinations of these two binding ID vectors as candidate binding IDs, [vE0,vA0]subscriptsubscript0subscriptsubscript0[v_E_0,v_A_0][ vitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , vitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] and [vE1,vA1]subscriptsubscript1subscriptsubscript1[v_E_1,v_A_1][ vitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , vitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ]. Now, we can take an n=22n=2n = 2 context cc and intervene on each of ZE0,ZA0,ZE1,ZA1subscriptsubscript0subscriptsubscript0subscriptsubscript1subscriptsubscript1Z_E_0,Z_A_0,Z_E_1,Z_A_1Zitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , Zitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , Zitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to change their binding IDs to our candidate binding IDs. If the model retains its beliefs, then we infer that the binding IDs are valid. There two main problems with this procedure. The first is that we only have access to ΔAsubscriptΔ _AΔitalic_A and ΔEsubscriptΔ _EΔitalic_E and not bE,bAsubscriptsubscriptb_E,b_Abitalic_E , bitalic_A. Instead of choosing [bE(0),bA(0],[bE(1),bA(1)][b_E(0),b_A(0],[b_E(1),b_A(1)][ bitalic_E ( 0 ) , bitalic_A ( 0 ] , [ bitalic_E ( 1 ) , bitalic_A ( 1 ) ] as our basis vectors, we can use contexts with n=33n=3n = 3 to obtain [ΔE(1),ΔA(1)],[ΔE(2),ΔA(2)]subscriptΔ1subscriptΔ1subscriptΔ2subscriptΔ2[ _E(1), _A(1)],[ _E(2), _A(2)][ Δitalic_E ( 1 ) , Δitalic_A ( 1 ) ] , [ Δitalic_E ( 2 ) , Δitalic_A ( 2 ) ]. These new basis vectors are still linear combinations of binding IDs, and if binding ID vectors do form a subspace, these would be part of the subspace too. The second problem is that we cannot arbitrarily set the binding ID vector of an activation to another binding ID vector. Instead, we can only add vectors to activations. We thus perform two sets of interventions. We first perform the mean interventions on the second binding ID pair to turn [bE(1),bA(1)]subscript1subscript1[b_E(1),b_A(1)][ bitalic_E ( 1 ) , bitalic_A ( 1 ) ] into [bE(0),bA(0)]subscript0subscript0[b_E(0),b_A(0)][ bitalic_E ( 0 ) , bitalic_A ( 0 ) ]. At this point, the LM sees two entities with the same binding ID and two attributes with the same binding ID, and is confused. Then, we can add candidate binding vector ID offsets to these activations. More precisely, let η,νη,νη , ν be coefficients for the linear combinations of the basis vectors. Define now hA(η,ν)=ηΔA(1)+νΔA(2)subscriptℎsubscriptΔ1subscriptΔ2h_A(η,ν)=η _A(1)+ν _A(2)hitalic_A ( η , ν ) = η Δitalic_A ( 1 ) + ν Δitalic_A ( 2 ) and hE(η,ν)=ηΔE(1)+νΔE(2)subscriptℎsubscriptΔ1subscriptΔ2h_E(η,ν)=η _E(1)+ν _E(2)hitalic_E ( η , ν ) = η Δitalic_E ( 1 ) + ν Δitalic_E ( 2 ) as the candidate binding vector ID offsets. Then, we add [hE(η0,ν0),hA(η0,ν0)]subscriptℎsubscript0subscript0subscriptℎsubscript0subscript0[h_E( _0, _0),h_A( _0, _0)][ hitalic_E ( η0 , ν0 ) , hitalic_A ( η0 , ν0 ) ] and [hE(η1,ν1),hA(η1,ν1)]subscriptℎsubscript1subscript1subscriptℎsubscript1subscript1[h_E( _1, _1),h_A( _1, _1)][ hitalic_E ( η1 , ν1 ) , hitalic_A ( η1 , ν1 ) ] to the respective two pairs of binding IDs, and evaluate if the model has regained its beliefs. Concretely, the intervention we apply is parameterized by (η0,ν0,η1,ν1)subscript0subscript0subscript1subscript1( _0, _0, _1, _1)( η0 , ν0 , η1 , ν1 ) and are as follows: ZA0→ZA0−ΔA(0)+hA(η0,ν0),ZE0→ZE0−ΔE(0)+hE(η0,ν0),formulae-sequence→subscriptsubscript0subscriptsubscript0subscriptΔ0subscriptℎsubscript0subscript0→subscriptsubscript0subscriptsubscript0subscriptΔ0subscriptℎsubscript0subscript0Z_A_0→ Z_A_0- _A(0)+h_A( _0, _0), Z_% E_0→ Z_E_0- _E(0)+h_E( _0, _0),Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → Zitalic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - Δitalic_A ( 0 ) + hitalic_A ( η0 , ν0 ) , Zitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → Zitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - Δitalic_E ( 0 ) + hitalic_E ( η0 , ν0 ) , ZA1→ZA1−ΔA(1)+hA(η1,ν1),ZE1→ZE1−ΔE(1)+hE(η1,ν1).formulae-sequence→subscriptsubscript1subscriptsubscript1subscriptΔ1subscriptℎsubscript1subscript1→subscriptsubscript1subscriptsubscript1subscriptΔ1subscriptℎsubscript1subscript1Z_A_1→ Z_A_1- _A(1)+h_A( _1, _1), Z_% E_1→ Z_E_1- _E(1)+h_E( _1, _1).Zitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → Zitalic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - Δitalic_A ( 1 ) + hitalic_A ( η1 , ν1 ) , Zitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → Zitalic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - Δitalic_E ( 1 ) + hitalic_E ( η1 , ν1 ) . We are now interested in the question: if we have coefficients (η0,ν0)subscript0subscript0( _0, _0)( η0 , ν0 ) and (η1,ν1)subscript1subscript1( _1, _1)( η1 , ν1 ), are the binding vectors constructed from those coefficients valid binding IDs? In our experiments (Fig. 5), we fix the value of η0subscript0 _0η0 and ν0subscript0 _0ν0 at varying positions (green circles), and vary η1subscript1 _1η1 and ν1subscript1 _1ν1. We plot the mean median-calibrated accuracy. We find that near the green circle, the model is completely confused, responding with near-chance accuracy. This verifies that the erasure step works as intended. In addition, we find that there appears to be a binding metric subspace in that as long as candidate binding IDs are sufficiently far apart, the LM recovers its ability to distinguish between the two, even when candidate binding IDs are outside of the convex hull between the three binding IDs used to generate the basis vectors. Appendix H One Hop Experiment One sign that the binding ID mechanism correctly captures the semantic binding information is that the LM is able to reason with representations modified according to the binding ID theory. To some extent, the capitals task already requires a small reasoning step: in the context the LM is given that “Alice lives in the capital city of France”, and is asked to answer “Paris”. This means that binding mechanism that binds “Alice” to “France” has to create representations that are robust enough to support the inference step “France” to “Paris”. To further push on reasoning, we introduce the onehop task, an augmented version of capitals. The context remains the same as the capitals task, i.e. we provide a list of people and where they live. However, the LM has to apply an additional reasoning step to answer the question. An example context and question is below: ⬇ Answer the question based on the context below. Keep the answer short. Context: Elizabeth lives in the capital city of France. Ryan lives in the capital city of Scotland. Question: The person living in Vienna likes rose. The person living in Edinburgh likes rust. The person living in Tokyo likes orange. The person living in Paris likes tan. What color does Elizabeth like? Answer: Elizabeth likes In the onehop task, based on the binding information in the context the LM has to perform two inference steps. The first is to infer that the capital city of France is Paris. The second is to, based on the additional information in the question, infer that the person living in Paris likes tan, and output tan as the correct answer. This is a more challenging task than our other tasks, and we thus present results on LLaMA-65b instead of LLaMA-30b. Overall, we find that all of our results still hold. We show results for factorizability (Fig. 9), position independence (Fig. 10), and mean interventions (Fig. 3). (a) Swapping entity/attribute for (E0,A0)subscript0subscript0(E_0,A_0)( E0 , A0 ) (b) Swapping entity/attribute for (E1,A1)subscript1subscript1(E_1,A_1)( E1 , A1 ) Figure 9: Factorizability results for onehop Figure 10: Position independence for onehop. Top: Mean log probs for entity interventions. Bottom: Mean log probs for attributes. Different from Fig. 4, we only compute the local neighborhood around the control and swapped conditions. Test condition Control Attribute Entity Both Querying E0subscript0E_0E0 0.73 0.25 0.24 0.71 Querying E1subscript1E_1E1 0.79 0.28 0.26 0.77 Table 3: Mean intervention results for onehop Appendix I Three-term binding In all of our tasks, we studied binding between two terms: binding an entity to an attribute. Here, we extend our results to three-term binding. An example context looks like: ⬇ Answer the question based on the context below. Keep the answer short. Context: Carol from Italy likes arts. Samuel from Italy likes swimming. Carol from Japan likes hunting. Samuel from Japan likes sketching. Question: What does Carol from Italy like? Answer: Carol from Italy likes In general, each statement in the context binds three terms together: a name, a country, and a hobby. We can query any two of the three terms, and ask the language model to retrieve the third. The above example shows how we query for the hobby, given the name and the country. We query for country and name by asking instead: ⬇ Which country is Carol who likes hunting from? Who from Italy likes hunting? We extend our analysis to three-term binding in the following way. Of the three attribute classes, namely names, countries, and hobbies, choose one to be the fixed attribute, one to be the query attribute, and one to be the answer attribute. Altogether, there are 3!=6363!=63 ! = 6 possible assignments. For every such assignment, we can perform the same set of analysis as before. To illustrate, suppose we choose country as the fixed attribute, name to be the query attribute, and hobby to be the answer attribute. An example prompt for this assignment will look like: ⬇ Answer the question based on the context below. Keep the answer short. Context: Carol from Italy likes arts. Samuel from Italy likes swimming. Question: What does Carol from Italy like? Answer: Carol from Italy likes We then report the median-calibrated accuracy for the mean interventions under all 6 assignments (Table 4). The accuracy is better than capitals (Table 1) because capitals requires inferring capital city from country, whereas onehop only requires looking up and copying countries. Fixed Query Answer Test condition Control Attribute Entity Both name country hobby Query 0 1.00 0.00 0.00 1.00 name country hobby Query 1 1.00 0.00 0.00 1.00 name hobby country Query 0 1.00 0.00 0.00 1.00 name hobby country Query 1 1.00 0.00 0.00 1.00 country name hobby Query 0 1.00 0.01 0.00 1.00 country name hobby Query 1 1.00 0.00 0.00 1.00 country hobby name Query 0 1.00 0.02 0.01 0.99 country hobby name Query 1 1.00 0.03 0.01 0.99 hobby name country Query 0 1.00 0.00 0.00 1.00 hobby name country Query 1 1.00 0.00 0.00 1.00 hobby country name Query 0 1.00 0.00 0.00 1.00 hobby country name Query 1 1.00 0.00 0.00 1.00 Table 4: Mean intervention results for three-term binding. The intervened model perform near perfectly for most test conditions. Appendix J Additional factorizability and position independence plots This section contains the experiments for factorizability (Fig. 3) and position independence (Fig. 4) reproduced for the other binding tasks, namely parallel (Fig. 11, 15), shapes (Fig. 12, 16), fruits (Fig. 13, and bios (Fig. 14). (a) Swapping entity/attribute for (E0,A0)subscript0subscript0(E_0,A_0)( E0 , A0 ) (b) Swapping entity/attribute for (E1,A1)subscript1subscript1(E_1,A_1)( E1 , A1 ) Figure 11: Factorizability results for parallel (a) Swapping entity/attribute for (E0,A0)subscript0subscript0(E_0,A_0)( E0 , A0 ) (b) Swapping entity/attribute for (E1,A1)subscript1subscript1(E_1,A_1)( E1 , A1 ) Figure 12: Factorizability results for shapes (a) Swapping entity/attribute for (E0,A0)subscript0subscript0(E_0,A_0)( E0 , A0 ) (b) Swapping entity/attribute for (E1,A1)subscript1subscript1(E_1,A_1)( E1 , A1 ) Figure 13: Factorizability results for fruits (a) Swapping entity/attribute for (E0,A0)subscript0subscript0(E_0,A_0)( E0 , A0 ) (b) Swapping entity/attribute for (E1,A1)subscript1subscript1(E_1,A_1)( E1 , A1 ) Figure 14: Factorizability results for bios Notice that for bios (Fig. 14), entity factorizability works, but not attribute factorizability. This is because the attribute information is represented by many tokens, while the attribute factorizability test only substitutes the first token in the attribute representation. Figure 15: Position independence for parallel. Top: Mean log probs for entity interventions. Bottom: Mean log probs for attributes. Different from Fig. 4, we only compute the local neighborhood around the control and swapped conditions. Figure 16: Position independence for shapes. Top: Mean log probs for entity interventions. Bottom: Mean log probs for attributes. Different from Fig. 4, we only compute the local neighborhood around the control and swapped conditions. Figure 17: Position independence for fruits. Top: Mean log probs for entity interventions. Bottom: Mean log probs for attributes. Different from Fig. 4, we only compute the local neighborhood around the control and swapped conditions. Figure 18: Position independence for bios. Top: Mean log probs for entity interventions. Bottom: Mean log probs for attributes. Different from Fig. 4, we only compute the local neighborhood around the control and swapped conditions.