← Back to papers

Paper deep dive

Decomposing Query-Key Feature Interactions Using Contrastive Covariances

Andrew Lee, Yonatan Belinkov, Fernanda Viégas, Martin Wattenberg

Year: 2026Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 71

Models: Llama 3.1-8B Instruct, Qwen 3-4B Instruct

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/11/2026, 12:38:02 AM

Summary

The paper introduces a 'contrastive covariance' method to decompose the Query-Key (QK) space in Transformer attention heads into low-rank, human-interpretable subspaces. By analyzing the difference between positive and negative covariance terms, the authors isolate specific latent features that drive attention scores. The method is validated on toy models and applied to large language models (Llama 3.1, Qwen 3) to identify subspaces for categorical semantic and binding features, providing a mechanism to attribute attention logits to specific feature components.

Entities (5)

Contrastive Covariance · method · 99%Llama-3.1-8B-Instruct · model · 98%Qwen 3-4B Instruct · model · 98%QK Space · concept · 95%Superposition · phenomenon · 92%

Relation Signals (3)

Contrastive Covariance appliedto Llama-3.1-8B-Instruct

confidence 98% · Next, we apply our method to Llama 3.1-8B Instruct

Contrastive Covariance appliedto Qwen 3-4B Instruct

confidence 98% · Next, we apply our method to... Qwen 3-4B Instruct

Contrastive Covariance decomposes QK Space

confidence 95% · We present a contrastive covariance method to decompose the QK space into low-rank, human-interpretable components.

Cypher Suggestions (2)

Find all models analyzed using the contrastive covariance method. · confidence 95% · unvalidated

MATCH (m:Model)-[:APPLIED_TO]-(method:Method {name: 'Contrastive Covariance'}) RETURN m.name

Identify concepts related to the QK space decomposition. · confidence 90% · unvalidated

MATCH (c:Concept {name: 'QK Space'})<-[:DECOMPOSES]-(m:Method) RETURN m.name

Abstract

Abstract:Despite the central role of attention heads in Transformers, we lack tools to understand why a model attends to a particular token. To address this, we study the query-key (QK) space -- the bilinear joint embedding space between queries and keys. We present a contrastive covariance method to decompose the QK space into low-rank, human-interpretable components. It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced. We first study our method both analytically and empirically in a simplified setting. We then apply our method to large language models to identify human-interpretable QK subspaces for categorical semantic features and binding features. Finally, we demonstrate how attention scores can be attributed to our identified features.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)mechanistic-interp (suggested, 92%)

Links

PDF not stored locally. Use the link above to view on the source site.

Full Text

71,052 characters extracted from source content.

Expand or collapse full text

Decomposing Query-Key Feature Interactions Using Contrastive Covariances Andrew Lee Yonatan Belinkov Fernanda Viégas Martin Wattenberg Abstract Despite the central role of attention heads in Transformers, we lack tools to understand why a model attends to a particular token. To address this, we study the query-key (QK) space – the bilinear joint embedding space between queries and keys. We present a contrastive covariance method to decompose the QK space into low-rank, human-interpretable components. It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced. We first study our method both analytically and empirically in a simplified setting. We then apply our method to large language models to identify human-interpretable QK subspaces for categorical semantic features and binding features. Finally, we demonstrate how attention scores can be attributed to our identified features. Interpretability, ICML 1 Introduction Attention is at the heart of Transformers, yet we struggle to answer “why did the model attend to this token?”. Attention heads produce key and query vectors for each token, and their dot product determines an attention score. However, these dot products return a single scalar value, concealing how the two tokens interact. To understand this, we instead study the QK space – the bilinear joint embedding space between queries and keys. Understanding the structure of QK spaces reveals how queries and keys interact. We demonstrate a simple way to decompose a QK space into interpretable low-rank features. As we will see, it is when these features in keys and queries align that leads to high attention scores. Figure 1: Contrastive covariance method schema. We define positive and negative covariance terms between queries and keys, each capturing the presence (or absence) of a feature. The resulting contrastive covariance term isolates the feature in QK space. Our method relies on the covariance of keys and queries. We define positive and negative covariance terms between keys and queries, each of which correspond to the presence (or absence) of a feature of interest, while holding all other factors constant. Their difference, i.e. the contrastive covariance, isolates the subspace of a feature: see Figure 1. Our method allows us to 1) recover the rank of features in QK space, and 2) recover the subspaces in which features lie, in both the query and key spaces. To show this, we design a task in which queries and keys are constructed from known latent features with ranging degrees of freedom. We show analytically that in our setting, our method recovers the correct ranks and subspaces of the latent features in query and key spaces. We then empirically verify our method by training attention heads and conducting causal interventions in our recovered QK subspaces. We also study superposition (Elhage et al., 2022) in QK space using our setup to study the limitations of our method. Next, we apply our method to Llama 3.1-8B Instruct (Grattafiori et al., 2024) and Qwen 3-4B Instruct (Yang et al., 2025) to find interpretable, low-rank QK subspaces. We study two examples where attention plays a central role: categorical semantic features in Filter Heads (Sharma et al., 2025) and binding features (Gur-Arieh et al., 2025). While prior works demonstrate such mechanisms, we localize the subspaces in which they are encoded. Finally, we show how attention logits (attention scores prior to softmax) can be attributed to the QK features that we identify. This follows naturally from the logits being linear in query space, thus decomposing the query space directly allows us to decompose the logit space. Put differently, we can identify how much each feature component contributes towards the final attention logits, but also inform on how much of the logits are left unexplained. In summary, we demonstrate a simple method to decompose QK spaces to study their inner structures. 2 Toy Model for QK Decomposition To motivate our study of QK feature decompositions, we design a simple payload retrieval task. We use italic letters (a,Ba,B) for scalars, bold lowercase (,)(q,k) for vectors, and bold uppercase (W) for matrices. For a brief review of attention heads, see Appendix A. 2.1 Task: Payload Retrieval from Context In our task, an attention head is given a set of payload embeddings, each of which contains some payload information (e.g., class label). The model is then given a “selector” embedding, which the model must use to attend to the correct payload embedding and retrieve the correct payload information. Concretely, we generate data of the form (1:T,q,i∗,yi∗)(x_1:T,x_q,i^*,y_i^*), where i∈ℝdx_i ^d is the payload embedding for timestep i∈1,…,Ti∈\1,…,T\, q∈ℝdx_q ^d is the selector embedding, i∗i^* is the target timestep to retrieve the payload from, and yi∗∈1,…,Py_i^*∈\1,…,P\ is the correct payload label that the model must predict. We study two variants of this task, in which embeddings are generated as follows. Variant 1: Discrete Latent Variables. Our data generation relies on K latent variables. For simplicity we set K=2K=2. Our latent variables are each binary sign vectors of length r1r_1 and r2r_2: 1∈−1,1r1,2∈−1,1r2z_1∈\-1,1\^r_1,z_2∈\-1,1\^r_2. We refer to them as latent keys. Each payload embedding ix_i is generated by first randomly sampling latent keys 1,i,2,iz_1,i,z_2,i independently, which are then mapped to the embedding space via linear maps 1∈ℝd×r1,2∈ℝd×r2A_1 ^d× r_1,A_2 ^d× r_2, each of which are randomly sampled and fixed from a standard Gaussian. Each payload embedding is also assigned a random payload yi∈1,…,Py_i∈\1,…,P\, which is also mapped to the embedding space via a fixed linear map y∈ℝd×PA_y ^d× P. Thus the payload embedding is given by: i=1​1,i+2​2,i+y​yi+ϵi _i=A_1z_1,i+A_2z_2,i+A_ye_y_i+ ε_i (1) where yie_y_i is a one-hot encoding of yiy_i and ϵ ε is standard Gaussian noise. The selector embedding qx_q is generated similarly. We first randomly select a target timestep i∗∈1,…,Ti^*∈\1,…,T\. We then use the same latent keys 1,i∗,2,i∗z_1,i^*,z_2,i^* that were used to construct the payload embedding at timestep i∗i^*, but now embed them with a different set of embedding matrices 1∈ℝd×r1,2∈ℝd×r2B_1 ^d× r_1,B_2 ^d× r_2, which are also randomly sampled and fixed from a standard Gaussian. The selector embedding is then given by: q=1​1,i∗+2​2,i∗+ϵq. _q=B_1z_1,i^*+B_2z_2,i^*+ ε_q. (2) Unlike the payload embeddings, the selector embedding does not contain any payload information. To summarize, the payload and selector embeddings share two sets of latent features, 1z_1 and 2z_2, but are embedded via different linear maps. Payload embeddings also contain payload information, and the attention head must attend to the correct payload embedding to retrieve the payload. Variant 2: Continuous Latent Variables. The second variant is similar, except that the latent variables are continuous vectors sampled from a standard Gaussian distribution, i.e., 1∼​(,r1),2∼​(,r2)s_1 (0,I_r_1),s_2 (0,I_r_2). Similarly, payload and selector embeddings are generated as follows: i _i =1​1,i+2​2,i+y​yi+ϵi =A_1s_1,i+A_2s_2,i+A_ye_y_i+ ε_i (3) q _q =1​1,i∗+2​2,i∗+ϵq =B_1s_1,i^*+B_2s_2,i^*+ ε_q (4) 2.2 Toy Attention Model We train a single attention head, i.e., weights Q,K,V∈ℝdhead×d,O∈ℝP×dheadW_Q,W_K,W_V ^d_head× d,W_O ^P× d_head. Given a data sample (1:T,q,i∗,yi⁣∗)(x_1:T,x_q,i^*,y_i*), the forward pass and loss are given by: =Q​q,i=K​i,i=V​i, =W_Qx_q, _i=W_Kx_i, _i=W_Vx_i, αi=exp⁡(⊤​i/dhead)∑j=1Texp⁡(⊤​j/dhead),=O​∑i=1Tαi​i, _i= (q k_i/ d_head) _j=1^T (q k_j/ d_head),\ o=W_O _i=1^T _iv_i, y^=softmax​(),ℒ=CrossEntropy​(y^,yi∗). y=softmax(o), =CrossEntropy( y,y_i^*). Thus the model must use Q,KW_Q,W_K to attend to the correct payload embedding i∗x_i^* and use V,OW_V,W_O to decode the correct payload information. 3 QK Decomposition using Contrastive Covariance Here we describe our method of recovering the ranks and subspaces of latent variables in the attention head’s query and key spaces. More succinctly, we refer to their bilinear joint embedding space as the QK space. One can think of the QK space (∈ℝdhead×dhead ^d_head× d_head) as the space of all possible interactions between queries and keys. Note that in all of our analyses, one can replace all instances of 1,2z_1,2 with 1,2s_1,2. Our method constructs a contrastive covariance matrix Δ​ between queries and keys that isolates their interactions attributable to a single latent variable. For instance, consider latent variable 1z_1. For a sampled query vector q (associated with target value 1,i∗z_1,i^*), we construct two keys: • (1)+k^+_(z_1), whose 1z_1 value matches the query (1=1,i∗)z_1=z_1,i^*) • (1)−k^-_(z_1), whose 1z_1 value differs (1≠1,i∗)z_1 _1,i^*). Crucially, we hold 2z_2 fixed across the two conditions: both keys share the same value of ~2 z_2 (drawn randomly) for 2z_2. Given a large sample of such triplets (,(1)+,(1)−)(q,k^+_(z_1),k^-_(z_1)), we compute positive and negative covariances: (1)+:=​[⊤|+]∈ℝdh​e​a​d×dh​e​a​d ^+_(z_1):=E[qk |+] ^d_head× d_head (5) (1)−:=​[⊤|−]∈ℝdh​e​a​d×dh​e​a​d ^-_(z_1):=E[qk |-] ^d_head× d_head (6) We use the term “covariance” informally, as we are not mean-centering ,q,k. Intuitively, (1)+C^+_(z_1) captures query-key correlations when 1z_1 matches, while (1)−C^-_(z_1) captures correlations when 1z_1 does not match. Importantly, because 2z_2 is held constant across the two conditions, the difference of the covariance terms, Δ​(1) _(z_1), isolates the component of query-key interactions that is specifically due to the matching of latent variable 1z_1 (see Appendix B for the derivation): Δ​(1) _(z_1) :=(1)+−(1)− :=C^+_(z_1)-C^-_(z_1) =Q​[​[1,i∗​1,i∗⊤]−​[1,i∗​1,i≠i∗⊤]]​⊤​K⊤ =W_QB bmatrixE[z_1,i^*\ z_1,i^* ]-E[z_1,i^*z_1,i≠ i^* ]&0\\ 0&0 bmatrixA W_K where :=[1,2]B:=[B_1,B_2] and :=[1,2]A:=[A_1,A_2]. The same procedure can be repeated for Δ​(2) _(z_2) by defining positive and negative conditions accordingly. Recovering the ranks and subspaces of latent variables. Given Δ​(1) _(z_1), we can recover the rank and subspace of latent variable 1z_1 by performing SVD: Δ​(1)=(1)​(1)​(1)⊤ _(z_1)=U_(z_1) _(z_1)V_(z_1) (7) The rank of 1z_1 (denoted r1r_1) can be estimated by counting the number of singular values that capture 99% of the squared Frobenius norm of Δ​(1) _(z_1). Denoting the top-r1r_1 singular vectors as (1)[:r1]U_(z_1)^[:r_1] and (1)[:r1]V_(z_1)^[:r_1], (1)[:r1]U_(z_1)^[:r_1] gives a basis in query space that encodes 1z_1, while (1)[:r1]V_(z_1)^[:r_1] gives a basis in key space. This can be repeated for each latent variable to recover their respective ranks and subspaces. 4 Empirical Validation of QK Decomposition Here we apply our method on attention heads trained on the payload retrieval task. Figure 2: Contrastive QK decomposition recovers the groundtruth rank of each latent variable, as long as there is no superposition (i.e., r1+r2<dheadr_1+r_2<d_head). Each cell annotates the recovered ranks r1,r2r_1,r_2, while the x and y-axes indicate the groundtruth ranks. The color of each cell indicates the difference between groundtruth and recovered ranks. Experimental Setup. We train a single attention head under various task settings and hyperparameters. We study both task variants (Section 2.1): discrete (1,2z_1,z_2) and continuous (1,2s_1,s_2) latent variables. We train attention heads with either dhead=8d_head=8, while varying r1,r2∈2,…,6r_1,r_2∈\2,…,6\, or with dhead=16d_head=16 with r1,r2∈4,…,12r_1,r_2∈\4,…,12\. In every settings, we set d=32d=32, context length T=16T=16, and the number of payloads (classes) P=10P=10. Under these settings, the attention heads achieve 99%99\% accuracy, except for the continuous task when dhead=8d_head=8, in which accuracy drops to around 85%85\%. For additional training details, see Appendix D. Recovering Rank of Latent Variables. We first verify that our method recovers the rank of each latent variable. Figure 2 shows the results for one of our models (all other results in Appendix F). The x, y-axes indicate the groundtruth ranks of 1z_1 or 2z_2, r1r_1 and r2r_2. The text annotations indicate the ranks recovered by our method. The colors indicate the difference between the groundtruth and recovered ranks. When the model has enough dimensions to encode both latent variables (r1+r2<dheadr_1+r_2<d_head), our method recovers the ranks of both latent variables (dark green cells). Otherwise, we see superposition (Elhage et al., 2022), in which the model compresses both variables using less dimensions than available. We discuss superposition in more detail below. Recovering Latent Variable Subspaces in QK Space. We can apply SVD on Δ​ to recover the subspaces in which each latent variable is encoded. As a reminder, we denote the top-r1r_1 singular vectors of Δ​(1) _(z_1) as (i)[:r1]∈ℝdhead×r1U_(z_i)^[:r_1] ^d_head× r_1 and (1)[:r1]∈ℝdhead×r1V_(z_1)^[:r_1] ^d_head× r_1. (1)[:r1]U_(z_1)^[:r_1] provides a basis for 1z_1 in query space, while (1)[:r1]V_(z_1)^[:r_1] provides a basis in key space. Figure 3: PCA of Latent Variable Subspace. We project key and query vectors onto the recovered subspaces of latent variable 1z_1 (of rank r1=3r_1=3), then perform PCA, which recovers the 3D-cube structure of 1z_1. Also note that keys and queries align onto the same clusters. See Figure 12 for the continuous task variant, in which our method recovers the spherical structure of latent variable 1s_1. We visualize these subspaces by projecting the query, key vectors ,∈ℝdheadq,k ^d_head onto (1)[:r1]U_(z_1)^[:r_1] and (1)[:r1]V_(z_1)^[:r_1], followed by PCA. Figure 3 shows an example for a model with dhead=16d_head=16 and r1=3,r2=5r_1=3,r_2=5. Note two observations: first, 1z_1 is sampled from −1,1r1\-1,1\^r_1 which corresponds to the vertices of a 3D cube, which is faithfully recovered from PCA. Second, the key and query projections are aligned, both of which collapse to the same clusters. For an example of the second task variant, see Figure 12, in which we recover the Gaussian sphere structure of latent key 1s_1. Causal Interventions in QK Space. To validate the role of the recovered subspaces, we perform causal interventions. Namely, we intervene on the key vectors by first projecting them onto their latent variable subspaces. We then change the coordinates in these subspaces (imagine moving from one vertex to another in Figure 3), and measure how the attention scores change. More specifically, consider intervening on 1z_1. Given an original timestep iorig.i_orig., we randomly select a new target timestep itargeti_target. We then project the key vectors iorig.,itargetk_i_orig.,k_i_target onto the subspaces of 1z_1, then replace the coordinates of iorig.k_i_orig. in these subspaces with those of itargetk_i_target, and vice versa: =(1)[:r1]​(1)[:r1]⁣⊤, _v=V_(z_1)^[:r_1]V_(z_1)^[:r_1] , (8) ~iorig.=iorig.+​(itarget−iorig.) k_i_orig.=k_i_orig.\ +P_v\ (\ k_i_target-\ k_i_orig.\ ) (9) ~itarget=itarget+​(iorig.−itarget) k_i_target=k_i_target\ +P_(\ k_i_orig.-\ k_i_target\ ) (10) Finally, we compute attention scores with these modified keys and measure how much of the attention score has shifted from timestep iorig.i_orig. to itargeti_target. Note that this step can be repeated using 2z_2 to intervene on both latent variables. Figure 4 shows the results on a test set of 51,200 samples (for more examples see Appendix F). z1z_1, z2z_2, and z1+z2z_1+z_2 correspond to intervening on (1)[:r1]V_(z_1)^[:r_1], (2)[:r2]V_(z_2)^[:r_2], or both. “Rand r1r_1, r2r_2, r1r_1+r2r_2” correspond to intervening on random subspaces of the same dimension as 1z_1 or 2z_2. Note that intervening on both subspaces (z1+z2z_1+z_2) moves all the attention from iorig.i_orig. to itargeti_target, while intervening on the random baseline counterparts induces a much smaller shift. This validates that our QK decomposition method recovers the correct subspaces in which latent variables are encoded. Figure 4: Causal Interventions on Latent Variable Subspaces. Intervening on the recovered subspaces for latent variables 1z_1 and 2z_2 shifts all the attention from the original token to the target token, while intervening on random subspaces of the same dimension (i.e., “Rand r1,r2,r1+r2r_1,r_2,r_1+r_2”) has less of an effect. Pitfalls of Contrastive Covariance: Feature Splits and Superposition. Our toy model also reveals pitfalls of our QK decomposition method. To illustrate them, we study how our latent variables interact with each other in QK space by analyzing their bilinear interactions G: Figure 5: Interactions between latent variables in QK space reveal feature splits and superposition. When the model has enough dimensions (r1+r2≤dheadr_1+r_2≤ d_head), the model further decomposes the latent variables into independent components (feature splits: strong diagonals in G, as opposed to block diagonals). When there are not enough dimensions (r1+r2>dheadr_1+r_2>d_head), we observe superposition, in which the model compresses both latent variables into fewer dimensions than available (off-diagonal interactions in G). ⊤​ k =(Q​q)⊤​(K​k) =(W_QBz_q) (W_KAz_k) (11) =q⊤​⊤​Q⊤​K​⏟​k=q⊤​k =z_q B W_Q W_KA_Gz_k\ =z_q Gz_k (12) where :=[1,2],:=[1,2]∈ℝd×(r1+r2)A:=[A_1,A_2],B:=[B_1,B_2] ^d×(r_1+r_2), q=[1,i∗;2,i∗],k=[1;2]∈ℝr1+r2z_q=[z_1,i^*;z_2,i^*],z_k=[z_1;z_2] ^r_1+r_2. Here, ∈ℝ(r1+r2)×(r1+r2)G ^(r_1+r_2)×(r_1+r_2) captures the bilinear interactions between each latent variable in the query and key spaces, where ​[i,j]G[i,j] indicates how strongly latent variables q,iz_q,i and k,jz_k,j interact, via weights Q⊤​KW_Q W_K. Figure 5 visualizes G under varying dheadd_head sizes and ranks of each latent variable, for the second task variant. For results on the first task, see Appendix F. We make two observations. First, when the model has enough dimensions to represent both latent variables (r1+r2≤dheadr_1+r_2≤ d_head), we observe feature splits, as indicated by the strong diagonals in such settings. Namely, while our latent variables 1,2z_1,z_2 have r1,r2r_1,r_2 degrees of freedom, their coordinates (e.g., 1​[0],1​[1]z_1[0],z_1[1]) are independent of one another. Thus the model further decomposes these latent variables into independent components. On the contrary, with not enough dimensions (r1+r2>dheadr_1+r_2>d_head), we observe superposition, where the model compresses both latent variables using fewer dimensions. This is indicated by the off-diagonal interactions in G, where multiple components from 1z_1 and 2z_2 interact. The subsequent softmax operation likely allows such compression to occur with its “winner-takes-all” behavior. This raises two questions: how often does superposition occur in “real” models, and how do we interpret superposed features? So What is a Feature? Note that our method relies on a human-defined notion of what constitutes as a “feature”, which is manifested in how the positive and negative covariance conditions are defined. Though our method faithfully recovers the targeted latent variables as designed by our positive and negative pairs, this human-defined notion of features may not always align with the “unit” in which the model represents features, as our examples demonstrate. All of this adds to the on-going discourse around “what is a feature?” (Olah et al., 2020; Elhage et al., 2022). 5 QK Features in Large Language Models Here we apply our method to Llama 3.1-8B Instruct (Grattafiori et al., 2024) and Qwen 3-4B Instruct (Yang et al., 2025). Results for Qwen are in Appx G. (a) (b) Figure 6: (a) PCA visualization of the categorical QK subspace. We project key and query vectors onto their respective categorical subspaces and perform PCA. Note the alignment between keys and queries of the same category. (b) PCA visualization of additional categories (keys only). Visualizing additional categories exhibits clear semantic clusters (e.g., locations (Country, States, Cities), names (Male, Female), animals (Animal, Bird), food (Food, Liquid, Fruit)). 5.1 Categorical Semantic Space in Filter Heads Filter Heads (Sharma et al., 2025) refer to attention heads that mirror “filter” functions: for instance, given a list of items, they attend to items pertaining to a queried category: Cat, apple, dog, truck, orange, tea, car, duck. Find the fruits. We apply our method to identify QK subspaces that encode various categories in Filter Heads. To do so, we emulate the setup of Sharma et al. to identify Filter Heads. We construct 2,000 prompts containing a list of items from various categories c∈c (e.g., fruits, animals, vehicles), followed by a query category c∗c^*. Each prompt includes at least 5 items per category. We select the top three heads based on the ratio of attention given to the queried items versus all other items. We use the last token for our query vector, and use key vectors for positive and negative QK covariances as defined below (per category): • category+C^+_category: tokens belonging to the queried category c∗c^*. • category−C^-_category: tokens not belonging to the queried category. Figure 7: Causal interventions on categorical QK subspaces. We intervene by replacing the QK components of tokens from one category (e.g., fruits) with those from another category (e.g., animals). The remaining steps follow as in Section 3. Visualizing Categorical Semantic QK Space. We provide two visualizations of the recovered categorical semantic space. In the first, we consider 5 categories: fruits, animals, vehicles, drinks, and countries. Interestingly, their contrastive covariances (Δ​fruits,Δ​animals,… _fruits, _animals,…) all result in rank 1. We thus define the categorical QK subspace as the span of these 5 directions. Figure 6(a) visualizes the keys and queries projected onto this categorical subspace using PCA. We observe clear clusters corresponding to each category, but more importantly, we also observe alignment between keys and queries of the same category. Namely, the first principal component (PC 1) separates keys from queries, while the structure of queries and keys in PC 2 and 3 are symmetric to one another. In Figure 6(b) we expand the list of categories to 13 and visualize only the keys, which again reveals clear semantic clusters. Causal Interventions. We validate the role of the identified subspace with interventions. We use a test set of 1,000 samples, each of which has 5 categories. In each sample, we randomly select a target token it​a​r​g​e​ti_target that does not belong to the queried category. We then intervene on the recovered subspaces as described in Equations (9), (10). Figure 7 shows that intervening on the recovered 5-dimensional subspace successfully shifts attention from one categorical token to another (e.g., from fruits to animals), and is much more effective than a random 5-dimensional baseline. It does not, however, shift all the attention, suggesting additional features in QK space not captured by our method. 5.2 Binding Features Researchers have studied how language models bind entities together (Feng and Steinhardt, 2023; Dai et al., 2024; Prakash et al., 2025). Gur-Arieh et al. (2025) show that models rely on multiple mechanisms. Consider the following prompt: The hat is in box O. The jam is in box Z …(omitted)…Which box is the jam in? One mechanism is dubbed order-ID, in which the model uses the order in which entity groups appear: given a query entity (e.g., jam), the model retrieves the box with the same order (e.g., second) as the queried entity. Another mechanism is the lexical mechanism: the model uses the identity of the queried entity (e.g., jam) to retrieve the associated box. This is perhaps the most intuitive, “correct” mechanism. For more details on these mechanisms, see Appendix E. Figure 8: PCA, UMAP of order-ID and lexical subspaces. PC1/UMAP1 encode keys versus queries, while PCs/UMAPs 2 and 3 encode order or lexical IDs. Note the alignment between keys and queries in order-IDs. Because the lexical subspace is higher dimensional, we include both PCA and UMAP: the clusters are easier to see in UMAP, while the alignment between keys and queries is easier to see in the PCA (note that UMAP does not preserve the notion of distance, and thus alignment information is not visually observable). Visualizing the same PCAs on key and query vectors without projecting to our QK subspaces reveals that order-ID features are encoded in the first few PCs (see Figure 17). We use our method to identify QK subspaces corresponding to these two mechanisms. We construct 3,000 prompts, each containing 9 entity-box pairs (e.g., hat-box O, jam-box Z, etc.). We filter for attention heads that attend to the correct box with at least 30% accuracy. This results in 9 heads - we demonstrate results from a few heads here while all others can be found in Appendix F. We use the last token as our query and box label tokens (e.g., box “Z”) as our keys. For order-ID, the positive and negative covariances are: • order+C^+_order: box whose order matches that of the queried entity. • order−C^-_order: boxes whose order does not match that of the queried entity. Importantly, we keep the same set of entities in all of our samples (although their orders are shuffled across samples), and use the same fixed query entity across all samples. However, in our intervention test data, we use query entities not seen when constructing Δ​order _order. For the lexical mechanism, we make counterfactual prompts: for every prompt, we make a copy but replace the entity being queried (“…the jam is in box Z…Which box is the jam in?” → “…the pen is in box Z…Which box is the pen in?”). Our positive and negative covariances are defined as: • Lex.+C^+_Lex.: box of the original queried entity. • Lex.−C^-_Lex.: box of the queried entity in counterfactual prompt. Similar to the order-IDs, this allows us to isolate signals coming from lexical information. Figure 9: Causal interventions on binding QK subspaces. We intervene by modifying the order-ID or lexical components (or both) of the QK space. Intervening on both components yields a larger shift in attention. Visualizing Binding QK Subspaces. Here we visualize our recovered binding QK subspaces. We use 3,000 samples using 9 entities each to construct Δ​order _order and Δ​Lex. _Lex.. We find that Δ​order _order is usually rank 2 or 3, while Δ​Lex. _Lex. is usually rank 9 or 10 (the ranks do not appear to depend on the number of entities used in constructing Δ​ – see Figure 16). We project our keys and query vectors onto these respective subspaces and visualize them using PCA or UMAP. Because the lexical subspace has more dimensions, we include a UMAP visualization. Figure 8 shows the results. Similar to categorical features, we observe clear clusters corresponding to order-IDs and lexical-IDs, as well as alignment between keys and queries. Causal Interventions. We further do causal interventions on these binding QK subspaces. We use 1,000 test samples. Similar to previous experiments, given an original timestep io​r​i​gi_orig corresponding to the correct box, we select a random target timestep it​a​r​g​e​ti_target corresponding to a different box. We then intervene the key vectors of io​r​i​gk_i_orig and it​a​r​g​e​tk_i_target in either the order-ID subspace, lexical subspace, or both. Results are shown in Figure 9, in which we see a similar trend as before: intervening on each individual subspace can shift some of the attention, while intervening on both subspaces shifts the majority of the attention. Intervening on random subspaces of the same ranks has negligible effects. 5.3 Attention Logit Attributions How much of the attention logits (attention scores prior to softmax) can be explained by our recovered features, and how much is left unexplained? Because the logits are linear in query space, we can easily check how much our features contribute towards an attention head’s logits. Namely, given ,i∈ℝdheadq,k_i ^d_head for key positions i∈1,…,Ti∈\1,…,T\, let ∈ℝT×dheadK ^T× d_head be the stacked matrix of keys, with each row ​[i]=i⊤K[i]=k_i . The pre-softmax attention logits are ℓ=/dhead∈ℝT \;=\;Kq/ d_head ^T. Figure 10: Attention logit attributions to low-rank feature components. Blue and orange bars refer to logit contributions from the order-ID and lexical subspaces. The green bars indicate logits left unexplained by our two features. Now consider our recovered feature basis for order-ID and lexical-ID in query space: (order)U_(order), (Lex.)U_(Lex.), each of rank rorder,rLex.≪dheadr_order,r_Lex. d_head. Let order:=order​order⊤P_order:=U_orderU_order be an orthogonal projector. Intuitively, order​∈ℝdheadP_orderq ^d_head is the subspace in q that encodes order-ID, as everything else orthogonal to the column space of orderU_order is removed. Define a similar orthogonal projection Lex.P_Lex. for lexical ID, and we can iteratively decompose our query vector: order _order =order​, \;=\;P_order\,q, (13) Lex. _Lex. =Lex.​(−order), \;=\;P_Lex. (q-q_order ), (14) ⟂ _ =−order−Lex., \;=\;q-q_order-q_Lex., (15) where Lex.q_Lex. identifies the lexical subspace in q after the order-ID subspace has been removed, and ⟂q_ is the residual query space that is not accounted for by order-ID and lexical-ID. By construction, =order+Lex.+⟂q=q_order+q_Lex.+q_ . Note that when the two feature subspaces are not distinct, this decomposition is sensitive to the order in which we project out feature subspaces, as the overlapping space will count towards the first feature. In our case we project out orderU_order first because it has fewer ranks than Lex.U_Lex.. Finally, with our decomposed query vectors, we can also define feature-specific logit vectors: ℓorder=orderdhead,ℓLex.=Lex.dhead,ℓ⟂=⟂dhead. _order\;=\; Kq_order d_head, _Lex.\;=\; Kq_Lex. d_head, _ \;=\; Kq_ d_head. Because the logit space is linear in q, we have the following decomposition: ℓ=ℓorder+ℓLex.+ℓ⟂,ℓi=ℓi(order)+ℓi(Lex.)+ℓi(⟂)​∀i. \;=\; _order+ _Lex.+ _ , _i\;=\; ^(order)_i+ ^(Lex.)_i+ ^( )_i\;\;\;∀ i. where ℓi _i is the logit at timestep i. This yields token-level attributions in logit space: ℓi(order) ^(order)_i and ℓi(Lex.) ^(Lex.)_i measure how much that token’s logit is accounted for by the recovered order-ID vs. lexical subspaces, with ℓi(⟂) ^( )_i capturing the residual contribution not explained by these subspaces. Figure 10 demonstrates an example: given an input sentence, per token, blue and orange bars indicate logits attributable to the order-ID and lexical subspaces, while green bars indicate residual logits that are left unexplained. In addition to attention logits left unexplained, this example provides a couple more insights. For instance, this head seems to rely on lexical-IDs more than order-IDs, although this may be a result of the lexical subspace having higher rank. We can also observe mistakes that may have gone unnoticed (especially post-softmax), as we see the model incorrectly assigning mass onto the order-ID subspace of Box B, or the lexical subspace of Box A. 6 Related Work Here we provide an abridged overview of prior work, with a much more thorough review in Appendix C. QK spaces have been studied before, in both language and vision models. In language, Kamath et al. (2025), Ge et al. (2024), and Friedman et al. (2025) decompose query-key interactions using features from sparse autoencoders, while Gurnee et al. (2026) use features from probes to study their interactions in QK space. Lastly, Wynrow and Sharkey (2024) learn a sparse mask in QK space to detect features. Unlike prior work, our method does not rely on pre-existing features, nor any training, in order to find QK features. In vision, Pan et al. (2024) and Doshi et al. (2026) similarly apply SVD on query-key interactions, finding “channels” that communicate positional or content information, while Li et al. (2025) study how vision models bind tokens belonging to the same entity via bilinear probes in QK space. Researchers have also viewed attention as a “communication channel” (Elhage et al., 2021). Merullo et al. (2024) studies heads that “talk” with one another, while Franco and Crovella (2025) recovers low-rank QK subspaces that are causally relevant for upstream usage within a circuit. Lastly, researchers have also studied attention heads by visualizing query-key interactions (Yeh et al., 2023), uncovering global patterns in their interactions. 7 Discussion We demonstrate a simple method to decompose the QK space of attention heads into interpretable low-rank components. Here we briefly discuss potential future directions. Multi-dimensional Features. In our work and others (Engels et al., 2025), we have seen multi-dimensional features. How might we detect other multi-dimensional features? Unsupervised QK Decomposition. One limitation of our method is its reliance on positive and negative covariance terms, which requires knowing what features to look for beforehand. A natural next step may be decomposing QK spaces without human supervision. One potential challenge may be in dealing with multi-dimensional features of varying ranks. Another challenge may be in interpreting such decomposed components: even if we identify multiple QK components, their observable behaviors may be identical (e.g., they both attend to token X). When multiple components exhibit the same behavior, how might we interpret each component? We leave these questions to future work. Ethical Statement This paper takes a step towards interpreting the internal computations of large language models. We hope such interpretable systems will lead to safer and more reliable use cases in the future. Acknowledgements AL thanks Eric Todd, Andy Arditi, Sheridan Feucht, and Yida Chen for constructive feedback. AL acknowledges support from a Superalignment Fast Grant from OpenAI. YB was funded by Coefficient Giving, the Israel Science Foundation (grant No. 2942/25), and the European Union (ERC, Control-LM, 101165402). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. Lastly, FV and MW acknowledge support from a Superalignment Fast Grant from OpenAI, and Coefficient Giving. References E. Aflalo, M. Du, S. Tseng, Y. Liu, C. Wu, N. Duan, and V. Lal (2022) Vl-interpret: an interactive visualization tool for interpreting vision-language transformers. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, p. 21406–21415. Cited by: Appendix C. A. Ahmad, A. Joshi, and A. Modi (2025) Beyond components: singular vector-based interpretability of transformer circuits. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, Cited by: Appendix C. D. Bahdanau (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: Appendix C. F. Barbero, A. Vitvitskyi, C. Perivolaropoulos, R. Pascanu, and P. Veličković (2025) Round and round we go! what makes rotary positional encodings useful?. In The Thirteenth International Conference on Learning Representations, Cited by: Appendix C. Q. Dai, B. Heinzerling, and K. Inui (2024) Representational analysis of binding in language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y. Al-Onaizan, M. Bansal, and Y. Chen (Eds.), Miami, Florida, USA, p. 17468–17493. External Links: Link, Document Cited by: Appendix E, Appendix E, §5.2. F. R. Doshi, T. Fel, T. Konkle, and G. Alvarez (2026) Bi-orthogonal factor decomposition for vision transformers. arXiv preprint arXiv:2601.05328. Cited by: Appendix C, §6. N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah (2022) Toy models of superposition. Transformer Circuits Thread. External Links: Link Cited by: §1, §4, §4. N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah (2021) A mathematical framework for transformer circuits. Transformer Circuits Thread. Note: https://transformer-circuits.pub/2021/framework/index.html Cited by: Appendix C, §6. J. Engels, E. J. Michaud, I. Liao, W. Gurnee, and M. Tegmark (2025) Not all language model features are one-dimensionally linear. In The Thirteenth International Conference on Learning Representations, Cited by: §7. J. Feng and J. Steinhardt (2023) How do language models bind entities in context?. In The Twelfth International Conference on Learning Representations, Cited by: Appendix E, Appendix E, §5.2. G. Franco and M. Crovella (2025) Pinpointing attention-causal communication in language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: Appendix C, §6. D. Friedman, A. Bhaskar, A. Wettig, and D. Chen (2025) Extracting rule-based descriptions of attention features in transformers. arXiv preprint arXiv:2510.18148. Cited by: Appendix C, §6. X. Ge, F. Zhu, W. Shu, J. Wang, Z. He, and X. Qiu (2024) Automatically identifying local and global circuits with linear computation graphs. arXiv preprint arXiv:2405.13868. Cited by: Appendix C, §6. A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. (2024) The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Cited by: §1, §5. Y. Gur-Arieh, M. Geva, and A. Geiger (2025) Mixing mechanisms: how language models retrieve bound entities in-context. arXiv preprint arXiv:2510.06182. Cited by: Appendix E, Appendix E, §1, §5.2. W. Gurnee, E. Ameisen, I. Kauvar, J. Tarng, A. Pearce, C. Olah, and J. Batson (2026) When models manipulate manifolds: the geometry of a counting task. arXiv preprint arXiv:2601.04480. Cited by: Appendix C, §6. B. Hoover, H. Strobelt, and S. Gehrmann (2020) ExBERT: a visual analysis tool to explore learned representations in transformer models. In Proceedings of the 58th annual meeting of the association for computational linguistics: system demonstrations, p. 187–196. Cited by: Appendix C. X. Huang and M. Hahn (2025) Decomposing representation space into interpretable subspaces with unsupervised learning. In Mechanistic Interpretability Workshop at NeurIPS 2025, Cited by: Appendix C. S. Jain and B. C. Wallace (2019) Attention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), p. 3543–3556. Cited by: Appendix C. H. Kamath, E. Ameisen, I. Kauvar, R. Luger, W. Gurnee, A. Pearce, S. Zimmerman, J. Batson, T. Conerly, C. Olah, and J. Lindsey (2025) Tracing attention computation through feature interactions. Transformer Circuits Thread. External Links: Link Cited by: Appendix C, §6. O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky (2019) Revealing the dark secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Hong Kong, China, p. 4365–4374. External Links: Link, Document Cited by: Appendix C. A. Lee, L. Sun, C. Wendler, F. Viégas, and M. Wattenberg (2025) The geometry of self-verification in a task-specific reasoning model. arXiv preprint arXiv:2504.14379. Cited by: Appendix C. J. Li, W. Monroe, and D. Jurafsky (2016) Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220. Cited by: Appendix C. Y. Li, S. Salehi, L. Ungar, and K. P. Kording (2025) Does object binding naturally emerge in large pretrained vision transformers?. arXiv preprint arXiv:2510.24709. Cited by: Appendix C, §6. S. Liu, T. Li, Z. Li, V. Srikumar, V. Pascucci, and P. Bremer (2018) Visual interrogation of attention-based models for natural language inference and machine comprehension. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, E. Blanco and W. Lu (Eds.), Brussels, Belgium, p. 36–41. External Links: Link, Document Cited by: Appendix C. J. Merullo, C. Eickhoff, and E. Pavlick (2024) Talking heads: understanding inter-layer communication in transformer language models. Advances in Neural Information Processing Systems 37, p. 61372–61418. Cited by: Appendix C, §6. N. Nanda, A. Lee, and M. Wattenberg (2023) Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941. Cited by: Appendix C. C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter (2020) Zoom in: an introduction to circuits. Distill. Note: https://distill.pub/2020/circuits/zoom-in External Links: Document Cited by: §4. X. Pan, A. Philip, Z. Xie, and O. Schwartz (2024) Dissecting query-key interaction in vision transformers. Advances in Neural Information Processing Systems 37, p. 54595–54631. Cited by: Appendix C, §6. N. Prakash, N. Shapira, A. S. Sharma, C. Riedl, Y. Belinkov, T. R. Shaham, D. Bau, and A. Geiger (2025) Language models use lookbacks to track beliefs. External Links: 2505.14685, Link Cited by: Appendix E, Appendix E, §5.2. A. S. Sharma, G. Rogers, N. Shapira, and D. Bau (2025) LLMs process lists with general filter heads. arXiv preprint arXiv:2510.26784. Cited by: §1, §5.1, §5.1. H. Strobelt, S. Gehrmann, M. Behrisch, A. Perer, H. Pfister, and A. M. Rush (2018) S eq 2s eq-v is: a visual debugging tool for sequence-to-sequence models. IEEE transactions on visualization and computer graphics 25 (1), p. 353–363. Cited by: Appendix C. S. Sukhbaatar, J. Weston, R. Fergus, et al. (2015) End-to-end memory networks. Advances in neural information processing systems 28. Cited by: Appendix C. J. Vig (2019) A multiscale visualization of attention in the transformer model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, M. R. Costa-jussà and E. Alfonseca (Eds.), Florence, Italy, p. 37–42. External Links: Link, Document Cited by: Appendix C. J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou (2022) Chain-of-thought prompting elicits reasoning in large language models. External Links: arXiv:2201.11903, Link Cited by: Appendix C. K. Wynrow and L. Sharkey (2024) Decomposing the qk circuit with bilinear sparse dictionary learning — ai alignment forum. External Links: Link Cited by: Appendix C, §6. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, p. 2048–2057. Cited by: Appendix C. A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. (2025) Qwen3 technical report. arXiv preprint arXiv:2505.09388. Cited by: §1, §5. C. Yeh, Y. Chen, A. Wu, C. Chen, F. Viégas, and M. Wattenberg (2023) Attentionviz: a global view of transformer attention. IEEE Transactions on Visualization and Computer Graphics 30 (1), p. 262–272. Cited by: Appendix C, §6. Appendix A Attention Review We use lowercase letters (a,ba,b) for scalars, bold lowercase (,)(q,k) for vectors, bold uppercase (W) for matrices. Consider a single attention head with key, query, and value weight matrices K,Q,V∈ℝdhead×d.W_K,\ W_Q,\ W_V ^d_head× d. At position t in the sequence, the activations are t∈ℝdx_t ^d and the head computes t=Q​t∈ℝdhead,s=K​s∈ℝdhead, _t=W_Qx_t ^d_head, _s=W_Kx_s ^d_head, ℓt,s=t⊤​s,αt,s=exp⁡(ℓt,s/dhead)∑s′exp⁡(ℓt,s′/dhead), _t,s=q_t k_s, _t,s= ( _t,s/ d_head ) _s ( _t,s / d_head ), where ℓt,s _t,s is the unnormalized attention logit from query position t to key position s, and αt,s _t,s is the corresponding attention weight. The logit is a bilinear form in the residual stream: ℓt,s=t⊤​s=t⊤​Q⊤​K​s=t⊤​s, _t,s=q_t k_s=x_t W_Q W_Kx_s=x_t Bx_s, rank()≤dhead≪d (B)≤ d_head d We are interested in decomposing ⊤​x Bx further into interpretable subspaces that encode specific features. Appendix B Contrastive Covariance Derivation Our contrastive covariance matrix Δ​(1) _(z_1) captures the interaction between query and key vectors that is specifically due to the matching of latent variable 1z_1. To see this, we start with the definitions of key and query terms. Remember from Equations 1 and 2 that the payload embeddings and selector embeddings are generated as follows: i _i =1​1,i+2​2,i+y​yi+ϵi =A_1z_1,i+A_2z_2,i+A_ye_y_i+ ε_i q _q =1​1,i∗+2​2,i∗+ϵq. =B_1z_1,i^*+B_2z_2,i^*+ ε_q. Thus query and key vectors are given by: =Q​q =W_Qx_q =Q​(1​1,i∗+2​2,i∗+ϵq) =W_Q (B_1z_1,i^*+B_2z_2,i^*+ ε_q ) i _i =K​i =W_Kx_i =K​(1​1,i+2​2,i+y​yi+ϵi) =W_K (A_1z_1,i+A_2z_2,i+A_ye_y_i+ ε_i ) Assuming that the attention head’s key vectors do not encode payload information (i.e., K​y≈W_KA_y 0) and ignoring noise terms, we can express the above in vectoral form: =Q​[1,i∗2,i∗], =W_QB bmatrixz_1,i^*\\ z_2,i^* bmatrix, i _i =K​[1,i2,i] =W_KA bmatrixz_1,i\\ z_2,i bmatrix where :=[1​2]B:=[B_1\;B_2] and :=[1​2]A:=[A_1\;A_2]. Now consider the positive covariance term (1)+C^+_(z_1). The positive covariance is defined as pairs of ,q,k where the latent variable 1z_1 matches, while 2z_2 is held constant (i.e., 2,i=~2z_2,i= z_2). Thus we have: i+=K​[1,i∗~2], _i^+=W_KA bmatrixz_1,i^*\\ z_2 bmatrix, ​[i+⊤|+] E[qk_i^+ |+] =​[(Q​[1,i∗2,i∗])​(K​[1,i∗~2])⊤] =E [ (W_QB bmatrixz_1,i^*\\ z_2,i^* bmatrix ) (W_KA bmatrixz_1,i^*\\ z_2 bmatrix ) ] =​[Q​[1,i∗2,i∗]​[1,i∗⊤​~2⊤]​⊤​K⊤] =E [W_QB bmatrixz_1,i^*\\ z_2,i^* bmatrix[z_1,i^* z_2 ]A W_K ] =Q​[[1,i∗2,i∗]​[1,i∗⊤,~2⊤]]​⊤​K⊤ =W_QBE [ bmatrixz_1,i^*\\ z_2,i^* bmatrix[z_1,i^* , z_2 ] ]A W_K =Q​[​[1,i∗​1,i∗⊤]​[1,i∗]​[~2,i⊤]​[2,i∗]​[1,i∗⊤]​[2,i∗​~2,i⊤]]​⊤​K⊤ =W_QB bmatrixE[z_1,i^*z_1,i^* ]&E[z_1,i^*]E[ z_2,i ]\\ E[z_2,i^*]E[z_1,i^* ]&E[z_2,i^* z_2,i ] bmatrixA W_K =Q​[​[1,i∗​1,i∗⊤]​[2,i∗​~2,i⊤]]​⊤​K⊤ =W_QB bmatrixE[z_1,i^*z_1,i^* ]&0\\ 0&E[z_2,i^* z_2,i ] bmatrixA W_K where the 0s in the last equality follow from the independence of 1z_1 and 2z_2, and the fact that ​[1]=​[2]=E[z_1]=E[z_2]=0. Similarly computing the expectation for the negative condition (pairs of ,q,k where the latent variable 1z_1 differs, while 2z_2 is held constant, i.e., 2,i=~2z_2,i= z_2) yields ​[⊤|−]= [qk |-]= Q​[​[1,i∗​1,i≠i∗⊤]​[2,i∗​~2,i⊤]]​⊤​K⊤ _QB bmatrixE[z_1,i^*z_1,i≠ i^* ]&0\\ 0&E[z_2,i^* z_2,i ] bmatrixA W_K Now we are left with the contrastive covariance matrix: Δ​(1)=(1)+−(1)− _(z_1)=C^+_(z_1)-C^-_(z_1) =Q​[​[1,i∗​1,i∗⊤]−​[1,i∗​1,i≠i∗⊤]]​⊤​K⊤ =W_QB bmatrixE[z_1,i^*\ z_1,i^* ]-E[z_1,i^*z_1,i≠ i^* ]&0\\ 0&0 bmatrixA W_K Thus Δ​(1) _(z_1) isolates the contribution of latent variable 1z_1 to the query-key interaction. This can be repeated for 2z_2 by defining positive and negative conditions based on 2z_2, while holding 1z_1 constant. The ranks and subspaces of latent variables can then be recovered by performing SVD on Δ​(1) _(z_1) and Δ​(2) _(z_2) respectively: Δ​(1)=(1)​(1)​(1)⊤, _(z_1)=U_(z_1) _(z_1)V_(z_1) , Δ​(2)=(2)​(2)​(2)⊤ _(z_2)=U_(z_2) _(z_2)V_(z_2) The rank of 1z_1 (denoted r1r_1) can be estimated by counting the number of singular values that captures 99% of the squared Frobenius norm of Δ​(1) _(z_1). The top-r1r_1 singular vectors (1)[:r1]U_(z_1)^[:r_1] and (1)[:r1]V_(z_1)^[:r_1] give bases in query and key space that encode 1z_1 respectively. Appendix C Related Work Since the adoption of attention modules in neural NLP models (Bahdanau, 2014; Sukhbaatar et al., 2015), researchers have been interested in better understanding them. Often, researchers use the attention patterns itself as an explanation for a neural network (Li et al., 2016; Xu et al., 2015; Lee et al., 2025). This practice is not without contention: for instance, Jain and Wallace (2019) claims that “attention is not explanation” by carefully studying the relationship between attention weights and model outputs, in which they find low correlation between attention weights and feature importance. On the other hand, Wei et al. (2022) refutes back, suggesting that under certain conditions, attention scores can provide meaningful interpretations. While attention patterns themselves may provide insight for a neural network’s behavior, this begs the question, “why did the model attend to this token?” A growing line of work thus studies the inner mechanisms of attention. This has been approached via multiple angles. Similar to our work, some researchers have studied the QK space of attention heads to understand why a certain token is attended to, both in language and vision models. In language, many of such works leverage features learned from sparse autoencoders (SAEs). Kamath et al. (2025); Ge et al. (2024); Friedman et al. (2025) decompose activations into SAE features and study aligned features from the query and key positions. Alternatively, researchers have used features recovered via training linear probes in order to observe how features interact in QK space. Gurnee et al. (2026) studies the mechanisms underlying a character count task, in which the model implicitly decides to produce a new line character when an implicit character limit is reached. By training probes for line widths and character counts, they demonstrate the two features interact in QK space. Lastly, Wynrow and Sharkey (2024) learns a sparse mask in QK space in order to detect matching features in QK space. Unlike prior work, our method does not rely on features from trained sparse autoencoders or probes, nor any training in order to retrieve QK features. In vision, Pan et al. (2024); Doshi et al. (2026) apply SVD on query-key interactions to find QK features, such as channels communicating positional or content information, while Li et al. (2025) study how vision models bind tokens belonging to the same entity via bilinear probes trained in QK space. Attention is often viewed as a “communication channel” that allows the model to exchange information from one token to another (Elhage et al., 2021). Merullo et al. (2024) study attention heads that likely “talk” to one another by decomposing attention weights using SVD and searching for aligned singular vectors across heads. Ahmad et al. (2025) extends this to include additional components (e.g., MLPs) to show low-rank subspaces that can be viewed as a unit of a computational circuit. Barbero et al. (2025) study communication channels in the rotary positional encodings of attention heads. Perhaps most related to our work is that of Franco and Crovella (2025) which similarly look for low-rank structure in attention heads that is critical for upstream usage in a circuit (i.e., computational graph). Note that many of the works described above entail decomposing model weights or activations. While sparse autoencoders have been a popular choice of decomposition, other unsupervised methods include Neighbor Distance Minimization (Huang and Hahn, 2025), which may be a suitable tool to decompose QK spaces as well. Lastly, researchers have also studied attention via visualizing feature interactions. Early works often visualized attention patterns over individual inputs as bipartite graphs (Liu et al., 2018; Strobelt et al., 2018; Vig, 2019) or heatmaps (Aflalo et al., 2022; Hoover et al., 2020; Kovaleva et al., 2019; Nanda et al., 2023), while subsequent work visualized the joint embedding space of keys and queries using PCA or UMAP to uncover global patterns of attention (Yeh et al., 2023). Appendix D Training Details for Toy Model Table 1 provides the hyperparameters used for training the toy model described in Section 4. We train until validation loss does not improve for more than 5 validation checks, where validation is performed every 200 training batches. Table 1: Hyperparameters used for training the toy model. Hyperparameter Value d 32 dheadd_head 8, 16 Batch size 256 Learning rate 0.0001 Weight decay 0.01 Validation batches 20 Validation batch size 512 Validation patience 5 Appendix E Review of Binding Mechanisms Here we review binding mechanisms from prior work (Feng and Steinhardt, 2023; Dai et al., 2024; Prakash et al., 2025; Gur-Arieh et al., 2025). As a running example, consider a set of prompts that contain multiple pairs of entities that are grouped together (e.g., boxes containing objects), followed by a query regarding one of the entities: The apple is in Box O. The banana is in Box Z…(omitted)… Which box is the banana in? Answer: Box Assume we have n-pairs of entity and box pairs. We refer to each pair as an entity group, denoted as (eg,bg)(e_g,b_g) with entity ege_g and box bgb_g for g=1,…,ng=1,…,n. How does the model answer this prompt? To our knowledge, Feng and Steinhardt (2023) is the first to suggest that models use “binding IDs”: entities belonging in the same group are “tagged” with the same “binding ID”, which the model uses to associate the two entities when queried later. Prakash et al. (2025); Dai et al. (2024) further study similar settings and suggest that models assign “order-IDs” to entity groups based on their positions: the first entity group gets assigned the first order ID, while the second group gets assign a second order ID, and so on. When queried about an entity, the model retrieves the entity group associated with the corresponding order ID. Finally, Gur-Arieh et al. (2025) show that order IDs are not the only “tags” used by models: they can also deploy “lexical” and “reflexive” tags to bind entities belonging to the same group. To summarize, we outline these three mechanisms of binding below: Order-ID (positional) mechanism. The positional mechanism retrieves the answer based on the group index g. When queried about an entity eg∗e_g^*, the model uses the group index g∗g^* (e.g., “the third group”) to fetch the corresponding box bg∗b_g^*. Put differently, it assumes an intermediate variable ZposZ_pos that encodes g∗g^* and retrieves the box associated with that index, regardless of the actual entity: Order-ID:Zpos=g∗⇒b^=bZpos.Order-ID: Z_pos=g^* b=b_Z_pos. where b b is the retrieved box token. Lexical mechanism. The lexical mechanism retrieves the answer by using the identity of the queried entity. This is perhaps the most intuitive, “correct” mechanism. When queried about an entity eg∗e_g^*, it assumes an intermediate variable ZlexZ_lex that encodes the entity identity, and retrieves the box from the group whose entity matches this identity: Lexical:Zlex=eg∗⇒b^=bg​such that​eg=Zlex.Lexical: Z_lex=e_g^* b=b_g\ such that\ e_g=Z_lex. Reflexive mechanism. The reflexive mechanism retrieves the entity group based on the target box itself. Informally, it assumes an intermediate variable ZrefZ_ref that encodes the target box, suggesting that the model has already solved the query in an earlier computation step. Reflexive:Zref=bg∗⇒b^=bg​such that ​bg=Zref.Reflexive: Z_ref=b_g^* b=b_g\ such that b_g=Z_ref. Appendix F Additional Results Here we provide additional results. F.1 Additional Results on Toy Model Figure 11 shows the groundtruth ranks versus the recovered ranks from our method on additional models and tasks. Figures 14 and 15 show results from causal interventions on additional models. Note that as the ranks of the two latent variables reach the number of attention head dimensions (r1+r2≈dheadr_1+r_2≈ d_head), the performance of the random baseline increases because at that point we are completely swapping out io​r​i​gk_i_orig for it​a​r​g​e​tk_i_target. Figure 13 show the interactions between the two latent variables in QK space when trained on our first task variant, i.e., discrete latent variables. Interestingly, unlike the continuous case, we no longer see symmetry in interactions. F.2 Additional Results on Semantic Categories and Binding Features Figure 16 shows the effective ranks of Δ​order,Δ​Lex. _order, _Lex. versus the number of entities used in constructing Δ​ . While we see each head using different numbers of ranks, we see the effective ranks plateau after enough entities. Figure 17 shows the PCA of keys and queries without projecting to our recovered order-ID and lexical subspaces. This reveals that order-ID is embedded in the first few principal components (PCs). While order-ID happens to have rank ≤ 3 and thus can be captured with the first 3 principal components, PCA alone is unable to tell us the rank of QK features. Furthermore, PCA alone cannot inform us where other features (lexical) are encoded, unless one enumerates through all possible PC combinations. Figure 18 shows causal intervention results on additional attention heads that attend to the correct binding entity (see Section 5.2). Appendix G Qwen3-4B Results Figure 19 provides causal interventions on Filter Heads of Qwen 3-4B-Instruct. Figure 20 provides causal intervention results for binding features. Figure 11: Contrastive QK decomposition recovers the expected rank of each latent variable, as long as there is no superposition (i.e., r1+r2≤dheadr_1+r_2≤ d_head). Each cell annotates the recovered ranks r1,r2r_1,r_2, while the x and y-axes indicate the expected ranks. The color of each cell indicates the difference between expected and recovered ranks. Figure 12: PCA of Latent Variable Subspace (Second Task Variant). The second toy task variant uses Gaussian hyperspheres as latent keys 1,2s_1,s_2, which is recovered by our method. Figure 13: Interactions between latent variables in QK space for models trained on discrete latent variables. Interestingly, note that unlike the task with continuous latent variables (Figure 5), we do not see symmetric interactions in this case. Figure 14: Additional results for causal interventions on our toy model, for attention head with dhead=16d_head=16. Figure 15: Additional results for causal interventions on our toy model, for attention head with dhead=8d_head=8. (a) (b) Figure 16: Effective ranks vs. number of entities used in constructing Δ​ . While each head uses a different number of ranks, the effective ranks plateau after enough entities. Figure 17: PCA of keys and queries directly, before projecting onto our recovered QK subspaces. Applying PCA on the keys and queries reveals that order-ID is encoded in the first few principal components (PCs). While order-ID happens to have rank ≤ 3 and thus can be captured with the first 3 principal components, PCA alone is unable to tell us the rank of QK features. Furthermore, PCA does localize where other features (e.g., lexical) are encoded. Figure 18: Causal intervention results on additional binding heads. Figure 19: Causal intervention results for Filter Heads on Qwen3-4B. Figure 20: Causal intervention results for binding on Qwen3-4B.