Paper deep dive
Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints
Andres Saurez, Yousung Lee, Dongsoo Har
Models: GPT2-Small, LLaMA3-8B, LLaMA3.2-3B, Mistral-7B
Abstract
Abstract:Linear probes and sparse autoencoders consistently recover meaningful structure from transformer representations -- yet why should such simple methods succeed in deep, nonlinear systems? We show this is not merely an empirical regularity but a consequence of architectural necessity: transformers communicate information through linear interfaces (attention OV circuits, unembedding matrices), and any semantic feature decoded through such an interface must occupy a context-invariant linear subspace. We formalize this as the \emph{Invariant Subspace Necessity} theorem and derive the \emph{Self-Reference Property}: tokens directly provide the geometric direction for their associated features, enabling zero-shot identification of semantic structure without labeled data or learned probes. Empirical validation in eight classification tasks and four model families confirms the alignment between class tokens and semantically related instances. Our framework provides \textbf{a principled architectural explanation} for why linear interpretability methods work, unifying linear probes and sparse autoencoders.
Tags
Links
- Source: https://arxiv.org/abs/2602.09783
- Canonical: https://arxiv.org/abs/2602.09783
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/11/2026, 12:53:47 AM
Summary
The paper introduces the 'Invariant Subspace Necessity' theorem, which posits that linear interpretability methods (like probes and SAEs) succeed in transformers because the architecture forces semantic features to be communicated through context-invariant linear subspaces. It further derives the 'Self-Reference Property,' demonstrating that tokens themselves provide the geometric directions for their associated features, enabling zero-shot semantic identification without labeled data.
Entities (5)
Relation Signals (3)
Linear Communication Interfaces → necessitates → Invariant Subspace
confidence 97% · any semantic feature decoded through such an interface must occupy a context-invariant linear subspace
Self-Reference Property → enables → Zero-shot identification
confidence 96% · enabling zero-shot identification of semantic structure without labeled data or learned probes
Transformer → usesinterface → Linear Communication Interfaces
confidence 95% · transformers communicate information through linear interfaces (attention OV circuits, unembedding matrices)
Cypher Suggestions (2)
Map the relationship between model architectures and their communication interfaces. · confidence 95% · unvalidated
MATCH (a:Architecture)-[:USES_INTERFACE]->(i:Interface) RETURN a.name, i.name
Find all interpretability methods linked to the Invariant Subspace Necessity theorem. · confidence 90% · unvalidated
MATCH (m:Method)-[:EXPLOITS]->(s:Subspace {type: 'Invariant'})<-[:NECESSITATES]-(t:Theorem {name: 'Invariant Subspace Necessity'}) RETURN mFull Text
66,781 characters extracted from source content.
Expand or collapse full text
Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints Andres Saurez Yousung Lee Dongsoo Har Abstract Linear probes and sparse autoencoders consistently recover meaningful structure from transformer representations—yet why should such simple methods succeed in deep, nonlinear systems? We show this is not merely an empirical regularity but a consequence of architectural necessity: transformers communicate information through linear interfaces (attention OV circuits, unembedding matrices), and any semantic feature decoded through such an interface must occupy a context-invariant linear subspace. We formalize this as the Invariant Subspace Necessity theorem and derive the Self-Reference Property: tokens directly provide the geometric direction for their associated features, enabling zero-shot identification of semantic structure without labeled data or learned probes. Empirical validation in eight classification tasks and four model families confirms the alignment between class tokens and semantically related instances. Our framework provides a principled architectural explanation for why linear interpretability methods work, unifying linear probes and sparse autoencoders. Machine Learning, ICML 1 Introduction Linear structure has become a central organizing principle in modern interpretability for transformers (Vaswani et al., 2017). Linear probes recover semantic attributes from hidden states (Alain and Bengio, 2016; Belinkov, 2022), sparse autoencoders (SAE) identify interpretable feature directions (Bricken et al., 2023; Cunningham et al., 2023), and single-vector activation steering reliably modifies model behavior (Turner et al., 2023; Zou et al., 2023). Across a wide range of methods and objectives, simple linear operations repeatedly succeed at isolating meaningful internal structures. Figure 1: Context-invariant directional representation. The explicit token “France” provides a reference vector for this direction (self-reference), while contextual mentions such as “I went to Paris” and “I visited Marseille” share the same invariant direction. Yet this success raises a fundamental theoretical puzzle. Transformers are deep, highly nonlinear systems trained with simple objectives but exhibiting complex emergent behavior. Why, then, should their internal representations admit such simple and reliable linear access to semantic information? Recent theoretical work suggests that linear representations can emerge from the next-token prediction objective together with the implicit bias of gradient descent (Jiang et al., 2024). We provide a complementary explanation: transformers communicate information through linear interfaces—most notably the attention OV circuit and the unembedding matrix—and any semantic feature read out through such an interface must reside in a context-invariant linear subspace. While optimization determines how representations are learned, architecture constrains what form they must take. We formalize this claim in our central theoretical result, the Invariant Subspace Necessity theorem: whenever a semantic feature is decoded through a linear interface, its representation must lie in an invariant subspace shared across all contexts expressing that feature. This provides a principled explanation for why linear probes, sparse autoencoders, and direction-based steering are able to recover stable semantic structure. Crucially, if such an invariant subspace exists, how can we identify its direction in practice? A key derivation from it is the Self-Reference Property: tokens and expressions that encode a feature directly provide its geometric direction in activation space. For example, the token “France” does not merely instantiate the France concept, but serves as a reference vector for locating that concept in any representation, enabling zero-shot identification of semantic directions without labeled data or learned probes. We empirically validate these predictions across eight semantic classification tasks spanning taxonomic, affective, stylistic, linguistic, and descriptive domains, and across multiple model families (LLaMA3-8B (Grattafiori and others, 2024), Mistral-7B (Jiang et al., 2023), GPT2-Small (Radford et al., 2019), LLaMA3.2-3B (Meta AI, 2024)). Our results reveal consistent evidence of context-invariant directional structure, supporting the generality of our theory. In summary, our contributions are threefold: (1) We provide a principled architectural explanation for why linear interpretability methods succeed in transformers, showing that linear structure arises as a necessary consequence of linear communication interfaces. (2) We introduce the Self-Reference Property, establishing that tokens directly define directions for their associated features—this enables zero-shot identification of semantic directions and yields an unsupervised probe that classifies instances using only class token geometry. (3) We demonstrate convergent evidence for directional invariance: sparse autoencoders trained without class supervision recover features that align with class token directions, validating that both methods access the same underlying structure. 2 Related Work 2.1 Mechanistic Interpretability Mechanistic interpretability seeks to explain model behavior by identifying internal features, circuits, and causal mechanisms inside neural networks rather than relying solely on input–output correlations (Elhage et al., 2021; Nanda et al., 2023). Transformer models contain identifiable computational substructures implementing specific behaviors, and targeted interventions on internal activations can causally alter model outputs (Meng et al., 2022; Wang et al., 2022; Turner et al., 2023; Zou et al., 2023). Our work connects this literature to representation geometry by providing a theoretical account of the linear, context-invariant feature structure that underlies many of these observations. 2.2 Linear Representation Semantic attributes can often be captured by low-dimensional linear structure in representation spaces, dating back to linear regularities in word embeddings (Mikolov et al., 2013; Pennington et al., 2014). In modern LMs, linear structure is operationalized as measurement (linear probes that extract a property) and as intervention (steering representations along a direction) (Kim et al., 2018; Ravfogel et al., 2020). Park et al. (2024) formalize these intuitions, clarifying what “linear representation” means and how measurement and intervention relate geometrically. Most closely related, Jiang et al. (2024) explain linear representations through the next-token prediction objective and implicit bias of gradient descent, whereas we give an architectural account of why linear, context-invariant directions are necessary in models with linear communication interfaces. 2.3 Linear Probes Linear probes test what information is present in hidden states (Alain and Bengio, 2016; Hewitt and Manning, 2019; Tenney et al., 2019). A range of successful techniques instantiate this idea: the logit lens decodes intermediate representations using the unembedding matrix, revealing that hidden states already resemble next-token distributions (nostalgebraist, 2020), and the tuned lens extends this with layer-specific affine transformations (Belrose et al., 2023). Relatedly, linear directions identified by probes have been used not only for measurement but also for intervention, enabling activation steering of hidden states (Li et al., 2023; Rimsky and others, 2024). Recent evaluations further show that sparse autoencoder latents do not consistently outperform simple linear baselines on probing tasks (Kantamneni et al., 2025). Together, these results highlight the surprising effectiveness of linear readouts and interventions across diverse settings. However, they remain largely descriptive and do not explain why such linear decodability should arise in deep nonlinear models. Our theory addresses this gap. 3 Theoretical Framework: Identity-Projection We develop a geometric framework explaining how transformers encode semantic features. Our central result establishes that features communicated through linear interfaces must occupy invariant subspaces—a structural consequence of architecture—and we show that optimization dynamics favor low-dimensional realizations. 3.1 Preliminaries and Assumptions We consider decoder-only transformers with the following properties: Assumption 3.1 (Architectural Requirements). Let ℳM be a transformer with L layers and hidden dimension d. We assume: 1. Additive Residual Stream. The residual stream updates additively: (ℓ+1)=(ℓ)+Δ(ℓ)r^( +1)=r^( )+ ^( ), where Δ(ℓ) ^( ) is the output of layer ℓ . 2. Linear Communication Interfaces. The OV circuit in attention (WOWVW_OW_V) and the unembedding layer (WUW_U) act as linear maps on the residual stream. 3. Shared Parameters. The same parameters are applied across all token positions; optimization occurs over parameters shared across contexts. 4. Linear Output Layer. Token logits are computed via linear projection: logits=WU(L)logits=W_Ur^(L). These assumptions hold for standard transformer architectures including GPT-2, LLaMA, and similar models. 3.2 Formal Definitions Definition 3.2 (Context). A context c=(x1,…,xn;i)c=(x_1,…,x_n;i) consists of an input token sequence and a position of interest i. We write (ℓ)(c)∈ℝdh^( )(c) ^d for the hidden state at layer ℓ and position i in context c. Definition 3.3 (Semantic Feature). A semantic feature is a function f:→f:C mapping contexts to a value space Y. For classification tasks, =y1,…,yKY=\y_1,…,y_K\ is finite. Definition 3.4 (Communicable Feature). A semantic feature f is communicable if it satisfies both: 1. Multi-context: There exist distinct contexts c1,c2c_1,c_2 with f(c1)=f(c2)f(c_1)=f(c_2). 2. Linear decodability: The feature value is recoverable through a linear interface. For unembedding: ∃ϕ∈ℝ|V|:ϕ⊤WU(c)=g(f(c))∀c∈∃\, φ ^|V|: φ W_Uh(c)=g(f(c)) ∀ c (1) for some function g:→ℝg:Y . The multi-context requirement captures that meaningful features appear across different surface realizations (e.g., “France” and “the country of the Eiffel Tower” both express the France feature). Linear decodability captures that the feature must be extractable by downstream linear operations. Definition 3.5 (Invariant Subspace). A communicable feature f exhibits contextual invariance if there exists a subspace f⊆ℝdS_f ^d, determined by f and model parameters alone, such that for all contexts c expressing f, the f-relevant information in (c)h(c) lies entirely within fS_f. Definition 3.6 (Directional Invariance). A communicable feature f exhibits directional invariance if it exhibits contextual invariance with dim(f)=1 (S_f)=1, i.e., f=span(f)S_f=span(\d_f\) for some direction f∈ℝdd_f ^d. Figure 2: Linear readout layers constrain representation geometry. We train a transformer on modular division with an MLP classification head instead of linear unembedding. (Left) When the model finds a non-Fourier solution, embeddings lack circular structure and linear probes fail (∼ 20% accuracy). (Right) When the model discovers Fourier structure, linear probes succeed. Across random seeds, linear probe accuracy correlates with the Fourier representations emerges, but are not required by MLP heads. Linear readout interfaces would necessitate such directional structure. 3.3 Main Theoretical Results Our first result establishes that invariant subspace structure is necessary for any feature communicated through linear interfaces. Theorem 3.7 (Invariant Subspace Necessity). Let ℳM be a transformer satisfying Assumption 3.1, and let f be a communicable feature decoded through a linear interface W. Then there exists a context-invariant subspace f⊆ℝdS_f ^d such that the f-relevant component of (c)h(c) lies in fS_f for all contexts c expressing f. Proof sketch. By linear decodability, the f-relevant output is of(c)=f⊤(c)o_f(c)=w_f h(c) for some f∈ℝdw_f ^d. Contexts requiring identical outputs differ only in f⟂w_f , so f-relevant information lies in a subspace determined by fw_f alone—independent of context. Full proof in Appendix A. ∎ 3.4 Capacity Constraints Force Factorized Representations The unembedding matrix WU∈ℝ||×dW_U ^|V|× d maps hidden states to logits over |||V| tokens. With ||≫d|V| d, tokens cannot occupy orthogonal directions—they must share structure. Proposition 3.8 (Capacity Constraint Implies Feature Sharing). Let ℳM be a transformer with vocabulary |||V| and hidden dimension d, where ||≫d|V| d. If (i) token logits are computed via linear readout logitt=t⊤(c) logit_t=w_t h(c), (i) each context activates a sparse subset of features, and (i) multiple tokens share semantic attributes, then the optimal representation factorizes as: t=∑f∈Ftαt,ffw_t= _f∈ F_t _t,f\,d_f (2) where f\d_f\ are shared feature directions with |F|≪|||F| |V|. Proof sketch. With ||≫d|V| d, tokens must share directions. Sharing incurs interference only when tokens co-occur; for sparse features, this cost is minimal. Factorization achieves |F||F| dimensions plus sparse interference, versus the infeasible |||V| dimensions for unique encodings. Under factorization, logitt=∑f∈Ftαt,f(f⊤(c)) logit_t= _f∈ F_t _t,f(d_f h(c)), so each factor fd_f must be linearly decodable and context-invariant—satisfying the conditions of Theorem 3.7. Full proof in Appendix B. ∎ Remark 3.9 (Implicit Classification Revisited). From this view, predicting a token is implicit classification: the model checks “does this context encode the factor combination for token t?” Each factor fd_f partitions contexts into those expressing f versus those that do not. The compression pressure ensures factors are reusable across tokens, maximizing the number of features that can coexist in the representation. This compression pressure aligns with the information bottleneck principle (Tishby and Zaslavsky, 2015): representations must discard context-specific details while preserving task-relevant features, favoring low-dimensional factorized encodings. Corollary 3.10 (Directional Decomposition). If feature f exhibits directional invariance with direction fd_f, then for all contexts c expressing f: (c)=αf(c)⋅f+(c)h(c)= _f(c)·d_f+ η(c) (3) where αf(c)∈ℝ _f(c) is a context-dependent magnitude and (c)⟂f η(c) _f captures the rest of the features. This decomposition is central: the direction fd_f is invariant across contexts, while the magnitude αf(c) _f(c) varies. Linear interfaces preserve this structure: W(α⋅f)=α⋅(Wf)W(α·d_f)=α·(Wd_f) (4) Figure 3: Token alignment validation of the Self-Reference Property across four datasets in LLaMA3-8B. Each point represents one attention head; the x-axis shows mean cosine similarity between class tokens and other-class implicit instances, while the y-axis shows similarity to same-class implicit instances. Points above the diagonal indicate stronger alignment with the correct class. Percentages indicate heads above diagonal: Countries 91.5%, Animals 97.6%, Cartoon Characters 86.0%, Emotions 89.6%. 3.5 The Identity-Projection Operator We now introduce the operator that enables practical applications. Definition 3.11 (Identity-Projection Operator). For a context c and feature f with invariant direction fd_f, the Identity-Projection is: ℐf(c)≜(c)⊤^fI_f(c) (c) d_f (5) where ^f=f/‖f‖ d_f=d_f/\|d_f\| is the unit direction. The Identity-Projection operator focuses on feature f within the representation (c)h(c). By architectural necessity (Theorem 3.7), f flows through (c)h(c) along direction fd_f; projecting onto this direction extracts the feature’s signal from the superposition of all features present. Proposition 3.12 (Feature Operations). If feature f satisfies directional invariance, then: (i) Detection: ℐf(c)>τI_f(c)>τ indicates f∈cf∈ c for an appropriate threshold τ (i) Measurement: |ℐf(c)||I_f(c)| quantifies feature strength Proof. From the decomposition (3): ℐf(c)=αf(c)‖f‖+(c)⊤^fI_f(c)= _f(c)\|d_f\|+ η(c) d_f (6) When features are approximately orthogonal ((c)⊤^f≈0 η(c) d_f≈ 0), we have ℐf(c)∝αf(c)I_f(c) _f(c), and the operations follow from linearity. ∎ 3.6 Linear Readouts Constrain Representation Geometry To validate Theorem 3.7, we compare representation geometry under linear versus nonlinear readout layers. If linear interfaces necessitate invariant subspace structure, then replacing the linear head with an MLP should relax this constraint. Setup. Prior work on grokking (Power et al., 2022) shows that transformers learning modular arithmetic develop Fourier representations—embeddings arranged in a circle where addition becomes rotation. Jiang et al. (2024) argue this linear structure emerges from training dynamics (gradient descent + cross-entropy) regardless of architecture. We test an alternative hypothesis: linear structure emerges because linear readout layers require it. Experiment. We train a 2-layer transformer on modular division (p=97p=97) with an MLP classification head instead of the standard linear unembedding. A separate linear probe is trained on the same hidden states (with gradients detached). The MLP head achieves ∼ 95% validation accuracy, but the linear probe fails (∼ 20%). Fourier analysis (Fig. 2) confirms the embeddings lack circular structure—PCA shows scattered points rather than an ordered circle. Key finding. Across random seeds, some runs do develop (see Fig.2) Fourier structure—and in exactly those runs, the linear probe also succeeds. This correlation demonstrates that Fourier representations are not inevitable: they emerge when the model happens to find that solution, but nonlinear readouts permit alternatives. Linear interfaces constrain representations to be directional; nonlinear interfaces permit but do not require such structure. 3.7 The Self-Reference Property A key question remains: how do we obtain fd_f without supervision? Our central insight is that tokens themselves provide these directions. Theorem 3.13 (Self-Reference Property). Let t be a token with associated semantic feature ftf_t. If ftf_t is communicated through linear interfaces, then t’s representation provides the invariant direction for ftf_t: t∝fth_t _f_t where th_t is obtained by passing token t through the model. Proof sketch. By Theorem 3.7, ftf_t occupies an invariant subspace ftS_f_t. Under directional invariance (Proposition B.2), ft=span(ft)S_f_t=span(\d_f_t\). Since t canonically expresses ftf_t, its representation must lie in ftS_f_t, hence t=λtfth_t= _td_f_t for some scalar λt _t. Full proof in Appendix A. ∎ Remark 3.14 (Tokens Provide Directions). The Self-Reference Property has a simple practical implication: tokens tell you where to look. To find the “France” feature in any representation, obtain the direction from the token “France” and project onto it. The feature is there—flowing through invariant directions by architectural necessity—and focusing on the right direction reveals it. Remark 3.15 (Why “Identity-Projection”). The framework’s name reflects this self-referential structure: projecting onto the direction provided by token t focuses on t’s identity—the features that define t. The token serves as its own reference point. 3.8 Empirical Validation of the Self-Reference Property Theorem 3.13 predicts that token representations align with the invariant directions of their associated features. We test this by comparing each class token’s hidden state against the mean instance vector for each class, computed across attention heads. Specifically, for each attention head and class k, we compute the cosine similarity between the class token direction tkh_t_k and the mean instance direction ¯k h_k for the same class (within-class) versus other classes (between-class). If tokens provide feature directions as predicted, within-class similarity should consistently exceed between-class similarity. Figure 3 confirms this prediction: across 89.6%-98.6% of attention heads, the class token aligns more strongly with its own class instances than with any other class. This pattern validates the Self-Reference Property—tokens encode the same directional features present in contexts expressing those features, enabling zero-shot extraction of semantic directions. 4 Methods for Extracting Directions The Self-Reference Property (Theorem 3.13) has a practical consequence: since tokens provide feature directions and communicable features are linear (Theorem 3.7), a classifier built on token directions should generalize to instances it was never trained on. A probe anchored to the “France” or “South Korea” direction (which we call a class token or class prompt) should correctly classify implicit prompts like “the country of the Eiffel Tower” or “the country of the Han River” (which we call an instance prompt) —not because it memorized this mapping, but because the France feature flows through the same invariant direction regardless of the surface form. We compare three approaches that leverage this insight, each with different tradeoffs between simplicity and expressiveness. All three methods perform classification using the Identity-Projection operator (Definition 3.11), projecting the query onto normalized class directions. They differ only in how representations are transformed before projection. All probes operate on individual attention-head outputs, and we evaluate each attention head independently. Zero-Shot Probe. The most direct instantiation of the Self-Reference Property is to use class-token hidden states themselves as class directions, without any training. To isolate class-specific components, we mean-center the token representations to remove features shared across all classes (e.g., “being a country”): ^k=tk−¯‖tk−¯‖,¯=1K∑j=1Ktj. d_k= h_t_k- h\|h_t_k- h\|, h= 1K _j=1^Kh_t_j. (7) Classification is then performed by projecting instance representations onto these normalized directions. This procedure requires no training data. Unsupervised Probe. We train a lightweight linear probe that aligns instance activations with class-token directions using only class tokens. Given an instance activation h and class-token prototypes k\c_k\ from the same head, the probe learns a transformation W by minimizing the contrastive objective ℒ=−∑klogexp((Wtk)⊤k/τ)∑jexp((Wtk)⊤j/τ).L=- _k ((Wh_t_k) c_k/τ) _j ((Wh_t_k) c_j/τ). (8) Crucially, W is trained using only class tokens—no instance labels are used. The Self-Reference Property predicts that this transformation should generalize: if tokens and instances share invariant directions, orthogonalizing tokens automatically orthogonalizes instances. Table 2 confirms this—the unsupervised probe matches or exceeds zero-shot performance despite never observing instances during training. Sparse Autoencoder Probe. We train sparse autoencoders (SAEs) on attention-head outputs to learn reusable latent features in an unsupervised manner (Cunningham et al., 2023; Bricken et al., 2023; Kissane et al., 2024). Given an attention output ∈ℝdh ^d, the encoder produces sparse latents =TopK(Wenc+enc)z=TopK(W_ench+b_enc) and reconstructs ^=Wdec+dec h=zW_dec+b_dec. SAEs are trained only on implicit instance activations (no class tokens). Consistent with our theory, many SAE features activate for both class tokens and semantically related instances, indicating that SAEs recover the same invariant directions as token-based probes. Unified Classification. For all methods, let ϕ(⋅)φ(·) denote the transformation (mean-centering, W, or SAE encoder). Classification uses the Identity-Projection operator with respect to the chosen class vectors: k^=argmaxk∈[K]ϕ((c))⊤ϕ^k,where ϕ^k=ϕ(tk)‖ϕ(tk)‖ k= *arg\,max_k∈[K]\,φ(h(c)) φ_k, φ_k= φ(h_t_k)\|φ(h_t_k)\| (9) The query is not normalized, so projection magnitude reflects feature strength. Strong classification across all methods would confirm that the invariant structure from Theorem 3.7 is recoverable through multiple independent approaches—providing converging evidence for directional invariance and the Self-Reference Property. Dataset Classes Instances/Class Animals 6 50 Countries 5 39 Emotional Sentences 6 60 Literary Quotes 6 50 Cartoon Phrases 6 50 Languages 6 50 Fruits 4 50 Companies 4 50 Table 1: Summary of task-specific prompt datasets used for probing invariant feature directions across semantic, stylistic, linguistic, and affective domains. 5 Experiments Accuracy (%) Model Method Animals Countries C. Chars Authors Langs Emotions Fruits Companies LLaMA3-8B SAE 92.05 75.60 54.28 70.29 93.92 38.32 64.14 77.32 Unsupervised 94.32 79.60 61.29 75.13 87.80 53.52 73.64 81.32 Zero-Shot 84.22 82.97 54.46 62.95 89.12 52.43 59.36 80.26 Text Output 99.67 79.48 42.00 68.75 94.00 68.40 49.00 88.50 Llama3.2-3B SAE 78.81 82.43 53.09 49.29 92.63 35.76 61.26 72.63 Unsupervised 90.04 78.47 45.69 59.05 90.98 54.59 63.46 76.26 Zero-Shot 72.24 80.36 43.96 53.68 91.50 46.73 60.63 74.50 Text Output 97.67 46.15 33.33 39.37 62.50 62.20 51.00 88.50 Mistral SAE 82.10 74.48 49.95 52.69 97.62 43.64 56.52 75.72 Unsupervised 85.98 88.55 52.33 54.51 97.86 51.34 62.04 84.51 Zero-Shot 75.71 81.37 47.16 46.23 98.15 50.15 65.18 79.03 Text Output 92.33 33.85 33.33 40.00 98.67 75.20 84.50 89.00 GPT2-Small SAE 26.65 39.62 20.21 15.73 84.02 17.69 30.83 52.59 Unsupervised 26.29 56.67 28.14 21.76 83.37 21.32 37.81 63.33 Zero-Shot 25.39 37.62 20.83 20.11 88.11 17.37 33.90 29.06 Text Output 35.33 25.64 17.00 12.50 15.67 49.80 37.50 29.50 Table 2: Classification accuracy across methods and models, showing the accuracy of the single best-performing (top-1) attention head for each method. Supervised probes were omitted due to their tendency to overfit. Text output denotes a zero-shot conditional log-likelihood baseline that scores each class by the sum of token log-likelihoods given the prompt. 5.1 Datasets We evaluate mainly across all the attention head outputs, on eight classification tasks spanning diverse semantic domains (Table 1): taxonomic (Animals), geographic (Countries), affective (Emotional Sentences (Ghazi et al., 2015)), stylistic (Literary Quotes, Cartoon Phrases), and linguistic (Languages (Artetxe et al., 2020)) and descriptive (Fruits, Companies). For each task, we obtained a list of implicit sentences that express the class without mentioning it (e.g., “the country of the Eiffel Tower”). The full description of the datasets can be found in the Appendix D.1. “The fruit associated with Newton’s discovery of gravity” → Apple The Fruits and Companies datasets test behavior under polysemy (Section 5.3). Our geometric analysis uses LLaMA3-8B, while classification results span four model families (Table 2) to demonstrate generalization. 5.2 Classification Results We do not aim to determine which method is best, but rather whether invariant directions alone support zero-shot and unsupervised classification. Table 2 presents classification accuracy across methods, models, and semantic domains. Several findings emerge. All methods achieve strong classification. Zero-shot, unsupervised, and SAE-based classification all distinguish between classes of similar semantic meaning, confirming that the invariant structure predicted by Theorem 3.7 is recoverable through multiple independent approaches. The shared success across methods that differ only in how directions are extracted—directly from tokens, via learned orthogonalization, or through sparse decomposition—provides converging evidence for directional invariance. Learned transformations improve separability. The unsupervised method consistently outperforms zero-shot classification, indicating that while token directions capture the correct semantic structure, learned orthogonalization improves class separability. This aligns with our motivation: raw token directions may share features (e.g., all country tokens encode “being a country”), and the contrastive loss learns to disentangle these. Figure 4: SAE shared peak analysis across classes. We compare top-k SAE dimensions of a class token with top-k SAE dimensions derived from its instances. Red markers denote shared dimensions, revealing shared invariant features between tokens and contexts. Figure 5: PCA and t-SNE projections of embeddings for the polysemous word Apple (fruit vs. company) using domain-specific tokens. (Left) A single ”Apple” token representing both classes (69% accuracy). (Right) Separate tokens ”Fruit apple” and ”Company Apple” treated as distinct classes (65.7% accuracy). Both methods successfully disentangle the two senses, with instances clustering around their respective class prototypes. SAEs recover communicable features. SAE-based classification achieves competitive performance, demonstrating that features learned through reconstruction objectives correspond to the communicable features defined in our framework. This validates Theorem 3.13 from a different angle: SAEs trained on diverse contexts learn directions that align with class token directions, confirming that both methods recover the same underlying invariant structure. Moreover, this establishes a connection between our framework and SAE interpretability—the Identity-Projection operator can serve as a tool for analyzing which SAE features correspond to semantically meaningful directions. Generalization across models. Consistent performance across LLaMA3-8B, Mistral-7B, GPT2-Small, and LLaMA3.2-3B suggests that directional invariance is not an artifact of a particular architecture or scale, but reflects a general property of transformer representations. 5.3 Polysemy as the inverse of Partial Synonyms Polysemy provides a natural test case: “Apple” encodes both fruit and company meanings as superposed directions. Table 2 shows both Fruits and Companies datasets classify accurately despite sharing this token. To test explicit disentanglement, we combine both datasets and compare using a single “Apple” token versus distinct tokens (“Fruit apple”, “Company Apple”). Both strategies achieve comparable accuracy (Fig.5 69% vs. 65.7%), confirming that both meanings co-exist in superposition—context modulates magnitude, not selection. This geometric view inverts naturally: “Dog” and “Cat” are distinct tokens sharing a component along mammald_mammal—partial synonyms with respect to that feature. Polysemy (one token, multiple directions) and partial synonymy (multiple tokens, shared direction) are two sides of the same coin. 5.4 Tokens Predict Feature Activation in SAEs We trained sparse autoencoders (SAEs) for each attention head, following the approach of Kissane et al. (2024), using only implicit prompts. Notably, class tokens were never introduced during training, ensuring that the SAEs learned to extract features purely from context, without direct supervision. The results are nothing short of remarkable: despite this, the SAEs still manage to retain enough structural information to reliably distinguish between classes (see Table 2). What is even more striking is how much class instances share with their corresponding class vectors. Figure 4 illustrates this phenomenon in the animal dataset, where we observe an astonishing overlap between the top-k SAE dimensions of a single class token (selected by activation magnitude) and the top-k most frequent SAE dimensions across its class instances. With top-k=32k=32, the intersection between these two sets is substantial: mammals share 15/32 features, fish 23/32, reptiles 25/32, and birds 22/32. This consistent pattern emerges across multiple datasets and holds true as model size increases. Crucially, this high overlap indicates that a class token activates the same invariant feature subspace that repeatedly emerges across diverse contextual instances of that class. In other words, tokens and their contexts converge onto a shared set of stable semantic features, suggesting that tokens themselves can serve as powerful predictors of the semantic organization learned by SAEs and transformers. 6 Conclusion We have shown that linear interpretability in transformers is a structural consequence of architecture: the Invariant Subspace Necessity theorem establishes that any semantic feature communicated through linear interfaces must occupy a context-invariant subspace. This complements prior work on training dynamics (Jiang et al., 2024)—architecture constrains what representations must look like, while optimization determines how they are learned. The Self-Reference Property offers a practical application: tokens directly encode their associated feature directions, enabling zero-shot semantic classification without labeled data. Conversely, instances that are not easily classifiable in this way may suggest the presence of a feature that hasn’t collapsed into an invariant form. Experiments across eight domains and four model families confirm that zero-shot, a novel unsupervised probe, and SAE-based methods all recover the same directional structure. Limitations. We focus on features communicated through linear interfaces; features operating through nonlinear gating (e.g., QK routing) may not exhibit the same invariance. We also do not characterize when invariant subspaces collapse to single directions versus higher-dimensional structures. Future Directions. The Self-Reference Property suggests paths toward scalable, unsupervised circuit discovery, and the connection between SAE features and token directions offers new evaluation criteria for dictionary learning. Interpretability methods succeed not despite transformer complexity, but because of how that complexity is structured—transformers are interpretable by design. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. References A. Achille and S. Soatto (2018) Emergence of invariance and predictability in deep neural networks. Proceedings of the National Academy of Sciences 115 (2), p. E215–E224. External Links: Document Cited by: §B.1. G. Alain and Y. Bengio (2016) Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644. Cited by: §1, §2.3. M. Artetxe, S. Ruder, and D. Yogatama (2020) On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th annual meeting of the association for computational linguistics, p. 4623–4637. Cited by: §D.1, §5.1. Y. Belinkov (2022) Probing classifiers: promises, shortcomings, and advances. Computational Linguistics 48 (1), p. 207–219. External Links: Link, Document Cited by: §1. N. Belrose, T. Henighan, B. Mann, et al. (2023) Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112. Cited by: §2.3. L. Berglund, M. Tong, M. Kaufmann, M. Balesni, A. C. Stickland, T. Korbak, and O. Evans (2023) The reversal curse: llms trained on ”a is b” fail to learn ”b is a”. arXiv preprint arXiv:2309.12288. Cited by: Corollary B.5. T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah (2023) Towards monosemanticity: decomposing language models with dictionary learning. Transformer Circuits Thread. Note: https://transformer-circuits.pub/2023/monosemantic-features/index.html Cited by: §1, §4. H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey (2023) Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600. Cited by: §1, §4. N. Elhage, N. Nanda, C. Olah, et al. (2021) A mathematical framework for transformer circuits. Transformer Circuits Thread. External Links: Link Cited by: §2.1. N. Elhage, N. Nanda, C. Olah, et al. (2022) Toy models of neural network behavior. Transformer Circuits Thread. External Links: Link Cited by: §B.1. D. Ghazi, D. Inkpen, and S. Szpakowicz (2015) Detecting emotion stimuli in emotion-bearing sentences. In International Conference on Intelligent Text Processing and Computational Linguistics, p. 152–165. Cited by: §D.1, §5.1. A. Grattafiori et al. (2024) The Llama 3 herd of models. arXiv preprint arXiv:2407.21783. Cited by: §1. J. Hewitt and C. D. Manning (2019) A structural probe for finding syntax in word representations. In Proceedings of NAACL-HLT, Cited by: §2.3. A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. (2023) Mistral 7B. arXiv preprint arXiv:2310.06825. Cited by: §1. Y. Jiang, G. Rajendran, P. Ravikumar, B. Aragam, and V. Veitch (2024) On the origins of linear representations in large language models. arXiv preprint arXiv:2403.03867. Cited by: §1, §2.2, §3.6, §6. S. Kantamneni, J. Engels, S. Rajamanoharan, M. Tegmark, and N. Nanda (2025) Are sparse autoencoders useful? a case study in sparse probing. arXiv preprint arXiv:2502.16681. Cited by: §2.3. B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, and R. Sayres (2018) Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav). In ICML, Cited by: §2.2. C. Kissane, R. Krzyzanowski, J. Bloom, A. Conmy, and N. Nanda (2024) Interpreting attention layer outputs with sparse autoencoders. arXiv preprint arXiv:2406.17759. Cited by: §4, §5.4. K. Li, A. K. Ober, D. Bau, and M. Wattenberg (2023) Inference-time intervention: eliciting truthful answers from a language model. NeurIPS. Cited by: §2.3. K. Meng, D. Bau, A. Andonian, Y. Belinkov, and C. Olah (2022) Locating and editing factual associations in gpt. arXiv preprint arXiv:2202.05262. Cited by: §2.1. Meta AI (2024) Llama 3.2: revolutionizing edge ai and vision with open, customizable models. Meta AI Blog. External Links: Link Cited by: §1. T. Mikolov, W. Yih, and G. Zweig (2013) Linguistic regularities in continuous space word representations. In NAACL-HLT, Cited by: §2.2. N. Nanda, T. Lieberum, J. Smith, et al. (2023) Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217. Cited by: §2.1. nostalgebraist (2020) Interpreting GPT: the logit lens. Note: https://w.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lensLessWrong blog post Cited by: §2.3. K. Park, J. Lee, S. Kim, et al. (2024) The linear representation hypothesis. arXiv preprint arXiv:2311.03658. Cited by: §2.2. J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In EMNLP, Cited by: §2.2. A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra (2022) Grokking: generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177. Cited by: §3.6. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog. External Links: Link Cited by: §1. S. Ravfogel, Y. Elazar, H. Gonen, M. Twiton, and Y. Goldberg (2020) Null it out: guarding protected attributes by iterative nullspace projection. arXiv preprint arXiv:2004.07667. Cited by: §2.2. N. Rimsky et al. (2024) Steering llama 2 via contrastive activation addition. In ACL, Cited by: §2.3. I. Tenney, D. Das, and E. Pavlick (2019) BERT rediscovers the classical nlp pipeline. In Proceedings of ACL, Cited by: §2.3. N. Tishby and N. Zaslavsky (2015) Deep learning and the information bottleneck principle. IEEE Information Theory Workshop. Cited by: §3.4. A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid (2023) Steering language models with activation engineering. arXiv preprint arXiv:2308.10248. Cited by: Corollary B.4, §1, §2.1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §1. K. Wang, N. Nanda, et al. (2022) Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593. Cited by: §2.1. A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A. Dombrowski, et al. (2023) Representation engineering: a top-down approach to ai transparency. arXiv preprint arXiv:2310.01405. Cited by: Corollary B.4, §1, §2.1. Appendix A Proofs of Main Results A.1 Proof of Theorem 3.7 (Invariant Subspace Necessity) Proof. Let W∈ℝm×dW ^m× d be the linear interface (either WUW_U for unembedding or WOWVW_OW_V for attention). By Definition 3.4, there exists a linear functional ϕ∈ℝm φ ^m such that the f-relevant output is: of(c)=ϕ⊤W(c)=f⊤(c)o_f(c)= φ Wh(c)=w_f h(c) (10) where f=W⊤ϕ∈ℝdw_f=W φ ^d. Step 1: Equivalence class structure. Partition contexts by their required f-output. For contexts ci,cjc_i,c_j requiring identical outputs (of(ci)=of(cj)o_f(c_i)=o_f(c_j)): f⊤(ci)=f⊤(cj)⟹(ci)−(cj)∈ker(f⊤)=f⟂w_f h(c_i)=w_f h(c_j) (c_i)-h(c_j)∈ (w_f )=w_f (11) That is, contexts with the same feature value can differ arbitrarily in directions orthogonal to fw_f, but must agree on their projection onto fw_f. Step 2: Decomposition. For any context c expressing f, decompose: (c)=f(c)+(c)h(c)=h_f(c)+ η(c) (12) where f(c)h_f(c) is the f-relevant component satisfying f⊤f(c)=of(c)w_f h_f(c)=o_f(c), and (c)∈f⟂ η(c) _f captures f-irrelevant variation. This decomposition is unique given the constraint. Step 3: Invariant subspace construction. The f-relevant components f(ci)\h_f(c_i)\ must satisfy f⊤f(ci)=of(ci)w_f h_f(c_i)=o_f(c_i) for all i. Define: f=span(f(c):c expresses f)S_f=span(\h_f(c):c expresses f\) (13) This subspace satisfies the required properties: 1. Context-invariance: fS_f is determined by feature f and the linear interface W (through fw_f), not by any individual context. 2. Completeness: By construction, fS_f contains all f-relevant information from any context expressing f. 3. Minimality: The subspace is the span of f-relevant components, excluding f-irrelevant variation in f⟂w_f . The key insight is that the linear interface W acts as a bottleneck: to produce consistent outputs for the same feature value across different contexts, the model must encode f-relevant information in a subspace that projects identically through fw_f. This subspace exists by construction and is independent of which specific contexts are considered. ∎ A.2 Proof of Proposition B.2 (Dimensional Bound) Proof. By Theorem 3.7, there exists a context-invariant subspace fS_f containing all f-relevant information. We show that linear separability of K feature values constrains dim(f)≤K−1 (S_f)≤ K-1. Let ykk=1K⊂f\h_y_k\_k=1^K _f be representative embeddings (e.g., class centroids) for each feature value yky_k. For these to be linearly separable, they must be affinely independent: no ykh_y_k lies in the affine hull of the others. The affine hull of K points has dimension at most K−1K-1. To see this, define centered representations: ~yk=yk−¯,where ¯=1K∑k=1Kyk h_y_k=h_y_k- h, h= 1K _k=1^Kh_y_k (14) These centered vectors satisfy ∑k~yk= _k h_y_k=0, so at most K−1K-1 are linearly independent. Thus span(~yk)≤K−1span(\ h_y_k\)≤ K-1. Since linear separability requires the f-relevant subspace to distinguish all K values, and this can be achieved in K−1K-1 dimensions, the minimal invariant subspace satisfies dim(f)≤K−1 (S_f)≤ K-1. For binary features (K=2K=2), this bound yields dim(f)≤1 (S_f)≤ 1. Combined with the requirement that fS_f be non-trivial (the feature must be detectable), we have dim(f)=1 (S_f)=1, establishing directional invariance. ∎ Appendix B Capacity Constraints and Feature Factorization B.1 Detailed Proof of Proposition 3.8 We provide a complete argument that capacity constraints in transformers force factorized representations with shared, invariant feature directions. Proposition (Capacity Constraint Implies Feature Sharing, restated). Let ℳM be a transformer with vocabulary |||V| and hidden dimension d, where ||≫d|V| d. Under the following conditions: 1. Linear readout: Token logits are computed via logitt=t⊤(c) logit_t=w_t h(c) 2. Sparse activation: Each context c expresses a small subset of all possible features 3. Shared features: Multiple tokens share semantic attributes Then the optimal representation factorizes tokens as: t=∑f∈Ftαt,ffw_t= _f∈ F_t _t,f\,d_f (15) where FtF_t is the feature set for token t, and f\d_f\ are shared feature directions with |F|≪|||F| |V|. Proof. We proceed in four steps, establishing that factorization is optimal under capacity constraints when features are sparse. Step 1: Dimensional constraint. The hidden dimension d bounds the number of orthogonal directions available. With typical values ||≈50,000|V|≈ 50,000 and d≈4,000d≈ 4,000, at most d tokens can have mutually orthogonal representations in WUW_U. The remaining ||−d|V|-d tokens must share directions with others, which creates potential interference in the readout. Step 2: Interference cost depends on co-occurrence. Suppose tokens t1t_1 and t2t_2 share a direction d, i.e., both t1w_t_1 and t2w_t_2 have non-zero projections onto d. Interference occurs when both tokens are relevant predictions for a context c: the hidden state (c)h(c) cannot independently modulate the logits for t1t_1 and t2t_2 along their shared direction. The expected interference cost is: ℒinterference(t1,t2)∝P(t1,t2 both relevant in c)⋅|t1⊤|⋅|t2⊤|L_interference(t_1,t_2) P(t_1,t_2 both relevant in c)·|w_t_1 d|·|w_t_2 d| (16) For sparse features—those with low co-occurrence probability across the corpus—this interference cost is small. Step 3: Factorization minimizes total representation cost. Consider the total cost of a representation as the sum of dimensional cost (number of directions used) and interference cost. Compare two strategies: • Unique directions: Each token t receives a dedicated direction td_t. This eliminates interference but requires |||V| dimensions—infeasible when ||≫d|V| d. • Factorized features: Tokens share |F||F| feature directions, where each token t is represented as a sparse combination of features f∈Ftf∈ F_t. This requires only |F||F| dimensions plus interference costs proportional to feature co-occurrence. When semantic features are sparse and |F|≪|||F| |V|, the factorized strategy achieves lower total cost. Intuitively, “mammal” and “European” rarely co-occur as the dominant semantic features of a context, so tokens sharing these directions (e.g., “cat” and “France”) experience minimal interference despite their shared structure. Step 4: Each factor is communicable and invariant. Under factorization, the logit computation for token t decomposes as: logitt=t⊤(c)=∑f∈Ftαt,ff⊤(c)=∑f∈Ftαt,fϕf(c)logit_t=w_t h(c)= _f∈ F_t _t,f\,d_f h(c)= _f∈ F_t _t,f\, _f(c) (17) where ϕf(c)=f⊤(c) _f(c)=d_f h(c) is the activation of feature f in context c. For the model to correctly predict token probabilities, the hidden state (c)h(c) must encode each relevant feature f such that ϕf(c) _f(c) reflects the appropriate magnitude. This imposes three requirements on each feature direction fd_f: 1. Linearly decodable: The feature activation is extracted via a linear projection f⊤(c)d_f h(c). 2. Multi-context: The same direction fd_f is used across all tokens t with f∈Ftf∈ F_t and all contexts c where feature f is relevant. 3. Context-invariant: The direction fd_f is fixed in WUW_U and does not depend on c; only the magnitude ϕf(c) _f(c) varies with context. These three properties—linear decodability, multi-context usage, and context-invariance—are precisely the conditions of Theorem 3.7. Therefore, each semantic factor fd_f occupies an invariant subspace in the representation geometry. ∎ Connection to prior work. This result connects to the superposition hypothesis (Elhage et al., 2022), which demonstrates that neural networks exploit high-dimensional geometry to represent more features than dimensions when features are sparse. Our contribution is to show that this superposition, combined with the linear readout constraint of transformers, forces the specific geometric structure of invariant feature directions. While Achille and Soatto (2018) establish that information-minimal representations are necessarily invariant, we characterize the form this invariance takes: fixed directions with context-dependent magnitudes. B.2 Proof of Proposition B.1 Proposition B.1 (LayerNorm preserves directional invariance). Let h(c)=αf(c)df+η(c)h(c)= _f(c)\,d_f+η(c) where df∈ℝd_f ^d is a context-invariant feature direction and αf(c)∈ℝ _f(c) is a context-dependent coefficient. If df∉span()d_f ( 1), then there exists a context-invariant direction d~f d_f such that LN(h(c))=α~f(c)d~f+η~(c)+β,LN(h(c))= α_f(c)\, d_f+ η(c)+β, where α~f(c):=αf(c)/σ(h(c)) α_f(c):= _f(c)/σ(h(c)) and β is the learned bias. Proof. We analyze each component of Layer Normalization. Step 1: Mean-centering. Define the mean-centering projection Π⟂:=I−1d⊤, _ 1 :=I- 1d 1 1 , where ∈ℝd 1 ^d is the all-ones vector. Applying this linear operator to our decomposition: Π⟂h(c)=αf(c)Π⟂df+Π⟂η(c). _ 1 h(c)= _f(c)\, _ 1 d_f+ _ 1 η(c). Since df∉span()d_f ( 1) by assumption, the projected direction df′:=Π⟂df≠0d_f := _ 1 d_f≠ 0. Step 2: Scalar normalization. LayerNorm divides by the standard deviation σ(h(c))>0σ(h(c))>0: Π⟂h(c)σ(h(c))=αf(c)σ(h(c))df′+Π⟂η(c)σ(h(c)). _ 1 h(c)σ(h(c))= _f(c)σ(h(c))\,d_f + _ 1 η(c)σ(h(c)). Since σ(h(c))σ(h(c)) is a positive scalar, this operation preserves the subspace span(df′)span(d_f ). Step 3: Learned affine transformation. LayerNorm applies elementwise scaling by γ∈ℝdγ ^d and adds bias β∈ℝdβ ^d: LN(h(c))=αf(c)σ(h(c))(γ⊙df′)+γ⊙Π⟂η(c)σ(h(c))+β.LN(h(c))= _f(c)σ(h(c))\,(γ d_f )+ γ _ 1 η(c)σ(h(c))+β. Define the transformed feature direction d~f:=γ⊙df′=γ⊙Π⟂df. d_f:=γ d_f =γ _ 1 d_f. This direction depends only on dfd_f and the learned parameter γ, and is therefore context-invariant. LayerNorm maps the context-invariant subspace span(df)span(d_f) to a new context-invariant subspace span(d~f)span( d_f) via the fixed transformation df↦γ⊙Π⟂dfd_f γ _ 1 d_f. The feature remains affinely decodable: a linear probe along d~f d_f with appropriate bias recovers a signal proportional to αf(c) _f(c), scaled by 1/σ(h(c))1/σ(h(c)). ∎ B.3 Proof of Proposition B.2 (Dimensional Bound) Proposition B.2 (Dimensional Bound). If f takes K linearly separable values, then dim(f)≤K−1 (S_f)≤ K-1. For binary features, this yields directional invariance: f=span(f)S_f=span(\d_f\). B.4 Proof of Theorem 3.13 (Self-Reference Property) Theorem (Self-Reference Property). Let t be a token with associated semantic feature ftf_t. If ftf_t is communicated through linear interfaces, then t’s representation provides the invariant direction for ftf_t: t∝fth_t _f_t where th_t is obtained by passing token t through the model. Proof. We establish the result in three steps. Step 1: Existence of invariant subspace. By Theorem 3.7, any communicable feature ftf_t is associated with an invariant subspace ft⊆ℝdS_f_t ^d that contains all ftf_t-relevant information. This subspace is determined by ftf_t and the model parameters, independent of any particular context. Step 2: Directional invariance for tokens. Under the assumption of directional invariance (which holds for binary features by Proposition B.2, and empirically for many semantic features), this subspace is one-dimensional: ft=span(ft)S_f_t=span(\d_f_t\) where ft∈ℝdd_f_t ^d is the unique direction (up to sign) representing ftf_t. Step 3: Token as canonical expression. The token t is a canonical expression of feature ftf_t—by definition, presenting t to the model expresses ftf_t maximally and unambiguously. Therefore, when t is processed, the resulting representation th_t must encode ftf_t. By the invariant subspace property, all ftf_t-relevant information lies in ftS_f_t. Since t canonically expresses ftf_t, the ftf_t-component of th_t must be non-zero and lie entirely within ft=span(ft)S_f_t=span(\d_f_t\). Thus: t=λtft+th_t= _td_f_t+ η_t where λt≠0 _t≠ 0 is a scalar magnitude and t⟂ft η_t _f_t captures features orthogonal to ftf_t. For tokens that primarily express a single dominant feature (e.g., “France” primarily expresses the France feature), t η_t is small relative to λtft _td_f_t, yielding t∝fth_t _f_t to good approximation. Consistency across contexts. For any context c expressing ftf_t, the same invariant direction applies: (c)=λcft+(c)h(c)= _cd_f_t+ η(c) where λc _c varies with context but ftd_f_t remains fixed. This confirms that the token direction th_t and context directions (c)h(c) share the same orientation, differing only in magnitude—exactly as predicted by directional invariance. ∎ B.5 Additional Theoretical Implications Our framework yields several additional consequences for transformer interpretability: Corollary B.3 (Attention Preserves Feature Identity). The OV circuit linearly transforms features while preserving their identity. If position j encodes feature f as j=αff+jh_j= _fd_f+ η_j, the attention output at position i is: i(f)=(∑jaij⋅αf(j))⏟context-dependent magnitude⋅WOWVf⏟linearly transformed directiono_i^(f)= ( _ja_ij· _f^(j) )_context-dependent magnitude· W_OW_Vd_f_linearly transformed direction (18) The QK circuit modulates how much of the feature transfers (magnitude); the OV circuit determines how the feature direction is transformed. Proof. By linearity of the OV circuit: i _i =∑jaij⋅WOWVj = _ja_ij· W_OW_Vh_j (19) =∑jaij⋅WOWV(αf(j)f+j) = _ja_ij· W_OW_V( _f^(j)d_f+ η_j) (20) =(∑jaij⋅αf(j))WOWVf+∑jaij⋅WOWVj = ( _ja_ij· _f^(j) )W_OW_Vd_f+ _ja_ij· W_OW_V η_j (21) The f-relevant component is the first term, with magnitude determined by attention-weighted sum and direction determined by linear transformation of fd_f. ∎ Corollary B.4 (Distributional Influence). Adding a feature direction fd_f to the hidden state shifts the output distribution predictably: Δlogitt=λ⋅t⊤f _t=λ·w_t d_f (22) Tokens aligned with fd_f are boosted; anti-aligned tokens are suppressed. This provides the theoretical foundation for activation steering (Turner et al., 2023; Zou et al., 2023). Proof. Let h be the original hidden state and ′=+λfh =h+ _f the steered state. By linearity of unembedding: logitt(′)=t⊤′=t⊤+λt⊤f=logitt()+λt⊤flogit_t(h )=w_t h =w_t h+ _t d_f=logit_t(h)+ _t d_f Thus Δlogitt=λt⊤f _t= _t d_f, which is positive when tw_t and fd_f are aligned. ∎ Corollary B.5 (Non-Bidirectionality). The invariant subspace for “A predicts B” is generally distinct from that for “B predicts A”: A→B⊈B→AandB→A⊈A→BS_A→ B _B→ A _B→ A _A→ B (23) This provides a geometric explanation for the reversal curse (Berglund et al., 2023). Proof. The features “A predicts B” and “B predicts A” are distinct communicable features—they produce different outputs and are learned from different training examples. By Theorem 3.7, each occupies its own invariant subspace determined by that feature. Since these are different features, their subspaces need not coincide. Models trained on “A is B” learn A→BS_A→ B but have no reason to simultaneously learn B→AS_B→ A, explaining the failure to infer reverse relations. ∎ Appendix C Additional Testing on Unsupervised Probes C.1 Unsupervised Probes We validate these properties through comprehensive visualization analysis, comparing embeddings before and after probe transformation using both PCA and t-SNE projections. These visualizations confirm that unsupervised probes learn transformations that improve class separability—instances cluster more tightly around their corresponding prototypes in the transformed space—while the prototype vectors themselves maintain interpretable directional structure aligned with semantic features. Figure 6: Zero-shot generalization of unsupervised probes. Probes trained only on odd months (Jan, Mar, May, Jul, Sep, Nov) successfully classify held-out even months, demonstrating that the learned transformation captures generalizable temporal structure rather than memorizing training classes. Zero-shot generalization to unseen classes. PCA (top) and t-SNE (bottom) projections of month embeddings after unsupervised probe transformation. Training used only odd months; even months were held out. In both visualizations, held-out classes cluster tightly around their respective prototypes, demonstrating that the learned geometric structure generalizes without exposure to these classes during training. Figure 7: Zero-shot (left) and Unsupervised prob classification performance on the Animal dataset in Llama-3-8B. Both contain similar accuracy distributions. Appendix D Supplementary Visualizations This section provides additional SAE-based analyses supporting the invariant feature sharing and head-level sparsity patterns discussed in the main paper. Across different datasets and models, we report (i) shared top-k SAE latent dimensions between instance-mean and class-prototype representations within individual heads, and (i) unsupervised head-level classification performance using SAE latents. Figures are grouped by dataset and model to demonstrate the consistency of these patterns. Figure 8: Shared top-k SAE latent activations between instance-mean and class-prototype representations for the Apple and Orange classes in Llama-3-8B, highlighting class-consistent sparse features within individual attention heads. Figure 9: Unsupervised head-level classification performance on the Fruits task using SAE latent representations extracted from Llama-3-8B. Figure 10: Shared top-k SAE latent activations between instance-mean and class-prototype representations for the Apple and Microsoft classes in Mistral, indicating head-specific sparse features aligned with company semantics. Figure 11: Unsupervised head-level classification performance on the Companies task using SAE latent representations extracted from Mistral. Figure 12: Shared top-k SAE latent activations between instance-mean and class-prototype representations for the Fear and Shame emotion classes in GPT-2. Figure 13: Unsupervised head-level classification performance on the Emotions task using SAE latent representations extracted from GPT-2. D.1 Dataset Explanation We constructed several datasets to evaluate where semantic attributes are encoded in LLMs and whether semantic features exhibit this linear invariance. Countries. We analyze the spaces of all the countries in the world grouped by their respective continent. Languages. To evaluate cross-lingual representations, we employed the XQuAD dataset (Artetxe et al., 2020), a benchmark for cross-lingual question answering. We sampled 200 questions per language, covering English, Spanish, Russian, Hindi, German, and Mandarin Chinese. Emotions. We used a subset of the Emotion Cause dataset (Ghazi et al., 2015), which includes 1,594 English sentences annotated with seven emotion labels (fear, sadness, anger, happiness, surprise, disgust, and shame). We sampled 600 examples for evaluation. Cartoon Characters. To probe stylistic and character-specific features, we constructed a dataset centered on fictional characters with distinctive linguistic patterns. For each of six characters—Elmer Fudd, Foghorn Leghorn, Jar Jar Binks, Porky Pig, Scooby-Doo, and Yoda—we collected 50 iconic phrases from publicly available sources. Literary Authors. We obtained 20 book quotes per author—William Faulkner, Gabriel García Márquez, Ernest Hemingway, Edgar Allan Poe, Virginia Woolf, William Shakespeare, and Mark Twain—from Goodreads. Animals. This dataset includes animal species names categorized by biological class: mammals, invertebrates, birds, amphibians, reptiles, and fish. Fruits and Companies. To test polysemy, we compiled a descriptive dataset where tokens like “Apple” belong to both the fruit and company categories, enabling evaluation of how context disambiguates shared representational directions. The company classes were “Apple”, “Google”, “Microsoft”, and “Amazon”, while the fruit classes were “Apple”, “Grape”, “Orange”, and “Banana”..