Paper deep dive

From Directions to Regions: Decomposing Activations in Language Models via Local Geometry

Or Shafran, Shaked Ronen, Omri Fahn, Shauli Ravfogel, Atticus Geiger, Mor Geva

Year: 2026Venue: arXiv preprintArea: Representation AnalysisType: EmpiricalEmbeddings: 107

Models: Gemma-2-2B, Llama-3.1-8B

Abstract

Abstract:Activation decomposition methods in language models are tightly coupled to geometric assumptions on how concepts are realized in activation space. Existing approaches search for individual global directions, implicitly assuming linear separability, which overlooks concepts with nonlinear or multi-dimensional structure. In this work, we leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative that models the activation space as a collection of Gaussian regions with their local covariance structure. MFA decomposes activations into two compositional geometric objects: the region's centroid in activation space, and the local variation from the centroid. We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B, and show they capture complex, nonlinear structures in activation space. Moreover, evaluations on localization and steering benchmarks show that MFA outperforms unsupervised baselines, is competitive with supervised localization methods, and often achieves stronger steering performance than sparse autoencoders. Together, our findings position local geometry, expressed through subspaces, as a promising unit of analysis for scalable concept discovery and model control, accounting for complex structures that isolated directions fail to capture.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/11/2026, 12:31:27 AM

Summary

The paper introduces Mixture of Factor Analyzers (MFA) as an unsupervised method to decompose language model activations into local geometric regions, each defined by a centroid and a low-dimensional subspace. This approach addresses the limitations of global direction-based methods by capturing nonlinear and multi-dimensional conceptual structures. Experiments on Llama-3.1-8B and Gemma-2-2B demonstrate that MFA outperforms sparse autoencoders and other baselines in localization and steering tasks, providing a more interpretable and scalable unit of analysis for model control.

Entities (5)

Gemma-2-2B · language-model · 100%Llama-3.1-8B · language-model · 100%Mixture of Factor Analyzers · method · 100%Sparse Autoencoders · method · 95%Activation Space · concept · 90%

Relation Signals (3)

Mixture of Factor Analyzers → appliedto → Llama-3.1-8B

confidence 100% · We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B

Mixture of Factor Analyzers → appliedto → Gemma-2-2B

confidence 100% · We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B

Mixture of Factor Analyzers → outperforms → Sparse Autoencoders

confidence 90% · MFA outperforms unsupervised baselines... and often achieves stronger steering performance than sparse autoencoders.

Cypher Suggestions (2)

Find all language models analyzed using the MFA method. · confidence 95% · unvalidated

MATCH (m:Method {name: 'Mixture of Factor Analyzers'})-[:APPLIED_TO]->(lm:LanguageModel) RETURN lm.name

Identify methods that are outperformed by MFA. · confidence 90% · unvalidated

MATCH (m:Method {name: 'Mixture of Factor Analyzers'})-[:OUTPERFORMS]->(other:Method) RETURN other.name

Full Text

107,040 characters extracted from source content.

Expand or collapse full text

From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Or Shafran 1 Shaked Ronen 1 Omri Fahn 1 Shauli Ravfogel 2 Atticus Geiger 3 Mor Geva 1 Abstract Activation decomposition methods in language models are tightly coupled to geometric assump- tions on how concepts are realized in activation space. Existing approaches search for individual global directions, implicitly assuming linear sepa- rability, which overlooks concepts with nonlinear or multi-dimensional structure. In this work, we leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative that models the activation space as a collection of Gaussian re- gions with their local covariance structure. MFA decomposes activations into two compositional geometric objects: the region’s centroid in ac- tivation space, and the local variation from the centroid. We train large-scale MFAs for Llama- 3.1-8B and Gemma-2-2B, and show they capture complex, nonlinear structures in activation space. Moreover, evaluations on localization and steering benchmarks show that MFA outperforms unsuper- vised baselines, is competitive with supervised localization methods, and often achieves stronger steering performance than sparse autoencoders. Together, our findings position local geometry, ex- pressed through subspaces, as a promising unit of analysis for scalable concept discovery and model control, accounting for complex structures that isolated directions fail to capture. 1. Introduction Disentangling the representations of language models (LMs) into causal interpretable units has been a hallmark of in- terpretability research (Sharkey et al., 2025; Geiger et al., 2024; Mueller et al., 2024). A growing consensus in re- cent years suggests global directions in activation space as candidate units, with many concepts empirically exhibiting 1 Blavatnik School of Computer Science and AI, Tel Aviv University, Israel 2 New York University, New York, NY, USA 3 Goodfire.Correspondence to: Or Shafran<or- davids1@mail.tau.ac.il>. Preprint. February 3, 2026. w 1 w 2 w 3 Mourning Crying Tears Laughed Nodded Smiles Shaking μ 4 μ 7 μ 6 μ 5 μ 4 μ 3 μ 2 μ 1 Anger Happiness & Sadness relaxation Exhaustion Surprise Intimacy Love MFA maps the activation space into Gaussian regions Each region is modeled with a low-dimensional subspace Smiles Shaking μ 4 μ 2 x Absolute positions and local directions act as features in localization and steering I am thinking about... +exhausted +tiring +drained +smile +happy +joyful -shake -shiver -shaking Figure 1. MFA decomposes each activation into a region as- signment and a within-region offset. Left: the region structure is modeled by Gaussian components (centroidsμ k ), with complex concepts typically spanning multiple Gaussians – here, the broader Emotions neighborhood is spanned by several interpretable Gaus- sians. Right: each component is equipped with a low-dimensional subspace that parameterizes structured within-region variation. linear structure (Ravfogel et al., 2020; Elhage et al., 2021; Gurnee et al., 2023; Nanda et al., 2023; Park et al., 2024). Consequently, much attention has been given to developing methods that disentangle activations into combinations of global directions (Yun et al., 2021; Bricken et al., 2023; Cunningham et al., 2023; Gao et al., 2025, inter alia). However, such decomposition methods are tightly coupled to strong geometric assumptions (Hindupur et al., 2025) that overlook growing evidence of representations with more complex geometrical structures (Cai et al., 2021; Chang et al., 2022; Engels et al., 2025; Park et al., 2025; Gurnee et al., 2026). Specifically, nonlinear or multi-dimensional concepts are dispersed across many global directions with no built-in structure relating them to one another, so recover- ing the concept requires post hoc assumptions about which directions form a single representation (Chanin et al., 2025; Engels et al., 2025). This limitation has recently driven a shift toward analysis units naturally modeled as subspaces rather than as isolated global directions (Sun et al., 2025; Huang & Hahn, 2025; Tiblias et al., 2025). Yet, while recent work shows meaning- ful geometric structure in activation space, how to turn these insights into practical tools for decomposition and steering remains an open challenge. In this work, we tackle this challenge through a local- geometry lens, building on evidence that LMs exhibit local 1 arXiv:2602.02464v1 [cs.CL] 2 Feb 2026 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry low-dimensional structure (Cai et al., 2021; Lee et al., 2025; Saglam et al., 2025). We propose a scalable, unsupervised method that partitions the activation space into regions and, within each region, learns a low-rank subspace that captures dominant modes of variation. Then, a given activation is decomposed into two compositional geometric objects: a region in activation space and a within-region offset (see Figure 1). To formalize the decomposition, we use classical statistical methods and employ Mixtures of Factor Analyz- ers (MFA; Ghahramani et al., 1996), a generative model that represents each region as a low-rank Gaussian distribution. We apply our approach to Llama-3.1-8B (Grattafiori et al., 2024) and Gemma-2-2B (Team et al., 2024), training MFAs with 1K, 8K, and 32K components. Analyzing the discov- ered regions reveals two classes: narrow Gaussians that concentrate on a constrained lexical pattern (e.g., the word “in” in varying contexts), and broad Gaussians that encom- pass wide thematic topics (e.g., movies or emotions). Broad Gaussians often exhibit semantic local variation, while nar- row Gaussians show more syntactic variance. Yet, with a larger number of components, Gaussians become narrower and their local variance differentiates based on context. Moreover, we observe that neighboring components tend to encode related semantics and, collectively, tile broader conceptual neighborhoods. These observations suggest that concepts may be realized not by a single component, but by constellations of nearby Gaussians that jointly cover a semantic, complex region. Next, we contrast the decomposition induced by MFA with that of sparse autoencoders (SAEs), the predominant dictio- nary learning method. We find that MFA yields a simple decomposition in which both the assigned region and the local variation are highly interpretable. In contrast, SAEs rely on a single dictionary of global directions. In our exper- iments we found that on average 75% of the active features were not directly interpretable from the context. Finally, we evaluate MFA’s decomposition as a practical tool, showing it outperforms existing disentanglement methods on localization and steering benchmarks. For localization, MFA outperforms large-scale SAEs and various supervised baselines, beating Desiderata-Based Masking (strong super- vised baseline) (De Cao et al., 2020; 2022; Csord ́ as et al., 2021; Davies et al., 2023; Chaudhary & Geiger, 2024) on 5 out of 8 tasks across models, and often being competitive with the state-of-the-art DAS (Geiger et al., 2023). On steer- ing, utilizing MFA centroids steers better than SAE features in the majority of settings, typically exhibiting a twofold gain on coherence and conceptual alignment. Together, these results indicate that MFA’s mixture structure supports both causal localization and controllable generation. To conclude, we propose a local-geometry view of acti- vation space, partitioning it into low-dimensional regions and modeling the intrinsic modes of variation within each region. This approach yields an interpretable decompo- sition and scales gracefully to thousands of subspaces. Empirically, MFA outperforms existing unsupervised and supervised baselines on localization and causal mediation benchmarks, positioning local subspace structure as a promising unit of analysis for understanding how LMs organize information. We release our code and trained MFAs at:https://github.com/ordavid-s/ decomposing-activations-local-geometry. 2. Preliminaries and Notation Factor Analysis (FA)FA is a statistical method that mod- els observed data with a small number of latent factors that explain correlations between variables. Unlike standard PCA 1 , FA is a generative probabilistic model, which defines a likelihood for the data and explicitly models noise. In- tuitively, the model assumes that most correlations among observed dimensions arise from a few underlying factors, while the remaining variation is dimension-specific inde- pendent noise. This yields a low-rank approximation that captures shared structure without requiring a full-covariance model. Formally, we assume each observed samplex∈R d is generated from the following generative model: x = Wz +ε,(1) with latent factorsz∼N (0,I)and noise termε∼N (0, Ψ) with a diagonal matrixΨ. The covariance ofxis therefore C = W ⊤ + Ψ, combining shared variation (via W) and independent noise (via Ψ). Notably,Wis invariant to orthogonal rotations. For any Qdefining an orthogonal rotation,WQand W induce an equivalent CovarianceCsince(WQ)(WQ) ⊤ = W ⊤ . Thus, FA identifies the low-rank subspacespan(W ), while the interpretation of individual axes depends on an addi- tional rotation convention. Mixtures of Factor Analyzers (MFA) MFA (Ghahra- mani et al., 1996) extends FA by allowing different regions of the representation space to express their own local ge- ometry. Rather than a single FA modeling directions of global variation, MFA models the space as a collection of local low-dimensional Factor Analyzers 2 . This property is useful when the data exhibits factors of variation unique to different regions in the observation space, such as those observed in the activation space of LMs (Cai et al., 2021; Lee et al., 2025; Saglam et al., 2025). 1 While standard PCA is not generative, probabilistic PCA pro- vides a closely related generative formulation, differing from MFA primarily in its noise model. 2 MFA is a low-rank variant of GMMs, making it more efficient and providing a local low-dimensional structure. 2 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry To represent this, MFA introduces a discrete latent variable ω ∈1,...,Kthat indicates which FA component gener- ated a sample. Each componentkmodels a different region of the representation space by having its own meanμ k , which sets the center of the region. After centering byμ k , variability within that region is described by an FA model with a component-specific matrixW k , which determines the orientation of the component’s local low-dimensional subspace. The columns ofW k , commonly referred to as the loadings, describe how latent factors translate into changes in the local region of the observation space: each column corresponds to one factor, and its entries specify how much each observed dimension changes when that factor varies. The generative model is the same as for FA, with the addition of a component specific mean which anchors the FA to a region of the observation space. Conditioned on ω = k, x =μ k + W k z k +ε,(2) which yields a component covariance: C k = W k W ⊤ k + Ψ.(3) Given all components, the overall density is p(x) = X k π k N (x|μ k ,C k ),(4) whereπ k is the mixture weight of component k. Each Gaus- sian therefore contributes according to how well its mean and subspace geometry explain the sample. See Ghahramani et al. (1996) for additional details. 3. Mapping the Activation Space with Mixtures of Factor Analyzers We show how MFA can map regions of the activation space into a set of reusable and interpretable geometric units. These units reflect how the model organizes information in its latent space. Our approach is motivated by previous work (Coenen et al., 2019a; Cai et al., 2021; Lee et al., 2025; Saglam et al., 2025) showing that activations do not cover the entire activation space uniformly, but rather cluster semantically, where within-cluster variation is well approxi- mated by a small number of degrees of freedom. Therefore, we seek a model that (i) partitions the activation space into coherent regions and (i) captures the intrinsic low-dimensional directions of variation within each region. MFA satisfies both of these desiderata: it achieves (i) by learning a mixture over components and assigning each ac- tivation to components via posterior responsibilities, effec- tively carving the activation space into regions. Moreover, it attains (i) as each component is a factor analysis model: a Gaussian whose covariance is parameterized by a low-rank subspace, so variation within that region is modeled along a small set of learned directions. Initialization Given a set of activationsX ⊂R d ex- tracted from the residual stream at a fixed layer, we ini- tialize an MFA withKcomponents and latent rankRfor each component. For simplicity, we use a uniform rank across components. This choice acts as a conservative ap- proximation to the local intrinsic dimension of each region, capturing the dominant modes of variation while mitigat- ing ill-conditioned loadings. We initialize the component meansμ k K k=1 by runningK-means onXand setting eachμ k to the corresponding cluster centroid. The mix- ture weights are initialized uniformly,π k = 1/Kfor all k. For each component, we initialize the factor loadings W k ∈R d×R with random values sampled fromN (0, 1), and set the (component-shared) diagonal noise covariance toΨ = I D . We also experimented with other initializations. See additional discussion in §A. Training Each mixture componentkdefines a Gaussian density over activations, p(x| k) =N (x|μ k ,C k ),(5) with the same covariance as Eq. 3. The mixture weights π k K k=1 combine these component densities into the marginal likelihood p(x) = K X k=1 π k p(x| k).(6) We learn the parametersθ =μ k ,W k , Ψ,πby minimiz- ing the negative log-likelihood with gradient descent: L(θ) =− 1 B B X i=1 log K X k=1 π k N (x i |μ k ,C k ) ,(7) whereBis the batch size. This objective allows us both to learn the clustering of the data and the local directions of variation together under one optimization problem. Component AssignmentTo assign a given activationx∈ R d to its best fitting component, we inspect the likelihood of componentkgiven the activation, normalized across all components. We denote this term as the activation’s responsibilities where the responsibility of component k for the activation x is computed using Bayes theorem as, R k (x) = p(k | x) = π k N (x|μ k ,C k ) P i π i N (x|μ i ,C i ) (8) These responsibilities assign each activation to the compo- nent whose local subspace best explains it, allowing us to express the activation as a mixture of the components. Decomposing an Activation Each activation can be ex- pressed using a dictionary of all the component means 3 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry μ k and loadingsW k . Specifically, for a given activation x∈R d , we compute the component’s latent coordinates us- ing the posterior mean under FA (Ghahramani et al., 1996): ˆ z k = Z k (x−μ k )(9) Z k : = I R + W ⊤ k Ψ −1 W k −1 W ⊤ k Ψ −1 (10) Under Eq. 2, ˆ z k is the posterior-mean latent vector whose projectionW k ˆ z k best explains the residualx−μ k . Given the generative assumption of FA (Eq. 1), the activation is represented by a scalar weightR k (x)for each meanμ k , and coordinatesz k for each axisw k,j (thej-th column of W k ). Collecting these into matrices, the reconstruction can be written as a linear product: x ≈ Ab(x)(11) A : = μ 1 | W 1 | · |μ K | W K (12) b(x) : =        R 1 (x) R 1 (x) ˆ z 1 (x) . . . R K (x) R K (x) ˆ z K (x)        (13) WhereA∈R d×K(1+R) is formed by concatenating along the column dimension; eachμ k ∈R d contributes one column and eachW k ∈R d×R contributesRcolumns. In contrast,b(x) ∈R K(1+R) is formed by concatenat- ing along the row dimension; for eachk, we append the scalarR k (x) ∈Rfollowed by the lengthRvector R k (x) ˆ z k (x) ∈R R . Thus, eachxis decomposed into ac- tivation coefficients of the shared dictionary of means and axes, and the entire reconstruction is a single matrix multi- plication between the responsibilities and the components. Global versus Local DecompositionMost activation de- composition methods treat the residual stream as governed by a single set of global directions. Eq. 11 instead induces a region-conditioned parameterization. Each activation is described by where it sits in activation space, via responsi- bilities over centroids (R k (x)μ k ), and how it varies locally, via within-component coordinates (R k (x) ˆ z k (x)). Since the factorization within a component is rotationally in- variant (§2), the meaningful object is not any single loading vector, but the local subspacespan(W k ). This motivates a shift in the unit of analysis, moving from isolated global di- rections to local regions with their own low-rank geometry. In the following sections, we use MFA to map the activa- tion space of modern LMs. We train large-scale MFAs on residual-stream activations from Llama-3.1-8B (Grattafiori et al., 2024) and Gemma-2-2B (Team et al., 2024) (§4) and characterize the resulting regions and within-region varia- tion. We then show that globally nonlinear concepts are National sport teams National Security National Assembly National Register National Health Institutions μ k AB Fictional Reality Comedy Horror Fantasy Romance μ k Figure 2. Example MFA Gaussians in the activation space of Llama-3.1-8B, visualized in 3D using three loadings as axes. (Left) A broad region spanning multiple movie genres, where the load- ings separate genre-related themes. (Right) A narrow region cen- tered on the token National, where the loadings capture context- dependent usage. often expressed as neighborhoods of multiple nearby Gaus- sians, and contrast MFA’s decomposition with the global dictionary decomposition of SAEs (§5). Finally, we eval- uate MFA on causal localization and steering benchmarks, where it outperforms unsupervised baselines and remains competitive with supervised methods (§6). Steering with MFALetx∈R d be a hidden state from a layer at the last position of an input sequence. Fix an MFA component with centroidμ∈R d and loadingsW ∈R d×R , and letv∈R R be latent coordinates in the Gaussian’s local subspace. We define the following interventions: f μ (x) = (1− α)x + αμ(14) f w (x) = x + Wv(15) Hereα∈ [0, 1]controls how strongly we move toward the centroid andvcontrols the direction and magnitude of the offset from the centroid. We interpolate towardμbecause it is an absolute location in activation space. In contrast, the loadings parameterize within-region displacements (di- rections in the centered space aroundμ), so we apply them additively as an offset. 4. Activation Structures Discovered by MFA We train 12 MFAs on residual-stream activations from Gemma-2-2B (Team et al., 2024) and Llama-3.1-8B (Grattafiori et al., 2024). Specifically, we use layers 6 and 18 in Gemma, and layers 8 and 22 in Llama (approximately 1/3and2/3of each model’s depth), while varying the MFA scale withK ∈ 1K, 8K, 32Kand settingR = 10. Each MFA was trained on 100M activations from The Pile (Gao et al., 2020), initialized with K-Means on a random sample of 4M activations. For additional discussion on parameter choice, see §A. Analyzing the trained MFAs reveals rich structures in the activation space, with semantically coher- 4 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry ent nonlinear manifolds captured as a collection of diverse, locally linear regions. Discovered Regions and Within-Region Variation We observe substantial diversity across components. Some re- gions are narrow, concentrating probability mass on a highly specific set of tokens or contexts, while others are broad, spanning a more comprehensive theme (Figure 2). The learned local subspaces also differ in the types of varia- tion they capture. Mirroring observations from previous work (Coenen et al., 2019a; Simon et al., 2024; Park et al., 2025), within-region variation often reflects both semantic and syntactic differences. Some directions separate meaning and high-level content, while others track form and local structure, such as letter case and punctuation. To quantify the types of structures captured by MFA, we sample 50 Gaussians from every MFA (600 in total), and annotate them as broad or narrow and their loadings as semantic or syntactic. Labeling of a Gaussian is done based on the theme of 25 sampled contexts with high likelihood under the Gaussian (Eq. 5). To label loadings, we first compute each context’s coordinates within the Gaussian’s subspace (Eq. 9). For loadingi, we collect the 12 contexts with the largest value of thei-th latent coordinatez i , and separately the 12 contexts with the smallest value. Since the sign of a loading is arbitrary, we label the two extremes separately, as we find both ends to be interpretable. We obtain the labels using an automated pipeline based on GPT- 5-mini (Singh et al., 2025), which was validated against labels by NLP graduate students using statistical testing (Calderon et al., 2025). Although a “no pattern” option was provided as part of the annotation task, it was rarely selected by either humans or the LLM. Thus, we omit it from the results. Statistical testing results, annotation instructions and model prompts are provided in §B and §G respectively. Figure 3 presents the annotation results, showing the ratios of broad/narrow Gaussians and semantic/syntactic loadings stratified based on the Gaussian type. In both models, larger Kincreases the portion of narrow Gaussians, but the mag- nitude of this shift is model-dependent. In Gemma-2-2B, largerKshifts mass toward narrow Gaussians, whereas in Llama-3.1-8B the partition remains predominantly broad even atK=32,000. This suggests that different models induce different notions of similarity in their activation spaces; Gemma’s activation space tends to cluster primar- ily by token type (narrow), whereas Llama’s clusters are driven more by semantics (broad). Additionally, increas- ingKnot only raises the fraction of narrow regions, but also increases the frequency of semantic loadings, suggest- ing that within-component variation becomes more context- dependent. Across all settings, narrow Gaussians skew more syntactic, while broad Gaussians skew more semantic. 1k8k32k1k8k32k 0 25 50 75 100 % Gemma-2-2BLlama-3.1-8B BroadNarrow 0 25 50 75 100 % BroadShallowBroadShallowBroadShallow 0 25 50 75 100 % 1k8k32k Gemma-2 Llama-3.1 SemanticSyntactic Figure 3. Characterizing MFA regions. (a) Broad vs. narrow regions differ across model families (Gemma skews narrow/token- driven; Llama stays mostly broad/semantic). (b) Semantic vs. syntactic loadings, split by broad/narrow components, become more semantic asKincreases, indicating more context-dependent within-region variation. Multi-Gaussian Concepts Consistent with prior work (Coenen et al., 2019b; Wiedemann et al., 2019; Park et al., 2025) showing that transformer representation spaces ex- hibit semantic organization, MFA components form coher- ent semantic neighborhoods, where nearby Gaussians tend to correspond to related meanings. Moreover, globally non- linear concepts are often expressed not by a single com- ponent but by a cluster of neighboring Gaussians. MFA enables extracting these structures at scale. By treating com- ponents as nodes in a neighborhood graph, we construct akNN graph using Euclidean distance between centroids and traverse local neighborhoods via BFS from selected components. Figure 1 illustrates one such neighborhood. Although individual components specialize in a narrower topic, such as happiness or surprise, together they form a unified “emotions” theme. See more examples in §F.1. 5. MFA vs. Dictionary Learning Both MFA and dictionary learning are generative models that decompose representations into components. Ideally, such a decomposition should be “simple”—that is, each example should be explained by only a few interpretable components. To clarify how MFA’s decomposition differs from dictionary learning, we conduct a side-by-side com- parison with state-of-the-art SAEs. Global versus Local Decomposition We sample activa- tions from Gemma-2-2B and Llama-3.1-8B for Wikipedia inputs, and compare their decompositions by our 8K MFAs (§4) and by the Gemmascope/Llamascope SAEs (Lieberum et al., 2024; He et al., 2024). We decompose each acti- 5 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Table 1. Centroid vs. loading token promotion. Representative top-promoted tokens under centroid (μ) vs. loading (w j ) inter- ventions. In the narrow region, the centroid promotes a general “National” theme, while loadings separate subthemes. In the broad region, the centroid promotes genres, and loadings refine it into subgenres/media-specific patterns. Term Top promoted tokens Genres Gaussian (Figure 2A, broad) μ thriller, horror, sitcom, fiction, romance, fantasy, comedy, novel w 1 fantasy, tale, RPG, adventure, realms w 2 sitcom, television, TV, series, show w 3 detective, espionage, spy, thriller, saga “National” Gaussian (Figure 2B, narrow) μ Association, Institute, Museum, Newspaper, Infantry, Championship, Organization w 1 Register, register, Historic, historic w 2 League, Football, Hockey, football, hockey w 3 Commission, Committee, Congress, caucus vationx ∈R d with MFA into its responsibilities vector b(x)and the corresponding MFA componentsAsuch that x ≈ Ab(x)(Eq. 11). Similarly, for an SAE with hidden dimensionNwe encodexinto its SAE activations,a∈R N and the corresponding SAE featuresF ∈R N×d such that x ≈ aF. For each method, we collect components with nonzero activation or responsibility and plot the cumulative reconstruction path along the top three PCA directions, sort- ing vectors by magnitude. This visualization shows how each method incrementally assembles its reconstruction. Figure 4 visualizes the reconstruction by MFA (purple-blue) and SAEs (orange-yellow) for two representative examples. SAEs often use many features to reach the target, producing longer trajectories in PCA space. This reflects the dictionary learning geometry, SAEs represent an activation as a sparse sum of global directions, so the reconstruction is assembled via many incremental additions. In contrast, MFA trajec- tories consist of two segments. The first explainsxat the region level through its centroid, and the second explains the remaining local variation. Moreover, we qualitatively find that the decomposition is often causally interpretable as well. Centroid interventions (Eq. 14) often promote a broad semantic theme, while local offsets (Eq. 15) can produce more fine-grained shifts within the broader theme of the region. We provide examples in Table 1 for the broad and narrow Gaussians shown in Figure 2 and more in §F.2. Interpretability of DecompositionWe evaluate whether the decompositions by MFA and SAEs yield features that are coherent and human-interpretable, rather than artifacts of their training objectives. To this end, we take for each method 50 activations from The Pile (Gao et al., 2020) and Figure 4. MFA vs. SAE reconstructions. MFA reconstructs an activation by anchoring it to a region (centroid) and refining it with a region-specific direction, whereas SAEs reconstruct by accumulating many global dictionary features. Left: Llama-3.1- 8B, layer 22; right: Gemma-2-2B, layer 18. decompose them into parts as previously described. Then, we label each feature as interpretable or not in the context of the decomposition based on a reference feature description. MFA centroids are described as in §4. For the local term W ˆ z, we do not interpret individual loadings in isolation, as a single direction does not necessarily correspond to a single concept. Instead, meaning is captured by the subspace as a whole and the local coordinate system it defines. Therefore, we label relative features as interpretable by asking NLP graduate student annotators to judge whether ˆ z k placesx in a coherent local cluster in the Gaussian’s latent space. We comparex’s nearest neighbors to a within-component contrast set of farthest points, and mark it interpretable if the shared concept is strong among neighbors but absent (or much weaker) in the contrast set. For SAEs, we use feature descriptions from Neuronpedia (Lin, 2023). We define an SAE feature as interpretable if its description relates to the activation context. Since SAEs activate numerous features, we use an LLM judge for labeling and validate it with 100 annotations done by NLP graduate students, finding substantial agreement (κ = 0.61). For prompt see G. Let ˆ x = P i v i be the decomposition ofxby a method, written as a sum of features. For MFA, we havev 1 =μ k andv 2 = W k ˆ z k , and for SAEsv j = a j f j for each active featurej. We quantify the interpretability fraction (IF) of ˆ x using the magnitude of each feature’s contribution: IF(x) = P i∈I ∥v i ∥ 2 P i ∥v i ∥ 2 ,(16) whereI is the set of features labeled interpretable. Across all settings, MFA achieves an average IF of0.96± 0.2, indicating that most of the high-contribution features in its decomposition are interpretable, compared to0.29± 0.2 for SAEs. This indicates that MFA decomposes activations into a small set of interpretable features, whereas SAEs 6 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Table 2. Localization performance of MFA versus unsupervised and supervised baseline methods on RAVEL and MCQA. ModelTaskPCASAEMFADBMDAS Gemma-2 Continent70.071.785.769.784.2 Language56.058.964.058.069.7 Country53.056.160.065.067.7 MCQA77.964.980.282.192.0 Llama-3.1 Continent74.070.681.678.181.1 Language57.056.867.363.169.1 Country54.057.662.860.464.1 MCQA74.365.670.575.091.7 rely on many features, most of which are not directly inter- pretable from their context. 6. Evaluations We evaluate MFA on localization and steering benchmarks, where it surpasses both unsupervised and supervised meth- ods on localization and often outperforms SAEs on steering. 6.1. Localization Experiment We evaluate MFA on the MIB benchmark (Mueller et al., 2025), using the published code for the MCQA (Wiegreffe et al., 2025) and RAVEL (Huang et al., 2024) settings. Both settings test a method’s ability to isolate a causal variable in the model’s computation and manipu- late its behavior by intervening on that variable’s represen- tation. MCQA focuses on a positional pointer variable and tests if interventions reliably change the model’s answers on multiple-choice questions, whereas RAVEL targets entity- level causal variables (Continent, Country, and Language). We train MFA using the MIB training split and report re- sults on the validation split. For causal localization, we utilize Desiderata-Based Masking (DBM) (De Cao et al., 2020; 2022; Csord ́ as et al., 2021; Davies et al., 2023) on top of MFA’s components, mirroring the way SAEs are evalu- ated on the benchmark. DBM learns a sparse mask over a method’s learned basis, selecting a sparse set that best aligns with the benchmark’s causal variable. The method is then evaluated using only the chosen basis vectors. We compare MFA to existing methods: PCA, SAEs (Bricken et al., 2023; Cunningham et al., 2023), DBM and DAS (Geiger et al., 2023), whose scores were taken directly from Mueller et al. (2025). We also ablate MFA on Gemma- 2-2B to identify whether the causal variables reside in the loadings or centroids. To this end, we restricted DBM’s candidate set to the centroids which removes its ability to utilize the loadings. For additional details see §E. Results Table 2 report accuracy on RAVEL and MCQA. On RAVEL, MFA outperforms PCA and SAEs by large mar- gins (3-16 points) and beats DBM in 5 out of 6 cases. More- over, MFA performs better on the Continent task than DAS, the current state-of-the-art supervised method and for Llama- 3.1-8B comes within two points on the rest of the tasks. For MCQA, MFA outperforms SAEs by up to 15 points and on Gemma, it also exceeds PCA and nearly matches DBM. Inspecting the ablation results, we find that utilizing only the centroids for RAVEL maintains performance (Continent 86, Language 64, Country 59), indicating that these vari- ables are captured primarily by regional information. In contrast, the same restriction substantially degrades MFA performance on MCQA (80%→39%), suggesting that more fine-grained variables require within-region variation that cannot be recovered from global position alone. Overall, MFA performs strongly on causal localization, con- sistently improving over the unsupervised baselines, while often being competitive with supervised methods. More- over, both the centroids and local covariance structures are important for isolating causal variables, with some only showing up as local variation. 6.2. Causal Steering Experiment We benchmark MFA against SAEs and su- pervised difference-in-means (DiffMeans) (Rimsky et al., 2024; Marks & Tegmark, 2024; Turner et al., 2024; Singh et al., 2024). For MFA, we intervene with the centroids (Eq. 14). For SAEs and DiffMeans, we use the standard additive intervention (addingαtimes the feature direction), as this was found most effective in Wu et al. (2025). For evaluation, we follow Wu et al. (2025), using the same prompts and 0–2 scoring rubric. Interventions are applied during model inference on the prompt: “<BOS> I think that ”. We sweep 15 values of the intervention coefficient α, and sample 8 completions per value. We then use GPT-4o- mini (OpenAI et al., 2024) to rate each completion on two axes: a concept score (alignment with the target concept) and fluency (coherence preservation). As in Wu et al. (2025), we report for each centroid/feature the highest scoring set of completions across the sweep and aggregate the concept and fluency scores with a harmonic mean as the final score. As some SAE features suppress rather than promote a concept, we also intervene with the negation, and report the better performing sign for each feature when aggregating over α. We provide steering parameters, intervention method ablations and the concept/fluency scores – where we see MFA has signficantly higher conceptual alignment – in §E. To calculate the concept score, we provide a concept descrip- tion for each centroid/feature. For MFA, we use the same procedure as in §4. For SAEs, we use descriptions from Neuronpedia (Lin, 2023), which are generated based on the 7 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry 618 Layer 0.0 0.5 1.0 Score Gemma-2-2B: CS + FL 822 Layer Llama-3.1-8B: CS + FL SAEDiffMeansMFA-32KMFA-8KMFA-1K Figure 5. Steering results across layers in Gemma-2-2B and Llama- 3.1-8B of state-of-the-art SAEs, DiffMeans and 1K, 8K, 32K Gaussian MFAs. Across the majority of settings MFA significantly outperforms DiffMeans and SAEs. max-activating samples for each feature (Bills et al., 2023; Bricken et al., 2023; Choi et al., 2024; Paulo et al., 2024). DiffMeans is supervised and has labels by construction. We apply this evaluation to Llama-3.1-8B and Gemma-2-2B across two layers each for 250 randomly sampled features/- centroids. We compare the trained MFAs of three scales from §4 and the state-of-the-art publicly available SAEs, Llamascope and Gemmascope (He et al., 2024; Lieberum et al., 2024). To train DiffMeans, we use the provided SAE descriptions as target concepts and generate 72 activating and 72 neutral examples per feature. We then calculate the feature vector using the difference of the average token representation for each set (Wu et al., 2025). Results Figure 5 presents the results, showing that MFA outperforms SAEs and DiffMeans across most settings. On Gemma-2-2B it roughly doubles the median score, and on Llama-3.1-8B it improves the median by about one third, with Layer 8 bringing the average down as an exception. Notably, we observe no consistent gains from increasing MFA’s capacity. This suggests that higher capacity shifts centroids toward more specific concepts that remain well described by the Gaussian’s high-likelihood contexts. This aligns with §4, where complex structures are represented by multiple Gaussians, hinting that increasing capacity may primarily split broad concepts into additional Gaussians. These results highlight absolute positions learned by the centroids as an effective unit for steering, driven by higher concept scores, it significantly improves over both SAEs and DiffMeans in the majority of settings. 7. Related Work Geometric Structures in LMs Early work showed that language representations have rich geometric structure, in- cluding linear relations (Mikolov et al., 2013a;b; Levy & Goldberg, 2014; Pennington et al., 2014; Arora et al., 2016; Smilkov et al., 2016, inter alia). Later analyses showed that contextual LM geometry is more structured and context- dependent, with token embeddings forming usage-specific regions rather than a single global linear space (Coenen et al., 2019a; Ethayarajh, 2019). More recent work has found that modern LM representations span many directions globally but have low intrinsic dimension locally, consis- tent with a manifold hypothesis Lee et al. (2025); Saglam et al. (2025). Motivated by these results, we use MFA to model LM activation space as a collection of local regions and their low-dimensional directions of variation, enabling scalable feature discovery grounded in local geometry. To our knowledge, our work is the first to apply MFA to LMs. Feature Discovery in LMs SAEs have become the pre- dominant approach for unsupervised activation decompo- sition in LMs (Bricken et al., 2023; Cunningham et al., 2023), assuming activations are well modeled by a global sparse dictionary of directions. This view has limitations be- cause it treats directions as the basic unit, despite evidence that meaningful variation is often multi-dimensional and local (Lee et al., 2025; Saglam et al., 2025; Engels et al., 2025). Unlike SAEs, we tackle decomposition by explicitly modeling activation space as locally organized. We learn a collection of regions with low-dimensional structure, using subspaces as the basic unit of analysis. This view is further supported by recent work moving be- yond single directions to subspace-level structure. Sun et al. (2025) learned concept-conditioned low-rank subspaces for causal intervention, Huang & Hahn (2025) proposed an unsupervised objective that partitions representation space into multiple subspaces, and Tiblias et al. (2025) identified task-relevant feature manifolds with measurable causal ef- fects. These results motivate subspaces as a natural unit of interpretation, but they are either learned via concept- or task-specific objectives or lack a shared, scalable represen- tation of local geometry. We instead learn a single model of locally organized representation space that provides a common basis for decomposition, localization, and steering. 8. Conclusion We propose MFA as a generative model of an LM’s ac- tivation space, decomposing it into a mixture of low-rank Gaussian regions and their local axes of variation. This local geometric view yields interpretable components that can rep- resent complex, nonlinear structures beyond what a single global set of directions can express. MFA offers a scalable approach to model control that generalizes across layers and models. On recent benchmarks, it not only surpasses exist- ing unsupervised methods but also remains competitive with supervised ones, often exceeding their performance. From a high-level view, our work introduces a new approach for practical activation decomposition in LMs, relying on local geometry rather than a dictionary of isolated directions. We release our code and 12 trained MFAs for Gemma-2-2B and Llama-3.1-8B to facilitate further community research. 8 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Impact Statement This work introduces a local-geometry framework for activa- tion decomposition in LMs. Instead of modeling activations as combinations of isolated global directions, we identify regions and local low-rank structure that explains how acti- vations vary locally. This yields better localization of where a feature “lives” in representation space and more precise interventions that target either the region or specific within- region variations. These tools are potentially dual-use: the same ability to isolate and manipulate internal mechanisms could potentially be used to evade safety measures or am- plify undesirable behaviors. We highlight this risk while pre- senting the method for its intended purpose—transparency and controllable, interpretable behavior. Acknowledgments This work was supported in part by a grant from Coefficient Giving, the Academic Research Program at Google, Len Blavatnik and the Blavatnik Family foundation, and the Israel Science Foundation grant 1083/24. References Arora, S., Li, Y., Liang, Y., Ma, T., and Risteski, A.Linear algebraic structure of word senses, with applications to polysemy.Transactions of the As- sociation for Computational Linguistics, 6:483–495, 2016.URLhttps://api.semanticscholar. org/CorpusID:9285053. Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W.Language models can explain neurons in language models.https: //openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html, 2023. Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer- circuits.pub/2023/monosemantic-features/index.html. Cai, X., Huang, J., Bian, Y., and Church, K. Isotropy in the contextual embedding space: Clusters and manifolds. In International Conference on Learning Representations, 2021. URLhttps://openreview.net/forum? id=xYGNO86OWDH. Calderon, N., Reichart, R., and Dror, R. The alterna- tive annotator test for LLM-as-a-judge: How to statis- tically justify replacing human annotators with LLMs. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 16051–16081, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long. 782. URLhttps://aclanthology.org/2025. acl-long.782/. Chang, T. A., Tu, Z., and Bergen, B. K. The geometry of multilingual language model representations. In Gold- berg, Y., Kozareva, Z., and Zhang, Y. (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, p. 119–136, Abu Dhabi, United Arab Emirates, December 2022. Association for Compu- tational Linguistics. doi: 10.18653/v1/2022.emnlp-main. 9.URLhttps://aclanthology.org/2022. emnlp-main.9/. Chanin, D., Wilken-Smith, J., Dulka, T., Bhatnagar, H., Golechha, S., and Bloom, J. I. A is for absorption: Studying feature splitting and absorption in sparse autoen- coders. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps: //openreview.net/forum?id=R73ybUciQF. Chaudhary, M. and Geiger, A. Evaluating open-source sparse autoencoders on disentangling factual knowledge in gpt-2 small, 2024. URLhttps://arxiv.org/ abs/2409.04478. Choi, D., Huang, V., Meng, K., Johnson, D. D., Stein- hardt, J., and Schwettmann, S.Scaling automatic neuron description.https://transluce.org/ neuron-descriptions, October 2024. Coenen, A., Reif, E., Yuan, A., Kim, B., Pearce, A., Vi ́ egas, F., and Wattenberg, M. Visualizing and measuring the geometry of BERT. Curran Associates Inc., Red Hook, NY, USA, 2019a. Coenen, A., Reif, E., Yuan, A., Kim, B., Pearce, A., Vi ́ egas, F. B., and Wattenberg, M. Visualizing and measuring the geometry of bert.In Neural Infor- mation Processing Systems, 2019b.URLhttps: //api.semanticscholar.org/CorpusID: 174802633. Csord ́ as, R., van Steenkiste, S., and Schmidhuber, J. Are neural nets modular? inspecting functional mod- ularity through differentiable weight masks.In In- ternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum? id=7uVcpu-gMD. 9 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models, 2023. URLhttps:// arxiv.org/abs/2309.08600. Davies, X., Nadeau, M., Prakash, N., Shaham, T. R., and Bau, D. Discovering variable binding circuitry with desiderata, 2023. URLhttps://arxiv.org/abs/ 2307.03637. De Cao, N., Schlichtkrull, M. S., Aziz, W., and Titov, I. How do decisions emerge across layers in neural models? interpretation with differentiable masking. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), p. 3243– 3255, Online, November 2020. Association for Computa- tional Linguistics. doi: 10.18653/v1/2020.emnlp-main. 262. URLhttps://aclanthology.org/2020. emnlp-main.262/. De Cao, N., Schmid, L., Hupkes, D., and Titov, I. Sparse interventions in language models with differentiable masking.In Bastings, J., Belinkov, Y., Elazar, Y., Hupkes, D., Saphra, N., and Wiegreffe, S. (eds.), Pro- ceedings of the Fifth BlackboxNLP Workshop on An- alyzing and Interpreting Neural Networks for NLP, p. 16–27, Abu Dhabi, United Arab Emirates (Hy- brid), December 2022. Association for Computational Linguistics.doi: 10.18653/v1/2022.blackboxnlp-1. 2.URLhttps://aclanthology.org/2022. blackboxnlp-1.2/. Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield- Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A math- ematical framework for transformer circuits.Trans- former Circuits Thread, 2021.https://transformer- circuits.pub/2021/framework/index.html. Engels, J., Michaud, E. J., Liao, I., Gurnee, W., and Tegmark, M.Not all language model features are one-dimensionally linear.In The Thirteenth In- ternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum? id=d63a4AM4hb. Ethayarajh, K. How contextual are contextualized word rep- resentations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), p. 55–65, Hong Kong, China, November 2019. Association for Compu- tational Linguistics. doi: 10.18653/v1/D19-1006. URL https://aclanthology.org/D19-1006/. Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The pile: An 800gb dataset of diverse text for language modeling, 2020. URLhttps: //arxiv.org/abs/2101.00027. Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum? id=tcsZt9ZNKD. Geiger, A., Wu, Z., Potts, C., Icard, T., and Goodman, N. D. Finding alignments between interpretable causal variables and distributed neural representations. Ms., Stanford University, 2023. URLhttps://arxiv.org/abs/ 2303.02536. Geiger, A., Ibeling, D., Zur, A., Chaudhary, M., Chauhan, S., Huang, J., Arora, A., Wu, Z., Goodman, N., Potts, C., and Icard, T. Causal abstraction: A theoretical foundation for mechanistic interpretability, 2024. URLhttps:// arxiv.org/abs/2301.04709. Ghahramani, Z., Hinton, G. E., et al. The em algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto, 1996. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteux, C., Nayak, C., Bi, C., Marra, C., McConnell, C., Keller, C., Touret, C., Wu, C., Wong, C., Ferrer, C. C., Nikolaidis, C., Allonsius, D., Song, D., Pintz, D., Livshits, D., Wyatt, D., Esiobu, D., Choudhary, D., Mahajan, D., Garcia-Olano, D., Perino, D., Hupkes, D., Lakomkin, E., AlBadawy, E., Lobanova, E., Dinan, E., Smith, E. M., Radenovic, F., Guzm ́ an, F., Zhang, F., Synnaeve, G., Lee, G., Anderson, G. L., Thattai, G., Nail, G., Mialon, G., Pang, G., Cucurell, G., Nguyen, H., Ko- revaar, H., Xu, H., Touvron, H., Zarov, I., Ibarra, I. A., Kloumann, I., Misra, I., Evtimov, I., Zhang, J., Copet, J., Lee, J., Geffert, J., Vranes, J., Park, J., Mahadeokar, J., Shah, J., van der Linde, J., Billock, J., Hong, J., Lee, J., Fu, J., Chi, J., Huang, J., Liu, J., Wang, J., Yu, J., Bitton, J., Spisak, J., Park, J., Rocca, J., Johnstun, J., Saxe, J., Jia, J., Alwala, K. V., Prasad, K., Upasani, K., Plawiak, K., Li, K., Heafield, K., Stone, K., El-Arini, K., Iyer, K., Malik, 10 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry K., Chiu, K., Bhalla, K., Lakhotia, K., Rantala-Yeary, L., van der Maaten, L., Chen, L., Tan, L., Jenkins, L., Martin, L., Madaan, L., Malo, L., Blecher, L., Landzaat, L., de Oliveira, L., Muzzi, M., Pasupuleti, M., Singh, M., Paluri, M., Kardas, M., Tsimpoukelli, M., Oldham, M., Rita, M., Pavlova, M., Kambadur, M., Lewis, M., Si, M., Singh, M. K., Hassan, M., Goyal, N., Torabi, N., Bashlykov, N., Bogoychev, N., Chatterji, N., Zhang, N., Duchenne, O.,C ̧elebi, O., Alrassy, P., Zhang, P., Li, P., Vasic, P., Weng, P., Bhargava, P., Dubal, P., Krishnan, P., Koura, P. S., Xu, P., He, Q., Dong, Q., Srinivasan, R., Ganapathy, R., Calderer, R., Cabral, R. S., Stojnic, R., Raileanu, R., Maheswari, R., Girdhar, R., Patel, R., Sauvestre, R., Polidoro, R., Sumbaly, R., Taylor, R., Silva, R., Hou, R., Wang, R., Hosseini, S., Chennabasappa, S., Singh, S., Bell, S., Kim, S. S., Edunov, S., Nie, S., Narang, S., Raparthy, S., Shen, S., Wan, S., Bhosale, S., Zhang, S., Vandenhende, S., Batra, S., Whitman, S., Sootla, S., Collot, S., Gururangan, S., Borodinsky, S., Herman, T., Fowler, T., Sheasha, T., Georgiou, T., Scialom, T., Speck- bacher, T., Mihaylov, T., Xiao, T., Karn, U., Goswami, V., Gupta, V., Ramanathan, V., Kerkez, V., Gonguet, V., Do, V., Vogeti, V., Albiero, V., Petrovic, V., Chu, W., Xiong, W., Fu, W., Meers, W., Martinet, X., Wang, X., Wang, X., Tan, X. E., Xia, X., Xie, X., Jia, X., Wang, X., Gold- schlag, Y., Gaur, Y., Babaei, Y., Wen, Y., Song, Y., Zhang, Y., Li, Y., Mao, Y., Coudert, Z. D., Yan, Z., Chen, Z., Papakipos, Z., Singh, A., Srivastava, A., Jain, A., Kelsey, A., Shajnfeld, A., Gangidi, A., Victoria, A., Goldstand, A., Menon, A., Sharma, A., Boesenberg, A., Baevski, A., Feinstein, A., Kallet, A., Sangani, A., Teo, A., Yunus, A., Lupu, A., Alvarado, A., Caples, A., Gu, A., Ho, A., Poul- ton, A., Ryan, A., Ramchandani, A., Dong, A., Franco, A., Goyal, A., Saraf, A., Chowdhury, A., Gabriel, A., Bharambe, A., Eisenman, A., Yazdan, A., James, B., Maurer, B., Leonhardi, B., Huang, B., Loyd, B., Paola, B. D., Paranjape, B., Liu, B., Wu, B., Ni, B., Hancock, B., Wasti, B., Spence, B., Stojkovic, B., Gamido, B., Montalvo, B., Parker, C., Burton, C., Mejia, C., Liu, C., Wang, C., Kim, C., Zhou, C., Hu, C., Chu, C.-H., Cai, C., Tindal, C., Feichtenhofer, C., Gao, C., Civin, D., Beaty, D., Kreymer, D., Li, D., Adkins, D., Xu, D., Testuggine, D., David, D., Parikh, D., Liskovich, D., Foss, D., Wang, D., Le, D., Holland, D., Dowling, E., Jamil, E., Mont- gomery, E., Presani, E., Hahn, E., Wood, E., Le, E.-T., Brinkman, E., Arcaute, E., Dunbar, E., Smothers, E., Sun, F., Kreuk, F., Tian, F., Kokkinos, F., Ozgenel, F., Cag- gioni, F., Kanayet, F., Seide, F., Florez, G. M., Schwarz, G., Badeer, G., Swee, G., Halpern, G., Herman, G., Sizov, G., Guangyi, Zhang, Lakshminarayanan, G., Inan, H., Shojanazeri, H., Zou, H., Wang, H., Zha, H., Habeeb, H., Rudolph, H., Suk, H., Aspegren, H., Goldman, H., Zhan, H., Damlaj, I., Molybog, I., Tufanov, I., Leontiadis, I., Veliche, I.-E., Gat, I., Weissman, J., Geboski, J., Kohli, J., Lam, J., Asher, J., Gaya, J.-B., Marcus, J., Tang, J., Chan, J., Zhen, J., Reizenstein, J., Teboul, J., Zhong, J., Jin, J., Yang, J., Cummings, J., Carvill, J., Shepard, J., McPhie, J., Torres, J., Ginsburg, J., Wang, J., Wu, K., U, K. H., Saxena, K., Khandelwal, K., Zand, K., Matosich, K., Veeraraghavan, K., Michelena, K., Li, K., Jagadeesh, K., Huang, K., Chawla, K., Huang, K., Chen, L., Garg, L., A, L., Silva, L., Bell, L., Zhang, L., Guo, L., Yu, L., Moshkovich, L., Wehrstedt, L., Khabsa, M., Avalani, M., Bhatt, M., Mankus, M., Hasson, M., Lennie, M., Reso, M., Groshev, M., Naumov, M., Lathi, M., Keneally, M., Liu, M., Seltzer, M. L., Valko, M., Restrepo, M., Patel, M., Vyatskov, M., Samvelyan, M., Clark, M., Macey, M., Wang, M., Hermoso, M. J., Metanat, M., Rastegari, M., Bansal, M., Santhanam, N., Parks, N., White, N., Bawa, N., Singhal, N., Egebo, N., Usunier, N., Mehta, N., Laptev, N. P., Dong, N., Cheng, N., Chernoguz, O., Hart, O., Salpekar, O., Kalinli, O., Kent, P., Parekh, P., Saab, P., Balaji, P., Rittner, P., Bontrager, P., Roux, P., Dollar, P., Zvyagina, P., Ratanchandani, P., Yuvraj, P., Liang, Q., Alao, R., Rodriguez, R., Ayub, R., Murthy, R., Nayani, R., Mitra, R., Parthasarathy, R., Li, R., Hogan, R., Battey, R., Wang, R., Howes, R., Rinott, R., Mehta, S., Siby, S., Bondu, S. J., Datta, S., Chugh, S., Hunt, S., Dhillon, S., Sidorov, S., Pan, S., Mahajan, S., Verma, S., Yamamoto, S., Ramaswamy, S., Lindsay, S., Lindsay, S., Feng, S., Lin, S., Zha, S. C., Patil, S., Shankar, S., Zhang, S., Zhang, S., Wang, S., Agarwal, S., Sajuyigbe, S., Chintala, S., Max, S., Chen, S., Kehoe, S., Satter- field, S., Govindaprasad, S., Gupta, S., Deng, S., Cho, S., Virk, S., Subramanian, S., Choudhury, S., Goldman, S., Remez, T., Glaser, T., Best, T., Koehler, T., Robinson, T., Li, T., Zhang, T., Matthews, T., Chou, T., Shaked, T., Vontimitta, V., Ajayi, V., Montanez, V., Mohan, V., Kumar, V. S., Mangla, V., Ionescu, V., Poenaru, V., Mi- hailescu, V. T., Ivanov, V., Li, W., Wang, W., Jiang, W., Bouaziz, W., Constable, W., Tang, X., Wu, X., Wang, X., Wu, X., Gao, X., Kleinman, Y., Chen, Y., Hu, Y., Jia, Y., Qi, Y., Li, Y., Zhang, Y., Zhang, Y., Adi, Y., Nam, Y., Yu, Wang, Zhao, Y., Hao, Y., Qian, Y., Li, Y., He, Y., Rait, Z., DeVito, Z., Rosnbrick, Z., Wen, Z., Yang, Z., Zhao, Z., and Ma, Z. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. Gurnee, W., Nanda, N., Pauly, M., Harvey, K., Troit- skii, D., and Bertsimas, D.Finding neurons in a haystack: Case studies with sparse probing. Transac- tions on Machine Learning Research, 2023. ISSN 2835- 8856. URLhttps://openreview.net/forum? id=JYs1R9IMJr. Gurnee, W., Ameisen, E., Kauvar, I., Tarng, J., Pearce, A., Olah, C., and Batson, J. When models manipulate mani- folds: The geometry of a counting task. arXiv preprint arXiv:2601.04480, 2026. 11 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry He, Z., Shu, W., Ge, X., Chen, L., Wang, J., Zhou, Y., Liu, F., Guo, Q., Huang, X., Wu, Z., et al. Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders. arXiv preprint arXiv:2410.20526, 2024. Hindupur, S. S. R., Lubana, E. S., Fel, T., and Ba, D. E. Projecting assumptions: The duality between sparse autoencoders and concept geometry. In ICML 2025 Workshop on Methods and Opportunities at Small Scale, 2025. URLhttps://openreview.net/forum? id=AKaoBzhIIF. Huang, J., Wu, Z., Potts, C., Geva, M., and Geiger, A. RAVEL: Evaluating interpretability methods on disentan- gling language model representations. In Ku, L.-W., Mar- tins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 8669–8687, Bangkok, Thailand, August 2024. Association for Com- putational Linguistics. doi: 10.18653/v1/2024.acl-long. 470. URLhttps://aclanthology.org/2024. acl-long.470/. Huang, X. and Hahn, M. Decomposing representation space into interpretable subspaces with unsupervised learning, 2025. URLhttps://arxiv.org/abs/ 2508.01916. Lee, J. H., Jiralerspong, T., Yu, L., Bengio, Y., and Cheng, E. Geometric signatures of compositionality across a language model’s lifetime. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), p. 5292–5320, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-251- 0. doi: 10.18653/v1/2025.acl-long.265. URLhttps: //aclanthology.org/2025.acl-long.265/. Levy, O. and Goldberg, Y. Linguistic regularities in sparse and explicit word representations. In Morante, R. and Yih, S. W.-t. (eds.), Proceedings of the Eighteenth Con- ference on Computational Natural Language Learning, p. 171–180, Ann Arbor, Michigan, June 2014. Associ- ation for Computational Linguistics. doi: 10.3115/v1/ W14-1618. URLhttps://aclanthology.org/ W14-1618/. Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kramar, J., Dragan, A., Shah, R., and Nanda, N. Gemma scope: Open sparse au- toencoders everywhere all at once on gemma 2. In Be- linkov, Y., Kim, N., Jumelet, J., Mohebbi, H., Mueller, A., and Chen, H. (eds.), Proceedings of the 7th Black- boxNLP Workshop: Analyzing and Interpreting Neu- ral Networks for NLP, p. 278–300, Miami, Florida, US, November 2024. Association for Computational Linguistics.doi: 10.18653/v1/2024.blackboxnlp-1. 19.URLhttps://aclanthology.org/2024. blackboxnlp-1.19/. Lin, J. Neuronpedia: Interactive reference and tooling for analyzing neural networks, 2023. URLhttps: //w.neuronpedia.org . Software available from neuronpedia.org. Marks, S. and Tegmark, M. The geometry of truth: Emer- gent linear structure in large language model representa- tions of true/false datasets. In First Conference on Lan- guage Modeling, 2024. URLhttps://openreview. net/forum?id=aajyHYjjsk. Mikolov, T., Chen, K., Corrado, G. S., and Dean, J. Efficient estimation of word representations in vector space. In International Conference on Learning Representations, 2013a. URLhttps://api.semanticscholar. org/CorpusID:5959482. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the 27th Inter- national Conference on Neural Information Processing Systems - Volume 2, NIPS’13, p. 3111–3119, Red Hook, NY, USA, 2013b. Curran Associates Inc. Mueller, A., Brinkmann, J., Li, M., Marks, S., Pal, K., Prakash, N., Rager, C., Sankaranarayanan, A., Sharma, A. S., Sun, J., Todd, E., Bau, D., and Belinkov, Y. The quest for the right mediator: A history, survey, and theo- retical grounding of causal interpretability, 2024. URL https://arxiv.org/abs/2408.01416. Mueller, A., Geiger, A., Wiegreffe, S., Arad, D., Arcuschin, I., Belfki, A., Chan, Y. S., Fiotto-Kaufman, J. F., Hak- lay, T., Hanna, M., Huang, J., Gupta, R., Nikankin, Y., Orgad, H., Prakash, N., Reusch, A., Sankaranarayanan, A., Shao, S., Stolfo, A., Tutek, M., Zur, A., Bau, D., and Belinkov, Y. MIB: A mechanistic interpretability bench- mark. In Forty-second International Conference on Ma- chine Learning, 2025. URLhttps://openreview. net/forum?id=sSrOwve6vb. Nanda, N., Lee, A., and Wattenberg, M. Emergent linear rep- resentations in world models of self-supervised sequence models. In Belinkov, Y., Hao, S., Jumelet, J., Kim, N., McCarthy, A., and Mohebbi, H. (eds.), Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpret- ing Neural Networks for NLP, BlackboxNLP@EMNLP 2023, Singapore, December 7, 2023, p. 16–30. Associa- tion for Computational Linguistics, 2023. doi: 10.18653/ V1/2023.BLACKBOXNLP-1.2. URLhttps://doi. org/10.18653/v1/2023.blackboxnlp-1.2. 12 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry OpenAI, :, Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., Madry, A., Baker-Whitcomb, A., Beutel, A., Borzunov, A., Carney, A., Chow, A., Kirillov, A., Nichol, A., Paino, A., Renzin, A., Passos, A. T., Kir- illov, A., Christakis, A., Conneau, A., Kamali, A., Jabri, A., Moyer, A., Tam, A., Crookes, A., Tootoochian, A., Tootoonchian, A., Kumar, A., Vallone, A., Karpathy, A., Braunstein, A., Cann, A., Codispoti, A., Galu, A., Kon- drich, A., Tulloch, A., Mishchenko, A., Baek, A., Jiang, A., Pelisse, A., Woodford, A., Gosalia, A., Dhar, A., Pan- tuliano, A., Nayak, A., Oliver, A., Zoph, B., Ghorbani, B., Leimberger, B., Rossen, B., Sokolowsky, B., Wang, B., Zweig, B., Hoover, B., Samic, B., McGrew, B., Spero, B., Giertler, B., Cheng, B., Lightcap, B., Walkin, B., Quinn, B., Guarraci, B., Hsu, B., Kellogg, B., Eastman, B., Lu- garesi, C., Wainwright, C., Bassin, C., Hudson, C., Chu, C., Nelson, C., Li, C., Shern, C. J., Conger, C., Barette, C., Voss, C., Ding, C., Lu, C., Zhang, C., Beaumont, C., Hallacy, C., Koch, C., Gibson, C., Kim, C., Choi, C., McLeavey, C., Hesse, C., Fischer, C., Winter, C., Czar- necki, C., Jarvis, C., Wei, C., Koumouzelis, C., Sherburn, D., Kappler, D., Levin, D., Levy, D., Carr, D., Farhi, D., Mely, D., Robinson, D., Sasaki, D., Jin, D., Valladares, D., Tsipras, D., Li, D., Nguyen, D. P., Findlay, D., Oiwoh, E., Wong, E., Asdar, E., Proehl, E., Yang, E., Antonow, E., Kramer, E., Peterson, E., Sigler, E., Wallace, E., Brevdo, E., Mays, E., Khorasani, F., Such, F. P., Raso, F., Zhang, F., von Lohmann, F., Sulit, F., Goh, G., Oden, G., Salmon, G., Starace, G., Brockman, G., Salman, H., Bao, H., Hu, H., Wong, H., Wang, H., Schmidt, H., Whitney, H., Jun, H., Kirchner, H., de Oliveira Pinto, H. P., Ren, H., Chang, H., Chung, H. W., Kivlichan, I., O’Connell, I., O’Connell, I., Osband, I., Silber, I., Sohl, I., Okuyucu, I., Lan, I., Kostrikov, I., Sutskever, I., Kanitscheider, I., Gulrajani, I., Coxon, J., Menick, J., Pachocki, J., Aung, J., Betker, J., Crooks, J., Lennon, J., Kiros, J., Leike, J., Park, J., Kwon, J., Phang, J., Teplitz, J., Wei, J., Wolfe, J., Chen, J., Harris, J., Varavva, J., Lee, J. G., Shieh, J., Lin, J., Yu, J., Weng, J., Tang, J., Yu, J., Jang, J., Candela, J. Q., Beut- ler, J., Landers, J., Parish, J., Heidecke, J., Schulman, J., Lachman, J., McKay, J., Uesato, J., Ward, J., Kim, J. W., Huizinga, J., Sitkin, J., Kraaijeveld, J., Gross, J., Ka- plan, J., Snyder, J., Achiam, J., Jiao, J., Lee, J., Zhuang, J., Harriman, J., Fricke, K., Hayashi, K., Singhal, K., Shi, K., Karthik, K., Wood, K., Rimbach, K., Hsu, K., Nguyen, K., Gu-Lemberg, K., Button, K., Liu, K., Howe, K., Muthukumar, K., Luther, K., Ahmad, L., Kai, L., Itow, L., Workman, L., Pathak, L., Chen, L., Jing, L., Guy, L., Fedus, L., Zhou, L., Mamitsuka, L., Weng, L., McCal- lum, L., Held, L., Ouyang, L., Feuvrier, L., Zhang, L., Kondraciuk, L., Kaiser, L., Hewitt, L., Metz, L., Doshi, L., Aflak, M., Simens, M., Boyd, M., Thompson, M., Dukhan, M., Chen, M., Gray, M., Hudnall, M., Zhang, M., Aljubeh, M., Litwin, M., Zeng, M., Johnson, M., Shetty, M., Gupta, M., Shah, M., Yatbaz, M., Yang, M. J., Zhong, M., Glaese, M., Chen, M., Janner, M., Lampe, M., Petrov, M., Wu, M., Wang, M., Fradin, M., Pokrass, M., Castro, M., de Castro, M. O. T., Pavlov, M., Brundage, M., Wang, M., Khan, M., Murati, M., Bavarian, M., Lin, M., Yesil- dal, M., Soto, N., Gimelshein, N., Cone, N., Staudacher, N., Summers, N., LaFontaine, N., Chowdhury, N., Ryder, N., Stathas, N., Turley, N., Tezak, N., Felix, N., Kudige, N., Keskar, N., Deutsch, N., Bundick, N., Puckett, N., Nachum, O., Okelola, O., Boiko, O., Murk, O., Jaffe, O., Watkins, O., Godement, O., Campbell-Moore, O., Chao, P., McMillan, P., Belov, P., Su, P., Bak, P., Bakkum, P., Deng, P., Dolan, P., Hoeschele, P., Welinder, P., Tillet, P., Pronin, P., Tillet, P., Dhariwal, P., Yuan, Q., Dias, R., Lim, R., Arora, R., Troll, R., Lin, R., Lopes, R. G., Puri, R., Miyara, R., Leike, R., Gaubert, R., Zamani, R., Wang, R., Donnelly, R., Honsby, R., Smith, R., Sahai, R., Ramchandani, R., Huet, R., Carmichael, R., Zellers, R., Chen, R., Chen, R., Nigmatullin, R., Cheu, R., Jain, S., Altman, S., Schoenholz, S., Toizer, S., Miserendino, S., Agarwal, S., Culver, S., Ethersmith, S., Gray, S., Grove, S., Metzger, S., Hermani, S., Jain, S., Zhao, S., Wu, S., Jomoto, S., Wu, S., Shuaiqi, Xia, Phene, S., Papay, S., Narayanan, S., Coffey, S., Lee, S., Hall, S., Balaji, S., Broda, T., Stramer, T., Xu, T., Gogineni, T., Christian- son, T., Sanders, T., Patwardhan, T., Cunninghman, T., Degry, T., Dimson, T., Raoux, T., Shadwell, T., Zheng, T., Underwood, T., Markov, T., Sherbakov, T., Rubin, T., Stasi, T., Kaftan, T., Heywood, T., Peterson, T., Walters, T., Eloundou, T., Qi, V., Moeller, V., Monaco, V., Kuo, V., Fomenko, V., Chang, W., Zheng, W., Zhou, W., Man- assra, W., Sheu, W., Zaremba, W., Patil, Y., Qian, Y., Kim, Y., Cheng, Y., Zhang, Y., He, Y., Zhang, Y., Jin, Y., Dai, Y., and Malkov, Y. Gpt-4o system card, 2024. URL https://arxiv.org/abs/2410.21276. Park, K., Choe, Y. J., and Veitch, V. The linear representa- tion hypothesis and the geometry of large language mod- els. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024. Park, K., Choe, Y. J., Jiang, Y., and Veitch, V. The ge- ometry of categorical and hierarchical concepts in large language models. In The Thirteenth International Confer- ence on Learning Representations, 2025. URLhttps: //openreview.net/forum?id=bVTM2QKYuA. Paulo, G., Mallen, A., Juang, C., and Belrose, N. Automati- cally interpreting millions of features in large language models. arXiv preprint arXiv:2410.13928, 2024. Pennington, J., Socher, R., and Manning, C.GloVe: Global vectors for word representation. In Moschitti, A., Pang, B., and Daelemans, W. (eds.), Proceedings 13 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry of the 2014 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), p. 1532–1543, Doha, Qatar, October 2014. Association for Computa- tional Linguistics. doi: 10.3115/v1/D14-1162. URL https://aclanthology.org/D14-1162/. Ravfogel, S., Elazar, Y., Gonen, H., Twiton, M., and Gold- berg, Y. Null it out: Guarding protected attributes by iterative nullspace projection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, p. 7237–7256, 2020. Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. Steering llama 2 via contrastive activa- tion addition. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 15504–15522, Bangkok, Thailand, August 2024. Association for Computational Linguis- tics. doi: 10.18653/v1/2024.acl-long.828. URLhttps: //aclanthology.org/2024.acl-long.828/. Saglam, B., Kassianik, P., Nelson, B., Weerawardhena, S., Singer, Y., and Karbasi, A. Large language mod- els encode semantics in low-dimensional linear sub- spaces, 2025.URLhttps://arxiv.org/abs/ 2507.09709. Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., Goldowsky-Dill, N., Heimersheim, S., Ortega, A., Bloom, J., Biderman, S., Garriga-Alonso, A., Conmy, A., Nanda, N., Rumbelow, J., Wattenberg, M., Schoots, N., Miller, J., Michaud, E. J., Casper, S., Tegmark, M., Saunders, W., Bau, D., Todd, E., Geiger, A., Geva, M., Hoogland, J., Murfet, D., and McGrath, T. Open problems in mechanistic interpretability, 2025. URL https://arxiv.org/abs/2501.16496. Simon, P. J. D., d’Ascoli, S., Chemla, E., Lakretz, Y., and King, J.-R. A polar coordinate system represents syntax in large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum? id=x2780VcMOI. Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., Nathan, A., Luo, A., Helyar, A., Madry, A., Efremov, A., Spyra, A., Baker-Whitcomb, A., Beutel, A., Karpenko, A., Makelov, A., Neitz, A., Wei, A., Barr, A., Kirchmeyer, A., Ivanov, A., Christakis, A., Gillespie, A., Tam, A., Bennett, A., Wan, A., Huang, A., Sandjideh, A. M., Yang, A., Kumar, A., Saraiva, A., Vallone, A., Gheorghe, A., Garcia, A. G., Braunstein, A., Liu, A., Schmidt, A., Mereskin, A., Mishchenko, A., Applebaum, A., Rogerson, A., Rajan, A., Wei, A., Kotha, A., Srivas- tava, A., Agrawal, A., Vijayvergiya, A., Tyra, A., Nair, A., Nayak, A., Eggers, B., Ji, B., Hoover, B., Chen, B., Chen, B., Barak, B., Minaiev, B., Hao, B., Baker, B., Lightcap, B., McKinzie, B., Wang, B., Quinn, B., Fioca, B., Hsu, B., Yang, B., Yu, B., Zhang, B., Brenner, B., Zetino, C. R., Raymond, C., Lugaresi, C., Paz, C., Hud- son, C., Whitney, C., Li, C., Chen, C., Cole, C., Voss, C., Ding, C., Shen, C., Huang, C., Colby, C., Hallacy, C., Koch, C., Lu, C., Kaplan, C., Kim, C., Minott-Henriques, C., Frey, C., Yu, C., Czarnecki, C., Reid, C., Wei, C., Decareaux, C., Scheau, C., Zhang, C., Forbes, C., Tang, D., Goldberg, D., Roberts, D., Palmie, D., Kappler, D., Levine, D., Wright, D., Leo, D., Lin, D., Robinson, D., Grabb, D., Chen, D., Lim, D., Salama, D., Bhattacharjee, D., Tsipras, D., Li, D., Yu, D., Strouse, D., Williams, D., Hunn, D., Bayes, E., Arbus, E., Akyurek, E., Le, E. Y., Widmann, E., Yani, E., Proehl, E., Sert, E., Cheung, E., Schwartz, E., Han, E., Jiang, E., Mitchell, E., Sigler, E., Wallace, E., Ritter, E., Kavanaugh, E., Mays, E., Nikishin, E., Li, F., Such, F. P., de Avila Belbute Peres, F., Raso, F., Bekerman, F., Tsimpourlas, F., Chantzis, F., Song, F., Zhang, F., Raila, G., McGrath, G., Briggs, G., Yang, G., Parascandolo, G., Chabot, G., Kim, G., Zhao, G., Valiant, G., Leclerc, G., Salman, H., Wang, H., Sheng, H., Jiang, H., Wang, H., Jin, H., Sikchi, H., Schmidt, H., Aspegren, H., Chen, H., Qiu, H., Lightman, H., Covert, I., Kivlichan, I., Silber, I., Sohl, I., Hammoud, I., Clavera, I., Lan, I., Akkaya, I., Kostrikov, I., Kofman, I., Etinger, I., Singal, I., Hehir, J., Huh, J., Pan, J., Wilczynski, J., Pachocki, J., Lee, J., Quinn, J., Kiros, J., Kalra, J., Samaroo, J., Wang, J., Wolfe, J., Chen, J., Wang, J., Harb, J., Han, J., Wang, J., Zhao, J., Chen, J., Yang, J., Tworek, J., Chand, J., Lan- don, J., Liang, J., Lin, J., Liu, J., Wang, J., Tang, J., Yin, J., Jang, J., Morris, J., Flynn, J., Ferstad, J., Heidecke, J., Fishbein, J., Hallman, J., Grant, J., Chien, J., Gordon, J., Park, J., Liss, J., Kraaijeveld, J., Guay, J., Mo, J., Lawson, J., McGrath, J., Vendrow, J., Jiao, J., Lee, J., Steele, J., Wang, J., Mao, J., Chen, K., Hayashi, K., Xiao, K., Salahi, K., Wu, K., Sekhri, K., Sharma, K., Singhal, K., Li, K., Nguyen, K., Gu-Lemberg, K., King, K., Liu, K., Stone, K., Yu, K., Ying, K., Georgiev, K., Lim, K., Tirumala, K., Miller, K., Ahmad, L., Lv, L., Clare, L., Fauconnet, L., Itow, L., Yang, L., Romaniuk, L., Anise, L., Byron, L., Pathak, L., Maksin, L., Lo, L., Ho, L., Jing, L., Wu, L., Xiong, L., Mamitsuka, L., Yang, L., McCallum, L., Held, L., Bourgeois, L., Engstrom, L., Kuhn, L., Feuvrier, L., Zhang, L., Switzer, L., Kondraciuk, L., Kaiser, L., Joglekar, M., Singh, M., Shah, M., Stratta, M., Williams, M., Chen, M., Sun, M., Cayton, M., Li, M., Zhang, M., Aljubeh, M., Nichols, M., Haines, M., Schwarzer, M., Gupta, M., Shah, M., Huang, M., Dong, M., Wang, M., Glaese, M., Carroll, M., Lampe, M., Malek, M., Shar- man, M., Zhang, M., Wang, M., Pokrass, M., Florian, M., Pavlov, M., Wang, M., Chen, M., Wang, M., Feng, M., Bavarian, M., Lin, M., Abdool, M., Rohaninejad, M., 14 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Soto, N., Staudacher, N., LaFontaine, N., Marwell, N., Liu, N., Preston, N., Turley, N., Ansman, N., Blades, N., Pancha, N., Mikhaylin, N., Felix, N., Handa, N., Rai, N., Keskar, N., Brown, N., Nachum, O., Boiko, O., Murk, O., Watkins, O., Gleeson, O., Mishkin, P., Lesiewicz, P., Baltescu, P., Belov, P., Zhokhov, P., Pronin, P., Guo, P., Thacker, P., Liu, Q., Yuan, Q., Liu, Q., Dias, R., Puckett, R., Arora, R., Mullapudi, R. T., Gaon, R., Miyara, R., Song, R., Aggarwal, R., Marsan, R., Yemiru, R., Xiong, R., Kshirsagar, R., Nuttall, R., Tsiupa, R., Eldan, R., Wang, R., James, R., Ziv, R., Shu, R., Nigmatullin, R., Jain, S., Talaie, S., Altman, S., Arnesen, S., Toizer, S., Toyer, S., Miserendino, S., Agarwal, S., Yoo, S., Heon, S., Ethersmith, S., Grove, S., Taylor, S., Bubeck, S., Banesiu, S., Amdo, S., Zhao, S., Wu, S., Santurkar, S., Zhao, S., Chaudhuri, S. R., Krishnaswamy, S., Shuaiqi, Xia, Cheng, S., Anadkat, S., Fishman, S. P., Tobin, S., Fu, S., Jain, S., Mei, S., Egoian, S., Kim, S., Golden, S., Mah, S., Lin, S., Imm, S., Sharpe, S., Yadlowsky, S., Choudhry, S., Eum, S., Sanjeev, S., Khan, T., Stramer, T., Wang, T., Xin, T., Gogineni, T., Christianson, T., Sanders, T., Patwardhan, T., Degry, T., Shadwell, T., Fu, T., Gao, T., Garipov, T., Sriskandarajah, T., Sherbakov, T., Kaftan, T., Hiratsuka, T., Wang, T., Song, T., Zhao, T., Peter- son, T., Kharitonov, V., Chernova, V., Kosaraju, V., Kuo, V., Pong, V., Verma, V., Petrov, V., Jiang, W., Zhang, W., Zhou, W., Xie, W., Zhan, W., McCabe, W., DePue, W., Ellsworth, W., Bain, W., Thompson, W., Chen, X., Qi, X., Xiang, X., Shi, X., Dubois, Y., Yu, Y., Khakbaz, Y., Wu, Y., Qian, Y., Lee, Y. T., Chen, Y., Zhang, Y., Xiong, Y., Tian, Y., Cha, Y., Bai, Y., Yang, Y., Yuan, Y., Li, Y., Zhang, Y., Yang, Y., Jin, Y., Jiang, Y., Wang, Y., Wang, Y., Liu, Y., Stubenvoll, Z., Dou, Z., Wu, Z., and Wang, Z. Openai gpt-5 system card, 2025. URL https://arxiv.org/abs/2601.03267. Singh, S., Ravfogel, S., Herzig, J., Aharoni, R., Cotterell, R., and Kumaraguru, P. Representation surgery: theory and practice of affine steering. In Proceedings of the 41st International Conference on Machine Learning, p. 45663–45680, 2024. Smilkov, D., Thorat, N., Nicholson, C., Reif, E., Vi ́ egas, F. B., and Wattenberg, M. Embedding projector: In- teractive visualization and interpretation of embeddings. ArXiv, abs/1611.05469, 2016. URLhttps://api. semanticscholar.org/CorpusID:14293681. Sun, J., Huang, J., Baskaran, S., D’Oosterlinck, K., Potts, C., Sklar, M., and Geiger, A. HyperDAS: Towards au- tomating mechanistic interpretability with hypernetworks. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview. net/forum?id=6fDjUoEQvm. Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram ́ e, A., Ferret, J., Liu, P., Tafti, P., Friesen, A., Casbon, M., Ramos, S., Kumar, R., Lan, C. L., Jerome, S., Tsit- sulin, A., Vieillard, N., Stanczyk, P., Girgin, S., Momchev, N., Hoffman, M., Thakoor, S., Grill, J.-B., Neyshabur, B., Bachem, O., Walton, A., Severyn, A., Parrish, A., Ahmad, A., Hutchison, A., Abdagic, A., Carl, A., Shen, A., Brock, A., Coenen, A., Laforge, A., Paterson, A., Bastian, B., Piot, B., Wu, B., Royal, B., Chen, C., Kumar, C., Perry, C., Welty, C., Choquette-Choo, C. A., Sinopalnikov, D., Weinberger, D., Vijaykumar, D., Rogozi ́ nska, D., Her- bison, D., Bandy, E., Wang, E., Noland, E., Moreira, E., Senter, E., Eltyshev, E., Visin, F., Rasskin, G., Wei, G., Cameron, G., Martins, G., Hashemi, H., Klimczak- Pluci ́ nska, H., Batra, H., Dhand, H., Nardini, I., Mein, J., Zhou, J., Svensson, J., Stanway, J., Chan, J., Zhou, J. P., Carrasqueira, J., Iljazi, J., Becker, J., Fernandez, J., van Amersfoort, J., Gordon, J., Lipschultz, J., Newlan, J., yeong Ji, J., Mohamed, K., Badola, K., Black, K., Milli- can, K., McDonell, K., Nguyen, K., Sodhia, K., Greene, K., Sjoesund, L. L., Usui, L., Sifre, L., Heuermann, L., Lago, L., McNealus, L., Soares, L. B., Kilpatrick, L., Dixon, L., Martins, L., Reid, M., Singh, M., Iverson, M., G ̈ orner, M., Velloso, M., Wirth, M., Davidow, M., Miller, M., Rahtz, M., Watson, M., Risdal, M., Kazemi, M., Moynihan, M., Zhang, M., Kahng, M., Park, M., Rahman, M., Khatwani, M., Dao, N., Bardoliwalla, N., Devanathan, N., Dumai, N., Chauhan, N., Wahltinez, O., Botarda, P., Barnes, P., Barham, P., Michel, P., Jin, P., Georgiev, P., Culliton, P., Kuppala, P., Comanescu, R., Merhej, R., Jana, R., Rokni, R. A., Agarwal, R., Mullins, R., Saadat, S., Carthy, S. M., Cogan, S., Per- rin, S., Arnold, S. M. R., Krause, S., Dai, S., Garg, S., Sheth, S., Ronstrom, S., Chan, S., Jordan, T., Yu, T., Eccles, T., Hennigan, T., Kocisky, T., Doshi, T., Jain, V., Yadav, V., Meshram, V., Dharmadhikari, V., Barkley, W., Wei, W., Ye, W., Han, W., Kwon, W., Xu, X., Shen, Z., Gong, Z., Wei, Z., Cotruta, V., Kirk, P., Rao, A., Giang, M., Peran, L., Warkentin, T., Collins, E., Bar- ral, J., Ghahramani, Z., Hadsell, R., Sculley, D., Banks, J., Dragan, A., Petrov, S., Vinyals, O., Dean, J., Has- sabis, D., Kavukcuoglu, K., Farabet, C., Buchatskaya, E., Borgeaud, S., Fiedel, N., Joulin, A., Kenealy, K., Dadashi, R., and Andreev, A. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118. Tiblias, F., Bigoulaeva, I., Niu, J., Balloccu, S., and Gurevych, I. Shape happens: Automatic feature manifold discovery in llms via supervised multi-dimensional scal- ing, 2025. URLhttps://arxiv.org/abs/2510. 01025. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering language 15 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry models with activation engineering, 2024. URLhttps: //arxiv.org/abs/2308.10248. Wiedemann, G., Remus, S., Chawla, A., and Bie- mann, C.Does bert make any sense?in- terpretable word sense disambiguation with con- textualized embeddings.ArXiv, abs/1909.10430, 2019.URLhttps://api.semanticscholar. org/CorpusID:202719403. Wiegreffe, S., Tafjord, O., Belinkov, Y., Hajishirzi, H., and Sabharwal, A. Answer, assemble, ace: Understanding how LMs answer multiple choice questions. In The Thir- teenth International Conference on Learning Represen- tations, 2025. URLhttps://openreview.net/ forum?id=6NNA0MxhCH. Wu, Z., Arora, A., Geiger, A., Wang, Z., Huang, J., Juraf- sky, D., Manning, C. D., and Potts, C. Axbench: Steer- ing llms? even simple baselines outperform sparse au- toencoders, 2025. URLhttps://arxiv.org/abs/ 2501.17148. Yun, Z., Chen, Y., Olshausen, B., and LeCun, Y. Trans- former visualization via dictionary learning: contextual- ized embedding as a linear superposition of transformer factors. In Agirre, E., Apidianaki, M., and Vuli ́ c, I. (eds.), Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Inte- gration for Deep Learning Architectures, p. 1–10, On- line, June 2021. Association for Computational Linguis- tics. doi: 10.18653/v1/2021.deelio-1.1. URLhttps: //aclanthology.org/2021.deelio-1.1/. 16 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Table 3. Convergence and centroid diversity across initializations. We report the convergence rate in number of gradient descent steps and the mean± std of pairwise Euclidean distances between learned component means (μ). Initialization StepsMean pairwise dist. (± std) Random Point26136 143.76± 106.18 K-Means28008 108.26± 65.41 Random Init38880 71.92± 10.33 A. MFA Initialization and Training While we primarily initialize with K-Means, we also tried two simpler alternatives: (1) fully random initialization and (2) initializing each component from random data points. Random Initialization We initialize each component mean by sampling from an isotropic normal distribution:μ k ∼ N (0,σ 2 I). All other parameters are initialized the same as described in §3. Random Point InitializationWe initialize each component mean by sampling activations uniformly from the dataset. Let x i N i=1 be the corpus of activations andKthe number of components. We sample indicesi 1 ,...,i K uniformly at random and set μ k ← x i k for k = 1,...,K.(17) All other parameters are initialized the same as described in §3. Additionally, we note that future work may make this initialization more robust by resampling Gaussians who are initialized close in activation space, thus motivating diverse Gaussians. Initialization Comparison To compare initialization strategies, we train MFAs with identical hyperparameters while varying only the initialization method (K-Means, random, and random point). We measure convergence efficiency as the number of training iterations until convergence, defined as when the change in log-likelihood between successive iterations falls below 10 −3 . To assess whether the learned components remain diverse after training, we compute pairwise Euclidean distances between component meansμ k K k=1 . results We present the results in 3. Fully random initialization often converges to poor solutions, with little variation in the pairwise euclidean distance and relatively higher NLL. Random point initialization scales to very large datasets and converges fast, but is more prone to local minima as is noticeable by the higher NLL compared to K-Means. Additionally, we see the variance is much higher than K-Means, indicating a less uniform spread of centroids. Lastly, K-Means converges fast, but its initialization is slower and less scalable than random point initialization. However, it yields better coverage of the dataset and converges to a better minima. B. Annotation Statistical Testing In this section we provide details on the statistical testing conducted to validate the LLMs capacity to label both the loadings and the Gaussians in §4. We provide prompts and annotation instructions in §G. Method First, to identify consistency between human and LLM annotators we examine agreement using Cohen’sκ(i) between humans and (i) between the LLM and each human. Second, we use the Alternative Annotator Test (Calderon et al., 2025). We compare how well the LLM agrees with the other human annotators versus how well each individual human agrees with the others (leave-one-out). Following the procedure outlined in Calderon et al. (2025) we compute the winning rate (ω, the fraction of humans for whom the LLM is judged better) and deem the LLM replaceable if ω ≥ 0.5 (with q = 0.05). For all tasks we use ε = 0.1. For the loading annotation task (which requires interpreting context), labels were provided by a mix of NLP graduate and doctoral students. We ran statistical testing on 58 randomly sampled loadings, using the annotation instructions in §G and 17 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry the procedure described in §4. For the gaussian annotation task we had non-expert human annotators annotate 29 samples with the procedure outlined in §4. Results For the loading task we found that across 58 sampled loadings, human annotators showed moderate agreement (mean Cohen’sκ = 0.29across pairs). The LLM matched each human annotator at least as well, with higher agreement on average (meanκ = 0.44). Using the alternative annotator test (leave-one-out with BY correction,q = 0.05), the LLM significantly outperformed the excluded human in all three comparisons, yielding a winning rate ofω = 1.0and therefore meeting the replacement criterion (ω ≥ 0.5). For the Gaussian labeling task across the sampled Gaussians, the three human annotators agreed perfectly (pairwise Cohen’s κ = 1.0). The LLM annotations also matched each human perfectly (meanκ = 1.0). Under the alternative annotator test (leave-one-out with BY correction,q = 0.05), the LLM met the replacement criterion (ω = 1.0 ≥ 0.5), indicating it is statistically indistinguishable from the human annotators on this task. C. Labeling Loadings For a single activationxassigned to componentk, we therefore test whether its latent coordinates ˆ z k place it in the right concept group within that region. We retrieve the 10 nearest neighbors to ˆ z k in the low-dimensional latent space defined by the Gaussian, along with a contrast set given by the 10 farthest points in the same space. This within-component contrast isolates what the subspace is separating inside the region. We then judge whetherxshares a coherent concept with its nearest neighbors. We also require this concept to be absent or much weaker in the contrast set. If so, we treat the placement induced by ˆ z k as an interpretable account of the local refinement. D. Reconstruction Analysis ExperimentTo evaluate whether a local low-rank factor model is a reasonable approximation of activations, we measure reconstruction error on a held-out validation set. We compare MFA with SAEs as a baseline, using 10 million activations sampled from Wikipedia. For each activationx∈R d , MFA reconstructs ˆ xusing Eq. 11, while the SAE reconstructs via encoding then decoding. We report the mean squared reconstruction error (MSE) over all samples: MSE = 1 Nd N X i=1 ∥x i − ˆ x i ∥ 2 2 ,(18) along with the standard error of this estimate to indicate measurement precision. We measure MSE for the 12 MFAs of Gemma-2-2B and Llama-3.1-8B from §4. For SAEs we evaluate Gemmascope-65k (Lieberum et al., 2024) and Llamascope (He et al., 2024) at the same layers of the MFAs. In addition, we evaluate a weak K-means baseline that reconstructs a given activation by assigning it to its closest centroid. Results Table 4 shows that reconstruction quality improves with the number of MFA componentsK. IncreasingK from 1K to 8K yields a substantial reduction in error, while the additional improvement from 8K to 32K is comparatively lower, indicating diminishing returns onceKis sufficiently large for the corpus. Across all settings, SAEs achieve lower reconstruction error. This is expected because SAEs allow a more flexible, sample-specific reconstruction. Each activation can be expressed as a sparse combination of dictionary elements whose support can change across inputs. MFA instead commits to a single component and reconstructs within a fixed rankRsubspace around that centroid, which makes the reconstruction less expressive with the usedR = 10. The remaining gap is therefore consistent with the rank constraint. Importantly, MFA substantially outperforms the K-means baseline, which consistently has1.3− 1.5times higher MSE. This indicates that modeling within-region variation with a learned low-rank structure captures a considerable portion of the structured representation of the activation. LimitationA key limitation of MFA is that it explicitly models the activation distribution it is trained on. This means that the centroids and low-rank subspaces model the regions of the activation space that occur in the training corpus. As a result, when an activation lies in a region that is rare or out of distribution of the training set (not part of the Gaussians we model), MFA may assign it to the nearest available component even if none provides a good local fit. In these cases, MFA yields high reconstruction error. This failure mode is inherent to the model class. IncreasingKcan expand coverage of the training 18 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Table 4. Reconstruction error reported as MSE for MFAs, SAEs, and a K-Means baseline. For the baseline we additionally provide the error ratio with MFA of the same size. All results were found to be statistically significant with standard error < 1e− 4. ModelLayerSAEMFA-1KKMeans-1KMFA-8KKMeans-8KMFA-32K KMeans-32K Gemma-2 60.5921.0981.530 (×1.4)0.8961.265 (×1.4)0.8001.104 (×1.4) 184.5955.4307.629 (×1.4)4.6786.622 (×1.4)4.4775.966 (×1.3) Llama-3.1 80.0230.0050.007 (×1.3)0.0040.006 (×1.4)0.0040.005 (×1.4) 220.0380.0640.086 (×1.3)0.0510.074 (×1.5)0.0450.066 (×1.5) distribution, but it does not guarantee low error on regions that were never observed. Given this, we still find MFA isolates meaningful and useful features that generalize to tasks like steering and localization. E. Benchmarking: Additional Details Localization (MCQA and RAVEL). We train MFAs for the localization experiments on both gemma-2-2b and Llama- 3.1-8b for all layers. We use the train sets Country, Language and Continent tasks combined. We fit an MFA independently per layer withK ∈100, 250for MCQA andK ∈25, 50for RAVEL. We sweep ranksR∈25, 50for MCQA and R∈10, 20for RAVEL. MFAs are trained for 400 epochs with batch size 256 and learning rate10 −3 . In the paper we report the results for the best combination over the parameters. For MIB parameters, we use the default parameters provided in their codebase. Causal. Causalsteeringisperformedongemma-2-2b(layers6and18)andLlama-3.1-8b (layers8and22).ForMFA,weapplyinterpolation-basedinterventionsandsweepα ∈ 0.15, 0.20, 0.25, 0.30, 0.325, 0.35, 0.375, 0.425, 0.45, 0.475, 0.50, 0.55, 0.60, 0.80. For SAE baselines, we use Gemmascope 65k (Lieberum et al., 2024) and Llamascope 131k (He et al., 2024), ran- domly sampling 250 features per layer. We report additive interventions (more effective in our setting) and sweep α ∈ 0.4, 0.8, 1.2, 1.6, 2.0, 3.0, 4.0, 6.0, 8.0, 10.0, 20.0, 40.0, 60.0, 100.0, using the same set of values as in Wu et al. (2025). For each condition, we generate 8 continuations from the prompt‘I think that’with max 50 new tokens, top- k=30, and top-p=0.3. For scoring prompts see Wu et al. (2025) Concept Score prompt and Fluency Prompt. Additionally, we run ablations on Gemma-2-2B (layer 18) to select the most appropriate intervention scheme (interpolation vs. additive) for both SAEs and MFA on a randomly sampled set of 100 features/Gaussians that was not used for the final results in the paper. For the 32K-MFA, we applied an additive intervention to the centroids using the sameαvalues as above. Performance dropped substantially, with a mean final score of0.124± 0.14opposed to0.24± 0.20with interpolation. This supports the view that MFA centroids primarily encode absolute position in activation space, so additive perturbations are poorly matched to their inductive bias. We performed the complementary ablation for SAEs, using an interpolation-based intervention withα ∈ 0.3, 0.35, 0.4, 0.45, 0.475, 0.5, 0.525, 0.55, 0.6, 0.65, 0.7. SAE performance degraded sharply, yielding a mean final score of0.177± 0.19as opposed to0.195± .21. Furthermore, SAEs are commonly used with additive interventions further supporting the empirical results. We utilized the best performing intervention method from the ablations for each decomposition. Causal Fluency and Concept ScoresWe report here the Concept and Fluency scores for the steering evaluation. Notably, MFA scores significantly higher than SAEs and DiffMeans in concept score, indicating it promotes very interpretable concepts that are well described by the high-likelihood samples. Meanwhile, fluency scores are fairly consistent with MFA having a slightly lower upper bound. 19 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry 618 Layer 0.0 0.5 1.0 Score Gemma-2-2B: CS 822 Layer Llama-3.1-8B: CS SAEDiffMeansMFA-32KMFA-8KMFA-1K Figure 6. Concept Score steering results across layers in Gemma-2-2B and Llama-3.1-8B of state-of-the-art SAEs, DiffMeans and 1K, 8K, 32K Gaussian MFAs. Across all settings MFA significantly outperforms DiffMeans and SAEs. Strongly promoting its associated concepts. 618 Layer 0.0 0.5 1.0 Score Gemma-2-2B: FL 822 Layer Llama-3.1-8B: FL SAEDiffMeansMFA-32KMFA-8KMFA-1K Figure 7. Fluency Score steering results across layers in Gemma-2-2B and Llama-3.1-8B of state-of-the-art SAEs, DiffMeans and 1K, 8K, 32K Gaussian MFAs. Across all settings scores are consistent for all methods, with MFA showing a slight decline. F. Examples F.1. Neighborhood Examples We provide a range of example Gaussian neighborhoods. For each Gaussian, we annotate a small set of randomly sampled tokens to give a qualitative sense of its content; these tokens are illustrative and may not fully represent the broader concept captured by the component. We show 10 examples in total: 5 from Llama-3.1-8B (layer 22) and 5 from Gemma-2-2B (layer 18). 20 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Llama-3.1-8B Gemma-2-2B 21 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Llama-3.1-8B (cont.)Gemma-2-2B (cont.) F.2. Steering Examples We provide additional examples of steering towards the centroids and independently steering with the local covariance structure. We consistently see that steering with centroids promotes a broad topic that aligns with the cluster it represents. Additionally, intervening with the directions of variation, even from not within the region, often promotes a sub-concept of the broader theme. We see this as a phenomenon where the centroid move changes the activation’s absolute position. It pushes the representation into a region that similar contexts tend to pass through, so it reliably brings out the broad topic associated with that Gaussian. In contrast, the local covariance captures how activations vary within the region. Because points in the same region are already very similar, this variation is often especially clean, separating a single semantic attribute and producing a targeted sub-concept shift within the broader theme. This is often the case for narrow Gaussians too, showing that even when the Gaussian capture only very specific tokens or structures, the learned covariance models a wide range of concepts. We present this as qualitative evidence, since the MFA loadings define the local subspace of variation rather than a unique semantic direction. As a result, stepping along a single loading may not always line up with the dominant direction of variation in that region. The first table shows examples of top promoted tokens using the centroid intervention (Eq. 14) and loading interventions (Eq. 15) from Llama-3.1-8B layer 22 and the second from Gemma-2-2B layer 18. 22 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Table 5. More centroid vs. loading examples (Llama-3.1-8B). Each centroid shows broad theme promotion, while loadings isolate sharper subthemes. Term DescriptorTop promoted tokens (1) Superheroes / comics universe μsuperheroes Superman, Batman, Joker, Gotham, Deadpool, superhero w 1 Superman Superman, Clark, krypton, rypton, SUPER, restoring w 2 DC film studio Batman, Warner, WB, Nolan, Snyder, films w 3 Batman Joker, Gotham, Commissioner, Mafia, inmates, profiler (2) Vascular procedures / clinical devices μcatheters, needles biopsy, cath, artery, needle, infusion, injection w 1 angioplasty tools balloon, angi, coronary, flexible, tapered, strut w 2 injection verbs injections, injection, injecting, inject, injected, Injection w 3 connectors, tubing adapter, connector, nozzle, hose, cartridge, apparatus (3) Places / US cities and neighborhoods μUS city names Newark, Lexington, Bronx, Omaha, Honolulu, Albany w 1 East Coast metro Hartford, Stamford, Greenwich, Brooklyn, Manhattan, Bronx w 2 downtown / inner-city downtown, Downtown, Harlem, Detroit, inner, ghetto w 3 suburbs framing suburban, suburbs, suburb, Anaheim, Los (4) Organs / biomedical anatomy μorgans anatomy intestine, liver, uterus, sple, pancre, kidneys w 1 neuroscience terms brain, brains, neurons, neuro, neuroscience, cerebral w 2 genetics / immunity antigen, antibody, gene, genome, mouse, vaccine w 3 organ physiology lungs, kidneys, arteries, renal, blood, pulmonary 23 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Table 6. More centroid vs. loading examples (Gemma-2-2B). Each centroid shows broad theme promotion, while loadings isolate sharper subthemes. Additionally, we label narrow Gaussians to show that local variation is often meaningful. Term DescriptorTop promoted tokens (1) continents/countries μcontinents continent, Europe, EU, France, Parlamento, Countries w 1 continent names Europe, Asia, Africa, continents, European w 2 Africa Africa, Ebola, malaria, Angola, regions w 3 SE Asia Nang, Cambodian, Ceylon, Nepali, Indon (2) Sleep / tiredness μsleep apnea, deprivation, sleep, sleepless w 1 sleep disorders apnea, insomnia, deprivation, circadian, disorders w 2 sleep lexemes slept, Sleep, sleeps, SLEEP w 3 waking / until until, Until, hrs, woken, UNTIL (3) Security / secure (Narrow Gaussian) μsecure, affixes safeguard, unlock, Secured, Against, ness, able w 1 email security Gmail, gmail, password, inbox, email, browser w 2 cryptography encrypted, ciphertext, passphrase, authenticated, smtp w 3 fasten / attach attaches, affixed, fastened, fastening, attach (4) Rooms / venues μrooms, interiors room, rooms, foyer, showroom, floor, located w 1 theaters Theater, Theatre, theater, thtre, theatrical w 2 casinos / gambling Casino, Gambling, Poker, Slots, Betting w 3 restaurants / menus eateries, restaurants, menus, diners, seafood, gourmet (5) “Developing” (Narrow Gaussian) μdeveloping methodologies, rapidly, methodology, scalable, accordingly w 1 developing symptoms symptoms, leukemia, jaundice, cancer, lymphoma w 2 developing countries countries, nations, continent, hemisphere, Nations w 3 drug development prophylactic, adjuvant, assays, analgesic, sterilized 24 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry G. Prompts and Annotation Instructions Centroid Description Prompt You are a meticulous AI researcher conducting an important investigation into patterns found in language. You will be given examples where certain tokens or sequences are highlighted between delimiters like << >>. These highlighted segments represent the most strongly activating tokens or spans for a concept. They may or may not be meaningful on their own, and sometimes the surrounding context is what carries the real pattern. Guidelines: - Analyze the examples and produce one concise natural-language interpretation that captures the shared latent patterns present in the highlighted text and examples. - Focus on describing the semantic, syntactic, stylistic, or conceptual pattern uniting the examples. - Each example line includes a normalized activation score between 0 and 1 (and sometimes relativetomax), derived from an energy measure. Higher values indicate that this example is more important for the concept. - If the examples are uninformative, do not dwell on them or list them; instead, give the most concise possible summary of the pattern you can infer across the examples. - Do not repeat or reference the marker tokens (<< >>) in your interpretation. - Do not list multiple possible interpretations or speculate; provide a single, clear, crisp description. - Keep your interpretation short, direct, and precise. - Be as specific as possible so that the interpretation of the description is unambiguous. RESPONSE FORMAT (STRICT): - You may optionally include a brief explanation first. - The FINAL line of your response MUST be exactly of the form: [interpretation]: <your one-sentence description> - That line: * MUST start with "[interpretation]:" * MUST be plain text (no markdown bullets, no code fences, no quotes around it). * MUST appear exactly once. * MUST be the last line of your response. Do NOT output anything after that line. Example of a valid final line: [interpretation]: A concept describing X that appears mostly in Y contexts. Now, follow the instructions carefully adhering to the format outlined above. The examples: examples SAE Feature Relevance Prompt System Message You analyze neural network interpretability features. For each feature, explain why it is or isn’t relevant to the given token. Be concise but specific in your reasoning. User Prompt 25 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Analyze which SAE (Sparse Autoencoder) features are semantically relevant to this specific token. TOKEN: "<TOKEN>" ARTICLE: <ARTICLE> CONTEXT: "<... left context ...>[<TOKEN>]<... right context ...>" Active SAE features: [<FEATURE_ID_1>]: <FEATURE_DESC_1> [<FEATURE_ID_2>]: <FEATURE_DESC_2> ... [<FEATURE_ID_N>]: <FEATURE_DESC_N> For EACH feature above, determine if it is RELEVANT (true) or IRRELEVANT (false) to the token. Your response must be a JSON object with this structure: "features": [ "atom_id": <integer - the feature ID from the list above>, "reasoning": <string - your explanation of why this feature is or isn’t relevant>, "relevant": <boolean - true if relevant, false if irrelevant> , ... (one entry for each feature listed above) ] 26 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Annotating Task In this task, you are given two groups of English examples each with 12 examples: Group A (which we label as the positive group) and Group B (labeled the negative group). Each example includes: ● a token (a short piece of text, often a word or a subword such as “App” in “Apple) ● a context snippet (the surrounding text where that token appeared), the token will be highlighted with <<token>> ● a value (positive or negative). The examples shown are meant to be the extremes: Group A contains examples with the largest positive values, and Group B contains examples with the most negative values. Your goal is to identify the main theme or pattern in each group (if one exists), and then label it with a category from the ones provided below. Important: Occasionally, a few examples may have a value that is small compared to the maximum magnitude in that group, or may be of opposite sign (negative vs. positive). In those cases, treat them as noise and do not include them when formulating a pattern. Pattern types Patterns can be: ● Semantic (meaning-based): The shared pattern depends on what the text is saying. If you paraphrase the sentence but keep the meaning, the pattern still holds (same topic, entity type, situation, or expressed idea). Examples: “countries/cities”, “sports teams”, “medical terms”, “dates/years” ● Syntactic / stylistic (surface-form-based): The shared pattern depends on how the text is written or where the token appears, and would still hold even if the topic changed completely. It’s about format, punctuation, casing, position, or a structural template. Examples: opening quotes, sentence-initial token, parentheses/citations, capitalization patterns, sentences with the same template ● None - no pattern Notes: - A good test is: Imagine paraphrasing the sentence (order, words etc), will it still belong in the set of examples? If yes, its semantic, if not then syntactic. - A pattern should be considered “real” only if it appears in the majority of the high-magnitude examples in that group. (If you notice 2–3 examples that don’t fit, that’s okay.) To complete the task, please do the following: 1) For Group A (positive) Write a short description of the main theme/pattern you see among the strongest positive examples. ● If the shared pattern is mostly about the token itself, say so. ● If it’s mostly about the context/style/structure, say so. ● If it’s mostly about a topic, describe the topic. Then label Group A as one of: ● Semantic ● Syntactic ● None (you cannot find a meaningful shared pattern) 2) For Group B (negative) Do the same for the strongest negative examples: ● Description ● Label: Semantic / Syntactic / None Notes: - Some words may be uncommon or unfamiliar. If needed, do a quick web search to understand what a word/name refers to. - Always pay attention to the sentence that the token is actually, for example: Figure 9. Annotation instructions provided to graduate NLP students for labeling the loadings as either semantic or syntactic, for the analysis of §4. 27 From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Annotating Task In this task, you are given a set of 25 English texts (examples). Each example includes a token (a short piece of text often a word) and a context snippet (the surrounding text where that token appeared). Your goal is to identify the main theme or pattern shared across the examples. Patterns can be: ● Shallow: the theme is very tight and specific (e.g., the same token or a repeated phrase/structure). ● Broad: the theme is clear, but the examples cover multiple subtypes/categories within that theme (more variety under one umbrella). ● None A pattern should be considered “real” only if it appears in the majority of examples. (If you notice 2 or 3 examples that don’t fit, that’s ok) To complete the task, please do the following: 1. Write a short description of the main theme/pattern you see in the 25 examples. ● If the shared pattern is mostly about the token itself, say so. ● If it’s mostly about the context/style/structure, say so. ● If it’s mostly about a topic, describe the topic. 2. Choose one label for the set: ● Shallow: the theme is very tight and specific (the exact same token or a repeated phrase/structure). ● Broad: the theme is clear, but the examples cover multiple subtypes/categories within that theme (more variety under one umbrella). ● No theme: you cannot find a meaningful shared theme or pattern. Should always look at the token, if its the same token then its shallow, if its a range of tokens within a theme then its broad. Additioanlly, if all the contexts are the same structure with very slight variation then its shallow too. Please note that some of the words might be uncommon words that you are not familiar with. In such cases, you will need to do a quick search over the Web to understand the meaning of words. Figure 10. Annotation instructions provided to annotators for labeling the Gaussians as either broad or narrow, for the analysis of §4. 28