← Back to papers

Paper deep dive

Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning

Junxuan Wang, Xuyang Ge, Wentao Shu, Zhengfu He, Xipeng Qiu

Year: 2025Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 88

Models: Gemma-2-9B, Llama-3.1-8B, Qwen3-8B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/11/2026, 1:06:20 AM

Summary

The paper identifies that transformer attention outputs reside in a low-dimensional subspace (approx. 60% effective rank) compared to MLP outputs and residual streams. This 'dimensional collapse' is a primary cause of the 'dead feature' problem in sparse dictionary learning. The authors propose 'Active Subspace Initialization' (ASI) for sparse autoencoders, which aligns feature directions with the active subspace of activations, reducing dead features from 87% to below 1% in large-scale models.

Entities (5)

Sparse Autoencoder · method · 100%Transformer · architecture · 100%Active Subspace Initialization · method · 95%Attention Output · component · 95%Dead Feature · phenomenon · 95%

Relation Signals (4)

Attention Output exhibits Low-Rank Structure

confidence 95% · attention outputs consistently display the strongest low-rank structure compared to MLP outputs and residual streams.

Active Subspace Initialization reduces Dead Feature

confidence 95% · Our approach reduces dead features from 87% to below 1% in Attention Output SAEs

Low-Rank Structure causes Dead Feature

confidence 90% · we find this low-rank structure as a key factor of the prevalent dead feature problem in sparse dictionary learning

Output Projection Matrix shapes Low-Rank Structure

confidence 90% · This striking dimensional discrepancy is consistently observed across diverse model families and datasets, and is strongly shaped by the attention output projection matrix.

Cypher Suggestions (2)

Find methods that mitigate the dead feature problem · confidence 90% · unvalidated

MATCH (m:Method)-[:REDUCES]->(p:Phenomenon {name: 'Dead Feature'}) RETURN m.name

Identify components exhibiting low-rank structure · confidence 90% · unvalidated

MATCH (c:Component)-[:EXHIBITS]->(p:Property {name: 'Low-Rank Structure'}) RETURN c.name

Abstract

Abstract:Transformer architectures, and their attention mechanisms in particular, form the foundation of modern large language models. While transformer models are widely believed to operate in high-dimensional hidden spaces, we show that attention outputs are in fact confined to a surprisingly low-dimensional subspace, with an effective dimensionality of only about $60\%$ of the full space. In contrast, MLP outputs and residual streams remain much closer to full-rank, exhibiting effective ranks around $90\%$. This striking dimensional discrepancy is consistently observed across diverse model families and datasets, and is strongly shaped by the attention output projection matrix. Critically, we find this low-rank structure as a key factor of the prevalent dead feature problem in sparse dictionary learning, where it creates a mismatch between randomly initialized features and the intrinsic geometry of the activation space. Building on this insight, we propose a subspace-constrained training method for sparse autoencoders (SAEs), initializing feature directions into the active subspace of activations. Our approach reduces dead features from 87\% to below 1\% in Attention Output SAEs with 1M features, and can further extend to other sparse dictionary learning methods. Our findings provide both new insights into the geometry of attention and practical tools for improving sparse dictionary learning in large language models.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)mechanistic-interp (suggested, 92%)

Links

PDF not stored locally. Use the link above to view on the source site.

Full Text

87,796 characters extracted from source content.

Expand or collapse full text

Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning Junxuan Wang 1 2 Xuyang Ge 1 2 Wentao Shu 1 2 Zhengfu He 1 2 Xipeng Qiu 1 2 Abstract Transformer architectures, and their attention mechanisms in particular, form the foundation of modern large language models. While trans- former models are widely believed to operate in high-dimensional hidden spaces, we show that attention outputs are in fact confined to a sur- prisingly low-dimensional subspace, with an ef- fective dimensionality of only about60%of the full space. In contrast, MLP outputs and residual streams remain much closer to full-rank, exhibit- ing effective ranks around90%. This striking dimensional discrepancy is consistently observed across diverse model families and datasets, and is strongly shaped by the attention output projection matrix. Critically, we find this low-rank structure as a key factor of the prevalent dead feature prob- lem in sparse dictionary learning, where it creates a mismatch between randomly initialized features and the intrinsic geometry of the activation space. Building on this insight, we propose a subspace- constrained training method for sparse autoen- coders (SAEs), initializing feature directions into the active subspace of activations. Our approach reduces dead features from 87% to below 1% in Attention Output SAEs with 1M features, and can further extend to other sparse dictionary learning methods. Our findings provide both new insights into the geometry of attention and practical tools for improving sparse dictionary learning in large language models. 1. Introduction Over the past years, mechanistic interpretability has shifted from a collection of proof-of-concept tools (Olsson et al., 2022; Wang et al., 2022; Meng et al., 2023; Gould et al., 2023) toward a fast-growing, scale-driven field (Ameisen 1 Shanghai Innovation Institute, Shanghai, China 2 OpenMOSS Team, School of Computer Science, Fudan University, Shanghai, China. Correspondence to: Xipeng Qiu <xpqiu@fudan.edu.cn>. Preprint. February 12, 2026. et al., 2025; Lindsey et al., 2025). This transformation is driven by a wave of sparse dictionary learning methods– such as sparse autoencoders (SAEs) and their variants (Cun- ningham et al., 2023; Bricken et al., 2023b; Lindsey et al., 2024b), transcoders (Dunefsky et al., 2024; Ge et al., 2024), and low-rank sparse attention (He et al., 2025)–that once targeted small models but are now being pushed to larger architectures and wider model families (Templeton et al., 2024; Gao et al., 2024; Hazra et al., 2025). As these ap- proaches scale in performance and model coverage, they provide increasingly complete and fine-grained explanations of neural network behavior (Lindsey et al., 2024a; Gao et al., 2024). However, scaling these approaches presents substantial prac- tical challenges (Templeton et al., 2024; Gao et al., 2024; Mudide et al., 2025). As models and feature dictionaries grow, the parameter count increases rapidly, leading to sig- nificant computational overhead. Moreover, a large fraction of learned features remain inactive, resulting in considerable waste in both computation and memory (Templeton et al., 2024; Kissane et al., 2024). In this work, we identify the low-rank structure of the activations as a primary driver of dead features (Section 5.1). Through singular value decomposition and effective dimen- sionality analyses (Roy & Vetterli, 2007; Staats et al., 2025), we show for the first time that the outputs of multi-head self-attention in transformer-based language models ex- hibit a remarkably low-rank structure (Section 4). Com- pared to multilayer perceptron (MLP) outputs and residual streams, attention outputs consistently concentrate in a low- dimensional subspace. We show that this behavior is robust across layers, datasets, and model families, including GPT- 2 (Radford et al., 2019), Llama 3.1 (Dubey et al., 2024), Gemma 2 (Rivi ` ere et al., 2024), and Qwen 3 (Yang et al., 2025). This universality aligns with prior observations of shared structures across models (Olah et al., 2020; Chughtai et al., 2023; Gurnee et al., 2024; Wang et al., 2025), while re- vealing a distinct form of representation collapse (Hua et al., 2021; Jing et al., 2022) at the level of attention outputs. We trace the origin of this low-rank structure to the anisotropy of the output projection matrixW O , which further compresses multi-head activations into a lower-dimensional subspace. 1 arXiv:2508.16929v4 [cs.LG] 11 Feb 2026 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning R esi du alSt ream FullSpac e Attention ActiveSubspace AliveFeature DeadFeature LearnedDictionary ActivationswithLowEffectiveDimensionality Figure 1. (left) Attention outputs exhibit pronounced low-rank structure compared to residual streams and MLP outputs, indicating that the attention layer writes into a subspace of the residual stream. (right) Low effective dimensionality of activations is a root cause of dead features in sparse dictionary learning methods. Setting feature directions in the active subspace mitigates this issue. In Section 5, we investigate how the low-rank structure of attention outputs interacts with SAE training. By evaluating the full suite of open-source SAEs from LlamaScope (He et al., 2024), we show that low effective dimensionality strongly correlates with the number of dead features, sug- gesting a mismatch between random initialization and the low-dimensional geometry of the activations. Drawing in- spiration from Phan et al. (2025)’s principal component initialization for the first network layer, we propose Active Subspace Initialization, which aligns SAE features with the active subspace of activations, substantially reducing dead features while improving reconstruction. Following Lind- sey et al. (2024a) and Gao et al. (2024), we conduct scaling experiments, which further reveal that ASI achieves superior reconstruction across feature counts, and when combined with SparseAdam 1 , it achieves the best reconstruction in large scale and reduces dead features from 87% to below 1% in Attention Output SAEs with 1M features trained on Llama-3.1-8B (Dubey et al., 2024). Furthermore, we show that Active Subspace Init can gen- eralize to sparse replacement models (He et al., 2025; Dunefsky et al., 2024; Ameisen et al., 2025) (Section 5.4). When applied to other sparse dictionary learning methods, our initialization procedure systematically reduces the preva- lence of dead parameters across architectures. 1 See the SparseAdam documentation for details. 2. Related Work 2.1.Representation Collapse in Neural Representations and Low-Rankness in Attention Mechanism A long line of research has shown that neural representa- tions frequently concentrate in low-dimensional subspaces, forming representation collapse (Hua et al., 2021; Tian et al., 2021; Jing et al., 2022). These works generally focus on visual models trained using self-supervised methods. Within attention mechanisms, prior work has investigated various notions of “low-rankness”: low-rank approximation of attention patterns (Wang et al., 2020; Tay et al., 2020; Raganato et al., 2020), low-rank parameterization for model compression (Noach & Goldberg, 2020; Hu et al., 2022), and the inherent low-rank bottleneck in single-head outputs (Bhojanapalli et al., 2020). Different from these prior lines of work, we demonstrate that the multi-head self-attention outputs exhibit a low- rank structure, revealing a distinct and under-explored phe- nomenon. 2.2. Superposition Hypothesis and Sparse Dictionary Learning Methods The superposition hypothesis posits that neurons encode multiple non-orthogonal underlying features (Arora et al., 2018; Olah et al., 2020; Elhage et al., 2022; Park et al., 2024). Motivated by this view, a variety of sparse dictionary learn- 2 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning ing methods have been developed for interpretability, includ- ing sparse autoencoders and their variants (Cunningham et al., 2023; Bricken et al., 2023b; Lindsey et al., 2024b), transcoders (Dunefsky et al., 2024; Ge et al., 2024), and low- rank sparse attention (He et al., 2025). These approaches decompose activations into sparse combinations of learned features while differing in their mechanisms for predicting or approximating feature activations. Their successful ap- plication across a wide range of model scales (Templeton et al., 2024; Lieberum et al., 2024; He et al., 2024), ar- chitectures (Wang et al., 2025), and modalities (Abdulaal et al., 2024) highlights their practical effectiveness for inter- pretability; however, they do not constitute direct hypothesis tests of the superposition hypothesis, which remain active topics of debate (Sharkey et al., 2025). 2.3. Dead Features in Sparse Dictionary Learning Methods A persistent challenge in sparse dictionary learning methods is the emergence of dead features 2 (Templeton et al., 2024; Kissane et al., 2024), which are also referred to as dead units in sparse replacement models (Dunefsky et al., 2024; Ge et al., 2024; He et al., 2025). These features contribute nothing to reconstruction quality, wasting parameters and computation. Existing approaches to mitigate this issue rely on auxiliary loss terms (Gao et al., 2024; Conerly et al., 2025) or resampling strategies (Bricken et al., 2023b) to encourage feature usage. 2.4. PCA-Inspired Network Initialization A common practice applies PCA to input data for dimension- ality reduction before network training (Hastie et al., 2009; Montavon et al., 2012; Jolliffe, 1986; Bishop & Nasrabadi, 2007). Recently, Phan et al. (2025) proposed PCsInit, which initializes the first layer weights of networks with top prin- cipal components of data—embedding the PCA transform directly into the network. This provides the model with a su- perior parameter set (Gu et al., 2025), boosting performance by construction. 3. Preliminaries 3.1. Multi-Head Self-Attention and Notations We consider a Transformer block with multi-head self- attention (MHSA) (Vaswani et al., 2017). Given input acti- vationsX ∈R n×d , wherenis the token count anddis the model hidden size, each attention head i computes: Q i = XW Q i , K i = XW K i , V i = XW V i , W Q i , W K i , W V i ∈R d×d h , 2 Following Bricken et al. (2023b), we define a feature as dead if it never activates over 10 million tokens in this paper. whered h = d/His the dimensionality of each head, and His the total number of heads. The attention weights and head outputs are then given by: A i = softmax Q i K ⊤ i √ d h , Z i = A i V i ∈R n×d h . LetZ = Concat[Z 1 ,...,Z H ]∈R n×d denote the concate- nated outputs of all attention heads (Nanda & Bloom, 2022). The final attention output is obtained by applying the out- put projection: O = ZW O = [Z 1 ,...,Z H ] [W O 1 ,...,W O H ] ⊤ = H X i=1 Z i W O i = H X i=1 O i , where eachW O i ∈R d h ×d is the corresponding submatrix of W O ∈R d×d associated with each head i. This formulation makes explicit thatOis the sum of the outputs from all heads, where each head produces a rank- d h output that is projected into the residual stream space through its correspondingW O h . Thus,Orepresents the attention block’s total contribution to the residual stream. 3.2. TopK Sparse Autoencoders In this work, we adopt the TopK sparse autoencoder (TopK SAE) introduced by Gao et al. (2024). Unlike standard SAEs that impose an ℓ 1 penalty, TopK SAE enforces exact sparsity by keeping only the top-kactivations in the latent representation for each input. Formally, given an input vector x∈R d , the encoder produces z = TopK(W e x + b e ), whereTopK(v)sets to zero all but the largestkentries of v. The decoder then reconstructs ˆx = W d z + b d . The model is trained to minimize the reconstruction loss, optionally augmented with an auxiliary loss to prevent dead latents: L TopK-SAE =∥x− ˆx∥ 2 2 + α·L aux , whereL aux is an optional term designed to penalize latents that never activate over a training period, andαbalances reconstruction fidelity and latent utilization. 4. Low-Rank Structure of Attention Outputs We begin by presenting our central empirical finding: in Transformer models, attention outputs consistently display the strongest low-rank structure compared to MLP outputs and residual streams. As shown in Figure 2, attention out- puts have a significantly lower effective rank. This phe- nomenon is remarkably robust, holding across different 3 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning 71523 0 0.2 0.4 0.6 0.8 1 GPT-2Llama-3.1-8BGemma-2-9BQwen3-8B Multi-Corpora GithubArxiv Attention OutputMLP OutputResidual Stream Layer Model Dataset Fraction of Effective Rank Figure 2. Across layers, model families and datasets, attention outputs exhibit dramatically lower effective rank than residual streams and MLP outputs, indicating that the attention layer writing into a low dimensional subspace of residual stream is a universal phenomenon. Details in Section 4.1. (left) Evaluation of Llama-3.1-8B on SlimPajama (Soboleva et al., 2023) dataset. (mid) Middle-layer analysis across model families on SlimPajama dataset. (right) Middle-layer analysis of Llama-3.1-8B across datasets. layers, model families and datasets. Further details regard- ing the activation sources are provided in Section 4.2 and Appendix A. These observations highlight that the attention block modifies a subspace of the residual stream, while the MLP operates nearly on the full space. 4.1. Quantifying Low-Rankness with Effective Rank We consider the activation matrixA ∈R n×d , where each row corresponds to the activation vector of a single token. Here,ndenotes the number of data points anddthe di- mensionality of the activation space (e.g., the model’s hid- den size). Unless otherwise specified, e Arepresents mean- centered activations. To quantify the effective dimensionality of the activations, we adopt the effective rank metric introduced by Roy & Vetterli (2007). Definition 4.1 (Effective Rank, Roy & Vetterli (2007)). Let e Abe a nonzero matrix with singular value decomposition e A = U ΣV ⊤ , whereΣ = diag(σ 1 ,σ 2 ,...,σ r )contains singular values in descending order. Define the normalized singular value distribution as p k = σ k P r j=1 σ j , k = 1, 2,...,r. The (Shannon) entropy of this distribution is H(p 1 ,p 2 ,...,p r ) =− r X k=1 p k logp k . Then, the effective rank of e A is defined as erank( e A) = exp H(p 1 ,p 2 ,...,p r ) . Intuitively, the effective rank captures how evenly the sin- gular values are distributed—higher values indicate a more isotropic spectrum, whereas lower values reflect concentra- tion along a few dominant directions. Fraction of effective rank used in Figure 2 means effective rank divided by the dimension of activation space. Following Bricken et al. (2023b) and Rajamanoharan et al. (2024a), we compute the fraction of downstream loss re- covered by varying the number of retained components. We decompose the activations into singular value compo- nents and take the language model loss under full activation ablation as a baseline. As components are progressively reintroduced, we report the fraction of this ablated loss that is recovered. See Appendix B for a formal definition. These quantitative measures complement our core analyses by providing a numerical characterization of the low-rank structure present in activations. 4.2. Experiment Settings For each activation type, we collect a total of 10 million activation vectors. 3 We empirically verified this sample size suffices to ensure stable and reproducible singular spectrum analyses in Appendix C. Unless otherwise specified, all experiments in Section 4 run on the middle layer of Llama-3.1-8B (layer 15, zero- indexed), using SlimPajama dataset. 4.3. Empirical Evidence of Low-Rank Structure We draw our findings from three lines of evidence: Low Effective Rank of Attention OutputAttention out- puts have a effective rank of around 60% of the total dimen- 3 In extremely rare cases, outlier activations inflate variance along certain directions, potentially biasing variance-based dimen- sionality estimates. To mitigate this, we exclude activations whose norms exceed 5σ from the mean. 4 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning 01024204830724096 1e-5 1e-4 1e-3 1e-2 1e-1 1e0 Attention Output MLP Output Residual Stream Singular Value Rank Index Relative Singular Value (a) Singular Value Spectra. 01024204830724096 0 20 40 60 80 100 Attention Output MLP Output Residual Stream Num of Components Fraction Loss Recovered(%) (b) Fraction of Loss Recovered. Figure 3. (a) The attention output is the most low-rank, as indicated by the sharpest decay in singular values. (b) Fraction of loss recovered using varying numbers of top singular components. sionality. In contrast, MLP outputs and the residual streams show much higher effective rank around 90% (Figure 2). Rapid Singular Spectral Decay in Attention Output This is quantitatively evidenced by the number of com- ponents retaining significant energy: only 74.7% singular values exceed 1% 4 of the maximum in attention output, ver- sus 100.0% for MLP output and residual stream (Figure 3a). Efficient Downstream Loss RecoveryCompared to zero ablation, attention output requires only 39.1% of dimensions to recover over 99% of the downstream loss, versus 95.3% and 96.9% of the dimensions for MLP outputs and residual streams to recover the same proportion (Figure 3b). More results of these metrics across different layers, models, datasets and activation positions are shown in Appendix D. 4.4. The Output Projection Matrix Further Reduces the Effective Rank of Attention Outputs Among all activation types, attention outputs consistently exhibits the most rapid singular spectral decay. To inves- tigate whether this low-rank structure originates from the attention heads outputs (Z), the output projection matrix (W O ), or their interaction, we perform a decomposition- based analysis. Recall that the attention output is computed asO = ZW O , whereZ ∈R n×d is the concatenated output of all attention heads, andW O ∈R d×d is a learned linear projection. To understand how the singular value spectra ofOis shaped, we analyze the variance 5 ofOalong a unit directionˆe∈R d , given by: Var(Oˆe) = Var(ZW O ˆe). This expression highlights that the variance alongeis deter- mined by two factors: the norm ofW O ˆeand the variance of 4 The resolution of bfloat16 (Kalamkar et al., 2019) is 0.01. 5 For zero-mean activations, singular values correspond to the standard deviations of activations along the associated singular directions. 01024204830724096 0.01 0.1 1 10 Attention Output O Contribution of Z Contribution of W O Singular Value Rank Index Normalized Sigular Value Figure 4. Decomposition of singular value spectra in attention output O. We analyze the contributions of the concatenated head outputsZand the projection matrixW O to the singular value of O(=ZW O ). For each component, the red value is the product of the purple and blue values. The curve ofOclosely follow that of Zfor the top components, whereas its downward trend at the tail is mainly due to W O contribution. ZalongW O ˆe. Specifically, we can rewrite the variance as: Var(Oˆe) = Var(Z ˆv)·∥v∥ 2 2 where v = W O ˆe, ˆv = v ∥v∥ 2 . We refer toVar(Z ˆv)as the contribution ofZ, capturing how much variance the head outputZprovides in that di- rection, and∥v∥ 2 2 as the contribution ofW O , measuring how much the output projectionW O scales or suppresses that direction. We compute and visualize the singular values of attention outputOand these two contributions, as shown in Figure 4. This analysis reveals that the low-rank structure of attention outputs is strongly influenced byW O , which further com- pressesZinto a lower-dimensional subspace. The analysis of the effective rank ofZin Figure 11 further supports this conclusion. From a mechanistic perspective, an intuitive explanation is that although each attention head contributes ad head -dimensional subspace, the superposition of attention heads (Jermyn et al., 2024; He et al., 2025) inevitably in- duces overlaps among these subspaces. LetO i denote the output of thei th attention head. Consequently, the dimen- sion of the MHSA output satisfies dim [ i span(O i ) ! ≤ X i dim span(O i ) = d head · n head = d model (in standard MHSA). 5. Active Subspace Initialization for Sparse Autoencoders 5.1. Empirical Correlation Between Low-Rank Structure and Dead Features To study how low-rankness affects the interpretability of attention, we adopt the same framework and dataset as the 5 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning 0481216202428 0 20 40 60 80 100 0481216202428 0.9 0.8 0.7 0.6 0.5 0.4 020406080100 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Attention Output MLP Output Residual Stream LayerLayer Dead Feature (%) Dead Feature (%) Fraction of Effective Rank Fraction of Effective Rank Figure 5. The number of dead features (left) and the effective rank (mid) of each activation in Llama-3.1-8B, shows a surprising consistency (right): activations with lower effective rank have more dead features, corresponding to all layers of attention output and last two layers of MLP output. 64128256512102420484096 0 10 20 30 40 50 64128256512102420484096 0.315 0.32 0.325 0.33 64128256512102420484096 7.7 7.8 7.9 8 8.1 8.2 8.3 8.4 D init D init D init Dead Feature (%) Normalized MSE Delta LM Loss (×10 - 3 ) Figure 6. After using ASI, proportion of dead features (left), normalized MSE (mid) and Delta LM loss (right) across different d init for activations with a full space dimension of 4096. All experiments repeat 3 times using different random seeds and error bars indicate mean ± std. original LlamaScope study (He et al., 2024) to evaluate their SAEs trained on attention outputs, MLP outputs, and the residual stream 6 . We find that the number of dead features is strongly related to effective rank, as shown in Figure 5. This observation suggests that dead features may stem from the low-rank geometry of the activation space. We also train SAEs using different SAE hyperparameters and further systematically verify that this phenomenon is prevalent in Appendix F. 5.2. Active Subspace Initialization for Sparse Autoencoders Based on this observation, we propose Active Subspace Initialization (ASI), a lightweight and generalizable strategy for scaling SAEs to high capacities. Letddenote the input dimension,hthe hidden dimension of the SAE, andnthe number of data points. Given activation matrices e A∈R n×d with singular value decomposition e A = U ΣV ⊤ andV ∈ R d×d contains the right singular vectors, we select the top d init singular vectors to define the active subspace: V active = V :,:d init ∈R d×d init . To initialize the SAE within this subspace, we first randomly initialize the decoder weightsW D ∈R h×d and then project 6 Another prominent set of open-source SAEs, GemmaS- cope (Lieberum et al., 2024), train their attention SAEs on Z rather than attention output. their first d init columns onto the active subspace: W D ← W D V active V ⊤ active , W E = W ⊤ D . whereW E is the encoder weight matrix andW D is the decoder weight matrix. Intuitively, ASI aligns the initial SAE parameters with the effective directions of the data, ensuring that SAEs start in a meaningful low-dimensional subspace. Asd init decreases from the full space dimension 7 within a certain range, the number of dead features in the Attention Output SAE rapidly drops, with a corresponding improvement in Mean Square Error (MSE) and Delta LM loss 8 (Figure 6). We refer readers to Appendix E for full SAE training details. Additional ablations, including the random subspace initialization baseline and the effect of activation rank, are reported in Appendix J. Pseudocode for Active Subspace Initialization (ASI) is provided in Ap- pendix K. Using Active Subspace Initialization offers several bene- fits: 7 Settingd init equal to the full space dimension is equivalent to not using Active Subspace Initialization. 8 Following He et al. (2024), this metric is defined as the differ- ence between the original language model loss and the loss when the SAE is inserted at the corresponding position, evaluated over 1 million tokens. 6 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning 4050607080 0 10 20 30 40 50 60 4050607080 0.27 0.29 0.31 0.33 0.35 4050607080 6 7 8 9 10 TopKAuxKASI Sparsity (L0)Sparsity (L0)Sparsity (L0) Dead Feature (%) Normalized MSE Delta LM Loss (×10 -3 ) Figure 7. At a fixed number of features (n = 32768), ASI (TopK SAE with Active Subspace Init) achieves a better reconstruction-sparsity trade-off than TopK (standard TopK SAE) and AuxK (TopK SAE with auxiliary loss). A similar trend is observed in its impact on Delta LM Loss. All experiments repeat 3 times using different random seeds and show the mean. The results of std are in Appendix G due to resolution constraints. TopKAuxKASIAuxK+SparseAdamASI+SparseAdam 20K50K100K200K500K1M 0.27 0.29 0.31 0.33 Number of Features Normalized MSE (a) Loss vs. #Total Features. 20K50K100K200K500K1M 0 20 40 60 80 Number of Features Dead Feature (%) (b) #Dead Features vs. #Total Features 10K20K50K100K200K500K1M 0.27 0.29 0.31 0.33 Number of Alive Features Normalized MSE (c) Loss vs. #Alive Features. Figure 8. Scaling results of TopK SAEs and their variants enhanced with AuxK, Active Subspace Init, and SparseAdam–all trained on attention output from the middle layer of Llama-3.1-8B. (A) Loss at convergence across different feature counts: Active Subspace Init consistently achieves lower reconstruction error than TopK and AuxK. Active Subspace Init with SparseAdam achieves the best at large scale. (B) Dead features: Active Subspace Init reduces dead features compared to TopK, but still retains many at extremely large scales. Enhanced with SparseAdam, dead features can be reduced to less than 1%. (C) Loss across different number of alive features: Active Subspace Init achieves the most efficient utilization of alive features, while AuxK shows the lowest efficiency. Details in Section 5.3. Reduced Dead Features and Enhanced Sparsity- Reconstruction Frontier Without Additional Compute It achieves near-zero dead features and slightly superior re- sults compared to the auxiliary loss approach (AuxK), at no additional computational cost of the same order. (Figure 7). Optimal Scaling Characteristics Our approach demon- strates optimal scaling behavior across various SAE training methods. It outperforms TopK and AuxK in any evaluated scale, from 16K to 1M features (Section 5.3). General Applicability The technique maintains applica- bility to diverse architectural variants and activation func- tions, as it operates directly on the intrinsic properties of activations. This generalizability is further explored in Sec- tion 5.4 and Appendix L. We conduct a statistical significance test in Appendix H to demonstrate the statistical significance of our conclusions. We validate the effectiveness of ASI across different layers, models, and datasets in Appendix I. We further compare the features of TopK and ASI in Appendix N, analyzing both the degree of monosemanticity and the behavior of SAE fea- tures in the dead subspace, to ensure that ASI increases the number of alive features while preserving feature quality and maintaining the dictionary’s coverage and recon- struction performance in the dead subspace. 5.3. Scaling Laws To assess scaling, we evaluate our method as the number of SAE features grows from16K to1M, keeping other hyperparameters fixed (see Appendix E). Active Subspace Init Improves Reconstruction. As shown in Figure 8a, Active Subspace Init consistently out- performs TopK and AuxK across all scales. Caveat: Some Dead Features Remain at Extremely Large Scales in Active Subspace Init. Figure 8b shows that, when scaling to extremely large feature counts, Ac- tive Subspace Init produces more dead features than AuxK. However, reconstruction performance remains better, indi- cating that the revived features from AuxK contribute little to actual reconstruction quality (Figure 8c). 7 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning 4050607080 20 40 60 80 4050607080 0.35 0.40 4050607080 9 10 11 12 13 14 TopKAuxKASI Sparsity (L0)Sparsity (L0)Sparsity (L0) Dead Feature (%) Normalized MSE Delta LM Loss (×10 -3 ) Figure 9. At a fixed number of Lorsa heads (n = 32768), ASI (TopK Lorsa with Active Subspace Init) achieves a better reconstruction- sparsity trade-off than TopK (standard TopK Lorsa) and AuxK (TopK Lorsa with auxiliary loss). A similar trend is observed in its impact on Delta LM Loss. Use Active Subspace Init with SparseAdam Further Im- proves Performance. Prior work (Bricken et al., 2023a) identified stale momentum as a key factor in dead fea- ture formation. Building on this insight, we propose us- ing SparseAdam, an optimizer specifically designed for sparse activation settings. By updating only the momentum terms and parameters corresponding to non-zero gradients, SparseAdam avoids stale momentum and thus mitigates the dead feature issue. As shown in Figures 8a, 8b, combin- ing Active Subspace Init with SparseAdam substantially reduces dead features while reaching the lowest reconstruc- tion error. While orthogonal to our initialization method, this choice provides a practical complement that further sta- bilizes training when scaling SAEs to very large capacities. We discuss more about stale momentum and SparseAdam in Appendix M. 5.4. Generalizing to Sparse Replacement Models Recent work by He et al. (2025) shows that Lorsa—a sparse replacement for MHSA—has a substantial fraction of dead parameters. We hypothesize this is partly due to initialization: standard random initialization ignore the low- dimensional active subspace of attention outputs (Section 4). Applying ASI to Lorsa. To test whether our subspace- based approach extends beyond SAEs, we incorporate ASI (Section 5.2) into Lorsa training. When training Lorsa to approximate a target MHSA, we partition Lorsa heads into such groups, matching the number of attention heads in the MHSA. For each group, we initialize the encoder and decoder matrices directly within the corresponding input and output active subspaces of the MHSA head it corresponds. In addition, we initialize each group’s Q/K weights from the MHSA Q/K parameters. Pseudo code is provided in Appendix K Results. This initialization sharply reduces dead param- eters under identical hyperparameters (Figure 9) while im- proving reconstruction quality. Further Lorsa training de- tails are in Appendix O. 6. Discussion and Limitations Low-rank attention outputs suggest a new direction for model improvement.Modern large language models are predominantly built by stacking attention and MLP blocks with residual connections. Our finding that attention outputs show a low-rank structure suggests a potentially new avenue for improving model capacity and expressivity: future archi- tectures may benefit from explicitly mitigating this rank col- lapse within attention modules. Notably, this phenomenon persists across models with and without grouped-query at- tention (GQA), and under both relative and absolute posi- tional encodings (Section 4), indicating that it may reflect a more fundamental property of the attention mechanism itself rather than an artifact of specific design choices. Causality between Low-Rank Structure and Dead Fea- tures.We find a strong correlation between low-rank acti- vations and the emergence of dead features (Section 5), but the underlying causal mechanism is unresolved. This effect may arise from optimization dynamics or feature competi- tion, and we leave a rigorous explanation to future work. When to Use Active Subspace Initialization.Active Sub- space Initialization is most beneficial when activations ex- hibit pronounced low-rank structure, such as attention out- puts, residual streams of some narrow datasets, or certain specialized situations. For activation sites without clear low-rank behavior, the improvements are marginal (Ap- pendix J.2). 7. Conclusion We identified the low-rank structure of attention outputs as a fundamental property of Transformer models and a key cause of dead features in sparse dictionary learning. Our proposed Active Subspace Initialization method addresses this by aligning SAE features with the intrinsic geometry of activations, reducing dead features while improving recon- struction quality. The approach generalizes beyond SAEs to sparse replacement models. 8 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning References Abdulaal, A., Fry, H., Brown, N. M., Ijishakin, A., Gao, J., Hyland, S. L., Alexander, D. C., and Cas- tro, D. C.An x-ray is worth 15 features: Sparse autoencoders for interpretable radiology report gener- ation. CoRR, abs/2410.03334, 2024. doi: 10.48550/ ARXIV.2410.03334.URLhttps://doi.org/ 10.48550/arXiv.2410.03334. Ameisen, E., Lindsey, J., Pearce, A., Gurnee, W., Turner, N. L., Chen, B., Citro, C., Abrahams, D., Carter, S., Hosmer, B., Marcus, J., Sklar, M., Templeton, A., Bricken, T., McDougall, C., Cunningham, H., Henighan, T., Jermyn, A., Jones, A., Persic, A., Qi, Z., Ben Thompson, T., Zimmerman, S., Rivoire, K., Conerly, T., Olah, C., and Batson, J.Circuit tracing: Revealing computational graphs in language models.Transformer Circuits Thread, 2025.URL https://transformer-circuits.pub/2025/ attribution-graphs/methods.html. Arora, S., Li, Y., Liang, Y., Ma, T., and Risteski, A. Linear algebraic structure of word senses, with applications to polysemy. Trans. Assoc. Comput. Linguistics, 6:483–495, 2018. doi: 10.1162/TACL \00034. URLhttps: //doi.org/10.1162/tacl a00034. Bhojanapalli, S., Yun, C., Rawat, A. S., Reddi, S. J., and Kumar, S. Low-rank bottleneck in multi-head atten- tion models. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, p. 864–873. PMLR, 2020.URLhttp://proceedings.mlr.press/ v119/bhojanapalli20a.html. Bishop, C. M. and Nasrabadi, N. M. Pattern Recognition and Machine Learning. J. Electronic Imaging, 16(4): 049901, 2007. doi: 10.1117/1.2819119. URLhttps: //doi.org/10.1117/1.2819119. Bricken, T., Davies, X., Singh, D., Krotov, D., and Kreiman, G. Sparse distributed memory is a continual learner. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1- 5, 2023. OpenReview.net, 2023a.URLhttps:// openreview.net/forum?id=JknGeelZJpHP. Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023b. https://transformer- circuits.pub/2023/monosemantic-features/index.html. Chughtai, B., Chan, L., and Nanda, N. A toy model of universality: Reverse engineering how networks learn group operations.In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scar- lett, J. (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Ma- chine Learning Research, p. 6243–6267. PMLR, 2023. URLhttps://proceedings.mlr.press/v202/ chughtai23a.html. Conerly, T., Cunningham, H., Templeton, A., Lindsey, J., Hosmer, B., and Jermyn, A.Circuits updates - january 2025. Transformer Circuits Thread, 2025. URL https://transformer-circuits.pub/2025/ january-update/index.html#DL. Cunningham, H. and Conerly, T.Circuits updates - june 2024. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/ june-update/index.html. Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models. CoRR, abs/2309.08600, 2023. doi: 10.48550/ARXIV.2309.08600. URLhttps: //doi.org/10.48550/arXiv.2309.08600. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravanku- mar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Rozi ` ere, B., Biron, B., Tang, B., Chern, B., Caucheteux, C., Nayak, C., Bi, C., Marra, C., McConnell, C., Keller, C., Touret, C., Wu, C., Wong, C., Ferrer, C. C., Nikolaidis, C., Allonsius, D., Song, D., Pintz, D., Livshits, D., Esiobu, D., Choud- hary, D., Mahajan, D., Garcia-Olano, D., Perino, D., Hup- kes, D., Lakomkin, E., AlBadawy, E., Lobanova, E., Di- nan, E., Smith, E. M., Radenovic, F., Zhang, F., Synnaeve, G., Lee, G., Anderson, G. L., Nail, G., Mialon, G., Pang, G., Cucurell, G., Nguyen, H., Korevaar, H., Xu, H., Tou- vron, H., Zarov, I., Ibarra, I. A., Kloumann, I. M., Misra, I., Evtimov, I., Copet, J., Lee, J., Geffert, J., Vranes, J., Park, J., Mahadeokar, J., Shah, J., van der Linde, J., Bil- lock, J., Hong, J., Lee, J., Fu, J., Chi, J., Huang, J., Liu, J., Wang, J., Yu, J., Bitton, J., Spisak, J., Park, J., Rocca, J., Johnstun, J., Saxe, J., Jia, J., Alwala, K. V., Upasani, K., Plawiak, K., Li, K., Heafield, K., Stone, K., and et al. The llama 3 herd of models. CoRR, abs/2407.21783, 2024. doi: 10.48550/ARXIV.2407.21783. URLhttps: //doi.org/10.48550/arXiv.2407.21783. 9 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning Dunefsky,J.,Chlenski,P.,andNanda,N. Transcoders find interpretable LLM feature cir- cuits.CoRR, abs/2406.11944,2024.doi: 10.48550/ARXIV.2406.11944.URLhttps: //doi.org/10.48550/arXiv.2406.11944. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., and Olah, C. Toy models of superposition. Transformer Cir- cuits Thread, 2022. URLhttps://transformer- circuits.pub/2022/toy model/index.html. Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Rad- ford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. CoRR, abs/2406.04093, 2024. doi: 10.48550/ARXIV.2406.04093. URLhttps: //doi.org/10.48550/arXiv.2406.04093. Ge, X., Zhu, F., Shu, W., Wang, J., He, Z., and Qiu, X. Automatically identifying local and global circuits with linear computation graphs. CoRR, abs/2405.13868, 2024. doi: 10.48550/ARXIV.2405.13868. URLhttps:// doi.org/10.48550/arXiv.2405.13868. Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Teh, Y. W. and Titterington, D. M. (eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia La- guna Resort, Sardinia, Italy, May 13-15, 2010, vol- ume 9 of JMLR Proceedings, p. 249–256. JMLR.org, 2010.URLhttp://proceedings.mlr.press/ v9/glorot10a.html. Gould, R., Ong, E., Ogden, G., and Conmy, A. Succes- sor heads: Recurring, interpretable attention heads in the wild, 2023. URLhttps://arxiv.org/abs/ 2312.09230. Gu, N., Chen, Y., Zhang, Z., Fu, P., Lin, Z., Wang, S., Sun, Y., Wu, H., Wang, W., and Wang, H. Advanta- geous parameter expansion training makes better large language models, 2025. URLhttps://arxiv.org/ abs/2505.24241. Gurnee, W., Horsley, T., Guo, Z. C., Kheirkhah, T. R., Sun, Q., Hathaway, W., Nanda, N., and Bertsimas, D. Universal neurons in GPT2 language models. Trans. Mach. Learn. Res., 2024, 2024.URLhttps:// openreview.net/forum?id=ZeI104QZ8I. Hastie, T., Tibshirani, R., and Friedman, J. H. The El- ements of Statistical Learning: Data Mining, Infer- ence, and Prediction, 2nd Edition.Springer Series in Statistics. Springer, 2009. ISBN 9780387848570. doi: 10.1007/978-0-387-84858-7. URLhttps:// doi.org/10.1007/978-0-387-84858-7. Hazra, D., Loeffler, M., Cubuktepe, M., Avagyan, L., Gorton, L., Bissell, M., Lewis, O., McGrath, T., and Balsam, D.Under the hood of a reasoning model. https://w.goodfire.ai/blog/under- the-hood-of-a-reasoning-model,2025. Accessed:2025-09-15, Blog post from Goodfire Research. He, Z., Shu, W., Ge, X., Chen, L., Wang, J., Zhou, Y., Liu, F., Guo, Q., Huang, X., Wu, Z., Jiang, Y., and Qiu, X. Llama scope: Extracting millions of features from llama- 3.1-8b with sparse autoencoders. CoRR, abs/2410.20526, 2024. doi: 10.48550/ARXIV.2410.20526. URLhttps: //doi.org/10.48550/arXiv.2410.20526. He, Z., Wang, J., Lin, R., Ge, X., Shu, W., Tang, Q., Zhang, J., and Qiu, X. Towards understanding the nature of attention with low-rank sparse decomposition. arXiv preprint arXiv:2504.20938, 2025. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models.In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenRe- view.net, 2022. URLhttps://openreview.net/ forum?id=nZeVKeeFYf9. Hua, T., Wang, W., Xue, Z., Ren, S., Wang, Y., and Zhao, H. On feature decorrelation in self-supervised learning. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10- 17, 2021, p. 9578–9588. IEEE, 2021. doi: 10.1109/ ICCV48922.2021.00946. URLhttps://doi.org/ 10.1109/ICCV48922.2021.00946. Jermyn, A., Olah, C., and Conerly, T. Circuits updates - january 2024. Transformer Circuits Thread, 2024. URLhttps://transformer-circuits.pub/ 2024/jan-update/index.html#attn- superposition. Jing, L., Vincent, P., LeCun, Y., and Tian, Y. Understand- ing dimensional collapse in contrastive self-supervised learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps: //openreview.net/forum?id=YevsQ05DEN7. Jolliffe, I. T. Principal Component Analysis. Springer Series in Statistics. Springer, 1986. ISBN 978-1-4757- 1906-2. doi: 10.1007/978-1-4757-1904-8. URLhttps: //doi.org/10.1007/978-1-4757-1904-8. 10 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning Kalamkar, D. D., Mudigere, D., Mellempudi, N., Das, D., Banerjee, K., Avancha, S., Vooturi, D. T., Jammala- madaka, N., Huang, J., Yuen, H., Yang, J., Park, J., Heinecke, A., Georganas, E., Srinivasan, S., Kundu, A., Smelyanskiy, M., Kaul, B., and Dubey, P.A study of BFLOAT16 for deep learning training. CoRR, abs/1905.12322, 2019. URLhttp://arxiv.org/ abs/1905.12322. Kissane, C., Krzyzanowski, R., Conmy, A., and Nanda, N.Sparse autoencoders work on atten- tion layer outputs.Alignment Forum, 2024.URL https://w.alignmentforum.org/posts/ DtdzGwFh9dCfsekZZ. Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kram ́ ar, J., Dragan, A. D., Shah, R., and Nanda, N.Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2.CoRR, abs/2408.05147, 2024.doi: 10.48550/ ARXIV.2408.05147.URLhttps://doi.org/ 10.48550/arXiv.2408.05147. Lindsey, J., Conerly, T., Templeton, A., Marcus, J., and Henighan, T.Circuits updates - april 2024.Transformer Circuits Thread, 2024a.URL https://transformer-circuits.pub/2024/ april-update/index.html#scaling-laws. Lindsey, J., Templeton, A., Marcus, J., Conerly, T., Batson, J., and Olah, C.Sparse cross- coders for cross-layer features and model diff- ing.Transformer Circuits Thread, 2024b.URL https://transformer-circuits.pub/2024/ crosscoders/index.html. Lindsey, J., Gurnee, W., Ameisen, E., Chen, B., Pearce, A., Turner, N. L., Citro, C., Abrahams, D., Carter, S., Hosmer, B., Marcus, J., Sklar, M., Templeton, A., Bricken, T., McDougall, C., Cunningham, H., Henighan, T., Jermyn, A., Jones, A., Persic, A., Qi, Z., Thompson, T. B., Zimmerman, S., Rivoire, K., Conerly, T., Olah, C., and Batson, J. On the biology of a large language model.Transformer Circuits Thread, 2025.URL https://transformer-circuits.pub/2025/ attribution-graphs/biology.html. Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locat- ing and editing factual associations in gpt, 2023. URL https://arxiv.org/abs/2202.05262. Montavon, G., Orr, G. B., and M ̈ uller, K. (eds.). Neural Networks: Tricks of the Trade - Second Edition, volume 7700 of Lecture Notes in Computer Science. Springer, 2012. ISBN 978-3-642-35288-1. doi: 10.1007/978-3- 642-35289-8. URLhttps://doi.org/10.1007/ 978-3-642-35289-8. Mudide, A., Engels, J., Michaud, E. J., Tegmark, M., and de Witt, C. S. Efficient dictionary learning with switch sparse autoencoders, 2025. URLhttps:// arxiv.org/abs/2410.08201. Nanda,N.andBloom,J.Transformerlens. https://github.com/TransformerLensOrg/ TransformerLens, 2022. Noach, M. B. and Goldberg, Y. Compressing pre-trained language models by matrix decomposition. In Wong, K., Knight, K., and Wu, H. (eds.), Proceedings of the 1st Conference of the Asia-Pacific Chapter of the As- sociation for Computational Linguistics and the 10th International Joint Conference on Natural Language Pro- cessing, AACL/IJCNLP 2020, Suzhou, China, Decem- ber 4-7, 2020, p. 884–889. Association for Computa- tional Linguistics, 2020. doi: 10.18653/V1/2020.AACL- MAIN.88.URLhttps://doi.org/10.18653/ v1/2020.aacl-main.88. Olah, C., Pratt-Hartmann, L., et al. Zoom in: An introduc- tion to circuits. Distill, 2020. Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. In-context learn- ing and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context- learning-and-induction-heads/index.html. Park, K., Choe, Y. J., and Veitch, V. The linear rep- resentation hypothesis and the geometry of large lan- guage models. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024.URLhttps: //openreview.net/forum?id=UGpGkLzwpP. Phan, N., Nguyen, T., Halvorsen, P., and Riegler, M. A. Principal components for neural network initializa- tion. CoRR, abs/2501.19114, 2025. doi: 10.48550/ ARXIV.2501.19114.URLhttps://doi.org/ 10.48550/arXiv.2501.19114. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. Raganato, A., Scherrer, Y., and Tiedemann, J. Fixed en- coder self-attention patterns in transformer-based ma- chine translation.In Cohn, T., He, Y., and Liu, Y. (eds.), Findings of the Association for Computa- tional Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of 11 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning ACL, p. 556–568. Association for Computational Lin- guistics, 2020.doi: 10.18653/V1/2020.FINDINGS- EMNLP.49. URLhttps://doi.org/10.18653/ v1/2020.findings-emnlp.49. Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., Kram ́ ar, J., Shah, R., and Nanda, N. Im- proving dictionary learning with gated sparse autoen- coders. CoRR, abs/2404.16014, 2024a. doi: 10.48550/ ARXIV.2404.16014.URLhttps://doi.org/ 10.48550/arXiv.2404.16014. Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kram ́ ar, J., and Nanda, N. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders. CoRR, abs/2407.14435, 2024b. doi: 10.48550/ARXIV.2407.14435. URLhttps:// doi.org/10.48550/arXiv.2407.14435. Rivi ` ere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupati- raju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram ́ e, A., Ferret, J., Liu, P., Tafti, P., Friesen, A., Casbon, M., Ramos, S., Kumar, R., Lan, C. L., Jerome, S., Tsitsulin, A., Vieillard, N., Stanczyk, P., Girgin, S., Momchev, N., Hoffman, M., Thakoor, S., Grill, J., Neyshabur, B., Bachem, O., Walton, A., Severyn, A., Parrish, A., Ahmad, A., Hutchison, A., Abdagic, A., Carl, A., Shen, A., Brock, A., Coenen, A., Laforge, A., Paterson, A., Bastian, B., Piot, B., Wu, B., Royal, B., Chen, C., Kumar, C., Perry, C., Welty, C., Choquette-Choo, C. A., Sinopalnikov, D., Weinberger, D., Vijaykumar, D., Rogozinska, D., Her- bison, D., Bandy, E., Wang, E., Noland, E., Moreira, E., Senter, E., Eltyshev, E., Visin, F., Rasskin, G., Wei, G., Cameron, G., Martins, G., Hashemi, H., Klimczak- Plucinska, H., Batra, H., Dhand, H., Nardini, I., Mein, J., Zhou, J., Svensson, J., Stanway, J., Chan, J., Zhou, J. P., Carrasqueira, J., Iljazi, J., Becker, J., Fernandez, J., van Amersfoort, J., Gordon, J., Lipschultz, J., Newlan, J., Ji, J., Mohamed, K., Badola, K., Black, K., Millican, K., McDonell, K., Nguyen, K., Sodhia, K., Greene, K., Sj ̈ osund, L. L., Usui, L., Sifre, L., Heuermann, L., Lago, L., and McNealus, L. Gemma 2: Improving open lan- guage models at a practical size. CoRR, abs/2408.00118, 2024. doi: 10.48550/ARXIV.2408.00118. URLhttps: //doi.org/10.48550/arXiv.2408.00118. Roy, O. and Vetterli, M. The effective rank: A mea- sure of effective dimensionality.In 15th European Signal Processing Conference, EUSIPCO 2007, Poz- nan, Poland, September 3-7, 2007, p. 606–610. IEEE, 2007.URLhttps://ieeexplore.ieee.org/ document/7098875/. Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., Goldowsky-Dill, N., Heimersheim, S., Ortega, A., Bloom, J. I., Biderman, S., Garriga- Alonso, A., Conmy, A., Nanda, N., Rumbelow, J. M., Wattenberg, M., Schoots, N., Miller, J., Saunders, W., Michaud, E. J., Casper, S., Tegmark, M., Bau, D., Todd, E., Geiger, A., Geva, M., Hoogland, J., Murfet, D., and McGrath, T. Open problems in mechanistic in- terpretability. Transactions on Machine Learning Re- search, 2025.ISSN 2835-8856.URLhttps:// openreview.net/forum?id=91H76m9Z94 . Sur- vey Certification. Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hestness, J., and Dey, N.SlimPajama: A 627B token cleaned and deduplicated version of RedPa- jama.https://w.cerebras.net/blog/ slimpajama-a-627b-token-cleaned-and- deduplicated-version-of-redpajama , 2023. URLhttps://huggingface.co/datasets/ cerebras/SlimPajama-627B. Staats, M., Thamm, M., and Rosenow, B. Small singu- lar values matter: A random matrix analysis of trans- former models, 2025. URLhttps://arxiv.org/ abs/2410.17770. Tay, Y., Bahri, D., Metzler, D., Juan, D., Zhao, Z., and Zheng, C. Synthesizer: Rethinking self-attention in trans- former models. CoRR, abs/2005.00743, 2020. URL https://arxiv.org/abs/2005.00743. Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., Mc- Dougall, C., MacDiarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., and Henighan, T. Scaling monosemantic- ity: Extracting interpretable features from claude 3 sonnet.Transformer Circuits Thread, 2024.URL https://transformer-circuits.pub/2024/ scaling-monosemanticity/index.html. Tian, Y., Chen, X., and Ganguli, S.Understand- ing self-supervised learning dynamics without con- trastive pairs.In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Ma- chine Learning Research, p. 10268–10278. PMLR, 2021.URLhttp://proceedings.mlr.press/ v139/tian21a.html. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser,Ł., and Polosukhin, I. Atten- tion is all you need. In Advances in Neural Information Processing Systems, p. 6000–6010, 2017. 12 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning Wang, J., Ge, X., Shu, W., Tang, Q., Zhou, Y., He, Z., and Qiu, X. Towards universality: Studying mechanistic sim- ilarity across language model architectures. In The Thir- teenth International Conference on Learning Representa- tions, ICLR 2025, Singapore, April 24-28, 2025. OpenRe- view.net, 2025. URLhttps://openreview.net/ forum?id=2J18i8T0oI. Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022. URL https://arxiv.org/abs/2211.00593. Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity. CoRR, abs/2006.04768, 2020. URLhttps://arxiv.org/ abs/2006.04768. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q., Men, R., Gao, R., Liu, S., Luo, S., Li, T., Tang, T., Yin, W., Ren, X., Wang, X., Zhang, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Zhang, Y., Wan, Y., Liu, Y., Wang, Z., Cui, Z., Zhang, Z., Zhou, Z., and Qiu, Z. Qwen3 technical report. CoRR, abs/2505.09388, 2025. doi: 10.48550/ARXIV.2505.09388. URLhttps:// doi.org/10.48550/arXiv.2505.09388. 13 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning Author Contributions Junxuan Wang and Zhengfu He co-discovered the low-rank structure in attention outputs and its correlation with dead features. Junxuan Wang proposed the Active Subspace Initialization method and the use of SparseAdam for SAE training, conducted all experiments. Xuyang Ge, Zhengfu He, and Wentao Shu made core contributions to the SAE training codebase infrastructure. Zhengfu He helped to edit the manuscript. Xipeng Qiu supervised the research and provided research guidance. A. Activation Sources The spectral characteristics of activations vary substantially across model architectures, datasets, and positional contexts. Below, we describe the experimental configurations used to support a broad and representative analysis. Models We study four large language models of different families–GPT-2 9 ,Llama-3.1-8B 10 ,Qwen3-8B 11 , and Gemma-2-9B 12 –all based on the Transformer architecture. This allows us to assess the robustness of spectral properties under varying model training configurations. Datasets To investigate how dataset diversity affects activation spectra, we select two datasets with varying linguistic and domain characteristics: (1)SlimPajama 13 , an English corpus comprising web text, Github, Arxiv and other sources. Multi-Corpora in Figure 2 denotes the setting where all SlimPajama components are jointly used with random mixing. Github and Arxiv correspond to the code and scientific paper subsets of SlimPajama, respectively. (2)CCI3-Data 14 , a Chinese dataset with broad domain coverage, which is used in Appendix D as a supplement. Activation Positions We analyze three types of activations: (1) attention output, (2) MLP output, and (3) residual stream (post layer). B. Formal Definition of Fraction of Loss Recovered Given the original language model cross-entropy loss isloss original , the loss after ablating the activation at a specific position to zero isloss zero , and the loss after replacing the original activation projected to the subspace spanned by firstnsingular vectors is loss recovered . Then, for these n components, the fraction loss recovered is calculated as: loss zero − loss recovered loss zero − loss original C. Error analysis in Singular Value Decomposition For the attention output of layer 15 of Llama-3.1-8B, we performed 5 times of singular value decompositions, using different 10 million tokens for each, and calculated the Coefficient of Variation (CV) for each singular value across these 5 runs. The maximum CV was only4.9× 10 −3 , and the mean and standard deviation of the effective rank computed from these 5 SVD results were 2523.165 and 0.404, respectively, with a CV of1.5× 10 −4 . These error experiments show that using 10 million tokens for singular value decomposition is sufficiently stable. 14 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning 05121024153620482560 0.001 0.01 0.1 1 Attention Output MLP Output Residual Stream Component Relative Singular Value (a) Middle layer of pythia-2.8b; SlimPajama Effective Dimensionality: Attention Output 1670; MLP Output 2252; Residual Stream 2327 01024204830724096 0.001 0.01 0.1 1 Attention Output MLP Output Residual Stream Component Relative Singular Value (b) Middle layer of Qwen3-8B; CCI3-Data Effective Dimensionality: Attention Output 2356; MLP Output 3558; Residual Stream 3140 01024204830724096 0.001 0.01 0.1 1 Attention Output MLP Output Residual Stream Component Relative Singular Value (c) Middle layer of Qwen3-8B; SlimPajamaGithub Effective Dimensionality: Attention Output 2410; MLP Output 3495; Residual Stream 2000 05121024153620482560 0.001 0.01 0.1 1 Attention Output MLP Output Residual Stream Component Relative Singular Value (d) Middle layer of Qwen3-4B; SlimPajama Effective Dimensionality: Attention Output 1122; MLP Output 1485; Residual Stream 990 0128256384512640768 0.001 0.01 0.1 1 Attention Output MLP Output Residual Stream Component Relative Singular Value (e) Middle layer of gpt2; SlimPajama Effective Dimensionality: Attention Output 515; MLP Output 703; Residual Stream 656 0128256384512640768 0.001 0.01 0.1 1 Attention Output MLP Output Residual Stream Component Relative Singular Value (f) Middle layer of pythia-160m; SlimPajama Effective Dimensionality: Attention Output 434; MLP Output 594; Residual Stream 662 Figure 10 15 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning D. More Singular Spectrum and Effective Rank Results D.1. Across Models and Datasets We present relative singular values for some other model-dataset pairs in Figure 10. Models include pythia-160m 15 , pythia-2.8b 16 . Datasets include SlimPajamaGithub (subset of SlimPajama) and CCI3-Data. D.2. Across Layers and Activation Positions We present the effective rank for activations that are commonly used to train SAEs in Figure 11, including Concatenated Outputs of all Attention Heads (Z), attention output, the hidden activations of MLP (post activation fuction), MLP output, residual stream. All effective ranks are computed on SlimPajama. E. SAE Training Details We train SAEs as the following description. E.1. Hyperparameters Model, Dataset, Layer, Pos Llama-3.1-8B, SlimPajama, 15(index start at 0), attention output. Sparsity We empirically setk = 50for a reasonable sparsity following He et al. (2024), except the experiments for sweeping k. Dictionary Size We empirically setn features = 32768which is8× d model , except the experiments for sweeping dictionary size (scaling law). Batch Size We empirically set the batch size to 4096. Optimizer We use the Adam and SparseAdam optimizer, both withβ 1 = 0.9,β 2 = 0.999, andε = 10 −8 . Unless otherwise specified, Adam is used by default. Learning RateThe learning rate for Adam and SparseAdam is sweeped separately in [1e−5,2e−5,4e−5,6e−5,8e−5, 1e−4,2e−4,4e−4], and we ultimately use4e−5for Adam and6e−5for SparseAdam. We employ a three-phase learning rate schedule consisting of a linear warm-up, a stable phase, and a linear decay. The learning rate increases linearly from zero to its maximum value over the first 500 steps, remains constant during the intermediate phase, and then decays linearly to 1% of the maximum value over the final 20% of the total training steps. AuxKWe follow Gao et al. (2024) to set auxiliary loss coefficientαas 1 32 . We sweep thek aux in [256, 512, 1024, 2048] and finally choose 512. We also sweep α and find the results are less sensitive to α in a reasonable interval. Dimension of Subspace for SAE Initialization (d init ) We use 768 for all experiments, except the experiments for sweeping d init . We refer readers to Appendix J.3 for the reason. Total Tokens We use 800M tokens for each SAE training. 9 https://huggingface.co/openai-community/gpt2 10 https://huggingface.co/meta-llama/Llama-3.1-8B 11 https://huggingface.co/Qwen/Qwen3-8B 12 https://huggingface.co/google/gemma-2-9b 13 https://huggingface.co/datasets/cerebras/SlimPajama-627B 14 https://huggingface.co/datasets/BAAI/CCI3-Data 15 https://huggingface.co/EleutherAI/pythia-160m 16 https://huggingface.co/EleutherAI/pythia-2.8b 16 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning 0481216202428 0 0.2 0.4 0.6 0.8 1 Concatenated Outputs of all Attention Heads Attention Output MLP Hidden (Post-Activation) MLP Output Residual Stream Layer Effective Rank Ratio (a) The number of the effective rank of each activation in Llama-3.1-8B 0481216202428323640 0 0.2 0.4 0.6 0.8 1 Concatenated Outputs of all Attention Heads Attention Output MLP Hidden (Post-Activation) MLP Output Residual Stream Layer Effective Rank Ratio (b) The number of the effective rank of each activation in Gemma-2-9B 048121620242832 0 0.2 0.4 0.6 0.8 1 Concatenated Outputs of all Attention Heads Attention Output MLP Hidden (Post-Activation) MLP Output Residual Stream Layer Effective Rank Ratio (c) The number of the effective rank of each activation in Qwen3-8B Figure 11 17 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning E.2. Collecting Activations We truncate each document to 1024 tokens and prepend a<bos>token to the beginning of each document. During training, we exclude the activations corresponding to the <bos> , <eos>and <pad> tokens. It has been observed that activations from different sequence positions within the same document are often highly correlated and may lack diversity. To mitigate this issue, it is common to introduce randomness into the training data. Our shuffling strategy maintains a buffer that is reshuffled whenever the buffer is refilled. E.3. Initialization The decoder columnsW dec :,i are initialized uniformly, and the optimal norm for them is found through a grid search to minimize the initial reconstruction loss. We find that the specific initialization norm has little impact, as long as in a reasonable scope. For example, initializingW dec :,i uniformly with a fixed bound, as in Conerly et al. (2025), yields similar results. The encoder weightsW enc are initialized as the transpose ofW dec , while both the encoder biasb enc and decoder bias b dec are set to zero. E.4. Jumprelu SAEs We trained JumpReLU SAEs (Rajamanoharan et al., 2024b) under two distinct hyperparameter configurations: one maintaining a consistently lowℓ 0 value throughout training, and another whereℓ 0 is gradually decreased from a higher initial value. Unless otherwise specified, all JumpReLU SAEs were trained using the same settings as Conerly et al. (2025), which corresponds to the latter configuration. The key modifications for the former setting are as follows: (1) we initialized the encoder bias to zero instead of applying the heuristic that equalizes feature activation counts at initialization, and (2) we kept the sparsity coefficient fixed rather than employing a global warm-up schedule. As a result, theℓ 0 sparsity level started at a relatively low value early in training. This design is critical to our approach: we observed that if the model remains in a high-ℓ 0 regime (e.g., on the order ofd model /2) for an extended period before sparsity increases, the feature directions tend to drift away from the active subspace during this phase, thereby diminishing the effectiveness of our method (Appendix L.1). F. Additional Analysis on Effective Dimensionality and Dead Features This section provides additional experimental evidence supporting the claim that low effective dimensionality strongly correlates with a higher proportion of dead features in sparse autoencoders (SAEs). In Figure 5, we report the effective rank of the residual stream, attention output, and MLP output at each layer ofLlama-3.1-8B, along with the proportion of dead features in the SAEs trained on these activations. All SAEs were obtained fromLlamascope(He et al., 2024), which uses the same dictionary size (32768 features) and sparsity level (L 0 = 50). To provide systematic analysis, we additionally train SAEs across multiple dictionary sizes (16384, 32768, 65536) and sparsity levels (L 0 ∈32, 64, 128) following Appendix E The SAEs are trained on the activations ofLlama-3.1-8B, and the corresponding effective ranks of these activations can be found in Figure 4 of the main paper. Tables 1–3 summarize the proportion of dead features across these configurations. Across all settings, attention outputs consistently show substantially higher dead-feature ratios than residual streams. This trend holds even when the dictionary size varies by a factor of four and the sparsity level varies by a factor of four. Table 1. Proportion of dead features for L 0 = 32 across different dictionary sizes. Activation (Effective Rank)163843276865536 Layer 7 attention (2351)84.80%90.31%94.02% Layer 7 residual (3664)1.61%6.69%17.10% Layer 15 attention (2506)68.70%79.86%87.22% Layer 15 residual (3611)27.01%45.70%61.33% Layer 23 attention (2654)58.45%70.48%78.81% Layer 23 residual (3634)0.13%0.26%1.35% 18 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning Table 2. Proportion of dead features for L 0 = 64 across different dictionary sizes. Activation (Effective Rank)163843276865536 Layer 7 attention (2351)66.97%75.65%82.96% Layer 7 residual (3664)0.02%0.06%0.15% Layer 15 attention (2506)41.65%54.68%67.43% Layer 15 residual (3611)1.73%7.96%18.83% Layer 23 attention (2654)41.58%56.70%67.05% Layer 23 residual (3634)0.15%0.09%0.08% Table 3. Proportion of dead features for L 0 = 128 across different dictionary sizes. Activation (Effective Rank)163843276865536 Layer 7 attention (2351)49.85%56.96%64.83% Layer 7 residual (3664)0.00%0.00%0.02% Layer 15 attention (2506)15.09%25.61%37.11% Layer 15 residual (3611)0.07%0.12%0.44% Layer 23 attention (2654)21.90%39.02%52.68% Layer 23 residual (3634)0.14%0.09%0.07% G. Complete Results of SAE metrics We use 3 different random seeds for all experiments in Figure 7 and compute the mean values and the standard deviations of each metrics, as shown in Table 4. H. Statistical Significance Test To assess whether the performance improvements introduced by ASI are statistically significant, we conducted a comprehen- sive significance analysis across multiple evaluation metrics. H.1. Experimental Setup We evaluate the statistical significance of performance differences between ASI and two baseline methods (TopK and AuxK) under the following controlled setting: • Model / Layer / L 0 / Dictionary Size: Llama-3.1-8B, Layer 15, L 0 = 50, dictionary size = 32,768. • Number of runs: 15 independent trials for each method, each with a different random seed. • Evaluation metrics: Dead Feature Count, Normalized MSE, and∆LM Loss. All metrics follow a “lower is better” criterion. • Comparisons performed: ASI vs. TopK and ASI vs. AuxK for all three metrics. H.2. Hypothesis Testing Framework For each metric and each baseline method, we perform Welch’s t-test (also known as Welch’s unequal-variance t-test), which does not assume equal variances between groups. For ASI and a given baseline method, we test the following hypotheses: H 0 : μ ASI ≥ μ baseline (ASI is worse than or equal to the baseline), H 1 : μ ASI < μ baseline (ASI outperforms the baseline). This is a one-tailed test, as we explicitly test whether ASI achieves significantly lower metric values. 19 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning Table 4. Comparison of Base, AuxK, and ASI across different L 0 settings. Numbers show mean± std over 3 seeds. L 0 MetricBaseAuxKASI 40 Dead Feature20395.00± 72.7737.00± 5.2010.67± 1.15 Normalized MSE0.36323± 0.000180.34328± 0.000110.33724± 0.00022 Delta LM Loss (×10 −3 )9.650± 0.0408.801± 0.0538.697± 0.033 50 Dead Feature16144.33± 129.5254.67± 5.514.00± 1.73 Normalized MSE0.33367± 0.000140.32241± 0.000200.31680± 0.00006 Delta LM Loss (×10 −3 )8.375± 0.0297.958± 0.0327.847± 0.050 60 Dead Feature12239.33± 165.5176.00± 9.172.67± 1.53 Normalized MSE0.31106± 0.000070.30555± 0.000070.30000± 0.00005 Delta LM Loss (×10 −3 )7.503± 0.0597.325± 0.0177.214± 0.026 70 Dead Feature8854.67± 40.51115.33± 15.371.67± 0.58 Normalized MSE0.29295± 0.000080.29064± 0.000080.28575± 0.00013 Delta LM Loss (×10 −3 )6.805± 0.0746.694± 0.0216.664± 0.066 80 Dead Feature6311.33± 55.77109.67± 11.501.00± 1.73 Normalized MSE0.27787± 0.000030.27715± 0.000030.27341± 0.00003 Delta LM Loss (×10 −3 )6.259± 0.0056.217± 0.0286.168± 0.033 We use the following SciPy function for all tests: scipy.stats.ttest ind(ASI, baseline,equalvar=False,alternative=’less’). H.3. Results Table 5 reports the resulting p-values for all comparisons. A smaller p-value indicates stronger evidence that ASI outperforms the baseline. Table 5. Welch’s t-test results for ASI compared with TopK and AuxK across 15 random seeds. ComparisonMetricp-value ASI vs. TopK Dead Feature Count3.26× 10 −35 Normalized MSE2.99× 10 −40 ∆ LM Loss1.14× 10 −23 ASI vs. AuxK Dead Feature Count4.33× 10 −15 Normalized MSE6.33× 10 −40 ∆ LM Loss1.27× 10 −6 Across all evaluation metrics and both baseline methods, the p-values are far below standard significance thresholds (e.g., α = 0.05). Therefore, we reject the null hypothesis for all comparisons. These results demonstrate that the improvements achieved by ASI are statistically significant and robust across random seeds. I. Additional Evaluation Across Layers, Models, and Datasets To assess the robustness and generality of ASI, we extend our experiments beyond the primary configuration used in the main paper (Llama-3.1-8B, Layer 15, SlimPajama). In particular, we investigate whether the advantages of ASI over baseline approaches (TopK and AuxK) persist across different layers, models, and datasets. We consider this evaluation essential, as mechanisms in sparse autoencoding can vary substantially across architectural depth, data distribution, and model family. I.1. Evaluation on Llama-3.1-8B Across Multiple Layers We first evaluate ASI on two additional layers of Llama-3.1-8B (Layers 7 and 23), using the SlimPajama dataset. Activations are taken from the attention output. Results are summarized in Table 6. We observe that ASI consistently achieves the lowest reconstruction error (Normalized MSE) and the smallest degradation in language modeling performance (∆LM loss). For Layer 7, ASI still retains a number of dead features, which may be 20 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning Table 6. Performance of ASI and baseline methods on Llama-3.1-8B Layers 7 and 23. Lower values indicate better performance. Dead Feature CountNormalized MSE∆ LM Loss (×10 −3 ) LayerBaseAuxKASIBaseAuxKASIBaseAuxKASI 7258369853080.328700.298000.288825.4144.5894.430 2315542332270.213020.202350.201411.9421.8651.845 attributed to its smaller effective rank (2351) compared to Layer 23 (2654). Since we use a fixedd init across layers, this mismatch can lead to remaining dead features. Despite this, ASI still achieves the lowest MSE on both layers. I.2. Evaluation on Qwen3-8B Across Layers and a New Dataset To further test cross-model and cross-dataset generality, we conduct experiments on the Qwen3-8B model using the fineweb-edu dataset, again using attention-output activations. Results for Layers 8 and 26 are shown in Table 7. Table 7. Performance comparison on Qwen3-8B Layers 8 and 26 using the fineweb-edu dataset. Lower values indicate better performance. Dead Feature CountNormalized MSE∆ LM Loss (×10 −3 ) LayerBaseAuxKASIBaseAuxKASIBaseAuxKASI 819048470.312280.288490.285332.95612.59112.5667 2616286665660.300900.281270.280871.34061.29801.2922 The results again confirm that ASI achieves the lowest reconstruction error and the smallest increase in LM loss across both layers. The near-dead-feature-free representation produced by AuxK is also observed, but ASI consistently outperforms AuxK in reconstruction quality and LM preservation. I.3. Summary Across all tested configurations–spanning multiple layers, two large language model families, and two datasets–ASI exhibits consistent advantages over baseline methods. These evaluations provide strong empirical evidence that the benefits of ASI are not confined to a specific layer, model, or dataset, but instead generalize across diverse settings. J. Ablation Study J.1. Active Subspace Init vs Random Subspace Init We employ random subspace initialization as a baseline and observe that it consistently degrades SAE training across all metrics, as shown in Figure 12. 64128256512102420484096 0 20 40 60 80 100 64128256512102420484096 0.35 0.4 0.45 64128256512102420484096 8 10 12 14 16 SVD Random SVD Random SVD Random D init D init D init Dead Feature (%) Normalized MSE Delta LM Loss (×10 - 3 ) Figure 12. For activations with a full space dimension of 4096, proportion of dead features (left), normalized MSE (mid) and Delta LM loss (right) across different subspace dimensions. Random subspaces are used as the baseline, whereas only initialization with the active subspace yields improvement. 21 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning J.2. Apply on Near-Full-Rank Activation We also apply Active Subspace Initialization (ASI) to near-full-rank activations, such as those in the residual stream, to evaluate its generality. When training an SAE on the post-layer-15 residual stream of Llama-3.1-8B, we find ASI yields minimal gains (Figure 13). This is consistent with our expectation, as these activations inherently exhibit a lower rate of dead features even with standard initialization. 64128256512102420484096 0 10 20 30 40 64128256512102420484096 0.295 0.3 0.305 0.31 0.315 64128256512102420484096 175 180 185 190 195 200 205 D init D init D init Dead Feature (%) Normalized MSE Delta LM Loss (×10 - 3 ) Figure 13. For activations with a full space dimension of 4096, proportion of dead features (left), normalized MSE (mid) and Delta LM loss (right) across different subspace dimensions. Random subspaces are used as the baseline, whereas only initialization with the active subspace yields improvement. J.3. Choice of the Initial Dictionary Size d init As shown in Figure 6,d init is a hyperparameter with a wide range of acceptable values (from 256 to 2048). We hypothesize that the performance degradation whend init is very low is due to a combination of the dictionary failing to cover the subspace containing the key information needed for effective reconstruction and the features being too crowded. K. Pseudo-code for implementing Active Subspace Init Below is a PyTorch-style pseudo-code for Active Subspace Initialization. Use on SAE # X: activation batch [batch_size, d_model] # W_E: decoder weight [d_model, d_sae] # W_D: decoder weight [d_sae, d_model], initialized uniformly # d_active_subspace: target subspace dimension # 1. Demean the activations demeaned_X = X - X.mean(dim=0) # [batch_size, d_model] # 2. Compute SVD U, S, V = torch.svd(demeaned_label) # V: [d_model, d_model] # 3. Take top-d_init singular vectors proj_weight = V[:, :d_init] # [d_model, d_init] # 4. Fold projection into decoder weights W_D.copy_(W_D @ proj_weight @ proj_weight.T) # 5. Init W_E with W_D.T W_E.copy_(W_D.T) Use on Lorsa # Input: # X: input activations [b, s, d] (b=batch_size, s=seq_len, d=d_model) 22 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning # mhsa: pretrained MHSA module # mhsa.W_V: [n, d, h] (n=n_heads, h=d_head, note: n * h=d) # mhsa.W_O: [n, h, d] # Lorsa parameters to initialize: # W_V: [n_lorsa, d] (n_lorsa = number of Lorsa heads) # W_O: [n_lorsa, d] # d_qk: Lorsa head dimension for initialization # 1. Compute per-head V projections X_flat = X.reshape(b * s, d) # [b * s, d] W_V_cat = mhsa.W_V.permute(1,0,2).reshape(d, d) # [d, d] V_per_head = (X_flat @ W_V_cat).reshape(b * s, n, h) # [b * s, n, h] # 2. Project V back to d_model space for each head # captured_v[:, i, :] = V_per_head[:, i, :] @ mhsa.W_V[i].T captured_v = einsum(’bnh, nhd -> bnd’, V_per_head, mhsa.W_V.permute(0,2,1)) # captured_v: [b * s, n, d] # 3. Initialize Lorsa heads from each original head’s active subspace rate = n_lorsa // n for i in range(n): slice_i = [rate * i : rate * (i+1)] # 3.1 Extract this head’s captured V v = captured_v[:, i, :] # [b * s, d] # 3.2 Demean demeaned_v = v - v.mean(dim=0) # [b * s, d] # 3.3 SVD on transposed data to get principal directions U, S, _ = svd(demeaned_v.T) # demeaned_v.T: [d, b * s] # U: [d, d] # 3.4 Take top-d_qk principal directions as projection proj = U[:, :d_qk] # [d, d_qk] # 3.5 Update W_V: project from initial d_qk space to principal subspace W_V[slice_i] = W_V[slice_i, :d_qk] @ proj.T # [rate, d] # 3.6 Update W_O: chain updated W_V through original head’s OV circuit # OV_i = mhsa.W_V[i] @ mhsa.W_O[i]: [d, h] @ [h, d] = [d, d] W_O[slice_i] = W_V[slice_i] @ mhsa.W_V[i] @ mhsa.W_O[i] # [rate, d] @ [d, h] @ [h, d] = [rate, d] # 4. Normalize all Lorsa weights (row-wise) W_V = W_V / W_V.norm(dim=1, keepdim=True) # [n_lorsa, d] W_O = W_O / W_O.norm(dim=1, keepdim=True) # [n_lorsa, d] The strategy of initializeW O in Lorsa is a method like the tied initialization used in SAEs to ensure alignment between feature encoding and decoding 17 . This approach has been shown to be crucial for reducing dead features in SAEs (Gao et al., 2024). We think the same thought could also be used to improve the replacement model for MLP (trancoder and cross layer transcoder), which we leave a deeper investigation to future work. 17 ”Match” means encoder can be initialized to predict relatively accurate feature activation values for decoder. 23 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning L. Use ASI on other Activation Functions L.1. Jumprelu Another wildly used activation fuction is Jumprelu (Rajamanoharan et al., 2024b). We trained the Jumprelu SAEs under two different hyperparameter settings: one with a consistently lowℓ 0 value and another whereℓ 0 gradually decreases from a higher initial value, as described in Appendix E.4. We observed that our method is effective in the former case (Figure 14) but shows little improvement in the latter (Figure 15). We train these SAEs following Appendix E. For cases where one follows a schedule that gradually reducesℓ 0 from a high initial value, we recommend first applying PCA to reduce the dimensionality of the data. The SAE can then be trained on the reduced representation until theℓ 0 level reaches the target range. Afterwards, the PCA projection matrix can be folded into the model parameters, and training can continue in the original space. This achieves a similar effect without the drawbacks of prolonged training in the high-ℓ 0 regime. 303540455055 0.33 0.35 0.37 303540455055 0 5 10 15 20 25 Base ASI Pre-Act Loss Base ASI Pre-Act Loss Sparsity (L0)Sparsity (L0) Normalized MSE Dead Feature (%) Figure 14. For the attention output of the middler layer of Llama-3.1-8B, using ASI on JumpRelu SAEs which has a low initialℓ 0 is effective. Details in Appendix L.1 405060708090100 0.29 0.31 0.33 0.35 0.37 405060708090100 0 1 2 3 4 5 6 7 8 Base ASI Pre-Act Loss Base ASI Pre-Act Loss Sparsity (L0)Sparsity (L0) Normalized MSE Dead Feature (%) Figure 15. For the attention output of the middler layer of Llama-3.1-8B, using ASI on JumpRelu SAEs which has a high initialℓ 0 than gradually decreasing shows little improvement. Details in Appendix L.1 L.2. TopK with K Anneal To enhance the finding in Section L.1, we conduct experiments on a variant of TopK, which sets K to a high value and then lets it decrease during training (He et al., 2024). We find ASI also fails in this case (Figure 16). M. Stale Momentum as Another Root Cause of Dead Features Recent work by Bricken et al. (2023a) identifies stale momentum as a key cause to dead feature formation. Specifically, when a feature remains inactive over training steps, its associated optimizer momentum continues to accumulate. If the feature activates, the stale momentum results in disproportionately large updates, destabilizing training and potentially suppressing that feature permanently. 24 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning 4050607080 0.29 0.31 0.33 0.35 4050607080 0 10 20 30 40 50 60 70 TopK Anneal TopK Anneal+ASI TopK Anneal TopK Anneal+ASI Sparsity (L0)Sparsity (L0) Normalized MSE Dead Feature (%) Figure 16. For the attention output of the middler layer of Llama-3.1-8B, using ASI on TopK SAEs which sets K to a high value and then lets it decrease during training fails. Details in Appendix L.2 To directly address this, we adopt SparseAdam, an optimizer tailored for sparse activation settings, designed for more efficient use of compute and memory. SparseAdam updates both parameters and moments only when the corresponding feature is active. This could effectively prevent the harmful accumulation of stale momentum. Empirically, we observe that this change substantially reduces the rate of dead feature formation in large-scale SAE training. We believe that this is a core technique for scaling sparse dictionary methods, as stale momentum is a common problem for them. N. Compare Features of SAEs with and without ASI N.1. Monosemanticity We conducted an additional analysis to assess the degree of monosemanticity exhibited by features learned by the base TopK SAE and the ASI-enhanced SAE. Following the rubric of Cunningham & Conerly (2024), we performed a blinded evaluation of 100 randomly sampled features and recorded the semantic consistency scores assigned to each feature. For clarity, we reproduce below the scoring rubric used for evaluating activation consistency: • 5: Clear pattern with no deviating examples • 4: Clear pattern with one or two deviating examples • 3: Clear overall pattern but quite a few examples not fitting that pattern • 2: Broad consistent theme but lacking structure • 1: No discernible pattern To provide transparency, Tables 8 summarize the distribution of scores for TopK and ASI. Across both models, we found no statistically significant differences in score distributions at this scale of analysis. Variations between the two SAE variants were marginal and did not indicate systematic differences in feature quality. This result is consistent with our expectations, as our method does not modify the SAE architecture and is not designed to intervene in how features are formed. Consequently, the potential risk of degrading feature quality remains low. N.2. Analysis of SAE Features in the Dead Subspace To assess whether ASI alters the SAE’s behavior in directions corresponding to the dead subspace, we perform a comparative analysis between Attention Output SAEs trained with and without ASI under identical configurations (same number of features, K, and all other hyperparameters). Feature alignment with the dead subspace. We compute the cosine similarity between each SAE feature and the dead subspace, restricting the analysis to alive features (which accounts for the difference in total counts). The distribution of cosine values is summarized in Table 9. Across all intervals, the ASI-initialized SAE exhibits a larger number of alive 25 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning Table 8. Monosemanticity score distributions for ASI-enhanced SAE and base TopK SAE features. (a) ASI-enhanced SAE ScoreCount 517 414 36 25 18 (b) Base TopK SAE ScoreCount 518 412 36 26 18 features, while both methods exhibit very few features that align closely with the dead subspace. This suggests two possible explanations: (i) features in the dead subspace have extremely small magnitude and provide insufficient signal for the SAE to learn, or (i) the dead subspace does not contain meaningful standalone features, and only small components of features reside in this region. Table 9. Distribution of cosine similarity between SAE features and the dead subspace (alive features only). Method[0.0, 0.05)[0.05, 0.1)[0.1, 0.15)[0.15, 1] TopK1017618960 ASI2457693680 Reconstruction error in the dead subspace. We further project the reconstruction error onto the dead subspace to quantify the SAE’s reconstruction performance on components lying in this region. As shown in Table 10, the reconstruction errors are nearly identical between the two methods, with the ASI showing only a marginal improvement. Table 10. Reconstruction error projected onto the dead subspace. MethodMSE in dead subspace TopK0.00350 ASI0.00334 Overall, these analyses indicate that ASI has minimal impact on the SAE’s behavior within the dead subspace, while substantially reducing the number of dead features. O. Lorsa Implementation Details O.1. Hyperparameters Model, Dataset, Layer Llama-3.1-8B, SlimPajama, 15(index start at 0). Dictionary SizeWe empirically set the num of Lorsa headsn Lorsa heads = 32768which is8× d model . We set the num of QK group of Lorsan Lorsa qk = 256which is8× n MHSA heads . We set the dimension of QK of Lorsad Lorsa qk = 128which isd MHSA qk . Batch Size We empirically set the batch size to 32768. Optimizer We use the Adam optimizer, with β 1 = 0.9, β 2 = 0.999, and ε = 10 −8 . Learning RateThe learning rate is sweeped in [1e−5,2e−5,4e−5,6e−5,8e−5,1e−4,2e−4,4e−4], and we ultimately use2e−4. We employ a three-phase learning rate schedule consisting of a linear warm-up, a stable phase, and a linear decay. The learning rate increases linearly from zero to its maximum value over the first 500 steps, remains constant during the intermediate phase, and then decays linearly to 1% of the maximum value over the final 20% of the total training steps. 26 Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning AuxKWe follow Gao et al. (2024) to set auxiliary loss coefficientαas 1 32 . We sweep thek aux in [256, 512, 1024, 2048] and finally choose 512. Dimension of Subspace for SAE Initialization (d init ) Because the active subspace of the input and output of each MHSA heads is very close to the dimension of MHSA head (d head ), we set it directly tod head . We found that increasing or decreasing this value did not improve performance. Total Tokens We use 800M tokens for each Lorsa training. Sequence LengthWe truncate each document to 2048 tokens. During training, we exclude the activations corresponding to the <bos> , <eos>and <pad> tokens. O.2. Initialization We initialize the query and key matricesW Q andW K using Xavier uniform initialization (Glorot & Bengio, 2010). The value matrixW V is initialized from a normal distributionN (0, 1/ √ d sae ), while the output matrixW O is initialized from N (0, 1/ √ d model ). All bias terms b Q , b K , b V , and b D (if used) are initialized to zero. 27