Paper deep dive

Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition

Zhengfu He, Junxuan Wang, Rui Lin, Xuyang Ge, Wentao Shu, Qiong Tang, Junping Zhang, Xipeng Qiu

Year: 2025Venue: ICLR 2026Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 73

Models: Llama-3.1-8B, Pythia-160M

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/12/2026, 6:00:20 PM

Summary

The paper introduces Low-Rank Sparse Attention (Lorsa), a sparse replacement model for Transformer attention layers designed to disentangle Multi-Head Self-Attention (MHSA) into interpretable, atomic components. By employing an overcomplete set of 1D attention heads with sparsity constraints, Lorsa addresses attention superposition, allowing for the discovery of fine-grained behaviors such as induction heads, successor heads, and arithmetic-specific circuits in models like Llama-3.1-8B.

Entities (5)

Llama-3.1-8B · language-model · 100%Lorsa · model-architecture · 100%MHSA · attention-mechanism · 100%Pythia-160m · language-model · 100%SAE · interpretability-method · 100%

Relation Signals (3)

Lorsa → implementedin → Llama-3.1-8B

confidence 95% · We identify a group of arithmetic-specific Lorsa heads in Llama-3.1-8B

Lorsa → replaces → MHSA

confidence 95% · Lorsa serves as a replacement model for Transformer attention, substituting sparse interpretable components for attention modules.

Lorsa → disentangles → Attention Superposition

confidence 90% · Lorsa is designed to address the challenge of attention superposition to understand attention-mediated interaction

Cypher Suggestions (2)

Find all models that utilize the Lorsa architecture · confidence 90% · unvalidated

MATCH (m:Model)-[:USES_ARCHITECTURE]->(a:Architecture {name: 'Lorsa'}) RETURN m.name

Identify relationships between interpretability methods and models · confidence 85% · unvalidated

MATCH (m:Model)-[r:APPLIES_INTERPRETABILITY]->(i:Method) RETURN m.name, i.name, type(r)

Abstract

Abstract:We propose Low-Rank Sparse Attention (Lorsa), a sparse replacement model of Transformer attention layers to disentangle original Multi Head Self Attention (MHSA) into individually comprehensible components. Lorsa is designed to address the challenge of attention superposition to understand attention-mediated interaction between features in different token positions. We show that Lorsa heads find cleaner and finer-grained versions of previously discovered MHSA behaviors like induction heads, successor heads and attention sink behavior (i.e., heavily attending to the first token). Lorsa and Sparse Autoencoder (SAE) are both sparse dictionary learning methods applied to different Transformer components, and lead to consistent findings in many ways. For instance, we discover a comprehensive family of arithmetic-specific Lorsa heads, each corresponding to an atomic operation in Llama-3.1-8B. Automated interpretability analysis indicates that Lorsa achieves parity with SAE in interpretability while Lorsa exhibits superior circuit discovery properties, especially for features computed collectively by multiple MHSA heads. We also conduct extensive experiments on architectural design ablation, Lorsa scaling law and error analysis.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Full Text

72,690 characters extracted from source content.

Expand or collapse full text

Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Zhengfu He 1,2∗ Junxuan Wang 1,2∗ Rui Lin 1,2 Xuyang Ge 1,2 Wentao Shu 1,2 Qiong Tang 2 Junping Zhang 2 Xipeng Qiu 1,2† 1 Shanghai Innovation Institute 2 OpenMOSS Team, School of Computer Science, Fudan University zfhe19@fudan.edu.cn Figure 1: (A) Low-Rank Sparse Attention (Lorsa) comprises thousands of sparsely activated attention heads with 1D outputs, designed to extract interpretable attention units from the original Multi Head Self Attention (MHSA). (B) Lorsa serves as a replacement model for Transformer attention, substituting sparse interpretable components for attention modules. (C) Each Lorsa head explains an atomic feature-feature interaction across token positions, which was originally a part of an MHSA head or spread across multiple heads, i.e. put in attention superposition. Abstract We propose Low-Rank Sparse Attention (Lorsa), a sparse replacement model of Transformer attention layers to disentangle original Multi Head Self Attention (MHSA) into individually comprehensible components. Lorsa is designed to ad- dress the challenge ofattention superpositionto understand attention-mediated interaction between features in different token positions. We show that Lorsa heads find cleaner and finer-grained versions of previously discovered MHSA behaviors like induction heads, successor heads and attention sink behavior (i.e., heavily attending to the first token). Lorsa and Sparse Autoencoder (SAE) are both sparse dictionary learning methods applied to different Transformer components, and lead to consistent findings in many ways. For instance, we discover a comprehensive ∗ Equal Contribution. † Corresponding Author. arXiv:2504.20938v1 [cs.LG] 29 Apr 2025 family of arithmetic-specific Lorsa heads, each corresponding to an atomic oper- ation in Llama-3.1-8B. Automated interpretability analysis indicates that Lorsa achieves parity with SAE in interpretability while Lorsa exhibits superior circuit discovery properties, especially for features computed collectively by multiple MHSA heads. We also conduct extensive experiments on architectural design ablation, Lorsa scaling law and error analysis. Code: https://github.com/OpenMOSS/Lorsa Lorsa Weights: https://huggingface.co/fnlp/Lorsa 1 Introduction When examining the function of individual attention heads in a Transformer model, one might identify some of these heads implementing a specific behavior. A canonical example is induction heads which predicts ’Potter’ following the token ’Harry’ when ’Harry Potter’ is present in the context [Olsson et al., 2022]. Ablating these heads substantially prevents the model from correctly performing corresponding tasks, which indicates causal relation of these heads and the model’s macroscopic behaviors. These interpretable attention units constitute the basic building blocks of the model’s inter-token information mixing algorithm. Not all attention heads, however, exhibit clear functionality. Most heads distribute attention across diverse contexts. Although some heads exhibit identifiable patterns, there might be inter-head collaboration that explains the whole story. These challenges in attention head interpretation is analogous to feature superposition in understanding individual neurons, which suggests the existence ofattention superposition[Jermyn et al., 2024] in Multi Head Self Attention (MHSA), which we will further discuss in Section 2. Inspired by the recent success of Sparse Autoencoders (SAEs) to extract monosemantic features from Transformers’ hidden space [Templeton et al., 2024b] or approximate part of the network’s computation as a sparse computation [Templeton et al., 2024a, Ge et al., 2024, Dunefsky et al., 2024], we propose Low-Rank Sparse Attention (Lorsa) to disentangle the atomic attention units from attention superposition (Section 3). Lorsa serves as a replacement module of the original MHSA with an overcomplete set of attention heads featuring a single-dimensional OV circuit [Elhage et al., 2021] and sparsity constraints. In Section 4, we introduce our exploration interface following Bricken et al. [2023], providing multifaceted information on each Lorsa head. We also quantitatively assess Lorsa head interpretability using top activations and their attribution patterns (zpattern) with automated interpretability [Bills et al., 2023]. The results indicate that Lorsa’s monosemanticity is comparable to SAE features. Section 5 presents findings with Lorsa on Pythia-160M [Biderman et al., 2023] and Llama-3.1- 8B [Dubey et al., 2024]. For validation, we first identify the Lorsa instantiations of known attention mechanisms:induction heads,name mover heads[Wang et al., 2023],successor heads[Gould et al., 2024], and attention sinks [Xiao et al., 2024]. Furthermore, we characterize a family of arithmetic- specific Lorsa heads in Llama-3.1-8B. We also identify a subset of Lorsa heads in Llama-3.1-8B that function asthematic anchorsby exhibiting long-range, topic-specific attention patterns. To the best of our knowledge, Lorsa is the first attempt to extract sparse and interpretable attentional computation, yet still has significant room for improvement in aspects discussed in Section 6. We hope these discussions and findings will facilitate future research along this direction. Note on Terminology:While prior work refers to the atomic computational units we aim to independently understand asattentional features[Jermyn et al., 2024, Ameisen et al., 2025], we adopt attention unitsto avoid conflating with activation-space features (which denote 1D linear features in representation spaces [Elhage et al., 2022]). The termheadflexibly denotes either MHSA heads or Lorsa heads as context dictates. 2 Attention Superposition Analogous to how post-ReLU neurons in Transformer MLPs learn to represent more features than they have dimensions [Elhage et al., 2022], a similar phenomenon may occur in Multi-Head Self 2 Attention (MHSA). We hypothesize MHSA may comprise multiple attention units inattention superposition, each attending between certain token pairs with interpretable read/write operations on the residual stream. Under this hypothesis, we would expect (1) an atomic attention unit is spread across multiple MHSA heads. (2) One MHSA head includes multiple units. We list three points of evidence of attention superposition in Transformer language models. 1. A Few Neurons (Heads) Are Polysemantic.Gurnee et al. [2023] discovered compound word neurons activating across diverse unrelated n-grams, while Bricken et al. [2023] reported neurons responding to mixed stimuli including academic citations and Korean text.(link). Similarly, successor heads [Gould et al., 2024] which increment ‘Monday’ into ‘Tuesday’ and ‘1’ into ‘2’ simultaneously exhibit Acronym behavior, Copying behavior and Greater-than behavior. 2. Most Neurons (Heads) Exhibit Uninterpretable Activating (Attention) Patterns.Multiple studies report the predominance of MLP neurons lacking clear activation patterns [Arora et al., 2018, Bricken et al., 2023]. Likewise, Krzyzanowski et al. [2024] reports failed interpretation attempts for more than 90% heads in GPT-2. 3. Attention Superposition in the Wild.He et al. [2024a] and Kissane et al. [2024] both found attention output SAE features collectively contributed by multiple attention heads. If we consider SAE features to represent monosemantic directions, such distribution provides evidence for attention superposition. Furthermore, Jermyn et al. [2024] directly demonstrate this phenomenon through a constructed case where 5 ground-truth attention units are put in superposition over 2 attention heads. We also show that about 25% of our learned attention units are spread across multiple MHSA heads (Appendix F.2). Why Does Attention Superposition Matter?Practically, attribution-based circuit tracing [Ge et al., 2024, Ameisen et al., 2025] becomes challenging when features are computed collectively: individual QK patterns do not explain the full mechanism and may be misleading due to interference from other features’ computations within the same heads. The structure of attention superposition may relect intriguing motifs of model biology. For example, what makes some privleged attention units like induction heads mostly implemented by a single MHSA head [Olsson et al., 2022] while others are put in superposition? This parallels privleged bases in MLP neurons [Elhage et al., 2023]. 3 Low-Rank Sparse Attention 3.1 Lorsa Architecture Algorithm 1:Low-Rank Sparse Attention (MHSA Lorsa) Input:X∈R n×d : Input sequence (n tokens, d dimensions) W h q ,W h k ∈R d×d h : Query/Key weights for headh W h v ∈R d×d h w h v ∈R d×1 : 1-Dim Value weights W h o ∈R d h ×d w h o ∈R 1×d : 1-Dim Output weights H∈Z + : Number of Lorsa heads K∈Z + : Max number of activated Lorsa Heads Output: ˆ Y∈R n×d : Output sequence 1forh←1toHdo 2Q h =XW h q ∈R n×d h ;// Query projection for headh 3K h =XW h k ∈R n×d h ;// Key projection 4v h =Xw h v ∈R n×1 ;// d h -Dim 1-Dim Value projection 5A h =softmax Q h (K h ) T √ d h ∈R n×n ;// Attention patterns 6z h =A h v h ∈R n×1 ;// d h -Dim 1-Dimensional Weighted sum of values 7 ˆ Y h =z h w h o ∈R n×d ;// Output of a single Lorsa head 8S ←TopKIndices(z h |h= 1,...,H,K);// Select top K heads byz 9 ˆ Y= P h∈S ˆ Y h ;// Combineall selected heads 10return ˆ Y 3 We detail Lorsa’s architectural designs in this section, with Algorithm 1 highlighting how Lorsa architecture differs from a standard MHSA. Lorsa takes in the same inputs of MHSA and is trained to predict MHSA outputs. The training objective is simply minimizing the mean square error (MSE): L=E x∈D ||Lorsa(x)−MHSA(x)|| 2 . One-Dimensional Output and Values.Each MHSA head reads from and writes to a residual stream subspace via its OV circuit [Elhage et al., 2021], whose dimension is decided by its head dimensiond h . Under the linear representation hypothesis that unidimensional features are encoded in the residual stream, we design Lorsa heads with 1D OV circuits. This offers the advantage of restricting read/write operations to one or few residual stream features (directions). Although ideal implementations would use 1D QK and OV circuits, we restrict dimensionality reduction to OV circuits for practical reasons. Query and Key Weights with Parameter Sharing.We observe significant performance drop as D Lorsa QK decreases, which is severer whenD Lorsa QK < D MHSA QK . This may suggest QK circuits for attention units are multidimensional. In result, we chooseD Lorsa QK =D MHSA QK and implement parameter sharing for QK weights across everyD Lorsa QK heads as the default setting. This strategy maintains a parameter count of4D model per head - equivalent to settingD Lorsa QK to 1 without parameter sharing, which is crucial for Lorsa scalability. Our parameter binding strategy renders Lorsa QK circuit strikingly similar to MHSA - a QK-sharing group of Lorsa heads is almost identical to an original MHSA head except the sparsity constraints applied on each OV dimension. We describe Lorsa heads as individual heads with shared QK circuits rather than a sparse dimension in MHSA architecture because they often exhibit correlated yet distinct interpretable functionalities, as we will show in Section 5. And there are cases where a QK-sharing group of Lorsa heads show no clear semantic correlation. We also show that Lorsa QK circuits are not solely learning to copy of the original QK circuit as shown in Appendix B.3. This distinguishes Lorsa from only applying sparse dictionary learning or Independent Component Analysis on OV circuits [Ameisen et al., 2024]. Orders of Magnitudes More Heads and Sparsity.To capture numerous underlying attention units, Lorsa employs an overcomplete architecture withN Lorsa ≫N MHSA heads per layer, activating onlyK≪N Lorsa heads per token. This parallels Sparse Autoencoders’ approach of learning more features than the input dimension while enforcing sparsity. For a given token position, Lorsa’s output aggregates the Top-K heads with largestz’s, wherezis the scalar activation value of a Lorsa head 3 . The active head subset dynamically varies across token positions. This sparsity mechanism resembles TopK-SAEs [Gao et al., 2024], as both select theK most salient linear components. Connection to Sparse Autoencoders.Lorsa shows notable resemblance to attention SAEs [Kissane et al., 2024] for its 1D OV circuits. Lorsa learns an overcomplete linear basis of the attention output spacew h o |h= 1,...,Hwith sparsely activated scalar componentsz h i |h= 1,...,Hat thei-th position, which is analogous to SAE decoder and sparse feature activations. However, whereas SAE features are computed via single linear encoders with ReLU, Lorsa head activation as a given positionz h i derives from attention patternsA h i andv h of previous tokens. Moreover, SAEs take in and predict the same activations while Lorsa, like Transcoders [Ge et al., 2024, Dunefsky et al., 2024] , learns to predict downstream activations. Lorsa is similar to a Gated [Rajamanoharan et al., 2024] Transcoder taking in activations from multiple positions, where the QK circuit resembles thegatefor its non-linearity andw v is simply a linear encoder. 3.2 Lorsa Training The Low-Rank Sparse Attention modules we are studying throughout this work are trained on all layers of Pythia-160M and Llama-3.1-8B. The training set is sampled from 800 million tokens for 3 Conceptually, a Lorsa head’s activation on a sequence should bez h ||w h o || 2 rather thanz h . For analytic simplicity and clarity, we construct a model with identical predictions but setw h v ←w h v ||w h o || 2 ,b h v ←b h v ||w h o || 2 andw h o ←w h o /||w h o || 2 . This operation isolates activationz h from output directionw h o . 4 each model, which is adequate to train Lorsa models with till convergence. The prompts are collected from SlimPajama [Soboleva et al., 2023] truncated to a context size of 256 for Pythia and 1024 for Llama. We report our experimental settings in Table 1. Appendix C details lorsa training settings along with a LorsaL(N,K)scaling law compared against TopK SAEs. Target Model # HeadsHead Dimension # Active Heads per Token # Params Per Layer MHSA Independent Lorsa QK Lorsa QK Lorsa OV MHSA Lorsa QK Lorsa OV MHSALorsaMHSALorsa Pythia-160M12966K6K6464112642.25M18M Llama-3.1-8B3225632K32K12812813212864M512M Table 1: Experimental setups for both target models. We primarily focus on Lorsa modules with 500-1,000 times more heads than the original MHSA. For instance, we have 6K Lorsa heads for an MHSA layer in Pythia-160M, with everyD Lorsa QK =D MHSA QK = 64 heads sharing QK weights. This gives us 96 independent QK weights. 4 Assessing Lorsa Interpretability 4.1 Interpreting Individual Lorsa Heads Top Activations.With Lorsa heads’ output restricted to a single direction, their activation strength at a given positionican be described with a scalarz h i (Section 3.1). Similar to SAE interpretation methods [Bricken et al., 2023, Templeton et al., 2024b], we iterate over 100M activations from a held-out dataset to identify the 16 highest-activating tokens for each Lorsa head. zPattern.According to Algorithm 1, the top activationsz h i decompose linearly into token-wise contributions from preceding positions:z h i =A h i v h = P i j=1 A h i,j v h j , whereA h i,j denotes attention weight from tokenito tokenjandv h j =w h v x j . Conceptually this tells from which previous tokens the activationz h i is computed. Thus we call it thezpattern. This is analogous to direct feature attribution (DFA) analysis for attention SAEs [Kissane et al., 2024, He et al., 2024a]. An SAE feature’s activation at thei-th tokenf i can be decomposed along heads and sequence position, i.e., f i = P j≤i P h∈H W enc f o h j , whereo h j is a linear component of MHSA output at tokenjfrom headh. The DFA from tokenjis then defined as P h∈H W enc f o h j . In comparison, Lorsa’s attribution includes only one 1D OV circuit and a single, though shared, QK circuit without multi-head aggregation. This enables QK circuit attribution for attention units distributed across multiple MHSA heads. 4.2 Visualization Interface Figure 2: Visualization dashboard for a "you"-specific induction Lorsa head. We provide an example interpretation of each item below. 5 Our visualization interface provides multifaceted information on Lorsa head interpretation. We illustrate our dashboards with the example in Figure 2, which visualizes to an induction Lorsa head specifically firing for the token "you". The methods used to identify correlated MHSA heads and SAE features are described in Appendix F and G. •Correlation to SAE features / Logits via OV:It mainly reads fromcurrent token is "you"/"your" features via itsw h v ; It strongly activates asay "you"feature (i.e., a feature amplifying the logit of "you" via the logit lens [nostalgebraist, 2020]); It amplifies the logits of a variety of "you" tokens. • Correlation to SAE features via QK:Its QK attention pattern is mainly computed bycurrent token is "X"features on the query position andprevious token is "X" & current token is "you" features, where "X" includes a number of tokens that often precedes "you", such as "with", "thank" or "do". • Correlation to MHSA heads:This Lorsa head is almost equally distributed in MHSA.5.0 and MHSA.5.7. Both MHSA heads exhibit induction functionality as shown in Appendix F. 4.3 Quantitative Evaluation with Automated Interpretability Figure 3: Automated interpretability scores of Lorsa heads and SAE features. Each distribution is estimated with 100 heads / features. The average score of each group is represented by a horizontal dash line. We highlight distributions with larger mean value suggested by t-tests withα= 0.05. To quantify the interpretability of Lorsa heads in terms of its top activations andzpattern, we perform automated interpretability (autointerp) [Bills et al., 2023] to estimate how comprehensible each Lorsa head is. We apply standard autointerp on max activating samples, Lorsaz-patterns and direct feature attribution of attention output SAEs. Prompt design and choice of few-shot examples are detailed in Appendix D. All results are obtained with Pythia-160M Lorsa and SAEs of the same size. As shown in Figure 3, Lorsa achieves a higher score in 6 cases, with 3 losses and 15 ties at α= 0.05significance across 24 layer-wise comparisons, suggesting comparable interpretability to SAE features. Both methods exhibit descending scores in deeper layers. Potential explanations include: (1) increased polysemanticity in later layers, or (2) autointerp’s limited capacity for capturing long-range dependencies. 5 Searching for Specific Lorsa Heads We use path patching [Wang et al., 2023, Conmy et al., 2023] to find the Lorsa heads involved in specialized tasks. For a given Lorsa head, path patching ablates its output and allows the influence to propagate only through residual connections and MLPs (but not through other attention heads). This measures the head’s counterfactual influence on the model’s behavior. Using this approach, we re-discover previously documented relatively monosemantic heads (Section 5.1), identify a family of arithmetic-specific Lorsa heads (Section 5.2), and an interesting set ofbroadcasting Lorsa heads (Section 5.3). 6 Figure 4: Examples of Lorsa heads re-discovering previously reported heads.Lorsa.5.3378: The Acronym Head attends to the parentheses and preceding text to predict the abbreviation.Lorsa.6.2814: Successor Head attends to the previous number token and predicts the next number.Lorsa.8.5963: Copy Suppression Head attends to the previous subject and suppresses its copy.Lorsa.10.4066: Attention Sink Head simply attends to the ’<|beginoftext|>’ token. 5.1 Lorsa Re-discovers Previously Reported Heads Previous works have documented attention heads with specific functionalities in well-characterized contexts. We demonstrate that Lorsa rediscovers these attention heads. Through experiments on Pythia-160M, we show that Lorsa rediscovers heads replicating these functionalities, such as induction heads[Olsson et al., 2022],name mover heads[Wang et al., 2023],copy suppression heads [McDougall et al., 2023], andsuccessor heads[Gould et al., 2024]. We also observe an important attention behavior called attention sinks [Xiao et al., 2024]. Figure 4 showcases four such heads, with their complete information provided in Appendix E.2. A representative selection of interpretable Lorsa heads is presented in Table 2. Lorsa Head IDAutointerp (Function) Lorsa.5.3955Induction for "ve" Lorsa.5.4010Induction for last names Lorsa.7.4203Induction for abbreviations Lorsa.9.132Induction after "and"/"with" Lorsa.4.32"define"/"include" in PHP Lorsa.4.3013"public static" in Java Lorsa.5.4035Say "Four"/"Five" Lorsa.8.142Apple Inc. and products (iPhone etc.) Lorsa.4.5167Previous token is "can"/"could" Lorsa.11.6084Previous token is "make" Lorsa.4.487Abbreviations (parentheses/quotes) Lorsa.6.1491Abbreviations in parentheses Lorsa.6.1787Abbreviations in parentheses Lorsa.6.5499Abbreviations in parentheses Lorsa.4.1420Russian words Lorsa.9.1622Induction in Italian Lorsa.4.4388Attention sinks Lorsa.7.862Attention sinks Lorsa.6.2592"the other"/"another" Lorsa.10.1232Year of birth and death Table 2: A non-exhaustive collection of inter- pretable Lorsa heads we have found, which are grouped by color from top to bottom: induc- tion heads, specific token heads, previous token heads, acronym heads, language-specific heads, attention sink heads, and miscellaneous heads. Figure 5: For the prompt "36 + 62 =", Lorsa moves two operands to the last position with 3 heads each. The first operand (36) is attended in terms ofzpattern by an "op1∈27−43", an "op1 % 10∈[4,5,6] " and an "op1% 10∈[6,7,8]" head, which uniquely determines "op1= 36". The same applies toop2. 5.2 A Family of Arithmetic Lorsa Heads in Llama-3.1-8B We identify a group of arithmetic-specific Lorsa heads in Llama-3.1-8B that activate during simple arithmetic operations following the template[op1][operator][op2][=]. One observation is that each head fetches certain operands with a number of unrelated heuristics, consistent to prior findings at neuron level on arithmetic mechanisms [Nikankin et al., 2024], despite Lorsa’s architectural differences. 7 Figure 5 demonstrates an example of the prompt "36 + 62 =". Similar to Ameisen et al. [2025], we visualize the function of each Lorsa head with an operand plot, displaying its activity on the 100× 100 grid of potential inputs of the template "op1+op2=". Visualization dashboards are provided in Appendix E.3 for these six heads to support claims made in this section, along with the more arithmetic Lorsa heads and their explanations. We also observe striking similarity between the heuristics used by Lorsa and SAE. 5.3 Lorsa Heads as Thematic Anchors While exploring through Lorsa heads in Llama-3.1-8B, we notice a distinctive subset of Lorsa heads attending to keywords with remarkable thematic consistency from all subsequent tokens in a sentence. Figure 6 illustrates a representative case which exhibit relatively selective, long-range attention to tokens related to presidency as evidenced by zpattern.Through manual inspection we also find Lorsa heads activating on topics like alcohol addiction, dynamic system, medication instructions and terms of service. An intuitive hypothetical function of these head isthematic anchorsto maintain persistent topic representations to bias subsequent token pre- dictions toward domain-appropriate vocabulary and syntactic structures. We believe these heads to be closely related to SAE features "smeared" across token positions, as mentioned in Lindsey et al. [2025] (link) (example). Figure 6:zpattern of a presidency-related topic broadcasting Lorsa head. 6 Discussion and Limitations We report a number of intriguing findings and limitations of Low-Rank Sparse Attention. We believe there remains significant room for improvement for future work in each of these following aspects. Reducing QK Dimension and Unbinding QK circuits.One significant limitation of our approach is that we do not get completely independent or low rank Lorsa heads. The shared QK circuit of Lorsa heads raise concerns on whether they can be independently understood, despite our current positive findings withzpatterns, which is an artifact of Q, K and V. If we could overcome the performance degradation of low-dimensional QK circuits, it is possible to scale up Lorsa with more independent QK circuits and fewer residual stream features interacting via QK 4 . This is also crucial for circuit tracing methods to clearly understand QK circuits. Dark Matters.We find non-trivial correlation between Lorsa error and SAE errors trained on the same attention layer in terms of (1) average loss per layer (2) loss per token on the same context and (3) error direction, as shown in Appendix H. This may suggest the existence of universal dark matters [Olah and Jermyn, 2024, Engels et al., 2024] for sparse dictionary learning methods like SAE and Lorsa. Any progress along this direction to reduce or understand SAE / Lorsa dark matters should reveal many interesting behaviors of neural networks. Inactive Attention SAE Features and Lorsa Heads.Despite efforts on hyperparameter search, we find that attention SAE and Lorsa both contains a majority of inactive feature / heads. This phenomenon renders most computation wasted and raises a question about the difference between structure of attention output space and MLP output space or residual streams, where SAEs of the same size only have few dead features if configured properly. 4 It might also be the case that attention units must be described in multidimensional QK circuits, like induction heads requiring attending to multiple "previous token is X" features. 8 Cross Layer Attention Superposition.If certain inter-token feature interaction is performed in more than one layer, our current method which decomposes only one MHSA layer does not suffice to find such relation. This parallels the problem of cross-layer superposition [Templeton et al., 2024b] for residual stream features. A cross-layer variant of Lorsa [Lindsey et al., 2024] might be tractable. Global Weights and Systematic Q/K/V Composition.To better understand the global attention behavior of Transformers, one important research direction is to identify systematic Q/K/V compo- sition like induction heads and previous token heads. Since Lorsa reveals finer-grained versions of MHSA heads, we can expect to find more of such cross-layer collaboration behavior. However, we failed in our early attempts to find Lorsa heads with Q/K composition. 7 Related Work 7.1 Explaining Individual Attention Heads With the help of activation patching [Meng et al., 2022, Zhang and Nanda, 2024] or path patch- ing [Wang et al., 2023, Conmy et al., 2023], the literature has discovered a number of heads that exhibit certain functionality in pre-defined contexts. This line of research starts from a composition of previous token headsandinduction heads[Olsson et al., 2022] which is closely related to in context learning. More work on this line includesname mover heads[Wang et al., 2023],number comparison heads[Hanna et al., 2023],copy suppression heads[McDougall et al., 2023],successor heads[Gould et al., 2024] andlong context retrieval heads[Wu et al., 2024]. 7.2 Superposition Hypothesis and Sparse Autoencoders The superposition hypothesis [Arora et al., 2018, Olah et al., 2020, Elhage et al., 2022] assumes that neurons are related to multiple non-orthognal underlying features. Sparse Autoencoders [Cunningham et al., 2023, Bricken et al., 2023] are proposed to extract an overcomplete set of the sparse and linear comprehensible features. Importantly, the success of the technique also sheds light on universality of superposition across model size [Templeton et al., 2024b, Lieberum et al., 2024, He et al., 2024b], model architectures [Wang et al., 2024] and modality [Abdulaal et al., 2024]. 7.3 Sparse Autoencoder Variants We see SAEs to have developed multiple forms along with the rapid evolution of SAEs in the past year. Some of them improve initialization [Conerly et al., 2024], loss function [Conerly, 2024, Bussmann et al., 2024] or sparsity constraints [Gao et al., 2024] to solve specific issues such as shrinkage [Wright and Sharkey, 2024] and massive inactive features [Bricken et al., 2023]. Another direction of improvement is the SAE architecture. For instance, Gated SAEs [Rajamanoharan et al., 2024] are proved effective in mitigating shrinkage. Transcoders [Ge et al., 2024, Dunefsky et al., 2024] aims to simplify sparse circuit analysis by replacing MLPs, whose non-linear nature makes causal attribution intractable. 8 Conclusion In this work, we introduced Low-Rank Sparse Attention (Lorsa) to disentangle atomic attention units from attention superposition in Transformer models. Our experiments validated that Lorsa can recover known attention mechanisms (e.g., induction heads, name movers) and uncover new interpretable behaviors (e.g., arithmetic-specific heads). While Lorsa improves attention interpretability, key challenges remain, particularly in unbinding QK circuits to achieve fully independent heads and re- ducing superposition effects. Future work should explore low-dimensional QK structures, cross-layer superposition, and systematic Q/K/V composition to further decompose attention mechanisms. Ad- dressing these limitations could enable a complete sparse, interpretable reconstruction of Transformer computations, advancing our understanding of in-context learning and feature interactions. 9 References Ahmed Abdulaal, Hugo Fry, Nina Montaña Brown, Ayodeji Ijishakin, Jack Gao, Stephanie L. Hyland, Daniel C. Alexander, and Daniel C. Castro. An x-ray is worth 15 features: Sparse autoencoders for interpretable radiology report generation.CoRR, abs/2410.03334, 2024. doi: 10.48550/ARXIV.2410.03334. URLhttps://doi.org/10.48550/arXiv.2410.03334. Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 4895–4901. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023. EMNLP-MAIN.298. URLhttps://doi.org/10.18653/v1/2023.emnlp-main.298. Emmanuel Ameisen, Joshua Batson, and Jack Lindsey.Investigating successor heads. Transformer Circuits Thread, 2024.URLhttps://transformer-circuits.pub/2024/ september-update/index.html. Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, and Joshua Batson. Circuit tracing: Revealing computational graphs in language models.Transformer Circuits Thread, 2025. URLhttps: //transformer-circuits.pub/2025/attribution-graphs/methods.html. Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic structure of word senses, with applications to polysemy.Trans. Assoc. Comput. Linguistics, 6:483–495, 2018. doi: 10.1162/TACL\_A\_00034. URLhttps://doi.org/10.1162/tacl_a_00034. Joshua Batson, Brian Chen, and Andy Jones. Circuits updates - march 2024.Transformer Circuits Thread, 2024. URLhttps://transformer-circuits.pub/2024/march-update/index. html. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 2397–2430. PMLR, 2023. URLhttps: //proceedings.mlr.press/v202/biderman23a.html. Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models.https://openaipublic.blob.core.windows.net/neuron-explainer/paper/ index.html, 2023. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023. URLhttps://transformer-circuits.pub/ 2023/monosemantic-features/index.html. Bart Bussmann, Patrick Leask, and Neel Nanda. Learning multi-level features with matryoshka saes. LessWrong, 2024. URLhttps://w.lesswrong.com/posts/rKM9b6B2LqwSB5ToN/ learning-multi-level-features-with-matryoshka-saes. Tom Conerly.Circuits updates - february 2024.Transformer Circuits Thread, 2024.URLhttps://transformer-circuits.pub/2024/feb-update/index.html# dict-learning-resampling. 10 Tom Conerly, Adly Templeton, Trenton Bricken, Jonathan Marcus, and Tom Henighan. Circuits up- dates - april 2024.Transformer Circuits Thread, 2024. URLhttps://transformer-circuits. pub/2024/april-update/index.html#training-saes. Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. In Al- ice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neu- ral Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, Decem- ber 10 - 16, 2023, 2023. URLhttp://papers.nips.c/paper_files/paper/2023/hash/ 34e1dbe95d34d7ebaf99b9bcaeb5b2be-Abstract-Conference.html. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.CoRR, abs/2309.08600, 2023. doi: 10.48550/ARXIV.2309.08600. URLhttps://doi.org/10.48550/arXiv.2309.08600. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Grégoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel M. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, and et al. The llama 3 herd of models.CoRR, abs/2407.21783, 2024. doi: 10.48550/ARXIV.2407.21783. URLhttps://doi.org/10.48550/arXiv.2407.21783. Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable LLM feature circuits.CoRR, abs/2406.11944, 2024. doi: 10.48550/ARXIV.2406.11944. URLhttps://doi. org/10.48550/arXiv.2406.11944. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposi- tion.Transformer Circuits Thread, 2022. URLhttps://transformer-circuits.pub/2022/ toy_model/index.html. Nelson Elhage, Robert Lasenby, and Christopher Olah. Privileged bases in the transformer residual stream.Transformer Circuits Thread, 2023. URLhttps://transformer-circuits.pub/ 2023/privileged-basis/index.html. Joshua Engels, Logan Riggs, and Max Tegmark. Decomposing the dark matter of sparse autoencoders. CoRR, abs/2410.14670, 2024. doi: 10.48550/ARXIV.2410.14670. URLhttps://doi.org/10. 48550/arXiv.2410.14670. Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders.CoRR, abs/2406.04093, 11 2024. doi: 10.48550/ARXIV.2406.04093. URLhttps://doi.org/10.48550/arXiv.2406. 04093. Xuyang Ge, Fukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, and Xipeng Qiu. Automatically identifying local and global circuits with linear computation graphs.CoRR, abs/2405.13868, 2024. doi: 10.48550/ARXIV.2405.13868. URLhttps://doi.org/10.48550/arXiv.2405.13868. Rhys Gould, Euan Ong, George Ogden, and Arthur Conmy. Successor heads: Recurring, in- terpretable attention heads in the wild. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=kvcbV8KQsi. Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.Trans. Mach. Learn. Res., 2023, 2023. URLhttps://openreview.net/forum?id=JYs1R9IMJr. Michael Hanna, Ollie Liu, and Alexandre Variengien.How does GPT-2 compute greater- than?: Interpreting mathematical abilities in a pre-trained language model.In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, edi- tors,Advances in Neural Information Processing Systems 36: Annual Conference on Neu- ral Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, Decem- ber 10 - 16, 2023, 2023. URLhttp://papers.nips.c/paper_files/paper/2023/hash/ efbba7719c5172d175240f24be11280-Abstract-Conference.html. Zhengfu He, Xuyang Ge, Qiong Tang, Tianxiang Sun, Qinyuan Cheng, and Xipeng Qiu. Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello-gpt.CoRR, abs/2402.12201, 2024a. doi: 10.48550/ARXIV.2402.12201. URLhttps: //doi.org/10.48550/arXiv.2402.12201. Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders.CoRR, abs/2410.20526, 2024b. doi: 10.48550/ARXIV.2410.20526. URLhttps://doi.org/10.48550/arXiv.2410. 20526. Adam Jermyn, Chris Olah, and Tom Conerly. Circuits updates - january 2024.Transformer Cir- cuits Thread, 2024. URLhttps://transformer-circuits.pub/2024/jan-update/index. html#attn-superposition. Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, and Neel Nanda. Interpreting attention layer outputs with sparse autoencoders.CoRR, abs/2406.17759, 2024. doi: 10.48550/ARXIV.2406.17759. URLhttps://doi.org/10.48550/arXiv.2406.17759. Robert Krzyzanowski, Connor Kissane, Arthur Conmy, and Neel Nanda.We in- spected every head in gpt-2 small using saes so you don’t have to.Alignment Forum, 2024.URLhttps://w.alignmentforum.org/posts/xmegeW5mqiBsvoaim/ we-inspected-every-head-in-gpt-2-small-using-saes-so-you-don. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca D. Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2.CoRR, abs/2408.05147, 2024. doi: 10.48550/ARXIV.2408.05147. URLhttps://doi.org/10.48550/arXiv.2408.05147. Jack Lindsey, Adly Templeton, Jonathan Marcus, Thomas Conerly, Joshua Batson, and Christopher Olah. Sparse crosscoders for cross-layer features and model diffing.Transformer Circuits Thread, 2024. URLhttps://transformer-circuits.pub/2024/crosscoders/index.html. Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, and Joshua Batson. On the biology of a large language model.Transformer Circuits Thread, 2025. URLhttps://transformer-circuits.pub/ 2025/attribution-graphs/biology.html. 12 Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, and Neel Nanda. Copy suppression: Comprehensively understanding an attention head.CoRR, abs/2310.04625, 2023. doi: 10.48550/ARXIV.2310.04625. URLhttps://doi.org/10.48550/arXiv.2310.04625. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URLhttp://papers.nips.c/paper_files/paper/2022/ hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html. Yaniv Nikankin, Anja Reusch, Aaron Mueller, and Yonatan Belinkov. Arithmetic without algorithms: Language models solve math with a bag of heuristics.CoRR, abs/2410.21272, 2024. doi: 10.48550/ARXIV.2410.21272. URLhttps://doi.org/10.48550/arXiv.2410.21272. nostalgebraist. interpreting gpt: the logit lens. lesswrong, 2020. URLhttps://w.lesswrong. com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Chris Olah and Adam Jermyn. Circuits updates - july 2024.Transformer Circuits Thread, 2024. URLhttps://transformer-circuits.pub/2024/july-update/index.html#hurdles. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020.doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. In-context learning and induction heads. Transformer Circuits Thread, 2022.URLhttps://transformer-circuits.pub/2022/ in-context-learning-and-induction-heads/index.html. Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse au- toencoders.CoRR, abs/2404.16014, 2024. doi: 10.48550/ARXIV.2404.16014. URLhttps: //doi.org/10.48550/arXiv.2404.16014. DariaSoboleva,FaisalAl-Khateeb,RobertMyers,JacobRSteeves,Joel Hestness,andNolanDey.SlimPajama:A627Btokencleanedand deduplicatedversionofRedPajama.https://cerebras.ai/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URLhttps://huggingface.co/datasets/cerebras/SlimPajama-627B. Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.CoRR, abs/2104.09864, 2021. URLhttps://arxiv.org/abs/ 2104.09864. Adly Templeton, Joshua Batson, Adam Jermyn, and Chris Olah. Circuits updates - january 2024. Transformer Circuits Thread, 2024a. URLhttps://transformer-circuits.pub/2024/ jan-update/index.html#predict-future. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monoseman- ticity: Extracting interpretable features from claude 3 sonnet.Transformer Circuits Thread, 2024b. URLhttps://transformer-circuits.pub/2024/scaling-monosemanticity/ index.html. Junxuan Wang, Xuyang Ge, Wentao Shu, Qiong Tang, Yunhua Zhou, Zhengfu He, and Xipeng Qiu. Towards universality: Studying mechanistic similarity across language model architectures.CoRR, abs/2410.06672, 2024. doi: 10.48550/ARXIV.2410.06672. URLhttps://doi.org/10.48550/ arXiv.2410.06672. 13 Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URLhttps://openreview.net/forum?id=NpsVSN6o4ul. Benjamin Wright and Lee Sharkey.Addressing feature suppression in saes.Less- Wrong,2024.URLhttps://w.lesswrong.com/posts/3JuSjTZyMzaSeTxKk/ addressing-feature-suppression-in-saes. Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mechanistically explains long-context factuality.CoRR, abs/2404.15574, 2024. doi: 10.48550/ARXIV.2404.15574. URLhttps://doi.org/10.48550/arXiv.2404.15574. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=NG7sS51zVF. Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview. net/forum?id=Hf17y6u9BC. 14 Appendices A Applying Lorsa to MHSA Variants16 B Ablation Study on Crucial Architectural Designs16 B.1 Ablation Study on QK Dimension . . . . . . . . . . . . . . . . . . . . . . . . . .16 B.2 Ablation Study on Binding Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . .17 B.3 Ablation Study on QK Initialization . . . . . . . . . . . . . . . . . . . . . . . . .17 B.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 CL(N, K)Scaling Laws18 D Automated Interpretability Details18 E Additional Case Studies21 E.1 Attribution Algorithm for Identifying Lorsa Heads with Specific Functionalities . .21 E.2 Examples of Lorsa’s Rediscovery of Reported Functional Heads . . . . . . . . . .21 E.3 Arithmetic Lorsa Heads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22 F Assessing Correlation with MHSA22 F.1Oblique Projection Method for Attribution . . . . . . . . . . . . . . . . . . . . . .23 F.2How Many Attention Units are Distributed Across MHSA Heads? . . . . . . . . .23 F.3Induction Heads in Pythia-160M . . . . . . . . . . . . . . . . . . . . . . . . . . .23 G Analyzing SAE Features’ Interaction on Lorsa Heads23 G.1 Quantifying Feature Impacts on Q and K . . . . . . . . . . . . . . . . . . . . . . .24 G.2 Quantifying Feature Correlations with O and V . . . . . . . . . . . . . . . . . . .24 H Lorsa Dark Matter24 ITowards Full Sparsification of A 2-Layer Transformer25 15 A Applying Lorsa to MHSA Variants Modern transformer-based models commonly employ variants of multi-head self-attention (MHSA), such as those incorporating rotary position embeddings(RoPE) [Su et al., 2021] and grouped-query attention(GQA) [Ainslie et al., 2023]. Our proposed Lorsa method demonstrates compatibility with these MHSA variants through straightforward adaptations. •For RoPE-enhanced MHSA, we apply the same rotary transformations to Lorsa’s computed queries and keys before computing attention scores, maintaining the positional information encoding. •In GQA implementations, Lorsa operates without modification—specifically, we intentionally avoid introducing grouped queries within the Lorsa framework. Empirical results on both Pythia-160M and LLaMA-3.1-8B demonstrate that this design choice does not adversely affect performance. B Ablation Study on Crucial Architectural Designs We conduct ablation studies on two crucial architectural designs: (1) the query and key dimension and (2) the binding ratio. Our experiments validate the necessity of maintaining both the QK dimension and the binding mechanism in our proposed architecture. Additional ablation tests on other implementation details further validate our decisions. Furthermore, we derive twohard constraintsfor parameter selection (violating these constraints leads to significant performance degradation): • The QK dimension must not be smaller than the head dimension in MHSA • The number of QK pairs must not be fewer than the number of attention heads in MHSA B.1 Ablation Study on QK Dimension 48163264128256 10% 15% 20% 25% 30% 35% 40% Layer 3 Layer 6 Layer 9 QK Dimensions F r a c t i o n V a r i a n c e U n e x p l a i n e d ( F V U ) (a) Context Length: 256 48163264128256 10% 20% 30% 40% Layer 3 Layer 6 Layer 9 QK Dimensions F r a c t i o n V a r i a n c e U n e x p l a i n e d ( F V U ) (b) Context Length: 1024 Figure 7: Ablation study on the QK dimension using Pythia-160M under different context lengths. We fix the parameter budget across all settings and observe that reducing the QK dimension below the original MHSA head dimension (d head = 64) results in significant performance degradation, highlighting the importance of maintaining a high QK dimension. We conduct ablation studies on the QK dimension using Pythia-160m, evaluating performance under different context lengths (256 and 1024 tokens). To ensure fair comparison, we fix the parameter budget at 4D model per attention head and maintaining a total parameter count equivalent to4×the original MHSA configuration throughout all experiments. As shown in Figure 7, reducing the QK dimension below the original MHSA’s head dimension (d head = 64) leads to severe performance degradation. This empirical evidence supports our design choice to maintain a high QK dimension. 16 3612244896 11% 12% 13% 14% 15% 16% 17% # Lorsa Heads=768 # Lorsa Heads=1536 # Lorsa Heads=3072 # Lorsa Heads=6144 # Independent Lorsa QK Heads Fraction Variance Unexplained(FVU) (a) Context Length: 256 3612244896 15% 20% 25% # Lorsa Heads=768 # Lorsa Heads=1536 # Lorsa Heads=3072 # Lorsa Heads=6144 # Independent Lorsa QK Heads Fraction Variance Unexplained(FVU) (b) Context Length: 256 3612244896 15% 20% 25% 30% # Lorsa Heads=768 # Lorsa Heads=1536 # Lorsa Heads=3072 # Lorsa Heads=6144 # Independent Lorsa QK Heads Fraction Variance Unexplained(FVU) (c) Context Length: 1024 Figure 8: Ablation study on the binding ratio. We vary the number of independent Lorsa QK heads and evaluate model performance under different settings. Appropriate binding maintains performance while reducing QK circuit cost, whereas overly aggressive binding (below the number of original MHSA heads) leads to substantial degradation. B.2 Ablation Study on Binding Ratio We conduct a systematic study on the impact of the number of independent Lorsa QK heads (i.e., the number of Lorsa heads divided by the binding ratio) across a range of configurations, as illustrated in Figure 8. Our experimental results highlight two key observations: •Appropriate binding effectively preserves model performance while substantially reducing both the parameter count and the computational cost of the QK circuit (scaling proportionally with the binding ratio). •Model performance deteriorates significantly when the number of independent QK heads falls below the original MHSA head count, establishing this threshold as a critical lower bound for binding ratio selection. B.3 Ablation Study on QK Initialization Given that our QK matrices maintain high dimensionality and adopt a binding strategy, a natural question arises: can we directly reuse the original MHSA QK parameters in Lorsa? To investigate this, we evaluate three settings: (1) randomly initializing the QK parameters of Lorsa, (2) initializing the QK parameters of Lorsa with the original MHSA QK parameters and allowing them to be updated during training, and (3) fixing the QK parameters to the original MHSA QK parameters throughout training. The results, summarized in Table 3, show that directly fixing the QK parameters to those of MHSA leads to worse performance compared to the other two setups. This suggests that during optimization, Lorsa learns QK parameters that capture information not present in the original MHSA parameters. Initialization StrategyFraction Variance Unexplained (FVU) Random Initialization11.3% Initialization with Original QK (Trainable)11.2% Initialization with Original QK (Fixed)12.4% Table 3: Comparison of different QK initialization strategies for Lorsa. B.4 Implementation Details To align with the superposition hypothesis and the architectural design of the SAE, we apply a ReLU to ensure that the activationszare non-negative. However, we observe that this modification has negligible impact on training dynamics, as the top-kactivations are almost always positive for reasonable choices ofk. 17 Figure 9: Comparison of scaling law of convergence loss with number of parameters and fixed sparsity(K) between SAE and Lorsa trained on layer 3 (out of 12) in Pythia-160M. CL(N, K)Scaling Laws We explore Low-Rank Sparse Attention scaling laws with respect to both number of learnable parametersNand their sparsityK(i.e. number of active Lorsa heads per token). Such joint scaling law is similar to TopK Sparse AutoencodersL(N,K)scaling law reported in Gao et al. [2024], which we also replicate for comparison with Lorsa. TheL(N,K)scaling laws at layer 3 (out of 12) in Pythia-160M for both SAE and Lorsa are shown in Figure 9. It is reasonable that Lorsa has a larger loss than SAE because SAE has identical input and output, making reconstruction easier. D Automated Interpretability Details Evaluation Protocol.Our automated interpretability assessment employs a two-phase paradigm adapted from Bills et al. [2023]: 1.Explanation Phase: GPT-4o generates mechanistic explanations using: • For activation patterns: 8 top-activating token contexts • Forz-patterns/DFAs: Contribution graphs to max-activating tokens 2.Simulation Phase: GPT-4o predicts activations/patterns for: • 4 top-activating contexts (testing pattern recognition) • 4 randomly sampled contexts (testing generalization) Top Activation Explanation Phase Prompt. Prompt We are analyzing the activation levels of features in a neural network, where each feature activates certain tokens in a text. Each token ́ s activation value indicates its relevance to the 18 feature, with higher values showing stronger association. Your task is to infer the common characteristic that these tokens collectively suggest based on their activation values. Consider the following activations for a feature in the neural network. Activation values are non-negative, with higher values indicating a stronger connection between the token and the feature. Summarize in a single sentence what characteristic the feature is identifying in the text. Don ́ t list examples of words. Do not start with "This feature is identifying...". Go straight to the explanation. Sentence 1: <START> <|endoftext|><tab>-0.0 /<tab>-0.0 */<tab>0.2 ... (omitted) <END> Sentence 2: ... (omitted) Top Activation Simulation Phase Prompt. Prompt We’re studying neurons in a neural network. Each neuron looks for certain things in a short document. Your task is to read the explanation of what the neuron does, and predict the neuron’s activations for each token in the document. For each document, you will see the full text of the document, then the tokens in the document with the activation left blank. You will print the exact same tokens verbatim, but with the activation values filled in according to the explanation. Pay special attention to the explanation’s description of the context and order of tokens or words. Fill out the activation values with integer values from 0 to 10. Don’t use negative numbers. Please think carefully. No need to include rationales. Directly start with the first token and do not use code blocks, i.e., “‘. Neuron 1 explanation: This feature is indentifying vowels. Sequence 1: Tokens without Activations: a<tab> b<tab> c<tab> d<tab> e<tab> f<tab> Sequence 1 Tokens with Activations: a<tab>10 b<tab>0 c<tab>0 d<tab>0 e<tab>10 f<tab>0 Neuron 2 explanation: <Autointerp explanations generated in the previous phase> <Few shot examples> zPattern / DFA Explanation Phase Prompt. Prompt We are analyzing the attention map of attention heads in a neural network, where each head attends between tokens in a text. Given a head and a query token, we provide each previous 19 token ́ s contribution value, with higher values showing stronger association. Your task is to infer the common characteristic of this head that these sequences collectively suggest based on their attention map. Consider the following attention maps for an attention head.Each line is in the format of <token><tab><value>.Query tokens are additionally highlighted with <to- ken><tab><value><tab>**Query token**. Note that query tokens also attend to themselves. Higher values indicates a stronger contribution from this token to the query token. Summarize in a single sentence what characteristic the head is attending from and to in the text. It might be helpful to summarize both the commonality of query tokens and source tokens (if any). It is also recommended to mention if this head is often attending to itself. Don ́ t list examples of words. Do not start with "This head is ...". Directly start with the explanation. Sentence 1: <START> <|endoftext|><tab>-0.0 /<tab>0.0 ... (omitted) */<tab>0.0<tab>**Query token** zPattern / DFA Simulation Phase Prompt. Prompt We’re studying attention heads in a neural network. Each head follows a certain attention pattern in a short document. Your task is to read the explanation of what the head does, and predict the head’s attention pattern for each previous token in the document, given a specific query token. For each document, you will see the full text of the document, then the tokens in the document with the activation left blank. You will print the exact same tokens verbatim, but with the contri- bution values filled in according to the explanation. Pay special attention to the explanation’s description of the context and order of tokens or words. Each line is in the format of <token><tab>. Query tokens are additionally highlighted with <token><tab>**Query token**<tab>. Fill out the contribution values with integer values from 0 to 10. Don’t use negative numbers. Please think carefully. No need to include rationales. Directly start with the first token and do not use code blocks, i.e., “‘. Head 1 explanation: This head is attending from one vowel to previous vowels and itself. Sequence 1 Tokens without Activations: a<tab> b<tab> c<tab> d<tab> e<tab>**Query token** Sequence 1 Tokens with Activations: a<tab>10 b<tab>0 c<tab>0 d<tab>0 e<tab>**Query token**<tab>10 Head 2 explanation: <Autointerp explanations generated in the previous phase> <Few shot examples> 20 E Additional Case Studies E.1 Attribution Algorithm for Identifying Lorsa Heads with Specific Functionalities In addition to the path patching method discussed in Section 5.1 , we employ an attribution algorithm, inspired by the approach for detecting important features with attribution in Batson et al. [2024], to identify Lorsa heads associated with specific functionalities. The attribution score for a given Lorsa headh, is defined as: attr h :=O h ·∇ x L Here,∇ x Lis the gradient of the logit on the prediction of the target token with respect to the attention outputO h of the Lorsa head. For different prompt, we also try logit difference or probability difference to calculate∇ x L. quantifies the contribution of Lorsa headhto the prediction of the correct token. E.2 Examples of Lorsa’s Rediscovery of Reported Functional Heads Figure 10: Detailed information on Lorsa’s rediscovery of reported functional heads. 21 The detailed information on the Lorsa heads discussed in Section 5.1 is provided in Figure 10, where we visually demonstrate the logit differences induced by the Lorsa head ,along with the most strongly correlated MSHA heads and SAE features. E.3 Arithmetic Lorsa Heads We present the SAE features related to the reported arithmetic Lorsa heads in Table 4, which shows consistent interpretation in terms of operand plot andzpattern. Additionally, Table 5 provides a broader set of examples for these arithmetic Lorsa heads, including functional descriptions and the z-patterns of their top activations. Lorsa head IDManual Interpretation with Operand PlotManual Interpretation withzPattern Lorsa.16.20791op1∈27−43near 30 Lorsa.16.20931op1% 10∈[4,5,6]ending with 4 or 6 Lorsa.16.20947op1% 10∈[6,7,8]ending with 7, sometimes 6 Lorsa.15.3646op2% 10 = 2ending with 2 Lorsa.15.3813op2∈55−99from 50 - 99 Lorsa.15.4001op2∈38−63near 50 Table 4: Supplementary information of Lorsa Head in Figure 5. IDOperatorOperandTop Activation Z Pattern Lorsa.15.3646 Additionop2 ends with 2 Subtractionmin(op1, op2) ends with 2 Multiplicationop2 = 2 or 12 Divisionop2 = 2 Lorsa.15.3648 Additionop2 ends with 4 Subtractionmin(op1, op2) ends with 4 Multiplicationop2 = 4, 24, or 40 Divisionop2 = 4 Lorsa.15.2668 AdditionUnrelated SubtractionUnrelated Multiplicationop2 = 3, 6, 30, or 60 Divisionop2 around 3 or 30 Lorsa.15.2770 AdditionUnrelated SubtractionUnrelated Multiplicationop2 around 62 and its multiples Divisionop2 around 62 and its multiples Lorsa.15.2945 AdditionUnrelated SubtractionUnrelated Multiplicationop2 = 7, 11 and their multiples Divisionop2 = 7, 11 and their multiples Table 5: Additional Case of Arithmetic Heads F Assessing Correlation with MHSA Lorsa is proposed as a method to attack attention superposition. A natural question arises: how is each Lorsa head composed in terms of the original attention heads? We address this by computing the attribution of each Lorsa head to the original attention heads using an oblique projection method (Appendix F.1). Analyzing all Lorsa heads trained on Pythia-160M (Appendix F.2), we find that roughly half of the Lorsa heads originate from a single original head, while the other half are superpositions across multiple original heads. 22 F.1 Oblique Projection Method for Attribution Given the output of an original attention head, we project it obliquely onto the (generally non- orthogonal) basis formed by the outputs of all Lorsa heads at the same layer. The resulting coefficients represent the contribution of the original head to each Lorsa head. Since the summed outputs of original heads and Lorsa heads closely match, the contribution coefficients for a given Lorsa head approximately sum to one. Conversely, we similarly compute the fraction of each Lorsa head’s output that can be attributed to each original attention head by projecting the Lorsa head’s output onto the basis formed by the original heads’ outputs. All reported results are averaged over more than 1M tokens. F.2 How Many Attention Units are Distributed Across MHSA Heads? 0246810 0 10 20 30 40 50 60 n 1 2 3 >3 Layer P er centage (%) Figure 11: Distribution of Lorsa heads based on the number of original attention heads they are superposed over. No clear trend is observed across different layers. Approximately half of the Lorsa heads are primarily associated with a single original head, about one quarter are superposed over two different original heads, around one eighth are superposed over three different original heads, and the remaining one eighth are superposed over more than three original heads. We compute the attribution statistics for all Lorsa heads trained on Pythia-160M. For a given Lorsa head, we definenas the minimum number of original heads whose cumulative contributions exceed 90%. We interpretnas the effective number of original heads a Lorsa head superposes over. As shown in Figure 11, approximately half of the Lorsa heads are primarily derived from a single original head, about a quarter involve two original heads, and the remaining quarter involve three or more original heads. F.3 Induction Heads in Pythia-160M We use path patching to measure the contribution of each MHSA head in Pythia-160M to induction behavior. The results are shown in Table 6. We find that headsL5.0,L4.6,L5.7,L9.0,L5.6exhibit the most prominent induction signals. G Analyzing SAE Features’ Interaction on Lorsa Heads We trained Sparse Autoencoders (SAE) on both the inputs and outputs of Lorsa to facilitate the understanding of its functionality. Since Lorsa’s Q, K, and V are computed from the input, with the output derived from O contributing to the final result, interactions between SAE features and these components exist across all four aspects: Q, K, O, and V. To evaluate the influence of SAE features on Q and K, we employ an ablation method (Appendix G.1). The correlation between theOVand SAE features is assessed using cosine similarity(Appendix G.2). For each Lorsa head, we identify the SAE features most strongly correlated with different aspects. The results are visualized in the Lorsa head dashboard. 23 Table 6: Contribution of each MHSA head to induction behavior in Pythia-160M, measured via path patching. Notable induction heads (L5.0,L4.6,L5.7,L9.0,L5.6) are highlighted in bold. Layer 01234567891011 00.000.000.000.000.000.000.000.000.000.000.000.00 10.07-0.15-0.100.030.09-0.08-0.070.06-0.010.110.34-0.05 2 -0.140.070.100.140.14-0.130.60-0.03-0.140.100.040.03 3-0.24-0.14-0.96-1.20-0.49-0.140.20-0.38-0.100.06-0.11-0.07 40.13-0.260.09-0.16-0.10-0.020.890.130.09-0.28-0.140.30 54.00-0.200.050.06-0.53-0.040.480.620.060.080.05-0.23 6-0.04-0.23-0.04-0.220.020.090.04-0.330.02-0.04-0.380.04 7-0.280.170.030.06-0.28-0.070.01-0.18-0.23-0.03-0.020.18 8-0.070.030.500.000.15-0.020.01-0.220.02-0.02-0.080.38 90.54-0.030.07-0.09-1.10-0.040.040.000.040.10-0.010.02 10-0.010.030.000.00-0.03-0.100.01-0.010.00-0.040.030.01 11-0.14-0.13-0.05-0.040.00-0.02-0.11-0.020.01-0.07-0.020.06 G.1 Quantifying Feature Impacts on Q and K For a given Lorsa head, the impact of a specific feature on Q is calculated as follows: First, we compute the attention pattern at the activation locations of the Lorsa head. Then, the feature is ablated from the input, andQ ′ and the new attention pattern are computed (with K remaining unaffected). The Kullback-Leibler (KL) divergence between the original and modified attention patterns is used to quantify the effect of the feature on Q. After iterating over 1 million tokens, the maximum KL divergence observed across all activations of the Lorsa head is taken as the measure of the feature’s influence on Q for this head. A similar approach is used to calculate the impact of a feature on K, with the difference being that when recalculating the attention pattern, all instances of K are recomputed using the modified input, while Q remains unchanged. G.2 Quantifying Feature Correlations with O and V For a given Lorsa head, both the weight vectorsW O andW V are one-dimensional vectors of size D model . Therefore, for each SAE feature trained on the Lorsa input, the contribution toVis linear, meaning that the contribution of each feature toVscales proportionally with the feature’s activation value. Similarly, for each activationzof the head, the contribution of SAE features trained on the Lorsa output to the activation value is also linear. We compute the cosine similarity between the decoder of each SAE feature trained on the Lorsa input andW V , which quantifies its correlation withVfor the given Lorsa head. Similarly, the cosine similarity between the encoder of each SAE feature trained on the Lorsa output andW O is computed to measure its correlation withOfor the given Lorsa head. H Lorsa Dark Matter In Figure 12, we examine the Fraction of Variance Unexplained (FVU) for both Lorsa and SAE on the same attention layers of Pythia-160M and Llama-3.1-8B. To further explore error patterns, Figure 13 illustrates the per-token error norms of Lorsa and SAE across layers 2, 6, and 10 of Pythia-160M on a set of 64 tokens. Figure 14 then quantifies the distribution of cosine similarity between Lorsa and SAE’s per-token error norms on the same layers, measured on approximately 10,000 tokens. These results indicate that the loss pattern between pre token between Lorsa and SAE has a nontrivial correlation. It is interesting that both Lorsa and SAE exhibit a positive correlation in their magnitudes and trends for FVU and per-token error norms. We propose that this is not a coincidence, and hypothesize that it stems from a shared gap between sparse dictionary learning and the representation structure of data within the model. Alternatively, this correlation may arise from the challenge that sparse dictionary learning faces in capturing super-rare data features or certain nonlinear or dense components within the features. 24 While we observe that Lorsa generally yields a higher FVU and error norm compared to SAE, this could suggest that Lorsa captures a greater amount of "dark matter" relative to SAE. This distinction arises because SAE is trained to reconstruct activations, while Lorsa is optimized to predict the actions of the original attention heads. (a) FVU for Each Layer in Pythia-160M (b) FVU for Each Layer in Llama-3.1-8B Figure 12: FVU of Lorsa and SAE across each layer in Pythia-160M and Llama-3.1-8B. Both models show a similar trend in FVU. Figure 13: Per-Token Error Norms of Lorsa and SAE on Layers 2, 6, and 10 of Pythia-160M for 64 tokens. I Towards Full Sparsification of A 2-Layer Transformer Since our final goal is to understand Transformers’ inner working by breaking down MHSA and MLPs into atomic units (Figure 1), we train Lorsa and Transcoder [Dunefsky et al., 2024] on a 2-layer Transfomer (link). We follow the method introduced in Ge et al. [2024] where they multiply features via QK circuit to find the most salient feature pairs contributing to QK scores. Alternatively applying attribution through Transcoder features / Lorsa heads and QK ablation gives us the clear attribution graph for induction behavior (Figure 15). Due to the capability constraint of this model, we failed to 25 Figure 14: Per-Token error cosine similarity distribution between Lorsa and SAE on Layers 2, 6, and 10 of Pythia-160M, measured on approximately 10,000 Tokens. observe more interesting behaviors or attribution graphs involving Transcoder features. Nonetheless, we believe applying Lorsa and Cross-Layer Transcoders [Ameisen et al., 2025] to a larger model may reveal a lot of surprising behaviors, following the spirit of Lindsey et al. [2025]. Figure 15: Induction circuit found in our fully sparsified replacement model. 26