← Back to papers

Paper deep dive

Beyond Activation Patterns: A Weight-Based Out-of-Context Explanation of Sparse Autoencoder Features

Yiting Liu, Zhi-Hong Deng

Year: 2026Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 93

Models: Gemma-2, Llama-3.1

Abstract

Abstract:Sparse autoencoders (SAEs) have emerged as a powerful technique for decomposing language model representations into interpretable features. Current interpretation methods infer feature semantics from activation patterns, but overlook that features are trained to reconstruct activations that serve computational roles in the forward pass. We introduce a novel weight-based interpretation framework that measures functional effects through direct weight interactions, requiring no activation data. Through three experiments on Gemma-2 and Llama-3.1 models, we demonstrate that (1) 1/4 of features directly predict output tokens, (2) features actively participate in attention mechanisms with depth-dependent structure, and (3) semantic and non-semantic feature populations exhibit distinct distribution profiles in attention circuits. Our analysis provides the missing out-of-context half of SAE feature interpretability.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)mechanistic-interp (suggested, 92%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/11/2026, 1:32:12 AM

Summary

The paper introduces a weight-based, out-of-context interpretation framework for Sparse Autoencoder (SAE) features, moving beyond activation-based methods. By analyzing direct weight interactions with model components, the authors identify that approximately 1/4 of SAE features are semantically coherent and predict output tokens, while others participate in attention mechanisms with depth-dependent structures. The study reveals distinct functional specializations across layers in Gemma-2 and Llama-3.1 models, providing a mechanistic understanding of SAE features.

Entities (4)

Gemma-2 · language-model · 100%Llama-3.1 · language-model · 100%Sparse Autoencoders · methodology · 100%Weight-based interpretation · framework · 95%

Relation Signals (3)

Weight-based interpretation analyzes Sparse Autoencoders

confidence 95% · We introduce a novel weight-based interpretation framework that measures functional effects through direct weight interactions

Sparse Autoencoders decomposes Language Model Representations

confidence 95% · Sparse autoencoders (SAEs) have emerged as a powerful technique for decomposing language model representations into interpretable features.

Gemma-2 exhibits U-shaped distribution

confidence 90% · In Gemma-2-9B, 44.15% of the features in layer 5 pass the joint threshold. Across both model scales, we find that about 1/4 of SAE features meet our criteria... the pass rate for semantic features follows a U-shaped distribution.

Cypher Suggestions (2)

Find all models analyzed in the paper · confidence 90% · unvalidated

MATCH (e:Entity {entity_type: 'Language Model'}) RETURN e.name

Map the relationship between methodologies and models · confidence 85% · unvalidated

MATCH (m:Methodology)-[:ANALYZES]->(lm:LanguageModel) RETURN m.name, lm.name

Full Text

92,847 characters extracted from source content.

Expand or collapse full text

Beyond Activation Patterns: A Weight-Based Out-of-Context Explanation of Sparse Autoencoder Features Yiting Liu 1 Zhi-Hong Deng 1 Abstract Sparse autoencoders (SAEs) have emerged as a powerful technique for decomposing language model representations into interpretable features. Current interpretation methods infer feature se- mantics from activation patterns, but overlook that features are trained to reconstruct activations that serve computational roles in the forward pass. We introduce a novel weight-based interpreta- tion framework that measures functional effects through direct weight interactions, requiring no activation data. Through three experiments on Gemma-2 and Llama-3.1 models, we demonstrate that (1) 1/4 of features directly predict output to- kens, (2) features actively participate in attention mechanisms with depth-dependent structure, and (3) semantic and non-semantic feature popula- tions exhibit distinct distribution profiles in atten- tion circuits. Our analysis provides the missing out-of-context half of SAE feature interpretability. 1. Introduction A language model represents more features than it has direc- tions, forcing it to encode features in superposition, which likely explains why many neurons are polysemantic and difficult to explain (Elhage et al., 2022; Bills et al., 2023). Sparse autoencoders (SAEs) have emerged as a promising approach for decomposing such representations into more interpretable features (Huben et al., 2023; Bricken et al., 2023; Gao et al., 2024; Templeton et al., 2024). By train- ing higher-dimensional reconstructions of model activations, SAEs aim to disentangle the superposed representations into sparse and semantically meaningful components. Formally, given an activation vectorx ∈R d model from one layer of the model, an SAE is trained to find a sparse feature activation vectorz ∈R d sae (where the feature dimension 1 State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University. Corre- spondence to: Zhi-Hong Deng <zhdeng@pku.edu.cn>. Preprint. February 2, 2026. FEATURE: Gemma-2-9B, Layer 35, Index 11433 ACTIVATION ON TOKENS: 121.24.Incontrast,we cannot reject the null hypothesisof noenrichmentofseasonalSNPsamong cl inal SNPs as defined by cl inal * q *- value ([ Fig 111.87medicationwascommonlypractised( 3 5 . 9%)butthis wasevenmoresignificantamong Turkish health workers ( 5 4. 6 %) \[ [@ CR 1 9 107.16of male cons cripts were current smokers; however,smokingwasmorefrequentamong those who were rejected for MS ( 6 7 . 7 %)than among ACTIVATION-BASED EXPLANATION: scientific terminology and statistical data related to ge- netic research and health outcomes WEIGHT-BASED: predict next token “among” Figure 1. An SAE feature with activated tokens highlighted and the highest activation values boxed. The activation-based generated by GPT-4o-mini is superficial and fails to capture the causal effect. d sae ≫ d model ) that can reconstruct the original activation: z = σ (xW enc + b enc )(1) ˆ x = zW dec + b dec (2) whereσis a nonlinear activation function, typically ReLU. The model is trained to minimize reconstruction loss, and the columns of the decoder matrixW dec are interpreted as “features” in the model’s activation space (Galichin et al., 2025). Sparsity is enforced either by applying ReLU in combination with anL 1 penalty onz, or by substitutingσ with more sophisticated operators (e.g., Gao et al., 2024; Bussmann et al., 2025), which encourages most features to remain inactive for any givenx. This decomposition aims to make language models more interpretable, helping explain how they process information and make decisions. Because these features emerge as unlabeled latent variables 1 arXiv:2601.22447v1 [cs.LG] 30 Jan 2026 A Weight-Based OOC Explanation of SAE Features from the training process, their meanings must be inferred post-hoc. Existing interpretation methods primarily address this by asking “When does this feature activate?” and ana- lyzing input contexts where the feature exhibits high activa- tion. This top-down, contextual approach has been scaled through automated explanation generation, where large lan- guage models are prompted with activating text examples (Bricken et al., 2023; Gao et al., 2024; Paulo et al., 2025) to assign plausible meanings to SAE features. However, this perspective only addresses half of the inter- pretability challenge. LLM-generated descriptions are often high-level summaries of activation contexts, and they are noted for being overly broad and oversimplified (Ma et al., 2025). Even if such explanations are descriptively accurate, they cannot clearly specify the impact on the model when a feature is activated or deactivated, which is essential for interventional applicability. Critically, SAEs are optimized for reconstruction rather than for explicitly isolating human-interpretable concepts. A fea- ture’s primary role is to accurately rebuild activations in the computation graph, and it should therefore inherit the computational function of those activations. We argue that mainstream interpretation methods are more correlational than causal. By focusing on activation patterns, they over- look the mechanistic effects inherent to the SAE training process. This creates a gap between semantic descriptions and functional understanding, leaving the computational roles of most features unexplored. To bridge this gap, we introduce a complementary, bottom- up perspective that asks “What does this feature do?” We propose out-of-context interpretation, a purely weight-based methodology that analyzes features through their direct in- teractions with downstream model components, requiring no activation data. By computing the products of feature vectors with the model’s weight matrices, we can directly assess a feature’s potential to influence output predictions and participate in computational circuits. Our contributions are threefold. First, we introduce a novel paradigm for SAE interpretability by developing the first principled framework for analyzing features from a purely weight-based perspective. Second, we identify and quantify computational roles that remain hidden to activation-based methods: semantic features exhibit a U-shaped distribution, concentrating in the early and late layers, while attention- specialized features display an inverted-U distribution, peak- ing in the mid-layers. Third, we validate and visualize these findings across 100 SAEs from three models, demonstrat- ing that weight-based analysis uncovers both architectural invariants and architecture-specific computational strategies. To facilitate future research, we will publicly release the code required to reproduce all experiments. 2. Related Work As discussed, the predominant approach to SAE inter- pretability focuses on activation-based analysis, in which textual examples that activate a feature are presented to LLMs to generate semantic explanations (Bricken et al., 2023; Gao et al., 2024; Paulo et al., 2025). This paradigm has spurred a rich ecosystem, including online tools for in- teractive inspection (Lin, 2023), and has been extended to downstream applications like classification (Gallifant et al., 2025) and specialized training for rare concepts (Muhamed et al., 2025). However, the quality of these automated expla- nations remains a known challenge: benchmarking efforts have revealed difficulties in effective steering (Wu et al., 2025), and providing more context can adversely lead to vaguer explanations (Juang et al., 2024). Meanwhile, previous studies have only preliminarily con- sidered the computational roles. As a proof of concept, Templeton et al. (2024) explored the idea of features as com- putational intermediates, though their analysis remained in- context by examining feature attributions and ablations on specific inputs. More related to our methodology, Gur-Arieh et al. (2025) have conducted “output-centric” analyses to enhance feature explanation within the framework of LLM- based generation. However, they focused only on the interac- tion between the feature decoder and the unembedding ma- trix (W dec W U ), as they reported lower performance scores for the embedding matrix even in models with tied weights. Similarly, Paulo et al. (2025) proposed intervention-based scoring, which touched upon a computational direction but still served to refine correlational explanations. The concept of using next-token predictions for interpretabil- ity was also explored for vanilla LLM neurons by Bills et al. (2023), who found it generally fails to meet the baseline. Joseph Bloom (2024) characterizes SAE features by analyz- ing the distribution of logit weights, confirming that many features can be interpreted as promoting or suppressing specific tokens. We build on these observations by intro- ducing a multi-metric framework that distinguishes between semantic and computational features, revealing underlying architectural patterns and functional trade-offs. Our analyses are conducted on publicly available models. We utilize SAEs of 16k width trained on the residual stream at the end of each layer from the Gemma Scope project (Lieberum et al., 2024) for Gemma-2 models (Team et al., 2024) and SAEs of 32k width from Llama Scope (He et al., 2024) for Llama-3.1 (Dubey et al., 2024). 3. Features as Output Predictors Activation-based studies suggest that many SAE features correspond to human-interpretable semantic concepts. We hypothesize that if a feature is truly semantic in a functional 2 A Weight-Based OOC Explanation of SAE Features Feature: 0L=0.3150, C=0.4241, E=0.0838 75.5765.01 freezer refrigeration icechilledcoldestcold refrigerated frozen freezing Freezer Feature: 2250L=0.4676, C=0.5229, E=0.0901 82.1367.55 trusttrustworthinesstrustworthytrustcredibility trustedTrustdistrustmistrustTrust Feature: 4712L=0.4012, C=0.3546, E=0.0286 55.9444.19 positivefavorablefavourabletriumphant prosperouspositivaspositivepositivelypositivenpositivos Feature: 7099L=0.3581, C=0.3599, E=0.0000 88.3653.46 youryouYouryourYouryoursj s youre Feature: 9569L=0.4164, C=0.4262, E=0.1719 35.0425.39 perfectionperfectperf flawlessincapa perfect perfección Perfection perfe Perfect Feature: 11639L=0.4298, C=0.3892, E=0.3881 55.7147.43 connotationsasociadoassociationassociating associéassociationssynonymousasociadosassociate Feature: 14052L=0.2886, C=0.4091, E=0.2882 50.4338.33 pointpointPointlocatedPOINTPointpunkt locationPOINTsituated Feature: 16382L=0.3296, C=0.4801, E=0.4634 64.6037.91 ExexexEXExEXexhEXCeksexc Gemma-2-9B Layer 27 Figure 2. A uniform sample of features that met all 3 thresholds, along with their top 10 tokens. “L” denotes Levenshtein similarity, “C” denotes cosine similarity, and “E” denotes top-100 entropy. The boxed scores correspond to the range of displayed tokens. sense, this property must be encoded in the SAE’s weights. 3.1. Objective This leads to our first question: Can we identify a population of semantic features that causally predict output tokens in an out-of-context setting? To answer this, we first formalize what a feature is in terms of architectural components. An SAE featureiconsists of two components: an encoder vectorW (·,i) enc ∈R d model and a decoder vectorW (i) dec ∈R d model . When a feature is active, its decoder vector remains a linear component of the residual stream up to the unembedding layerW U ∈R d model ×d vocab and the preceding normalization. This allows its functional effect to be measured directly using a technique known as the logit lens (nostalgebraist, 2020; Belrose et al., 2025), yielding a logit vector l i D = FinalNorm(W (i) dec )W U ∈R d vocab ,(3) which reveals the tokens promoted by the feature by examin- ing its highest-scoring tokens. To measure what constitutes a semantic feature, we require metrics that quantify l i . 3.2. Method We designed 23 candidate metrics (Section A) and analyzed the pairwise correlations among them (Figure 8) in order to select a comprehensive set. Ultimately, we chose the follow- ing 3 metrics, which demonstrate only moderate correlation 0510152025 Layer 10 20 30 40 50 60 70 Pass Ratio (%) Pass Ratios by Metric Combination Gemma-2-2B / Gemma-Scope-2B-Pt-Res-Canonical 1/3: Levenshtein 1/3: Cosine 1/3: Entropy 2/3: L & C 2/3: L & E 2/3: C & E 3/3: All 010203040 Layer 10 20 30 40 50 60 70 Pass Ratio (%) Pass Ratios by Metric Combination Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical 1/3: Levenshtein 1/3: Cosine 1/3: Entropy 2/3: L & C 2/3: L & E 2/3: C & E 3/3: All Figure 3. Main results for Experiment 1 on Gemma-2-2B and 9B: Semantic features display a U-shaped distribution, with an average joint pass rate of 24.01% and 25.62%, respectively. Ablation using subsets of the 3 metrics leads to a stepwise decrease in pass rates across layers, demonstrating their complementary effectiveness. (mean absolute Spearman across layers0.53 <|ρ| < 0.69 for Gemma-2-9B; see Figure 9 for an example ofρin layer 27), thereby providing complementary perspectives on a feature’s properties. A feature is considered semantic if it passes all 3 metrics: Levenshtein similarity. Semantic features should have a low average Levenshtein distance (Levenshtein, 1965) among the top-10 predicted tokens, indicating that these tokens are mostly lexical variants of the same underlying concept (e.g., “http”, “HTTP”, “https”, with or without a leading space). Cosine similarity. Semantic features should have a high cosine similarity among the embeddings of the top-10 pre- dicted tokens, indicating that the model represents them as closely related. While the Levenshtein similarity reflects what we expect, cosine similarity captures the model’s per- spective and complements the former. Top-100 Entropy. Semantic features should have low en- tropy over logits, resulting in a “spiky” distribution where the feature’s influence is decisive rather than diffuse, often affecting only a single token. After selecting the metrics, we need to determine the thresh- olds. In practice, the distributions of metric scores across the feature population are continuous and do not display a 3 A Weight-Based OOC Explanation of SAE Features natural cutoff point (see Figure 10). Therefore, we employ a percentile-based approach to establish a consistent and relative standard across each model. To achieve this, we calibrate our metric thresholds on a single representative layer. We choose the layer at 2/3 depth for this purpose (e.g., layer 27 in Gemma-2-9B) because it is sufficient to capture well-formed predictive features, yet not so late as to risk becoming overly reflective of the final layers. We set the cutoff for all three metrics at their respective 50th percentile scores at this depth (Table 3), then conducted manual inspections of hundreds of sampled features by applying these thresholds to all other layers of the model. These layers effectively serve as our “validation set,” ensuring that the resulting classification aligns with human judgment. See Figure 2 for an example of eight passing features from a uniform sample. Importantly, this approach enables us to compare the propor- tion of features that meet this semantic standard as a func- tion of network depth. While adjusting the percentile cutoff would alter the absolute pass rates, we have confirmed that our qualitative findings remain robust to reasonable varia- tions in this parameter, and the overall shape of the resulting distributions remains stable. 3.3. Results As shown in Figure 3, applying our multi-metric filter to the Gemma-2-2B and 9B models provides an affirmative answer: a substantial number of causally semantic features exist. For instance, in Gemma-2-9B, 44.15% of the features in layer 5 pass the joint threshold. Across both model scales, we find that about 1/4 of SAE features meet our criteria for being semantically coherent. More revealing is the distribution of these features across the model’s depth: the pass rate for semantic features follows a U-shaped distribution. The proportion of semantic features is high in the initial layers, decreases substantially through the middle layers, and rises again toward the final layers of the network. This U-shaped distribution reveals a depth-dependent func- tional specialization. Features in early layers, which are trained to reconstruct activations close to the initial token embeddings, naturally align with input vocabulary concepts. Conversely, features in the final layers become increasingly specialized for predicting output tokens as the network pre- pares its final logit distribution. The significant dip in se- mantic prevalence through the mid-layers suggests a shift towards more abstract, compositional processing, where features are less directly tied to specific tokens. This find- ing is consistent with broader observations of hierarchical representation in deep networks, where early and late lay- ers often handle more concrete input/output representations 051015202530 Layer 0 20 40 60 80 Pass Ratio (%) Pass Ratios by Metric Combination Llama-3.1-8B / Llama-Scope-Lxr-8X (Decoder-Unembedding) 1/3: Levenshtein 1/3: Cosine 1/3: Entropy 2/3: L & C 2/3: L & E 2/3: C & E 3/3: All 051015202530 Layer 0 20 40 60 80 100 Pass Ratio (%) Pass Ratios by Metric Combination Llama-3.1-8B / Llama-Scope-Lxr-8X (Encoder-Embedding) 1/3: Levenshtein 1/3: Cosine 1/3: Entropy 2/3: L & C 2/3: L & E 2/3: C & E 3/3: All Figure 4. Main results for Experiment 1 on Llama-3.1-8B: Se- mantic features display monotonic distributions, with an average joint pass rate of 23.32% and 11.82%, respectively. Thresholds are obtained from layer 20 for the decoder–unembedding pair and from layer 5 for the encoder–embedding pair. while middle layers perform more abstract transformations (e.g., Vig & Belinkov, 2019; Zou et al., 2025). Our analysis of Llama-3.1-8B, however, revealed a different pattern. Despite also having an approximate average pass ratio of 1/4, the analysis of the decoder–unembedding attri- bution in Equation (3) revealed a clear monotonic increase with network depth. The rate starts near zero in the initial layers and climbs steadily to over 60% in the final layer, as illustrated in Figure 4 (top). This unexpected result raised a new question: if early-layer features in Llama have almost no semantic alignment with the output vocabulary, what is their function? The absence of the U-shape’s left arm suggested that an output-centric perspective was only capturing part of the story. This led us to investigate the interaction that is often overlooked in SAE analysis: the interpretation of features using the embedding layerW E ∈R d vocab ×d model . When we performed the symmetric analysis using the encoder vectorsW (·,i) enc ∈ R d model to produce an “input logit” l i E = W (·,i) enc W ⊤ E ∈R d vocab ,(4) we found the missing half. The pass rate for input-space alignment was exceptionally high in the initial layers but declined rapidly, becoming negligible after roughly the first third of the model, as shown in Figure 4 (bottom). 4 A Weight-Based OOC Explanation of SAE Features Taken together, these diverging trajectories reveal an archi- tectural consequence: untying the embedding and unembed- ding matrices enables functional specialization across depth. In Gemma, where the two matrices are tied (W E = W ⊤ U ), a feature’s relationship with tokens is constrained to be con- sistent across both input and output spaces, explaining the U-shaped distribution and the similar results obtained from encoder–embedding analysis (Figure 11). In Llama, early- layer features exhibit strong encoder-embedding alignment, specializing for input token recognition, while late-layer fea- tures show decoder-unembedding alignment, specializing for output token prediction. This bifurcation demonstrates that the model independently optimizes its representations for encoding versus prediction, which underscores a critical point: even an output-centric analysis is insufficient. As a final note, we discuss why the combination of encoder– unembedding or decoder–embedding is not considered. We analyze encoder-embedding and decoder-unembedding pair- ings because they respect the computational structure: em- beddings are received by the encoder alongside the residual stream, while outputs from the decoder proceed toward the unembedding. Cross-pairings violate this causal ordering. While they may produce similar results due to encoder– decoder similarity, they are logically incongruent. 4. Features as Attention Participants Our initial experiment demonstrated that a significant pro- portion of SAE features are semantically coherent. However, this also indicates that approximately 3/4 of the features do not have a semantically clear influence on the output log- its. This presents a challenge to mainstream interpretation efforts that seek a semantic label for every feature and sug- gests that more features may serve a different purpose. 4.1. Objective If these features are not directly shaping the output, what do they do? Given that SAEs are trained to reconstruct acti- vations that participate in the forward pass, it is natural to expect that their features would inherit these computational roles. The primary mechanism for information routing and composition within a transformer layer is the attention head (Vaswani et al., 2017). We therefore hypothesize that many of these features are specialized for participating in attention mechanisms. In this experiment, we test this by measuring the alignment of SAE features from a given layer with the query and key (QK) circuits of the subsequent layer’s atten- tion heads. 4.2. Method To assess a feature’s participation in QK circuits, we analyze the interaction between its decoder vector from layerL, W (i) dec,L , and the QK weight matrices of the subsequent layer, L + 1. For each attention headhin layerL + 1, we first apply the input normalization to the feature vectors, then project them to obtain their corresponding query and key representations: ̃ W (i) dec,L = AttnNorm L+1 (W (i) dec,L ) ∈R d model (5) q (h) i = ̃ W (i) dec,L W (h) Q,L+1 ∈R d head (6) k (h) i = ̃ W (i) dec,L W (h) K,L+1 ∈R d head (7) The pre-softmax attention score between featureiacting as a query and featurejacting as a key is their scaled dot product: s (h) i,j = q (h) i · k (h) j √ d head ∈R.(8) By computing the matrixS (h) = (s (h) i,j ) ∈R d sae ×d sae for each headh, we can exhaustively approximate the attention between all possible decompositions of actual activations in the forward pass, providing a powerful analytical tool. For the mathematical derivation, see Section B. While the pre-softmax scores provide direct evidence of raw participation strength across model layers, it is challenging to quantitatively define what qualifies as a computational feature using these scores in an out-of-context manner. Addi- tionally, since attention is inherently competitive, applying a pre-softmax threshold does not necessarily ensure influence across different queries. To address this, we further adopt a post-softmax perspective that simulates the competitive dynamics of attention. For a given featureiin layerLand attention headhin layerL+1, we construct a vector of pre-softmax attention scores by evaluating its interaction with all featuresj ∈1,...,d sae in the same SAE: s (h) i = h s (h) i,1 ,s (h) i,2 ,...,s (h) i,d sae i ∈R d sae ,(9) which is thei-th row of the complete attention score matrix S (h) . We then apply the softmax function to obtain the simulated post-softmax attention distribution: a (h) i = softmax(s (h) i ) ∈R d sae ,(10) where thej-th componenta (h) i,j represents the fraction of attention that featurei(as a query) would allocate to feature j(as a key) in headh. This gives us the complete post- softmax attention matrixA (h) ∈R d sae ×d sae for headh, by applying the softmax function to each row of S (h) . We now define two complementary post-softmax criteria that consider both the query and key aspects to identify computationally specialized features. LetA (h) i,· , A (h) ·,j ∈ R d sae denote thei-th row and thej-th column, respectively, of the post-softmax attention matrix A (h) . 5 A Weight-Based OOC Explanation of SAE Features 012345678910111213141516171819202122232425262728293031323334353637383940 SAE Layer 1500 1000 500 0 500 1000 1500 Cumulative Mean Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical Magnitude Evolution (Pre-Softmax, Sum of Per-Query Means) 15 7 0 -1 -8 -15 Rank (+pos / -neg) 0123456789101112131415161718192021222324252627282930 SAE Layer 200000 150000 100000 50000 0 50000 Cumulative Mean Llama-3.1-8B / Llama-Scope-Lxr-8X Magnitude Evolution (Pre-Softmax, Sum of Per-Query Means) 31 15 0 -1 -16 -31 Rank (+pos / -neg) Figure 5. Main results for Experiment 2 on Gemma-2-9B and Llama-3.1-8B: Magnitude of attention participation across layers, with the total height at each layer representing the aggregate pre- softmax score. These scores are calculated as the mean across all keys for each query feature, then summed for each head. The stack- ing direction of the head values is determined based on whether they are positive or negative. Each color represents a rank of mag- nitude, rather than a specific head index, across all layers. Query-based specialization. A featureiis classified as a query specialist if, for at least one headh, the sum of its top-k attention allocations exceeds threshold τ Q : P j∈I k (A (h) i,· ) a (h) i,j > τ Q ,(11) whereI k (v) =j 1 ,...,j k denotes the index set of thek largest components of vectorv. This identifies features that concentrate their attention on a small set of targets, acting as specialized “lookers” for certain keys. Key-based specialization. A featurejis classified as a key hub if, for at least one headh, the mean attention it receives from its top-kattending queries exceeds threshold τ K . Then: 1 k P i∈I k (A (h) ·,j ) a (h) i,j > τ K .(12) This identifies features that are consistently emphasized by multiple queries, serving as information hubs. To analyze the heads within a layer collectively, we futher measure the number of features where the mean or median score across all heads in a layer exceeds the threshold, pro- viding a broader view of the overall trend within the SAE. 0510152025303540 SAE Layer 0 200 400 600 800 Mean/Median Count Head Mean Head Median Head Mean (excl. self-attn) Head Median (excl. self-attn) Any Head (Union) Any Head (excl. self-attn) 0 2000 4000 6000 8000 10000 12000 14000 16000 Any Head Count Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical Feature Count (Post-Softmax, Query, Sum Weight, threshold=0.9500) 0510152025303540 SAE Layer 0 20 40 60 80 100 120 140 Mean/Median Count Head Mean Head Median Head Mean (excl. self-attn) Head Median (excl. self-attn) Any Head (Union) Any Head (excl. self-attn) 0 500 1000 1500 2000 Any Head Count Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical Feature Count (Post-Softmax, Key, Mean Weight, threshold=0.2500) Figure 6. Feature count trends across Gemma-2-9B layers showing features with top-10 query-based post-softmax attention weight sum above 0.95 (top) and key-based attention weight mean above 0.25 (bottom). Note that there are two y-axes of different mag- nitudes: the left y-axis is for the mean and median counts across heads, while the right y-axis is for the union across all heads (“any head”). Solid lines include weights of self-attention; dashed lines exclude self-attention. 4.3. Results Pre-softmax. To visualize the raw computational potential across layers, we first analyze the evolution of pre-softmax attention scores. Positive pre-softmax values signify a ten- dency for constructive alignment within a head’s QK circuit, while negative values suggest anti-alignment. In Gemma-2-9B (Figure 5 top) and Gemma-2-2B (Fig- ure 12), we observe a distinct low–high–low pattern in the aggregate attention potential. The scores are negative in early and late layers and become strongly positive in the middle layers, mirroring the inverse of the semantic U-shape from Experiment 1. This suggests that middle-layer features are indeed configured for constructive, high-potential inter- actions within QK circuits. At first glance, Llama-3.1-8B (Figure 5 bottom) appears to operate “wrongly”. Its aggregate pre-softmax scores are almost uniformly negative across all layers. However, this is not necessarily a contradiction. Consider that the soft- max function is shift-invariant; only the relative differences between scores matter for the final attention distribution. 6 A Weight-Based OOC Explanation of SAE Features In line with this, the absolute magnitude of Llama’s scores follows a similar low–high–medium pattern, peaking in the middle layers where computational activity is expected to be highest. Our out-of-context analysis thus reveals that both architectures achieve similar functional specialization by depth, albeit through geometrically distinct approaches. Post-softmax. We setk = 10,τ Q = 0.95, andτ K = 0.25 (see Section C for a detailed rationale). Figure 6 presents the distribution of computationally specialized features across the layers of Gemma-2-9B. The primary metric (triangles, bold red line) shows the count of features that meet the criteria in at least one attention head, as previously defined, which we term “any head.” Notably, this metric also exhibits a rough inverted U-shape in all cases, corroborating our major findings. For query specialists (top plot), the any-head pass rate is remarkably high, exceeding 90% in some layers. There- fore, we also tried excluding “self-attention” cases (i = j excluded, dashed lines), which reasonably reduces the pass rate, though a significant portion remains. We observe that the contribution of self-attention diminishes in later layers, suggesting a shift from features processing their own con- cepts to integrating information from others. Conversely, the “mean across heads” count is near zero for all layers. This difference confirms that the computational role of fea- tures is partitioned across different heads, with each feature acting as a specialist in only a few specific contexts. The distribution of key hubs (bottom plot) presents a dif- ferent and more structured picture. The overall pass rate is much lower, peaking at around 15% in the union metric, suggesting that serving as a widely-used information hub is a rarer function. More interestingly, the key hub population displays a clear oscillatory pattern, with pronounced peaks in layers 3–10 and again in layers 26–32. This rhythmic emergence of information hubs suggests that the model’s architecture has certain “consolidation” layers where key concepts are made available for processing by subsequent layers in an alternating fashion. Similar patterns, albeit over fewer layers, were observed in Gemma-2-2B (Figure 13) and Llama-3.1-8B (Figure 14). Overall, our analysis provides a clear answer to the role beyond semanticity: SAE features can form a sophisticated computational architecture, executing the distributed com- putations that precede the final semantic output. 5. Semantic vs. Computational Specialization The previous experiments established two distinct charac- terizations of SAE features: semantic features that directly predict output tokens and computational features that partic- ipate in attention circuits. However, these analyses treated the two roles as independent properties. This raises a natural question: are these roles mutually exclusive, or can features exhibit both semantic and computational characteristics si- multaneously? 5.1. Objective More specifically, we ask: do semantic features exhibit systematically different attention participation patterns com- pared to non-semantic features? If the two roles are inversely related, we would expect se- mantic features to show reduced engagement with QK cir- cuits. If they are orthogonal, both populations should display similar attention distributions. A third possibility is that they are positively correlated in certain layers, suggesting that some features simultaneously serve both interpretive and computational functions. 5.2. Method To investigate the relationship between semantic properties and attention participation, we analyze the distribution of attention weights across the two feature populations defined in Experiment 1: semantic features (those passing all three metric thresholds) and non-semantic features (those failing at least one threshold). For each featureiin a given SAE at layerL, we compute its aggregate attention weight as the sum of its top-kpre- softmax scores across all attention heads in layerL + 1. Specifically, for query-based analysis: w Q i = n heads X h=1 X j∈I k (s (h) i ) s (h) i,j ,(13) wheres (h) i,j is the pre-softmax attention score from Equa- tion (8), andI k (·)selects the indices of theklargest com- ponents. The key-based aggregate weightw K j is computed analogously by summing over the top-kqueries attending to featurejas a key. We usek = 10as in Experiment 2. This produces a scalar weight for each feature, quantifying its highest degree of participation in the attention mechanisms of the subsequent layer. We compare the distributions of these weights between semantic and non-semantic feature populations using two complementary visualizations. For per-layer analysis, we construct probability density estimates and, as simplified representations, the corresponding boxplots. To provide a reference point, we include a baseline consisting of a ran- dom sample from the full feature population, matched in size to the semantic feature set. For cross-layer analysis, we compute three aggregate statis- tics at each SAE layer: (1) the mean and (2) median attention weight for each population, and (3) the percentage of fea- tures exceeding the 75th percentileP 75 of all features across 7 A Weight-Based OOC Explanation of SAE Features 25050075010001250150017502000 Attention Weight 0.00000 0.00025 0.00050 0.00075 0.00100 0.00125 0.00150 0.00175 0.00200 Probability Density = 745.073 = 973.258 = 1018.860 Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical / Layer 20 SAE Layer 20: Sum of Per-Query Sum Semantic (n=2,056) Population Mean (n=2,056) Non-Semantic (n=14,328) 4006008001000 Attention Weight 0.000 0.001 0.002 0.003 0.004 Probability Density = 825.363 = 685.139 = 606.115 Llama-3.1-8B / Llama-Scope-Lxr-8X (Encoder-Embedding) / Layer 4 SAE Layer 4: Sum of Per-Query Sum Semantic (n=11,833) Population Mean (n=11,833) Non-Semantic (n=20,935) Figure 7. Probability densities of attention weights in Gemma-2- 9B layer 20 and Llama-3.1-8B layer 4. all layers. This pooled threshold provides a architecture- wide standard for identifying features with high attention participation, allowing for the comparison of their long yet important tails. 5.3. Results Per-layer distributions. As shown in Figure 7 (top) and Figure 16, the semantic and non-semantic populations ex- hibit substantial overlap, yet their distributions are measur- ably distinct. The distribution of semantic feature atten- tion weights is shifted toward lower values compared to non-semantic features, with the population mean similar to that of the non-semantic features. This validates that our semantic classification captures a functionally meaningful partition: semantic features participate less intensively in attention circuits than their non-semantic counterparts. Cross-layer trends. In both Gemma-2-9B (Figures 24 and 28) and Gemma-2-2B (Figures 23 and 27), non- semantic features maintain a modest lead over semantic features in both mean and median attention weights across most layers. A clearer separation emerges when examin- ing the upper tail of the attention weight distribution. Fig- ures 31 and 32 show the percentage of features exceeding the globalP 75 threshold. Here, semantic features demonstrate a markedly lower representation in the high-attention regime. This suggests that, despite their overlap, non-semantic fea- tures are more prevalent among the most computationally influential features. Llama’s dual-phase architecture. The untied architecture of Llama-3.1-8B necessitates a partitioned analysis. As shown in Figure 4, semantic specialization diverges near layer 13, where semantic pass rates for both phases drop below 3%. We therefore adopt this point as the boundary. For the output-semantic phase (Figures 21 and 33), the pat- tern mirrors Gemma. However, the input-semantic phase, as shown in Figure 7 (bottom), reveals the opposite trend: embedding-aligned semantic features obtain higher atten- tion weights than non-semantic features. This shows that input-semantic features, which focus on recognizing and encoding incoming tokens, must aggregate and transform their information for downstream layers. Therefore, high attention participation is functionally necessary for them. Our findings reveal that the computational roles of semantic and non-semantic features are different. However, their re- lationship is not universally inverse; instead, it varies with functional context. A feature’s role is shaped by its depth in the model, the embedding–unembedding structure, and whether it is oriented toward interpreting inputs or predict- ing outputs. 6. Conclusion We introduced a weight-based interpretation framework that reveals the computational roles SAE features inherit from their training objective, independent of activation patterns. We demonstrated that approximately 1/4 of the features directly predict output tokens with semantic coherence, ex- hibiting depth-dependent distributions shaped by architec- tural constraints: U-shaped in tied-weight designs and bi- furcated in untied architectures. SAE features also exhibit systematic specialization for attention mechanisms, with inverted-U distributions peaking in mid-layers where ab- stract computation occurs. We found that semantic inter- pretability and computational participation are inversely related in output-oriented contexts; however, this relation- ship is reversed for input-encoding features in architectures with untied embeddings. These findings establish that activation-based methods cap- ture only half of feature interpretability: understanding what contexts activate a feature must be complemented by under- standing what that feature does within the model’s compu- tational graph. By analyzing features through their weight interactions, we provide the mechanistic foundation neces- sary for causally grounded interventions and for developing a complete account of how sparse autoencoders decompose language model computation. 8 A Weight-Based OOC Explanation of SAE Features Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. We believe that advancing from cor- relational observations to mechanistic interpretability is a foundational step toward the development of more robust, predictable, and safely-aligned AI systems. While the po- tential societal consequences of this line of research are sig- nificant, we do not feel there are consequences that must be specifically highlighted here beyond those well-established for the field. References Belrose, N., Ostrovsky, I., McKinney, L., Furman, Z., Smith, L., Halawi, D., Biderman, S., and Steinhardt, J. Eliciting Latent Predictions from Transformers with the Tuned Lens, November 2025. Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W. Language models can explain neurons in language models.https://openaipublic.blob.core .windows.net/neuron-explainer/paper/i ndex.html, 2023. Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N. L., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decompos- ing language models with dictionary learning, October 2023. Bussmann, B., Nabeshima, N., Karvonen, A., and Nanda, N. Learning Multi-Level Features with Matryoshka Sparse Autoencoders. In Forty-Second International Conference on Machine Learning, June 2025. Chanin, D., Wilken-Smith, J., Dulka, T., Bhatnagar, H., Golechha, S., and Bloom, J. I. A is for Absorption: Studying Feature Splitting and Absorption in Sparse Au- toencoders. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, October 2025. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravanku- mar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteux, C., Nayak, C., Bi, C., Marra, C., McConnell, C., Keller, C., Touret, C., Wu, C., Wong, C., Ferrer, C. C., Nikolaidis, C., Al- lonsius, D., Song, D., Pintz, D., Livshits, D., Esiobu, D., Choudhary, D., Mahajan, D., Garcia-Olano, D., Perino, D., Hupkes, D., Lakomkin, E., AlBadawy, E., Lobanova, E., Dinan, E., Smith, E. M., Radenovic, F., Zhang, F., Syn- naeve, G., Lee, G., Anderson, G. L., Nail, G., Mialon, G., Pang, G., Cucurell, G., Nguyen, H., Korevaar, H., Xu, H., Touvron, H., Zarov, I., Ibarra, I. A., Kloumann, I., Misra, I., Evtimov, I., Copet, J., Lee, J., Geffert, J., Vranes, J., Park, J., Mahadeokar, J., Shah, J., van der Linde, J., Billock, J., Hong, J., Lee, J., Fu, J., Chi, J., Huang, J., Liu, J., Wang, J., Yu, J., Bitton, J., Spisak, J., Park, J., Rocca, J., Johnstun, J., Saxe, J., Jia, J., Alwala, K. V., Upasani, K., Plawiak, K., Li, K., Heafield, K., Stone, K., El-Arini, K., Iyer, K., Malik, K., Chiu, K., Bhalla, K., Rantala-Yeary, L., van der Maaten, L., Chen, L., Tan, L., Jenkins, L., Martin, L., Madaan, L., Malo, L., Blecher, L., Landzaat, L., de Oliveira, L., Muzzi, M., Pasupuleti, M., Singh, M., Paluri, M., Kardas, M., Oldham, M., Rita, M., Pavlova, M., Kambadur, M., Lewis, M., Si, M., Singh, M. K., Hassan, M., Goyal, N., Torabi, N., Bashlykov, N., Bogoychev, N., Chatterji, N., Duchenne, O.,C ̧elebi, O., Alrassy, P., Zhang, P., Li, P., Vasic, P., Weng, P., Bhargava, P., Dubal, P., Krishnan, P., Koura, P. S., Xu, P., He, Q., Dong, Q., Srinivasan, R., Ganapathy, R., Calderer, R., Cabral, R. S., Stojnic, R., Raileanu, R., Girdhar, R., Patel, R., Sauvestre, R., Polidoro, R., Sumbaly, R., Taylor, R., Silva, R., Hou, R., Wang, R., Hosseini, S., Chennabas- appa, S., Singh, S., Bell, S., Kim, S. S., Edunov, S., Nie, S., Narang, S., Raparthy, S., Shen, S., Wan, S., Bhosale, S., Zhang, S., Vandenhende, S., Batra, S., Whitman, S., Sootla, S., Collot, S., Gururangan, S., Borodinsky, S., Her- man, T., Fowler, T., Sheasha, T., Georgiou, T., Scialom, T., Speckbacher, T., Mihaylov, T., Xiao, T., Karn, U., Goswami, V., Gupta, V., Ramanathan, V., Kerkez, V., Gonguet, V., Do, V., Vogeti, V., Petrovic, V., Chu, W., Xiong, W., Fu, W., Meers, W., Martinet, X., Wang, X., Tan, X. E., Xie, X., Jia, X., Wang, X., Goldschlag, Y., Gaur, Y., Babaei, Y., Wen, Y., Song, Y., Zhang, Y., Li, Y., Mao, Y., Coudert, Z. D., Yan, Z., Chen, Z., Papakipos, Z., Singh, A., Grattafiori, A., Jain, A., Kelsey, A., Shajnfeld, A., Gangidi, A., Victoria, A., Goldstand, A., Menon, A., Sharma, A., Boesenberg, A., Vaughan, A., Baevski, A., Feinstein, A., Kallet, A., Sangani, A., Yunus, A., Lupu, A., Alvarado, A., Caples, A., Gu, A., Ho, A., Poulton, A., Ryan, A., Ramchandani, A., Franco, A., Saraf, A., Chowdhury, A., Gabriel, A., Bharambe, A., Eisenman, A., Yazdan, A., James, B., Maurer, B., Leonhardi, B., Huang, B., Loyd, B., Paola, B. D., Paranjape, B., Liu, B., Wu, B., Ni, B., Hancock, B., Wasti, B., Spence, B., Stojkovic, B., Gamido, B., Montalvo, B., Parker, C., Burton, C., Mejia, C., Wang, C., Kim, C., Zhou, C., Hu, C., Chu, C.-H., Cai, C., Tindal, C., Feichtenhofer, C., Civin, D., Beaty, D., Kreymer, D., Li, D., Wyatt, D., Adkins, D., Xu, D., Testuggine, D., David, D., Parikh, D., Liskovich, D., Foss, D., Wang, D., Le, D., Holland, D., Dowling, E., Jamil, E., Montgomery, E., Presani, E., Hahn, E., Wood, E., 9 A Weight-Based OOC Explanation of SAE Features Brinkman, E., Arcaute, E., Dunbar, E., Smothers, E., Sun, F., Kreuk, F., Tian, F., Ozgenel, F., Caggioni, F., Guzm ́ an, F., Kanayet, F., Seide, F., Florez, G. M., Schwarz, G., Badeer, G., Swee, G., Halpern, G., Thattai, G., Herman, G., Sizov, G., Guangyi, Zhang, Lakshminarayanan, G., Shojanazeri, H., Zou, H., Wang, H., Zha, H., Habeeb, H., Rudolph, H., Suk, H., Aspegren, H., Goldman, H., Damlaj, I., Molybog, I., Tufanov, I., Veliche, I.-E., Gat, I., Weissman, J., Geboski, J., Kohli, J., Asher, J., Gaya, J.-B., Marcus, J., Tang, J., Chan, J., Zhen, J., Reizenstein, J., Teboul, J., Zhong, J., Jin, J., Yang, J., Cummings, J., Carvill, J., Shepard, J., McPhie, J., Torres, J., Ginsburg, J., Wang, J., Wu, K., U, K. H., Saxena, K., Prasad, K., Khandelwal, K., Zand, K., Matosich, K., Veeraragha- van, K., Michelena, K., Li, K., Huang, K., Chawla, K., Lakhotia, K., Huang, K., Chen, L., Garg, L., A, L., Silva, L., Bell, L., Zhang, L., Guo, L., Yu, L., Moshkovich, L., Wehrstedt, L., Khabsa, M., Avalani, M., Bhatt, M., Tsimpoukelli, M., Mankus, M., Hasson, M., Lennie, M., Reso, M., Groshev, M., Naumov, M., Lathi, M., Keneally, M., Seltzer, M. L., Valko, M., Restrepo, M., Patel, M., Vyatskov, M., Samvelyan, M., Clark, M., Macey, M., Wang, M., Hermoso, M. J., Metanat, M., Rastegari, M., Bansal, M., Santhanam, N., Parks, N., White, N., Bawa, N., Singhal, N., Egebo, N., Usunier, N., Laptev, N. P., Dong, N., Zhang, N., Cheng, N., Chernoguz, O., Hart, O., Salpekar, O., Kalinli, O., Kent, P., Parekh, P., Saab, P., Balaji, P., Rittner, P., Bontrager, P., Roux, P., Dollar, P., Zvyagina, P., Ratanchandani, P., Yuvraj, P., Liang, Q., Alao, R., Rodriguez, R., Ayub, R., Murthy, R., Nayani, R., Mitra, R., Li, R., Hogan, R., Battey, R., Wang, R., Maheswari, R., Howes, R., Rinott, R., Bondu, S. J., Datta, S., Chugh, S., Hunt, S., Dhillon, S., Sidorov, S., Pan, S., Verma, S., Yamamoto, S., Ramaswamy, S., Lindsay, S., Lindsay, S., Feng, S., Lin, S., Zha, S. C., Shankar, S., Zhang, S., Zhang, S., Wang, S., Agarwal, S., Sajuyigbe, S., Chintala, S., Max, S., Chen, S., Kehoe, S., Satter- field, S., Govindaprasad, S., Gupta, S., Cho, S., Virk, S., Subramanian, S., Choudhury, S., Goldman, S., Remez, T., Glaser, T., Best, T., Kohler, T., Robinson, T., Li, T., Zhang, T., Matthews, T., Chou, T., Shaked, T., Vontimitta, V., Ajayi, V., Montanez, V., Mohan, V., Kumar, V. S., Mangla, V., Albiero, V., Ionescu, V., Poenaru, V., Mi- hailescu, V. T., Ivanov, V., Li, W., Wang, W., Jiang, W., Bouaziz, W., Constable, W., Tang, X., Wang, X., Wu, X., Wang, X., Xia, X., Wu, X., Gao, X., Chen, Y., Hu, Y., Jia, Y., Qi, Y., Li, Y., Zhang, Y., Zhang, Y., Adi, Y., Nam, Y., Yu, Wang, Hao, Y., Qian, Y., He, Y., Rait, Z., DeVito, Z., Rosnbrick, Z., Wen, Z., Yang, Z., and Zhao, Z. The Llama 3 Herd of Models, August 2024. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., and Olah, C. Toy Models of Superposition, September 2022. Galichin, A., Dontsov, A., Druzhinina, P., Razzhigaev, A., Rogov, O. Y., Tutubalina, E., and Oseledets, I. I Have Covered All the Bases Here: Interpreting Reasoning Fea- tures in Large Language Models via Sparse Autoencoders, March 2025. Gallifant, J., Chen, S., Sasse, K., Aerts, H., Hartvigsen, T., and Bitterman, D. Sparse Autoencoder Features for Classifications and Transferability. In Christodoulopou- los, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, p. 29927–29951, Suzhou, China, November 2025. Association for Com- putational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.1521. Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders. In The Thirteenth International Conference on Learning Representations, October 2024. Gur-Arieh, Y., Mayan, R., Agassy, C., Geiger, A., and Geva, M. Enhancing Automated Interpretability with Output- Centric Feature Descriptions. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), p. 5757– 5778, Vienna, Austria, July 2025. Association for Com- putational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.288. He, Z., Shu, W., Ge, X., Chen, L., Wang, J., Zhou, Y., Liu, F., Guo, Q., Huang, X., Wu, Z., Jiang, Y.-G., and Qiu, X. Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders, October 2024. Huben, R., Cunningham, H., Smith, L. R., Ewart, A., and Sharkey, L. Sparse Autoencoders Find Highly Inter- pretable Features in Language Models. In The Twelfth International Conference on Learning Representations, October 2023. Joseph Bloom, J. L. Understanding sae features with the logit lens.https://w.lesswrong.com/post s/qykrYY6rXXM7EEs8Q/understanding-sae -features-with-the-logit-lens, 2024. Juang, C., Paulo, G., Drori, J., and Belrose, N. Open Source Automated Interpretability for Sparse Autoencoder Fea- tures. https://blog.eleuther.ai/autointerp/, July 2024. Levenshtein, V. I. Binary codes capable of correcting dele- tions, insertions, and reversals. Soviet physics. Doklady, 10 A Weight-Based OOC Explanation of SAE Features 10:707–710, 1965. URLhttps://api.semantic scholar.org/CorpusID:60827152. Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kram ́ ar, J., Dragan, A., Shah, R., and Nanda, N. Gemma Scope: Open Sparse Autoen- coders Everywhere All At Once on Gemma 2, August 2024. Lin, J. Neuronpedia: Interactive reference and tooling for analyzing neural networks, 2023. URLhttps: //w.neuronpedia.org . Software available from neuronpedia.org. Ma, G., Pfrommer, S., and Sojoudi, S. Revising and Falsi- fying Sparse Autoencoder Feature Explanations. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, October 2025. Muhamed, A., Diab, M. T., and Smith, V. Decoding Dark Matter: Specialized Sparse Autoencoders for Interpret- ing Rare Concepts in Foundation Models. In Chiruzzo, L., Ritter, A., and Wang, L. (eds.), Findings of the As- sociation for Computational Linguistics: NAACL 2025, p. 1604–1635, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN 979-8- 89176-195-7. doi: 10.18653/v1/2025.findings-naacl.87. nostalgebraist. Interpreting GPT: The logit lens, August 2020. Paulo, G. S., Mallen, A. T., Juang, C., and Belrose, N. Au- tomatically Interpreting Millions of Features in Large Language Models. In Forty-Second International Confer- ence on Machine Learning, June 2025. Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram ́ e, A., Ferret, J., Liu, P., Tafti, P., Friesen, A., Casbon, M., Ramos, S., Kumar, R., Lan, C. L., Jerome, S., Tsit- sulin, A., Vieillard, N., Stanczyk, P., Girgin, S., Momchev, N., Hoffman, M., Thakoor, S., Grill, J.-B., Neyshabur, B., Bachem, O., Walton, A., Severyn, A., Parrish, A., Ahmad, A., Hutchison, A., Abdagic, A., Carl, A., Shen, A., Brock, A., Coenen, A., Laforge, A., Paterson, A., Bastian, B., Piot, B., Wu, B., Royal, B., Chen, C., Kumar, C., Perry, C., Welty, C., Choquette-Choo, C. A., Sinopalnikov, D., Weinberger, D., Vijaykumar, D., Rogozi ́ nska, D., Her- bison, D., Bandy, E., Wang, E., Noland, E., Moreira, E., Senter, E., Eltyshev, E., Visin, F., Rasskin, G., Wei, G., Cameron, G., Martins, G., Hashemi, H., Klimczak- Pluci ́ nska, H., Batra, H., Dhand, H., Nardini, I., Mein, J., Zhou, J., Svensson, J., Stanway, J., Chan, J., Zhou, J. P., Carrasqueira, J., Iljazi, J., Becker, J., Fernandez, J., van Amersfoort, J., Gordon, J., Lipschultz, J., Newlan, J., Ji, J.-y., Mohamed, K., Badola, K., Black, K., Millican, K., McDonell, K., Nguyen, K., Sodhia, K., Greene, K., Sjoesund, L. L., Usui, L., Sifre, L., Heuermann, L., Lago, L., McNealus, L., Soares, L. B., Kilpatrick, L., Dixon, L., Martins, L., Reid, M., Singh, M., Iverson, M., G ̈ orner, M., Velloso, M., Wirth, M., Davidow, M., Miller, M., Rahtz, M., Watson, M., Risdal, M., Kazemi, M., Moynihan, M., Zhang, M., Kahng, M., Park, M., Rahman, M., Khatwani, M., Dao, N., Bardoliwalla, N., Devanathan, N., Dumai, N., Chauhan, N., Wahltinez, O., Botarda, P., Barnes, P., Barham, P., Michel, P., Jin, P., Georgiev, P., Culliton, P., Kuppala, P., Comanescu, R., Merhej, R., Jana, R., Rokni, R. A., Agarwal, R., Mullins, R., Saadat, S., Carthy, S. M., Cogan, S., Perrin, S., Arnold, S. M. R., Krause, S., Dai, S., Garg, S., Sheth, S., Ronstrom, S., Chan, S., Jordan, T., Yu, T., Eccles, T., Hennigan, T., Kocisky, T., Doshi, T., Jain, V., Yadav, V., Meshram, V., Dharmadhikari, V., Barkley, W., Wei, W., Ye, W., Han, W., Kwon, W., Xu, X., Shen, Z., Gong, Z., Wei, Z., Cotruta, V., Kirk, P., Rao, A., Giang, M., Peran, L., Warkentin, T., Collins, E., Bar- ral, J., Ghahramani, Z., Hadsell, R., Sculley, D., Banks, J., Dragan, A., Petrov, S., Vinyals, O., Dean, J., Hass- abis, D., Kavukcuoglu, K., Farabet, C., Buchatskaya, E., Borgeaud, S., Fiedel, N., Joulin, A., Kenealy, K., Dadashi, R., and Andreev, A. Gemma 2: Improving Open Lan- guage Models at a Practical Size, October 2024. Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., MacDi- armid, M., Tamkin, A., Durmus, E., Hume, T., Mosconi, F., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., and Henighan, T. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet, May 2024. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ukasz Kaiser,Ł., and Polosukhin, I. Attention is All you Need. In Advances in Neural Informa- tion Processing Systems, volume 30. Curran Associates, Inc., 2017. Vig, J. and Belinkov, Y. Analyzing the Structure of At- tention in a Transformer Language Model. In Linzen, T., Chrupała, G., Belinkov, Y., and Hupkes, D. (eds.), Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, p. 63–76, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-4808. Wu, Z., Arora, A., Geiger, A., Wang, Z., Huang, J., Jurafsky, D., Manning, C. D., and Potts, C. AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Au- toencoders. In Forty-Second International Conference on Machine Learning, June 2025. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, 11 A Weight-Based OOC Explanation of SAE Features S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., and Hendrycks, D. Representation Engineering: A Top-Down Approach to AI Transparency, March 2025. 12 A Weight-Based OOC Explanation of SAE Features A. Interpretability Metrics We introduce a comprehensive suite of interpretability metrics designed to quantify different aspects of SAE feature behavior based on their logit attribution patterns. A.1. Semantic Coherence Metrics Cosine Similarity measures the mean pairwise cosine similarity between embeddings of the top-ktokens. LetT k = t 1 ,...,t k denote the top-k tokens by attribution score, and e i ∈R d their corresponding embeddings. We compute: CosineSim k = 2 k(k− 1) k−1 X i=1 k X j=i+1 e i · e j ∥e i ∥e j ∥ (14) Higher values indicate that top tokens occupy similar regions in embedding space, suggesting semantic coherence. Levenshtein Similarityquantifies string-level similarity between tokens using normalized edit distance. For each pair of tokens (t i ,t j ) in the top-k set: LevenSim k = 2 k(k− 1) k−1 X i=1 k X j=i+1 1− d Lev (t i ,t j ) max(|t i |,|t j |) (15) whered Lev is the Levenshtein distance and|t|denotes string length. This metric captures syntactic patterns such as shared morphological features. String Overlapmeasures the fraction of unique normalized tokens in the top-kset. After removing non-word characters and lowercasing, we compute: Overlap k = 1− |normalize(t) : t∈T k | k (16) Higher values indicate repeated tokens (possibly with different capitalization or punctuation). A.2. Distribution Concentration Metrics Entropyquantifies the concentration of attribution mass across the vocabulary. Given attribution scoress = (s 1 ,...,s V ) for vocabulary size V , we compute the Shannon entropy of the softmax distribution: H k (s) =− X i∈top-k p i logp i , p i = exp(s i /τ) P j∈top-k exp(s j /τ) (17) whereτis temperature (default 1.0). Lower entropy indicates more concentrated attribution, suggesting clearer feature semantics. We compute this at multiple scales: top-10, top-25, top-50, top-100, and full vocabulary. Mass Concentration measures the fraction of total positive attribution mass contained in the top-k tokens: Mass k = P k i=1 s i P j:s j >0 s j (18) Higher values indicate that a small number of tokens account for most of the feature’s impact. Gini Coefficientquantifies inequality in the score distribution. For top-kscores sorted in ascending order(s (1) ,...,s (k) ): G k = 2 P k i=1 i· s (i) k P k i=1 s (i) − k + 1 k (19) Values range from 0 (perfect equality) to 1 (maximal inequality). Higher values indicate sparser, more concentrated attribution. 13 A Weight-Based OOC Explanation of SAE Features Case Consistency Cosine Similarity Entropy Full Entropy Top10 Entropy Top100 Entropy Top25 Entropy Top50 Gini Coefficient L2 Norm Levenshtein Similarity Mass Concentration Top10 Mass Concentration Top100 Mass Concentration Top25 Mass Concentration Top50 Max Min Ratio Overlap Score Prefix Diversity Suffix Diversity Token Length Max Token Length Mean Token Length Std Topk Cv Whitespace Pattern Case Consistency Cosine Similarity Entropy Full Entropy Top10 Entropy Top100 Entropy Top25 Entropy Top50 Gini Coefficient L2 Norm Levenshtein Similarity Mass Concentration Top10 Mass Concentration Top100 Mass Concentration Top25 Mass Concentration Top50 Max Min Ratio Overlap Score Prefix Diversity Suffix Diversity Token Length Max Token Length Mean Token Length Std Topk Cv Whitespace Pattern 1.00-0.03-0.03-0.04-0.11-0.05-0.080.090.010.280.190.210.200.20-0.01-0.17-0.270.01-0.070.13-0.22-0.020.49 -0.031.00-0.08-0.41-0.63-0.54-0.600.450.030.480.03-0.040.01-0.010.040.42-0.41-0.45-0.230.02-0.370.030.08 -0.03-0.081.00-0.08-0.07-0.08-0.080.20-0.740.040.060.050.060.050.140.03-0.07-0.100.080.120.050.14-0.03 -0.04-0.41-0.081.000.790.930.86-0.660.09-0.37-0.28-0.19-0.24-0.22-0.80-0.390.310.380.190.040.25-0.81-0.08 -0.11-0.63-0.070.791.000.930.98-0.860.12-0.62-0.32-0.22-0.29-0.26-0.39-0.540.530.590.300.010.43-0.39-0.23 -0.05-0.54-0.080.930.931.000.98-0.800.11-0.50-0.31-0.21-0.27-0.24-0.62-0.490.420.510.260.040.35-0.62-0.13 -0.08-0.60-0.080.860.980.981.00-0.850.12-0.57-0.32-0.22-0.29-0.26-0.49-0.530.490.560.280.020.39-0.49-0.18 0.090.450.20-0.66-0.86-0.80-0.851.00-0.190.590.640.550.620.590.470.58-0.49-0.58-0.35-0.10-0.420.460.19 0.010.03-0.740.090.120.110.12-0.191.00-0.15-0.06-0.06-0.06-0.06-0.09-0.070.150.16-0.04-0.130.02-0.09-0.04 0.280.480.04-0.37-0.62-0.50-0.570.59-0.151.000.310.260.300.280.080.55-0.76-0.75-0.350.06-0.600.080.58 0.190.030.06-0.28-0.32-0.31-0.320.64-0.060.311.000.991.001.000.350.31-0.24-0.25-0.30-0.19-0.270.350.20 0.21-0.040.05-0.19-0.22-0.21-0.220.55-0.060.260.991.001.001.000.300.25-0.20-0.19-0.27-0.18-0.240.290.20 0.200.010.06-0.24-0.29-0.27-0.290.62-0.060.301.001.001.001.000.320.29-0.24-0.24-0.29-0.19-0.270.320.21 0.20-0.010.05-0.22-0.26-0.24-0.260.59-0.060.281.001.001.001.000.310.27-0.22-0.22-0.29-0.18-0.260.300.21 -0.010.040.14-0.80-0.39-0.62-0.490.47-0.090.080.350.300.320.311.000.21-0.04-0.14-0.09-0.09-0.060.98-0.06 -0.170.420.03-0.39-0.54-0.49-0.530.58-0.070.550.310.250.290.270.211.00-0.37-0.73-0.43-0.31-0.430.21-0.02 -0.27-0.41-0.070.310.530.420.49-0.490.15-0.76-0.24-0.20-0.24-0.22-0.04-0.371.000.520.23-0.090.42-0.04-0.49 0.01-0.45-0.100.380.590.510.56-0.580.16-0.75-0.25-0.19-0.24-0.22-0.14-0.730.521.000.310.000.47-0.13-0.24 -0.07-0.230.080.190.300.260.28-0.35-0.04-0.35-0.30-0.27-0.29-0.29-0.09-0.430.230.311.000.680.79-0.08-0.14 0.130.020.120.040.010.040.02-0.10-0.130.06-0.19-0.18-0.19-0.18-0.09-0.31-0.090.000.681.000.29-0.090.16 -0.22-0.370.050.250.430.350.39-0.420.02-0.60-0.27-0.24-0.27-0.26-0.06-0.430.420.470.790.291.00-0.05-0.34 -0.020.030.14-0.81-0.39-0.62-0.490.46-0.090.080.350.290.320.300.980.21-0.04-0.13-0.08-0.09-0.051.00-0.07 0.490.08-0.03-0.08-0.23-0.13-0.180.19-0.040.580.200.200.210.21-0.06-0.02-0.49-0.24-0.140.16-0.34-0.071.00 Metric Correlation Heatmap (Spearman) 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Spearman Correlation Metric Correlations (Spearman) Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical (No Layernorm) / Layer 27 Figure 8. Spearman rank correlation (ρ) matrix for 23 candidate semantic metrics, calculated on all features from the Gemma-2-9B layer 27 SAE. Coefficient of Variation (CV) measures relative dispersion of the top-k scores: CV k = σ(s i |i∈ top-k) μ(s i |i∈ top-k) (20) whereσandμdenote standard deviation and mean. Lower values indicate more uniform score magnitudes among top tokens. Max/Min Ratio captures the dominance of the top-scoring token: Ratio k = s 1 s k (21) where s 1 is the highest score and s k is the k-th highest. Higher values indicate a single dominant token. 14 A Weight-Based OOC Explanation of SAE Features 0.00.10.20.30.40.50.60.7 Levenshtein Similarity 0.2 0.4 0.6 0.8 1.0 Cosine Similarity Levenshtein Similarity vs Cosine Similarity r = 0.477 0.00.10.20.30.40.50.60.7 Levenshtein Similarity 4.600 4.601 4.602 4.603 4.604 4.605 Entropy Top100 Levenshtein Similarity vs Entropy Top100 r = -0.617 0.20.40.60.81.0 Cosine Similarity 4.600 4.601 4.602 4.603 4.604 4.605 Entropy Top100 Cosine Similarity vs Entropy Top100 r = -0.632 Metric Scatter Plots (Spearman) Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical (No Layernorm) / Layer 27 Figure 9. Pairwise scatter samples illustrating the relationships between the three selected semantic metrics for the Gemma-2-9B layer 27 SAE. The plots demonstrate that while the metrics are not entirely independent, their correlations are moderate. This supports their selection as a complementary set, where each metric provides a distinct perspective for evaluating the semantic coherence of a feature. 0.00.20.40.60.8 0 500 1000 1500 Count = 0.223,= 0.142 Levenshtein Similarity P50=0.195 P75=0.308 P90=0.428 0.20.40.60.81.0 0 500 1000 1500 Count = 0.353,= 0.134 Cosine Similarity P50=0.332 P75=0.429 P90=0.556 01234 0 1000 2000 3000 4000 5000 Count = 0.739,= 0.758 Entropy Top100 P50=0.550 P75=1.155 P90=1.833 Metric Distributions Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical / Layer 27 Figure 10. Distributions of the three primary semantic metrics—Levenshtein similarity (higher is better), cosine similarity (higher is better), and top-100 entropy (lower is better)—for all features in the Gemma-2-9B layer 27 SAE. The distributions are continuous and generally unimodal, lacking a distinct bimodal structure or natural cutoff point. This characteristic motivates our use of a percentile-based thresholding approach in Experiment 1, as it provides a consistent relative standard for what constitutes a “semantic” feature, rather than relying on an arbitrary absolute value. The distribution plot applies normalization (RMSNorm) because, without it, the entropy is highly concentrated near the maximum value with extreme outliers, making visualization difficult. L2 Norm measures the overall magnitude of feature impact, normalized by vocabulary size: ∥s∥ norm = ∥s∥ 2 √ V (22) This provides scale-independent comparison of attribution strength across features. A.3. Morphological Pattern Metrics Prefix/Suffix Diversity quantify the variety of token beginnings and endings. For prefix length ℓ (default 3): PrefixDiv k = |t[: ℓ] : t∈T k | k (23) and analogously for suffixes using t[−ℓ :]. Lower diversity suggests morphological consistency (e.g., shared verb conjuga- tions). Token Length Statistics include mean, standard deviation, and maximum character length across top-ktokens. These capture whether features specialize to tokens of particular lengths. 15 A Weight-Based OOC Explanation of SAE Features Case Consistency measures the fraction of top-ktokens sharing the dominant case pattern (lowercase, uppercase, title case, or mixed): CaseConsist k = max c∈lower, upper, title, mixed |t∈T k : case(t) = c| k (24) Whitespace Patterncomputes the fraction of top-ktokens beginning with whitespace, capturing formatting conventions: Whitespace k = |t∈T k : t[0]∈ whitespace| k (25) To avoid redundancy and select a minimal set of metrics that provide complementary perspectives, we performed a correlation analysis. We computed the pairwise Spearman rank correlation coefficient (ρ) among all 23 metrics for the features in a representative SAE (Gemma-2-9B, layer 27). As illustrated in Figure 8, this analysis revealed distinct clusters of highly correlated metrics, indicating that many captured similar underlying properties. B. Feature-Level Attention Analysis We now formalize the relationship between our out-of-context feature-feature attention scores and the actual attention computations in the forward pass. In our experiments, we apply normalization to individual feature contributions, following the convention used in prior work employing the logit lens. In contrast, during the model’s actual forward pass, normalization operates on the full combination of features prior to attention computation. Applying normalization at the level of individual features is therefore an approximation, since normalization is a non-linear operation. For analytical clarity, the derivation below omits normalization. B.1. SAE Reconstruction of the Residual Stream In the actual forward pass, the residual stream at the end of layer L is: x true = x SAE + x error (26) wherex SAE is the SAE reconstruction andx error is the reconstruction error. Our analysis focuses onx SAE , which for well-trained SAEs captures the dominant systematic structure. When features activate with strengths z 1 ,z 2 ,...,z d sae , the SAE reconstruction is: x SAE = d sae X i=1 z i W (i) dec,L + b dec (27) B.2. Attention Computation in Layer L + 1 For attention headhin layerL + 1, under our approximation where normalization effects are set aside, the query and key projections are: q (h) = x SAE W (h) Q,L+1 = d sae X i=1 z i W (i) dec,L W (h) Q,L+1 + b dec W (h) Q,L+1 (28) k (h) = x SAE W (h) K,L+1 = d sae X i=1 z i W (i) dec,L W (h) K,L+1 + b dec W (h) K,L+1 (29) Defining the per-feature query and key vectors: q (h) i = W (i) dec,L W (h) Q,L+1 , k (h) i = W (i) dec,L W (h) K,L+1 (30) and the bias contributions: q (h) b = b dec W (h) Q,L+1 , k (h) b = b dec W (h) K,L+1 (31) 16 A Weight-Based OOC Explanation of SAE Features we can rewrite the projections as: q (h) = d sae X i=1 z i q (h) i + q (h) b , k (h) = d sae X j=1 z j k (h) j + k (h) b (32) B.3. Decomposition of Attention Scores The pre-softmax attention score in head h is the scaled dot product: s (h) = q (h) · k (h) √ d head (33) = 1 √ d head   d sae X i=1 d sae X j=1 z i z j (q (h) i · k (h) j ) + d sae X i=1 z i (q (h) i · k (h) b ) + d sae X j=1 z j (q (h) b · k (h) j ) + q (h) b · k (h) b   (34) The most interesting term is the double sum over feature pairs. Defining the per-feature pairwise attention potential: s (h) i,j = q (h) i · k (h) j √ d head (35) the forward pass attention score decomposes as: s (h) = d sae X i=1 d sae X j=1 z i z j s (h) i,j + (bias terms)(36) This decomposition shows that the pre-softmax matrixS (h) = [s (h) i,j ]captures the weight-based structural potentials that, when weighted by actual feature activationsz i z j , determine the attention scores in the forward pass (up to the normalization approximation discussed above). C. Rationale for Attention Specialization Metrics The decision to analyze computational features from both a query-based and a key-based perspective is motivated by the inherent asymmetry of the attention mechanism. The row-wise application of the softmax function imparts fundamentally different properties and, consequently, different computational roles to features acting as queries versus those acting as keys. C.1. The Asymmetry of Query and Key The attention score from a query featureito a key featurejis calculated as part of a distribution over all possible keys: a (h) i = softmax(s (h) i ). This means that for a given queryi, all keysj ∈1,...,d sae are in direct competition to receive its attention. Query-based (Outgoing) PerspectiveA query feature must allocate a fixed budget of attention (summing to 1.0) across all possible keys. To achieve a high score for a specific key, it only needs to have a pre-softmax alignment with that key that is significantly higher than its alignment with any other key. This identifies features that are specialized “lookers” or “actors,” designed to perform a targeted information lookup. Key-based (Incoming) Perspective A key feature’s incoming attention is the sum of attention it receives from many independent, row-wise softmax operations. There is no column-wise competition. For a key to be considered “dominant,” it must consistently win the attention competition across many different queries. This identifies features that represent “information hubs”: concepts that are so useful that multiple computational circuits attend to them. C.2. Justification of Metric Choices This asymmetry necessitates different metrics for each perspective: 17 A Weight-Based OOC Explanation of SAE Features For query specialists: We use sum of top-k > 0.95. This metric is ideal for identifying “specialist lookers.” It asks: “Does this feature concentrate nearly all of its attention budget on a very small set of targets?” This is a very intuitive and powerful definition of a computationally specialized feature. It directly identifies features that participate in targeted circuits, capturing the essence of “doing something specific” in attention. For key hubs: We use mean of top-k > 0.25. This is a much stricter criterion designed to identify “information hubs.” A simple sum could be inflated by one query giving 0.95 attention and nine others giving almost none. The mean, however, requires that a feature be consistently and substantially important to multiple independent queries (e.g., each of its top-10 admirers dedicates, on average,>25% of their attention to it). This filters for features that are important to the model’s computations, for which a lower pass rate is expected and informative. We chosek = 10primarily as a compromise between retaining sufficient information and managing the computational budget required for all features across all layers and models, since each column or row storesd sae scores for both pre- and post-softmax matrices during exploration. Given that these thresholds may be somewhat arbitrary, we also provide several alternative values forτ Q andτ K for reference in Table 1 and Table 2, respectively. Table 1. Average percentage of features qualifying as query specialists across all layers for a range of thresholds (τ Q ). The criterion requires the sum of a feature’s top-10 post-softmax attention scores to exceed the threshold. The highlighted column corresponds to our primary choice. Threshold τ Q 0.950.900.750.500.25 Gemma-2-2B39.54%43.79%50.94%59.29%68.98% Gemma-2-9B53.82%58.49%65.60%72.84%79.73% Llama-3.1-8B9.75%11.71%15.62%21.14%28.83% Table 2. Average percentage of features qualifying as key hubs across all layers for a range of thresholds (τ K ). The criterion requires the mean of a feature’s top-10 incoming post-softmax attention scores to exceed the threshold. The highlighted column corresponds to our primary choice. Threshold τ K 0.500.250.100.050.01 Gemma-2-2B1.49%2.90%8.88%14.16%25.82% Gemma-2-9B3.16% 6.15%17.52%27.31%43.59% Llama-3.1-8B0.42% 0.85%2.15%3.57%8.03% D. Analysis of Lower Computational Feature Pass Rates in Llama-3.1-8B A notable finding from our computational analysis is the lower pass rate for both query-specialists and key-hubs in Llama- 3.1-8B compared to the Gemma-2 models (e.g., 9.75% vs. 53.82% for query-specialists withτ Q = 0.95). We hypothesize this discrepancy arises from several interacting factors related to model and SAE architecture, as well as the nature of our out-of-context methodology. D.1. SAE dictionary size The most direct architectural difference is the SAE dictionary size: 32,768 for Llama-3.1-8B versus 16,384 for Gemma-2-9B. This has two potential consequences: A larger dictionary size naturally leads to a lower percentage pass rate, even if the absolute number of specialized features for a given function were similar across models. D.2. Feature Splitting More importantly, a larger dictionary may encourage the SAE to learn a more distributed or compositional code (Chanin et al., 2025). Instead of representing a concept with a single feature, it might reconstruct the same activation using a linear combination of several more granular features. Our methodology, which evaluates each feature’s decoder vector in isolation, 18 A Weight-Based OOC Explanation of SAE Features would register a diluted signal in such cases, as no single feature vector carries the full computational weight of the concept. Consequently, fewer individual features would meet the dominance thresholds. We can formalize how feature splitting would lead to lower measured scores for individual features. For simplicity, we ignore normalizations and reconstruction coefficients. Letx∈R d model be a target vector in the residual stream that represents a specific computational role. An ideal SAE might learn a single feature, say featurek, to represent this vector. In this scenario, the feature’s decoder vector is proportional to the target vector: W (k) dec = x(37) Our methodology measures this feature’s participation in a QK circuit by projecting it. For example, its query vector for head h would be: q (h) k = W (k) dec W (h) Q = xW (h) Q (38) The resulting pre-softmax attention scores are directly proportional to the full magnitude of x. Now, consider an SAE with a larger dictionary. It might learn to represent the same vectorxusing a linear combination of two or more features. Let’s say features m and n combine to form x: x = W (m) dec + W (n) dec (39) A simple solution for the SAE optimization is to distribute the representation, for instance: W (m) dec = αx and W (n) dec = (1− α)x(40) for some fraction α∈ (0, 1). Our methodology analyzes each feature in isolation. When we evaluate feature m, its query vector is: q (h) m = W (m) dec W (h) Q = (αx)W (h) Q = α(xW (h) Q ) = αq (h) k (41) The query vector for the split featuremis a scaled-down version of the query vector for the ideal single featurek. Consequently, its pre-softmax attention score with any key feature j is also scaled down: s (h) m,j = q (h) m · k (h) j √ d head = α q (h) k · k (h) j √ d head ! = αs (h) k,j (42) The measured score for the individual split feature is only a fraction of the score it would have if it represented the concept alone. This makes it significantly less likely for any single feature to pass a fixed, high dominance threshold like τ q or τ k . D.3. Strictness of out-of-context competition Our out-of-context framework is inherently strict, as it simulates attention competition across the entire SAE dictionary without the benefit of contextual sparsity from a real forward pass. In an actual inference step, only a few hundred features might be active, and the softmax competition would occur among this small set. Our method pits one feature against all others in the dictionary. This effect is likely magnified for Llama-3.1’s larger dictionary; the bar for a single feature to dominate a competition against 32,767 others is substantially higher than against 16,383, potentially contributing to its lower pass rate. D.4. Focus on intra-layer interactions Our analysis focuses exclusively on intra-layer attention, where features from layerLinteract with the QK circuits of the immediately subsequent layer,L + 1. It is plausible that some features are specialized for longer-range, inter-layer interactions, for example, by being preserved in the residual stream to be primarily used by attention heads in layersL + 2or beyond. Such long-range specialists would not be captured by our current methodology. While this is a limitation common to our analysis of all models, architectural differences could lead to Llama relying more heavily on such mechanisms, which would not be reflected in our reported scores. 19 A Weight-Based OOC Explanation of SAE Features Table 3. Thresholds for semantic feature classification, set at the 50th percentile of scores from a representative layer. The arrows indicate the desired direction for each metric. Analysis TypeModelLayerLevenshtein Sim. ↑Cosine Sim. ↑Top-100 Entropy↓ Decoder- Unembedding Gemma-2-2B160.1573850.2923630.612010 Gemma-2-9B270.1948150.3322520.550316 Llama-3.1-8B200.1540480.0586102.911260 Encoder- Embedding Gemma-2-2B30.1491820.2824854.604819 Gemma-2-9B60.1531510.3481184.604956 Llama-3.1-8B50.1293340.0239654.603228 0510152025 Layer 20 40 60 80 100 Pass Ratio (%) Pass Ratios by Metric Combination Gemma-2-2B / Gemma-Scope-2B-Pt-Res-Canonical (Encoder-Embedding) 1/3: Levenshtein 1/3: Cosine 1/3: Entropy 2/3: L & C 2/3: L & E 2/3: C & E 3/3: All 010203040 Layer 20 40 60 80 100 Pass Ratio (%) Pass Ratios by Metric Combination Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical (Encoder-Embedding) 1/3: Levenshtein 1/3: Cosine 1/3: Entropy 2/3: L & C 2/3: L & E 2/3: C & E 3/3: All Figure 11. Supplementary results for Experiment 1 on Gemma-2-2B and 9B, using encoder–embedding alignment. Semantic features display a U-shaped distribution, with an average joint pass rate of 24.66% and 25.16%, respectively. E. Additional Results This appendix presents additional results that were omitted from the main text for brevity. 20 A Weight-Based OOC Explanation of SAE Features 0123456789101112131415161718192021222324 SAE Layer 750 500 250 0 250 500 750 Cumulative Mean Gemma-2-2B / Gemma-Scope-2B-Pt-Res-Canonical Magnitude Evolution (Pre-Softmax, Sum of Per-Query Means) 7 3 0 -1 -4 -7 Rank (+pos / -neg) Figure 12. Main results for Experiment 2 on Gemma-2-2B: Magnitude of attention participation across layers, with the total height at each layer representing the aggregate pre-softmax score. These scores are calculated as the mean across all keys for each query feature, then summed for each head. The stacking direction of the head values is determined based on whether they are positive or negative. Each color represents a rank of magnitude, rather than a specific head index, across all layers. 0510152025 SAE Layer 0 500 1000 1500 2000 2500 Mean/Median Count Head Mean Head Median Head Mean (excl. self-attn) Head Median (excl. self-attn) Any Head (Union) Any Head (excl. self-attn) 0 2000 4000 6000 8000 10000 12000 14000 16000 Any Head Count Gemma-2-2B / Gemma-Scope-2B-Pt-Res-Canonical Feature Count (Post-Softmax, Query, Sum Weight, threshold=0.9500) 0510152025 SAE Layer 0 50 100 150 200 250 Mean/Median Count Head Mean Head Median Head Mean (excl. self-attn) Head Median (excl. self-attn) Any Head (Union) Any Head (excl. self-attn) 0 200 400 600 800 1000 Any Head Count Gemma-2-2B / Gemma-Scope-2B-Pt-Res-Canonical Feature Count (Post-Softmax, Key, Mean Weight, threshold=0.2500) Figure 13. Feature count trends across Gemma-2-2B layers showing features with top-10 query-based post-softmax attention weight sum above 0.95 (left) and key-based attention weight mean above 0.25 (right). Note that there are two y-axes of different magnitudes: the left y-axis is for the mean and median counts across heads, while the right y-axis is for the union across all heads (“any head”). Solid lines include weights of self-attention; dashed lines exclude self-attention. 051015202530 SAE Layer 0 100 200 300 400 Mean/Median Count Head Mean Head Median Head Mean (excl. self-attn) Head Median (excl. self-attn) Any Head (Union) Any Head (excl. self-attn) 0 2500 5000 7500 10000 12500 15000 17500 20000 Any Head Count Llama-3.1-8B / Llama-Scope-Lxr-8X Feature Count (Post-Softmax, Query, Sum Weight, threshold=0.9500) 051015202530 SAE Layer 0 20 40 60 80 Mean/Median Count Head Mean Head Median Head Mean (excl. self-attn) Head Median (excl. self-attn) Any Head (Union) Any Head (excl. self-attn) 0 200 400 600 800 1000 Any Head Count Llama-3.1-8B / Llama-Scope-Lxr-8X Feature Count (Post-Softmax, Key, Mean Weight, threshold=0.2500) Figure 14. Feature count trends across Llama-3.1-8B layers showing features with top-10 query-based post-softmax attention weight sum above 0.95 (left) and key-based attention weight mean above 0.25 (right). Note that there are two y-axes of different magnitudes: the left y-axis is for the mean and median counts across heads, while the right y-axis is for the union across all heads (“any head”). Solid lines include weights of self-attention; dashed lines exclude self-attention. 21 A Weight-Based OOC Explanation of SAE Features 100200300400500600 Weight Semantic (n=2,583) Population Mean (n=2,583) Non-Semantic (n=13,801) SAE Layer 12: Key - Sum Weight Gemma-2-2B / Gemma-Scope-2B-Pt-Res-Canonical / Layer 12 (a) Key-based. 100200300400500600 Weight Semantic (n=2,583) Population Mean (n=2,583) Non-Semantic (n=13,801) SAE Layer 12: Query - Sum Weight Gemma-2-2B / Gemma-Scope-2B-Pt-Res-Canonical / Layer 12 (b) Query-based. Figure 15. Attention weight distributions in Gemma-2-2B layer 12. White diamonds mark means and white lines mark medians. 500100015002000 Weight Semantic (n=2,056) Population Mean (n=2,056) Non-Semantic (n=14,328) SAE Layer 20: Key - Sum Weight Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical / Layer 20 (a) Key-based. 2505007501000125015001750 Weight Semantic (n=2,056) Population Mean (n=2,056) Non-Semantic (n=14,328) SAE Layer 20: Query - Sum Weight Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical / Layer 20 (b) Query-based. Figure 16. Attention weight distributions in Gemma-2-9B layer 20. White diamonds mark means and white lines mark medians. 600800100012001400160018002000 Weight Semantic (n=18,820) Population Mean (n=18,820) Non-Semantic (n=13,948) SAE Layer 28: Key - Sum Weight Llama-3.1-8B / Llama-Scope-Lxr-8X (Decoder-Unembedding) / Layer 28 (a) Key-based. 80010001200140016001800 Weight Semantic (n=18,820) Population Mean (n=18,820) Non-Semantic (n=13,948) SAE Layer 28: Query - Sum Weight Llama-3.1-8B / Llama-Scope-Lxr-8X (Decoder-Unembedding) / Layer 28 (b) Query-based. Figure 17. Attention weight distributions in Llama-3.1-8B layer 28. White diamonds mark means and white lines mark medians. 200300400500600700800 Weight Semantic (n=11,833) Population Mean (n=11,833) Non-Semantic (n=20,935) SAE Layer 4: Key - Sum Weight Llama-3.1-8B / Llama-Scope-Lxr-8X (Encoder-Embedding) / Layer 4 (a) Key-based. 20040060080010001200 Weight Semantic (n=11,833) Population Mean (n=11,833) Non-Semantic (n=20,935) SAE Layer 4: Query - Sum Weight Llama-3.1-8B / Llama-Scope-Lxr-8X (Encoder-Embedding) / Layer 4 (b) Query-based. Figure 18. Attention weight distributions in Llama-3.1-8B layer 4. White diamonds mark means and white lines mark medians. 22 A Weight-Based OOC Explanation of SAE Features 100200300400500 Attention Weight 0.000 0.002 0.004 0.006 0.008 Probability Density = 238.316 = 333.577 = 350.232 Gemma-2-2B / Gemma-Scope-2B-Pt-Res-Canonical / Layer 12 SAE Layer 12: Sum of Per-Key Sum Semantic (n=2,583) Population Mean (n=2,583) Non-Semantic (n=13,801) (a) Key-based. 100200300400500 Attention Weight 0.000 0.002 0.004 0.006 0.008 0.010 Probability Density = 222.124 = 317.382 = 331.041 Gemma-2-2B / Gemma-Scope-2B-Pt-Res-Canonical / Layer 12 SAE Layer 12: Sum of Per-Query Sum Semantic (n=2,583) Population Mean (n=2,583) Non-Semantic (n=13,801) (b) Query-based. Figure 19. Probability densities of attention weights in Gemma-2-2B layer 12. Dotted lines mark means. 2004006008001000120014001600 Attention Weight 0.00000 0.00025 0.00050 0.00075 0.00100 0.00125 0.00150 0.00175 0.00200 Probability Density = 886.203 = 1085.960 = 1127.800 Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical / Layer 20 SAE Layer 20: Sum of Per-Key Sum Semantic (n=2,056) Population Mean (n=2,056) Non-Semantic (n=14,328) (a) Key-based. 200400600800100012001400 Attention Weight 0.0000 0.0005 0.0010 0.0015 0.0020 Probability Density = 745.073 = 973.258 = 1018.860 Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical / Layer 20 SAE Layer 20: Sum of Per-Query Sum Semantic (n=2,056) Population Mean (n=2,056) Non-Semantic (n=14,328) (b) Query-based. Figure 20. Probability densities of attention weights in Gemma-2-9B layer 20. Dotted lines mark means. 23 A Weight-Based OOC Explanation of SAE Features 8001000120014001600 Attention Weight 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 Probability Density = 1203.498 = 1256.694 = 1323.722 Llama-3.1-8B / Llama-Scope-Lxr-8X (Decoder-Unembedding) / Layer 28 SAE Layer 28: Sum of Per-Key Sum Semantic (n=18,820) Population Mean (n=18,820) Non-Semantic (n=13,948) (a) Key-based. 8001000120014001600 Attention Weight 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 Probability Density = 1162.466 = 1214.345 = 1284.169 Llama-3.1-8B / Llama-Scope-Lxr-8X (Decoder-Unembedding) / Layer 28 SAE Layer 28: Sum of Per-Query Sum Semantic (n=18,820) Population Mean (n=18,820) Non-Semantic (n=13,948) (b) Query-based. Figure 21. Probability densities of attention weights in Llama-3.1-8B layer 28. Dotted lines mark means. 300400500600700800 Attention Weight 0.000 0.002 0.004 0.006 0.008 0.010 Probability Density = 570.053 = 532.276 = 510.685 Llama-3.1-8B / Llama-Scope-Lxr-8X (Encoder-Embedding) / Layer 4 SAE Layer 4: Sum of Per-Key Sum Semantic (n=11,833) Population Mean (n=11,833) Non-Semantic (n=20,935) (a) Key-based. 4006008001000 Attention Weight 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 0.0040 Probability Density = 825.363 = 685.139 = 606.115 Llama-3.1-8B / Llama-Scope-Lxr-8X (Encoder-Embedding) / Layer 4 SAE Layer 4: Sum of Per-Query Sum Semantic (n=11,833) Population Mean (n=11,833) Non-Semantic (n=20,935) (b) Query-based. Figure 22. Probability densities of attention weights in Llama-3.1-8B layer 4. Dotted lines mark means. 24 A Weight-Based OOC Explanation of SAE Features 0510152025 SAE Layer 200 250 300 350 400 450 500 550 Mean Weight Mean Sum Weight Across Layers (Key) Gemma-2-2B / Gemma-Scope-2B-Pt-Res-Canonical Semantic Non-Semantic (a) Key-based. 0510152025 SAE Layer 200 300 400 500 600 700 800 Mean Weight Mean Sum Weight Across Layers (Query) Gemma-2-2B / Gemma-Scope-2B-Pt-Res-Canonical Semantic Non-Semantic (b) Query-based. Figure 23. Mean attention weights across layers in Gemma-2-2B. 0510152025303540 SAE Layer 600 800 1000 1200 Mean Weight Mean Sum Weight Across Layers (Key) Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical Semantic Non-Semantic (a) Key-based. 0510152025303540 SAE Layer 400 600 800 1000 1200 1400 Mean Weight Mean Sum Weight Across Layers (Query) Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical Semantic Non-Semantic (b) Query-based. Figure 24. Mean attention weights across layers in Gemma-2-9B. 141618202224262830 SAE Layer 600 800 1000 1200 1400 Mean Weight Mean Sum Weight Across Layers (Key) Llama-3.1-8B / Llama-Scope-Lxr-8X (Decoder-Unembedding) Semantic Non-Semantic (a) Key-based. 141618202224262830 SAE Layer 800 1000 1200 1400 Mean Weight Mean Sum Weight Across Layers (Query) Llama-3.1-8B / Llama-Scope-Lxr-8X (Decoder-Unembedding) Semantic Non-Semantic (b) Query-based. Figure 25. Mean attention weights across layers in Llama-3.1-8B, decoder–unembedding phase (layers 14–30). 25 A Weight-Based OOC Explanation of SAE Features 024681012 SAE Layer 400 500 600 700 800 Mean Weight Mean Sum Weight Across Layers (Key) Llama-3.1-8B / Llama-Scope-Lxr-8X (Encoder-Embedding) Semantic Non-Semantic (a) Key-based. 024681012 SAE Layer 400 500 600 700 800 900 1000 Mean Weight Mean Sum Weight Across Layers (Query) Llama-3.1-8B / Llama-Scope-Lxr-8X (Encoder-Embedding) Semantic Non-Semantic (b) Query-based. Figure 26. Mean attention weights across layers in Llama-3.1-8B, encoder–embedding phase (layers 0–12). 0510152025 SAE Layer 200 250 300 350 400 450 500 550 Median Weight Median Sum Weight Across Layers (Key) Gemma-2-2B / Gemma-Scope-2B-Pt-Res-Canonical Semantic Non-Semantic (a) Key-based. 0510152025 SAE Layer 200 300 400 500 600 700 800 Median Weight Median Sum Weight Across Layers (Query) Gemma-2-2B / Gemma-Scope-2B-Pt-Res-Canonical Semantic Non-Semantic (b) Query-based. Figure 27. Median attention weights across layers in Gemma-2-2B. 0510152025303540 SAE Layer 400 600 800 1000 1200 Median Weight Median Sum Weight Across Layers (Key) Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical Semantic Non-Semantic (a) Key-based. 0510152025303540 SAE Layer 400 600 800 1000 1200 Median Weight Median Sum Weight Across Layers (Query) Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical Semantic Non-Semantic (b) Query-based. Figure 28. Median attention weights across layers in Gemma-2-9B. 26 A Weight-Based OOC Explanation of SAE Features 141618202224262830 SAE Layer 600 700 800 900 1000 1100 1200 1300 Median Weight Median Sum Weight Across Layers (Key) Llama-3.1-8B / Llama-Scope-Lxr-8X (Decoder-Unembedding) Semantic Non-Semantic (a) Key-based. 141618202224262830 SAE Layer 600 700 800 900 1000 1100 1200 1300 Median Weight Median Sum Weight Across Layers (Query) Llama-3.1-8B / Llama-Scope-Lxr-8X (Decoder-Unembedding) Semantic Non-Semantic (b) Query-based. Figure 29. Median attention weights across layers in Llama-3.1-8B, decoder–unembedding phase (layers 14–30). 024681012 SAE Layer 400 500 600 700 800 900 Median Weight Median Sum Weight Across Layers (Key) Llama-3.1-8B / Llama-Scope-Lxr-8X (Encoder-Embedding) Semantic Non-Semantic (a) Key-based. 024681012 SAE Layer 400 500 600 700 800 900 1000 Median Weight Median Sum Weight Across Layers (Query) Llama-3.1-8B / Llama-Scope-Lxr-8X (Encoder-Embedding) Semantic Non-Semantic (b) Query-based. Figure 30. Median attention weights across layers in Llama-3.1-8B, encoder–embedding phase (layers 0–12). 0510152025 SAE Layer 5 10 15 20 25 30 35 % Above P75 Percentage Above P75 Threshold Across Layers (Key) Gemma-2-2B / Gemma-Scope-2B-Pt-Res-Canonical Semantic Non-Semantic (a) Key-based. 0510152025 SAE Layer 10 15 20 25 30 35 % Above P75 Percentage Above P75 Threshold Across Layers (Query) Gemma-2-2B / Gemma-Scope-2B-Pt-Res-Canonical Semantic Non-Semantic (b) Query-based. Figure 31. Percentage of features exceeding global P 75 threshold across layers in Gemma-2-2B. 27 A Weight-Based OOC Explanation of SAE Features 0510152025303540 SAE Layer 10 15 20 25 30 35 40 % Above P75 Percentage Above P75 Threshold Across Layers (Key) Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical Semantic Non-Semantic (a) Key-based. 0510152025303540 SAE Layer 5 10 15 20 25 30 35 40 45 % Above P75 Percentage Above P75 Threshold Across Layers (Query) Gemma-2-9B / Gemma-Scope-9B-Pt-Res-Canonical Semantic Non-Semantic (b) Query-based. Figure 32. Percentage of features exceeding global P 75 threshold across layers in Gemma-2-9B. 141618202224262830 SAE Layer 10 15 20 25 30 35 % Above P75 Percentage Above P75 Threshold Across Layers (Key) Llama-3.1-8B / Llama-Scope-Lxr-8X (Decoder-Unembedding) Semantic Non-Semantic (a) Key-based. 141618202224262830 SAE Layer 15 20 25 30 35 % Above P75 Percentage Above P75 Threshold Across Layers (Query) Llama-3.1-8B / Llama-Scope-Lxr-8X (Decoder-Unembedding) Semantic Non-Semantic (b) Query-based. Figure 33. Percentage of features exceeding globalP 75 threshold across layers in Llama-3.1-8B, decoder–unembedding phase (layers 14–30). 024681012 SAE Layer 10 15 20 25 30 35 40 45 % Above P75 Percentage Above P75 Threshold Across Layers (Key) Llama-3.1-8B / Llama-Scope-Lxr-8X (Encoder-Embedding) Semantic Non-Semantic (a) Key-based. 024681012 SAE Layer 10 20 30 40 50 % Above P75 Percentage Above P75 Threshold Across Layers (Query) Llama-3.1-8B / Llama-Scope-Lxr-8X (Encoder-Embedding) Semantic Non-Semantic (b) Query-based. Figure 34. Percentage of features exceeding globalP 75 threshold across layers in Llama-3.1-8B, encoder–embedding phase (layers 0–12). 28