Paper deep dive

Decomposing The Dark Matter of Sparse Autoencoders

Joshua Engels, Logan Smith, Max Tegmark

Year: 2024Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 120

Models: Gemma 2 9B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/12/2026, 7:59:30 PM

Summary

The paper investigates 'dark matter' in Sparse Autoencoders (SAEs), defined as unexplained variance in language model activations. The authors demonstrate that a significant portion of SAE error vectors and their norms can be linearly predicted from initial model activations. They distinguish between 'linearly predictable' and 'nonlinear' error, finding that nonlinear error is qualitatively different, harder to learn, and more impactful on downstream cross-entropy loss. The study also explores methods to mitigate nonlinear error, such as inference-time gradient pursuit and linear transformations from earlier layer SAE outputs.

Entities (6)

Gemma-2 · language-model · 100%Llama-3.1 · language-model · 100%Sparse Autoencoders · model-architecture · 100%Dark Matter · concept · 95%Linear Representation Hypothesis · hypothesis · 95%Nonlinear SAE Error · metricphenomenon · 90%

Relation Signals (3)

Nonlinear SAE Error → increases → Cross Entropy Loss

confidence 95% · it is responsible for a proportional amount of the downstream increase in cross entropy loss

Sparse Autoencoders → produces → Dark Matter

confidence 90% · current SAEs fall short of completely explaining model performance, resulting in 'dark matter'

Linear Representation Hypothesis → issupportedby → Sparse Autoencoders

confidence 85% · The LRH has seen recent empirical support with sparse autoencoders

Cypher Suggestions (2)

Find all models studied in the context of SAE dark matter · confidence 90% · unvalidated

MATCH (m:Model)-[:STUDIED_IN]->(p:Paper {id: 'e9136c37-a4e0-4789-b0fd-de8d6c72e836'}) RETURN m.name

Map the relationship between SAE error types and their impact · confidence 85% · unvalidated

MATCH (e:Entity {name: 'Nonlinear SAE Error'})-[r:AFFECTS]->(m:Metric) RETURN e.name, type(r), m.name

Abstract

Abstract:Sparse autoencoders (SAEs) are a promising technique for decomposing language model activations into interpretable linear features. However, current SAEs fall short of completely explaining model performance, resulting in "dark matter": unexplained variance in activations. This work investigates dark matter as an object of study in its own right. Surprisingly, we find that much of SAE dark matter -- about half of the error vector itself and >90% of its norm -- can be linearly predicted from the initial activation vector. Additionally, we find that the scaling behavior of SAE error norms at a per token level is remarkably predictable: larger SAEs mostly struggle to reconstruct the same contexts as smaller SAEs. We build on the linear representation hypothesis to propose models of activations that might lead to these observations. These insights imply that the part of the SAE error vector that cannot be linearly predicted ("nonlinear" error) might be fundamentally different from the linearly predictable component. To validate this hypothesis, we empirically analyze nonlinear SAE error and show that 1) it contains fewer not yet learned features, 2) SAEs trained on it are quantitatively worse, and 3) it is responsible for a proportional amount of the downstream increase in cross entropy loss when SAE activations are inserted into the model. Finally, we examine two methods to reduce nonlinear SAE error: inference time gradient pursuit, which leads to a very slight decrease in nonlinear error, and linear transformations from earlier layer SAE outputs, which leads to a larger reduction.

PDF

Open source PDF →Open local PDF →

Full Text

120,063 characters extracted from source content.

Expand or collapse full text

Decomposing The Dark Matter of Sparse Autoencoders Joshua Engels jengels@mit.edu MIT Logan Smith logansmith5@gmail.com Independent Max Tegmark tegmark@mit.edu MIT & IAIFI Abstract Sparse autoencoders (SAEs) are a promising technique for decomposing language model activations into interpretable linear features. However, current SAEs fall short of completely explaining model performance, resulting in “dark matter”: unexplained variance in activations. This work investigates dark matter as an object of study in its own right. Surprisingly, we find that much of SAE dark matter—about half of the error vector itself and >90%absentpercent90>90\%> 90 % of its norm—can be linearly predicted from the initial activation vector. Additionally, we find that the scaling behavior of SAE error norms at a per token level is remarkably predictable: larger SAEs mostly struggle to reconstruct the same contexts as smaller SAEs. We build on the linear representation hypothesis to propose models of activations that might lead to these observations. These insights imply that the part of the SAE error vector that cannot be linearly predicted (“nonlinear” error) might be fundamentally different from the linearly predictable component. To validate this hypothesis, we empirically analyze nonlinear SAE error and show that 1) it contains fewer not yet learned features, 2) SAEs trained on it are quantitatively worse, and 3) it is responsible for a proportional amount of the downstream increase in cross entropy loss when SAE activations are inserted into the model. Finally, we examine two methods to reduce nonlinear SAE error: inference time gradient pursuit, which leads to a very slight decrease in nonlinear error, and linear transformations from earlier layer SAE outputs, which leads to a larger reduction. 1 Introduction The ultimate goal for ambitious mechanistic interpretability is to understand neural networks from the bottom up by breaking them down into programs (“circuits") and the variables (“features”) that those programs operate on (Olah, 2023). One recent successful technique for finding features in language models has been sparse autoencoders (SAEs), which learn a dictionary of one-dimensional representations that can be sparsely combined to reconstruct model hidden activations (Cunningham et al., 2023; Bricken et al., 2023). However, as observed by Gao et al. (2024), the scaling behavior of SAE width (number of latents) vs. reconstruction mean squared error (MSE) is best fit by a power law with a constant error term. Gao et al. (2024) speculate that this component of SAE error below the asymptote might best be explained by model activations having components with denser structure than simple SAE features (e.g. Gaussian noise). This is a concern for the ambitious agenda because it implies that there are components of model hidden states that are harder for SAEs to learn and which might not be eliminated by simple scaling of SAEs. Motivated by this discovery, in this work our goal is to specifically study the SAE error vector itself, and in doing so gain insight into the failures of current SAEs, the dynamics of SAE scaling, and possible distributions of model activations. Thus, our direction differs from the bulk of prior work that seeks to quantify SAE failures, as these mostly focus on downstream benchmarks or simple cross entropy loss (see e.g. Gao et al. (2024); Templeton et al. (2024); Anders & Bloom (2024)). The structure of this paper is as follows: 1. In Section 4, we introduce the fundamental mystery that we will explore throughout the rest of the paper: SAE errors are shockingly predictable. To the best of our knowledge, we are the first to show that a large fraction of SAE error vectors can be explained with a linear transformation of the input activation, that the norm of SAE error vectors can be accurately predicted by a linear projection of the input activation, and that on a per-token level, error norms of large SAEs are linearly predictable from small SAEs. 2. In Section 5, we investigate the linearly predictable and nonlinearly predictable components of SAE error. We find that although the nonlinear component affects downstream cross entropy loss in proportion to its norm, it is qualitatively different from linear error: as compared to linear error, nonlinear error is harder to learn SAEs for and has a norm that is harder to linearly predict from activations, suggesting that it consists of a smaller proportion of not-yet-learned linear features. 3. In Section 6, we investigate methods for reducing nonlinear error. We show that inference time optimization increases the fraction of variance explained by SAEs, but only slightly decreases nonlinear error. Additionally, we show that we can use SAEs trained on previous components to decrease nonlinear error and total SAE error. Figure 1: A breakdown of SAE dark matter for layer 20 Gemma 9B SAEs, with dotted lines assuming that observed trends continue for larger SAEs. See Section 4 for how we break down the overall fraction of unexplained variance into absent features, linear error, and nonlinear error. See Section 6.1 for further separating encoder error from nonlinear error. 2 Related Work Language Model Representation Structure: The linear representation hypothesis (LRH) (Park et al., 2023; Elhage et al., 2022) claims that language model hidden states can be decomposed into a sparse sum of linear feature directions. The LRH has seen recent empirical support with sparse autoencoders (Makhzani & Frey, 2013; Bricken et al., 2023; Cunningham et al., 2023), which have succeeded in decomposing much of the variance of language model hidden states into such a sparse sum, as well as a long line of work that has used probing and dimensionality reduction to find causal linear representations for specific concepts (Alain, 2016; Nanda et al., 2023; Marks et al., 2024; Gurnee, 2024). On the other hand, some recent work has questioned whether the linear representation hypothesis is true: Engels et al. (2024) find multidimensional circular representations in Mistral (Jiang et al., 2023) and Llama (AI@Meta, 2024), and Csordás et al. (2024) examine synthetic recurrent neural networks and find “onion-like” non-linear features not contained in a linear subspace. This has inspired recent discussion about what a true model of activation space might be: Mendel (2024) argues that the linear representation hypothesis ignores the growing body of results showing the multi-dimensional structure of SAE latents, and Smith (2024b) argues that we only have evidence for a “weak” form of the superposition hypothesis holding that only some features are linearly represented (such a hypothesis has also been studied in the sparse coding literature (Tasissa et al., 2020)). SAE Errors and Benchmarking: Multiple works have introduced techniques to benchmark SAEs and characterize their error: Bricken et al. (2023), Gao et al. (2024), and Templeton et al. (2024) use manual human analysis of features, automated interpretability, downstream cross entropy loss when SAE reconstructions are inserted back into the model, and feature geometry visualizations; Karvonen et al. (2024) use the setting of board games, where the ground truth features are known, to determine what proportion of the true features SAEs learn; and Anders & Bloom (2024) use the performance of the model on NLP benchmarks when the SAE reconstruction is inserted back into the model. More specifically relevant to our main direction in this paper, Gurnee (2024) finds that SAE reconstruction errors are pathological, that is, when SAE reconstructions are inserted into the model, they have a larger effect on cross entropy loss than random perturbations with the same error norm. Follow up work by Heimersheim & Mendel (2024) and Lee & Heimersheim (2024) find that this effect disappears when the random baseline is replaced by a perturbation in the direction of the difference between two random activations. SAE Scaling Laws: Anthropic (2024), Templeton et al. (2024), and Gao et al. (2024) study how SAE MSE scales with respect to FLOPS, sparsity, and SAE width, and define scaling laws with respect to these quantities. Templeton et al. (2024) also study how specific groups of language features like chemical elements, cities, animals, and foods are learned by SAEs, and show that SAEs predictably learn these features in terms of their occurrence. Finally, Bussmann et al. (2024) find that larger SAEs learn two new types of dictionary vectors as comparsed to smaller SAEs: features not present at all in smaller SAEs, and more fine-grained “feature split” versions of features in smaller SAEs. 3 Notation We consider neural network activations ∈ℝdsuperscriptℝx ^dx ∈ blackboard_Rd and sparse autoencoders Sae∈ℝd→ℝdSaesuperscriptℝ→superscriptℝ Sae ^d ^dSae ∈ blackboard_Rd → blackboard_Rd. Sparse autoencoders map inputs to a latent space ℝmsuperscriptℝR^mblackboard_Rm with m≫dmuch-greater-thanm dm ≫ d and then back into ℝdsuperscriptℝR^dblackboard_Rd, while also requiring that only a sparse set of the m latent dimensions are nonzero. SAEs have the general architecture hidden⁢()hidden hidden(x)hidden ( x ) :-σ⁢(e⁢n⁢c⋅(−d⁢e⁢c)+e⁢n⁢c):-absent⋅subscriptsubscriptsubscript σ( W_enc·(x- b_dec)+% b_enc):- σ ( italic_Witalic_e n c ⋅ ( x - italic_bitalic_d e c ) + italic_bitalic_e n c ) (1) Sae⁢()Sae Sae(x)Sae ( x ) :-d⁢e⁢c⋅hidden⁢(x)+d⁢e⁢c:-absent⋅subscripthiddensubscript W_dec· hidden(x)+ b_dec:- italic_Witalic_d e c ⋅ hidden ( x ) + italic_bitalic_d e c (2) where σ is an architecture dependent activation function, and seek to minimize ℒSae⁢()=∥−Sae⁢()∥22+S⁢(hidden⁢())subscriptℒSaesuperscriptsubscriptdelimited-∥Sae22hidden _ Sae(x)= x-% Sae(x) _2^2+S( hidden(x))LSae ( x ) = ∥ x - Sae ( x ) ∥22 + S ( hidden ( x ) ) (3) where S is an architecture dependent sparsity function. For convenience, we define SaeError⁢()SaeError SaeError(x)SaeError ( x ) as the reconstruction error of the SAE and L0subscript0L_0L0 as the average number of nonzeros in hidden⁢()hidden hidden(x)hidden ( x ): SaeError⁢()SaeError SaeError(x)SaeError ( x ) :-−Sae⁢():-absentSae x- Sae(x):- x - Sae ( x ) (4) L0subscript0 L_0L0 :-E⁢(∥hidden⁢()∥0):-absentsubscriptsubscriptdelimited-∥hidden0 E_x( hidden(x)% _0):- Ebold_x ( ∥ hidden ( x ) ∥0 ) (5) In this work, we study the relationship between xx and SaeError⁢()SaeError SaeError(x)SaeError ( x ) and are agnostic to the specifics of SAE architecture. Thus, we examine both TopK SAEs (Gao et al., 2024) and JumpRelu SAEs (Rajamanoharan et al., 2024). The TopK SAE is defined as σT⁢o⁢p⁢K⁢()subscript _TopK(z)σitalic_T o p K ( z ) =set all but the k largest dimensions of to zeroabsentset all but the k largest dimensions of to zero =set all but the $k$ largest dimensions of $ z$ % to zero= set all but the k largest dimensions of z to zero (6) ST⁢o⁢p⁢K⁢(hidden⁢())subscripthidden S_TopK( hidden(x))Sitalic_T o p K ( hidden ( x ) ) =0absent0 =0= 0 (7) Note that L0=ksubscript0L_0=kL0 = k for a TopK SAE. The JumpRelu SAE is defined as σJ⁢u⁢m⁢p⁢R⁢e⁢l⁢u⁢()isubscriptsubscript _JumpRelu(z)_iσitalic_J u m p R e l u ( z )i =iif i−θi>00otherwiseabsentcasessubscriptif i−θi>00otherwise = casesz_i&if $ z_i- _% i>0$\\ 0&otherwise cases= start_ROW start_CELL zitalic_i end_CELL start_CELL if zitalic_i - θitalic_i > 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (8) SJ⁢u⁢m⁢p⁢R⁢e⁢l⁢u⁢(hidden⁢())subscripthidden S_JumpRelu( hidden(x))Sitalic_J u m p R e l u ( hidden ( x ) ) =λ⁢∥hidden⁢()∥0absentsubscriptdelimited-∥hidden0 =λ hidden(x) _0= λ ∥ hidden ( x ) ∥0 (9) where θ is a learned bias vector and λ is a sparsity parameter. 4 Predicting SAE Error In this section, we evaluate the extent to which SAE errors SaeError⁢()SaeError SaeError(x)SaeError ( x ) can be predicted from input model activations xx. We run experiments 111Code at https://anonymous.4open.science/r/SAE-Dark-Matter-1163 on Gemma 2 2B and 9B (Team et al., 2024) and Llama 3.1 8B (AI@Meta, 2024). We use the suite of Gemma Scope (Lieberum et al., 2024) sparse autoencoders for Gemma 2 2B experiments and Llama Scope (He et al., 2024) for Llama 3.1 8B. For experiments where we analyze the effect of L0subscript0L_0L0 and SAE width, we focus on layer 12 of Gemma 2 2B and layer 20 of Gemma 9 9B in the main text as these layers have the most SAEs in Gemma Scope, and we include additional layers in the appendix (we do not run on Llama Scope for these experiments, as Llama Scope has just one SAE L0subscript0L_0L0 and two SAE widths per layer). For experiments where we analyze the effect of layer, we show the set of SAEs with L0subscript0L_0L0 closest to a target L0subscript0L_0L0 across all layers for both Gemma Scope and Llama Scope suites of SAEs. We use 300300300300 contexts of 1024102410241024 tokens from the uncopywrited subset of the Pile (Gao et al., 2020) and then filter to only activations of tokens after position 200200200200 in each context, as Lieberum et al. (2024) find that earlier tokens are easier for sparse autoencoders to reconstruct, and we wish to ignore the effect of token position on our results. This results in a dataset of about 247k activations. For linear regressions, we use a random subset of size 150k as training examples (since all models have a dimension of less than 5000500050005000, this prevents overfitting) and report the R2superscript2R^2R2 on the other 97k activations. For linear transformations to a multi-dimensional output, we report the average R2superscript2R^2R2 across dimensions. We include bias terms in our linear regressions but omit them from equations for simplicity. (a) Linear prediction results for layer 12 Gemma 2 2B SAEs from Gemma Scope. FVUnonlinearsubscriptFVUnonlinear FVU_ nonlinearFVUnonlinear is roughly constant for a fixed L0subscript0L_0L0. (b) Linear prediction results for layer 20 Gemma 2 9B SAEs from Gemma Scope. FVUnonlinearsubscriptFVUnonlinear FVU_ nonlinearFVUnonlinear is roughly constant for a fixed L0subscript0L_0L0. Figure 2: Results of linearly predicting SAE error norm and SAE error from model activations on Gemma 2 2B layer 12 (top) and Gemma 2 9B layer 20 (bottom). The right plots show the R2superscript2R^2R2 of predicting SAE error norms (see Eq. 10, the middle plots show the R2superscript2R^2R2 of predicting SAE error vectors (see Eq. 11, and the right plots show 1−R21superscript21-R^21 - R2 of predicting model activations given the SAE reconstruction and the SAE error vector prediction. We note that these 2D heatmaps are somewhat sparse and only the black dots represent actual SAEs. This is because we use Gemma Scope SAEs, which are trained only on some L0subscript0L_0L0s and SAE widths. We do a linear interpolation between SAEs to predict R2superscript2R^2R2 between hyperparameters. 4.1 Predicting SAE Error Norm For our first set of experiments, we find the optimal linear probe ∗superscript a^*italic_a∗ from xx to ∥SaeError⁢()∥22subscriptsuperscriptdelimited-∥SaeError22 SaeError(x) ^2_2∥ SaeError ( x ) ∥22. Formally (with a slight abuse of notation, since xx is a random variable and not a dataset), we solve for ∗=arg⁢min∈ℝ∥T⋅−∥SaeError()∥22∥2 a^*= *arg\,min_ a ^d% a^T·x- SaeError(x)% ^2_2 _2italic_a∗ = start_OPERATOR arg min end_OPERATORitalic_a ∈ blackboard_Rblackboard_d ∥ italic_aitalic_T ⋅ x - ∥ SaeError ( x ) ∥22 ∥2 (10) The R2superscript2R^2R2 of these probes are all extremely high: across all combinations of SAE width, L0subscript0L_0L0, layer, and model, between 70%percent7070\%70 % and 95%percent9595\%95 % of the variance in SAE error norm is explained by the optimal linear probe. Results across SAE width and L0subscript0L_0L0: We plot the R2superscript2R^2R2 of probes from Eq. 10 across SAE width and SAE L0subscript0L_0L0 for layer 9 of Gemma 2 2B and layer 20 of Gemma 2 9B as a contour plot (on the left) in Fig. 2. We also include plots for additional layers in Appendix A. Overall, sparser and wider SAEs have less predictable error norms, although interestingly for some layers, extremely dense SAEs have less predictable error norm as well. Results across layers and models: In Fig. 4 we plot the R2superscript2R^2R2 of probes from Eq. 10 for the SAEs with L0subscript0L_0L0 closest to 50505050 on Gemma 2 2B and 9B and Llama Scope SAEs (which all have k=L0=50subscript050k=L_0=50k = L0 = 50). We find that the R2superscript2R^2R2 is low for the first few layers, increases rapidly to a value of about 0.80.80.80.8 to 0.950.950.950.95, remains at this level with a slight dip for mid-late layers, and then drops off steeply in the last few layers. In Appendix A, we compare the R2superscript2R^2R2 of this approach across layers versus other baselines like token identity; we find that using activations does much better, and thus it is not “easy” to predict error norm. 4.2 Predicting SAE Error Vectors We next examine the R2superscript2R^2R2 of the optimal linear transform ∗superscript b^*italic_b∗ from xx to SaeError⁢()SaeError SaeError(x)SaeError ( x ): ∗:-arg⁢min∈ℝd×d∥⋅−SaeError()∥2 b^* *arg\,min_ b∈% R^d× d b·x- SaeError( % x) _2italic_b∗ :- start_OPERATOR arg min end_OPERATORitalic_b ∈ blackboard_Rd × d ∥ italic_b ⋅ x - SaeError ( x ) ∥2 (11) As we show in Fig. 2 (middle column) and in Appendix A for additional layers, the R2superscript2R^2R2 of these transforms varies widely, between 15%percent1515\%15 % and 72%percent7272\%72 %; this is less than the R2superscript2R^2R2 for our norm prediction experiments, but still much higher than we might expect. Intuitively, this result implies that there are large linear subspaces that the SAE is mostly failing to learn. There is also a clear pattern across SAE L0subscript0L_0L0 and width: R2superscript2R^2R2 decreases with increasing SAE width and L0subscript0L_0L0. Interestingly, this pattern is not the same as it was above for SAE error norm: the R2superscript2R^2R2 of error norm predictions increases with SAE L0subscript0L_0L0, while it decreases for error vector predictions. One concern might be that ∗superscript b^*italic_b∗ is mostly reversing feature shrinkage; in Section A.1, we show that this is not the case. 4.3 Nonlinear FVU and Breaking Down SAE Dark Matter Another related metric we are interested in is the total amount of the original activation xx we fail to explain using both Sae⁢()Sae Sae(x)Sae ( x ) and a linear projection of xx. That is, assuming we have found ∗superscript b^*italic_b∗ as in Eq. 11, we are interested in the fraction of variance unexplained (FVU) by the sum of Sae⁢()Sae Sae(x)Sae ( x ) and ∗⋅superscript b^*·xitalic_b∗ ⋅ x: FVUnonlinear:-1−R2⁢(,Sae⁢()+∗⋅):-subscriptFVUnonlinear1superscript2Sae⋅superscript FVU_ nonlinear 1-R^2(x,% Sae(x)+ b^*·x)FVUnonlinear :- 1 - R2 ( x , Sae ( x ) + italic_b∗ ⋅ x ) (12) We label this quantity FVUnonlinearsubscriptFVUnonlinear FVU_ nonlinearFVUnonlinear because it is intuitively the amount of the SAE’s unexplained variance that is not a linear projection of the input. Interestingly, we find that for middle layers (layer 12 Gemma 2 2B and layer 20 Gemma 2 9B) at a fixed L0subscript0L_0L0, FVUnonlinearsubscriptFVUnonlinear FVU_ nonlinearFVUnonlinear is approximately constant (see Fig. 2), while for other layers it decreases slightly as L0subscript0L_0L0 increases (see Appendix A). That is, even though we can linearly predict a smaller portion of the error vector in larger SAEs, this effect is counteracted by the fact that the SAE error vector itself is getting smaller. In contrast, FVUnonlinearsubscriptFVUnonlinear FVU_ nonlinearFVUnonlinear decreases as SAE L0subscript0L_0L0 increases. As stated above, for middle layers we observe that FVUnonlinearsubscriptFVUnonlinear FVU_ nonlinearFVUnonlinear is roughly constant at a fixed sparsity. Because we observe this across multiple orders of magnitude of SAE scale, we hypothesize that this will continue to hold as we scale the SAE. Using this assumption, we plot a horizontal fit for FVUnonlinearsubscriptFVUnonlinear FVU_ nonlinearFVUnonlinear and a power law fit with a constant for Gemma SAE reconstructions as SAE width scales (choosing the SAEs closest to L0≈60subscript060L_0≈ 60L0 ≈ 60) in Fig. 1. The power law fit asymptotes above the horizontal fit, which implies the presence of linear error even at very large SAE width. The absent features component of the error in Fig. 1 comes from the following observation: if our hypothesis is correct that nonlinear error is roughly constant as the layer 20 SAEs scale, then the nonlinear features must reside in the linear error component, since this is the only component that decreases as the SAE scales and learns more features. We investigate this hypothesis further in Section B.2. Figure 3: R2superscript2R^2R2 of SAE error norm predictions (see Eq. 10) for Gemma Scope SAEs of width L0≈50subscript050L_0≈ 50L0 ≈ 50 and Llama Scope SAEs of width L0=k=50subscript050L_0=k=50L0 = k = 50. Figure 4: R2superscript2R^2R2 for linear probes of per token SAE errors of larger SAEs from smaller SAEs. Prediction accuracy decreases as the SAEs get farther apart in scale, but overall remains high. 4.4 Predicting SAE Per-Token Error Norms We now examine per-token SAE scaling behavior. Given two SAEs, SAE1subscriptSAE1 SAE_1SAE1 and SAE2subscriptSAE2 SAE_2SAE2, we are interested in how much of the variance in error norms in SAE2subscriptSAE2 SAE_2SAE2 is predictable from error norms in SAE1subscriptSAE1 SAE_1SAE1. That is, we want to find ∗superscript c^*italic_c∗ such that ∗:-arg⁢min∈ℝ∥⋅∥SaeError1()∥22−∥SaeError2()∥22∥2 c^* *arg\,min_ c∈% R c· SaeError_1(x)% _2^2- SaeError_2(x) % _2^2 _2italic_c∗ :- start_OPERATOR arg min end_OPERATORitalic_c ∈ blackboard_R ∥ italic_c ⋅ ∥ SaeError1 ( x ) ∥22 - ∥ SaeError2 ( x ) ∥22 ∥2 (13) Note that in practice, although ∗superscript a^*italic_a∗ also can predict the norm of SAE error, it requires training the target SAE to learn a probe. Here, on the other hand, although we formulate finding ∗superscript c^*italic_c∗ as an optimization problem that requires a larger SAE, in practice we do not need to actually train the larger SAE to get interesting insights: since ∗superscript c^*italic_c∗ has just one component, it simply measures how well small SAE error can be multiplied by a scalar to predict large SAE error. If the R2superscript2R^2R2 is high, we know that on tokens that small SAEs perform poorly on, larger SAEs will as well. In Fig. 4, we plot the R2superscript2R^2R2 of ∗superscript c^*italic_c∗ probes on all pairs of Gemma Scope 9B layer 20202020 SAEs with L0≈60subscript060L_0≈ 60L0 ≈ 60 (restricting to pairs where SAE2subscriptSAE2 SAE_2SAE2 is larger than SAE2subscriptSAE2 SAE_2SAE2), and find that indeed, per token SAE errors are highly predictable. Additionally, we show concretely what these correlated SAE error norms looks like on a set of 100100100100 tokens from the Pile in Fig. 5. Figure 5: Per token scaling with average nonlinear error, layer 20202020 Gemma 9B SAEs from Gemma Scope closest to L0subscript0L_0L0 = 60. 5 Analyzing Components of SAE Error In previous sections, we broke down the SAE error vector into a linearly predictable component and a non-linearly predictable component. A reasonable question is whether these error subsets meaningfully differ. Thus, in this section, we run experiments on these components to determine how the linear and non-linear components of SaeError⁢()SaeError SaeError(x)SaeError ( x ) differ. For convenience, given a probe ∗superscript b^*italic_b∗ from Eq. 11, we write LinearError⁢()LinearError LinearError(x)LinearError ( x ) :-∗⋅:-absent⋅superscript b^*·x:- italic_b∗ ⋅ x (14) NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ) :-SaeError⁢()−LinearError⁢()=SaeError⁢()−∗:-absentSaeErrorLinearErrorSaeErrorsuperscript SaeError(x)- LinearError(% x)= SaeError(x)- b^*:- SaeError ( x ) - LinearError ( x ) = SaeError ( x ) - italic_b∗ (15) In Section B.2 we discuss a hypothesis for SAE feature activations that might explain why we observe these differences; this appendix section is not necessary for understanding the experiments in this section, and we will refer to it in the few cases where it provides additional intuition. We run the following experiments in this section to understand how the linear and nonlinear components of SAE error differ: 1. In Section 5.1, we run the norm prediction test from Eq. 10 on different components of the error. We find that the norm of the nonlinear component is less predictable from activations than other components, implying that it might consist of fewer not-yet-learned features. 2. We train new SAEs directly on the linear and nonlinear components of error. The SAE trained on nonlinear error converges to higher reconstruction loss and produces less interpretable features. 3. We examine how much each component contributes to downstream model performance by intervening in the forward pass, and find that both components contribute proportionally to their size to increased cross entropy loss. Overall, these experiments suggest that our split of SAE dark matter into these two categories is indeed a meaningful one. Figure 6: Violin plot of norm prediction tests for all SAEs (layer 12121212 Gemma 2B SAEs, layer 20202020 Gemma 9B SAEs, all Gemma Scope SAEs with L0subscript0L_0L0 closest to 60606060 across layers, and all Llama Scope SAEs). We plot the R2superscript2R^2R2 of a linear regression from xx to to each random vector’s norm squared. Figure 7: Auto-interpretability results on SAEs trained on the linear and nonlinear components of SaeError⁢()SaeError SaeError(x)SaeError ( x ) on a width 16k, L0≈60subscript060L_0≈ 60L0 ≈ 60, layer 20202020 Gemma Scope SAE. “Not” represents contexts that the SAE latent did not activate on, while each Q⁢iQiQ i represents activating examples from decile i. 5.1 Applying the Norm Prediction Test: For our first experiment, we run the norm prediction test from Eq. 10 on five different random vectors: xx, Sae⁢()Sae Sae(x)Sae ( x ), SaeError⁢()SaeError SaeError(x)SaeError ( x ), LinearError⁢()LinearError LinearError(x)LinearError ( x ), and NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ). The results are shown as a violin plot for each component across all 329329329329 SAEs we experiment with in Section 4 (layer 12121212 Gemma 2B SAEs, layer 20202020 Gemma 9B SAEs, all Gemma Scope SAEs with L0subscript0L_0L0 closest to 60606060 across layers, and all Llama Scope SAEs) in Fig. 7 (the Sae⁢()Sae Sae(x)Sae ( x ) bar can be seen as a summary of Fig. 2). There are one or two SAE with an outlier R2superscript2R^2R2 equal to or lower than 00; we omit these from the plot because they are likely due to numeric instability in the linear regression routine. The intuition behind this test is that if a random vector consists of a sum of almost-orthogonal linear features from xx, then its norm should be linearly predictable from xx (see Appendix B for more on this intuition). Firstly, we note that ∥22superscriptsubscriptdelimited-∥22 x _2^2∥ x ∥22 can almost be perfectly predicted from xx. This is reassuring news for the linear representation hypothesis, as it implies that xx may be well modeled as the sum of many one-dimensional features, at least from the perspective of this test. We also find that NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ) (and SaeError⁢()SaeError SaeError(x)SaeError ( x ), which consists partly of NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x )) has a notably lower score on this test than LinearError⁢()LinearError LinearError(x)LinearError ( x ). This result suggests that NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ) does not consist of a sparse sum of linear features from xx. 5.2 Training SAEs on SaeError⁢()SaeError SaeError(x)SaeError ( x ) Components: Another empirical test we run is training an SAE on NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ) and LinearError⁢()LinearError LinearError(x)LinearError ( x ). We choose a fixed Gemma 9B Gemma Scope layer 20202020 SAE with 16⁢k1616k16 k latents and L0≈60subscript060L_0≈ 60L0 ≈ 60 to generate SaeError⁢()SaeError SaeError(x)SaeError ( x ) from. This SAE has nonlinear and linear components of the error approximately equal in norm and R2superscript2R^2R2 of the total SaeError⁢()SaeError SaeError(x)SaeError ( x ) they explain, so it presents a fair comparison. We train SAEs to convergence (about 100M tokens) on each of these components of error and find that the SAE trained on NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ) converges to a fraction of variance unexplained an absolute 5555 percent higher than the SAE trained on the linear component of SAE error (≈0.59absent0.59≈ 0.59≈ 0.59 and ≈0.54absent0.54≈ 0.54≈ 0.54 respectively). We additionally examine the interpretability of the learned SAE latents using automated interpretability (this technique was first proposed by Bills et al. (2023) for interpreting neurons, and first applied to SAEs by Cunningham et al. (2023)). Specifically, we use the implementation introduced by Juang et al. (2024), where a language model (we use Llama 3.1 70b (AI@Meta, 2024)) is given top activating examples to generate an explanation, and then must use only that explanation to predict if the feature fires on a test context. Our results in Fig. 7 show that the SAE trained on linear error produces latents that are about an absolute 5%percent55\%5 % more interpretable across all activation firing deciles (we average results across 1000100010001000 random features for both SAEs, where for each feature use 7777 examples in each of the 10101010 feature activation deciles as well as 50505050 negative examples, and show 95%percent9595\%95 % confidence intervals). Figure 8: Results of intervening in the forward pass and replacing xx with Sae⁢()+NonlinearError⁢()SaeNonlinearError Sae(x)+ NonlinearError(x)Sae ( x ) + NonlinearError ( x ) and Sae⁢()+LinearError⁢()SaeLinearError Sae(x)+ LinearError(x)Sae ( x ) + LinearError ( x ) during the forward pass on all layer 20202020 Gemma 9B SAEs with L0≈60subscript060L_0≈ 60L0 ≈ 60 or width of 16k. Reported in percent of cross entropy loss recovered with respect to the difference between the same intervention with Sae⁢()Sae Sae(x)Sae ( x ) and with the normal model forward pass. 5.3 Downstream Cross Entropy Loss of SaeError⁢()SaeError SaeError(x)SaeError ( x ) Components: A common metric used to test SAEs is the percent of cross entropy loss recovered when the SAE reconstruction is inserted into the model in place of the original activation versus an ablation baseline (see e.g. Bloom (2024)). We modify this test to specifically examine the different components of SaeError⁢()SaeError SaeError(x)SaeError ( x ): we compare the percent of the cross entropy loss recovered when replacing xx with Sae⁢()Sae Sae(x)Sae ( x ) plus either LinearError⁢()LinearError LinearError(x)LinearError ( x ) or NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ) to the baseline of inserting just Sae⁢()Sae Sae(x)Sae ( x ) in place of xx on all layer 20202020 Gemma 9B SAEs with L0≈60subscript060L_0≈ 60L0 ≈ 60 or width of 16k. To estimate how much each component “should” recover, we use two metrics: the average norm of the component relative to the total norm of Sae⁢()Sae Sae(x)Sae ( x ) and the percent of the variance that the component recovers between Sae⁢()Sae Sae(x)Sae ( x ) and xx. The results, shown in Fig. 8, show that for the most part these metrics are reasonable predictions for both types of error. That is, both NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ) and LinearError⁢()LinearError LinearError(x)LinearError ( x ) proportionally contribute to the SAE’s increase in downstream cross entropy loss, with possibly a slightly higher contribution than expected for LinearError⁢()LinearError LinearError(x)LinearError ( x ). 6 Reducing NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ) In past sections we identify NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ) and LinearError⁢()LinearError LinearError(x)LinearError ( x ) and argue that NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ) likely does not consist of not-yet-learned linear features. This may be a problem for scaling SAEs farther: if a part of SAE error persistently consists of nonlinear transformations of linear features, these may never be able to be learned by SAEs. Thus, in this section, we investigate two approaches for reducing NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ): 1. Using a more powerful encoder at inference time. This approach has limited success, reducing total FVU by 3-5% but leaving FVUnonlinearsubscriptFVUnonlinear FVU_ nonlinearFVUnonlinear almost unchanged 2. Attempting to predict NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ) from SAE reconstructions of adjacent model components - this approach is more successful; on the SAE we test on, Sae⁢()Sae Sae(x)Sae ( x ) predicts 50% of the nonlinear error, and previous model components predict between 4% and 14% of nonlinear error, although only between 6% and 10% of the overall SAE error. 6.1 Using a More Powerful Encoder (a) SAE error breakdown vs. SAE width for inference time optimized reconstructions from Gemma Scope L0≈60subscript060L_0≈ 60L0 ≈ 60 dictionaries. (b) The R2superscript2R^2R2 of linearly predicting parts of SAE error from the SAE reconstructions of adjacent model components, layer 20202020 Gemma Scope L0≈60subscript060L_0≈ 60L0 ≈ 60, 16⁢k1616k16 k SAE width. Figure 9: Investigations towards reducing nonlinear SAE error. Our first approach for reducing nonlinear error is to try improving the encoder. We use a recent approach suggested by Smith (2024a): applying a greedy inference time optimization (ITO) algorithm called gradient pursuit to a frozen learned SAE decoder matrix. We implement and run ITO on all layer 20202020 Gemma Scope 9⁢b99b9 b SAEs closest to L0≈60subscript060L_0≈ 60L0 ≈ 60. For each example xx with reconstruction Sae⁢()Sae Sae(x)Sae ( x ), we use the gradient pursuit implementation with an L0subscript0L_0L0 exactly equal to the L0subscript0L_0L0 of xx in the original Sae⁢()Sae Sae(x)Sae ( x ). Using these new reconstructions of xx, we repeat Eq. 11 and do a linear transformation from xx to the inference time optimized reconstructions. We then regenerate the same scaling plot as Fig. 1 and show this figure in Fig. 9(a). Our first finding is that pursuit indeed decreases the total FVU of Sae⁢()Sae Sae(x)Sae ( x ) by 3333 to 5%percent55\%5 %; as Smith (2024a) only showed an improvement on a small 1111 layer model, to the best of our knowledge we are the first to show this result on state of the art SAEs. Our most interesting finding, however, is that the FVUnonlinearsubscriptFVUnonlinear FVU_ nonlinearFVUnonlinear stays almost constant when compared to the original SAE scaling in Fig. 1. We hypothesize that this might happen because ITO reduces “easy” linear errors like feature shrinkage. In Fig. 1, we plot the additional reduction in FVUnonlinearsubscriptFVUnonlinear FVU_ nonlinearFVUnonlinear as the contribution of encoder error; because FVUnonlinearsubscriptFVUnonlinear FVU_ nonlinearFVUnonlinear stays almost constant, this section is very narrow. 6.2 Predicting SAE Errors Between SAEs: For the Gemma 2 architecture at the locations the SAEs are trained on, each residual activation can be decomposed in terms of prior components: Residlayer=MlpOutlayer+RMSNorm⁢(Oproj⁢(AttnOutlayer))+Residlayer−1subscriptResidlayersubscriptMlpOutlayerRMSNormsubscriptOprojsubscriptAttnOutlayersubscriptResidlayer1 Resid_ layer= MlpOut_ layer+% RMSNorm( O_ proj( AttnOut_ layer% ))+ Resid_ layer-1Residlayer = MlpOutlayer + RMSNorm ( Oproj ( AttnOutlayer ) ) + Residlayer - 1 (16) We focus on l⁢a⁢y⁢e⁢r=2020layer=20l a y e r = 20 and Gemma 2 9B, and thus use the layer 19191919 attention and MLP Gemma Scope SAEs. For all components we use the width 16⁢k1616k16 k SAEs with L0≈60subscript060L_0≈ 60L0 ≈ 60. In Fig. 9(b), we plot the R2superscript2R^2R2 of a regression from the SAE output corresponding to each of these right hand side components to each of the different components of an SAE trained on ResidlayersubscriptResidlayer Resid_ layerResidlayer (SaeError⁢()SaeError SaeError(x)SaeError ( x ), LinearError⁢()LinearError LinearError(x)LinearError ( x ), and NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x )). We find that we can explain a small amount (up to ≈10%absentpercent10≈ 10\%≈ 10 %) of total SaeError⁢()SaeError SaeError(x)SaeError ( x ) using previous components, which may be immediately useful for circuit analysis. We also can explain large parts of NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ) with prior components as well. While Sae⁢()Sae Sae(x)Sae ( x ) explains 50%percent5050\%50 % of the variance in the nonlinear error, this may be somewhat misleading, as the nonlinear error is partially a function of Sae⁢()Sae Sae(x)Sae ( x ): NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ) =SaeError⁢()−LinearError⁢()absentSaeErrorLinearError = SaeError(x)- LinearError(x)= SaeError ( x ) - LinearError ( x ) =(−Sae⁢())−LinearError⁢()absentSaeLinearError =(x- Sae(x))- LinearError(% x)= ( x - Sae ( x ) ) - LinearError ( x ) These results mean that we might be able to explain some of the SAE Error using a circuits level view, but that overall there are still large parts of each error component unexplained. 7 Conclusion The fact that SAE error can be predicted and analyzed at all is surprising. Thus, our findings are intriguing evidence that SAE errors, and not just SAE reconstructions, are worthy of analysis, and we hope that it inspires further work in decomposing SAE error. Concretely, we believe that our study of error has already discovered a number of promising directions for future SAE research. The discovery that SAE errors can be significantly predicted from model activations hints that some subspaces may resist being learned and points toward potential architectural innovations such as incorporating low-rank dense side channels. The predictable relationship between small and large SAE errors may also streamline experimentation with novel SAE architectures by allowing researchers to efficiently forecast scaling behavior through small-scale trials. Additionally, our demonstration that prior SAE outputs can explain part of SAE error has immediate practical implications for circuit analysis, which currently relies on large noise terms that complicate circuit interpretation. At a higher level, the presence of constant nonlinear error at a fixed sparsity as we scale implies that scaling SAEs may not be the only (or best) way to explain more of model behavior. Future work might explore alternative penalties besides sparsity or new ways to learn better dictionaries. Ultimately, we believe that there is still room to make SAEs better, not just bigger. Acknowledgments Our work benefited greatly from the thoughtful comments and discussions provided by (in alphabetical order) Joseph Bloom, Lauren Greenspan, Jake Mendel, and Eric Michaud. We are deeply appreciative of their contributions. This work was supported by Erik Otto, Jaan Tallinn, the Rothberg Family Fund for Cognitive Science, and IAIFI through NSF grant PHY-2019786. JE was supported by the NSF Graduate Research Fellowship (Grant No. 2141064). LS was supported by the Long Term Future Fund. References AI@Meta (2024) AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md. Alain (2016) Guillaume Alain. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016. Anders & Bloom (2024) Evan Anders and Joseph Bloom. Examining language model performance with reconstructed activations using sparse autoencoders. LessWrong, 2024. URL https://w.lesswrong.com/posts/8QRH8wKcnKGhpAu2o/examining-language-model-performance-with-reconstructed. Anthropic (2024) Transformer Circuits Team Anthropic. Circuits updates april 2024, 2024. URL https://transformer-circuits.pub/2024/april-update/index.html#scaling-laws. Bills et al. (2023) Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 14.05. 2023), 2, 2023. Bloom (2024) Joseph Bloom. Open source sparse autoencoders for all residual stream layers of gpt2 small. https://w.alignmentforum.org/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream, 2024. Bricken et al. (2023) Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html. Bussmann et al. (2024) Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges, and Neel Nanda. Stitching saes of different sizes. AI Alignment Forum, 2024. URL https://w.alignmentforum.org/posts/baJyjpktzmcmRfosq/stitching-saes-of-different-sizes. Candès et al. (2011) Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):1–37, 2011. Csordás et al. (2024) Róbert Csordás, Christopher Potts, Christopher D Manning, and Atticus Geiger. Recurrent neural networks learn to store and generate sequences using non-linear representations. arXiv preprint arXiv:2408.10920, 2024. Cunningham et al. (2023) Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023. Elhage et al. (2022) Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toy_model/index.html. Engels et al. (2024) Joshua Engels, Isaac Liao, Eric J Michaud, Wes Gurnee, and Max Tegmark. Not all language model features are linear. arXiv preprint arXiv:2405.14860, 2024. Foucart & Rauhut (2013) Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing. Applied and Numerical Harmonic Analysis. Springer, New York, 2013. ISBN 978-0-8176-4947-0. doi: 10.1007/978-0-8176-4948-7. Gao et al. (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. Gao et al. (2024) Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024. Giglemiani et al. (2024) Giorgi Giglemiani, Nora Petrova, Chatrik Singh Mangat, Jett Janiak, and Stefan Heimersheim. Evaluating synthetic activations composed of sae latents in gpt-2. arXiv preprint arXiv:2409.15019, 2024. Gurnee (2024) Wes Gurnee. Sae reconstruction errors are (empirically) pathological. In AI Alignment Forum, p. 16, 2024. He et al. (2024) Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, et al. Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders. arXiv preprint arXiv:2410.20526, 2024. Heimersheim & Mendel (2024) Stefan Heimersheim and Jake Mendel. Activation plateaus & sensitive directions in gpt2. LessWrong, 2024. URL https://w.lesswrong.com/posts/LajDyGyiyX8DNNsuF/interim-research-report-activation-plateaus-and-sensitive-1. Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023. Juang et al. (2024) Caden Juang, Gonçalo Paulo, Jacob Drori, and Nora Belrose. Open source automated interpretability for sparse autoencoder features. https://blog.eleuther.ai/autointerp/, 2024. Karvonen et al. (2024) Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictionary learning for language model interpretability with board game models. arXiv preprint arXiv:2408.00113, 2024. Lad et al. (2024) Vedang Lad, Wes Gurnee, and Max Tegmark. The remarkable robustness of llms: Stages of inference? arXiv preprint arXiv:2406.19384, 2024. Lee & Heimersheim (2024) Daniel Lee and Stefan Heimersheim. Investigating sensitive directions in gpt-2: An improved baseline and comparative analysis of saes. LessWrong, 2024. URL https://w.lesswrong.com/posts/dS5dSgwaDQRoWdTuu/investigating-sensitive-directions-in-gpt-2-an-improved. Lieberum et al. (2024) Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147, 2024. Makhzani & Frey (2013) Alireza Makhzani and Brendan Frey. K-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013. Marks et al. (2024) Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647, 2024. Mendel (2024) Jake Mendel. Sae feature geometry is outside the superposition hypothesis. AI Alignment Forum, 2024. URL https://w.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis. Nanda et al. (2023) Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941, 2023. Olah (2023) Chris Olah. Interpretability dreams. Transformer Circuits, May 2023. URL https://transformer-circuits.pub/2023/interpretability-dreams/index.html. Park et al. (2023) Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658, 2023. Rajamanoharan et al. (2024) Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders. arXiv preprint arXiv:2407.14435, 2024. Smith (2024a) Lewis Smith. Replacing sae encoders with inference-time optimisation. https://w.alignmentforum.org/s/AtTZjoDm8q3DbDT8Z/p/C5KAZQib3bzzpeyrg, 2024a. Smith (2024b) Lewis Smith. The ‘strong’ feature hypothesis could be wrong. AI Alignment Forum, 2024b. URL https://w.alignmentforum.org/posts/tojtPCCRpKLSHBdpn/the-strong-feature-hypothesis-could-be-wrong. (36) Abiy Tasissa, Manos Theodosis, Bahareh Tolooshams, and Demba E Ba. Discriminative reconstruction via simultaneous dense and sparse coding. Transactions on Machine Learning Research. Tasissa et al. (2020) Abiy Tasissa, Emmanouil Theodosis, Bahareh Tolooshams, and Demba Ba. Towards improving discriminative reconstruction via simultaneous dense and sparse coding. arXiv preprint arXiv:2006.09534, 2020. Team et al. (2024) Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024. Templeton et al. (2024) Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html. (a) Linear prediction results for layer 5 Gemma 2 2B SAEs from Gemma Scope. (b) Linear prediction results for layer 19 Gemma 2 9B SAEs from Gemma Scope. Figure 10: Results of linearly predicting SAE error norm and SAE error from model activations on Gemma 2 2B layer 5 (top) and Gemma 2 2B layer 19 (bottom). The right plots show the R2superscript2R^2R2 of predicting SAE error norms (see Eq. 10, the middle plots show the R2superscript2R^2R2 of predicting SAE error vectors (see Eq. 11, and the right plots show 1−R21superscript21-R^21 - R2 of predicting model activations given the SAE reconstruction and the SAE error vector prediction. Unlike for the middle layers shown in the main body, FVUnonlinearsubscriptFVUnonlinear FVU_ nonlinearFVUnonlinear decreases some as width is scaled, although the sparsity of the SAEs in the space makes this harder to verify. (a) Linear prediction results for layer 9 Gemma 2 9B SAEs from Gemma Scope. (b) Linear prediction results for layer 31 Gemma 2 9B SAEs from Gemma Scope. Figure 11: Results of linearly predicting SAE error norm and SAE error from model activations on Gemma 2 9B layer 9 (top) and Gemma 2 9B layer 31 (bottom). The right plots show the R2superscript2R^2R2 of predicting SAE error norms (see Eq. 10, the middle plots show the R2superscript2R^2R2 of predicting SAE error vectors (see Eq. 11, and the right plots show 1−R21superscript21-R^21 - R2 of predicting model activations given the SAE reconstruction and the SAE error vector prediction. Unlike for the middle layers shown in the main body, FVUnonlinearsubscriptFVUnonlinear FVU_ nonlinearFVUnonlinear decreases some as width is scaled, although the sparsity of the SAEs in the space makes this harder to verify. Appendix A Extra Error Prediction Experiments A.1 Note on Feature Shrinkage Earlier SAE variants were prone to feature shrinkage: the observation that Sae⁢()Sae Sae(x)Sae ( x ) systematically undershot xx. Current state of the art SAE variants (e.g. JumpReLU SAEs and TopK SAEs, which we examine in this work), are less vulnerable to this problem, although we still find that Gemma Scope reconstructions have about a 10101010% smaller norm than xx. One potential concern is that the ∗superscriptb^*b∗ in Eq. 11 that we learn is merely predicting this shrinkage. If this was the case, then the cosine similarity of the linear error prediction (∗)T⋅superscriptsuperscript( b^*)^T·x( italic_b∗ )T ⋅ x with xx would be close to 1111; however, in practice we find that it is around 0.50.50.50.5, so ∗superscript b^*italic_b∗ is indeed doing more than predicting shrinkage. A.2 Additional Error Prediction Plots In Fig. 10(b) and Fig. 11(b), we show the accuracy of SAE error norm predictions, SAE error vector predictions, and FVUnonlinearsubscriptFVUnonlinear FVU_ nonlinearFVUnonlinear for layers 5 and 19 of Gemma 2 2B and layers 9 and 31 of Gemma 2 9B. As described in the main text, we find broadly similar results for SAE error norm and error vector prediction at these layers, but find that FVUnonlinearsubscriptFVUnonlinear FVU_ nonlinearFVUnonlinear decreases some as we increase SAE width at a fixed L0subscript0L_0L0, although this result is uncertain because of the sparsity of Gemma Scope SAEs at these layers. A.3 Norm Prediction Baselines In Fig. 14, we run a linear regression from different model variables to ∥SaeError⁢()∥delimited-∥SaeError SaeError(x) ∥ SaeError ( x ) ∥ across layers on Gemma Scope 9B for SAEs of size 131k. We find that is is not “easy” to predict SAE error norm, especially at later layers; the token identity, SAE L0, activation norm, and model loss all do significantly worse than using the full activation. It is interesting to note that at the first few layers, token identity does better at predicting SAE error than a probe of the activations; this is perhaps not surprising, since recent results from e.g. Lad et al. (2024) show that very early layers primarily operate on a per token level. A.4 Breaking Apart Error Per Token In Fig. 12, we show the same subset of tokens as in Fig. 5, but now broken apart into linearly predictable and non-linearly predictable components. That is, we learn ∗superscript b^*italic_b∗ for each SAE as in Eq. 11, and then plot the norm of ∗⋅superscript b^*·xitalic_b∗ ⋅ x at the norm of the linearly predictable error on the left, and plot the norm of SaeError⁢()−∗⋅SaeError⋅superscript SaeError(x)- b^*·xSaeError ( x ) - italic_b∗ ⋅ x as the norm of the non-linearly predictable error on the right. We see that the linearly predictable error decreases as we scale SAE width, but the non-linearly predictable error mostly stays constant. This is especially interesting because the result in Fig. 1 just found this on an average level, whereas here we find the same result holds on a per-token level. Figure 12: Per-token breakdown of linearly predictable and non-linearly predictable SAE error across SAE scale. We show the same tokens as in Fig. 5. The norm of linear error decreases with SAE width, whereas the norm of nonlinear error stays mostly constant. Appendix B Modeling Activations In this section, we seek to explain why we see a difference between linear and nonlinear error in the main body of the paper. We will adopt the weak linear hypothesis (Smith, 2024b), a generalization of the linear representation hypothesis which holds only that some features in language models are represented linearly. Thus we have =∑i=0nwi⁢i+Dense⁢()superscriptsubscript0subscriptwsubscriptDense x= _i=0^n w_i y_i+% Dense(x)x = ∑i = 0n wi italic_yitalic_i + Dense ( x ) (17) for linear features 1,…,nsubscript1…subscript\ y_1,…, y_n\ italic_y1 , … , italic_yitalic_n and random vector ∈ℝnsuperscriptℝw ^nw ∈ blackboard_Rn, where ww is sparse (∥1≪dmuch-less-thansubscriptdelimited-∥1 w _1 d∥ w ∥1 ≪ d) and Dense⁢()Dense Dense(x)Dense ( x ) is a random vector representing the dense component of xx. Dense⁢()Dense Dense(x)Dense ( x ) might be Gaussian noise, nonlinear features as described by Csordás et al. (2024), or anything else not represented in a low-dimensional linear subspace. Figure 13: R2superscript2R^2R2 for linear regressions of SAE error norms with different regressors. We run on Gemma Scope 9B SAEs of size 131⁢k131131k131 k with L0≈60subscript060L_0≈ 60L0 ≈ 60. Activations perform the best except on the first few layers. Figure 14: Average SAE latent activation and dot product of the latent with the learned norm prediction vector ∗superscript a^*italic_a∗ for the Gemma Scope layer 20202020, width 131⁢k131131k131 k, L0=62subscript062L_0=62L0 = 62 SAE. We also plot a smoothed version of this dot product with a smoothed window of 10101010. Before we proceed with our analysis, we note that there is a rich sparse coding literature studying dictionary learning in the setting with mixed sparse and dense signals. For example, in a classic work Candès et al. (2011) propose Robust Principal Component Analysis for decomposing data matrices into sparse and dense components, although this is not directly applicable to our setting of trying to learn an autoencoder. More recently, Tasissa et al. propose a dictionary learning technique for data exactly modeled as in Eq. 17. In our work, we focus on studying existing SAEs and hypothesizing why they fail, so these works are not immediately applicable, but we are excited to see future work that applies these combined sparse-dense autoencoders to language model activations.. Say our SAE has m latents. Since by assumption Dense⁢()Dense Dense(x)Dense ( x ) cannot be represented in a low-dimensional linear subspace, the sparsity limited SAE will not be able to learn it. Thus, we will assume that the SAE learns only the m most common features 0,…,m−1subscript0…subscript1 y_0,…, y_m-1italic_y0 , … , italic_yitalic_m - 1. We will also assume that the SAE introduces some error when making this approximation, and instead learns i^^subscript y_iover start_ARG italic_yitalic_i end_ARG and wi^^subscriptw w_iover start_ARG wi end_ARG. Thus we have Sae⁢()Sae Sae(x)Sae ( x ) =∑i=0mwi^⁢i^absentsuperscriptsubscript0^subscriptw^subscript = _i=0^m w_i y_i= ∑i = 0m over start_ARG wi end_ARG over start_ARG italic_yitalic_i end_ARG (18) SaeError⁢()SaeError SaeError(x)SaeError ( x ) =Dense⁢()+(∑i=0mwi^⁢i^−∑i=0mwi⁢i)+∑i=mnwi⁢iabsentDensesuperscriptsubscript0^subscriptw^subscriptsuperscriptsubscript0subscriptwsubscriptsuperscriptsubscriptsubscriptwsubscript = Dense(x)+ ( _i=0^m % w_i y_i- _i=0^m w_i % y_i )+ _i=m^n w_i y_i= Dense ( x ) + ( ∑i = 0m over start_ARG wi end_ARG over start_ARG italic_yitalic_i end_ARG - ∑i = 0m wi italic_yitalic_i ) + ∑i = mitalic_n wi italic_yitalic_i (19) We finally define Introduced⁢():-∑i=0mwi^⁢i^−∑i=0mwi⁢i:-Introducedsuperscriptsubscript0^subscriptw^subscriptsuperscriptsubscript0subscriptwsubscript Introduced(x) _i=0^m w_% i y_i- _i=0^m w_i y_iIntroduced ( x ) :- ∑i = 0m over start_ARG wi end_ARG over start_ARG italic_yitalic_i end_ARG - ∑i = 0m wi italic_yitalic_i, so we have SaeError⁢()SaeError SaeError(x)SaeError ( x ) =Dense⁢()+Introduced⁢()+∑i=mnwi⁢iabsentDenseIntroducedsuperscriptsubscriptsubscriptwsubscript = Dense(x)+ Introduced(x)+% _i=m^n w_i y_i= Dense ( x ) + Introduced ( x ) + ∑i = mitalic_n wi italic_yitalic_i (20) B.1 Analyzing Error Norm Prediction We will first analyze Eq. 10, the learned probe from xx to ∥SaeError⁢()∥22superscriptsubscriptdelimited-∥SaeError22 SaeError(x) _2^2∥ SaeError ( x ) ∥22. First, we claim that given a vector xx, if xx is a sparse sum of orthogonal vectors, then by basic linear algebra there exists a perfect prediction vector aitalic_a such that T⁢≈||22superscriptsuperscriptsubscript22 a^Tx≈ x _2^2italic_aitalic_T x ≈ | x |22 (in other words, the norm squared of xx can be linearly predicted from xx). The proof of this claim is in Section C.1; the intuition is that we can set the probe vector ∗superscript a^*italic_a∗ to the sum of the vectors isubscript y_iitalic_yitalic_i weighted by their average weight E⁢(wi)subscriptwE( w_i)E ( wi ). When xx is instead a sparse sum of non-orthogonal vectors, as it partly is in Eq. 17 and Eq. 20, this proof is no longer true, but we now argue that a similar intuition holds. If the isubscript y_iitalic_yitalic_i are almost orthogonal (formally c⁢o⁢h⁢e⁢r⁢e⁢n⁢c⁢e<ϵℎitalic-ϵcoherence< o h e r e n c e < ϵ, see (Foucart & Rauhut, 2013)) and do not activate much at the same time, then a probe vector again equal to the sum of vectors isubscript y_iitalic_yitalic_i weighted by their average value E⁢(wi)subscriptwE( w_i)E ( wi ) will be a good approximate prediction. Indeed, when we try predicting ∥Sae⁢()∥22superscriptsubscriptdelimited-∥Sae22 Sae(x) _2^2∥ Sae ( x ) ∥22 from Sae⁢()Sae Sae(x)Sae ( x ) (which is a sparse sum of known almost orthogonal vectors of a similar distribution to the true SAE vectors), we find that indeed the linear probe that is learned is approximately equal to this sum (see Section C.2). Thus, we can now neatly explain why we can predict the norms of SAE errors: they mostly consist of almost orthogonal sparsely occuring not yet learned SAE features! We further can explain why larger SAEs have less predictable error norms: since m is larger, there is a larger component in the error of not-as-linearly-predictable Dense⁢()Dense Dense(x)Dense ( x ) and Introduced⁢()Introduced Introduced(x)Introduced ( x ). B.2 Analyzing Error Vector Prediction We will now analyze Eq. 11, the learned transformation from xx to SaeError⁢()SaeError SaeError(x)SaeError ( x ), with our model of SAE error from Eq. 20. We assume that Introduced⁢()Introduced Introduced(x)Introduced ( x ) cannot be approximated at all as a linear function of xx. This is a reasonable assumption since the SAE is a nonlinear function of xx, but if indeed some of Introduced⁢()Introduced Introduced(x)Introduced ( x ) can be approximated in this way, then we will underestimate the amount of Introduced⁢()Introduced Introduced(x)Introduced ( x ) and therefore overestimate the amount of Dense⁢()Dense Dense(x)Dense ( x ). If Dense⁢()+∑i=mnwi⁢iDensesuperscriptsubscriptsubscriptwsubscript Dense(x)+ _i=m^n w_i y_iDense ( x ) + ∑i = mitalic_n wi italic_yitalic_i is contained in a linear subspace of xx orthogonal to ∑i=0mwi⁢isuperscriptsubscript0subscriptwsubscript _i=0^m w_i y_i∑i = 0m wi italic_yitalic_i, then Dense⁢()Dense Dense(x)Dense ( x ) is part of the linearly explainable error, and the error of the transformation ∗superscript b^*italic_b∗ exactly equals Introduced⁢()Introduced Introduced(x)Introduced ( x ) (since the transformation is just exactly this orthogonal linear subspace). However, if such an orthogonal linear subspace does not exist, the optimal linear transform will only be able to reconstruct part of Dense⁢()+∑i=mnwi⁢iDensesuperscriptsubscriptsubscriptwsubscript Dense(x)+ _i=m^n w_i y_iDense ( x ) + ∑i = mitalic_n wi italic_yitalic_i, and the percent of variance left unexplained by the regression will be an upper bound on the true variance explained by Introduced⁢()Introduced Introduced(x)Introduced ( x ). We also note that if this linear transform indeed recovers Dense⁢()Dense Dense(x)Dense ( x ) and absent features but fails to recover Introduced⁢()Introduced Introduced(x)Introduced ( x ), we can use it to estimate Dense⁢()Dense Dense(x)Dense ( x ): the difference between the variance explained by Sae⁢()Sae Sae(x)Sae ( x ) and the variance explained by −(Sae⁢()+∗⋅)Sae⋅superscriptx-( Sae(x)+b^*·x)x - ( Sae ( x ) + b∗ ⋅ x ) will approach Dense⁢()Dense Dense(x)Dense ( x ) as m→∞→m→∞m → ∞. Thus, our ability to estimate Introduced⁢()Introduced Introduced(x)Introduced ( x ) and Dense⁢()Dense Dense(x)Dense ( x ) using ∗superscript b^*italic_b∗ depends on how well a linear transform works to predict Dense⁢()Dense Dense(x)Dense ( x ) and ∑i=mnwi⁢isuperscriptsubscriptsubscriptwsubscript _i=m^n w_iy_i∑i = mitalic_n wi yitalic_i. Although we do not have access to the ground truth vectors isubscript y_iitalic_yitalic_i, we can replace ′superscript′x x′ with a similar distribution of vectors that we do have access to, using the same trick as above. Given an SAE, we replace xx with ′=Sae⁢()superscript′Saex = Sae(x)x′ = Sae ( x ). ′superscript′x x′ has the useful property that it is a sparse linear sum of vectors (the ones that the SAE learned), and the distribution of these vectors and their weights are similar to that of the true features isubscript y_iitalic_yitalic_i. We now pass ′superscript′x x′ back through the SAE and can control all of the quantities we are interested in: we can vary m by masking SAE dictionary elements, simulate Dense(′superscript′x x′) by adding Gaussian noise to ′superscript′x x′, and simulate Introduced⁢()Introduced Introduced(x)Introduced ( x ) by adding Gaussian noise to Sae⁢(′)Saesuperscript′ Sae(x )Sae ( x′ ). Table 1: Correlation matrix between synthetic noise and estimated errors. Estimated Dense(′superscript′x x′) Estimated Introduced(′superscript′x x′) ′superscript′x x′ Noise 0.9842 0.1417 Sae⁢(′)Saesuperscript′ Sae(x )Sae ( x′ ) Noise 0.0988 0.9036 We run this synthetic setup with a Gemma Scope layer 20202020 SAE (width 16⁢k1616k16 k, L0≈68subscript068L_0≈ 68L0 ≈ 68) in Section C.3, and find that indeed, estimated Dense(′superscript′x x′) is highly correlated with the amount of Gaussian noise added to ′superscript′x x′ and Introduced(′superscript′x x′) is highly correlated with the amount of Gaussian noise added to Sae⁢(′)Saesuperscript′ Sae(x )Sae ( x′ ) (see Section B.2). However, note that because ′superscript′x x′ noise is also slightly correlated with estimated Introduced(′superscript′x x′), it is possible that some of the contribution to the estimated nonlinear error is from Dense(′superscript′x x′). Thus, we again now have a potential explanation for our initial results: we can predict error vectors because they consist in large part of not yet learned linear features in an almost orthogonal subspace of xx, we can predict a smaller portion of larger SAE errors because the number of these linear features go down with SAE width, and the horizontal line in Fig. 1 is because Dense⁢()Dense Dense(x)Dense ( x ) and Introduced⁢()Introduced Introduced(x)Introduced ( x ) are mostly constant. Furthermore, we can hypothesize from the correlations om Section B.2 that the linearly predictable component of SAE error consists mostly of not yet learned features and Dense⁢()Dense Dense(x)Dense ( x ), while the component that is not linearly predictable consists mostly of Introduced⁢()Introduced Introduced(x)Introduced ( x ). We will explore this hypothesis in Section 5. B.3 Analyzing Per-Token Scaling Predictions Finally, we provide a simple explanation for why per-token SAE errors are highly predictable between SAEs of different sizes. For this, we only need Eq. 20. Since Dense⁢()Dense Dense(x)Dense ( x ) and Introduced⁢()Introduced Introduced(x)Introduced ( x ) stay mostly constant as m increases, for large m the SAE error stays mostly constant because it is primarily determined by these components. Thus, since m=16⁢k16m=16km = 16 k is already large, a linear prediction that is just a slightly smaller version of the current error performs well. Additionally, this reasoning suggests a natural experiment: if we can predict Introduced⁢()Introduced Introduced(x)Introduced ( x ) on a per-token level (which we hypothesize we can do with the non-linearly predictable component of SAE error), we may be able to better predict the floor of SAE scaling and therefore better predict larger SAE errors; we run this experiment in Appendix E, where we find an affirmative answer. Appendix C More Info on Modeling Activations C.1 Proof of Claim from Section B.1 Say we have a set of m unit vectors 1,2,…,m∈ℝdsubscript1subscript2…subscriptsuperscriptℝ y_1, y_2,…, y_m ^ditalic_y1 , italic_y2 , … , italic_yitalic_m ∈ blackboard_Rd. We will call these “feature vectors”. Define ∈ℝd×msuperscriptℝ Y ^d× mitalic_Y ∈ blackboard_Rd × m as the matrix with the feature vectors as columns. We then define the Gram matrix ∈ℝm×msubscriptsuperscriptℝG_Y ^m× mGbold_Y ∈ blackboard_Rm × m of dot products on YY: ()i⁢j=(T⁢)i⁢j=i⋅jsubscriptsubscriptsubscriptsuperscript⋅subscriptsubscript (G_Y)_ij=(Y^TY)% _ij= y_i· y_j( Gbold_Y )i j = ( Yitalic_T Y )i j = italic_yitalic_i ⋅ italic_yitalic_j We now will define a random column vector xx that is a weighted positive sum of the m feature vectors, that is, =∑iwi⁢isubscriptsubscriptwsubscriptx= _i w_i y_ix = ∑i wi italic_yitalic_i for a non-negative random vector ∈ℝmsuperscriptℝw ^mw ∈ blackboard_Rm. We say feature vector isubscript y_iitalic_yitalic_i is active if wi>0subscriptw0 w_i>0wi > 0. We now define the autocorrelation matrix ∈ℝm×msubscriptsuperscriptℝR_w ^m× mRbold_w ∈ blackboard_Rm × m for ww as RR =⁢(T).absentsuperscript =E(ww^T).= blackboard_E ( wwitalic_T ) . We are interested in breaking down xx into its components, so we define a random matrix XX as i⁢j=wj⁢i⁢jsubscriptsubscriptsubscriptX_ij=w_jY_ijXitalic_i j = witalic_j Yitalic_i j, i.e. the columns of Yitalic_Y multiplied by ww. We can now define the Gram matrix ∈ℝm×msubscriptsuperscriptℝG_X ^m× mGbold_X ∈ blackboard_Rm × m: ()i⁢jsubscriptsubscript (G_X)_ij( Gbold_X )i j =(T⁢)i⁢j=wi⁢wj⁢i⋅jabsentsubscriptsuperscript⋅subscriptwsubscriptwsubscriptsubscript =(X^TX)_ij= w_i% w_j y_i· y_j= ( Xitalic_T X )i j = wi wj italic_yitalic_i ⋅ italic_yitalic_j subscript G_XGbold_X =(T)⊙absentdirect-productsuperscriptsubscript =(ww^T) G_Y= ( wwitalic_T ) ⊙ Gbold_Y ⁢()subscript (G_X)blackboard_E ( Gbold_X ) =⊙,absentdirect-productsubscriptsubscript =R_w G_Y,= Rbold_w ⊙ Gbold_Y , where ⊙direct-product ⊙ denotes Schur (elementwise) multiplication. The intuition here is that the expected dot product between columns of XX depends on the dot product between the corresponding columns of YY and the correlation of the corresponding elements of the random vector. We will now examine the L2 norm of xx: ∥22superscriptsubscriptdelimited-∥22 x _2^2∥ x ∥22 =∑i⁢jwi⁢wj⁢i⁢jabsentsubscriptsubscriptwsubscriptwsubscriptsubscript = _ij w_i w_jy_i% y_j= ∑i j wi wj yitalic_i yitalic_j =∥(T⁢)⊙∥F2=Tr⁡(w⁢t⁢)=⁢Tabsentsuperscriptsubscriptdelimited-∥direct-productsuperscriptsubscript2Trwsuperscriptsubscriptsubscriptsuperscript = (w^Tw) G_% Y _F^2=Tr( ww% ^tG_Y)=wG_Y% w^T= ∥ ( witalic_T w ) ⊙ Gbold_Y ∥F2 = Tr ( w witalic_t Gbold_Y ) = wGbold_Y witalic_T We can also take the expected value: ⁢(∥22)superscriptsubscriptdelimited-∥22 ( x _2^2)blackboard_E ( ∥ x ∥22 ) =Tr⁡(⁢)absentTrsubscriptsubscript =Tr(R_wG_% Y)= Tr ( Rbold_w Gbold_Y ) Our goal is to find a direction ∈ℝdsuperscriptℝ a ^ditalic_a ∈ blackboard_Rd that when dotted with xx predicts ∥22superscriptsubscriptdelimited-∥22 x _2^2∥ x ∥22. In other words, we want to find aa such that ∥22≈T⁢=T⁢∑iwi⁢i=T⁢superscriptsubscriptdelimited-∥22superscriptsuperscriptsubscriptsubscriptwsubscriptsuperscript x _2^2≈ a^T% x= a^T _i w_iy_i= a^% TYw∥ x ∥22 ≈ italic_aitalic_T x = italic_aitalic_T ∑i wi yitalic_i = italic_aitalic_T Yw Combining equations, we want to find aa such that T⁢≈∥22=(v⁢wT⁢)superscriptsuperscriptsubscriptdelimited-∥22vsuperscriptsubscript a^TYw≈ x% _2^2=( vw^TG_Y% w)italic_aitalic_T Yw ≈ ∥ x ∥22 = ( v witalic_T Gbold_Y w ) Let us first consider the simple case where for all i≠ji≠ ji ≠ j, yisubscripty_iyitalic_i and yjsubscripty_jyitalic_j are perpendicular. Then our goal is to find aitalic_a such that T⁢≈Tr⁡(⁢T)=∑i⟨yi,yi⟩⁢wi2=∑iwi2=∥22=T⁢superscriptTrsubscriptsuperscriptsubscriptsubscriptsubscriptsuperscriptsubscriptw2subscriptsuperscriptsubscriptw2superscriptsubscriptdelimited-∥22superscript a^TYw (% wG_Yw^T)= _i <y_i,y_% i > w_i^2= _i w_i^2= % w _2^2=w^Twitalic_aitalic_T Yw ≈ Tr ( wGbold_Y witalic_T ) = ∑i ⟨ yitalic_i , yitalic_i ⟩ wi2 = ∑i wi2 = ∥ w ∥22 = witalic_T w Since all of the yisubscripty_iyitalic_i are perpendicular, WLOG we can write =∑ibi⁢i+subscriptsubscriptsubscript a= _ib_i y_i+ citalic_a = ∑i bitalic_i italic_yitalic_i + italic_c for a vector ∈ℝdsuperscriptℝ c ^ditalic_c ∈ blackboard_Rd perpendicular to all isubscript y_iitalic_yitalic_i and a vector ∈ℝmsuperscriptℝ b ^mitalic_b ∈ blackboard_Rm. Then we have T⁢superscript a^TYwitalic_aitalic_T Yw =(∑ibi⁢i+)T⁢absentsuperscriptsubscriptsubscriptsubscript = ( _ib_i y_i+ c )^TY% w= ( ∑i bitalic_i italic_yitalic_i + italic_c )T Yw =T⁢absentsuperscript = b^Tw= italic_bitalic_T w Since ordinary least squares produces an unbiased estimator, we know that if we use ordinary least squares to solve for bitalic_b, ⁢(T⁢)=⁢(T⁢)superscriptsuperscriptE( b^Tw)=E(w^Tw)blackboard_E ( italic_bitalic_T w ) = blackboard_E ( witalic_T w ). Thus, ∑ibi⁢(wi)=∑i⁢(wi2)subscriptsubscriptsubscriptsubscriptsuperscriptsubscript2 _ib_iE(w_i)= _iE(w_i^2)∑i bitalic_i blackboard_E ( witalic_i ) = ∑i blackboard_E ( witalic_i2 ) bi=⁢(wi2)/⁢(wi)subscriptsuperscriptsubscript2subscript b_i=E(w_i^2)/E(w_i)bitalic_i = blackboard_E ( witalic_i2 ) / blackboard_E ( witalic_i ) Now that we have bisubscriptb_ibitalic_i, we can solve for the correlation coefficient between T⁢=T⁢superscriptsuperscript a^Tx= b^Twitalic_aitalic_T x = italic_bitalic_T w and ∥22=T⁢superscriptsubscriptdelimited-∥22superscript x _2^2=w^Tw∥ x ∥22 = witalic_T w. This gets messy when using general distributions, so we focus on a few simple cases. The first is the case where each wisubscriptw w_iwi is a scaled independent Bernoulli distribution, so wisubscriptw_iwitalic_i is sisubscripts_isitalic_i with probability pisubscriptp_ipitalic_i and 00 otherwise. Then bi=sisubscriptsubscriptb_i=s_ibitalic_i = sitalic_i. We also have that ⁢(T⁢)=⁢(T⁢)=∑isi2⁢pi=μsuperscriptsuperscriptsubscriptsuperscriptsubscript2subscriptE(w^Tw)=E( b^Tw)=% _is_i^2p_i= _E ( witalic_T w ) = blackboard_E ( italic_bitalic_T w ) = ∑i sitalic_i2 pitalic_i = μ. ρ ρ =⁢(T⁢T⁢)−μ2⁢(T⁢T⁢)−μ2⁢(T⁢T⁢)−μ2absentsuperscriptsuperscriptsuperscript2superscriptsuperscriptsuperscript2superscriptsuperscriptsuperscript2 = E( b^Tww^T% w)-μ^2 E(w^Tww% ^Tw)-μ^2 E(b^Tw% b^Tw)-μ^2= divide start_ARG blackboard_E ( italic_bitalic_T wwitalic_T w ) - μ2 end_ARG start_ARG square-root start_ARG blackboard_E ( witalic_T wwitalic_T w ) - μ2 end_ARG square-root start_ARG blackboard_E ( bitalic_T wbitalic_T w ) - μ2 end_ARG end_ARG =∑isi4⁢(pi−pi2)∑isi4⁢(pi−pi2)⁢∑isi4⁢(pi−pi2)=1absentsubscriptsuperscriptsubscript4subscriptsuperscriptsubscript2subscriptsuperscriptsubscript4subscriptsuperscriptsubscript2subscriptsuperscriptsubscript4subscriptsuperscriptsubscript21 = _is_i^4(p_i-p_i^2) _is_i^4% (p_i-p_i^2) _is_i^4(p_i-p_i^2)=1= divide start_ARG ∑i sitalic_i4 ( pitalic_i - pitalic_i2 ) end_ARG start_ARG square-root start_ARG ∑i sitalic_i4 ( pitalic_i - pitalic_i2 ) end_ARG square-root start_ARG ∑i sitalic_i4 ( pitalic_i - pitalic_i2 ) end_ARG end_ARG = 1 That is, for Bernoulli variables, =∑isi⁢isubscriptsubscriptsubscriptx= _is_i y_ix = ∑i sitalic_i italic_yitalic_i is a perfect regression vector. The second is the case when each wisubscriptw w_iwi is an independent Poisson distribution with parameter λisubscript _iλitalic_i. Then ⁢(wi)=λisubscriptwsubscriptE( w_i)= _iblackboard_E ( wi ) = λitalic_i and ⁢(wi2)=λi2+λisuperscriptsubscriptw2superscriptsubscript2subscriptE( w_i^2)= _i^2+ _iblackboard_E ( wi2 ) = λitalic_i2 + λitalic_i, so bi=λi+1subscriptsubscript1b_i= _i+1bitalic_i = λitalic_i + 1. We also have that ⁢(T⁢)=⁢(T⁢)=∑iλi2+λi=μsuperscriptsuperscriptsubscriptsuperscriptsubscript2subscriptE(w^Tw)=E( b^Tw)=% _i _i^2+ _i= _E ( witalic_T w ) = blackboard_E ( italic_bitalic_T w ) = ∑i λitalic_i2 + λitalic_i = μ. Finally, we will use the fact that ⁢(wi3)=λi3+3⁢λi2+λisuperscriptsubscriptw3superscriptsubscript33superscriptsubscript2subscriptE( w_i^3)= _i^3+3 _i^2+ _% iblackboard_E ( wi3 ) = λitalic_i3 + 3 λitalic_i2 + λitalic_i and ⁢(wi4)=λ4+6⁢λ3+7⁢λ2+λsuperscriptsubscriptw4superscript46superscript37superscript2E( w_i^4)=λ^4+6λ^3+7λ^2+ _E ( wi4 ) = λ4 + 6 λ3 + 7 λ2 + λ. Then via algebra we have that ρ ρ =∑i2⁢λi3+3⁢λi2+λi∑i4⁢λ3+6⁢λi3+λi⁢∑iλi3+2⁢λi2+λiabsentsubscript2superscriptsubscript33superscriptsubscript2subscriptsubscript4superscript36superscriptsubscript3subscriptsubscriptsuperscriptsubscript32superscriptsubscript2subscript = _i2 _i^3+3 _i^2+ _i% _i4λ^3+6 _i^3+ _i _i _% i^3+2 _i^2+ _i= divide start_ARG ∑i 2 λitalic_i3 + 3 λitalic_i2 + λitalic_i end_ARG start_ARG square-root start_ARG ∑i 4 λ3 + 6 λitalic_i3 + λitalic_i end_ARG square-root start_ARG ∑i λitalic_i3 + 2 λitalic_i2 + λitalic_i end_ARG end_ARG For the special case λi=1subscript1 _i=1λitalic_i = 1, we then have ρ=666≈0.736660.73 ρ= 6 66≈ 0.73ρ = divide start_ARG 6 end_ARG start_ARG square-root start_ARG 66 end_ARG end_ARG ≈ 0.73 C.2 Empirical Norm Prediction In this experiment, we aim to determine to what extent our analysis in Section B.1 holds true in practice on almost orthogonal true SAE features. Thus, we use a random vector that we can control: Sae⁢()Sae Sae(x)Sae ( x ). Specifically, we learn a probe ∗superscript a^*italic_a∗ for the Gemma Scope layer 20202020, width 131⁢k131131k131 k, L0=62subscript062L_0=62L0 = 62 SAE as in Eq. 10, except with the regressor equal to Sae⁢()Sae Sae(x)Sae ( x ) and the target equal to ∥Sae⁢()∥::delimited-∥Saeabsent Sae(x) :∥ Sae ( x ) ∥ : ∗=arg⁢min∈ℝ∥T⋅Sae()−∥Sae()∥22∥2 a^*= *arg\,min_ a ^d% a^T· Sae(x)- Sae(% x) ^2_2 _2italic_a∗ = start_OPERATOR arg min end_OPERATORitalic_a ∈ blackboard_Rblackboard_d ∥ italic_aitalic_T ⋅ Sae ( x ) - ∥ Sae ( x ) ∥22 ∥2 (21) One important note is that we subtract the bias from Sae⁢()Sae Sae(x)Sae ( x ) so that it is purely a sparse sum of SAE features (this makes analysis easier). For each SAE latent from the SAE, we then compute (∗)T⋅latenti⋅superscriptsuperscriptsubscriptlatent ( a^*)^T· latent_i( italic_a∗ )T ⋅ latenti (22) Finally, we plot this dot product against the average latent activation in Fig. 14. If ∗superscript a^*italic_a∗ indeed equals the sum of the latents weighted by their activation, as we predict in Section B.1, then these two quantities should be approximately equal, which we indeed see in the figure. C.3 Synthetic SAE Error Vector Experiments The results for different Gaussian noise amounts versus percentage of features ablated are shown in Fig. 15. On this distribution of vectors, the test works as expected; the variance explained by Sae⁢()+T⁢Saesuperscript Sae(x)+ a^TxSae ( x ) + italic_aitalic_T x is a horizontal line proportional to Introduced⁢()Introduced Introduced(x)Introduced ( x ), while the gap between this horizontal line and the asymptote of the variance explained by Sae⁢()Sae Sae(x)Sae ( x ) is proportional to Dense⁢()Dense Dense(x)Dense ( x ). Figure 15: Top: When controlled amounts of noise are added to synthetic data Sae⁢(′)Saesuperscript′ Sae(x )Sae ( x′ ) and ′superscript′x x′, the result is a plot similar to Fig. 1. Bottom: The nonlinear and linear error estimates (as shown at top) accurately correlate with the amount of noise added. The exact correlation between synthetic added noise and resulting estimated error components across these noise levels are shown in Table B.2 We also tried running this test on a sparse sum of random vectors, which did not work as well, possibly due to not including the structure of the SAE vectors (Giglemiani et al., 2024); see Appendix D for more details. Appendix D Synthetic Experiments with Random Data For this set of experiments, we generated a random vector ′superscript′x x′ that was the sum of a power law of 100⁢k100100k100 k random gaussian vectors in ℝ4000superscriptℝ4000R^4000blackboard_R4000 with expected L0subscript0L_0L0 of around 100100100100. To simulate the SAE reconstruction and SAE error, we simply masked a portion of the vectors in the sum of ′superscript′x x′. Unlike the more realistic synthetic data case we describe in Section C.3, this did not work as expected: even in the case with no noise added to ′superscript′x x′ or the simulated reconstruction, the variance explained by the sum of the linear estimate of the error plus the reconstructed vectors plotted against the number of features “ablated” formed a parabola (with minimum variance explained in the middle region), as opposed to a straight line as in Fig. 15. We note that this result is not entirely surprising: other works have found that random vectors are a bad synthetic test case for language model activations. For example, in the setting of model sensitivity to perturbations of activations, Giglemiani et al. (2024) found they needed to control for both sparsity and cosine similarity of SAE latents to produce synthetic vectors that mimic SAE latents when perturbed. Appendix E Using ∥NonlinearError⁢()∥delimited-∥NonlinearError NonlinearError(x) ∥ NonlinearError ( x ) ∥ to Predict Scaling: Following up on our discussion in Section B.3, we are interested in whether NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ) can help with predicting SAE per-token error norm scaling, as this might suggest that it contains a larger component of Introduced⁢()Introduced Introduced(x)Introduced ( x ) than LinearError⁢()LinearError LinearError(x)LinearError ( x ). Formally, we solve for ∗:-arg⁢min∈ℝ∥T⋅[SaeError1(),NonlinearError1()]−SAE2()∥2 d^* *arg\,min_ d∈% R^2 d^T·[ SaeError_1(x),% NonlinearError_1(x)]- SAE_2(x)% _2italic_d∗ :- start_OPERATOR arg min end_OPERATORitalic_d ∈ blackboard_Rblackboard_2 ∥ italic_ditalic_T ⋅ [ SaeError1 ( x ) , NonlinearError1 ( x ) ] - SAE2 ( x ) ∥2 (23) To evaluate the improvement of ∗superscript d^*italic_d∗ relative to ∗superscript c^*italic_c∗ from Eq. 13, we report the percent decrease in FVU; see Fig. 16. We find that using the norm of NonlinearError⁢()NonlinearError NonlinearError(x)NonlinearError ( x ) provides a small but noticeable bump in the ability to predict larger SAE errors of up to a 5%percent55\%5 % decrease in FVU, validating this hypothesis. Figure 16: Percent decrease in FVU when additionally using the squared norms of nonlinear error to predict SAE error norm.