Paper deep dive
Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
David Chanin, Adrià Garriga-Alonso
Models: Gemma-2-2b
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/12/2026, 6:25:19 PM
Summary
The paper investigates the impact of the L0 hyperparameter on Sparse Autoencoders (SAEs), demonstrating that incorrect L0 settings lead to feature mixing and degraded interpretability. The authors show that low L0 values cause SAEs to 'cheat' by mixing correlated features to improve reconstruction, rendering standard sparsity-reconstruction tradeoff plots misleading. They introduce a proxy metric, 'decoder pairwise cosine similarity' (c_dec), to identify the optimal L0, which is validated on toy models and LLMs (Gemma-2-2b, Llama-3.2-1b) by correlating with peak sparse probing performance.
Entities (6)
Relation Signals (3)
L0 → influences → feature disentanglement
confidence 95% · if L0 is not set correctly, the SAE fails to disentangle the underlying features of the LLM.
decoder pairwise cosine similarity → detects → optimal L0
confidence 90% · We see that pairwise cosine similarity is minimized at the true L0.
MSE loss → incentivizes → feature mixing
confidence 90% · MSE loss actively incentivizes low L0 SAEs to learn incorrect latents.
Cypher Suggestions (2)
Find all SAE architectures mentioned in the paper. · confidence 90% · unvalidated
MATCH (e:Entity {entity_type: 'Model Architecture'}) RETURN e.nameIdentify the relationship between metrics and hyperparameters. · confidence 85% · unvalidated
MATCH (m:Entity {entity_type: 'Metric'})-[r]->(h:Entity {entity_type: 'Hyperparameter'}) RETURN m.name, r.relation, h.nameAbstract
Abstract:Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to interpretable concepts. A core SAE training hyperparameter is L0: how many SAE features should fire per token on average. Existing work compares SAE algorithms using sparsity-reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value aside from its effect on reconstruction. In this work we study the effect of L0 on SAEs, and show that if L0 is not set correctly, the SAE fails to disentangle the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we present a proxy metric that can help guide the search for the correct L0 for an SAE on a given training distribution. We show that our method finds the correct L0 in toy models and coincides with peak sparse probing performance in LLM SAEs. We find that most commonly used SAEs have an L0 that is too low. Our work shows that L0 must be set correctly to train SAEs with correct features.
Tags
Links
PDF not stored locally. Use the link above to view on the source site.
Full Text
94,812 characters extracted from source content.
Expand or collapse full text
Preprint. Under review. SPARSE BUT WRONG: INCORRECT L0 LEADS TO IN- CORRECT FEATURES IN SPARSE AUTOENCODERS David Chanin ∗ University College London Adri ` a Garriga-Alonso FAR AI ABSTRACT Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to interpretable concepts. A core SAE training hyperparame- ter is L0: how many SAE features should fire per token on average. Existing work compares SAE algorithms using sparsity–reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value aside from its effect on recon- struction. In this work we study the effect of L0 on SAEs, and show that if L0 is not set correctly, the SAE fails to disentangle the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we present a proxy metric that can help guide the search for the correct L0 for an SAE on a given training distribution. We show that our method finds the correct L0 in toy models and coincides with peak sparse probing performance in LLM SAEs. We find that most commonly used SAEs have an L0 that is too low. Our work shows that L0 must be set correctly to train SAEs with correct features. 1INTRODUCTION It is theorized that Large Language Models (LLMs) represent concepts as linear directions in rep- resentation space, known as the Linear Representation Hypothesis (LRH) (Elhage et al., 2022; Park et al., 2024). These concepts are nearly orthogonal linear directions, allowing the LLM to represent many more concepts than there are neurons, a phenomenon known as superposition (Elhage et al., 2022). However, superposition poses a challenge for interpretability, as neurons in the LLM are polysemantic, firing on many different concepts. Sparse autoencoders (SAEs) are meant to reverse superposition, and extract interpretable, monose- mantic latent features (Cunningham et al., 2024; Bricken et al., 2023) using sparse dictionary learn- ing (Olshausen & Field, 1997). SAEs have the advantage of being unsupervised, and can be scaled to millions of neurons in its hidden layer (hereafter called “latents” 1 ). When training an SAE, prac- titioners must decide on the sparsity of SAE, measured in terms of L0, or how many latents activate on average for a given input. 2 L0 is typically considered a neutral design choice: most of the literature evaluates SAEs at a range of L0 values, referring to this as a “sparsity–reconstruction tradeoff” (Gao et al., 2024; Rajamanoharan et al., 2024). While most practitioners would expect that too high an L0 will break the SAE (afterall, it is called a sparse autoencoder), the implication of “sparsity–reconstruction tradeoff” plots is that any sufficiently low L0 is equally valid. However, recent work shows the same trend: low L0 SAEs perform worse on downstream tasks Kantamneni et al. (2025); Bussmann et al. (2025). What causes this degraded performance at low L0? In this work, we explore the effect of L0 on SAEs. We begin with toy model experiments using synthetic data, and show that if the L0 is too low, the SAE can “cheat” by mixing together components of correlated features, achieving better reconstruction compared to an SAE with correct, ∗ Correspondence to: david.chanin.22@ucl.ac.uk 1 We use latents to prevent overloading the term feature, which we reserve for human-interpretable concepts the SAE may capture. This breaks from earlier usage which used feature for both (Elhage et al., 2022), but aligns with the terminology in (Lieberum et al., 2024) and makes the distinction more clear. 2 TopK and BatchTopK SAEs (Gao et al., 2024; Bussmann et al., 2024) set the L0 (K) directly, whereas L1 and JumpReLU (Cunningham et al., 2024; Bricken et al., 2023; Rajamanoharan et al., 2024) adjust it via a coefficient in the loss. In any case, all SAE trainers must decide on the target L0. 1 arXiv:2508.16560v3 [cs.LG] 5 Dec 2025 Preprint. Under review. 051015202530354045 True feature 0 5 10 15 20 25 30 35 40 45 SAE Latent −1 0 1 cos sim SAE L0<True L0 051015202530354045 True feature 0 5 10 15 20 25 30 35 40 45 SAE Latent −1 0 1 cos sim SAE L0 = True L0 051015202530354045 True feature 0 5 10 15 20 25 30 35 40 45 SAE Latent −1 0 1 cos sim SAE L0>True L0 Figure 1: When SAE L0 is too low (left) or too high (right), the SAE mixes together correlated features, ruining monsemanticity. Only at the correct L0 (middle), the SAE learns correct features. disentangled features. We consider this to be a manifestation of feature hedging (Chanin et al., 2025), where the SAE abuses feature correlations to compensate for insufficient resources to model the underlying features monsemantically. This mixing of correlated features into SAE latents affects both positively and negatively correlated features, meaning that in low L0 SAEs, nearly all latents are both less interpretable and more noisy than an SAE with a correctly set L0. Our findings also show that “sparsity–reconstruction tradeoff” plots, commonly used to assess SAE architectures, are not a sound method of evaluating SAEs. We demonstrate using toy model exper- iments that at low L0, an SAE with ground-truth correct latents achieves worse reconstruction than an SAE that mixes correlated features. Thus, if we had an SAE training method that resulted in perfect SAEs, “sparsity–reconstruction tradeoff” plots would cause us to reject that method. Finally, we develop a proxy metric based on projections between the SAE decoder and training activations that can detect if L0 is too low. We validate these findings on Gemma-2-2b (Team et al., 2024), demonstrating that decoder patterns similar to what we observe in our toy model experiments also manifests in LLM SAEs. We further validate that the optimal L0 we find with our method in Gemma-2-2b matches peak performance on sparse probing tasks (Kantamneni et al., 2025). Our findings are of direct importance to anyone using SAEs in practice, showing that L0 must be set correctly for SAEs to learn correct features. Furthermore, our work implies that most SAEs used by researchers today have too low an L0. Code is available at https://github.com/chanind/sparse-but-wrong-paper 2BACKGROUND Sparse autoencoders (SAEs). An SAE decomposes an input activation x ∈ R d into a hidden state, a, consisting of h hidden neurons, called “latents”. An SAE is composed of an encoder W enc ∈ R h×d , a decoder W dec ∈ R d×h , a decoder bias b dec ∈ R d , and encoder bias b enc ∈ R h , and a nonlinearity σ, typically ReLU or a variant like JumpReLU (Rajamanoharan et al., 2024), TopK (Gao et al., 2024) or BatchTopK (Bussmann et al., 2024). The decoder is sometimes called the dictionary, in reference to sparse dictionary learning. We use both terms interchangeably. a =σ(W enc (x− b dec ) + b enc )(1) ˆ x =W dec a + b dec (2) In this work we focus on BatchTopK and JumpReLU SAEs as these are both considered SOTA architectures. The JumpReLU activation is a modified ReLU with a threshold parameter τ > 0, so JumpReLU τ (x) = x· 1 x>τ . The BatchTopK activation function selects the top b× k activations across a batch of size b, allowing variance in the k selected per sample in the batch. After training, a BatchTopK SAE is converted to a JumpReLU SAE with a global τ . We follow the JumpReLU training procedure outlined by Anthropic (Conerly et al., 2025). SAEs are trained as follows, with an auxiliary loss L p to revive dead latents with corresponding coefficient λ p . JumpReLU SAEs also have a sparsity lossL s and corresponding coefficient λ s . 2 Preprint. Under review. L =∥x− ˆ x∥ 2 2 + λ s L s + λ p L p (3) The formulation ofL s andL p for JumpReLU and BatchTopK SAEs is shown in Appendix A.1. 3TOY MODEL EXPERIMENTS The Linear Representation Hypothesis (LRH) (Elhage et al., 2022; Park et al., 2024) states that LLMs represent concepts (alternatively referred to as “features”) as (nearly) orthogonal linear di- rections in representation space. Thus, the hidden activations in an LLM are simply the sum all the firing feature vectors (a feature direction with a positive, non-zero magnitude) that are being repre- sented. While an LLM can represent a potentially large number of concepts this way, in any given activation, only a small number of concepts are actively represented. For instance, if we inspect a hidden activation from within an LLM at the token “ Canada”, we may expect this activation to be a sum of feature vectors representing concepts like “country”, “North America”, “starts with C”, “noun”, etc... The job of a sparse autoencoder is to recover these “true feature” directions in its dictionary. In a real LLM, we do not have ground-truth knowledge of the “true features” the model is represent- ing, so we do not know if the SAE has learned the correct features. Fortunately, it is easy to create a toy model setup that follows the requirements of the LRH while providing ground-truth knowledge of the underlying true features. Our toy model has a set of feature embeddings F ∈ R g×d , where d is the input dimension of our SAE, and g is the number of features. All features are orthogonal, so f i · f j = 0 for i ̸= j. Each feature f i fires with probability p i , mean magnitude μ i , and magnitude standard deviation σ i . Feature activations follow a correlated Bernoulli process controlled by correlation matrix C, with final magnitudes given by m i = a i · ReLU(μ i +σ i ε i ), where a i indicates whether feature i is active and ε i ∼N(0, 1). Training activations for an SAE, x∈ R d , are thus generated as x = P n i=1 m i f i In these toy model experiments, we mainly focus on BatchTopK SAEs (Bussmann et al., 2024) as this enables direct control of L0. Additionally, we validate our results with JumpReLU SAEs. We train SAEs on 15M synthetic samples with batch size 500 using SAELens (Bloom et al., 2024). Throughout this section we will use the following terminology: True L0 In toy models we have complete control over which features fire, so we know how many features are firing on average. We refer to this as the true L0 of the toy model. Ground-truth SAE Since we know the ground-truth features in our toy models, we can construct an SAE that perfectly captures these features. We refer to this as the ground-truth SAE. This is an SAE where g = h, W enc = F T , W dec = F, b enc = 0, b dec = 0. 3.1LOW L0 SAES MIX CORRELATED AND ANTI-CORRELATED FEATURES We begin with a small toy model with 5 true features (g = 5) in an input space of d = 20. We set each p i = 0.4 such that on average 2 features are active per input, for a true L0 of 2. We begin with a simple correlation pattern between features, where f 0 is positively correlated with every feature f 1 through f 4 , but otherwise there are no other correlations. We then train an SAE with L0 = 2, matching the true L0 of the model, and an SAE with slightly lower value of L0 = 1.8 (BatchTopK SAEs permit setting fractional L0). For the L0 = 1.8 SAE, we initialize it to the ground-truth solution, ensuring that the result of training is due to gradient pressure rather than just being a local minimum. We show the toy model feature correlation matrix as well as decoder cosine similarity plots with the true features for both SAEs in Figure 2. When the SAE L0 matches the true L0, we see that the SAE perfectly learns the underlying true features. However, when SAE L0 is smaller than the true L0, the resulting SAE latents mix feature components together based on the correlation matrix. The latents tracking features f 1 through f 4 all mix in a positive component of f 0 , but they have no components of each other. 3 Preprint. Under review. 01234 Feature 0 1 2 3 4 Feature Feature correlation matrix −1.0 −0.5 0.0 0.5 1.0 01234 True feature 0 1 2 3 4 SAE Latent −1 0 1 cos sim SAE L0 = True L0 01234 True feature 0 1 2 3 4 SAE Latent −1 0 1 cos sim SAE L0<True L0 Figure 2: (left) Toy model feature correlation matrix showing positive correlations between features. (middle) SAE decoder cosine similarities with true feature when SAE L0 = 2, matching the true L0 of the toy model. (right) SAE decoder cosine similarities with true features when SAE L0 = 1.8. When L0 is too low, the SAE mixes components of features based on their firing correlations. 01234 Feature 0 1 2 3 4 Feature Feature correlation matrix −1.0 −0.5 0.0 0.5 1.0 01234 True feature 0 1 2 3 4 SAE Latent −1 0 1 cos sim SAE L0 = True L0 01234 True feature 0 1 2 3 4 SAE Latent −1 0 1 cos sim SAE L0<True L0 Figure 3: (left) Toy model feature correlation matrix showing negative correlations between features. (middle) SAE decoder cosine similarities with true feature when SAE L0 = 2, matching the true L0 of the toy model. (right) SAE decoder cosine similarities with true features when SAE L0 = 1.8,. When L0 is too low, the SAE mixes negative components of anti-correlated features. Next, we invert the correlation, i.e. each feature f 1 through f 4 is negatively correlated with f 0 instead, while keeping everything else unchanged. We show the correlation matrix and SAE decoder cosine similarity with true features plots in Figure 3. Now, we see the same pattern as with positive correlations except inverted. The latents tracking features f 1 through f 4 mix in a negative component of f 0 , but have no component of each other. This pattern is problematic because it means that if our L0 is too low, every SAE latent will con- tain positive components of every positively correlated feature, and negative components of every negatively correlated feature in the model. Negative correlations are particularly bad, as negative correlations are prevalent throughout language. For instance, we may expect a nonsensical negative component of “Harry Potter” to appear in the latent for “French poetry”, since Harry Potter has nothing to do with French poetry. This will result in highly polysemantic and noisy SAE latents. Extended toy model experiments are shown in Appendix A.3. 3.2LARGER TOY MODEL EXPERIMENTS Next, we scale up to a larger toy model with 50 true features (g = 50) in input space of d = 100. We set p 0 = 0.345 and linearly decrease to p 49 = 0.05, so firing probability decreases with feature number. The true L0 of this model is 11. We randomly generate a correlation matrix, so the firings of each feature are correlated with other features. Feature correlations are shown in Appendix A.2. We train SAEs with L0 values that are too small (L0 = 5), exactly correct (L0 = 11), and too large (L0 = 18). Results are shown in Figure 1. When the SAE L0 matches the true L0, the SAE exactly learns the true features. When SAE L0 is too low, the SAE mixes components of correlated features together, particularly breaking latents tracking high-frequency features. When L0 is too high, the SAE learns degenerate solutions that mix features together. The further SAE L0 is from the true L0, the worse the SAE. Interestingly, when L0 is too high the SAE still learns many correct latents, but when L0 is too low, every latent in the SAE is affected. 4 Preprint. Under review. 3.3MSE LOSS INCENTIVIZES LOW-L0 SAES TO MIX CORRELATED FEATURES Why do SAEs with low L0 not learn the true features? We construct a ground-truth SAE and set L0 = 5, to match the low L0 SAE from Figure 1. We generate 100k synthetic training samples and calculate the Mean Square Error (MSE) of both these SAEs. The trained SAE with incorrect latents achieves a MSE of 2.73, while the ground-truth SAE achieves a much worse MSE of 4.88. Thus, MSE loss actively incentivizes low L0 SAEs to learn incorrect latents. 3.4THE SPARSITY–RECONSTRUCTION TRADEOFF It is common practice to evaluate SAE architectures using a sparsity–reconstruction tradeoff plot (Cunningham et al., 2024; Gao et al., 2024; Rajamanoharan et al., 2024), where the assumption is that having better reconstruction at a given sparsity is inherently better, and indicates that the SAE is correct. Afterall, we train SAEs to reconstruct inputs, so surely an SAE that has better reconstruction must therefore be a better SAE than one that has lower reconstruction? Sadly, this is not the case. As we discussed in Section 3.3, when the L0 of the SAE is lower than optimal, the SAE can find ways to “cheat” by engaging in feature hedging (Chanin et al., 2025), and get a better MSE score by mixing components of correlated features together. This results in an SAE where the latents are not monosemantic, and do not track ground-truth features. 0510152025 L0 0.2 0.4 0.6 0.8 1.0 Variance explained Sparsity vs reconstruction tradeoff Trained SAE Ground-truth SAE Figure 4: Sparsity (L0, lower is better) vs reconstruction (variance explained, higher is better) for learned SAEs and a ground-truth SAE. When L0 is less than the true L0 of the toy model (the dotted line), the trained SAE gets better reconstruction than the ground-truth SAE. Sparsity–reconstruction plots like this lead us to the incorrect conclusion that the ground-truth SAE is a worse SAE. 051015202530354045 True feature 0 5 10 15 20 25 30 35 40 45 SAE Latent −1 0 1 cos sim L0=1 trained SAE 051015202530354045 True feature 0 5 10 15 20 25 30 35 40 45 SAE Latent −1 0 1 cos sim L0=5 trained SAE 051015202530354045 True feature 0 5 10 15 20 25 30 35 40 45 SAE Latent −1 0 1 cos sim Ground truth SAE Figure 5: SAE decoder cosine similarity with true features for the SAEs from Figure 4 with L0=1 (left) and L0=5 (middle), compared with the ground-truth SAE (right). The trained SAEs score much better than the ground truth SAE on variance explained, despite their corrupted, polysemantic latents. We next explore the sparsity–reconstruction tradeoff by training SAEs on our toy model at various L0s. Since we know the ground-truth features in our toy model, we construct a ground-truth SAE that perfectly represents these features. We vary the L0 of the ground truth SAE while leaving the encoder and decoder fixed at the correct features. We plot the variance explained vs L0 in Figure 4 5 Preprint. Under review. for both SAEs. When the SAE L0 is lower than the true L0 of the toy model, the ground-truth SAE scores worse on reconstruction than the trained SAE! If we had an SAE training technique that gave us the ground truth correct SAE for a given LLM, sparsity–reconstruction plots would cause us to discard the correct SAE in favor of an incorrect SAE that mixes features together. We show the cosine similarity of the SAE decoder latents with the ground truth features for the SAEs learned with L0=1 and L0=2 compared with the ground-truth SAE in Figure 5. Both these SAEs outperform the ground-truth SAE on variance explained by over 2x despite learning horribly polysemantic latents bearing little resemblance to the underlying true features of the model. 3.5DETECTING THE TRUE L0 USING THE SAE DECODER Figure 1 reveals that the SAE decoder latents contain mixes of underlying features, both when the L0 is too high and also when it is too low. As the SAE approaches the correct L0, each SAE latent has fewer components of multiple true features mixed in, becoming more monosemantic. Thus, we expect that the closer the SAE is to the correct L0, the more latents should be orthogonal relative to each other, as there are fewer components of shared correlated features mixed into latents. If we are far from the correct L0, then SAE latents contain components of many underlying features, and thus we expect latents to have higher cosine similarity with each other. We call this metric decoder pairwise cosine similarity, c dec , and define it as below: c dec = 1 h 2 h−1 X i=1 h X j=i+1 | cos(W dec,i ,W dec,j )|(4) where h 2 = h(h−1) 2 is the total number of distinct pairs of latents in the SAE decoder. If SAE decoder latents are mixing lots of positive and negative components of correlated and anti- correlated features, then each SAE latent should become less orthogonal to each other SAE latent, as many latents will likely mix together similar features. This should mean that the absolute value of the cosine similarity between arbitrary latents should also increase the worse this mixing becomes. We calculate pairwise calculate similarity c dec for each of the BatchTopK SAEs we trained on toy models from Section 3.5. Results are shown in Figure 6. We see that pairwise cosine similarity is minimized at the true L0. 0510152025 SAE L0 0.0 0.1 c dec Decoder Pairwise Cosine Similarity Figure 6: Decoder pairwise cosine similarity c dec evaluated on 5 seeds of toy model SAEs. The true L0 is indicated with a dotted line at 11. Shaded area is 1 stdev. c dec is minimized at the true L0. We explore alternative metrics in Appendix A.9. Further toy model experiments are shown in Ap- pendix A.4. Pytorch code implementing c dec is provided in Appendix A.17. We provide formal theoretical justification for the c dec metric in Appendix A.6. 3.6JUMPRELU SAE EXPERIMENTS So far, we have only investigated BatchTopK SAES due to their ease of setting L0. We now validate that these same conclusions apply to JumpReLU SAEs. We train JumpReLU saes with a range of λ s to control the sparsity of the SAEs. We show plots of λ s vs L0 and decoder pairwise cosine 6 Preprint. Under review. 510152025 SAE L0 0.0 0.5 1.0 1.5 L0 Coefficient SAE L0 vs L0 Coefficient for JumpReLU SAEs 0510152025 SAE L0 0.00 0.05 0.10 c dec Decoder Pairwise Cosine Similarity vs SAE L0 Figure 7: (left) L0 coefficient λ s vs L0 for JumpReLU SAEs. (right) Decoder pairwise cosine similarity vs L0 for JumpReLU SAEs. The true L0, 11, is marked by a dotted line on the plot. 025050075010001250150017502000 L0 0.008 0.010 0.012 0.014 Cosine Similarity Decoder Pairwise Cosine Similarity 025050075010001250150017502000 L0 0.78 0.80 0.82 F1 Score k=16 Sparse Probing Gemma-2-2B Layer 5 BatchTopK 025050075010001250150017502000 L0 0.02 0.03 0.04 Cosine Similarity Decoder Pairwise Cosine Similarity 025050075010001250150017502000 L0 0.78 0.80 0.82 F1 Score k=16 Sparse Probing Llama-3.2-1B Layer 7 BatchTopK Figure 8: Decoder pairwise cosine similarity vs SAE L0 and K-sparse probing F1 vs L0 with 3 seeds per L0. (left) Gemma-2-2b layer 5 BatchTopK results. (right) Llama-3.2-1b layer 7 BatchTopK SAEs. In both cases, peak sparse probing performance occurs in the elbow just before c dec jumps due to low L0, although the shapes of the c dec plots vary at high L0. similarity vs L0 for these SAEs in Figure 7. We see that the cosine similarity vs L0 broadly follows the same pattern as we saw for BatchTopK SAEs, and is minimized at the correct L0. Interestingly, we see that the L0 does not change linearly with λ s , but instead “sticks” near the correct L0. This is a testament to Anthropic’s JumpReLU SAE training method (Conerly et al., 2025), as a wide range of sparsity coefficients λ s cause the SAE to naturally find the correct L0. 4LLM EXPERIMENTS We train a series of BatchTopK SAEs (Bussmann et al., 2024) with h = 32768 on Gemma-2-2b (Team et al., 2024) and Llama-3.2-1b (Dubey et al., 2024) varying L0 and calculate c dec . Each SAE is trained on 500M tokens from the Pile (Gao et al., 2020) using SAELens (Bloom et al., 2024). We also calculate k-sparse probing performance for these SAEs using the benchmark from Kantamneni et al. (2025), consisting of over 100 sparse probing tasks. Results are shown in Figure 8. The Llama SAE c dec plot looks very similar to the toy model, with a clear minimum point. The Gemma-2-2b layer 5 SAEs also show a sharp increase in c dec at low L0 as we saw in toy models, but has a long shallow region with the global minimum actually appearing in that shallow region. In both cases, the “elbow” in the c dec plots just before the jump due to low L0 is around L0 200, and this also corresponds to peak sparse probing performance. More plots and analysis of c dec curves are shown in Appendix A.15. 7 Preprint. Under review. 0250500750100012501500175020002250250027503000 L0 0.020 0.025 0.030 0.035 0.040 Cosine Similarity Pairwise Decoder Cosine Similarity BatchTopK JumpReLU −20−10010203040 Decoder Projection Value 0.000 0.005 0.010 0.015 0.020 Probability Density BatchTopK Decoder Projection Histogram L0=10 L0=200 L0=750 L0=2000 0250500750100012501500175020002250250027503000 L0 0.78 0.80 0.82 0.84 F1 Score k=16 Sparse Probing BatchTopK JumpReLU −20−10010203040 Decoder Projection Value 10 −6 10 −5 10 −4 10 −3 10 −2 Probability Density (log) BatchTopK Decoder Projection (Log Scale) Gemma-2-2B Layer 12 Figure 9: Gemma-2-2b layer 12, with (left) decoder pairwise cosine similarity and K-sparse probing F1 for BatchTopK and JumpReLU SAEs, and (right) normalized decoder projection histograms for BatchTopK SAEs. Histograms are truncated to -20 and 40 to highlight projections near the origin. 4.1JUMPRELU VS BATCHTOPK SAES We next explore how JumpReLU and BatchTopK SAEs compare with decoder pairwise cosine similarity plots. We train a suite of SAEs on 1B tokens on Gemma-2-2b layer 12. We plot c dec for a range N values as well as k-sparse probing results for JumpReLU and BatchTopK SAEs in Figure 9 (left). JumpReLU and BatchTopK SAEs behave similarly at low L0, with the high c dec at low L0 cor- responding to poor sparse-probing performance. However, we see notable differences at high L0. The BatchTopK SAEs have a global c dec minimum around 200, but JumpReLU SAEs c dec minimum appears closer to 250-300. As we saw in Figure 8 as well, using the “elbow” of the plots just before c dec jumps due to low L0 seems to roughly correspond to peak k-sparse probing performance. For JumpReLU SAEs, we see that c dec rises much less than BatchTopK SAEs at high L0, and indeed, JumpReLU SAEs also perform much better than BatchTopK SAEs at sparse probing when L0 is high. We suspect this is due to JumpReLU SAEs being able to “stick” near the correct threshold per latent like we saw in our toy models section. We investigate the differences in learned SAEs between JumpReLU and BatchTopK further in Appendix A.16. 4.2CAN L0 BE BOTH TOO LOW AND TOO HIGH SIMULTANEOUSLY? In Figure 9 (right), we plot decoder projection histogram plots for BatchTopK SAEs on Gemma-2- 2b layer 12 with L0 10, 200, 750, and 2000. These plots are created by projecting training inputs on the SAE decoder, creating a histogram of how strongly each latents projects onto the input. We expect that the more SAE latents are mixing positive and negative components of underlying features, the more strongly they should project both positively and negatively on arbitrary training inputs. This should look like a narrow gaussian around 0 when there is little mixing, and a wider gaussian the more mixing there is. This is also the intuition behind the alternative metric discussed in Appendix A.9, and the theory behind this is formalized further in Appendix A.10. As expected, when L0 is very low (10) or very high (2000), we see a wide gaussian around 0, indicating that decoder latents are mixing correlated features together. At L0=200, we see a much more narrow distribution around 0, as we expect when near the correct L0. However, at L0=750, we see an interesting phenomenon, where there is an even narrower distribution than at L0=200, but also a large hump starting at projection above 10 (more visible in the log plot). 8 Preprint. Under review. We suspect this indicates at L0=750, some latents become more monosemantic while other latents mix underlying features becoming less monosemantic. This likely means that the L0 is too high for some latents while simultaneously being too low for other latents. There is no reason why every latent has the same firing threshold, so there is likely a range of L0s where some latents are firing more than they ideally should while other latents are firing less than they ideally should. We also suspect this is part of why JumpReLU SAEs seem to perform much better at high L0, since JumpReLU SAEs can adjust firing threshold per-latent while BatchTopK SAEs cannot. 5RELATED WORK Limitations of SAEs Early work on SAEs for interpretability highlight the problem of feature splitting (Bricken et al., 2023; Templeton et al., 2024), where a seemingly interpretable general fea- ture splits into more specific features at narrower SAE widths. Chanin et al. (2025) explores feature hedging, showing SAEs mix correlated features into latents if the SAE is too narrow. We consider our work a version of feature hedging due to low L0. Till (2024) shows SAEs may increase sparsity by inventing features. Chanin et al. (2024) discuss the problem of feature absorption, where SAEs can improve their sparsity score by mixing hierarchical features together. Engels et al. (2024) inves- tigates SAE errors and finds that SAE error may be pathological and non-linear. Engels et al. (2025) find that not all underlying LLM features themselves are linear, demonstrating circular embeddings of some concepts. Wu et al. (2025) and Kantamneni et al. (2025) both investigate empirical SAE per- formance, finding SAEs underperform relative to supervised baselines, but do not offer theoretical explanations as to why SAEs underperform. Picking SAE hyperparameters Related to our work is Minimum Description Lengths (MDL) SAEs (Ayonrinde et al., 2024), which attempt to find reasonable choices for SAE width and L0 based on information theory. However, MDL SAEs assume that there is no inherently “correct” decomposition for LLM activations and no “correct” L0, and therefore does not attempt to find the underlying true features. Our work takes the opposite approach, starting from simple toy models with linear features and showing that if L0 is not set correctly the SAE decoder becomes corrupted. Another SAE architecture which attempts to pick L0 heuristically is Approximate Feature Activation (AFA) SAEs (Lee et al., 2025). AFA SAEs selects L0 adaptively at each input by assuming under- lying true features are maximally orthogonal and selecting features until the feature norm is close to the input norm. While the L0 is not set directly in AFA SAEs, there is an extra loss hyperparameter that may modulate the resulting L0. 6DISCUSSION While most practitioners of SAEs understand that having too high L0 is problematic, our work shows that having too low of L0 is perhaps even worse. Our work has several important implications for the field. First, the L0 used by most SAEs is lower than it ideally should be, as a cursory search of open source SAEs on Neuronpedia (Lin, 2023) shows L0 less than 100 is very common even for SAEs trained on large models (see Appendix A.13). We further show that the sparsity–reconstruction tradeoff, as commonly discussed by most SAE papers (Cunningham et al., 2024; Gao et al., 2024; Rajamanoharan et al., 2024), is misleading: when L0 is too low, an SAE with a correct dictionary achieves worse reconstruction than an incorrect SAE that mixes correlated features. We presented a metric based on the correlation between the SAE decoder and input activations, c dec , that can give us hints about the correct L0 for a given SAE. However, we do not view this as a perfect guide. As we saw in our results, while low L0 SAEs consistently have very high c dec , the metric can sometime remain nearly flat for a wide range of L0. Still, we feel that this metric is a useful guide to avoid L0 that is clearly too low, and we hope this investigation into correlation-based SAE quality metrics can be built on further in future work. We are particularly excited about the possibility that we can learn more about the underlying correlational structure between underlying features by studying correlations in the SAE decoder. While our metric currently requires training a sweep over L0 to optimize, we are hopeful that it may be possible to optimize this metric automatically during training (steps towards this are discussed in Appendix A.11). Improving this further is left to future work. 9 Preprint. Under review. ACKNOWLEDGMENTS David Chanin was supported thanks to EPSRC EP/S021566/1 and the Machine Learning Alignment and Theory Scholars (MATS) program. We are grateful to Henning Bartsch and Lovkush Agarwal for feedback during the project. REFERENCES Kola Ayonrinde, Michael T Pearce, and Lee Sharkey. Interpretability as compression: Reconsidering sae explanations of neural activations with mdl-saes. arXiv preprint arXiv:2410.11179, 2024. Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. Saelens. https://github. com/jbloomAus/SAELens, 2024. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decom- posing language models with dictionary learning. Transformer Circuits Thread, 2, 2023. Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. arXiv preprint arXiv:2412.06410, 2024. Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda. Learning multi-level features with matryoshka sparse autoencoders. arXiv preprint arXiv:2503.17547, 2025. David Chanin, James Wilken-Smith, Tom ́ a ˇ s Dulka, Hardik Bhatnagar, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoencoders. arXiv preprint arXiv:2409.14507, 2024. David Chanin, Tom ́ a ˇ s Dulka, and Adri ` a Garriga-Alonso. Feature hedging: Correlated features break narrow sparse autoencoders. arXiv preprint arXiv:2505.11756, 2025. TomConerly,HoagyCunningham,AdlyTempleton,JackLindsey,BasilHos- mer, and Adam Jermyn.Dictionary learning optimization techniques. https: //transformer-circuits.pub/2025/january-update, 2025. Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=F76bwRSLeK. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Ander- son, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Ma- hadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Al- wala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Man- nat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, 10 Preprint. Under review. Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur C ̧ elebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhar- gava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sum- baly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petro- vic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Bran- don Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Ar- caute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzm ́ an, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Gold- man, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Ke- neally, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mo- hammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navy- ata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Sa- tadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lind- say, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, 11 Preprint. Under review. Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Tim- othy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, V ́ ıtor Albiero, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Con- stable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposi- tion. arXiv preprint arXiv:2209.10652, 2022. Joshua Engels, Logan Riggs, and Max Tegmark. Decomposing the dark matter of sparse autoen- coders. arXiv preprint arXiv:2410.14670, 2024. Joshua Engels, Eric J Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=d63a4AM4hb. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. Leo Gao, Tom Dupr ́ e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024. Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda. Are sparse autoencoders useful? a case study in sparse probing. arXiv preprint arXiv:2502.16681, 2025. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. Sewoong Lee, Adam Davies, Marc E Canby, and Julia Hockenmaier. Evaluating and designing sparse autoencoders by approximating quasi-orthogonality. CoRR, 2025. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, J ́ anos Kram ́ ar, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2, August 2024. Johnny Lin. Neuronpedia: Interactive reference and tooling for analyzing neural networks, 2023. URL https://w.neuronpedia.org. Software available from neuronpedia.org. Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997. Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=UGpGkLzwpP. Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, J ́ anos Kram ́ ar, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders. arXiv preprint arXiv:2407.14435, 2024. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu- patiraju, L ́ eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram ́ e, Johan Fer- ret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Char- line Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momchev, Matt Hoffman, Shantanu Thakoor, Jean-Bastien Grill, Behnam Neyshabur, 12 Preprint. Under review. Olivier Bachem, Alanna Walton, Aliaksei Severyn, Alicia Parrish, Aliya Ahmad, Allen Hutchi- son, Alvin Abdagic, Amanda Carl, Amy Shen, Andy Brock, Andy Coenen, Anthony Laforge, Antonia Paterson, Ben Bastian, Bilal Piot, Bo Wu, Brandon Royal, Charlie Chen, Chintu Kumar, Chris Perry, Chris Welty, Christopher A. Choquette-Choo, Danila Sinopalnikov, David Wein- berger, Dimple Vijaykumar, Dominika Rogozi ́ nska, Dustin Herbison, Elisa Bandy, Emma Wang, Eric Noland, Erica Moreira, Evan Senter, Evgenii Eltyshev, Francesco Visin, Gabriel Rasskin, Gary Wei, Glenn Cameron, Gus Martins, Hadi Hashemi, Hanna Klimczak-Pluci ́ nska, Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda Mein, Jack Zhou, James Svensson, Jeff Stanway, Jetha Chan, Jin Peng Zhou, Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joost van Amersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Ju yeong Ji, Kareem Mohamed, Kar- tikeya Badola, Kat Black, Katie Millican, Keelin McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish Greene, Lars Lowe Sjoesund, Lauren Usui, Laurent Sifre, Lena Heuermann, Leticia Lago, Lilly McNealus, Livio Baldini Soares, Logan Kilpatrick, Lucas Dixon, Luciano Martins, Machel Reid, Manvinder Singh, Mark Iverson, Martin G ̈ orner, Mat Velloso, Mateo Wirth, Matt Davidow, Matt Miller, Matthew Rahtz, Matthew Watson, Meg Risdal, Mehran Kazemi, Michael Moyni- han, Ming Zhang, Minsuk Kahng, Minwoo Park, Mofi Rahman, Mohit Khatwani, Natalie Dao, Nenshad Bardoliwalla, Nesh Devanathan, Neta Dumai, Nilay Chauhan, Oscar Wahltinez, Pankil Botarda, Parker Barnes, Paul Barham, Paul Michel, Pengchong Jin, Petko Georgiev, Phil Culli- ton, Pradeep Kuppala, Ramona Comanescu, Ramona Merhej, Reena Jana, Reza Ardeshir Rokni, Rishabh Agarwal, Ryan Mullins, Samaneh Saadat, Sara Mc Carthy, Sarah Cogan, Sarah Perrin, S ́ ebastien M. R. Arnold, Sebastian Krause, Shengyang Dai, Shruti Garg, Shruti Sheth, Sue Ron- strom, Susan Chan, Timothy Jordan, Ting Yu, Tom Eccles, Tom Hennigan, Tomas Kocisky, Tulsee Doshi, Vihan Jain, Vikas Yadav, Vilobh Meshram, Vishal Dharmadhikari, Warren Barkley, Wei Wei, Wenming Ye, Woohyun Han, Woosuk Kwon, Xiang Xu, Zhe Shen, Zhitao Gong, Zichuan Wei, Victor Cotruta, Phoebe Kirk, Anand Rao, Minh Giang, Ludovic Peran, Tris Warkentin, Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, D. Sculley, Jeanine Banks, Anca Dra- gan, Slav Petrov, Oriol Vinyals, Jeff Dean, Demis Hassabis, Koray Kavukcuoglu, Clement Fara- bet, Elena Buchatskaya, Sebastian Borgeaud, Noah Fiedel, Armand Joulin, Kathleen Kenealy, Robert Dadashi, and Alek Andreev. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosemanticity: Ex- tracting interpretable features from claude 3 sonnet. https://transformer-circuits. pub/2024/scaling-monosemanticity/, May 2024. Accessed on May 21, 2024. Demian Till.Do sparse autoencoders find true features?LessWrong,2024. URL https://w.lesswrong.com/posts/QoR8noAB3Mp2KBA4B/ do-sparse-autoencoders-find-true-features. Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christo- pher D Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outper- form sparse autoencoders. arXiv preprint arXiv:2501.17148, 2025. AAPPENDIX A.1SAE TRAINING ARCHITECTURE DEFINITIONS In this work we focus on JumpReLU (Conerly et al., 2025; Rajamanoharan et al., 2024) and Batch- TopK (Bussmann et al., 2024) SAEs. For BatchTopK SAEs, there is no sparsity penalty as sparsity is enforced by the BatchTopK function. The auxiliary lossL P for BatchTopK is as follows, where e is the SAE training error residual, and ˆe is a reconstruction using the top k aux dead latents (meaning the latents have not fired in more than n dead steps during training). L p =∥e− ˆe∥ 2 2 13 Preprint. Under review. 010203040 Feature 0 10 20 30 40 Feature Feature correlation matrix −1.0 −0.5 0.0 0.5 1.0 051015202530354045 Feature 0.0 0.1 0.2 0.3 0.4 P i Feature firing probabilitiesP i Figure 10: (left) random correlation matrix and (right) base feature firing probabilities for toy model. We follow the JumpReLU training setup from Conerly et al. (2025), which involves both a sparsity lossL s and a pre-activation loss for reviving dead latents,L p . L s is defined as below, where c is a scalar scaling factor: L s = X i tanh(c∗|a i |∥W dec,i ∥ 2 ) The pre-activation loss L p adds a small penalty for all dead features, where a pre refers to the pre- activation of the SAE passed into the JumpReLU: L p = X i ReLU(τ i − a pre,i )∥W dec,i ∥ 2 The JumpReLU defines a pseudo-gradient relative to the threshold τ as follows, where ε is the bandwidth of the estimator: ∂JumpReLU(x,τ) ∂τ = − τ ε if − 1 2 < x−τ ε < 1 2 0otherwise A.2TOY MODEL SAE TRAINING DETAILS We train on 15M samples with a batch size of 1024 for all toy model experiments, and a learning rate of 3e-4. We do not use any learning rate warm-up or decay. For all SAE latents vs true feature cosine similarity plots, we re-arrange the SAE latents so the latent indices match the feature indices in the plots, as this makes interpreting the plots easier without any loss of generality. For the large toy model experiments in Section 3.2, we use a randomly generated correlation matrix and linearly decreasing feature firing probabilities, both shown in Figure 10. A.3EXTENDED SMALL TOY MODEL EXPERIMENTS We continue our investigation of feature mixing due to low L0 and correlated features using the same five-feature toy model from Section 3.1. A.3.1VARYING FEATURE CORRELATION STRENGTH We now explore the effect of varying the strength of the correlation between feature f 0 and features f 1 , f 2 , f 3 and f 4 . In our earlier experiments, we used correlation of 0.4 and −0.4. Here, we will vary correlation between −0.5 and 0.5, while keeping L0=1.8, lower than the true L0 of 2.0. We show decoder cosine similarity plot with true features with correlation −0.5, 0.0, and 0.5 in Figure 11. As expected, latents 1-4 mix negative components of f 0 when correlation is negative, 14 Preprint. Under review. 01234 True feature 0 1 2 3 4 SAE Latent −1 0 1 cos sim correlation -0.5 01234 True feature 0 1 2 3 4 SAE Latent −1 0 1 cos sim correlation 0.0 01234 True feature 0 1 2 3 4 SAE Latent −1 0 1 cos sim correlation 0.5 Figure 11: Decoder cosine similarity with true features for SAEs trained on toy models with different amounts of correlation between f 0 and features 1-4. When there is negative correlation (left), we see the SAE mix in a negative component of f 0 into latents 1-4. When there is positive correlation (right), we see a positive component of f 0 mixed into the SAE latents. When there is no correlation (middle), there is no mixing. positive components of f 0 when correlation is positive, and we see no mixing at all when there is no correlation. We next measure the amount of mixing by calculating the mean cosine similarity of feature 0 with the SAE latents tracking features f 1 through f 4 . We show results in Figure 12. As expected the more negative the correlation, the more negative the mixing. The more positive the correlation, the more positive the mixing. When there is no correlation, there is no mixing. −0.4−0.20.00.20.4 Correlation betweenf 0 and other features −0.25 0.00 0.25 Mean cos sim Feature mixing vs correlation Figure 12: Amount of mixing (measured as mean cosine similarity between SAE latents 1-4 and feature 0) vs correlation between feature 0 and features 1-4. A.3.2VARYING L0 Next, we study how varying the L0 of the SAE affects the amount of feature mixing we observe. We fix the correlation between feature 0 and features 1-4 at 0.4 for positive correlation, and -0.4 for negative correlation, as in Section 3.1, and vary the L0 of the SAE from 1.7 to 2.0 (the true L0 of the toy model is 2.0). We find that dropping the L0 below 1.7 causes the SAE latents to become so deformed that they bear almost no resemblance to the true features, making it difficult to perform systematic analysis. Results are shown in Figure 13. As expected, the further the SAE L0 gets from the true L0 (2.0), the worse the mixing becomes. Furthermore, the mixing matches the sign of the correlation, with negative correlation causing negative mixing, and positive correlation causing positive mixing. A.3.3SUPERPOSITION NOISE We have showed that even in the simplest possible setting for an SAE with perfect orthogonality between features, SAEs will fail to learn true features if the SAE L0 is too low and there is correlation between features. If there is superposition noise making the task harder for the SAE, there is thus no reason to expect that SAE will somehow perform better. Regardless, we include results on a toy model with superposition noise below for completeness, as in we use SAEs in situations with superposition noise. 15 Preprint. Under review. 1.701.751.801.851.901.952.00 SAE L0 −0.2 0.0 Mean cos sim Feature mixing vs L0 (negative correlation) 1.701.751.801.851.901.952.00 SAE L0 0.0 0.2 Mean cos sim Feature mixing vs L0 (positive correlation) Figure 13: Amount of mixing (measured as mean cosine similarity between SAE latents 1-4 and feature 0) vs SAE L0 (true L0 is 2.0). (Left) mixing vs L0 with negative correlation, and (right) mixing vs L0 with positive correlation. For this experiment, we reuse the same positive and negative correlations from Section 3.1 (+0.4 and -0.4). However, we allow the toy model features to have small positive and negative overlap with each other. We then train an SAE on this toy model with L0=1.9. We find that the using the previous L0=1.8 breaks the SAE too much given the added challenge of superposition noise. We show results in Figure 14. 01234 Feature 0 1 2 3 4 Feature Feature cosine similarities (superposition) −1.0 −0.5 0.0 0.5 1.0 01234 True feature 0 1 2 3 4 SAE Latent −1 0 1 cos sim SAE, neg correlation, superposition 01234 True feature 0 1 2 3 4 SAE Latent −1 0 1 cos sim SAE, pos correlation, superposition Figure 14: Superposition toy model results. For the SAE decoder cosine similarity plots, we sub- tract out the cosine similarity of the underlying features due to superposition for clarity. (Left) cosine similarity between underlying toy model features, showing positive and negative overlaps between features. (Middle) SAE decoder similarity with true feature with negative correlation be- tween feature 0 and features 1-4. (Right) SAE decoder cosine similarity with true features with positive correlation between feature 0 and features 1-4. We see the same pattern as we saw with no superposition noise: the SAE mixes correlated features based on the sign of the correlation. The superposition noise has made the results a bit noisier, but the core trend is still clearly visible. A.4EXTENDED LARGE TOY MODEL EXPERIMENTS In this section, we build on the results from the 50-latent toy model from Section 3.2. A.4.1SUPERPOSITION NOISE We now modify the large toy model to have superposition noise, as this is a more realistic setting for an LLM SAE to operate in. We reducing the dimensionality of the space to 40, lower than the number of features in the toy model (50). This forces each feature to slightly overlap other features in the space. The resulting feature cosine similarities are shown in Figure 15 (left). We train 5 seeds of SAEs at a range of L0s on this superposition toy model, and calculate c dec in Figure 15 (right). We see that decoder pairwise cosine similarity is still roughly minimized at the true L0 of the toy model. A.5PROOF: LOW L0 INCENTIVIZES FEATURE MIXING We now provide a theoretical proof that when SAE L0 is less than the true L0 of the underlying features, MSE loss directly incentivizes the SAE to mix features together. 16 Preprint. Under review. 010203040 Feature 0 10 20 30 40 Feature Feature cosine similarities (superposition) −1.0 −0.5 0.0 0.5 1.0 0510152025 SAE L0 0.10 0.15 c dec Decoder Pairwise Cosine Similarity Figure 15: (Left) cosine similarity between the true features in the toy model. Due to superposition noise, each features overlaps slightly with many other features. (Right) decoder pairwise cosine similarity for SAEs trained at different L0s. The true L0 is marked with a dashed line. Theorem 1. Consider a toy model with two orthonormal features f 1 ,f 2 ∈ R d where f 1 · f 2 = 0 and ∥f 1 ∥ 2 =∥f 2 ∥ 2 = 1. Let f 1 fire alone with probability P 1 , f 2 fire alone with probability P 2 , and both fire together with probability P 12 , where P 1 + P 2 + P 12 ≤ 1. Consider a tied SAE with 2 latents (i.e., W enc = W T dec ) and no biases that can fire at most 1 latent per input (L0 = 1). We assume this is less than the true L0 of the data (i.e., E[active features] = P 1 + P 2 + 2P 12 > 1), which occurs whenever features co-occur (P 12 > 0). Then the SAE that minimizes expected MSE will have latents that mix f 1 and f 2 , rather than learning them separately. Proof. We define our SAE with decoder W dec = [l 1 ,l 2 ] ∈ R d×2 where l 1 ,l 2 are the two latent directions. Since the SAE is tied and has no biases, the reconstruction of an input x using a single active latent l i (selected via Top-1 projection) is: ˆ x = (l i · x)l i (5) The reconstruction loss for a single sample is: L(x) =∥x− ˆ x∥ 2 2 =∥x− (l i · x)l i ∥ 2 2 (6) Parameterization. We parameterize the latents as: l 2 = f 2 (7) l 1 = αf 1 + (1− α)f 2 p α 2 + (1− α) 2 (8) where 0 ≤ α ≤ 1 controls the mixture. When α = 1, l 1 = f 1 (the correct, disentangled solution). When 0≤ α < 1, l 1 mixes both features. Case analysis. We analyze the four possible cases: Case 1: Only f 1 fires (probability P 1 ). The input is x = m 1 f 1 where m 1 > 0 is the magnitude. Latent l 1 activates (since it has the largest projection). The reconstruction loss is: L 1 (α) =∥m 1 f 1 − (l 1 · m 1 f 1 )l 1 ∥ 2 2 (9) = m 1 f 1 − m 1 α p α 2 + (1− α) 2 · αf 1 + (1− α)f 2 p α 2 + (1− α) 2 2 2 (10) = m 1 f 1 − m 1 α 2 α 2 + (1− α) 2 f 1 − m 1 α(1− α) α 2 + (1− α) 2 f 2 2 2 (11) = m 2 1 " 1− α 2 α 2 + (1− α) 2 2 + α(1− α) α 2 + (1− α) 2 2 # (12) 17 Preprint. Under review. Simplifying using α 2 + (1− α) 2 = 1− 2α(1− α): L 1 (α) = m 2 1 " (1− α) 2 α 2 + (1− α) 2 2 + α(1− α) α 2 + (1− α) 2 2 # (13) = m 2 1 · (1− α) 2 [α 2 + (1− α) 2 ] [α 2 + (1− α) 2 ] 2 (14) = m 2 1 · (1− α) 2 α 2 + (1− α) 2 (15) Case 2: Only f 2 fires (probability P 2 ). The input is x = m 2 f 2 . Latent l 2 = f 2 activates, giving perfect reconstruction: L 2 = 0(16) Case 3: Both f 1 and f 2 fire (probability P 12 ). The input is x = m 1 f 1 + m 2 f 2 . Since L0 = 1, only one latent can activate. The SAE will choose l 1 if|l 1 · x| 2 >|l 2 · x| 2 . We have: |l 1 · x| 2 = m 1 α + m 2 (1− α) p α 2 + (1− α) 2 ! 2 = (m 1 α + m 2 (1− α)) 2 α 2 + (1− α) 2 (17) |l 2 · x| 2 = m 2 2 (18) For simplicity, we assume m 1 ≥ m 2 > 0, so l 1 will activate when α is sufficiently large (e.g., for α = 1,|l 1 · x| 2 = m 2 1 > m 2 2 ). Assuming l 1 activates, the reconstruction loss is: L 3 (α) =∥m 1 f 1 + m 2 f 2 − (l 1 · (m 1 f 1 + m 2 f 2 ))l 1 ∥ 2 2 (19) = m 1 f 1 + m 2 f 2 − m 1 α + m 2 (1− α) p α 2 + (1− α) 2 · αf 1 + (1− α)f 2 p α 2 + (1− α) 2 2 2 (20) Let c = m 1 α+m 2 (1−α) α 2 +(1−α) 2 . Then: L 3 (α) =∥m 1 f 1 + m 2 f 2 − cαf 1 − c(1− α)f 2 ∥ 2 2 (21) = (m 1 − cα) 2 + (m 2 − c(1− α)) 2 (22) Expanding and simplifying (see detailed algebra below): L 3 (α) = [m 1 (1− α)− m 2 α] 2 α 2 + (1− α) 2 (23) Note that when m 1 = m 2 = m, this simplifies to: L 3 (α) = m 2 (1− 2α) 2 α 2 + (1− α) 2 (24) which equals 0 when α = 0.5 (perfect reconstruction when features are equally mixed) and equals m 2 when α = 1 (complete failure to reconstruct f 2 ). Case 4: Neither feature fires (probability P 0 = 1− P 1 − P 2 − P 12 ). Perfect reconstruction with L 4 = 0. Expected loss. The expected loss is: E[L(α)] = P 1 E m 1 [L 1 (α)] + P 2 E m 2 [L 2 (α)] + P 12 E m 1 ,m 2 [L 3 (α)](25) Assuming l 1 activates in Case 1 and l 2 in Case 2 (which holds for α > 0.5): E[L(α)] = P 1 E m 1 m 2 1 (1− α) 2 α 2 + (1− α) 2 + P 12 E m 1 ,m 2 [m 1 (1− α)− m 2 α] 2 α 2 + (1− α) 2 (26) 18 Preprint. Under review. Concrete example demonstrating feature mixing. To make this concrete, suppose m 1 = m 2 = 1 (both features have equal magnitude when they fire). Assume equal probabilities P 1 = P 12 = 0.4, which implies P 2 = 0 (or is negligible) and P 0 = 0.2. For the disentangled solution (α = 1, so l 1 = f 1 ): L 1 (α = 1) = 0 (perfect reconstruction when only f 1 fires)(27) L 3 (α = 1) = m 2 (1− 2) 2 1 2 + 0 2 = m 2 = 1 (cannot reconstruct f 2 component)(28) Expected loss: E[L(α = 1)] = (0.4× 0) + (0.4× 1) = 0.4 For a mixed solution (α = 0.6): L 1 (α = 0.6) = 1 2 · (1− 0.6) 2 0.6 2 + 0.4 2 = 0.16 0.52 ≈ 0.308(29) L 3 (α = 0.6) = 1 2 · (1− 2× 0.6) 2 0.6 2 + 0.4 2 = (−0.2) 2 0.52 = 0.04 0.52 ≈ 0.077(30) Expected loss: E[L(α = 0.6)] = (0.4× 0.308) + (0.4× 0.077)≈ 0.1232 + 0.0308 = 0.154 The mixed solution achieves E[L(α = 0.6)] ≈ 0.154 < 0.4 = E[L(α = 1)], demonstrating that MSE loss directly incentivizes feature mixing when L0 is constrained below the true L0. Optimal mixing coefficient. More generally, for the case m 1 = m 2 = m, the expected loss is: E[L(α)] = m 2 α 2 + (1− α) 2 P 1 (1− α) 2 + P 12 (1− 2α) 2 (31) At the boundaries: • At α = 1 (disentangled): E[L(1)] = P 12 m 2 • At α = 0.5 (maximally mixed): E[L(0.5)] = P 1 m 2 (0.5) 2 0.5 2 +0.5 2 = P 1 m 2 (0.25) 0.5 = P 1 m 2 2 When P 12 > P 1 /2, we have E[L(0.5)] < E[L(1)], showing that mixing reduces loss when both features frequently co-occur. For instance, with P 1 = P 12 = 0.5 and m = 1: E[L(α = 1)] = 0.5(32) E[L(α = 0.5)] = 0.25(33) This demonstrates that when features frequently co-occur (P 12 is large), the MSE-optimal solu- tion involves substantial feature mixing (α ∗ < 1) rather than learning them disentangled (α = 1), completing the proof. Detailed algebra for Case 3. Starting from: L 3 (α) = (m 1 − cα) 2 + (m 2 − c(1− α)) 2 (34) where c = m 1 α+m 2 (1−α) α 2 +(1−α) 2 . Expanding: L 3 = m 2 1 − 2m 1 cα + c 2 α 2 + m 2 2 − 2m 2 c(1− α) + c 2 (1− α) 2 (35) = m 2 1 + m 2 2 + c 2 [α 2 + (1− α) 2 ]− 2c[m 1 α + m 2 (1− α)](36) Note that c[α 2 + (1− α) 2 ] = m 1 α + m 2 (1− α) by definition of c. Therefore: L 3 = m 2 1 + m 2 2 + c[m 1 α + m 2 (1− α)]− 2c[m 1 α + m 2 (1− α)](37) = m 2 1 + m 2 2 − c[m 1 α + m 2 (1− α)](38) = m 2 1 + m 2 2 − [m 1 α + m 2 (1− α)] 2 α 2 + (1− α) 2 (39) 19 Preprint. Under review. Further simplification: L 3 = (m 2 1 + m 2 2 )[α 2 + (1− α) 2 ]− [m 1 α + m 2 (1− α)] 2 α 2 + (1− α) 2 (40) The numerator expands to: (m 2 1 + m 2 2 )[α 2 + (1− α) 2 ]− [m 2 1 α 2 + 2m 1 m 2 α(1− α) + m 2 2 (1− α) 2 ](41) = m 2 1 α 2 + m 2 1 (1− α) 2 + m 2 2 α 2 + m 2 2 (1− α) 2 − m 2 1 α 2 − 2m 1 m 2 α(1− α)− m 2 2 (1− α) 2 (42) = m 2 1 (1− α) 2 + m 2 2 α 2 − 2m 1 m 2 α(1− α)(43) = [m 1 (1− α)− m 2 α] 2 (44) We can verify this factorization: [m 1 (1− α)− m 2 α] 2 = m 2 1 (1− α) 2 − 2m 1 m 2 α(1− α) + m 2 2 α 2 (45) This matches. Therefore: L 3 (α) = [m 1 (1− α)− m 2 α] 2 α 2 + (1− α) 2 (46) A.6THEORETICAL JUSTIFICATION FOR c DEC METRIC We provide a theoretical justification for why the decoder pairwise cosine similarity metric, c dec , serves as a proxy for detecting feature mixing in SAEs. Theorem 2. Consider two SAEs with identical dictionary size h, where SAE 1 learns disentangled features and SAE 2 mixes a correlated feature into its latents. Let the underlying true features f 1 ,...,f h ,g be an orthonormal set in R d , where f i are unique features and g is a dense or frequent feature correlated with multiple f i . We model the decoder weights W dec,i (normalized to unit length) for the two SAEs as: SAE 1 (Disentangled): W (1) i = f i (47) SAE 2 (Mixed): W (2) i = q 1− γ 2 i f i + γ i g(48) where γ i ∈ [−1, 1] represents the mixing coefficient for latent i. Assume there exists a subset of latents S ⊆ 1,...,h with|S| ≥ 2 such that for all i ∈ S, γ i ̸= 0. Then, the expected pairwise cosine similarity is strictly greater for SAE 2 than SAE 1: c dec (SAE 2) > c dec (SAE 1)(49) Proof. Recall the definition of decoder pairwise cosine similarity: c dec = 1 h 2 h−1 X i=1 h X j=i+1 | cos(W dec,i ,W dec,j )|(50) Since the decoder weights are normalized, cos(W dec,i ,W dec,j ) = W ⊤ dec,i W dec,j . Case 1: SAE 1 (Disentangled). For any distinct pair i ̸= j, the weights are W (1) i = f i and W (1) j = f j . Since the underlying features are orthonormal: W (1)⊤ i W (1) j = f ⊤ i f j = 0(51) Thus, for SAE 1: c dec (SAE 1) = 0(52) 20 Preprint. Under review. Case 2: SAE 2 (Mixed). Consider the dot product for a distinct pair i,j: W (2)⊤ i W (2) j = ( q 1− γ 2 i f i + γ i g) ⊤ ( q 1− γ 2 j f j + γ j g)(53) = q (1− γ 2 i )(1− γ 2 j )(f ⊤ i f j ) + γ j q 1− γ 2 i (f ⊤ i g) + γ i q 1− γ 2 j (g ⊤ f j ) + γ i γ j (g ⊤ g)(54) Using the orthonormality of the setf 1 ,...,f h ,g: • f ⊤ i f j = 0 • f ⊤ i g = 0 and g ⊤ f j = 0 • g ⊤ g = 1 The expression simplifies to: cos(W (2) i ,W (2) j ) = γ i γ j (55) The metric c dec is the average of absolute cosine similarities: c dec (SAE 2) = 1 h 2 X i<j |γ i γ j |(56) Since we assumed there exists a subset S where γ i ̸= 0, there exists at least one pair (i,j) where |γ i γ j | > 0. All other terms are non-negative. Therefore: c dec (SAE 2) > 0 = c dec (SAE 1)(57) This confirms that mixing a shared feature component into multiple latents strictly increases the c dec metric. Remark 1. In real-world scenarios with superposition noise, the baseline orthogonality f ⊤ i f j is not exactly zero but follows a distribution with mean zero and variance≈ 1/d. However, systematic fea- ture mixing introduces a structured non-zero component (γ i γ j ) that typically dominates the random superposition noise, causing a measurable rise in c dec as observed in Figure 6 and Figure 8. A.7LLM SAE TRAINING DETAILS For BatchTopK SAEs, we ensure that the decoder remains normalized with ||W dec || 2 = 1 so s dec n calculations use the same scale for every latent. We use a learning rate of 3e −4 with no warmup or decay. For JumpReLU SAEs, we broadly follow the training procedure laid out by Conerly et al. (2025). However, we do not apply learning rate decay, and only warm λ s for 100M tokens to avoid the sparsity penalty changing throughout the majority of training. We use a learning rate of 2e −4 , c = 4, λ p = 3e −6 and bandwidth ε = 2.0 as recommended by (Conerly et al., 2025). A.8TRANSITIONING L0 DURING TRAINING We explore the effect of transitioning the L0 of the SAE during training using the toy model from Section 3. This toy model has a true L0 of 11. We train BatchTopK SAEs with a final L0 of 11, but starting with L0 either too high or too low, and linearly transitioning to the correct L0 over the first 25k steps of training, leaving the SAE at the correct L0 for the final 5k steps of training. We use a starting L0 of 20 for the case where we start too high, and use a starting L0 of 2 for the case where we start too low. Results are shown in Figure 16. We see that decreasing the L0 of the SAE from a too high value to the correct value still results in the SAE learning correct features. However, when the SAE starts from a too low L0, the SAE cannot fully recover when the L0 is adjusted to the correct value later. It seems that the latents the SAE learns when L0 is too low is a local minimum that is difficult from the SAE to escape from 21 Preprint. Under review. 051015202530354045 True feature 0 5 10 15 20 25 30 35 40 45 SAE Latent −1 0 1 cos sim SAE L0 Increase 2→11 051015202530354045 True feature 0 5 10 15 20 25 30 35 40 45 SAE Latent −1 0 1 cos sim SAE L0 Decrease 20→11 Figure 16: Transitioning L0 from too low (left) and too high (right) to the correct L0 during training. When the starting L0 is too high, the SAE still learns the correct features at the end of training. However, when L0 is too low, the SAE cannot recover fully and still learns many incorrect features at the end of training. even when the L0 is later corrected. This is likely because the latents learned when L0 is too low are optimized by gradient descent to achieve a higher MSE loss than is achievable by the correct latents under the same L0 constraint. However, when L0 is too high, there is no equivalent optimization pressure, and is thus less likely to be a local minimum. A.9ALTERNATIVE METRIC: N TH DECODER PROJECTION 0 s dec n s dec n Idealized decoder projection histogram Disentangled SAE SAE mixing correlated features Figure 17: Idealized histogram of decoder projections on input activations demonstrating the intu- ition behind our n th decoder projection metric, s dec n . For an arbitrary input, most latents should be non-active and thus have low projection. When SAE latents are monosemantic, meaning they do not mix components of many features, we expect non-active latents to have a near-zero projection on arbitrary input activations. However, if SAE latents mix positive and negative components of many underlying features, then those latents will have larger projections on arbitrary inputs that contains those features. By picking an N less than h/2 (corresponding roughly to the origin), a smaller s dec n means latents have smaller projection on arbitrary inputs and thus are more monosemantic. Figure 1 reveals that the SAE decoder latents contain mixes of underlying features, both when the L0 is too high and also when it is too low. As the SAE approaches the correct L0, each SAE latent has fewer components of multiple true features mixed in, becoming more monosemantic. Thus, we expect that when the SAE is at the correct L0, most latents should have near zero projection on arbitrary training inputs, because they usually do not contain the feature being tracked by that latent. If we are far from the correct L0, then SAE latents contain components of many underlying features, and we expect latents to project more strongly on arbitrary training inputs. We now define a metric we call n th decoder projection score, or s dec n , that we can use to find the optimal L0 of the SAE. Given SAE inputs x ∈ R b×d where b is the batch size and d is the input dimension, we first compute the decoder projections for all latents: Z = (x− b dec )W ⊤ dec ∈ R b×h (58) 22 Preprint. Under review. 0510152025 SAE L0 0.00 0.25 0.50 s decn n= 12Decoder Projection vs SAE L0 0510152025 SAE L0 0.0 0.2 s decn n= 18Decoder Projection vs SAE L0 Figure 18: n th decoder projection vs SAE L0 for n = 12 (left) and n = 18 (right) on our toy model SAEs. The true L0, 11, is marked by a dotted line on the plots. Both settings of n are minimized at the true L0, but the slopes of the metric change depending on N. The shaded area is 1 stdev. where b dec ∈ R d is the decoder bias and W dec ∈ R d×h is the decoder weight matrix with h latent dimensions. To aggregate across the batch, we flatten Z to obtain z ∈ R bh and sort these values in descending order to get z ↓ . The n th decoder projection is then defined as: s dec n = z ↓ [n· b](59) where [n· b] corresponds to selecting the element at index n· b. The multiplication by b accounts for the batch dimension, effectively selecting the n th highest projection value when considering all samples in the batch. For this to work n should be sufficiently larger than a reasonable guess at the correct L0, as in a perfect SAE, the decoder for these latents should be uncorrelated with input activations. Picking any n up to h /2 should work, as the majority of latents should have low projection on arbitrary input activations, so h /2 intuitively corresponds to 0 expected projection. The intuition behind s dec n is shown visually in Figure 17, and is formalized in Appendix A.10. We calculate s dec n for n = 12 and n = 18, varying SAE L0 from 2 to 25 with 5 seeds per L0 in Figure 18. The metric is minimized at the true L0, 11, in both cases, although the shape changes depending on n. In both cases, the slope of s dec n is flat when L0 is slightly higher than the true L0. A.9.1LLM SAE RESULTS Next, we s dec n for each LLM SAE we evaluated in the paper along with k=16 sparse probing results. Gemma-2-2b layer 5 and Llama-3.2-1b layer 7 results are shown in Figure 19. The results roughly match what we saw with c dec . 025050075010001250150017502000 L0 0.00 0.25 0.50 0.75 1.00 s decn n= 16000Decoder Projection 025050075010001250150017502000 L0 0.78 0.80 0.82 F1 Score k=16 Sparse Probing Gemma-2-2B Layer 5 BatchTopK 025050075010001250150017502000 L0 −0.010 −0.005 0.000 0.005 s decn n= 16000Decoder Projection 025050075010001250150017502000 L0 0.78 0.80 0.82 F1 Score k=16 Sparse Probing Llama-3.2-1B Layer 7 BatchTopK Figure 19: n th decoder projection and k=16 sparse probing results for BatchTopK SAEs trained on Gemma-2-2b layer 5 (left) and Llama-3.2-1b layer 7 (right). The metric is roughly minimized near peak sparse-probing performance. The shaded area is 1 stdev. 23 Preprint. Under review. 0250500750100012501500175020002250250027503000 L0 0.0 0.5 1.0 1.5 s decn n= 16000Decoder Projection BatchTopK JumpReLU 0250500750100012501500175020002250250027503000 L0 0.78 0.80 0.82 0.84 F1 Score k=16 Sparse Probing BatchTopK JumpReLU Gemma-2-2B Layer 12 Figure 20: n th decoder projection and sparse probing results for BatchTopK and JumpReLU SAEs trained on Gemma-2-2b layer 12. The metric seems to align less well with k=16 sparse probing results. BatchTopK and JumpReLU results for Gemma-2-2b layer 12 are shown in Figure 20. The results look similar to what we saw for c dec . A.9.2WHICH METRIC IS BETTER? We choose to focus on c dec as it is a simpler metric both to understand and implement as it has no hyperparameters. However, we expect that when an SAE is near the correct L0, there are likely many indicators that all should point to similar results. Any metric which can detect correlated features being mixed into SAE latents should give roughly similar results. A.10THEORETICAL JUSTIFICATION FOR s DEC n METRIC We provide a theoretical justification for why the n th decoder projection metric, s dec n , successfully identifies when SAE latents are mixing correlated features. Theorem 3. Consider two SAEs with identical dictionary sizes h and input dimension d, where SAE 1 has greater feature mixing than SAE 2. Specifically, let the decoder projections onto a feature f for non-active latents follow: SAE 1: z (1) i ∼N(0,σ 2 0 + σ 2 1 )(60) SAE 2: z (2) i ∼N(0,σ 2 0 + σ 2 2 )(61) where σ 2 0 represents the base variance from superposition noise, and σ 2 1 > σ 2 2 represents the vari- ance from feature mixing. Then for n < h/2, we have E[s dec n ] is larger for SAE 1 than SAE 2. Proof. Let f ∈ R d be an underlying true feature with ∥f∥ 2 = 1. Consider an SAE with decoder W dec ∈ R d×h and decoder bias b dec ∈ R d . For an input activation x ∈ R d , the projection of latent i onto the input is: z i = (x− b dec ) ⊤ W dec,i (62) Decomposition of decoder latents. We decompose each decoder latent W dec,i into three compo- nents: W dec,i = α i f + β i g i + ε i (63) where: 24 Preprint. Under review. • α i f is the component aligned with feature f (the intended feature for latent i if i is the correct latent, or mixing if i is incorrect) • β i g i represents components of other correlated/anti-correlated features mixed into latent i, where g i is orthogonal to f • ε i represents superposition noise, also orthogonal to f Distribution of projections for non-active latents. Consider latents that should not activate for feature f (i.e., latents i where α i should ideally be near zero). For an input x containing feature f with magnitude m f , we can write: x− b dec = m f f + r(64) where r contains all other feature contributions orthogonal to f . The projection of latent i becomes: z i = (m f f + r) ⊤ (α i f + β i g i + ε i ) = m f α i + r ⊤ (β i g i + ε i )(65) For non-active latents in a well-trained SAE, we expect: • α i ≈ 0 for the intended feature component • β i g i represents unintended mixing of correlated features • ε i represents superposition noise Modeling as Gaussian mixtures. Under the assumptions that: 1. Feature magnitudes m f and residual components r vary across the input distribution 2. The number of latents h is large 3. Feature mixing coefficients β i arise from optimization pressure to compensate for insuffi- cient L0 By the Central Limit Theorem, the distribution of projections z i for non-active latents approximately follows: z i ∼N(0,σ 2 base + σ 2 mix )(66) where: • σ 2 base captures variance from superposition noise (ε i ) • σ 2 mix captures variance from feature mixing (β i g i ) Comparing two SAEs. Consider two SAEs: • SAE 1 (high feature mixing): z (1) i ∼N(0,σ 2 base + σ 2 1 ) where σ 2 1 is large • SAE 2 (low feature mixing): z (2) i ∼N(0,σ 2 base + σ 2 2 ) where σ 2 2 is small with σ 2 1 > σ 2 2 . Computing s dec n . For a batch of size b, we have bh projection values. After sorting in descending order, the n th decoder projection is: s dec n = z ↓ [n· b](67) This corresponds to the (n· b)/(bh) = n/h quantile of the distribution. For n < h/2, this is the (n/h) th quantile on the positive side of the distribution. 25 Preprint. Under review. Quantile comparison. For a standard normal distribution Z ∼N(0, 1) and σ 1 > σ 2 > 0, the p th quantile satisfies: Q p (N(0,σ 2 1 )) = σ 1 · Q p (N(0, 1)) > σ 2 · Q p (N(0, 1)) = Q p (N(0,σ 2 2 ))(68) for p > 0.5 (corresponding to positive quantiles). Since n < h/2 implies p = n/h < 0.5, we are actually looking at the (1−p) th quantile on the right tail due to descending sort. This gives us: E[s dec n ] SAE 1 > E[s dec n ] SAE 2 (69) Therefore, SAEs with greater feature mixing (larger σ 2 mix ) will have larger values of s dec n , justifying its use as a metric for detecting feature mixing. Remark 2. The choice of n ≈ h/2 in practice corresponds to sampling from a region where the distribution is sensitive to changes in variance (roughly near the median), while being sufficiently far from the extreme tails to maintain statistical stability. Values of n too close to 0 would sample from the extreme right tail where variance is high, while n too close to h would sample from regions dominated by active latents rather than the non-active latents we wish to characterize. Remark 3. This theoretical analysis assumes that decoder projections follow approximately Gaus- sian distributions. While this is a simplification, our empirical results in both toy models (where we have full control) and LLM SAEs support this assumption, as evidenced by the decoder projection histograms in Figure 9. A.11AUTOMATICALLY FINDING THE CORRECT L0 DURING TRAINING A natural next step of our finding that the correct L0 occurs when Nth decoder projection, s dec n , metric is minimized is to use this to find the correct L0 automatically during training. This is a meta-learning task, as the L0 is a hyperparameter of the training process. We find there are several challenges to directly using s dec n as an optimization target: • Small gradients directly above correct L0 In our plots of s dec n from both toy models and Gemma-2-2b, we find that the metric is relatively flat in a region start at the correct L0 and extending to higher L0 values. We thus need a way to traverse this flat region and stop once the metric starts to increase again. • The impact of changing L0 is delayed We find that it takes many steps after changing L0 for s dec n to also change, meaning it is easy to overshoot the target L0 or oscillate back and forth. • Dropping L0 too low can harm the SAE As we saw in Appendix A.8, if the L0 is too low the SAE can permanently end up in poor local minima. We thus want to avoid drop- ping below the correct L0, even temporarily, to avoid permanently breaking the SAE. We therefore need to start with L0 too high and slowly decrease it until we find the correct L0. • Noise during training We find that while s dec n shows clear trends after training for many steps, it can be noisy on each training sample. So our optimization needs to be robust to this noise. Taking these requirements into account, we present an optimization procedure to find the L0 that minimizes s dec n automatically during training. We first estimate the gradient of s dec n , hereafter referred to as to as the metric, m, with respect to L0, dm /dL0. We first define an evaluation step t as a set number of training steps (we evaluate every 100 training steps). At t we change L0 by δ L 0 . At the next evaluation step, t + 1, we evaluate m. We use a sliding average of s dec n over the past 10 training steps to calculate m to help account for noise. We the estimate dm /dL0 as: dm dL0 = m t+1 − m t δ L0 Next, we add a small negative bias to this gradient estimate to encourage our estimate to push L0 lower even if the loss landscape is relatively flat. We use a bias magnitude 0 < b < 1 that is 26 Preprint. Under review. multiplied by the magnitude of our gradient estimate, so that our biased estimate can never change the sign of the gradient estimate, but can gently nudge it to be more negative in flat, noisy regions of the loss landscape. We find b = 0.1 works well. Thus, our biased gradient estimate dm b /dL0 is calculated as below: dm b dL0 = dm dL0 − b dm dL0 We then provide this gradient to the Adam optimizer (Kingma & Ba, 2014) with default settings, and allow it to change the L0 parameter. We add the following optional modifications to this algorithm. First, we clip the gradient estimates dm /dL0 to be between -1 and 1. We also set a minimum and maximum δ L 0 . The minimum is added to avoid the denominator of our gradient estimate being near 0, and the maximum is chosen to keep the L0 from changing too quickly. In practice, we find a minimum δ L 0 between 0 and 1 seems to work well, and a maximum δ L 0 between 1 and 5 seems to work well. We find that this optimization strategy works very well in toy models, but requires a lot of hyper- parameter tuning to work in real LLMs, limiting its utility. The starting L0, n for s dec n , b, learning rate for the Adam optimizer, and min and max δ L 0 values all have a big impact on how fast and how aggressively the optimization works. The slope of m around the correct L0 is shallow, so it is easy to overshoot. We also find that different values of n take more or less time to converge during training. We expect it is possible to further simplify and improve this process in future work. A.12EXTENDED LLM RESULTS We include further results for Gemma-2-2b layer 20, to extend the analysis to later model layers. Results are shown in Figure 21. 025050075010001250150017502000 L0 0 5 10 15 s decn n= 16000Decoder Projection Mean ±1 std 025050075010001250150017502000 L0 0.78 0.80 0.82 0.84 F1 Score k=16 Sparse Probing Mean ±1 std Gemma-2-2B Layer 20 BatchTopK 025050075010001250150017502000 L0 0.020 0.025 0.030 0.035 0.040 Cosine Similarity Decoder Pairwise Cosine Similarity 025050075010001250150017502000 L0 0.700 0.725 0.750 0.775 F1 Score k=1 Sparse Probing Gemma-2-2B Layer 20 BatchTopK Figure 21: N th decoder projection (top left) and decoder pairwise cosine similarity (top right) with K=16 sparse probing results (bottom-left) and K=1 sparse probing results (bottom-right) for Batch- TopK SAEs trained on Gemma-2-2b layer 20. A.13L0 OF OPEN-SOURCE SAES ON NEURONPEDIA We analyze common open-source SAEs as provided by Neuronpedia (Lin, 2023) and SAELens (Bloom et al., 2024). We include all SAEs cross-listed in both SAELens and Neuronpedia with an L0 reported in SAELens. We show the results as a histogram in Figure 22. Our analysis shows that for layer 12 of Gemma-2-2b, the correct L0 should be around 200-250. However, we find that most open-source SAEs have L0 below 100, much lower than our analysis expects to be ideal. 27 Preprint. Under review. 60.660.861.061.261.4 L0 (Average number of active features) 0.0 0.2 0.4 0.6 0.8 1.0 Number of SAEs gemma-2b-it (1 SAEs) Mean: 61.0 Median: 61.0 4850525456586062 L0 (Average number of active features) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Number of SAEs gemma-2b (4 SAEs) Mean: 56.8 Median: 59.0 0100200300400 L0 (Average number of active features) 0 2 4 6 8 10 12 Number of SAEs gemma-2-2b (40 SAEs) Mean: 93.0 Median: 40.5 050100150200250300350400 L0 (Average number of active features) 0 10 20 30 40 50 Number of SAEs gemma-2-9b (179 SAEs) Mean: 53.2 Median: 41.0 010203040506070 L0 (Average number of active features) 0 5 10 15 20 25 Number of SAEs gpt2-small (57 SAEs) Mean: 32.1 Median: 32.0 49.649.850.050.250.4 L0 (Average number of active features) 0 25 50 75 100 125 150 175 200 Number of SAEs meta-llama/Llama-3.1-8B (192 SAEs) Mean: 50.0 Median: 50.0 Figure 22: L0 of SAEs on Neuronpedia with known L0 listed in SAELens. A.14LIMITATIONS We limited the scope of our investigation to features satisfying the linear representation hypothesis, and do not investigate how SAEs react if the underlying features are actually non-linear (Engels et al., 2025). However, we do not feel that non-linear features are necessary for SAEs to fail to work properly, as we demonstrate in this paper. We also do not consider the nuances of how unbalanced correlations impact the SAE, as simple correlations are already enough to cause problems. However, we do expect that different sorts of correlations may affect SAEs differently, and would encourage future work to look into this. Finally, we only investigated a few layers of popular LLMs, as running sweeps of SAE training at every layer of the LLM was too prohibitively expensive for this work. Nevertheless, we have no reason to expect any meaningfully different behavior in decoder projection at other LLM layers. 28 Preprint. Under review. A.15EXTENDED NTH DECODER PROJECTION PLOTS In this section we document n th decoder projection plots for multiple values of N for each sweep of L0 we performed. We show Llama-3.2-1b layer 7 plots in Figure 25, Gemma-2-2b layer 5 in Figure 23, and Gemma-2-2b layer 12 in Figure 24. We note that in all cases, low L0 behavior is similar: no matter the value of N, s dec n increases dramatically at low L0. However, the high L0 behavior is less consistent. We always see a similar “elbow” in the plots at roughly the same place regardless of N, but sometimes this elbow corresponds to a clear global minimum, and sometimes the high L0 behavior is very shallow. We find that using a N near h/2 ( 16k in our cases) seems to give the best results. 025050075010001250150017502000 L0 4 6 8 s decn Gemma-2-2B Layer 5:n= 2000Decoder Projection 025050075010001250150017502000 L0 2 3 4 5 s decn Gemma-2-2B Layer 5:n= 5000Decoder Projection 025050075010001250150017502000 L0 1.0 1.5 2.0 2.5 3.0 s decn Gemma-2-2B Layer 5:n= 10000Decoder Projection 025050075010001250150017502000 L0 0.5 1.0 1.5 2.0 s decn Gemma-2-2B Layer 5:n= 12000Decoder Projection 025050075010001250150017502000 L0 0.5 1.0 1.5 s decn Gemma-2-2B Layer 5:n= 14000Decoder Projection 025050075010001250150017502000 L0 0.00 0.25 0.50 0.75 1.00 s decn Gemma-2-2B Layer 5:n= 16000Decoder Projection Figure 23: Extended Nth decoder projection plots. Gemma-2-2b, layer 5, 32k latents. These plots never have a clear global minimum at the “elbow” point, but the “elbow” is always at the same point regardless of choice of N. A.16EXTENDED ANALYSIS OF JUMPRELU VS BATCHTOPK DYNAMICS JumpReLU and BatchTopK SAEs are both considered state of the art, but we find they have notable differences in their behavior at high L0 in our experiments. In this section, we explore what maybe be causing these differences. In theory, JumpReLU and BatchTopK SAEs are very similar, as a BatchTopK SAE can be viewed as a JumpReLU SAE with a single global threshold, rather than a threshold per-latent (Bussmann et al., 2024). However, the training losses are quite different for JumpReLU vs BatchTopK. We use the JumpReLU variant laid out by Conerly et al. (2025), which allows gradients to flow through the JumpReLU threshold to the rest of the model parameters. We expect this means that JumpReLU SAEs are better able to coordinate the threshold with the rest of the model parameters, while BatchTopK cannot, as the threshold does not directly receive a gradient in BatchTopK training. We begin by comparing the encoder bias between JumpReLU and BatchTopK in Figure 26. We see that BatchTopK SAEs rely much more heavily on the encoder bias than JumpReLU SAEs seem to, with a much wider variance in values and a sharper decrease compared to JumpReLU. We expect this is because BatchTopK cannot coordinate the cutoff threshold with the encoder directly as JumpReLU can, since there is no gradient available to directly change the threshold of BatchTopK SAEs. 29 Preprint. Under review. 0250500750100012501500175020002250250027503000 L0 5 10 15 s decn n= 2000Decoder Projection BatchTopK JumpReLU 0250500750100012501500175020002250250027503000 L0 2 4 6 8 10 s decn n= 5000Decoder Projection BatchTopK JumpReLU 0250500750100012501500175020002250250027503000 L0 1 2 3 4 5 s decn n= 10000Decoder Projection BatchTopK JumpReLU 0250500750100012501500175020002250250027503000 L0 1 2 3 s decn n= 12000Decoder Projection BatchTopK JumpReLU 0250500750100012501500175020002250250027503000 L0 0.5 1.0 1.5 2.0 2.5 s decn n= 14000Decoder Projection BatchTopK JumpReLU 0250500750100012501500175020002250250027503000 L0 0.0 0.5 1.0 1.5 s decn n= 16000Decoder Projection BatchTopK JumpReLU Figure 24: Extended Nth decoder projection plots. Gemma-2-2b, layer 12, 32k latents for both JumpReLU and BatchTopK. For BatchTopK, regardless of the choice of N, all plots are minimized around the same L0 range, 200-250. For JumpReLU, there is a clear “elbow” at roughly the same L0, but this elbow is only a clear minimum at N=16k. Next, we inspect the threshold values between JumpReLU and BatchTopK in Figure 27. Here as well, we see dramatic differences between BatchTopK and JumpReLU SAEs. The threshold for BatchTopK is much higher than it is for JumpReLU, and the threshold decreases as L0 increses. This makes sense, since using a lower cutoff means more latents can fire. However, JumpReLU seems to unintuitively have the opposite trend, with the threshold actually increasing with L0. We saw in Figure 26 that the encoder bias for JumpReLU (and BatchTopK) SAEs increases as well as L0 increases, so perhaps this increase in threshold for JumpReLU SAEs with increasing L0 is just to offset that trend somewhat. We also notice that the variance in JumpReLU SAE thresholds also increases as L0 increases, supporting our hypothesis that one of the reasons JumpReLU SAEs seem to handle high L0 better than BatchTopK is because the thresholds are able to dynamically adjust to near the correct cutoff point per latent, aleviating the situation we saw in BatchTopK SAEs where we can be at both too high and too low L0 at the same time (Section 4.2). A.17PYTORCH PSEUDOCODE FOR METRICS We present Pytorch pseudocode for n th decoder projection in Figure 29 and decoder pairwise cosine similarity in Figure 28. 30 Preprint. Under review. 025050075010001250150017502000 L0 0.15 0.20 0.25 0.30 0.35 s decn Llama-3.2-1B Layer 7:n= 2000Decoder Projection 025050075010001250150017502000 L0 0.075 0.100 0.125 0.150 0.175 0.200 s decn Llama-3.2-1B Layer 7:n= 5000Decoder Projection 025050075010001250150017502000 L0 0.04 0.06 0.08 s decn Llama-3.2-1B Layer 7:n= 10000Decoder Projection 025050075010001250150017502000 L0 0.02 0.03 0.04 0.05 0.06 s decn Llama-3.2-1B Layer 7:n= 12000Decoder Projection 025050075010001250150017502000 L0 0.01 0.02 0.03 s decn Llama-3.2-1B Layer 7:n= 14000Decoder Projection 025050075010001250150017502000 L0 −0.010 −0.005 0.000 0.005 s decn Llama-3.2-1B Layer 7:n= 16000Decoder Projection Figure 25: Extended Nth decoder projection plots. Llama-3.2-1b, layer 7, 32k latents. The plots begin to have a sharp minimum around N=14k, but the “elbow” of the plots before the decoder projection increases at low L0 is always around the same location. 050010001500200025003000 L0 −3 −2 −1 0 b enc Mean Encoder Bias vs L0 BatchTopK JumpReLU Figure 26: Mean encoder bias vs L0. Shaded area in plots corresponds to 1 stdev. 050010001500200025003000 L0 0 2 4 6 8 Mean Threshold Mean Threshold vs L0 BatchTopK JumpReLU 050010001500200025003000 L0 0.00 0.02 0.04 0.06 Mean Threshold Mean Threshold vs L0 (JumpReLU only) JumpReLU Figure 27: Threshold vs L0 for JumpReLU and BatchTopK SAEs. Shaded area in plots corresponds to 1 stdev. Interestingly, JumpReLU threshold is much lower than the BatchTopK threshold, and actually increases as L0 increases. We plot just JumpReLU on its own (right) since it is otherwise difficult to see these trends, as the threshold is so much smaller than BatchTopK. 31 Preprint. Under review. def pairwise_decoder_cosine_similarity(sae): norm_dec = torch.n.functional.normalize(sae.W_dec, dim=1) dec_sims = torch.m(norm_dec, norm_dec.T) triu_mask = torch.triu( torch.ones_like(dec_sims), diagonal=1, ).bool() return dec_sims[triu_mask].abs().mean() Figure 28: Pytorch pseudocode for decoder pairwise cosine similarity def nth_decoder_projection(input_acts, sae, n): dec_proj = (input_acts - sae.b_dec) @ sae.W_dec.T sorted_dec_proj = dec_proj.flatten().sort(descending=True) index = n * dec_proj.shape[0] return sorted_dec_proj.values[index] Figure 29: Pytorch pseudocode for n th decoder projection 32