← Back to papers

Paper deep dive

Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

David Chanin, Tomas Dulka, Adria Garriga-Alonso

Year: 2025Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 76

Models: GPT-2 Small

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/12/2026, 6:52:03 PM

Summary

The paper introduces 'feature hedging', a phenomenon where sparse autoencoders (SAEs) merge components of correlated features into single latents when the SAE is too narrow to represent all underlying features. This occurs due to MSE reconstruction loss and is distinct from feature absorption. The authors demonstrate this theoretically in toy models and empirically in LLMs, noting that narrower SAEs are more susceptible to hedging, which degrades performance and interpretability.

Entities (5)

Feature Hedging · phenomenon · 100%Sparse Autoencoder · model-architecture · 100%Large Language Models · technology · 98%Feature Absorption · phenomenon · 95%Matryoshka SAEs · model-architecture · 95%

Relation Signals (3)

Feature Hedging exacerbatedby Narrow SAE width

confidence 95% · hedging is more severe the narrower the SAE

Feature Hedging causedby MSE reconstruction loss

confidence 90% · This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss

Matryoshka SAEs tradesoff Feature Absorption

confidence 85% · Matryoshka SAEs thus trade off absorption for hedging.

Cypher Suggestions (2)

Identify dependencies between hyperparameters and phenomena · confidence 85% · unvalidated

MATCH (h:Hyperparameter)-[:EXACERBATES]->(p:Phenomenon) RETURN h.name, p.name

Find all phenomena related to SAE performance issues · confidence 80% · unvalidated

MATCH (p:Phenomenon)-[:AFFECTS]->(s:ModelArchitecture {name: 'Sparse Autoencoder'}) RETURN p.name, p.description

Abstract

Abstract:It is assumed that sparse autoencoders (SAEs) decompose polysemantic activations into interpretable linear directions, as long as the activations are composed of sparse linear combinations of underlying features. However, we find that if an SAE is more narrow than the number of underlying "true features" on which it is trained, and there is correlation between features, the SAE will merge components of correlated features together, thus destroying monosemanticity. In LLM SAEs, these two conditions are almost certainly true. This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE. In this work, we introduce the problem of feature hedging and study it both theoretically in toy models and empirically in SAEs trained on LLMs. We suspect that feature hedging may be one of the core reasons that SAEs consistently underperform supervised baselines. Finally, we use our understanding of feature hedging to propose an improved variant of matryoshka SAEs. Importantly, our work shows that SAE width is not a neutral hyperparameter: narrower SAEs suffer more from hedging than wider SAEs.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)mechanistic-interp (suggested, 92%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

75,578 characters extracted from source content.

Expand or collapse full text

Preprint. Under review. FEATURE HEDGING: CORRELATED FEATURES BREAK NARROW SPARSE AUTOENCODERS David Chanin University College London Tom ́ a ˇ s Dulka Independent Adri ` a Garriga-Alonso FAR AI ABSTRACT It is assumed that sparse autoencoders (SAEs) decompose polysemantic activa- tions into interpretable linear directions, as long as the activations are composed of sparse linear combinations of underlying features. However, we find that if an SAE is more narrow than the number of underlying “true features” on which it is trained, and there is correlation between features, the SAE will merge compo- nents of correlated features together, thus destroying monosemanticity. In LLM SAEs, these two conditions are almost certainly true. This phenomenon, which we call feature hedging, is caused by SAE reconstruction loss, and is more severe the narrower the SAE. In this work, we introduce the problem of feature hedg- ing and study it both theoretically in toy models and empirically in SAEs trained on LLMs. We suspect that feature hedging may be one of the core reasons that SAEs consistently underperform supervised baselines. Finally, we use our under- standing of feature hedging to propose an improved variant of matryoshka SAEs. Importantly, our work shows that SAE width is not a neutral hyperparameter: nar- rower SAEs suffer more from hedging than wider SAEs. 1INTRODUCTION As large language models (LLMs) are deployed in real-world applications, it is increasingly impor- tant to understand their internal workings. Sparse autoencoders (SAEs) decompose the dense, pol- ysemantic activations of LLMs into interpretable latent features (Cunningham et al., 2024; Bricken et al., 2023) using sparse dictionary learning (Olshausen & Field, 1997). SAEs have the advantage of operating completely unsupervised, and can easily be scaled to millions of neurons in its hidden layer (hereafter called “latents” 1 )(Templeton et al., 2024; Gao et al., 2024). While SAEs showed promising results, recent work has cast doubt on the performance of SAEs rela- tive to baseline techniques. Wu et al. (2025) show that SAEs underperform on both concept steering and detection relative to baselines, and Kantamneni et al. (2025) show that SAEs underperform sim- ple linear probes on both in-domain and out-of-domain detection, even when the probes have very few training samples. The question, then, is why do SAEs underperform relative to other techniques? And if we can identify the problems holding back SAEs, can we then fix those problems? One fundamental issue with SAEs is the problem of feature absorption (Chanin et al., 2024), where a more specific latent suppresses the firing a more general latent. For instance, an SAE may have a latent that appears to track “Cities in USA” but that arbitrarily fails to fire on the specific cities “New York” and “Detroit”, where a city-specific latent fires instead. Feature absorption requires underlying features to exist in a hierarchy, with a parent feature f p and a child feature f c , where f c can only fire if f p is firing (f c =⇒ f p ). Feature absorption is caused by SAE sparsity penalty, and becomes more severe the wider the SAE. An SAE encoder/decoder under feature absorption is shown in Figure 1b. In this paper, we identify another fundamental issue with SAEs which we call feature hedging. In hedging, an SAE is too narrow to represent both features f a and f b with their own latents l a and l b . Ideally, an SAE should assign a latent l to either f a or f b , and ignore the feature not being tracked. However, if f a and f b are either hierarchical as in absorption, or (anti-)correlated, then the SAE 1 We use the term “latents” for the hidden neurons of the SAE to avoid overloading the term “feature”. We use “feature” only to describe interpretable concepts represented by the model. 1 arXiv:2505.11756v2 [cs.LG] 26 Sep 2025 Preprint. Under review. Table 1: Comparing feature hedging and feature absorption Feature absorptionFeature hedging Learns gerrymandered latentsLearns polysemantic mixtures of features Caused by sparsity lossCaused by MSE reconstruction loss Features are all tracked in the SAEOne feature is in the SAE, the other is not Affects the encoder and decoder asymmetricallyAffects encoder and decoder symmetrically Gets worse the wider the SAEGets worse the narrower the SAE Requires hierarchical featuresRequires only correlation between features latent l can reduce reconstruction error by incorrectly mixing in components of both f a and f b . A sample SAE encoder and decoder experiencing hedging is shown in Figure 1a. In an LLM SAE, hedging will look like each SAE latent has noise mixed into it, reducing the performance of the latent for both detection and steering. Unlike with absorption, hedging becomes worse the narrower the SAE: thus trying to reduce absorption by making the SAE narrower will simply result in more hedging instead. The differences between hedging and absorption are shown in Table 1. In LLM SAEs, the SAE is almost certainly narrower than the number of underlying features, as even extremely wide LLM SAEs appear to miss features (Templeton et al., 2024). Furthermore, we expect that nearly every feature in an LLM has positive and negative correlations to many features. We thus expect that hedging is the norm in LLM SAEs and will significantly distort their performance. 12 True feature 1 SAE Latent SAE encoder 12 True feature 1 SAE Latent SAE decoder −1 0 1 cos sim Cosine similarity with true features (Feature hedging) (a) When the SAE is only wide enough to represent one of the two features, we see feature hedging. La- tentl 1 mainly tracksf 1 , but a small component off 2 is incorrectly mixed into the latent l 1 as well. f 2 is mixed symmetrically into both the encoder and de- coder. 12 True feature 1 2 SAE Latent SAE encoder 12 True feature 1 2 SAE Latent SAE decoder −1 0 1 cos sim Cosine similarity with true features (Feature absorption) (b) Adding a new latent to the SAE so it is wide enough to track both features, we see feature absorp- tion. The decoder for l 1 perfectly tracks f 1 , but its encoder turns off if f 2 is also active. l 2 tracks f 2 , but its decoder mixes f 1 and f 2 . Asymmetry between encoder and decoder is characteristic of absorption. Figure 1: SAE encoder and decoder patterns for hierarchical features f 1 and f 2 , where f 1 =⇒ f 2 . These features lead to either hedging or absorption depending on the width of the SAE. A solution to feature absorption has been proposed in the form of matryoshka SAEs (Bussmann et al., 2025). Matryoshka SAEs use nested SAE loss terms to enforce a hierarchy on the SAE latents, solving absorption by forcing the narrow inner levels of the SAE to reconstruct inputs on their own. However, as we show in this paper, matryoshka SAEs suffer more from hedging due to the inner matryoshka levels essentially being very narrow SAEs. Matryoshka SAEs thus trade off absorption for hedging. In this work, we define and study feature hedging both theoretically in toy models and empirically in LLM SAEs. We show that hedging is worse the more narrow the SAE, and introduce a technique to characterize the amount of hedging present in a given SAE. We also study hedging and absorption in matryoshka SAEs, and show that it is possible to improve the monosemanticity of matryoshka SAEs by tuning the relative loss coefficients in each level of the matryoshka SAE to better balance the competing forces of absorption and hedging—though both problems remain present. We show as well that SAE width is not a neutral hyperparameter: narrow SAEs suffer more from hedging than wider SAEs. Code is available at https://github.com/chanind/feature-hedging-paper. 2 Preprint. Under review. 2BACKGROUND Sparse autoencoders (SAEs). An SAE decomposes an input activation x ∈R D into a hidden state f consisting of L hidden neurons, called “latents”. An SAE is composed of an encoder W enc ∈ R L×D , a decoder W dec ∈R D×L , a decoder bias b dec ∈R D , and encoder bias b enc ∈R L , and a nonlinearity σ, typically ReLU or a variant like JumpReLU (Rajamanoharan et al., 2024), TopK (Gao et al., 2024) or BatchTopK (Bussmann et al., 2024). z =σ(W enc (x− b dec ) + b enc )(1) ˆx =W dec z + b dec (2) The SAE is trained with a reconstruction loss, typically Mean Squared Error (MSE), and a sparsity- inducing loss consisting of a functionS that penalizes non-sparse representation with corresponding sparsity coefficient λ. For standard L1 SAEs, S is the L1 norm of f . For TopK and BatchTopK SAEs, there is no sparsity-inducing loss (S = 0) as the TopK function directly induces sparsity. There is sometimes also an additional auxiliary loss L aux with coefficient α to ensure all latents fire. Standard L1 SAEs typically do not have an auxiliary loss (Olah et al., 2024). The general SAE loss is L =∥x− ˆx∥ 2 2 + λS + αL aux .(3) Tied SAEs. A tied SAE has W enc = W T dec . The biases have different dimensions and are untied. Matryoshka SAEs. A matryoshka SAE (Bussmann et al., 2025) extends the SAE definition by summing losses created by prefixes of SAE latents. This forces each sub-SAE to reconstruct input activations on its own, and incentivizes the SAE to place more common, general concepts into latents with smaller index number. A matryoshka SAE uses nested prefixes with sizesM = m 1 ,m 2 ,...m n where m 1 < m 2 < ... < m n = L, where L is the number of latents in the full dictionary. Matryoshka SAE loss is: L = X m∈M ∥x− ˆx m ∥ 2 2 + λS m + αL aux (4) Where ˆx m is the reconstruction for the SAE using the first m latents, andS m is the sparsity penalty applied to the first m latents. For TopK and BatchTopK Matryoshka SAEs, there is no sparsity penalty (S m = 0) as the TopK function directly imposes sparsity. 3TOY MODELS OF FEATURE HEDGING The linear representation hypothesis (LRH) states that features in LLMs are represented as nearly- orthogonal linear directions in representation space (Bricken et al., 2023). The goal of SAEs, then, is to recover these underlying “true features” of the model, where each latent of the SAE decoder perfectly matches an underlying feature of the model. In real LLMs we do not have ground-truth knowledge of these underlying features, making it difficult to know if SAEs are succeeding at recov- ering model features. Fortunately, it is easy to create synthetic training data for SAEs that follows the LRH and gives us ground-truth knowledge of the underlying features. This allows us to understand when SAEs will learn the underlying features of the model, and when SAEs fail. We define a toy model consisting of N true features F ∈R N×D , where each ∥f i ∥ 2 = 1. These features are mutually orthogonal, so∀i ̸= j,f i · f j = 0. Each feature f i has a corresponding firing probability p i ∈ [0, 1]. For each sample, we generate a binary activation vectora ∈ 0, 1 N where a i ∼ Bernoulli(p i ) indicates whether feature f i is active (fires). The model can incorporate feature dependencies by conditioning firing probabilities: a i |a −i ∼ Bernoulli(p i (a −i )), wherea −i denotes the activation states of all other features. We then generate SAE training samples x from this model as x = P N i=1 a i f i . 3 Preprint. Under review. We say that an SAE is correct or monosemantic for this toy model if every latent in the SAE dic- tionary matches a true feature direction, and each SAE latent corresponds to a different true feature. Formally, there exists a bijection π :1,...,L→1,...,N such that cos(W dec,i ,f π(i) ) = 1 for all i ∈ 1,...,L. We only investigate SAEs where L ≤ N in our toy experiments. We say an SAE is polysemantic if some SAE latents contain positive or negative components of multiple true features, so there exists at least one latent i∈1,...,L such that|j ∈1,...,N :|W dec,i ·f j | > ε| > 1 for some threshold ε > 0. For all SAEs in this section, we train on 15M synthetic activations using SAELens (Bloom et al., 2024). In this section we show plots of the cosine similarity between the SAE encoder / decoder and the true features. Each cell i,j in these plots is simply cos(W T enc,i ,f j ) and cos(W dec,i ,f j ), respectively. We re-arrange the indices of the SAE latents to best align visually with true features. 3.1FULLY INDEPENDENT FEATURES We first study the case of a toy model with N = 4 independent features. Features 1-3 fire with probability 0.25, and feature 4 fires with probability 0.2. We plot the encoder / decoder cosine similarity with true features in Figure 2. When features fire independently, the SAE learns correct features regardless of the width of the SAE. 1234 True feature 1 2 3 SAE Latent SAE encoder 1234 True feature 1 2 3 SAE Latent SAE decoder −1 0 1 cos sim Cosine similarity with true features (Independent features) 1234 True feature 1 2 3 4 SAE Latent SAE encoder 1234 True feature 1 2 3 4 SAE Latent SAE decoder −1 0 1 cos sim Cosine similarity with true features (Independent features) Figure 2: SAE with three latents (left) and four latents (right) trained on a toy model with indepen- dent features. Both SAEs learn correct features. Unfortunately, real LLMs do not have fully independent features. SAEs were first studied under toy models with independent features (Elhage et al., 2022), and this is likely why the field was not aware of feature hedging earlier. 3.2HIERARCHICAL FEATURES Next, we explore true features that fire hierarchically. We modify the toy model from Section 3.1 above, and set f 3 as the parent feature and f 4 as the child feature, so f 4 =⇒ f 3 . That is, f 4 cannot fire unless f 3 is also firing. Hierarchical features cause feature absorption in SAEs that are wide enough to represent both the parent and child feature, but what happens if the SAE is not wide enough to represent the child latent? This is important as this is the intuition behind why Matryoshka SAEs work to combat absorption: if inner SAE levels are too narrow to represent both parent and child features, we hope that only the parent will be represented. We show results in Figure 3. 1234 True feature 1 2 3 SAE Latent SAE encoder 1234 True feature 1 2 3 SAE Latent SAE decoder −1 0 1 cos sim Cosine similarity with true features (Hierarchical features) 1234 True feature 1 2 3 4 SAE Latent SAE encoder 1234 True feature 1 2 3 4 SAE Latent SAE decoder −1 0 1 cos sim Cosine similarity with true features (Hierarchical features) Figure 3: SAEs trained on a toy model with hierarchical features (f 3 =⇒ f 4 ). When the SAE is too narrow to represent f 4 (left), we see hedging where latent 3 mixes . When the SAE is wide enough to contain both f 3 and f 4 (right), we see feature absorption. As expected, in the full-width SAE we see a classic feature absorption pattern. The parent latent encoder, l 3 , learns ¬f 4 ∧ f 3 , disabling the latent from firing if f 4 is present. The child latent, l 4 , mixes both f 3 and f 4 together in the decoder. 4 Preprint. Under review. However, when the SAE is too narrow to represent f 4 , we now see that latent l 3 mixes a component of f 4 into both its encoder and decoder! We refer to this as feature hedging. The SAE is learning an incorrect, polysemantic latent that mixes correlated features. While we expect this will be a problem for all SAEs, it is particularly problematic for Matryoshka SAEs, as Matryoshka SAEs combat absorption by using inner SAE levels that are too narrow to contain both parent and child features. However, as we see here, this causes hedging. We show that MSE loss directly causes hedging with hierarchical features in Appendix A.1. 3.3POSITIVELY CORRELATED FEATURES Hierarchy is a particularly extreme form of positive correlation, where a feature can only fire if another feature also fires. Next, we relax that restriction, and investigate what happens if a feature is merely more likely to fire along with another feature, but can still fire on its own as well. We modify the toy model so that p 4 = 0.2 if f 3 fires, but p 4 = 0.1 if f 3 does not fire, adding a small positive correlation between f 3 and f 4 . We show results Figure 4 1234 True feature 1 2 3 SAE Latent SAE encoder 1234 True feature 1 2 3 SAE Latent SAE decoder −1 0 1 cos sim Cosine similarity with true features (Correlated features) 1234 True feature 1 2 3 4 SAE Latent SAE encoder 1234 True feature 1 2 3 4 SAE Latent SAE decoder −1 0 1 cos sim Cosine similarity with true features (Correlated features, full-width SAE) Figure 4: SAEs trained on a toy model with positive correlation between features f 3 and f 4 . When the SAE is too narrow to represent f 4 (left), we still see hedging in latent 3. When the SAE is wide enough to contain both f 3 and f 4 (right), the SAE learns correct features. We still see that the SAE is mixing a positive component of f 4 into l 3 , despite there no longer being perfect hierarchy! We also see that if we extend the SAE width so that f 4 is tracked by its own latent l 4 , there is now no absorption at all, as absorption requires (nearly) hierarchical features to arise. 3.4ANTI-CORRELATED FEATURES So far we have only seen the effect of positive correlation between features. We next change our toy model so f 4 is more likely to fire if f 3 does not fire. We set p 4 = 0.1 if f 3 fires, but p 4 = 0.2 if f 3 does not fire. Results are shown in Figure 5. 1234 True feature 1 2 3 SAE Latent SAE encoder 1234 True feature 1 2 3 SAE Latent SAE decoder −1 0 1 cos sim Cosine similarity with true features (Standard SAE - anti-correlated features) 1234 True feature 1 2 3 4 SAE Latent SAE encoder 1234 True feature 1 2 3 4 SAE Latent SAE decoder −1 0 1 cos sim Cosine similarity with true features (Standard SAE - anti-correlated features) Figure 5: SAEs trained on a toy model with negative (anti) correlation between features f 3 and f 4 . When the SAE is too narrow to represent f 4 (left), we latent 3 mixes a negative component of f 4 . When the SAE is wide enough to contain both f 3 and f 4 (right), the SAE learns correct features. We now see that l 4 is mixing in a negative component of f 3 . This demonstrates that the correlation is the cause of the hedging: flipping the sign of the correlation flips the sign of the hedging. The implications of this for SAE performance are quite dire. While it is already bad for positively correlated features to become hedged (e.g. “sunshine” and “summertime”), at least the mixed fea- tures have some relation to each other. For anti-correlated features, this could look like a latent for “chemical molecules” having a negative component of the “Darth Vader” feature mixed in, since chemical modules and Darth Vader are highly anti-correlated. Worse, it is not even clear if the in- verse of the “Darth Vader” direction is meaningful in the model at all, or is just noise. We further 5 Preprint. Under review. expect that there are many more negative correlations than positive correlations in language, e.g. a negative component every word in every non-English language may be mixed into every latent track- ing an English word. These negative correlations likely introduce what looks like a lot of random noise into SAE latents, and this can only harm performance and interpretability. 3.5STUDYING HEDGING IN SINGLE-LATENT SAES We next investigate hedging in the simplest possible toy SAE setting: an SAE with a single latent. We use a model with two true features f 1 and f 2 (N = 2,D = 50). Each feature fires with magni- tude 1.0. Unless otherwise specified, f 1 fires with probability 0.25, and f 2 fires with probability 0.2. We use SAELens (Bloom et al., 2024) to train a single-latent SAE on these activations. 3.5.1FULLY INDEPENDENT FEATURES −101 −1 0 1 f 1 f 2 SAE Latent SAEb dec (a) Independent features −101 −1 0 1 f 1 f 2 SAE Latent SAEb dec (b) Hierarchical features −101 −1 0 1 f 1 f 2 SAE Latent SAEb dec (c) Correlated features −101 −1 0 1 f 1 f 2 SAE Latent SAEb dec (d) Anti-correlated features Figure 6: True features, and SAE decoder latent and b dec for single-latent SAE and a toy model with two true features. When the features fire independently, there is no hedging seen in the SAE latent. When any correlation is present, the SAE latent shows clear hedging. We first study the case when f 1 and f 2 fire independently. We find that the SAE correctly represents f 1 without any interference from f 2 . However, the decoder bias has incorrectly learned to represent the direction of f 2 , but with magnitude 0.2, equal to the probability of f 2 firing. The single SAE latent, SAE bias term, and true features are shown in Figure 6a. We consistently find this pattern of the decoder bias merging in positive components of features not tracked by their own latent. In this sense, the decoder bias can be thought of as an always-on latent, and thus is thus also susceptible to hedging. 3.5.2HIERARCHICAL FEATURES Next, we investigate what happens if f 1 and f 2 are in a hierarchy, so f 2 can only fire if f 1 fires, but f 1 can still fire on its own (f 2 =⇒ f 1 ). We adjust the firing probability of f 2 so that P(f 2 |f 1 ) = 0.2, and P(f 2 |¬f 1 ) = 0 (thus, P(f 2 ) = 0.05). In a two-latent SAE this setup would cause feature absorption. We plot the SAE latent, decoder bias, and true features in Figure 6b. Here we clearly see feature hedging. The single SAE latent has now merged in a component of f 2 into its single latent, so it is now a mixture of f 1 and f 2 . This merging of features reduces the MSE loss of the SAE despite being a degenerate solution. 3.5.3POSITIVELY CORRELATED FEATURES Next, we change our setup so that P(f 2 |¬f 1 ) = 0.1 instead of 0. We still keep P(f 2 |f 1 ) = 0.2, so that f 2 is more likely to fire if f 1 fires, but it can still fire on its own as well. The features are now merely correlated rather than following a strict hierarchy. Results are shown in Figure 6c. We still see hedging in the SAE latent, but less than with full hierarchical features. However, if the L1 penalty is high enough and the level of correlation is low enough, then the SAE can still learn the correct features, as positive hedging increases the L0 of the SAE slightly relative to learning just f 1 . We show the resulting SAE latent and features with high L1 penalty in Figure 7b. Interestingly, we now see that the hedging has moved more apparently into the decoder bias instead. If we use a full-width SAE, the SAE learns the true features despite the correlation (Appendix A.2). 6 Preprint. Under review. 3.5.4ANTI-CORRELATED FEATURES Next, we reverse the conditional probabilities of f 2 so that P(f 2 |f 1 ) = 0.1 and P(f 2 |¬f 1 ) = 0.2. Now f 2 is more likely to fire on its own than it is to fire along with f 1 , making these feaures anti- correlated. Results are shown in Figure 6d. Now the SAE latent has actually merged a negative component of f 2 into its single latent instead of a positive component. How does this work? We see that the decoder bias, b dec , has a larger component of f 2 than in the positive correlation case. The SAE is using the decoder bias to include a “default” value for f 2 , and then when f 1 fires, the SAE latent’s negative component of f 2 acts to reduces the amount of f 2 present in the reconstruction. The SAE is abusing the correlation to adjust its guess of the amount of f 2 that should be output despite not having a dedicted latent for f 2 : if f 1 is active, then the likelihood that f 2 is active decreases, and the SAE likewise reduces the amount of f 2 that is output. Increasing L1 penalty cannot solve this, as the negative component of hedging in the encoder does not increase L0 of the SAE. If we use a full-width SAE, we again see the SAE learns the true features despite the correlation (see Appendix A.2). 3.5.5HEDGING IS A FUNCTION OF FEATURE CORRELATION Next, we explore the effect of feature correlation on the amount of hedging in our single-latent, two feature setting. We set P(f 1 ) = 0.45 and P(f 2 ) = 0.25, but change the correlation between these features, ρ, to range from −0.5 to 0.5. We then calculate the cosine similarity of the SAE decoder latent, l, with f 2 . We furthermore initialize the single SAE latent to match f 1 , so that any deviation from this must be caused by gradient pressure rather than simply being an unfortunate local minimum. If there is no hedging occurring, then cos(l,f 2 ) = 0, as we saw in Figure 6a. Results are shown in Figure 7a. −0.4−0.20.00.20.4 Feature Correlation −0.50 −0.25 0.00 0.25 0.50 Cosine Similarity Hedging (cos(l,f 2 )) vs Feature Correlation (a) Hedging amount (cos(l,f 2 )) vs correlation between f 1 and f 2 . The degree to which l mixes in f 2 is a clear function of the amount of corre- lation between features. −101 −1 0 1 f 1 f 2 SAE Latent SAEb dec Correlated features, high L1 penalty (b) High L1 penalty can reduce hedging caused by positive cor- relations. Figure 7: Hedging as a function of feature correlation, and effect of L1 penalty on positive hedging. As expected, the amount of hedging directly tracks the amount of correlation. The hedging also matches the sign of the correlation as well, with negative correlation resulting in a negative compo- nent of f 2 being mixed into l, and positive correlation resulting in a positive component of f 2 being mixed into l. 4QUANTIFYING HEDGING IN LLM SAES While we have demonstrated hedging in a synthetic setting, it remains a question how much hedging occurs in LLM SAEs. Based on our understanding of hedging in toy models, we expect that when a new latent is added to an SAE, this should “pull out” the component of the new feature from existing SAE latents, where it was previously hedged. Thus if hedging occurs, the change in existing latents after a new latent is added should project onto that new latent. If hedging did not exist, then adding a new latent should not have any effect on existing latents. 7 Preprint. Under review. Parent latents are learned before child latents A key assumption in Matryoshka SAEs is that if latents exist in a hierarchy, and the SAE is too narrow to represent both the parent and child, the SAE will learn the parent first. We feel this assumption is reasonable since parent latents, by definition, fire more frequently than child latents, so the SAE is incentivized to learn them first. This insight allows us to differentiate hedging from absorption. Under absorption, if a newly added latent is a child feature of an existing latent, then the encoder for the parent latent adds a negative component of the child latent to avoid firing when the child is active, but the parent decoder latent remains unchanged. This corresponds to adding l 2 to Figure 1a and arriving at Figure 1b. The decoder of l 1 (the parent) remains identical to before l 2 is added, except the hedging from f 2 is removed. Thus, changes to existing decoder latents cannot be absorption and must be due to hedging. Hedging degree Taking this into account, we define a metric called hedging degree, h. We take an existing SAE s 0 with L latents and add N new latents to the SAE. After adding these latents, we continue training the SAE and arrive at a new SAE, s 1 , with L + N latents. We also continue training s 0 on the same tokens that we train s 1 on to ensure that any difference between s 0 and s 1 is due only to the newly added latents. W 0 dec refers to the new decoder of s 0 , and W 1 dec refers to the decoder of s 1 . W dec is normalized so each latent has unit norm. We define the difference in the original L latents between s 0 and s 1 as: δ L = W 1 dec [0 : L]− W 0 dec [0 : L](5) where W 1 dec [L : L + N] refers to the newly added decoder latents. W rand [0 : N] refers to a decoder consisting of N randomly initialized unit-norm latents. All decoders are normalized to have latents of unit norm. We define the projection of a vector v onto a subspace spanned by W as: Proj(v,W) =∥W(W T W) −1 W T v∥(6) We expect that even if there were no hedging at all, simply due to noise, existing SAE decoder latents may undergo a change that has some small projection onto new added latents. We want to make sure that anything we quantify as hedging must be larger than what we would expect from random noise. Taking this into account, the hedging degree h is then defined as: h = 1 L L X i ∥Proj(δ L [i],W 1 dec [L : L + N])∥ |z Projection of δ L onto N new latents −∥Proj(δ L [i],W rand [0 : N])∥ | z Projection of δ L onto N random latents (7) Any value of h > 0 corresponds to hedging above what we would expect from random noise, as h subtracts the projection along N randomly initialized unit-norm latents as part of the computation. The choice of the number of new latents N is a hyperparameter of hedging degree. We use N = 64 for our hedging degree calculation. We explore the effect of different choices on N in Appendix A.4. 4.1RESULTS We experiment with SAEs trained on Gemma-2-2b (Team et al., 2024), as this model is commonly used for SAE research due to the thoroughness of the Gemma Scope suite of SAEs (Lieberum et al., 2024), as well as Llama-3.2-1b (Dubey et al., 2024) to validate results on another LLM. All SAEs are trained first on 250M tokens of the Pile uncopyrighted (Gao et al., 2020). After adding N = 64 latents, we continue training for another 250M tokens. The version of the SAE without latents added is also trained for another 250M tokens, so each SAE is trained for 500M tokens total. The pair of extended and non-extended SAEs is used to calculate hedging degree. SAE training details are in Appendix A.3. We first calculate hedging degree vs SAE width in Figure 8a, with widths ranging from 128 to 65536. Hedging degree is dramatically higher at narrower widths, especially at 4096 width and below. While the hedging rate drops a lot with increasing SAE width, even at our max width of 65536 no SAE achieves 0 hedging degree, indicating there is still hedging occurring. 8 Preprint. Under review. 0100002000030000400005000060000 Width 10 −2 10 −1 Hedging degree (log) Hedging degree vs width gemma / btk gemma / l1 llama / btk llama / l1 (a) Hedging degree vs width. No SAE tested reached 0 hedging. 0.00.20.40.60.81.0 Layer (portion of model) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Hedging degree Hedging degree vs layer gemma / btk gemma / l1 llama / btk llama / l1 (b) Hedging degree vs layer, nor- malized by number of LLM lay- ers. 050100150200 L0 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Hedging degree Hedging degree vs L0 gemma / btk gemma / l1 llama / btk llama / l1 (c) Hedging degree vs L0. Figure 8: Hedging degree for SAEs trained on Gemma-2-2b layer 12. Unless otherwise specified, SAEs have width 8192, BatchTopK SAEs have K=25. Shaded area in plots is 1 std. We next calculate hedging degree vs L0 (the average number of active latents) in Figure 8c, with L0 ranging from about 5 to 200. Very low L0 seems to lead to more hedging for BatchTopK SAEs, but the effect is minor compared with the effect of SAE width on hedging degree. Finally, we calculate hedging degree vs layer in Figure 8b. The hedging degree for L1 and TopK SAEs appears to merge around the end of the SAE, but overall the layer does not appear to have a massive effect on hedging degree. It also appears that BatchTopK SAEs have more hedging than L1 SAEs. We suspect that L1 loss can reduce hedging from positively correlated features, as we saw in Section 3.5.3. We further validate hedging in LLM SAEs via a case-study of adding a new latent to an SAE trained on Gemma-2-2b in Appendix A.5. 5BALANCING HEDGING AND ABSORPTION IN MATRYOSHKA SAES Matryoshka SAEs (Bussmann et al., 2025) combat absorption with nested SAE loss prefixes. Each level acts like a small SAE, and is forced to reconstruct the input on its own. This forces the SAE to learn more general concepts in earlier levels, and makes it difficult for the SAE to make holes in the recall of parent latents for absorption, as this would hurt the reconstruction of earlier levels. However, since early matryoshka levels are effectively narrow SAEs, they suffer from feature hedg- ing. As we saw in Section 4.1, the more narrow an SAE is, the worse the hedging. Matryoshka SAEs thus solve feature absorption at the expense of exacerbating feature hedging. Inspecting the effect of hedging and absorption on the SAE encoder in Figure 1b shows that hedging and absorption have opposite effects. For hierarchical features, hedging adds a positive component of child features into the parent encoder latent, but absorption does the opposite and adds a negative component of child features into the parent latent. If we balance the negative component of child latents from absorption with the positive component from hedging, these effects can cancel out. Balance matryoshka SAE We extend the definition of a matryoshka SAE from Equation 4 to allow applying a scaling coefficient β m to the loss for each matryoshka level: L = X m∈M β m ∥a− ˆa m ∥ 2 2 + λS m + αL aux (8) We refer to this extension as a balance matryoshka SAE, where each β m ≥ 0 controls the relative balance of each level. If each β m = 1 this is a standard matryoshka SAE. If β m = 0 for all matryoshka levels except the outer-most level, this reduces to a standard (non-matryoshka) SAE. We demonstrate this balancing in a toy model of hierarchical features. The toy model has 4 features, with feature 1 being the parent feature and features 2-4 being children (features 2-4 can only fire if feature 1 is also firing). Feature 1 fires with probability 0.25, and each child feature fires with probability 0.15 if feature 1 is firing. We train a matryoshka SAE with a single inner level consisting 9 Preprint. Under review. of only latent 1 with balance coefficient β (Since there is only one inner level, we always set the outer level coefficient to 1). For more details on this toy setup, see Appendix A.7. 1234 True feature 1 2 3 4 SAE Latent SAE encoder 1234 True feature 1 2 3 4 SAE Latent SAE decoder −1 0 1 cos sim Detached Matryoshka SAE (β=∞) (a) Matryoshka SAE with de- tached loss (equivalent to a ma- tryoshka SAE with β = ∞). Hedging adds positive compo- nents of the child features 2-4 to the encoder of latent 1. 1234 True feature 1 2 3 4 SAE Latent SAE encoder 1234 True feature 1 2 3 4 SAE Latent SAE decoder −1 0 1 cos sim Standard SAE (β= 0) (b) Standard SAE (equivalent a matryoshka SAE with β = 0). Absorption adds negative compo- nents of the child features 2-4 to the encoder of latent 1. 1234 True feature 1 2 3 4 SAE Latent SAE encoder 1234 True feature 1 2 3 4 SAE Latent SAE decoder −1 0 1 cos sim Balanced Matryoshka SAE (β= 0.25) (c) Roughly balanced matryoshka SAE with β = 0.25. The positive and negative contributions hedg- ing and absorption roughly cancel out, leaving a nearly perfect SAE. Figure 9: Balancing hedging and absorption in a toy model of hierarchical features. Child features 2-4 only fire if parent feature 1 fires. The matryoshka SAE has a single inner level with 1 latent, represented by a black box around latent 1. We show results in Figure 9. When β is too high or too low this results in hedging or absorption, respectively. When β = 0.25, these balance out and the SAE learns a near perfect representation. Next, we train LLM balance matryoshka SAEs with different balance ratios on Gemma-2-2b layer 12. The SAEs are BatchTopK with k=40, trained on 500M tokens. The SAEs have 5 matryoshka levels of sizes 128, 512, 2048, 8192, and 32768 (so the full SAE has width 32768). We set the outermost β 5 = 1, and set a constant multiplier between each subsequent β m , so multiplier = β m /β m+1 . If the multiplier is 0.5, then β m = 0.5 (5−m) . 012345 Multiplier 0.0 0.1 0.2 0.3 0.4 Absorption rate Mean Absorption Rate (a) SAEBench Absorption rate. Lower is better. 012345 Multiplier 0.02 0.04 0.06 TPP score TPP Top-2 (b)SAEBenchTPPmetric. Higher is better. 012345 Multiplier 0.65 0.70 0.75 0.80 0.85 F1 score Part of speech probes mean F1 (k=1) (c) k=1 sparse probing F1 score on Parts of Speech (POS). 012345 Multiplier 0.12 0.14 0.16 0.18 SCR metric SCR Top-2 (d) SAEBench SCR top-2 metric. Higher is better. 012345 Multiplier 0.69 0.70 0.71 0.72 0.73 0.74 0.75 Accuracy Sparse Probing Top-1 (e) SAEBench K=1 sparse prob- ing accuracy. 012345 Multiplier 1.0 1.5 2.0 2.5 3.0 Num split features Mean Num Split Feats by SAE (f) Feature splitting (SAEBench absorption). Lower is better. Figure 10: Performance of balance matryoshka SAEs vs multiplier. The shaded area is 1 std. Mul- tiplier=0 is equivalent to a standard SAE, and multiplier=1 is a standard matryoshka SAE. We train 10 seeds for each multiplier and show results in Figure 10 for absorption rate, targeted probe pertubation (TPP), Spurious Concept Removal (SCR), K-sparse probing, and feature-splitting metrics from SAEBench (Karvonen et al., 2025), and k=1 sparse probing results (Gurnee et al., 2023) for a Parts of Speech (POS) dataset we created using Treebank POS tagged sentences (Marcus et al., 1993). We add a POS dataset for probing since POS are very general concepts, and should be learned in the earliest levels of a matryoshka SAE. 10 Preprint. Under review. For TPP, feature splitting, and sparse probing, using a compound multiplier of around 0.75 achieves better results than either a standard matryoshka SAE or a standard (non-matryoshka) SAE, providing evidence that balancing matryoshka losses can improve the performance. Using a multiplier of 0.75 still scores well on the absorption metric as well. Strangely, SCR appears to perform better at higher multipliers. However, SCR is also the noisiest metric, and the noise is higher at high multipliers, so it could be that hedging increases the noise of the SCR metric but does not fully break it. We provide further results and more details in Appendix A.9. While balancing each β m can improve performance on most metrics, we do not expect this to per- fectly solve absorption and hedging. We show in Appendix A.8 that balancing all hedging and absorption with a single β m is not always possible. We expect it may be possible to further improve performance by learning different balancing coefficients per latent, but this is left to future work. 6RELATED WORK Other work has highlighted theoretical problems with SAEs. Till (2024) investigated a problem where SAEs may increase sparsity by inventing features. For instance, an SAE may fabricate a “red triangle” feature in addition to “red” and “triangle” features. Templeton et al. (2024) dicuss the problem of feature splitting, where an SAE may not learn features at a desired level of specificity. Engels et al. (2024) investigates SAE errors and finds that SAE error may be pathological and non- linear. Engels et al. (2025) further shows that there are features that cannot be expressed as a simple linear direction, and thus SAEs may struggle to represent these features. Wu et al. (2025) and Kantamneni et al. (2025) both investigate the empirical performance of SAEs and find that SAEs underperform baselines. 7DISCUSSION SAEs remain a promising technique for decomposing the residual stream of LLMs in an unsuper- vised manner. However, given recent work showing that SAEs underperform relative to baselines (Wu et al., 2025; Kantamneni et al., 2025), it is imperative that we understand the reasons for this underperformance so they can be addressed. In this work, we introduced the problem of feature hedging in SAEs, showing it both theoretically in toy models, and empirically in SAEs trained on real LLMs. We suspect that hedging, along with absorption, may be one of the core theoretical problems leading to poor SAE performance. Using our understanding of hedging, we introduced the balance matryoshka SAE architecture, al- lowing balancing of hedging and absorption against each other, improving interpretability. We view balance matryoshka SAEs as a starting point, and expect this architecture can be improved by op- timizing the balance coefficients. There may not be a single coefficient that perfectly balances hedging and absorption for all features, so we expect there may be further gains from learning a different balancing coefficients per latent in the SAE. We leave these improvements to future work. ACKNOWLEDGMENTS This work was carried out as part of the ML Alignment & Theory Scholars (MATS) program. DC was supported thanks to EPSRC EP/S021566/1. We are grateful to Andrew Mack, Juan Gil, Satvik Golechha, Josh Engels and Henning Bartsch for discussions and input during the project. REFERENCES Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. Saelens. https://github. com/jbloomAus/SAELens, 2024. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decom- posing language models with dictionary learning. Transformer Circuits Thread, 2, 2023. Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. arXiv preprint arXiv:2412.06410, 2024. 11 Preprint. Under review. Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda. Learning multi-level features with matryoshka sparse autoencoders. arXiv preprint arXiv:2503.17547, 2025. David Chanin, James Wilken-Smith, Tom ́ a ˇ s Dulka, Hardik Bhatnagar, and Joseph Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoencoders. arXiv preprint arXiv:2409.14507, 2024. Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=F76bwRSLeK. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Ander- son, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Ma- hadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Al- wala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Man- nat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur C ̧ elebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhar- gava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sum- baly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petro- vic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Bran- don Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, 12 Preprint. Under review. Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Ar- caute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzm ́ an, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Gold- man, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Ke- neally, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mo- hammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navy- ata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Sa- tadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lind- say, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Tim- othy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, V ́ ıtor Albiero, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Con- stable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposi- tion. arXiv preprint arXiv:2209.10652, 2022. Joshua Engels, Logan Riggs, and Max Tegmark. Decomposing the dark matter of sparse autoen- coders. arXiv preprint arXiv:2410.14670, 2024. Joshua Engels, Eric J Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=d63a4AM4hb. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. Leo Gao, Tom Dupr ́ e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024. 13 Preprint. Under review. Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsi- mas. Finding neurons in a haystack: Case studies with sparse probing. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum? id=JYs1R9IMJr. Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda. Are sparse autoencoders useful? a case study in sparse probing. arXiv preprint arXiv:2502.16681, 2025. Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025. URL https://arxiv.org/abs/2503.09532. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, J ́ anos Kram ́ ar, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2, August 2024. Mitch Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993. Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=I4e82CIDxv. Chris Olah, Adly Templeton, Trenton Bricken, and Adam Jermyn.April update. https: //transformer-circuits.pub/2024/april-update/index.html, 2024. URL https://transformer-circuits.pub/2024/april-update/index.html. Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997. Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, J ́ anos Kram ́ ar, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders. arXiv preprint arXiv:2407.14435, 2024. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu- patiraju, L ́ eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram ́ e, Johan Fer- ret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Char- line Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momchev, Matt Hoffman, Shantanu Thakoor, Jean-Bastien Grill, Behnam Neyshabur, Olivier Bachem, Alanna Walton, Aliaksei Severyn, Alicia Parrish, Aliya Ahmad, Allen Hutchi- son, Alvin Abdagic, Amanda Carl, Amy Shen, Andy Brock, Andy Coenen, Anthony Laforge, Antonia Paterson, Ben Bastian, Bilal Piot, Bo Wu, Brandon Royal, Charlie Chen, Chintu Kumar, Chris Perry, Chris Welty, Christopher A. Choquette-Choo, Danila Sinopalnikov, David Wein- berger, Dimple Vijaykumar, Dominika Rogozi ́ nska, Dustin Herbison, Elisa Bandy, Emma Wang, Eric Noland, Erica Moreira, Evan Senter, Evgenii Eltyshev, Francesco Visin, Gabriel Rasskin, Gary Wei, Glenn Cameron, Gus Martins, Hadi Hashemi, Hanna Klimczak-Pluci ́ nska, Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda Mein, Jack Zhou, James Svensson, Jeff Stanway, Jetha Chan, Jin Peng Zhou, Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joost van Amersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Ju yeong Ji, Kareem Mohamed, Kar- tikeya Badola, Kat Black, Katie Millican, Keelin McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish Greene, Lars Lowe Sjoesund, Lauren Usui, Laurent Sifre, Lena Heuermann, Leticia Lago, Lilly McNealus, Livio Baldini Soares, Logan Kilpatrick, Lucas Dixon, Luciano Martins, Machel Reid, Manvinder Singh, Mark Iverson, Martin G ̈ orner, Mat Velloso, Mateo Wirth, Matt Davidow, Matt Miller, Matthew Rahtz, Matthew Watson, Meg Risdal, Mehran Kazemi, Michael Moyni- han, Ming Zhang, Minsuk Kahng, Minwoo Park, Mofi Rahman, Mohit Khatwani, Natalie Dao, Nenshad Bardoliwalla, Nesh Devanathan, Neta Dumai, Nilay Chauhan, Oscar Wahltinez, Pankil Botarda, Parker Barnes, Paul Barham, Paul Michel, Pengchong Jin, Petko Georgiev, Phil Culli- ton, Pradeep Kuppala, Ramona Comanescu, Ramona Merhej, Reena Jana, Reza Ardeshir Rokni, Rishabh Agarwal, Ryan Mullins, Samaneh Saadat, Sara Mc Carthy, Sarah Cogan, Sarah Perrin, 14 Preprint. Under review. S ́ ebastien M. R. Arnold, Sebastian Krause, Shengyang Dai, Shruti Garg, Shruti Sheth, Sue Ron- strom, Susan Chan, Timothy Jordan, Ting Yu, Tom Eccles, Tom Hennigan, Tomas Kocisky, Tulsee Doshi, Vihan Jain, Vikas Yadav, Vilobh Meshram, Vishal Dharmadhikari, Warren Barkley, Wei Wei, Wenming Ye, Woohyun Han, Woosuk Kwon, Xiang Xu, Zhe Shen, Zhitao Gong, Zichuan Wei, Victor Cotruta, Phoebe Kirk, Anand Rao, Minh Giang, Ludovic Peran, Tris Warkentin, Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, D. Sculley, Jeanine Banks, Anca Dra- gan, Slav Petrov, Oriol Vinyals, Jeff Dean, Demis Hassabis, Koray Kavukcuoglu, Clement Fara- bet, Elena Buchatskaya, Sebastian Borgeaud, Noah Fiedel, Armand Joulin, Kathleen Kenealy, Robert Dadashi, and Alek Andreev. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosemanticity: Ex- tracting interpretable features from claude 3 sonnet. https://transformer-circuits. pub/2024/scaling-monosemanticity/, May 2024. Accessed on May 21, 2024. Demian Till.Do sparse autoencoders find true features?LessWrong,2024. URL https://w.lesswrong.com/posts/QoR8noAB3Mp2KBA4B/ do-sparse-autoencoders-find-true-features. Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christo- pher D Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outper- form sparse autoencoders. arXiv preprint arXiv:2501.17148, 2025. AAPPENDIX A.1HEDGING IS CAUSED BY RECONSTRUCTION LOSS: CURVES FOR SINGLE-LATENT SAES What causes hedging? We hypothesize that it is a combination of not enough latents to represent every feature, and the fact that MSE loss incentivizes reconstructing multiple features imperfectly as opposed to only one feature perfectly. To test this, we analyze the loss curves for a single-latent tied SAE with a parent-child relationship between the two features f 1 and f 2 , so f 2 =⇒ f 1 . The ideal SAE latent must be some combination of these two features. As there are no other interfering features to break the symmetry between encoder and decoder, the SAE can be expressed by a single unit norm latent. We set the SAE latent l to an interpolation of these two features, l = αf 2 + (1− α)f 1 (adjusted to have unit norm). We calculate expected SAE loss consisting of MSE + L1 loss for 0≤ α≤ 1. First, we set P(a = f 1 ) = 0.3 and P(a = f 1 + f 2 ) = 0.1. We characterize the probabilities this way since there are only two firing possibilities we need to consider: either f 1 is firing on its own or f 1 and f 2 are firing together. We use L1 coefficient of 0 and 0.1 to explore the effect of the sparsity penalty on loss. We also consider the case where both features fire together more than they fire on their own, with P(a = f 1 ) = 0.1 and P(a = f 1 + f 2 ) = 0.3. Loss curves are shown in Figure 11. In these plots, α = 0 corresponds to the SAE latent being exactly f 1 , and α = 1 corresponds to the latent being f 2 , and α = 0.5 corresponds to f 1 + f 2 . We clearly see that the SAE loss has a single minimum between f 1 and f 1 +f 2 , showing that the MSE minimum is attained with feature hedging. A.2FULL-WIDTH SAE TOY MODEL RESULTS We extend the discussion of single-latent SAEs to explore what happens if the SAE has two la- tents, the same number of latents as the number of true features. We use the same toy model as in Section 3.5.3 for the positive correlation case, and the same toy model as in Section 3.5.4 for the anti-correlated case. We use L1 penalty of 1e-3 for the positive correlation case, the same as the L1 penalty that caused hedging in single-latent SAEs. 15 Preprint. Under review. 0.00.20.40.60.81.0 α 0.1 0.2 0.3 0.4 Expected Loss skew parent (p(f 1 +f 2 )<p(f 1 )) L1 Coeff. 0.0 0.1 Loss curves for single-latent SAE (a) Loss curves when the parent featuref 1 fires more on its own than with child feature f 2 . Loss is mini- mized between f 1 and f 2 rather than at f 1 (α = 0). Sparsity penalty does not change the minimum. 0.00.20.40.60.81.0 α 0.1 0.2 0.3 0.4 Expected Loss skew child (p(f 1 +f 2 )>p(f 1 )) L1 Coeff. 0.0 0.1 Loss curves for single-latent SAE (b) Loss curves when the parent feature f 1 fires less on its own than it does with the child feature f 2 . Loss is incorrectly minimized between f 1 and f 2 . Sparsity penalty does not change the minimum. Figure 11: Loss curves for an SAE with a single latent l and 2 hierarchical features, where f 2 =⇒ f 1 . The minimum loss is indicated with a dot on each plot. α = 0 means that l = f 1 , and α = 1 means l = f 2 . In all cases, loss is minimized when the latent l is a combination of f 1 and f 2 . 12 True feature 1 2 SAE Latent 0.410.00 0.000.55 SAE encoder 12 True feature 1 2 SAE Latent 1.000.00 0.001.00 SAE decoder −1 0 1 cos sim Cosine similarity with true features (Correlated features, full-width SAE) (a) Full-width SAE with correlated features. The SAE is still able to perfectly learn the underlying features despite the correlation. 12 True feature 1 2 SAE Latent 0.360.00 0.000.50 SAE encoder 12 True feature 1 2 SAE Latent 1.000.00 0.001.00 SAE decoder −1 0 1 cos sim Cosine similarity with true features (Full-width SAE - anti-correlated features) (b) Full-width SAE with anti-correlated features. The SAE is still able to perfectly learn the under- lying features despite the correlation. Figure 12: Full-width SAE results on correlated and anti-correlated toy models. We plot the results in Figure 12. In both cases, the full-width SAEs are able to perfectly recover the true features despite the correlation, and despite the low L1 penalty. This shows that hedging is caused by the SAE being too narrow, as increasing the width of the SAE solves the problem. A.3TRAINING DETAILS FOR LLM SAES All SAEs are trained on the Pile uncopyrighted (Gao et al., 2020), using a batch size of 4096 acti- vations and context length of 1024 tokens. For L1 SAEs, we use a linear L1 warm-up of 10k steps. SAEs are trained on a single 80gb Nvidia H100 GPU. Model weights are loaded in fp32 precisions, but autocast to bfloat16 during training. We initialize the SAE so that the encoder and decoder are identical, where each latent has norm 0.1, following the procedure described in (Olah et al., 2024). All L1 SAEs are trained with learning rate 7e-5, and BatchTopK SAEs are trained with learning rate 3e-4. SAEs are trained using SAELens (Bloom et al., 2024). Unless otherwise specified, BatchTopK SAEs use k=25. For SAEs trained on Gemma-2-2b, we conduct most experiments at layer 12 (roughly in the middle), and L1 SAEs trained on Gemma-2- 2b use L1 coefficient of 10. This coefficient does not reuslt in dead extension latents, and yields a L0 around 50. For SAEs trained on Llama-3.2-1b, we conduct most experiments at layer 7 (roughly in the middle of the model), and for L1 SAEs trained on Llama-3.2-1b, we use L1 coefficient of 0.5. This coefficient does not result in dead extension latents, and yields a L0 around 50. A.4CHOICE OF HEDGING HYPERPARAMETER N Our hedging degree metric requires adding N new latents onto an existing SAE to extend it, naturally leading to the question of what is a reasonable choice of N. We plot hedging degree vs N for Gemma- 2-2b layer 12, given an initial BatchTopK SAE of width 8192 in Figure 13. We find that hedging degree increases until about N=250. We choose N=64 for our experiments, as 64 is still a small 16 Preprint. Under review. 050100150200250 Num latents added (N) 0.00 0.01 0.02 0.03 0.04 Hedging degree Hedging degree vs N (width=8k, L0=25) btk Figure 13: Hedging degree vs N number of latents relative to the size of the residual stream (2304 for Gemma-2-2b), while still being large enough to hopefully reduce noise from any specific latent that gets added. Furthermore, as we see in the plot, the hedging degree from N=64 is about in the middle of the curve, further validating that this is a reasonable choice. A.4.1EXTENDING LLM SAES We train two versions of extension SAEs - one for L1 loss SAEs and one for BatchTopK SAEs. In both cases, we begin with a pretrained SAE and add N latents randomly initialized with norm 0.1, and with the same encoder and decoder directions, following Olah et al. (2024). For the BatchTopK SAEs, we simply train the SAE from this point as normal, as the TopK auxiliary loss (Gao et al., 2024) will naturally ensure that the newly added latents do not simply die off. For L1 SAEs with high L1 penalty, dead latents become a more serious problem. We find that most of the newly added extension latents will rapidly be killed off if we simply train as normal. To combat this, we re-warm-up the L1 penalty. However, we cap the minimum L1 penalty at λ min , so for the portion of the warm-up where the L1 penalty would normally be below λ min , the L1 penalty is left at λ min instead. This capping helps ensure the existing SAE latents are not very disturbed by this change in the L1 penalty. If the final L1 penalty is λ min or below, then we do not perform this warm-up at all, as the L1 penalty is not strong enough to immediately kill off the newly added latents. For Gemma-2-2b SAEs, we set λ min = 10.0. For Llama-3.2-1b SAEs, we set λ min = 0.5. This warm-up procedure is only used for the high-L1 variants in Figure 8c - for all other plots the L1 coefficient used is less than λ min , so no warmup is needed. A.5CASE STUDY: ADDING A NEW LATENT TO AN EXISTING SAE We next explore how hedging affects a real SAE. We trained a L1 SAE on Gemma-2-2b layer 12 with width 8192 for 250M tokens on the Pile (Gao et al., 2020), then add a new latent to the SAE, and continue training both the original SAE and the extended SAE for another 250M tokens. 0.3/css /bootstrap.min . /bootstrap.min .cssintegrity="sha3" /bootstrap.min .css ">link (a) Newly added case-study latent, latent 8192. The latent appears to track CSS scripts in HTML. ><linkrel=" stylesheet"type =" 8"/>< linkrel= stylesheethref="..//doc png"><linkrel=" manifest"href=" (b) Latent 3094, which had the largest negative δ- projection after adding latent 8192.This latent tracks “rel” in HTML, used for CSS in HTML. Figure 14: Sample top activating examples for case study latents. We examine inputs that cause the newly added latent to fire to get a sense of what it represents. We reproduce a portion of the top activating examples for the new latent in Figure 14a. This latent appears to fire on CSS scripts included in HTML. A larger set of inputs is shown in Appendix A.6. 17 Preprint. Under review. Next, we look at the magnitude of change in existing latents projected on the new latent. Based on our understanding of hedging, if a latent loses a large component of the newly added latent, this corresponds to a likely hierarchical relationship with the new latent. The latent which lost the largest component of the new latent is latent 3094, which seems to track the “rel” HTML attribute used mainly for linking CSS scripts. We show top activating examples for latent 3094 in Figure 14b. Since CSS scripts are just one type of asset that can be linked using “rel”, this appears to be exactly the sort of hierarchical relationship we expect to be heavily impacted by hedging. A.6ADDITIONAL CASE STUDY DASHBOARDS Figure 15: Dashboard for the newly added case study latent representing CSS scripts in HTML. Figure 16: Dashboard for latent 3094, representing the “rel” HTML attribute used for CSS scripts. This latent has the highest negative δ-projection on the newly added case study latent. A.7TOY BALANCE MATRYOSHKA SAES To explore the effect of balancing matryoshka losses in a simple toy setting, we create a toy model with 4 true features, all mutually orthogonal and with unit norm in a 50 dimensional space. We set up a hierarchical relationship between these features, so feature 1 fires with probability 0.25, and features 2, 3, and 4 all fire with probability 0.15 only if feature 1 fires. Thus, feature 1 is the parent feature in the hierarchy and features 2, 3, and 4 are all child features. We train a matryoshka SAE with 4 latents on 100,000,000 samples from this toy model. The ma- tryoshka SAE has a single inner level consisting of 1 latent, to match the number of parent latents in our hierarchy. Since our goal with this toy is just to build intuition, we initialize the SAE to the correct solution and allow the training to thus pull it away from this correct solution. This also 18 Preprint. Under review. ensures that each variation of our SAE with different balancing co-efficients learns the same latents in the same order, so visual comparison is easy. A.8TOY UNBALANCEABLE MATRYOSHKA SAES The situation above where each child feature has the same probability of firing is unrealistic - we would expect that child features all fire with different probabilities from each other. Can we still balance the SAE perfectly in this situation? We adjust the toy model from above so that the 3 child features fire with probabilities 0.02, 0.2, and 0.5 for features f 2 , f 3 , and f 4 , respectively. We then try to manually balance this SAE, finding that β = 0.17 gives roughly the best balance. We plots the resulting encoder/decoder cosine similarities in Figure 17. 1234 True feature 1 2 3 4 SAE Latent SAE encoder 1234 True feature 1 2 3 4 SAE Latent SAE decoder −1 0 1 cos sim Unbalanceable Matryoshka SAE (β= 0.17) Figure 17: SAE encoder and decoder vs true feature cosine similarities for a balance matryoshka SAE where the child features fire with different probabilities. It’s no longer possible to perfectly balance all 3 child features with the same β, but we can still do reasonably well. We now see it is no longer possible to choose a single β that perfectly balances all 3 children. We see slight hedging of feature 4 in latent 1, and slight absorption of feature 2 in latent 1. Still, this looks decent compared to the full hedging or full absorption scenario, so we still expect that while balancing is not a perfect solution, it should be an improvement. We believe it should be possible to finding ways of better balancing the contribution of each outer latent on each inner latent, but this is left to future work. A.9SAE EVALUATION A.9.1SAEBENCH EVALS We evaluate our SAEs mainly using SAEBench (Karvonen et al., 2025). All evals are performed using default settings. We run all evaluations on an Nvidia H100 GPU with 80gb GPU memory. We evaluate on the following SAEBench tasks: K-sparse probing k-sparse probing builds on the work of Gurnee et al. (2023), where the goal is to create a linear probe from model activations using only k neurons as input to the probe. This was adapted for use as an SAE evaluation technique by Gao et al. (2024). We focus mainly on k = 1 sparse probing, which means finding the single best SAE latent that serves as a classifier for a given concept, and evaluating the accuracy of that latent. SAEBench includes supervised classification datasets on which k-sparse probing is evaluated. Feature absorption The feature absorption metric in SAEBench is a variation on the metric de- fined in the original feature absorption work (Chanin et al., 2024). This metric uses a first-letter spelling task and first identifies the “main” latents for that task using k-sparse probing (Gurnee et al., 2023). Then, the metric identifies cases where a linear probe is able to correctly perform the first-letter classification task, but the “main” SAE latents fail to perform the task. The metric also looks for other latents that project onto the linear probe direction and fire in place of the “main’ latents. Lower absorption is better. The SAEBench absorption metric also includes “absorptions fraction”, “feature splitting”, and “first- letter k=1 sparse probing” results as well, which we include in our extended results. Absorption fraction detects partial absorption, where a parent latent can still fire but weaker when an absorbing 19 Preprint. Under review. child latent fires as well. Feature splitting detects the amount of interpretable feature splitting occur- ring in the SAE. Interpretable feature splitting is still considered negative, as we would prefer that features do not split at all and the SAE can still represent general, high-level concepts. The k-sparse probing results for the first-letter spelling task is calculated as part of the absorption metric, but is an interesting sparse-probing result in and of itself. Spurious concept removal (SCR) SCR builds on the SHIFT method from Marks et al. (2025) to detect how well an SAE isolates concepts. The metric uses datasets where two properties are perfectly entangled, for instance “profession” and “gender”, and trains a biased probe on these concepts. The SCR metric then detects how well k SAE latents can be ablated to de-bias the probe. If the SAE latents learn disentangled concepts, then it should only take a few SAE latents to perfectly de-bias the probe. A high SCR score means the SAE latents represent disentangled concepts. Targeted probe perturbation (TPP) The TPP metric extends SCR to multi-class labels. Binary probes are trained for each class, and TPP measures how well ablating k SAE latents can degrade the performance of just one of the probes without degrading performance on the other probes. A high TPP score means that concepts are represented by distinct sets of SAE latents, rather than latents being entangled with many concepts. A.9.2PARTS OF SPEECH (POS) PROBING DATASET We are interested as well in general, high-frequency concepts that we expect should be learned in the inner-most levels of a matryoshka SAE. These concepts should be the most affected by both absorption and hedging, as these concepts can be considered parent concepts to most other concepts. Parts of speech (POS) is a great test-case for these general concepts, and are not covered by the SAEBench sparse probing task. As such, we create our own custom POS dataset using the Penn Treebank tagged sentences (Marcus et al., 1993). We simplify the Treebank parts of speech to the following core set: "TO", "IN", "DT", "C", "NNS", "PRP", "POS" We pass these tagged sentences through an LLM, and then collect activations for the final token of position of each word at a given layer in the LLM. We create a binary classification dataset for each of these parts of speech, and perform k-sparse probing (Gurnee et al., 2023) on SAE latents to find the top k latents that act as a classifier for each of these parts of speech. A.9.3BALANCE MATRYOSHKA SAE FULL RESULTS A.10LIMITATIONS We only test hedging in SAEs up to 65k latents on LLMs with 2b parameters due to compute constraints. Our method for detecting hedging requires fine-tuning SAEs, which is expensive. 20 Preprint. Under review. 012345 Multiplier 0.0 0.1 0.2 0.3 0.4 Absorption fraction Mean Absorption Fraction by SAE (a) SAEBench Absorption frac- tion. Lower is better. 012345 Multiplier 0.08 0.10 0.12 0.14 0.16 TPP score TPP Top-10 (b) SAEBench TPP top-10 metric. Higher is better. 012345 Multiplier 0.70 0.71 0.72 0.73 0.74 0.75 0.76 Explained variance Explained variance (c) Explained variance. 012345 Multiplier 0.225 0.250 0.275 0.300 0.325 0.350 SCR metric SCR Top-10 (d) SAEBench SCR top-10 metric. Higher is better. 012345 Multiplier 0.84 0.85 0.86 0.87 Accuracy Sparse Probing Top-5 (e) SAEBench K=5 sparse prob- ing accuracy. 012345 Multiplier 0.45 0.50 0.55 0.60 0.65 k=1 sparse probing Mean First-Letter k=1 Sparse Probing F1 by SAE (f) K=1 first-letter sparse probing F1 score (SAEBench absorption). Figure 18: Performance of balance matryoshka SAEs vs multiplier for extended metrics. The shaded area in the plots refers to 1 std. Multiplier=0 is equivalent to a standard non-matryoska SAE, and multiplier=1 is equivalent to a standard matryoshka SAE. 21