← Back to papers

Paper deep dive

Empirical Evaluation of Progressive Coding for Sparse Autoencoders

Hans Peter, Anders Søgaard

Year: 2025Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 81

Models: Gemma-2-2b

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/11/2026, 12:37:37 AM

Summary

The paper evaluates progressive coding for Sparse Autoencoders (SAEs) to address their computational cost. It introduces the 'Dictionary Power Law Hypothesis,' which posits that feature importance in SAEs follows a power law, allowing for hierarchical organization. The authors compare two methods for creating progressive SAEs: pruning vanilla SAEs based on feature importance and training nested 'Matryoshka' SAEs. Results indicate that while Matryoshka SAEs offer better reconstruction and representational similarity, pruned vanilla SAEs provide superior interpretability.

Entities (5)

Matryoshka SAEs · model-architecture · 100%Sparse Autoencoders · model-architecture · 100%TopK · activation-function · 98%Dictionary Power Law Hypothesis · scientific-hypothesis · 95%Progressive Coding · technique · 95%

Relation Signals (3)

Dictionary Power Law Hypothesis explains hierarchical organization of feature importance

confidence 95% · This consistency across scales and metrics provides strong empirical evidence for the Dictionary Power Law Hypothesis, revealing a robust hierarchical organization of feature importance in SAEs.

Matryoshka SAEs exhibits higher representational similarity

confidence 90% · We show Matryoshka SAEs exhibit lower reconstruction loss and recaptured language modeling loss, as well as higher representational similarity.

Pruned vanilla SAEs exhibits higher interpretability

confidence 90% · Pruned vanilla SAEs are more interpretable, however.

Cypher Suggestions (2)

Find all model architectures mentioned in the paper · confidence 90% · unvalidated

MATCH (e:Entity {entity_type: 'Model Architecture'}) RETURN e.name

Retrieve relations between models and their performance metrics · confidence 85% · unvalidated

MATCH (s)-[r:EXHIBITS]->(o:PerformanceMetric) RETURN s.name, r.relation, o.name

Abstract

Abstract:Sparse autoencoders (SAEs) \citep{bricken2023monosemanticity,gao2024scalingevaluatingsparseautoencoders} rely on dictionary learning to extract interpretable features from neural networks at scale in an unsupervised manner, with applications to representation engineering and information retrieval. SAEs are, however, computationally expensive \citep{lieberum2024gemmascopeopensparse}, especially when multiple SAEs of different sizes are needed. We show that dictionary importance in vanilla SAEs follows a power law. We compare progressive coding based on subset pruning of SAEs -- to jointly training nested SAEs, or so-called {\em Matryoshka} SAEs \citep{bussmann2024learning,nabeshima2024Matryoshka} -- on a language modeling task. We show Matryoshka SAEs exhibit lower reconstruction loss and recaptured language modeling loss, as well as higher representational similarity. Pruned vanilla SAEs are more interpretable, however. We discuss the origins and implications of this trade-off.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)mechanistic-interp (suggested, 92%)safety-evaluation (suggested, 80%)

Links

PDF not stored locally. Use the link above to view on the source site.

Full Text

80,677 characters extracted from source content.

Expand or collapse full text

Empirical Evaluation of Progressive Coding for Sparse Autoencoders Hans Peter Anders Søgaard Abstract Sparse autoencoders (SAEs) (Bricken et al., 2023; Gao et al., 2024) rely on dictionary learning to extract interpretable features from neural networks at scale in an unsupervised manner, with applications to representation engineering and information retrieval. SAEs are, however, computationally expensive (Lieberum et al., 2024), especially when multiple SAEs of different sizes are needed. We show that dictionary importance in vanilla SAEs follows a power law. We compare progressive coding based on subset pruning of SAEs – to jointly training nested SAEs, or so-called Matryoshka SAEs (Bussmann et al., 2024; Nabeshima, 2024) – on a language modeling task. We show Matryoshka SAEs exhibit lower reconstruction loss and recaptured language modeling loss, as well as higher representational similarity. Pruned vanilla SAEs are more interpretable, however. We discuss the origins and implications of this trade-off. sparse autoencoders, dictionary learning, evaluation, interpretability 1 Introduction Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks (Brown et al., 2020; Chowdhery et al., 2022; Hoffmann et al., 2022; Grattafiori et al., 2024), but understanding their internal representations remains a significant challenge. Sparse autoencoders (SAEs) (Bricken et al., 2023; Yun et al., 2023; Gao et al., 2024; Templeton et al., 2024) have enabled extraction of interpretable features from these models at scale, already offering some insights into how LLMs process and represent information. SAEs are computationally expensive to train and run inference on, often prompting developers to train SAEs of varying sizes to balance performance and computational constraints. This is the question we are interested in: How can we efficiently obtain high-fidelity, interpretable SAEs of different sizes for LLMs? Our goal is to induce a progressive (Skodras et al., 2001), sparse coding that provides us with flexible, dynamic, and more interpretable reconstructions of our representations (Gao et al., 2024). In other words, we want to learn a latent space such that for any granularity G∈ℕ,G≤Nformulae-sequenceℕG ,G≤ NG ∈ blackboard_N , G ≤ N, such that the first G dimensions yields good reconstruction performance. We call an SAE with this property a progressive coder, as it allows for graceful degradation of reconstruction quality as we reduce the size of the latent representation and thus the effective number of features used. Throughout this paper, we refer to G as the granularity. By this definition, as the sparse code gets shorter, the computation required for non-sparse matrix multiplication(Gao et al., 2024), is reduced proportionately. For example, if G=N22G= N2G = divide start_ARG N end_ARG start_ARG 2 end_ARG, the total computation is halved. So is the computation involved in decoding, but this is less important, since encoding is approximately six times as expensive in the limit of sparsity (Gao et al., 2024). Figure 1: Illustrating progressive coding, the dark part highlight the ressources not used by the model at inference time. Contributions We explore two ways of approaching the challenge of inducing progressive SAE coders: (i) Matryoshka SAEs explored independently and concurrently in (Bussmann et al., 2024; Nabeshima, 2024); (i) pruning vanilla SAEs based on the observed dictionary power law, leveraging their conditional independence. Our paper makes the following contributions: (i) We introduce the power law hypothesis for SAE dictionaries. (i) We introduce a novel baseline method for augmenting pretrained SAEs to become progressive coders. We introduce Matryoshka SAEs, also explored independently and concurrently in (Bussmann et al., 2024; Nabeshima, 2024). (i) We compare the two approaches to inducing progressive SAEs across five evaluation protocols, including some not previously discussed in the SAE literature. 2 Background SAEs The superposition hypothesis (Elhage et al., 2022) posits that neural networks ”want to represent more features than they have neurons” (Bricken et al., 2023). This phenomenon arises from the fundamental constraint that a vector space can support only as many orthogonal vectors as its dimensionality. To circumvent this limitation, networks learn an overcomplete basis of approximately orthogonal vectors, effectively simulating higher dimensional representations within lower dimensional spaces. Such an approximation is theoretically supported by the Johnson-Lindenstrauss Lemma (Johnson & Lindenstrauss, 1984), which states that for 0<ϵ<10italic-ϵ10<ε<10 < ϵ < 1, any set of n points in ℝdsuperscriptℝR^dblackboard_Rd can be embedded into ℝO⁢(ϵ−2⁢log⁡n)superscriptℝsuperscriptitalic-ϵ2R^O(ε^-2 n)blackboard_RO ( ϵ start_POSTSUPERSCRIPT - 2 log n ) end_POSTSUPERSCRIPT while approximately preserving all pairwise distances between the points up to a factor of (1+ϵ)1italic-ϵ(1+ε)( 1 + ϵ ). In dictionary learning, the goal is to find an overcomplete set of basis vectors D∈ℝD×NsuperscriptℝD ^D× ND ∈ blackboard_RD × N, with N>>Dmuch-greater-thanN>>DN > > D, and a set of representations R=[r1,…,rN]subscript1…subscriptR=[r_1,…,r_N]R = [ r1 , … , ritalic_N ], where ri∈ℝnsubscriptsuperscriptℝr_i ^nritalic_i ∈ blackboard_Rn, that jointly minimizes reconstruction and sparsity weighted by the sparsity coefficient λ∈λ λ ∈ R: arg⁡min⁡D,R⁢(1K⁢∑i=1K|xi−D⋅ri|22+λ⁢(R))1superscriptsubscript1superscriptsubscriptsubscript⋅subscript22 D,R ( 1K _i=1^K|x_i-D· r_i|_2^2+% (R) )arg min D , R ( divide start_ARG 1 end_ARG start_ARG K end_ARG ∑i = 1K | xitalic_i - D ⋅ ritalic_i |22 + λ S ( R ) ) (1) where ⁢(R)S(R)S ( R ) is a sparsity measure, commonly implemented as either the L0 pseudo-norm |r|0subscript0|r|_0| r |0 or the L1 norm |r|1subscript1|r|_1| r |1. However, it remains an open question how to best measure and optimize sparsity (Hurley & Rickard, 2009). In the interpretability literature, the atoms are most commonly referred to as features, and we use both terms interchangeably. Yun et al. (2023) were the first to propose dictionary learning for language model interpretability. Bricken et al. (2023) and Cunningham et al. (2023) used SAEs to disentangle features in superposition. SAEs have weights Wdec∈ℝN×DsubscriptdecsuperscriptℝW_dec ^N× DWdec ∈ blackboard_RN × D and Wenc∈ℝD×NsubscriptencsuperscriptℝW_enc ^D× NWenc ∈ blackboard_RD × N and biases Bcenter∈ℝDsubscriptcentersuperscriptℝB_center ^DBcenter ∈ blackboard_RD and Benc∈ℝNsubscriptencsuperscriptℝB_enc ^NBenc ∈ blackboard_RN. They use an element-wise activation function σ such that: z z =σ⁢((X−Bcenter)⋅WEnc+BEnc)absent⋅subscriptcentersubscriptEncsubscriptEnc =σ((X-B_center)· W_Enc+B_Enc)= σ ( ( X - Bcenter ) ⋅ WEnc + BEnc ) (2) X^ Xover start_ARG X end_ARG =(z⋅WDec)+Bcenterabsent⋅subscriptDecsubscriptcenter =(z· W_Dec)+B_center= ( z ⋅ WDec ) + Bcenter (3) Different activation functions have been suggested, but the TopK (Gao et al., 2024) and JumpReLU (Rajamanoharan et al., 2024) activation functions are the most prominent. Our paper exclusively uses the TopK activation function. SAEs are trained by minimizing this loss: ℒ=(1|D|⁢∑X∈D|X−X^|22)⏟reconstruction loss+λ⁢(z)⏟sparsityℒsubscript⏟1subscriptsuperscriptsubscript^22reconstruction losssubscript⏟sparsityL= ( 1|D| _X∈ D|X- X|_2^2% )_reconstruction loss+ (z)_% sparsityL = under⏟ start_ARG ( divide start_ARG 1 end_ARG start_ARG | D | end_ARG ∑X ∈ D | X - over start_ARG X end_ARG |22 ) end_ARGreconstruction loss + under⏟ start_ARG λ S ( z ) end_ARGsparsity (4) Matryoshka Representation Learning (MRL) trains representations in a coarse-to-fine manner, where smaller representations are contained within larger ones. MRL has been applied to NLP (Devvrit et al., 2024), in multimodal learning (Cai et al., 2024), and in diffusion models (Gu et al., 2024). MRL considers a set ℳ⊂ℕℳℕM ⊂ blackboard_N of representation sizes that are jointly learned. Given an input x from domain XX, MRL learns a representation vector z∈ℝmax⁡(ℳ)superscriptℝℳz (M)z ∈ blackboard_Rmax ( M ) such that z1:m1⊆z1:m2⊆⋯⊆z1:mnsubscript:1subscript1subscript:1subscript2⋯subscript:1subscriptz_1:m_1 z_1:m_2 … z_1:m_nz1 : m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊆ z1 : m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊆ ⋯ ⊆ z1 : m start_POSTSUBSCRIPT n end_POSTSUBSCRIPT, where each larger representation contains all smaller ones. The representation z is obtained through a neural network F⁢(⋅;θF):→ℝmax⁡(ℳ):⋅subscript→superscriptℝℳF(·; _F):X (M)F ( ⋅ ; θitalic_F ) : X → blackboard_Rmax ( M ) parameterized by θFsubscript _Fθitalic_F, such that z:=F⁢(x;θF)assignsubscriptz:=F(x; _F)z := F ( x ; θitalic_F ). ℳMM, typically contains |ℳ|≤⌊log⁡(max⁡(ℳ))⌋ℳ|M|≤ ( (M)) | M | ≤ ⌊ log ( max ( M ) ) ⌋ elements (Kusupati et al., 2024). For supervised learning tasks with dataset =(x1,y1),…,(xN,yN)subscript1subscript1…subscriptsubscriptD=\(x_1,y_1),…,(x_N,y_N)\D = ( x1 , y1 ) , … , ( xitalic_N , yitalic_N ) where xi∈subscriptx_i _i ∈ X and yi∈[L]subscriptdelimited-[]y_i∈[L]yitalic_i ∈ [ L ], MRL minimizes a linear weighted combination of the loss of the nested models over the dataset DD: 1N⁢∑i∈[N]∑m∈ℳcm⋅ℒ⁢(W(m)⋅F⁢(xi;θF)1:m,yi)1subscriptdelimited-[]subscriptℳ⋅subscriptℒ⋅superscriptsubscriptsubscriptsubscript:1subscript 1N _i∈[N] _m c_m·L(W^(m)% · F(x_i; _F)_1:m,y_i)divide start_ARG 1 end_ARG start_ARG N end_ARG ∑i ∈ [ N ] ∑m ∈ M citalic_m ⋅ L ( W( m ) ⋅ F ( xitalic_i ; θitalic_F )1 : m , yitalic_i ) (5) where W(m)∈ℝL×msuperscriptsuperscriptℝW^(m) ^L× mW( m ) ∈ blackboard_RL × m represents separate linear models for each nested dimension m, and cm≥0subscript0c_m≥ 0citalic_m ≥ 0 denotes the weighted importance of each scale. These weights may be hierarchically structured depending on the task. F⁢(xi;θF)1:msubscriptsubscriptsubscript:1F(x_i; _F)_1:mF ( xitalic_i ; θitalic_F )1 : m needs to be computed only once for each xisubscriptx_ixitalic_i by computing F⁢(xi;θF)1:max⁡(ℳ)subscriptsubscriptsubscript:1ℳF(x_i; _F)_1: (M)F ( xitalic_i ; θitalic_F )1 : max ( M ), thus this method introduces only the additional overhead of ∑m∈ℳW(m)subscriptℳsuperscript _m W^(m)∑m ∈ M W( m ) to the forward pass. 3 The Dictionary Power Law Hypothesis (Li et al., 2024b) found that the eigenvalues of the covariance matrix of the dictionary Wd⁢e⁢c∈ℝN⁢x⁢DsubscriptsuperscriptℝW_dec ^NxDWitalic_d e c ∈ blackboard_RN x D follow a power law. This suggests a hierarchical organization of information, where a relatively small number of features capture most of the variance in the data. We examine the mean squared activation value and frequency, as well as replicating their experiment on 105superscript10510^5105 unseen tokens for three sparse TopK autoencoders of different sizes. Figure 2: Power law fits for eigenvalues of the covariance matrix, E⁢[a⁢c⁢t⁢i⁢v⁢a⁢t⁢i⁢o⁢n2]delimited-[]superscript2E[activation^2]E [ a c t i v a t i o n2 ] and activation frequency (E⁢[⁢|a⁢c⁢t⁢i⁢v⁢a⁢t⁢i⁢o⁢n|>0]delimited-[]10E[ 1|activation|>0]E [ blackboard_1 | a c t i v a t i o n | > 0 ]). We fit a linear regression model to the logarithmically transformed values and display the coefficient and fit for each. We analyze three models of various sizes (65k, 32k, 16k) with consistent sparsity ratios (256-65k, 128-32k, 64-16k). The eigenvalues of the decoder matrix’s covariance matrix exhibit clear power law decay, with exponents (α) ranging from -0.54 to -0.72, and R2superscript2R^2R2 values between 0.916 and 0.922. This indicates a hierarchical structure in the feature space where a small number of directions capture most of the variance. While the squared activation values (E⁢[a⁢c⁢t⁢i⁢v⁢a⁢t⁢i⁢o⁢n2]delimited-[]superscript2E[activation^2]E [ a c t i v a t i o n2 ]) demonstrate approximate power law behavior in their middle range with exponents from -1.03 to -1.27 (R2superscript2R^2R2 values between 0.741 and 0.857), there is a notable deviation in the tail where values decrease more steeply than a power law would predict. Similarly, the frequency of feature activation (E⁢[⁢|a⁢c⁢t⁢i⁢v⁢a⁢t⁢i⁢o⁢n|>0]delimited-[]10E[ 1|activation|>0]E [ blackboard_1 | a c t i v a t i o n | > 0 ]) follows an approximate power law in its central region with exponents between -1.00 and -1.20 (R2superscript2R^2R2 values between 0.868 and 0.887), but also exhibits a sharp decline in the tail. This indicates that while there exists a hierarchical structure where some features activate much more frequently than others, the least-used features activate even more rarely than a pure power law distribution would suggest. Notably, the power law relationships persist across different model sizes, with larger models (TopK-SAE-256) exhibiting slightly less steep decay (smaller absolute α values) compared to smaller models. This consistency across scales and metrics provides strong empirical evidence for the Dictionary Power Law Hypothesis, revealing a robust hierarchical organization of feature importance in SAEs. 4 SAE Dictionary Permutation and Selection The dictionary power law suggests that a small subset of features capture most of the important information. We develop a method to identify and prioritize these features by exploiting the permutation invariance of SAEs. An important property of SAE features is their conditional independence given the input. Given an input X∈ℝDsuperscriptℝX ^DX ∈ blackboard_RD and an SAE with weights Wdec∈ℝN×DsubscriptdecsuperscriptℝW_dec ^N× DWdec ∈ blackboard_RN × D and Wenc∈ℝD×NsubscriptencsuperscriptℝW_enc ^D× NWenc ∈ blackboard_RD × N and biases Bcenter∈ℝDsubscriptcentersuperscriptℝB_center ^DBcenter ∈ blackboard_RD and Benc∈ℝNsubscriptencsuperscriptℝB_enc ^NBenc ∈ blackboard_RN, let Pπ∈ℕN×NsubscriptsuperscriptℕP_π ^N× NPitalic_π ∈ blackboard_N × N be a permutation matrix corresponding to π. Then, for any permutation π of the latent dimensions, the following holds for SAEs: z=σ⁢((X−Bcenter)⋅WEnc+BEnc)⋅subscriptcentersubscriptEncsubscriptEncz=σ((X-B_center)· W_Enc+B_Enc)z = σ ( ( X - Bcenter ) ⋅ WEnc + BEnc ), z′=σ⁢((X−Bcenter)⋅(WEnc⁢Pπ)+Pπ⁢BEnc)superscript′⋅subscriptcentersubscriptEncsubscriptsubscriptsubscriptEncz =σ((X-B_center)·(W_EncP_π)+P_πB_% Enc)z′ = σ ( ( X - Bcenter ) ⋅ ( WEnc Pitalic_π ) + Pitalic_π BEnc ), X^=z⁢WDec+Bcenter^subscriptDecsubscriptcenter X=zW_Dec+B_centerover start_ARG X end_ARG = z WDec + Bcenter, and X^′=z′⁢(Pπ−1⁢WDec)+Bcentersuperscript^′superscript′subscript1subscriptDecsubscriptcenter X =z (P_π^-1W_Dec)+B_centerover start_ARG X end_ARG′ = z′ ( Pitalic_π- 1 WDec ) + Bcenter. This produces identical reconstructions X^′=X^superscript^′ X = Xover start_ARG X end_ARG′ = over start_ARG X end_ARG for any permutation π. Each feature activation zjsubscriptz_jzitalic_j depends solely on the dot product between the j-th row of WEncsubscriptEncW_EncWEnc and the centered input (X−Bcenter)subscriptcenter(X-B_center)( X - Bcenter ), plus its bias term BEncjsubscriptsubscriptEncB_Enc_jBEnc start_POSTSUBSCRIPT j end_POSTSUBSCRIPT. Independence enables arbitrary reordering of features without affecting overall reconstruction quality. Figure 3: An illustration of dictionary permutation with function π, Both models will produce the same output given the same input Permutation invariance now enables the conversion of an existing SAE into a progressive coder by sorting features by descending importance and selecting the first G features at test time. Our objective is to find the permutation π that facilitates high-quality reconstruction using only the first G∈ℕ,G≤Nformulae-sequenceℕG ,G≤ NG ∈ blackboard_N , G ≤ N features of our encoding z∈ℝNsuperscriptℝz ^Nz ∈ blackboard_RN: X^′=z1:GWDec[:G,:]+Bcenter X =z_1:GW_Dec[:G,:]+B_centerover start_ARG X end_ARG′ = z1 : G WDec [ : G , : ] + Bcenter, where G represents the granularity, or the length of the code the decoder receives. We propose two ranking methods for determining π: sorting by mean squared activation: E⁢[a⁢c⁢t⁢i⁢v⁢a⁢t⁢i⁢o⁢n2]delimited-[]superscript2E[activation^2]E [ a c t i v a t i o n2 ]; or sorting by mean activation frequency: E⁢[⁢|a⁢c⁢t⁢i⁢v⁢a⁢t⁢i⁢o⁢n|>0]delimited-[]10E[ 1|activation|>0]E [ blackboard_1 | a c t i v a t i o n | > 0 ]. Figure 4: Granularity vs FVU (normalized reconstruction loss) for non-permuted(baseline), permuted based on E⁢[activation2]delimited-[]superscriptactivation2E[activation^2]E [ activation2 ] and E⁢[⁢activation>0]delimited-[]1activation0E[ 1\activation>0\]E [ blackboard_1 activation > 0 ]. Relative sparsity is fixed such that k non-zero / granularity is constant for all granularities Our results demonstrate that sorting by E⁢[a⁢c⁢t⁢i⁢v⁢a⁢t⁢i⁢o⁢n2]delimited-[]superscript2E[activation^2]E [ a c t i v a t i o n2 ] consistently achieves the best reconstruction performance across all granularities, and we therefore adopt this ranking method for all subsequent experiments. 5 Matryoshka SAEs We introduce a new method for jointly training nested SAEs by applying principles from MRL (Kusupati et al., 2024). Given an SAE with weights Wdec∈ℝN×DsubscriptdecsuperscriptℝW_dec ^N× DWdec ∈ blackboard_RN × D and Wenc∈ℝD×NsubscriptencsuperscriptℝW_enc ^D× NWenc ∈ blackboard_RD × N and biases Bdec∈ℝNsubscriptdecsuperscriptℝB_dec ^NBdec ∈ blackboard_RN and Benc∈ℝDsubscriptencsuperscriptℝB_enc ^DBenc ∈ blackboard_RD. Let M=m1,…,mksubscript1…subscriptM=\m_1,…,m_k\M = m1 , … , mitalic_k be the set of representation sizes we want to learn. We denote the forward pass for an SAE for dimension misubscriptm_imitalic_i, F1:mi⁢(⋅;θF)subscript:1subscript⋅subscriptF_1:m_i(·; _F)F1 : m start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ( ⋅ ; θitalic_F ) as: z1:misubscript:1subscript z_1:m_iz1 : m start_POSTSUBSCRIPT i end_POSTSUBSCRIPT =σ⁢((X−Bcenter)⋅WEnc:,1:mi+BEnc1:mi)absent⋅subscriptcentersubscriptsubscriptEnc::1subscriptsubscriptsubscriptEnc:1subscript =σ((X-B_center)· W_Enc_:,1:m_i+B_% Enc_1:m_i)= σ ( ( X - Bcenter ) ⋅ WEnc start_POSTSUBSCRIPT : , 1 : m start_POSTSUBSCRIPT i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + BEnc start_POSTSUBSCRIPT 1 : m start_POSTSUBSCRIPT i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (6) X^1:misubscript^:1subscript X_1:m_iover start_ARG X end_ARG1 : m start_POSTSUBSCRIPT i end_POSTSUBSCRIPT =z1:mi⋅WDec1:mi,:+Bcenterabsent⋅subscript:1subscriptsubscriptsubscriptDec:1subscript:subscriptcenter =z_1:m_i· W_Dec_1:m_i,:+B_center= z1 : m start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ⋅ WDec start_POSTSUBSCRIPT 1 : m start_POSTSUBSCRIPT i , : end_POSTSUBSCRIPT end_POSTSUBSCRIPT + Bcenter (7) Figure 5: Architectural diagram of the Matryoshka SAE, showing nested latent representations of decreasing dimensionality. The encoder and decoder are shared by each nesting We implement weight sharing in both the encoder and decoder. As z1:m1⊆z1:m2⊆…⊆z1:mksubscript:1subscript1subscript:1subscript2…subscript:1subscriptz_1:m_1 z_1:m_2 … z_1:m_kz1 : m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊆ z1 : m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊆ … ⊆ z1 : m start_POSTSUBSCRIPT k end_POSTSUBSCRIPT we only have to compute z1:mksubscript:1subscriptz_1:m_kz1 : m start_POSTSUBSCRIPT k end_POSTSUBSCRIPT and z1:m1⊆…⊆z1:mksubscript:1subscript1…subscript:1subscriptz_1:m_1 … z_1:m_kz1 : m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊆ … ⊆ z1 : m start_POSTSUBSCRIPT k end_POSTSUBSCRIPT will have been computed. This is crucial as the encoding step is the most computationally expensive part of SAE training (Gao et al., 2024; Mudide et al., 2024). The computational complexity for a naive implementation of an SAE for a batch of size N is dominated by the matrix multiplications ⁢(N⋅D⋅N)⋅O(N· D· N)O ( N ⋅ D ⋅ N ) for both encoding and decoding, totaling ⁢(4⋅N⋅D⋅N)⋅4O(4· N· D· N)O ( 4 ⋅ N ⋅ D ⋅ N ) for the forward and backward pass. However, as observed by (Gao et al., 2024), the latent vector is highly sparse, and with an efficient sparse-dense matmul kernel we can compute the decoding step in ⁢(N⋅D⋅k)⋅O(N· D· k)O ( N ⋅ D ⋅ k ). By amortizing this cost over 1 encoding step, we can train M nested models for the cost of training the largest one. This gives us a cost of ≈max⁡M∑Mabsent≈ MΣ M≈ divide start_ARG max M end_ARG start_ARG ∑ M end_ARG vs M separate SAEs and as both the encoder and decoder weights are shared, there is no memory overhead. For an efficient implementation of the sparse-dense-matmul kernel, we use the kernel by (Gao et al., 2024). Time per step during training increases by ≈1.25absent1.25≈ 1.25≈ 1.25 for Matryoshka TopK SAEs, however, this ratio decreases fast with sparsity and larger model sizes. To minimize the amount of dead features, we include the auxiliary loss (Gao et al., 2024). We denote the reconstructed activation for granularity m as X^msubscript X_mover start_ARG X end_ARGm, and optimize the following loss: ℒ=(1|D|⁢∑X∈D∑m∈ℳcm⋅|X−X^m|22)⏟reconstruction loss+λ⁢(z)⏟sparsity loss+α⋅ℒaux⁢(z)⏟auxiliary lossℒsubscript⏟1subscriptsubscriptℳ⋅subscriptsuperscriptsubscriptsubscript^22reconstruction losssubscript⏟sparsity losssubscript⏟⋅subscriptℒauxauxiliary lossL= ( 1|D| _X∈ D _m % c_m·|X- X_m|_2^2 )_reconstruction loss+% (z)_sparsity loss+ α% ·L_aux(z)_auxiliary lossL = under⏟ start_ARG ( divide start_ARG 1 end_ARG start_ARG | D | end_ARG ∑X ∈ D ∑m ∈ M citalic_m ⋅ | X - over start_ARG X end_ARGm |22 ) end_ARGreconstruction loss + under⏟ start_ARG λ S ( z ) end_ARGsparsity loss + under⏟ start_ARG α ⋅ Laux ( z ) end_ARGauxiliary loss (8) where the sparsity and feature activity constraints are only enforced on the full latent representation z, not separately on each nested representation. 6 Evaluation and Comparison We compare our two approaches, Matryoshka SAEs and column permutation, against baseline TopK SAEs (Gao et al., 2024). We evaluate the performance of our methods by measuring granularity versus reconstruction fidelity at a fixed relative sparsity,111That is, given our pretrained SAE with dimension N∈ℕ,K∈ℕ,G1,…,Gn∈ℕ≤N,K1,…,Kn∈ℕ≤Kformulae-sequenceformulae-sequenceℕformulae-sequenceℕsubscript1…subscriptℕsubscript1…subscriptℕN ,K ,G_1,…,G_n ≤ N,K_1,…% ,K_n ≤ KN ∈ blackboard_N , K ∈ blackboard_N , G1 , … , Gitalic_n ∈ blackboard_N ≤ N , K1 , … , Kitalic_n ∈ blackboard_N ≤ K, we have KN=K1G1=⋯=KnGnsubscript1subscript1⋯subscriptsubscript KN= K_1G_1=…= K_nG_ndivide start_ARG K end_ARG start_ARG N end_ARG = divide start_ARG K1 end_ARG start_ARG G1 end_ARG = ⋯ = divide start_ARG Kitalic_n end_ARG start_ARG Gitalic_n end_ARG , as well as sparsity versus reconstruction fidelity (Rajamanoharan et al., 2024). We train models on 50 million tokens extracted from the second layer residual stream activations (positions 0-512) of Gemma-2-2b (Team et al., 2024), using a random subset of the Pile uncopyrighted dataset.222Available at https://huggingface.co/datasets/monology/pile-uncopyrighted The experiments utilized granularities ℳ=214,215,216ℳsuperscript214superscript215superscript216M=\2^14,2^15,2^16\M = 214 , 215 , 216 and sparsity levels 6421⁢6,12821⁢6,25621⁢6,51221⁢664superscript216128superscript216256superscript216512superscript216\ 642^16, 1282^16, 2562^16, 5122^16\ divide start_ARG 64 end_ARG start_ARG 21 6 end_ARG , divide start_ARG 128 end_ARG start_ARG 21 6 end_ARG , divide start_ARG 256 end_ARG start_ARG 21 6 end_ARG , divide start_ARG 512 end_ARG start_ARG 21 6 end_ARG , where k is the numerator. We trained three non-Matryoshka models for each Matryoshka SAE, matching the activation function and dictionary size across granularities misubscriptm_imitalic_i. Training hyperparameters followed established configurations from prior work(Bricken et al., 2023; Templeton et al., 2024; Lieberum et al., 2024; Rajamanoharan et al., 2024). Parameter Value Description Learning Rate 10−4superscript10410^-410- 4 Optimization step size Weight Decay 10−2superscript10210^-210- 2 L2 regularization coefficient AdamW β1=0.9subscript10.9 _1=0.9β1 = 0.9, β2=0.99subscript20.99 _2=0.99β2 = 0.99, Momentum and stability terms ϵ=10−8italic-ϵsuperscript108ε=10^-8ϵ = 10- 8 Dictionary Sizes (ℳMM) 214,215,216superscript214superscript215superscript216\2^14,2^15,2^16\ 214 , 215 , 216 Nested model granularities TopK Values (K) 64,128,256,51264128256512\64,128,256,512\ 64 , 128 , 256 , 512 Active features per granularity Auxiliary Loss Scale 132132 132divide start_ARG 1 end_ARG start_ARG 32 end_ARG Dead feature regularization ka⁢u⁢xsubscriptk_auxkitalic_a u x min⁡ds⁢a⁢e2,n⁢_⁢d⁢e⁢a⁢dsubscript2_ \ d_sae2,n\_dead\min divide start_ARG ditalic_s a e end_ARG start_ARG 2 end_ARG , n _ d e a d Auxiliary loss feature count Training Data 50M tokens Pile uncopyrighted subset Context Window 0-512 Token positions sampled Table 1: Training Configuration for Matryoshka SAEs All evaluation metrics were computed on a held-out test set of 105superscript10510^5105 tokens. As a baseline comparison, we evaluate our results against the JumpReLU SAEs from the GemmaScope family of models (Lieberum et al., 2024). While this comparison provides useful context, several important caveats should be noted: The training distributions differ, as the exact distribution used in (Lieberum et al., 2024) is not publicly documented, and their models are trained on at least an order of magnitude more tokens. Furthermore, the models employ different activation functions (JumpReLU versus TopK), which introduces fundamental architectural differences in how features are encoded and activated. Progressive coding frontier We compute the loss function L⁢(Z1:G)subscript:1L(Z_1:G)L ( Z1 : G ) across granularities G∈ℳ=5000,10000,…ℳ500010000…G =\5000,10000,…\G ∈ M = 5000 , 10000 , … , where G represents the dimensionality of the latent space. For each granularity, we maintain a fixed sparsity ratio. The evaluation of SAEs remains an open research question (Makelov et al., 2024)(Gao et al., 2024). However, two metrics have emerged as standard in the literature: a) reconstruction loss, measured by FVU(fraction of variance unexplained) or what (Gao et al., 2024) calls the normalized mse loss, defined as ⁢[|X−X^m|22]⁢[|X−X¯|22]delimited-[]superscriptsubscriptsubscript^22delimited-[]superscriptsubscript¯22 E[|X- X_m|_2^2]E[|X- X|_2^2]divide start_ARG blackboard_E [ | X - over start_ARG X end_ARGm |22 ] end_ARG start_ARG blackboard_E [ | X - over¯ start_ARG X end_ARG |22 ] end_ARG where X¯ Xover¯ start_ARG X end_ARG is the mean of X over the batch and latent dimension and X^msubscript X_mover start_ARG X end_ARGm is the reconstruction of X using the SAE with granularity m; recaptured LLM loss, i.e., the cross-entropy loss of the unablated model on a dataset divided by the cross-entropy loss when the SAE is spliced into the LMs forward pass: Unablated LM lossablated LM lossUnablated LM lossablated LM loss Unablated LM lossablated LM lossdivide start_ARG Unablated LM loss end_ARG start_ARG ablated LM loss end_ARG. Importantly, (Braun et al., 2024) demonstrated that discrepancies can arise between these two metrics. As reconstruction loss treats all directions in the activation space as equally important, while in practice some directions may be more functionally significant for the model’s downstream performance than others. We find a correlation of about ≈0.8absent0.8≈ 0.8≈ 0.8 between the two metrics21. We employ representational similarity analysis (RSA) as an additional evaluation metric that bridges between FVU and recaptured LM loss.333RSA was developed to compare neural representations, but has found applications in machine learning as a measure of second-order isometry (Li et al., 2024a; Klabunde et al., 2024). Given N samples of D-dimensional activations, RSA forms a matrix A∈ℝN×DsuperscriptℝA ^N× DA ∈ blackboard_RN × D containing the original activations. For each representation space, we compute a representational dissimilarity matrix (RDM) using Euclidean distance: RDMA=[∑k=1D(ai,k−aj,k)2]i,j=1N∈ℝN×NsubscriptRDMsuperscriptsubscriptdelimited-[]superscriptsubscript1superscriptsubscriptsubscript21superscriptℝRDM_A= [ _k=1^D(a_i,k-a_j,k)^2 ]_i,j=1^N∈% R^N× NRDMA = [ ∑k = 1D ( aitalic_i , k - aitalic_j , k )2 ]i , j = 1N ∈ blackboard_RN × N. The similarity between two representation spaces is the correlation between their RDMs. For each model m, we obtain reconstructed activations A^m∈ℝN×Dsubscript^superscriptℝ A_m ^N× Dover start_ARG A end_ARGm ∈ blackboard_RN × D by passing the original activations A through the model. We then compute RDMs for both the original and reconstructed activations. The RSA score for model m is computed as the Pearson correlation between the upper triangular elements of the original and reconstructed RDMs: RSAm=corr⁢(triu⁢(RDMA),triu⁢(RDMA^m))subscriptRSAcorrtriusubscriptRDMtriusubscriptRDMsubscript^RSA_m=corr(triu(RDM_A),triu(RDM_% A_m))RSAm = corr ( triu ( RDMA ) , triu ( RDMover start_ARG A end_ARG start_POSTSUBSCRIPT m end_POSTSUBSCRIPT ) ). We examine how these metrics correlate in Appendix 21. Figure 6: Mean cross-entropy loss per token for gemma-2-2b divided by the cross-entropy loss using the SAE reconstruction, computed over 105superscript10510^5105 tokens on the pile-uncopyrighted dataset. K refers to the sparsity mechanism, for the topk activation function, which all our models use, in topk all but the K largest features are used, all other are set to zero Results For all granularities, the Matryoshka SAE outperforms the baseline SAEs as well as the baseline column permuted SAE on the granularity-versus-reconstruction fidelity frontier. This suggests that the Matryoshka SAE has learned to be a more efficient progressive coder. We also observe that applying column permutation approach to Matryoshka SAE increases performance further, although we believe this impact is greatly diminished when using more granularities. Figure 7: Fvu per token for gemma-2-2b divided by the cross-entropy loss using the SAE reconstruction, computed over 105superscript10510^5105 tokens on the pile-uncopyrighted dataset. Figure 8: RSA per token for gemma-2-2b divided by the cross-entropy loss using the SAE reconstruction, computed over 105superscript10510^5105 tokens on the pile-uncopyrighted dataset. Sparsity-Fidelity Frontier Next, we evaluate the sparsity vs fidelity frontier for our different approaches. For a fixed dictionary size, we evaluate models with four different sparsity levels using the hyperparameters described in Table 1. We measure sparsity using ⁢[|z|0]delimited-[]subscript0E[|z|_0]blackboard_E [ | z |0 ], which equals k when using the TopK activation function that fixes the number of non-zero latents. We evaluate models using the same performance metrics as in Section LABEL:eval_progressive_coding, testing each model at full capacity (i.e., using all available features with granularity G equal to the model’s total dimension N). Results We find that Matryoshka SAEs closely track the performance of a baseline autoencoder of the same size both in terms of recaptured downstream cross-entropy loss and reconstruction loss 10. Figure 9: Sparsity vs Reconstruction fidelity (FVU) Figure 10: Sparsity frontiers for different metrics computed over 105superscript10510^5105 tokens from the pile-uncopyrighted dataset. Next we compare the performance of MatryoshkaSAEs and TopKSAEs using only the first 16K and 32K latents(G). We find that applying either of our two methods (Matryoshka SAE or column permutation) to a larger SAE, and using the first n latents when reordering is applied, achieves performance comparable to training an SAE of that same size from scratch. This suggests that given a fixed computational budget, it may be more efficient to train one large SAE and subsequently distill it into smaller ones, rather than training multiple SAEs with less compute. However, this effect becomes less pronounced as the ratio of granularity to model size decreases. While both the 65K Matryoshka SAE and TopK permuted SAE outperform a baseline 16K TopK SAE when using only their first 16K latents, they are in turn outperformed by the 32K SAE with reordering at the 16K or 10K granularity level. This is likely attributable to the phenomenon of feature-splitting (Bricken et al., 2023), where a single latent in a smaller SAE is split into multiple latents in a larger one. Thus, although we observe features follow a power law, as our latent space grows, the importance of any given feature may be gradually diluted as it becomes distributed across multiple features. In Section 21, we propose future approaches that might recover the performance lost from feature splitting. Figure 11: Sparsity vs Reconstruction fidelity for models, only using the first 10k, 16k or 32k latents. Lower fvu is better, higher recaptured ce-loss is better. Interpretability As Matryoshka SAEs are a new method for training SAEs, we find it important to evaluate whether this architecture compromises on interpretability. Our other approach, column permutation, is exempt from this analysis, as this method does not change features themselves only their ordering. We evaluate the interpretability of our architecture using two methods from the automated interpretability library ’sae-auto-interp’(EleutherAI, 2024): simulation scoring and fuzzing. We evaluate the interpretability of our models by measuring how well a large language model can predict the activation value of our features, given an LM-explanation generated from a training set of examples. This method was first proposed by (Bills et al., 2023), and it measures how correlated an LLM’s guess of an activation is with the ground truth activation. We group our activations into 10 quantiles of 50 features based on their firing frequency after having filtered out dead features444features with a firing frequency of 0. We compute the Pearson correlation between the activations of the SAE feature in question and the LM simulated activation. We use sequences of context length 32, 10 test samples and 20 training samples used to generate the LM-explanations. All experiments are performed using Llama-3-1-70B (Grattafiori et al., 2024). We compare the results for our Matryoshka SAE against the baseline Topk SAE, as well as GemmaScope (Lieberum et al., 2024) JumpReLU SAEs with approximately the same dictionary size and sparsity level. We compare these against a randomly initialized SAE. Results We find that although the Mean Pearson Correlation is meaningfully higher than the randomly initialized SAE12 and on par with the GemmaScope models, our Matryoshka SAE underperforms the baseline TopK SAE models. Figure 12: The Pearson correlation between Llama-3 simulated and ground truth activations. The dashed lines represent the mean per SAE type. Values above 1 are an artifact of the kernel density estimation process To get a better grasp of exactly which features become less interpretable, we visualize the distribution of Pearson correlations for different granularities. Figure 13: The Pearson correlation between Llama-3 simulated and ground truth activations for different granularities of a Matryoshka SAE. Note that the granularities of 16k are a subset of 32k etc. Values above 1 are an artifact of the kernel density estimation process We find that the innermost granularities are meaningfully more interpretable than the outermost, going from a mean correlation of 0.57 to 0.74. We posit that this occurs as the model, through the Matryoshka loss function 4, becomes incentivized to effectively put the most meaningful features in the first part of the Wd⁢e⁢csubscriptW_decWitalic_d e c matrix. Next we evaluate our models using fuzzing, a token-level evaluation technique introduced by (Paulo et al., 2024). In fuzzing, potentially activating tokens are highlighted within example sentences, and a language model is prompted to identify which markings are correct. Unlike simulation scoring (Bills et al., 2023), which requires predicting continuous activation values, fuzzing frames the problem as a binary classification task(Paulo et al., 2024): Determining whether a token triggers a given feature or not. Results We plot the mean balanced accuracy of feature quantiles by frequency in Figure 14. Figure 14: Balanced accuracy for feature indices grouped into quantiles 0-100 for 400 randomly selected features Matryoshka SAEs slightly underperform on this task: The first latents seem to perform better than the average, but scores quickly drop. 7 Discussion: Scaling and Granularities An obvious question is, given a large SAE, how well can the performance of the model be predicted when only the G first elements are considered? Specifically what is the interaction between model size (N), granularity (G), and sparsity (K) as we scale? We develop empirical scaling laws following the methodology established by (Kaplan et al., 2020) by modelling how reconstruction loss(FVU) scales with model size and sparsity for baseline TopKAutoencoders with dictionary permutation/reordering applied. Building on the work of (Gao et al., 2024), we extend their formulation with two terms: βg⁢log⁡(g)subscript _g (g)βitalic_g log ( g ) for the direct effect of granularity, and γg⁢log⁡(k)⁢log⁡(g)subscript _g (k) (g)γitalic_g log ( k ) log ( g ) for its interaction with sparsity. Lprogressive⁢(n,k,g)=exp(α+βklog(k)+βn⁢log⁡(n)+βg⁢log⁡(g)+γn⁢log⁡(k)⁢log⁡(n)+γglog(k)log(g))⏟loss+exp⁡(ζ+η⁢log⁡(k))⏟irreducible loss splitL_progressive(n,k,g)=& (α+ _k% (k)\\ &+ _n (n)+ _g (g)\\ &+ _n (k) (n)\\ & + _g (k) (g))_loss\\ & + (ζ+η (k))_irreducible loss splitstart_ROW start_CELL Lprogressive ( n , k , g ) = end_CELL start_CELL exp ( α + βitalic_k log ( k ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + βitalic_n log ( n ) + βitalic_g log ( g ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + γitalic_n log ( k ) log ( n ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL under⏟ start_ARG + γitalic_g log ( k ) log ( g ) ) end_ARGloss end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL under⏟ start_ARG + exp ( ζ + η log ( k ) ) end_ARGirreducible loss end_CELL end_ROW (9) We fit our scaling law using validation data from 16k, 32k, and 65k TopK SAEs with sparsity levels described in 1, inducing parameters: α=−3.603.60α=-3.60α = - 3.60, βk=0.69subscript0.69 _k=0.69βitalic_k = 0.69, βn=0.19subscript0.19 _n=0.19βitalic_n = 0.19, βg=0.08subscript0.08 _g=0.08βitalic_g = 0.08, γn=0.02subscript0.02 _n=0.02γitalic_n = 0.02, γg=−0.10subscript0.10 _g=-0.10γitalic_g = - 0.10, ζ=−2.132.13ζ=-2.13ζ = - 2.13, η=−0.130.13η=-0.13η = - 0.13 with R2=0.978superscript20.978R^2=0.978R2 = 0.978 in log-log space. Figure 15: Loss vs. predicted loss for SAE (32k and 65k latents) Substantial evidence supports that Matryoshka SAEs learn a hierarchy of features, placing the most important features in the first mksubscriptm_kmitalic_k columns of the decoder. Earlier work on MRL(Devvrit et al., 2024) has suggested sampling granularities during training. This idea is very similar to nested dropout (Rippel et al., 2014), where higher-dimensional components of the representation are stochastically dropped out to encourage ordering of dimensions by importance. We apply this approach to Matryoshka SAEs. We hypothesize that sampling granularities dynamically would further improve progressive coding abilities, by learning a finer-grained hierarchy of features. We sample mi∼⁢(1,N)similar-tosubscript1m_i (1,N)mitalic_i ∼ U ( 1 , N ) uniformly at each training step, where N is the maximum dimension of our latent space. Figure 16: Sampled, non-sampled Matryoshka and baseline (105superscript10510^5105t) Sampling improves Matryoshka SAE metrics. The sampled Matryoshka SAE concentrates most of its activation mass in the first features, while the non-sampled exhibits distinct plateaus for each granularity level. The baseline TopK SAE shows a more uniform distribution of activation mass across its feature space. This suggests that both Matryoshka variants learn to concentrate important features early in their latent space, but the fixed granularity version creates more structured groupings. Figure 17: Mean activation squared by interval in latent space: sampled, non-sampled Matryoshka and baseline. References Bills et al. (2023) Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023. Braun et al. (2024) Braun, D., Taylor, J., Goldowsky-Dill, N., and Sharkey, L. Identifying functionally important features with end-to-end sparse dictionary learning, 2024. URL https://arxiv.org/abs/2405.12241. Bricken et al. (2023) Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html. Brown et al. (2020) Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165. Bussmann et al. (2024) Bussmann, B., Leask, P., and Nanda, N. Learning multi-level features with matryoshka saes. AI Alignment Forum, 12 2024. URL https://w.alignmentforum.org/posts/learning-multi-level-features-with-matryoshka-saes. Cai et al. (2024) Cai, M., Yang, J., Gao, J., and Lee, Y. J. Matryoshka multimodal models, 2024. URL https://arxiv.org/abs/2405.17430. Chowdhery et al. (2022) Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levskaya, A., Ghemawat, S., Dev, S., Michalewski, H., Garcia, X., Misra, V., Robinson, K., Fedus, L., Zhou, D., Ippolito, D., Luan, D., Lim, H., Zoph, B., Spiridonov, A., Sepassi, R., Dohan, D., Agrawal, S., Omernick, M., Dai, A. M., Pillai, T. S., Pellat, M., Lewkowycz, A., Moreira, E., Child, R., Polozov, O., Lee, K., Zhou, Z., Wang, X., Saeta, B., Diaz, M., Firat, O., Catasta, M., Wei, J., Meier-Hellstern, K., Eck, D., Dean, J., Petrov, S., and Fiedel, N. Palm: Scaling language modeling with pathways, 2022. URL https://arxiv.org/abs/2204.02311. Cunningham et al. (2023) Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models, 2023. URL https://arxiv.org/abs/2309.08600. Devvrit et al. (2024) Devvrit, Kudugunta, S., Kusupati, A., Dettmers, T., Chen, K., Dhillon, I., Tsvetkov, Y., Hajishirzi, H., Kakade, S., Farhadi, A., and Jain, P. Matformer: Nested transformer for elastic inference, 2024. URL https://arxiv.org/abs/2310.07707. EleutherAI (2024) EleutherAI. sae-auto-interp. https://github.com/EleutherAI/sae-auto-interp, 2024. URL https://blog.eleuther.ai/autointerp/. Elhage et al. (2022) Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., and Olah, C. Toy models of superposition. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/2022/toy_model/index.html. Gao et al. (2024) Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders, 2024. URL https://arxiv.org/abs/2406.04093. Grattafiori et al. (2024) Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteux, C., Nayak, C., Bi, C., Marra, C., McConnell, C., Keller, C., Touret, C., Wu, C., Wong, C., Ferrer, C. C., Nikolaidis, C., Allonsius, D., Song, D., Pintz, D., Livshits, D., Wyatt, D., Esiobu, D., Choudhary, D., Mahajan, D., Garcia-Olano, D., Perino, D., Hupkes, D., Lakomkin, E., AlBadawy, E., Lobanova, E., Dinan, E., Smith, E. M., Radenovic, F., Guzmán, F., Zhang, F., Synnaeve, G., Lee, G., Anderson, G. L., Thattai, G., Nail, G., Mialon, G., Pang, G., Cucurell, G., Nguyen, H., Korevaar, H., Xu, H., Touvron, H., Zarov, I., Ibarra, I. A., Kloumann, I., Misra, I., Evtimov, I., Zhang, J., Copet, J., Lee, J., Geffert, J., Vranes, J., Park, J., Mahadeokar, J., Shah, J., van der Linde, J., Billock, J., Hong, J., Lee, J., Fu, J., Chi, J., Huang, J., Liu, J., Wang, J., Yu, J., Bitton, J., Spisak, J., Park, J., Rocca, J., Johnstun, J., Saxe, J., Jia, J., Alwala, K. V., Prasad, K., Upasani, K., Plawiak, K., Li, K., Heafield, K., Stone, K., El-Arini, K., Iyer, K., Malik, K., Chiu, K., Bhalla, K., Lakhotia, K., Rantala-Yeary, L., van der Maaten, L., Chen, L., Tan, L., Jenkins, L., Martin, L., Madaan, L., Malo, L., Blecher, L., Landzaat, L., de Oliveira, L., Muzzi, M., Pasupuleti, M., Singh, M., Paluri, M., Kardas, M., Tsimpoukelli, M., Oldham, M., Rita, M., Pavlova, M., Kambadur, M., Lewis, M., Si, M., Singh, M. K., Hassan, M., Goyal, N., Torabi, N., Bashlykov, N., Bogoychev, N., Chatterji, N., Zhang, N., Duchenne, O., Çelebi, O., Alrassy, P., Zhang, P., Li, P., Vasic, P., Weng, P., Bhargava, P., Dubal, P., Krishnan, P., Koura, P. S., Xu, P., He, Q., Dong, Q., Srinivasan, R., Ganapathy, R., Calderer, R., Cabral, R. S., Stojnic, R., Raileanu, R., Maheswari, R., Girdhar, R., Patel, R., Sauvestre, R., Polidoro, R., Sumbaly, R., Taylor, R., Silva, R., Hou, R., Wang, R., Hosseini, S., Chennabasappa, S., Singh, S., Bell, S., Kim, S. S., Edunov, S., Nie, S., Narang, S., Raparthy, S., Shen, S., Wan, S., Bhosale, S., Zhang, S., Vandenhende, S., Batra, S., Whitman, S., Sootla, S., Collot, S., Gururangan, S., Borodinsky, S., Herman, T., Fowler, T., Sheasha, T., Georgiou, T., Scialom, T., Speckbacher, T., Mihaylov, T., Xiao, T., Karn, U., Goswami, V., Gupta, V., Ramanathan, V., Kerkez, V., Gonguet, V., Do, V., Vogeti, V., Albiero, V., Petrovic, V., Chu, W., Xiong, W., Fu, W., Meers, W., Martinet, X., Wang, X., Wang, X., Tan, X. E., Xia, X., Xie, X., Jia, X., Wang, X., Goldschlag, Y., Gaur, Y., Babaei, Y., Wen, Y., Song, Y., Zhang, Y., Li, Y., Mao, Y., Coudert, Z. D., Yan, Z., Chen, Z., Papakipos, Z., Singh, A., Srivastava, A., Jain, A., Kelsey, A., Shajnfeld, A., Gangidi, A., Victoria, A., Goldstand, A., Menon, A., Sharma, A., Boesenberg, A., Baevski, A., Feinstein, A., Kallet, A., Sangani, A., Teo, A., Yunus, A., Lupu, A., Alvarado, A., Caples, A., Gu, A., Ho, A., Poulton, A., Ryan, A., Ramchandani, A., Dong, A., Franco, A., Goyal, A., Saraf, A., Chowdhury, A., Gabriel, A., Bharambe, A., Eisenman, A., Yazdan, A., James, B., Maurer, B., Leonhardi, B., Huang, B., Loyd, B., Paola, B. D., Paranjape, B., Liu, B., Wu, B., Ni, B., Hancock, B., Wasti, B., Spence, B., Stojkovic, B., Gamido, B., Montalvo, B., Parker, C., Burton, C., Mejia, C., Liu, C., Wang, C., Kim, C., Zhou, C., Hu, C., Chu, C.-H., Cai, C., Tindal, C., Feichtenhofer, C., Gao, C., Civin, D., Beaty, D., Kreymer, D., Li, D., Adkins, D., Xu, D., Testuggine, D., David, D., Parikh, D., Liskovich, D., Foss, D., Wang, D., Le, D., Holland, D., Dowling, E., Jamil, E., Montgomery, E., Presani, E., Hahn, E., Wood, E., Le, E.-T., Brinkman, E., Arcaute, E., Dunbar, E., Smothers, E., Sun, F., Kreuk, F., Tian, F., Kokkinos, F., Ozgenel, F., Caggioni, F., Kanayet, F., Seide, F., Florez, G. M., Schwarz, G., Badeer, G., Swee, G., Halpern, G., Herman, G., Sizov, G., Guangyi, Zhang, Lakshminarayanan, G., Inan, H., Shojanazeri, H., Zou, H., Wang, H., Zha, H., Habeeb, H., Rudolph, H., Suk, H., Aspegren, H., Goldman, H., Zhan, H., Damlaj, I., Molybog, I., Tufanov, I., Leontiadis, I., Veliche, I.-E., Gat, I., Weissman, J., Geboski, J., Kohli, J., Lam, J., Asher, J., Gaya, J.-B., Marcus, J., Tang, J., Chan, J., Zhen, J., Reizenstein, J., Teboul, J., Zhong, J., Jin, J., Yang, J., Cummings, J., Carvill, J., Shepard, J., McPhie, J., Torres, J., Ginsburg, J., Wang, J., Wu, K., U, K. H., Saxena, K., Khandelwal, K., Zand, K., Matosich, K., Veeraraghavan, K., Michelena, K., Li, K., Jagadeesh, K., Huang, K., Chawla, K., Huang, K., Chen, L., Garg, L., A, L., Silva, L., Bell, L., Zhang, L., Guo, L., Yu, L., Moshkovich, L., Wehrstedt, L., Khabsa, M., Avalani, M., Bhatt, M., Mankus, M., Hasson, M., Lennie, M., Reso, M., Groshev, M., Naumov, M., Lathi, M., Keneally, M., Liu, M., Seltzer, M. L., Valko, M., Restrepo, M., Patel, M., Vyatskov, M., Samvelyan, M., Clark, M., Macey, M., Wang, M., Hermoso, M. J., Metanat, M., Rastegari, M., Bansal, M., Santhanam, N., Parks, N., White, N., Bawa, N., Singhal, N., Egebo, N., Usunier, N., Mehta, N., Laptev, N. P., Dong, N., Cheng, N., Chernoguz, O., Hart, O., Salpekar, O., Kalinli, O., Kent, P., Parekh, P., Saab, P., Balaji, P., Rittner, P., Bontrager, P., Roux, P., Dollar, P., Zvyagina, P., Ratanchandani, P., Yuvraj, P., Liang, Q., Alao, R., Rodriguez, R., Ayub, R., Murthy, R., Nayani, R., Mitra, R., Parthasarathy, R., Li, R., Hogan, R., Battey, R., Wang, R., Howes, R., Rinott, R., Mehta, S., Siby, S., Bondu, S. J., Datta, S., Chugh, S., Hunt, S., Dhillon, S., Sidorov, S., Pan, S., Mahajan, S., Verma, S., Yamamoto, S., Ramaswamy, S., Lindsay, S., Lindsay, S., Feng, S., Lin, S., Zha, S. C., Patil, S., Shankar, S., Zhang, S., Zhang, S., Wang, S., Agarwal, S., Sajuyigbe, S., Chintala, S., Max, S., Chen, S., Kehoe, S., Satterfield, S., Govindaprasad, S., Gupta, S., Deng, S., Cho, S., Virk, S., Subramanian, S., Choudhury, S., Goldman, S., Remez, T., Glaser, T., Best, T., Koehler, T., Robinson, T., Li, T., Zhang, T., Matthews, T., Chou, T., Shaked, T., Vontimitta, V., Ajayi, V., Montanez, V., Mohan, V., Kumar, V. S., Mangla, V., Ionescu, V., Poenaru, V., Mihailescu, V. T., Ivanov, V., Li, W., Wang, W., Jiang, W., Bouaziz, W., Constable, W., Tang, X., Wu, X., Wang, X., Wu, X., Gao, X., Kleinman, Y., Chen, Y., Hu, Y., Jia, Y., Qi, Y., Li, Y., Zhang, Y., Zhang, Y., Adi, Y., Nam, Y., Yu, Wang, Zhao, Y., Hao, Y., Qian, Y., Li, Y., He, Y., Rait, Z., DeVito, Z., Rosnbrick, Z., Wen, Z., Yang, Z., Zhao, Z., and Ma, Z. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. Gu et al. (2024) Gu, J., Zhai, S., Zhang, Y., Susskind, J., and Jaitly, N. Matryoshka diffusion models, 2024. URL https://arxiv.org/abs/2310.15111. Hoffmann et al. (2022) Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/2203.15556. Hurley & Rickard (2009) Hurley, N. P. and Rickard, S. T. Comparing measures of sparsity, 2009. URL https://arxiv.org/abs/0811.4706. Johnson & Lindenstrauss (1984) Johnson, W. B. and Lindenstrauss, J. Extensions of lipschitz mappings into hilbert space. Contemporary mathematics, 26:189–206, 1984. URL https://api.semanticscholar.org/CorpusID:117819162. Kaplan et al. (2020) Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361. Klabunde et al. (2024) Klabunde, M., Schumacher, T., Strohmaier, M., and Lemmerich, F. Similarity of neural network models: A survey of functional and representational measures, 2024. URL https://arxiv.org/abs/2305.06329. Kusupati et al. (2024) Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., and Farhadi, A. Matryoshka representation learning, 2024. URL https://arxiv.org/abs/2205.13147. Li et al. (2024a) Li, J., Kementchedjhieva, Y., Fierro, C., and Søgaard, A. Do vision and language models share concepts? a vector space alignment study. Transactions of the Association for Computational Linguistics, 12:1232–1249, 2024a. doi: 10.1162/tacl˙a˙00698. URL https://aclanthology.org/2024.tacl-1.68/. Li et al. (2024b) Li, Y., Michaud, E. J., Baek, D. D., Engels, J., Sun, X., and Tegmark, M. The geometry of concepts: Sparse autoencoder feature structure, 2024b. URL https://arxiv.org/abs/2410.19750. Lieberum et al. (2024) Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kramár, J., Dragan, A., Shah, R., and Nanda, N. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024. URL https://arxiv.org/abs/2408.05147. Makelov et al. (2024) Makelov, A., Lange, G., and Nanda, N. Towards principled evaluations of sparse autoencoders for interpretability and control, 2024. URL https://arxiv.org/abs/2405.08366. Mudide et al. (2024) Mudide, A., Engels, J., Michaud, E. J., Tegmark, M., and de Witt, C. S. Efficient dictionary learning with switch sparse autoencoders, 2024. URL https://arxiv.org/abs/2410.08201. Nabeshima (2024) Nabeshima, N. Matryoshka sparse autoencoders. AI Alignment Forum, 12 2024. Paulo et al. (2024) Paulo, G., Mallen, A., Juang, C., and Belrose, N. Automatically interpreting millions of features in large language models, 2024. URL https://arxiv.org/abs/2410.13928. Rajamanoharan et al. (2024) Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kramár, J., and Nanda, N. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders, 2024. URL https://arxiv.org/abs/2407.14435. Rippel et al. (2014) Rippel, O., Gelbart, M. A., and Adams, R. P. Learning ordered representations with nested dropout, 2014. URL https://arxiv.org/abs/1402.0915. Skodras et al. (2001) Skodras, A., Christopoulos, C., and Ebrahimi, T. The jpeg 2000 still image compression standard. IEEE Signal Processing Magazine, 18(5):36–58, 2001. doi: 10.1109/79.952804. Team et al. (2024) Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ramé, A., Ferret, J., Liu, P., Tafti, P., Friesen, A., Casbon, M., Ramos, S., Kumar, R., Lan, C. L., Jerome, S., Tsitsulin, A., Vieillard, N., Stanczyk, P., Girgin, S., Momchev, N., Hoffman, M., Thakoor, S., Grill, J.-B., Neyshabur, B., Bachem, O., Walton, A., Severyn, A., Parrish, A., Ahmad, A., Hutchison, A., Abdagic, A., Carl, A., Shen, A., Brock, A., Coenen, A., Laforge, A., Paterson, A., Bastian, B., Piot, B., Wu, B., Royal, B., Chen, C., Kumar, C., Perry, C., Welty, C., Choquette-Choo, C. A., Sinopalnikov, D., Weinberger, D., Vijaykumar, D., Rogozińska, D., Herbison, D., Bandy, E., Wang, E., Noland, E., Moreira, E., Senter, E., Eltyshev, E., Visin, F., Rasskin, G., Wei, G., Cameron, G., Martins, G., Hashemi, H., Klimczak-Plucińska, H., Batra, H., Dhand, H., Nardini, I., Mein, J., Zhou, J., Svensson, J., Stanway, J., Chan, J., Zhou, J. P., Carrasqueira, J., Iljazi, J., Becker, J., Fernandez, J., van Amersfoort, J., Gordon, J., Lipschultz, J., Newlan, J., yeong Ji, J., Mohamed, K., Badola, K., Black, K., Millican, K., McDonell, K., Nguyen, K., Sodhia, K., Greene, K., Sjoesund, L. L., Usui, L., Sifre, L., Heuermann, L., Lago, L., McNealus, L., Soares, L. B., Kilpatrick, L., Dixon, L., Martins, L., Reid, M., Singh, M., Iverson, M., Görner, M., Velloso, M., Wirth, M., Davidow, M., Miller, M., Rahtz, M., Watson, M., Risdal, M., Kazemi, M., Moynihan, M., Zhang, M., Kahng, M., Park, M., Rahman, M., Khatwani, M., Dao, N., Bardoliwalla, N., Devanathan, N., Dumai, N., Chauhan, N., Wahltinez, O., Botarda, P., Barnes, P., Barham, P., Michel, P., Jin, P., Georgiev, P., Culliton, P., Kuppala, P., Comanescu, R., Merhej, R., Jana, R., Rokni, R. A., Agarwal, R., Mullins, R., Saadat, S., Carthy, S. M., Cogan, S., Perrin, S., Arnold, S. M. R., Krause, S., Dai, S., Garg, S., Sheth, S., Ronstrom, S., Chan, S., Jordan, T., Yu, T., Eccles, T., Hennigan, T., Kocisky, T., Doshi, T., Jain, V., Yadav, V., Meshram, V., Dharmadhikari, V., Barkley, W., Wei, W., Ye, W., Han, W., Kwon, W., Xu, X., Shen, Z., Gong, Z., Wei, Z., Cotruta, V., Kirk, P., Rao, A., Giang, M., Peran, L., Warkentin, T., Collins, E., Barral, J., Ghahramani, Z., Hadsell, R., Sculley, D., Banks, J., Dragan, A., Petrov, S., Vinyals, O., Dean, J., Hassabis, D., Kavukcuoglu, K., Farabet, C., Buchatskaya, E., Borgeaud, S., Fiedel, N., Joulin, A., Kenealy, K., Dadashi, R., and Andreev, A. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118. Templeton et al. (2024) Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., MacDiarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., and Henighan, T. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html. Yun et al. (2023) Yun, Z., Chen, Y., Olshausen, B. A., and LeCun, Y. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors, 2023. URL https://arxiv.org/abs/2103.15949. Appendix A Feature Splitting Feature splitting is the phenomenon where as the dictionary grows in size, one basis vector gets decomposed into multiple separate basis vectors. In contrast, the standard TopK SAE exhibits a relatively uniform diagonal pattern, indicating that similar features tend to be distributed throughout the latent space with a natural locality which is likely a function of random initialization. In contrast, the Matryoshka TopK SAE shows a distinctive stepped pattern, with clear discontinuities at the model’s granularity boundaries 214,215,216superscript214superscript215superscript216\2^14,2^15,2^16\ 214 , 215 , 216 . This indicates that features within each granularity level form relatively isolated clusters, with limited similarity to features in other granularity levels. Figure 18: We compute the mean index of the top 5 closest feature for each feature for the Matryoshka TopK SAE Figure 19: We compute the mean index of the top 5 closest feature for each feature for the TopK SAE The natural locality of features in the standard TopK SAE can be attributed to the random initialization process, where nearby features in the latent space tend to develop related functionality during training. Figure 20: We compute the mean index of the top 5 closest feature for each feature for a random initialized decoder Appendix B Figures Figure 21: Correlation analysis between different evaluation metrics (FVU, CE Loss, and RSA). The scatter plots show pairwise relationships with linear regression fits, displaying both Pearson correlation coefficients (r) and coefficients of determination (R²). Appendix C Limitations and Future Work While our results are promising, it’s important to note that our experiments were conducted on relatively modest-sized SAEs compared to recent work, scaling to tens of millions of features (Templeton et al., 2024)(Gao et al., 2024). Our methods remain to be validated at larger scales, though we find the observation that the dictionary power law holds at multiple scales2 encouraging. A key limitation in our implementation of Matryoshka SAEs lies in the decoder kernel (Gao et al., 2024). However the kernel has not been optimized for performing multiple decoding passes per encoding step, leading to redundant computations as the decode kernel is invoked |ℳ|ℳ|M|| M | times for m∈ℳm ∈ M, separately computing WDec1:m⁢t⁢o⁢p⁢k⁢(z1:m)subscriptsubscriptDec:1subscript:1W_Dec_1:mtopk(z_1:m)WDec start_POSTSUBSCRIPT 1 : m end_POSTSUBSCRIPT t o p k ( z1 : m ) for each granularity level. Given our weight sharing structure where WDec1:m0⊆WDec1:mksubscriptsubscriptDec:1subscript0subscriptsubscriptDec:1subscriptW_Dec_1:m_0 W_Dec_1:m_kWDec start_POSTSUBSCRIPT 1 : m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊆ WDec start_POSTSUBSCRIPT 1 : m start_POSTSUBSCRIPT k end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Moreover it is highly likely that t⁢o⁢p⁢k⁢(z1:m0)⊂t⁢o⁢p⁢k⁢(z1:m1)⁢⋯⊂t⁢o⁢p⁢k⁢(z1:mn)subscript:1subscript0subscript:1subscript1⋯subscript:1subscripttopk(z_1:m_0)⊂ topk(z_1:m_1)…⊂ topk(z_1:m_n)t o p k ( z1 : m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⊂ t o p k ( z1 : m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋯ ⊂ t o p k ( z1 : m start_POSTSUBSCRIPT n end_POSTSUBSCRIPT ). Thus a modified implementation could be meaningfully faster. The challenge of feature-splitting presents another significant limitation. While permuting dictionaries by E⁢[a⁢c⁢t⁢i⁢v⁢a⁢t⁢i⁢o⁢n2]delimited-[]superscript2E[activation^2]E [ a c t i v a t i o n2 ] ordering provides a lightweight approach to distilling large pretrained SAEs, this method becomes less effective as the ratio of granularity to model size (GN GNdivide start_ARG G end_ARG start_ARG N end_ARG) decreases. This degradation occurs because a single feature in a small SAE is decomposed into multiple features in larger ones, and selecting only the most important of these split features fails to capture the complete functionality present in the original, unified feature. Future research could focus on developing efficient methods to recombine or ”reverse” this feature-splitting during the distillation process, potentially through feature clustering or adaptive merging strategies. To the best of our knowledge, we are the first to observe that as the decoding step in SAEs is highly sparse, for every sparse code, we can decode it multiple times using different parts of our dictionary with asymptotically negligible overhead. We consider other training approaches that apply these ideas highly promising and likely more computationally efficient than current methods.