Paper deep dive

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, JÃ¡nos KramÃ¡r, Rohin Shah, Neel Nanda

Year: 2024Venue: NeurIPS 2024Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 103

Models: GELU-1L, Gemma-7B, Pythia-2.8B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/12/2026, 8:21:35 PM

Summary

The paper introduces Gated Sparse Autoencoders (Gated SAEs), a modification to standard sparse autoencoders designed to improve dictionary learning in language models. By decoupling feature detection from magnitude estimation, Gated SAEs mitigate the 'shrinkage' bias caused by L1 penalties, achieving better reconstruction fidelity and sparsity trade-offs compared to baseline SAEs.

Entities (5)

Gated Sparse Autoencoder · architecture · 100%Gemma-7B · language-model · 100%Pythia-2.8B · language-model · 100%Sparse Autoencoder · architecture · 100%Shrinkage · phenomenon · 95%

Relation Signals (3)

Gated Sparse Autoencoder → trainedon → Gemma-7B

confidence 100% · We evaluate Gated SAEs on multiple models: ... Gemma-7B

Gated Sparse Autoencoder → outperforms → Sparse Autoencoder

confidence 95% · Gated SAEs are a Pareto improvement over baseline SAEs holding training compute fixed

Gated Sparse Autoencoder → resolves → Shrinkage

confidence 95% · Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.

Cypher Suggestions (2)

Identify phenomena addressed by specific architectures · confidence 95% · unvalidated

MATCH (a:Architecture)-[:RESOLVES]->(p:Phenomenon) RETURN a.name, p.name

Find all language models evaluated with Gated SAEs · confidence 90% · unvalidated

MATCH (s:Architecture {name: 'Gated Sparse Autoencoder'})-[:TRAINED_ON]->(lm:LanguageModel) RETURN lm.name

Abstract

Abstract:Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such as shrinkage -- systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity.

PDF

Open source PDF →Open local PDF →

Full Text

102,436 characters extracted from source content.

Expand or collapse full text

2024-5-1 Improving Dictionary Learning with Gated Sparse Autoencoders Senthooran Rajamanoharan * , Arthur Conmy * , Lewis Smith, Tom Lieberum † , Vikrant Varma † , János Kramár, Rohin Shah and Neel Nanda * : Joint contribution. † : Core infrastructure contributor. Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models’ (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gated Sparse Autoencoder (Gated SAE), which achieves a Pareto improvement over training with prevailing methods. In SAEs, the L1 penalty used to encourage sparsity introduces many undesirable biases, such asshrinkage– systematic underestimation of feature activations. The key insight of Gated SAEs is to separate the functionality of (a) determining which directions to use and (b) estimating the magnitudes of those directions: this enables us to apply the L1 penalty only to the former, limiting the scope of undesirable side effects. Through training SAEs on LMs of up to 7B parameters we find that, in typical hyper-parameter ranges, Gated SAEs solve shrinkage, are similarly interpretable, and require half as many firing features to achieve comparable reconstruction fidelity. 1. Introduction Mechanistic interpretability research aims to explain how neural networks produce outputs in terms of the learned algorithms executed during a forward pass (Olah, 2022; Olah et al., 2020). Much work makes use of the fact that many concept representations appear to be linear (Elhage et al., 2021; Gurnee et al., 2023; Olah et al., 2020; Park et al., 2023). However, finding the set of all interpretable directions is a highly non-trivial problem. Classic approaches, like interpreting neurons (i.e. directions in the standard basis) are insufficient, as many arepolysemanticand tend to activate for a range of different seemingly unrelated concepts (Bolukbasi et al., 2021; Elhage et al., 2022a,b, Empirical Phenomena). Thesuperposition hypothesis(Elhage et al., 2022b, Definitions and Motivation) posits a mechanistic explanation for these observations: in an intermediate representation of dimension푛, a model will encode푀≫푛concepts as linear directions, where the set of concepts and their directions is fixed across all inputs, but on a given input only a sparse subset of concepts are active, ensuring that there is not much simultaneous interference (Gurnee et al., 2023, Appendix A) between these (non-orthogonal) concepts. Motivated by the superposition hypothesis, Bricken et al. (2023) and Cunningham et al. (2023) recently used sparse autoencoders (SAEs; Ng (2011)) to find sparse decompositions of model activations in terms of an overcomplete basis, ordictionary(Mallat and Zhang, 1993). 1 Although SAEs show promise in this regard, the L1 penalty used in the prevailing training method to encourage sparsity also introduces biases that harm the accuracy of SAE reconstructions, as the loss can be decreased by trading-off some reconstruction accuracy for lower L1. We refer to this 1 Although motivated by the superposition hypothesis, the utility of line of research is not contingent on this hypothesis being true. If a faithful, sparse and interpretable decomposition can be found, we expect this to be a useful basis in its own right for downstream interpretability tasks, such as understanding or intervening on a model’s representations and circuits, even if some fraction of the model’s computation is e.g. represented non-linearly and not captured. Corresponding authors: srajamanoharan@google.com and neelnanda@google.com ©2024 Google DeepMind. All rights reserved arXiv:2404.16014v2 [cs.LG] 30 Apr 2024 Improving Dictionary Learning with Gated Sparse Autoencoders 100 25 10 252 0.8 0.7 0.9 1 25 10 25 100 225 10 25 100 2 SAE Type Gated Baseline L0 (Lower is sparser) Loss Recovered (Fidelity) Residual stream post-MLPMLP OutputAttention output pre-linear Figure 1|The performance of Gated SAEs compared to the baseline SAE at Layer 20 in Gemma-7B (log-scale axes from L0=2 to L0=200). The SAEs are trained with equal compute, since the baseline SAEs have 50% more learned features (Section 4.1). This performance improvement holds in layers throughout GELU-1L, Pythia-2.8B and Gemma-7B (Appendix B). Full detail in Table 2 and 4. existing training methodology as thebaseline SAE, defined fully in Section 2.1-2.2 and which borrows heavily from Bricken et al. (2023). In this paper, we introduce a modification to the baseline SAE architecture – aGated SAE– along with an accompanying loss function, which partially overcomes these limitations. Our key insight is to use separate affine transformations for (a) determining which dictionary elements to use in a reconstruction and (b) estimating the coefficients of active elements, and to apply the sparsity penalty only to the former task. We share a subset of weights between these transformations to avoid significantly increasing the parameter count and inference-time compute requirements of a Gated SAE compared to a baseline SAE of equivalent width. 2 We evaluate Gated SAEs on multiple models: a one layer GELU activation language models (Nanda, 2022), Pythia-2.8B (Biderman et al., 2023) and Gemma-7B (Gemma Team et al., 2024), and on multiple sites within models: MLP layer outputs, attention layer outputs, and residual stream activations. Across these models and sites, we find Gated SAEs to be a Pareto improvement over baseline SAEs holding training compute fixed (Fig. 1): they yield sparser decompositions at any desired level of reconstruction fidelity. We also conduct further follow up ablations and investigations on a subset of these models and sites to better understand the differences between Gated SAEs and baseline SAEs. Overall, the key contributions of this work are that we: 1.Introduce the Gated SAE, a modification to the standard SAE architecture that decouples detection of which features are present from estimating their magnitudes (Section 3.2); 2. Show that Gated SAEs Pareto improve the sparsity and reconstruction fidelity trade-off, com- pared to baseline SAEs (Section 4.1); 3.Confirm that Gated SAEs overcome the shrinkage problem (Section 4.2), while outperforming other methods that also address this problem (Section 5.1); 4.Provide evidence from a small double-blind study that Gated SAE features are comparably interpretable to baseline SAE features (Section 4.3). 2 Although due to an auxiliary loss term, computing the Gated SAE loss for training purposes does require 50% more compute than computing the loss for a matched-width baseline SAE. 2 Improving Dictionary Learning with Gated Sparse Autoencoders 2. Sparse Autoencoder Background In this section we summarise the concepts and notation necessary to understand existing SAE archi- tectures and training methods, which we call thebaseline SAE. We define Gated SAEs in Section 3.2. We follow notation broadly similar to Bricken et al. (2023) and recommend that work as a more complete introduction to training SAEs on LMs. As motivated in Section 1, we wish to decompose a model’s activationx∈ℝ 푛 into a sparse, linear combination of feature directions: x≈x 0 + 푀 ∑︁ 푖=1 푓 푖 (x)d 푖 ,(1) whered 푖 are푀≫푛latent unit-normfeature directions, and the sparse coefficients푓 푖 (x) ≥0are the correspondingfeature activationsforx. 3 The right-hand side of Eq. (1) naturally has the structure of an autoencoder: an input activationxis encoded into a (sparse) feature activations vectorf(x) ∈ℝ 푀 , which in turn is linearly decoded to reconstructx. 2.1. Baseline Architecture Using this correspondence, Bricken et al. (2023) and subsequent works attempt to learn a suitable sparse decomposition by parameterising a single-layer autoencoder ( f, ˆ x ) defined by: f(x):=ReLU ( W enc ( x−b dec ) +b enc ) (2) ˆ x(f):=W dec f+b dec (3) and training it (Section 2.2) to reconstruct a large dataset of model activationsx∼D, constraining the hidden representationfto be sparse. 4 Once the sparse autoencoder has been trained, we obtain a decomposition of the form of Eq. (1) by identifying the (suitably normalised) columns of the decoder weight matrixW dec ∈ℝ 푀×푛 with the dictionary of feature directionsd i , the decoder bias b dec ∈ℝ 푛 with the centering termx 0 , and the (suitably normalised) entries of the latent representation f(x) ∈ℝ 푀 with the feature activations푓 푖 (x). 2.2. Baseline Training Methodology To train sparse autoencoders, Bricken et al. (2023) use a loss function that jointly encourages (i) faithful reconstruction and (i) sparsity. Reconstruction fidelity is encouraged by the squared distance between SAE input and its reconstruction, ∥ x− ˆ x(f(x)) ∥ 2 2 , which we call thereconstruction loss, whereas sparsity is encouraged by the L1 norm of the active features, ∥ f(x) ∥ 1 , which we call the sparsity penalty. 5 Balancing these two terms with aL1 coefficient휆, the loss used to optimize SAEs is given by L(x):= ∥ x− ˆ x(f(x)) ∥ 2 2 +휆 ∥ f(x) ∥ 1 .(4) Since it is possible to arbitrarily reduce the sparsity loss term without affecting reconstructions or sparsity by simply scaling down encoder outputs and scaling up the norm of the decoder weights, it is important to constrain the norms of the columns ofW dec during training. Following Bricken et al. 3 In this work, we use the termfeatureonly in the context of thelearned featuresof SAEs, i.e. the overcomplete basis directions that are linearly combined to produce reconstructions. In particular,learned featuresare always linear and not necessarily interpretable, sidestepping the difficulty in defining what a feature is (Elhage et al. (2022b)’s ‘What are features?’ section). 4 Model activations are typically taken from a specific layer and site, e.g. the output of the MLP part of layer 17. 5 Note that we cannot directly optimize the L0 norm (i.e. the number of active features) since this is not a differentiable function. We do however use the L0 norm to evaluate SAE sparsity (Section 4). 3 Improving Dictionary Learning with Gated Sparse Autoencoders Figure 2|The L1 penalty in sparse autoencoder causes shrinkage– reconstructions are biased towards smaller norms, even when perfect reconstruction is possible. E.g. a single-feature SAE (with L1 coefficient휆=1) reconstructs 1/2 rather than 1 when minimizing Equation (4). (2023), we constrain columns to have exactly unit norm. See Appendix D for full details about our (Gated and baseline) SAE training. 2.3. Evaluation To get a sense of the quality of trained SAEs we use two metrics from Bricken et al. (2023):L0, a measure of SAE sparsity andloss recovered, a measure of SAE reconstruction fidelity. • TheL0of a SAE is defined by the average number of active features on a given input, i.e 피 x∼D ∥ f(x) ∥ 0 . • Theloss recoveredof a SAE is calculated from the average cross-entropy loss of the language model on an evaluation dataset, when the SAE’s reconstructions are spliced into it. If we denote byCE(휙)the average loss of the language model when we splice in a function휙:ℝ 푛 →ℝ 푛 at the SAE’s site during the model’s forward pass, then loss recovered is 1− CE( ˆ x◦f)−CE(Id) CE(휁)−CE(Id) ,(5) Where ˆ x◦fis the autoencoder function,휁:x↦→0the zero-ablation function andId:x↦→x the identity function. According to this definition, a SAE that always outputs the zero vector as its reconstruction would get a loss recovered of 0%, whereas a SAE that reconstructs its inputs perfectly would get a loss recovered of 100%. Of course, these metrics do not paint the full picture of SAE quality, 6 hence we perform manual analysis of SAE interpretability in Section 4.3. 3. Gated SAEs 3.1. Motivation The intuition behind how SAEs are trained is to maximise reconstruction fidelity at a given level of sparsity, as measured by L0, although in practice we optimize a mixture of reconstruction fidelity and L1 regularization. This difference is a source of unwanted bias in the training of a sparse autoencoder: for any fixed level of sparsity, a trained SAE can achieve lower loss (as defined in Eq. (4)) by trading off a little reconstruction fidelity to perform better on the L1 sparsity penalty. The clearest consequence of this bias isshrinkage(Wright and Sharkey, 2024), illustrated in Figure 2. Holding the decoder ˆ x(•)fixed, the L1 penalty pushes feature activationsf(x)towards 6 For example, see Templeton et al. (2024, Tanh Penalty in Dictionary Learning). 4 Improving Dictionary Learning with Gated Sparse Autoencoders zero, while the reconstruction loss pushesf(x)high enough to produce an accurate reconstruction. Thus, the optimal value is somewhere in between, which means it systematically underestimates the magnitude of feature activations, without any necessarily having any compensatory benefit for sparsity. 7 How can we reduce the bias introduced by the L1 penalty? The output of the encoderf(x)of a baseline SAE (Section 2.1) has two roles: 1.Itdetectswhich features are active (according to whether the outputs are zero or strictly positive). For this role, the L1 penalty is necessary to ensure the decomposition is sparse. 2. Itestimatesthe magnitudes of active features. For this role, the L1 penalty is a source of unwanted bias. If we could separate out these two functions of the SAE encoder, we could design a training loss that narrows down the scope of SAE parameters that are affected (and therefore to some extent biased) by the L1 sparsity penalty to precisely those parameters that are involved in feature detection, minimising its impact on parameters used in feature magnitude estimation. 3.2. Gated SAEs 3.2.1. Architecture How should we modify the baseline SAE encoder to achieve this separation of concerns? Our solution is to replace the single-layer ReLU encoder of a baseline SAE with agatedReLU encoder. Taking inspiration from Gated Linear Units (Dauphin et al., 2017; Shazeer, 2020), we define the gated encoder as follows: ̃ f(x):=ퟙ[ 흅 gate (x) z | (W gate (x−b dec )+b gate )>0] | z f gate (x) ⊙ReLU(W mag (x−b dec )+b mag ) | z f mag (x) ,(6) whereퟙ[•>0]is the (pointwise) Heaviside step function and⊙denotes elementwise multiplication. Here,f gate determines which features are deemed to be active, whilef mag estimates feature activation magnitudes (which only matter for features that have been deemed to be active);흅 gate (x)are the f gate sub-layer’s pre-activations, which are used in the gated SAE loss, defined below. Naively, we appear to have doubled the number of parameters in the encoder, increasing the total number of parameters by 50%. We mitigate this through weight sharing: we parameterise these layers so that the two layers share the same projection directions, but allow the norms of these directions as well as the layer biases to differ. Concretely, we defineW mag in terms ofW gate and an additional vector-valued rescaling parameterr mag ∈ℝ 푀 as follows: W mag 푖푗 := exp(r mag ) 푖 · W gate 푖푗 .(7) See Fig. 3 for an illustration of the tied-weight Gated SAEs architecture. With this weight tying scheme, the Gated SAE has only2×푀more parameters than a baseline SAE. In Section 5.1, we perform an ablation study showing that this weight tying scheme leads to a small increase in performance. 7 Conversely, rescaling the shrunk feature activations (Wright and Sharkey, 2024) is not necessarily enough to overcome the bias induced by by L1 penalty: a SAE trained with the L1 penalty could have learnt sub-optimal encoder and decoder directions that are not improved by such a fix. In Section 5.2 and Figure 11 we provide empirical evidence that this is true in practice. 5 Improving Dictionary Learning with Gated Sparse Autoencoders Magnitude Path Gating Path scale & shift shift binarize EncoderDecoder ReLU Figure 3|The Gated SAE architecture with weight sharing between the gating and magnitude paths, shown with an example input. 휎 Figure 4|After applying the weight sharing scheme of Eq. (7), a gated encoder becomes equivalent to a single layer linear encoder with a Jump ReLU (Erichson et al., 2019) activation function휎 휃 , illustrated above. With tied weights, the gated encoder can be reinterpreted as a single-layer linear encoder with a non-standard and discontinuous “Jump ReLU” activation function (Erichson et al., 2019),휎 휃 (푧), illustrated in Fig. 4. To be precise, using the weight tying scheme of Eq. (7), ̃ f(x)can be re-expressed as ̃ f(x)=휎 휽 (W mag ·x+b mag ), with the Jump ReLU gap given by휽=b mag −푒 r mag ⊙b gate ; see Appendix E for an explanation. We think this is a useful intuition for reasoning about how Gated SAEs reconstruct activations in practice. See Appendix F for a walkthrough of a toy example where an SAE with Jump ReLUs outperforms one with standard ReLUs. 3.2.2. Training Gated SAEs A natural idea for training gated SAEs would be to apply Eq. (4), while restricting the sparsity penalty to justf gate : L incorrect (x):= x− ˆ x ̃ f(x) 2 2 | z L reconstruct +휆 f gate (x) 1 | z L sparsity 6 Improving Dictionary Learning with Gated Sparse Autoencoders Unfortunately, due to the Heaviside step activation function inf gate , no gradients would propagate to W gate andb gate . To mitigate this for the sparsity penalty, we instead apply the L1 norm to the positive parts of the preactivation,ReLU 흅 gate (x) . To ensuref gate aids reconstruction by detecting active features, we add an auxiliary task requiring that these same rectified preactivations can be used by the decoder to produce a good reconstruction: L gated (x):= x− ˆ x ̃ f(x) 2 2 | z L reconstruct +휆 ReLU(흅 gate (x)) 1 | z L sparsity + x− ˆ x frozen ReLU 흅 gate (x) 2 2 | z L aux (8) where ˆ x frozen is a frozen copy of the decoder, ˆ x frozen (f):=W copy dec f+b copy dec , to ensure that gradients fromL aux do not propagate back toW dec orb dec . This can typically be implemented by stop gradient operations rather than creating copies – see Appendix G for pseudo-code for the forward pass and loss function. To calculate this loss (or its gradient), we have to run the decoder twice: once to perform the main reconstruction forL reconstruct and once to perform the auxiliary reconstruction forL aux . This leads to a 50% increase in the compute required to perform a training update step. However, the increase in overall training time is typically much less, as in our experience much of the training wall clock time goes to generating language model activations (if these are being generated on the fly) or disk I/O (if training on saved activations). 4. Evaluation In this section we benchmark Gated SAEs across a large variety of models and at different sites (Section 4.1), show that they resolve the shrinkage problem (Section 4.2), and show that they produce features that are similarly interpretable to baseline SAE features according to expert human raters, although we could not conclusively determine whether one is better than the other (Section 4.3). 4.1. Comprehensive Benchmarking In this subsection we show that Gated SAEs are a Pareto improvement over baseline SAEs on the loss recovered and L0 metrics (Section 2.3). We show this by evaluating SAEs trained to reconstruct: 1.The MLP neuron activations in GELU-1L, which is the closest direct comparison to Bricken et al. (2023); 2. The MLP outputs, attention layer outputs (taken pre푊 푂 as in Kissane et al. (2024a)) and residual stream activations in 5 different layers throughout Pythia-2.8B and four different layers in the Gemma-7B base model. In both experiments, we vary the L1 coefficient휆(Section 2.2) used to train the SAEs, which enables us to compare the Pareto frontiers of L0 and loss recovered between Gated and baseline SAEs. Gated SAEs require at most 1.5×more compute to train than regular SAEs (Section 3.2.2). To therefore ensure fair comparison in our evaluations, we compare Gated SAEs to baseline SAEs with 50% more learned features. We show the results for GELU-1L in Figure 5 and the results for Pythia- 2.8B and Gemma-7B in Appendix B. In Appendix B (Figure 12), at all sites tested, Gated SAEs are a Pareto improvement over regular SAEs. In some cases in Figure 12 and 13 there is a non-monotonic Pareto frontier. We attribute this to difficulties training SAEs (Appendix D.1.3). 7 Improving Dictionary Learning with Gated Sparse Autoencoders 020406080100120140 0.88 0.9 0.92 0.94 0.96 0.98 1 SAE Type Baseline (1.5× width) Gated L0 (Lower is sparser) Loss Recovered (Fidelity) Figure 5|Gated SAEs offer better reconstruction fidelity (as measured by loss recovered) at any given level of feature sparsity (as measured by L0). This plot compares Gated and baseline SAEs trained on GELU-1L neuron activations; see Appendix B for comparisons on Pythia-2.8B and Gemma-7B. 4.2. Shrinkage As described in Section 3.1, the L1 sparsity penalty used to train baseline SAEs causes feature activations to be systematically underestimated, a phenomenon calledshrinkage. Since this in turn shrinks the reconstructions produced by the SAE decoder, we can observe the extent to which a trained SAE is affected by shrinkage by measuring the average norm of its reconstructions. Concretely, the metric we use is therelative reconstruction bias, 훾:=arg min 훾 ′ 피 x∼D h ∥ ˆ x SAE (x)/훾 ′ −x ∥ 2 2 i ,(9) i.e.훾 −1 is the optimum multiplicative factor by which an SAE’s reconstructions should be rescaled in order to minimise the L2 reconstruction loss;훾=1for an unbiased SAE and훾 <1when there’s shrinkage. 8 Explicitly solving the optimization problem in Eq. (9), the relative reconstruction bias can be expressed analytically in terms of the mean SAE reconstruction loss, the mean squared norm of input activations and the mean squared norm of SAE reconstructions, making훾easy to compute and track during training: 9 훾= 피 x∼D h ∥ ˆ x SAE ( x )∥ 2 2 i 피 x∼D h ˆ x SAE ( x ) ·x i = 2피 x∼D h ∥ ˆ x SAE ( x )∥ 2 2 i 피 x∼D h ∥ ˆ x SAE ( x )∥ 2 2 i +피 x∼D h ∥ x ∥ 2 2 i −피 x∼D h ∥ ˆ x SAE ( x ) −x ∥ 2 2 i .(10) 8 We have defined훾this way round so that훾 <1intuitively corresponds to shrinkage. 9 The second equality makes use of the identity2a·b≡ ∥ a ∥ 2 2 + ∥ b ∥ 2 2 − ∥ a−b ∥ 2 2 . Note an unbiased reconstruction (훾=1) therefore satisfies피 x∼D h ∥ ˆ x SAE ( x )∥ 2 2 i =피 x∼D h ∥ x ∥ 2 2 i −피 x∼D h ∥ ˆ x SAE ( x ) −x ∥ 2 2 i ; in other words, an unbiased but imperfect SAE (i.e. one that has non-zero reconstruction loss) must have mean squared reconstruction norm that is strictlyless than the mean squared norm of its inputseven without shrinkage. Shrinkage makes the mean squared reconstruction norm even smaller. 8 Improving Dictionary Learning with Gated Sparse Autoencoders 020406080100120140 0.8 0.85 0.9 0.95 1 1.05 1.1 SAE Type Baseline (1.5× width) Gated L0 (Lower is sparser) Relative reconstruction bias 훾 ( 훾 <1 indicates shrinkage) Figure 6|Gated SAEs address the shrinkage (GELU-1L neuron activations). As shown in Figure 6, Gated SAEs’ reconstructions are unbiased, with훾≈1, whereas baseline SAEs exhibit shrinkage (훾 <1), with the impact of shrinkage getting worse as the L1 coefficient휆 increases (and L0 consequently decreases). In Appendix C we show that this result generalizes to Pythia-2.8B. 4.3. Manual Interpretability Scores 4.3.1. Experimental Methodology While we believe that the metrics we have investigated above convey meaningful information about an SAE’s quality, they are only imperfect proxies. As of now, there is no consensus on how to gauge the degree to which a learned feature is ‘interpretable’. To gain a more qualitative understanding of the difference between the learned dictionary feature, we conduct a blinded human rater experiment, in which we rated the interpretability of a set of randomly sampled features. We study a variety of SAEs from different layers and sites. For Pythia-2.8B we had 5 raters, who each rated one feature from baseline and Gated SAEs trained on each (Site, Layer) pair from Figure 12, for a total of 150 features. For Gemma-7B we had 7 raters; one rated 2 features each, and the rest 1 feature each, from baseline or Gated SAEs trained on each (Site, Layer) pair from Figure 13, for a total of 192 features. In both cases, the raters were shown the features in random order, without revealing what SAE, site 10 , or layer they came from. To assess a feature, the rater needed to decide whether there is an explanation of the feature’s behavior, in particular for its highest activating examples. The rater then entered that explanation (if applicable) and selected whether the feature is interpretable (‘Yes’), uninterpretable (‘No’) or maybe interpretable (‘Maybe’). As an interface we use an open source SAE visualizer library (McDougall, 2024). 10 Except due to a debugging issue, Gemma attention SAEs were rated separately, so raters were not blind to that. 9 Improving Dictionary Learning with Gated Sparse Autoencoders Figure 7|Contingency table showing Gated vs Baseline interpretability labels from our paired study results, for Pythia-2.8B and Gemma-7B. 4.3.2. Statistical Analysis To test whether Gated SAEs may be more interpretable and estimate the difference, we pair our datapoints according to all covariates (model, layer, site, rater); this lets us control for all of them without making any parametric assumptions, and thus reduces variance in the comparison. We use a one-sided paired Wilcoxon-Pratt signed-rank test, and provide a 90% BCa bootstrap confidence interval for the mean difference between Baseline and Gated labels, where we count ‘No’ as 0, ‘Maybe’ as 1, and ‘Yes’ as 2. Overall the test of the null hypothesis that Gated SAEs are at most as interpretable as Baseline SAEs gets푝=.060(estimate .13, mean difference CI[0, .26]). This breaks down into 푝=.15on just the Pythia-2.8B data (mean difference CI[−.07, .33]), and푝=.13on just the Gemma-7B data (mean difference CI[−.04, .29]). A Mann-Whitney U rank test on the label differences, comparing results on the two models, fails to reject (푝=.95) the null hypothesis that they’re from the same distribution; the same test directly on the labels similarly fails to reject (푝=.84) the null hypothesis that they’re similarly interpretable overall. The contingency tables used for these results are shown in Figure 7. The overall conclusion is that, while we can’t definitively say the Gated SAE features are more interpretable than those from the Baseline SAEs, they are at least comparable. We provide more analysis of how these break down by site and layer in Appendix H. 5. Why do Gated SAEs improve SAE training? In this section we describe an ablation study that reveals the important parts of Gated SAE training (Section 5.1) and benchmark Gated SAEs against a closely related approach to resolving shrinkage (Section 5.2). 5.1. Ablation Study In this section, we vary several parts of the Gated SAE training methodology (Section 3.2) to gain insight into which aspects of the training are required for the observed improvement in performance. Gated SAEs differ from baseline SAEs in many respect, making it easy to incorrectly attribute the performance gains to spurious details without a careful ablation study. Figure 8 shows Pareto frontiers for these variations and below which we describe each variation in turn and discuss our interpretation of the results. 1.Unfreeze decoder: Here we unfreeze the decoder weights inL aux – i.e. allow this auxiliary 10 Improving Dictionary Learning with Gated Sparse Autoencoders 020406080100120140 0.88 0.9 0.92 0.94 0.96 0.98 1 SAE Type Baseline (1.5× width) Gated Ablation: unfreeze decoder Ablation: untie encoder layers Ablation: no r_mag L0 (Lower is sparser) Loss Recovered (Fidelit y) Figure 8|Our ablation study on GELU-1L MLP neuron activations indicates: (a) the importance of freezing the decoder in the auxiliary taskL aux used to trainf gate ’s parameters; (b) tying encoder weights according to Eq. (7) is slightly beneficial for performance (in addition to yielding a significant reduction in parameter count and inference compute); (c) further simplifying the encoder weight tying scheme in Eq. (7) by removingr mag is mildly harmful to performance. task to update the decoder weights in addition to trainingf gate ’s parameters. Although this (slightly) simplifies the loss, there is a reduction in performance, providing evidence in support of the hypothesis that it is beneficial to limit the impact of the L1 sparsity penalty to just those parameters in the SAE that need it – i.e. those used to detect which features are active. 2.Nor mag : Here we remove ther mag scaling parameter in Eq. (7), effectively setting it to zero (so that we multiply by푒 0 =1); this further tiesf gate ’s andf mag ’s parameters together. With this change, the two encoder sublayers’ preactivations can at most differ by an elementwise shift. 11 There is a slight drop in performance, suggestingr mag contributes somewhat (but not critically) to the improved performance of the Gated SAE. 3.Untied encoders: Here we check whether our choice to share the majority of parameters between the two encoders has meaningfully hurt performance, by training Gated SAEs with gating and ReLU encoder parameters completely untied. Despite the greater expressive power of an untied encoder, we see no improvement in performance – in fact a slight deterioration. This suggests our tying scheme (Eq. (7)) – where encoder directions are shared, but magnitudes and biases aren’t – is effective at capturing the advantages of using a gated SAE while avoiding the 50% increase in parameter count and inference-time compute of using an untied SAE. 5.2. Is it sufficient to just address shrinkage? As explained in Section 3.1, SAEs trained with the baseline architecture and L1 loss systematically underestimate the magnitudes of latent features’ activations (i.e. shrinkage). Gated SAEs, through modifications to their architecture and loss function, overcome these limitations, thereby addressing shrinkage. It is natural to ask to what extent the performance improvement of Gated SAEs is solely attributable 11 Because the two biasesb gate andb mag can still differ. 11 Improving Dictionary Learning with Gated Sparse Autoencoders 020406080100120140 0.88 0.9 0.92 0.94 0.96 0.98 1 SAE Type Baseline (equal width) Gated Baseline + rescale & shift L0 (Lower is sparser) Loss Recovered (Fidelit y) Figure 9|Evidence from GELU-1L that the performance improvement of gated SAEs does not solely arise from addressing shrinkage (systematic underestimation of latent feature activations). Taking a frozen baseline SAE’s parameters and learningr mag andb mag parameters on top of them (green line) does successfully resolve shrinkage, by decoupling feature magnitude estimation from active feature detection. However, it explains only a small part of the performance increase of gated SAEs (red line) over baseline SAEs (blue line). to addressing shrinkage. Although addressing shrinkage would – all else staying equal – improve reconstruction fidelity, it is not the only way to improve SAEs’ performance: for example, gated SAEs could also improve upon baseline SAEs by learning better encoder directions (for estimating when features are active and their magnitudes) or by learning better decoder directions (i.e. better dictionaries for reconstructing activations). In this section, we try to answer this question by comparing Gated SAEs trained as described in Section 3.2.2 with an alternative (architecturally equivalent) approach that also addresses shrinkage, but in a way that uses frozen encoder and decoder directions from a baseline SAE of equal dictionary size. 12 Any performance improvement over baseline SAEs obtained by this alternative approach (which we dub “baseline + rescale & shift”) can only be due to better estimations of active feature magnitudes, since by construction an SAE parameterised by “baseline + rescale & shift” shares the same encoder and decoder directions as a baseline SAE. As shown in Fig. 9, although resolving shrinkage only (“baseline + rescale & shift”) does improve- ment baseline SAEs’ performance a little, a significant gap remains with respect to the performance of gated SAEs. This suggests that the benefit of the gated architecture and loss comes from learning better encoder and decoder directions, not just from overcoming shrinkage. In Appendix A we explore further how Gated and baseline SAEs’ decoders differ by replacing their respective encoders with an optimization algorithm at inference time. 12 Concretely, we do this by training baseline SAEs, freezing their weights, and then learning additional rescale and shift parameters (similar to Wright and Sharkey (2024)) to be applied to the (frozen) encoder pre-activations before estimating feature magnitudes. 12 Improving Dictionary Learning with Gated Sparse Autoencoders 6. Related Work Mechanistic Interpretability. We hope that our improvements to Sparse Autoencoders are helpful for mechanistic interpretability research. Recent mechanistic interpretability work has found recurring components in small and large LMs (Olsson et al., 2022), identified computational subgraphs that carry out specific tasks in small LMs (circuits; Wang et al. (2023)) and reverse-engineered how toy tasks are carried out in small transformers (Nanda et al., 2023). Limitations of existing work include (i) how they only study narrow subsets of the natural language training distribution are studied (though see McDougall et al. (2023)) and (i) current work has not explained how frontier language models function mechanistically (Anthropic AI, 2024; Gemini Team, 2024; OpenAI, 2023). SAEs may be key to explaining model behaviour across the whole training distribution (Bricken et al., 2023) and are trained without supervision, which may enable future work to explain how larger models function on broader tasks. Classical Dictionary Learning. Our work builds on a large amount of research that precedes transformers, and even deep learning. For example, sparse coding (Elad, 2010) studies how discrete and continuous representations can involve more representations than basis vectors, like our setup in Section 1, and sparse representations are also studied in neuroscience (Olshausen and Field, 1997; Thorpe, 1989). Further, shrinkage (Section 4.2) is built into the Lasso (Tibshirani, 1996) and well-studied in statistical learning (Hastie et al., 2015). One dictionary learning algorithm, k-SVD (Aharon et al., 2006) also uses two stages to learn a dictionary like Gated SAEs. Dictionary Learning in Language Models. Early work into applying Dictionary Learning to LMs include Sharkey et al. (2022) (on a GPT-2-like model), Yun et al. (2023) (on a BERT model), Tamkin et al. (2023) (with discrete features, and during LM pretraining) and Cunningham et al. (2023) (on a small Pythia model). Bricken et al. (2023)’s work later provided a widely-scoped analysis of SAEs trained on a 1L model, evaluating the loss when splicing the SAE into the forward pass (Section 4.1), evaluating the impact of learned features on LM rollouts, and visualizing and interpreting all learned features with autointerpretability (Bills et al., 2023). Following this work, other researchers have extended SAE training to attention layer outputs (Kissane et al., 2024a,b) and residual stream states (Bloom, 2024). Dictionary Learning’s Limitations and Improvements. Wright and Sharkey (2024) raised awareness of shrinkage (Section 4.2) and proposed addressing this via decoder finetuning. A difficulty with this approach is that it is not possible to fine tuneallthe SAEs parameters in this way without losing sparsity and/or interpretability of feature directions. This limits the extent to which fine-tuning can remove the biases baked into the SAEs parameters during L1-based pre-training. Gated SAEs address this issue (Section 3.2). Marks et al. (2024) stress-test how useful SAEs are, and find success but also rely on methods that leave many error nodes in their computational subgraphs, which represent the difference between SAE reconstructions and the ground truth. A series of updates to the work in Bricken et al. (2023) have also proposed SAE training methodology improvements (Batson et al., 2024; Olah et al., 2024; Templeton et al., 2024). In parallel to our work, Taggart (2024) finds early improvements with a similar Jump ReLU (Erichson et al., 2019) architecture change to SAEs, but with a different loss function, and without addressing the problems of L1. Disentanglement(Bengio, 2013) aims to learn representations that separate out distinct, indepen- dent ‘factors of variation’ of the underlying data generating process. This is somewhat similar to our aims with dictionary learning, as we want to separate an activation vector into distinct, sparse factors of variation (weights on feature directions), although the dictionary elements are not completely independent, as it may not be possible to accurately represent two features simultaneously due to interference between non-orthogonal dictionary features. Methods explicitly motivated by learning a 13 Improving Dictionary Learning with Gated Sparse Autoencoders disentangled representation typically enforce a prior structure on the learned representation, typically that features are aligned with the basis of a latent space (Chen et al., 2018, 2016; Kim and Mnih, 2018; Mathieu et al., 2019). In contrast, in our work we focus on the representation space of a pre- trained language model, rather than trying to learn a representation directly from data, and enforce a different prior structure, of decomposition into a sparse linear combination of an overcomplete basis. In a sense, our work proceeds from the theory that language models have succeeded in learning a disentangled representation of the data with a particular structure, which we are trying to recover. 7. Conclusion In this work we introduced Gated SAEs (Section 3.2) which are a Pareto improvement in terms of reconstruction quality and sparisty compared to baseline SAEs (Section 4.1), and are comparably interpretable (Section 4.3). We showed via an ablation study that every key part of the Gated SAE methodology was necessary for strong performance (Section 5.1). This represents significant progress on improving Dictionary Learning in LLMs – at many sites, Gated SAEs require half the L0 to achieve the same loss recovered (Figure 12). This is likely to improve work that uses SAEs to steer language models (Nanda et al., 2024), interpret circuits (Marks et al., 2024), or understand language model components across the full distribution (Bricken et al., 2023). Limitations. Our work, like all sparse autoencoder research, is motivated by several assumptions about the sparsity and linearity of computation in Large Language Models (Section 1). If these assumptions are false, our work may still be useful (see footnote 1), but we may be making incorrect conclusions from work using SAEs, since they bake in the sparsity and linearity assumptions. Separately, our work complicates SAE training with a more complex encoder. One worry about increasing the expressivity of sparse autoencoders is that they will overfit when reconstructing activations (Olah et al., 2023, Dictionary Learning Worries), since the underlying model only uses simple MLPs and attention heads, and in particular lacks discontinuities such as step functions. Overall we do not see evidence for this. Our evaluations use held-out test data and we check for interpretability manually. But these evaluations are not totally comprehensive: for example, they do not test that the dictionaries learned contain causally meaningful intermediate variables in the model’s computation. The discontinuity in particular introduces issues with methods like integrated gradients (Sundararajan et al., 2017) that discretely approximate a path integral, as applied to SAEs by Marks et al. (2024). Finally, it could be argued that some of the performance gap between Gated and baseline SAEs could be closed by inexpensive inference-time interventions that prune the many low activating features that tend to appear in baseline SAEs – because baseline SAEs don’t have a thresholding mechanism like Gated SAEs do (Appendix E). Without such interventions, these low activating features increase baseline SAEs’ L0 at a given loss recovered without contributing much to reconstruction (due to low magnitude), and with unclear impact on interpretability. Future work. Future work could verify that Gated SAEs continue to improve dictionary learning beyond 7B base LLMs, such as by extending to larger chat models, or even to multimodal or Mixture- of-Experts models. Alternatively, work could look into the features learned by Gated and baseline SAEs and determine whether the architectures have differences in inductive biases beyond those we noted in this work. We expect it may be possible to further improve Gated SAEs’ performance through additional tweaks to the architecture and training procedure. Finally, we would be most excited to work on using dictionary learning techniques to further interpretability in general, such as to improve circuit finding (Conmy et al., 2023; Marks et al., 2024) or steering (Turner et al., 2023) in language models, and hope that Gated SAEs can serve to accelerate such work. 14 Improving Dictionary Learning with Gated Sparse Autoencoders 8. Acknowledgements We would like to thank Romeo Valentin for conversations that got us thinking about k-SVD in the context of SAEs, which inspired part of our work. Additionally, we are grateful for Vladimir Mikulik’s detailed feedback on a draft of this work which greatly improved our presentation, and Nicholas Sonnerat’s work on our codebase and help with feature labelling. We would also like to thank Glen Taggart who found in parallel work (Taggart, 2024) that a similar method gave improvements to SAE training, helping give us more confidence in our results. Finally, we are grateful to Sam Marks for pointing out an error in the derivation of relative reconstruction bias in an earlier version of this paper. 9. Author contributions Senthooran Rajamanoharan developed the Gated SAE architecture and training methodology, in- spired by discussions with Lewis Smith on the topic of shrinkage. Arthur Conmy and Senthooran Rajamanoharan performed the mainline experiments in Section 4 and Section 5 and led the writing of all sections of the paper. Tom Lieberum implemented the manual interpretability study of Section 4.3, which was designed and analysed by János Kramár. Tom Lieberum also created Fig. 3. Lewis Smith contributed Appendix A and Neel Nanda contributed Appendix F. Our SAE codebase was designed by Vikrant Varma who implemented it with Tom Lieberum, and was scaled to Gemma by Arthur Conmy, with contributions from Senthooran Rajamanoharan and Lewis Smith. János Kramár built most of our underlying interpretability infrastructure. Rohin Shah and Neel Nanda edited the manuscript and provided leadership and advice throughout the project. References M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm for designing overcomplete dictionaries for sparse representation.IEEE Transactions on Signal Processing, 54(11):4311–4322, 2006. doi: 10.1109/TSP.2006.881199. Anthropic AI. Introducing the next generation of Claude.https://w.anthropic.com/index/ introducing-the-next-generation-of-claude, 2024. Accessed: 2024-04-14. J. Batson, B. Chen, A. Jones, A. Templeton, T. Conerly, J. Marcus, T. Henighan, N. L. Turner, and A. Pearce. Circuits Updates - March 2024.Transformer Circuits Thread, 2024. URLhttps: //transformer-circuits.pub/2024/mar-update/index.html. Y. Bengio. Deep learning of representations: Looking forward, 2013. S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, pages 2397–2430. PMLR, 2023. S. Bills, N. Cammarata, D. Mossing, H. Tillman, L. Gao, G. Goh, I. Sutskever, J. Leike, J. Wu, and W. Saunders. Language models can explain neurons in language models.https: //openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023. J. Bloom.Open Source Sparse Autoencoders for all Residual Stream Layers of GPT-2 Small.https://w.alignmentforum.org/posts/f9EgfLSurAiqRJySD/ open-source-sparse-autoencoders-for-all-residual-stream, 2024. 15 Improving Dictionary Learning with Gated Sparse Autoencoders T. Blumensath and M. E. Davies. Gradient pursuits.IEEE Transactions on Signal Processing, 56(6): 2370–2382, 2008. T. Bolukbasi, A. Pearce, A. Yuan, A. Coenen, E. Reif, F. Viégas, and M. Wattenberg. An interpretability illusion for bert.arXiv preprint arXiv:2104.07143, 2021. T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html. R. T. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud. Isolating sources of disentanglement in variational autoencoders.Advances in neural information processing systems, 31, 2018. X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets.Advances in neural information processing systems, 29, 2016. A. Conmy.My best guess at the important tricks for training 1L SAEs.https://w.lesswrong.com/posts/yJsLNWtmzcgPJgvro/ my-best-guess-at-the-important-tricks-for-training-1l-saes, Dec 2023. A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability, 2023. H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023. Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. InProceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 933–941. JMLR.org, 2017. M. Elad.Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer, New York, 2010. ISBN 978-1-4419-7010-7. doi: 10.1007/978-1-4419-7011-4. N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Con- erly, N. DasSarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021. URL https://transformer-circuits.pub/2021/framework/index.html. N. Elhage, T. Hume, C. Olsson, N. Nanda, T. Henighan, S. Johnston, S. ElShowk, N. Joseph, N. Das- Sarma, B. Mann, D. Hernandez, A. Askell, K. Ndousse, A. Jones, D. Drain, A. Chen, Y. Bai, D. Ganguli, L. Lovitt, Z. Hatfield-Dodds, J. Kernion, T. Conerly, S. Kravec, S. Fort, S. Kadavath, J. Jacobson, E. Tran- Johnson, J. Kaplan, J. Clark, T. Brown, S. McCandlish, D. Amodei, and C. Olah. Softmax linear units.Transformer Circuits Thread, 2022a. https://transformer-circuits.pub/2022/solu/index.html. N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, et al. Toy Models of Superposition.arXiv preprint arXiv:2209.10652, 2022b. N. B. Erichson, Z. Yao, and M. W. Mahoney. Jumprelu: A retrofit defense strategy for adversarial attacks, 2019. 16 Improving Dictionary Learning with Gated Sparse Autoencoders Gemini Team. Gemini: A Family of Highly Capable Multimodal Models. Rohan Anil and Sebastian Borgeaud and Yonghui Wu and Jean-Baptiste Alayrac and Jiahui Yu and Radu Soricut and Johan Schalkwyk and Andrew M Dai and Anja Hauth et. al, 2024. Gemma Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, L. Sifre, M. Rivière, M. S. Kale, J. Love, P. Tafti, L. Hussenot, and et al. Gemma, 2024. URLhttps://w.kaggle.com/m/3301. W. Gurnee and M. Tegmark. Language models represent space and time, 2024. W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, and D. Bertsimas. Finding neurons in a haystack: Case studies with sparse probing, 2023. T. Hastie, R. Tibshirani, and M. Wainwright.Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton, FL, 2015. ISBN 978-1-4987-1216-3. doi: 10.1201/b18401. H. Kim and A. Mnih. Disentangling by factorising. InInternational conference on machine learning, pages 2649–2658. PMLR, 2018. C. Kissane, R. Krzyzanowski, A. Conmy, and N. Nanda. Sparse autoencoders work on attention layer outputs. Alignment Forum, 2024a. URLhttps://w.alignmentforum.org/posts/ DtdzGwFh9dCfsekZZ. C. Kissane, R. Krzyzanowski, A. Conmy, and N. Nanda. Attention saes scale to gpt-2 small. Alignment Forum, 2024b. URLhttps://w.alignmentforum.org/posts/FSTRedtjuHa4Gfdbr. S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries.IEEE Transactions on Signal Processing, 41(12):3397–3415, 1993. doi: 10.1109/78.258082. S. Marks, C. Rager, E. J. Michaud, Y. Belinkov, D. Bau, and A. Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models, 2024. E. Mathieu, T. Rainforth, N. Siddharth, and Y. W. Teh. Disentangling disentanglement in variational autoencoders. InInternational conference on machine learning, pages 4402–4412. PMLR, 2019. C. McDougall. SAE Visualizer.https://github.com/callummcdougall/sae_vis, 2024. C. McDougall, A. Conmy, C. Rushing, T. McGrath, and N. Nanda. Copy suppression: Comprehensively understanding an attention head, 2023. N. Nanda. My Interpretability-Friendly Models (in TransformerLens).https://dynalist.io/d/ n2ZWtnoYHrU1s4vnFSAQ519J#z=NCJ6zH_Okw_mUYAwGnMKsj2m, 2022. N. Nanda. Open Source Replication & Commentary on Anthropic’s Dictionary Learning Pa- per, Oct 2023. URLhttps://w.alignmentforum.org/posts/aPTgTKC45dWvL9XBF/ open-source-replication-and-commentary-on-anthropic-s. N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=9XFSbDPmdW. N. Nanda, A. Conmy, L. Smith, S. Rajamanoharan, T. Lieberum, J. Kramár, and V. Varma. [Summary] Progress Update #1 from the GDM Mech Interp Team.Alignment Fo- rum, 2024.URLhttps://w.alignmentforum.org/posts/HpAr8k74mW4ivCvCu/ summary-progress-update-1-from-the-gdm-mech-interp-team. 17 Improving Dictionary Learning with Gated Sparse Autoencoders A. Ng. Sparse autoencoder.http://web.stanford.edu/class/cs294a/sparseAutoencoder. pdf, 2011. CS294A Lecture notes. C. Olah. Mechanistic interpretability, variables, and the importance of interpretable bases.https: //w.transformer-circuits.pub/2022/mech-interp-essay, 2022. C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter. Zoom in: An introduction to circuits.Distill, 2020. doi: 10.23915/distill.00024.001. C. Olah, T. Bricken, J. Batson, A. Templeton, A. Jermyn, T. Hume, and T. Henighan. Circuits Updates - May 2023.Transformer Circuits Thread, 2023. URLhttps://transformer-circuits.pub/ 2023/may-update/index.html. C. Olah, S. Carter, A. Jermyn, J. Batson, T. Henighan, T. Conerly, J. Marcus, A. Templeton, B. Chen, and N. L. Turner. Circuits Updates - January 2024.Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/jan-update/index.html. B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1?Vision Research, 37(23):3311–3325, 1997. doi: 10.1016/S0042-6989(97)00169-7. C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, et al. In-context learning and induction heads, 2022. URLhttps: //transformer-circuits.pub/2022/in-context-learning-and-induction-heads/ index.html. OpenAI. GPT-4 Technical Report, 2023. K. Park, Y. J. Choe, and V. Veitch. The linear representation hypothesis and the geometry of large language models, 2023. Y. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal matching pursuit: recursive function approxi- mation with applications to wavelet decomposition. InProceedings of 27th Asilomar Conference on Signals, Systems and Computers, pages 40–44 vol.1, 1993. doi: 10.1109/ACSSC.1993.342465. L. Sharkey,D. Braun,and B. Millidge.[interim research re- port]takingfeaturesoutofsuperpositionwithsparseautoen- coders.https://w.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/ interim-research-report-taking-features-out-of-superposition, 2022. N. Shazeer. GLU variants improve transformer.CoRR, abs/2002.05202, 2020. URLhttps://arxiv. org/abs/2002.05202. M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In D. Precup and Y. W. Teh, editors,Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 ofProceedings of Machine Learning Research, pages 3319–3328. PMLR, 2017. URLhttp://proceedings.mlr.press/ v70/sundararajan17a.html. G.M.Taggart.Prolu:Anonlinearityforsparseautoen- coders.https://w.lesswrong.com/posts/HEpufTdakGTTKgoYF/ prolu-a-pareto-improvement-for-sparse-autoencoders, 2024. A. Tamkin, M. Taufeeque, and N. D. Goodman. Codebook features: Sparse and discrete interpretability for neural networks, 2023. 18 Improving Dictionary Learning with Gated Sparse Autoencoders A. Templeton, J. Batson, T. Henighan, T. Conerly, J. Marcus, A. Golubeva, T. Bricken, and A. Jermyn. Circuits Updates - February 2024.Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/feb-update/index.html. S. J. Thorpe. Local vs. distributed coding.Intellectica, 8:3–40, 1989. R. Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. doi: 10.1111/j.2517-6161.1996.tb02080.x. C. Tigges, O. J. Hollinsworth, A. Geiger, and N. Nanda. Linear representations of sentiment in large language models, 2023. A. M. Turner, L. Thiergart, D. Udell, G. Leech, U. Mini, and M. MacDiarmid. Activation addition: Steering language models without optimization, 2023. K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=NpsVSN6o4ul. B. Wright and L. Sharkey. Addressing feature suppression in saes.https://w.alignmentforum. org/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes ,Feb 2024. Z. Yun, Y. Chen, B. A. Olshausen, and Y. LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors, 2023. Appendix A. Inference-time optimization The task SAEs perform can be split into two sub-tasks: sparse coding, or learning a set of features from a dataset, and sparse approximation, where a given datapoint is approximated as a sparse linear combination of these features. The decoder weights are the set of learned features, and the mapping represented by the encoder is a sparse approximation algorithm. Formally, sparse approximation is the problem of finding a vector휶that minimises; 휶=arg min ∥ x−D휶 ∥ 2 2 푠.푡. ∥ 휶 ∥ 0 < 훾(11) i.e. that best reconstructs the signalxas a linear combination of vectors in a dictionaryD, subject to a constraint on the L0 pseudo-norm on휶. Sparse approximation is a well studied problem, and SAEs are aweaksparse approximation algorithm. SAEs, at least in the formulation conventional in dictionary learning for language models, in fact solve a slightly more restricted version of this problem where the weights휶on each feature are constrained to be non-negative, leading to the related problem 휶=arg min ∥ x−D휶 ∥ 2 2 푠.푡. ∥ 휶 ∥ 0 < 훾,휶>0(12) In this paper, we do not explore using more powerful algorithms for sparse coding. This is partly because we are using SAEs not just to recoverasparse reconstruction of activations of a LM; ideally we hope that the learned features will coincide with the linear representations actually used by the LM, under the superposition hypothesis. Prior work (Bricken et al., 2023) has argued that SAEs are more likely to recover these due to the correspondence between the SAE encoder and the structure of 19 Improving Dictionary Learning with Gated Sparse Autoencoders the network itself; the argument is that it is implausible that the network can make use of features which can only be recovered from the vector via an iterative optimisation algorithm, whereas the structure of the SAE means that it can only find features whose presence can be predicted well by a simple linear mapping. Whether this is true remains, in our view, an important question for future work, but we do not address it in this paper. In this section we discuss some results obtained by using the dictionaries learned via SAE training, but replacing the encoder with a different sparse approximation algorithm at inference time. This allows us to compare the dictionaries learned by different SAE training regimes independently of the quality of the encoder. It also allows us to examine the gap between the sparse reconstruction performed by the encoder against the baseline of a more powerful sparse approximation algorithm. As mentioned, for a fair comparison to the task the encoder is trained for, it is important to solve the sparse approximation problem of Eq. (12), rather than the more conventional formulation of Eq. (11), but most sparse approximation algorithms can be modified to solve this with relatively minor changes. Solving Eq. (12) exactly is equivalent to integer linear programming, and is NP hard. The integer linear programs in question would be large, as our SAE decoders routinely have hundreds of thousands of features, and solving them to guaranteed optimality would likely be intractable. Instead, as is commonly done, we use iterative greedy algorithms to find an approximate solution. While the solution found by these sparse approximation algorithms is not guaranteed to be the global optimum, these are significantly more powerful than the SAE encoder, and we feel it is acceptable in practice to treat them as an upper bound on possible encoder performance. For all results in this section, we use gradient pursuit, as described in Blumensath and Davies (2008), as our inference time optimisation (ITO) algorithm. This algorithm is a variant of orthogonal matching pursuit (Pati et al., 1993) which solves the orgothonalisation of the residual to the span of chosen dictionary elements approximately at every step rather than exactly, but which only requires matrix multiplies rather than matrix solves and is easier to implement on accelerators as a result. It is possibly not crucial for performance that our optimisation algorithm be implementable on TPUs, but being able to avoid a host-device transfer when splicing this into the forward pass allowed us to re-use our existing evaluation pipeline with minimal changes. When we use a sparse approximation algorithm at test time, we simply use the decoder of a trained SAE as a dictionary, ignoring the encoder. This allows us to sweep the target sparsity at test time without retraining the model, meaning that we can plot an entire Pareto frontier of loss recovered against sparsity for a single decoder, as in done in Figure 11. Figure 10 compares the loss recovered when using ITO for a suite of SAEs decoders trained with both methods at three different test time L0 thresholds. This graph shows a somewhat surprising result; while Gated SAEs learn better decoders generally, and often achieve the best loss recovered using ITO close to their training sparsity, SAE decoders are often outperformed by decoders which achieved a higher test time L0; it’s better to do ITO with a target L0 of 10 with an decoder with an achieved L0 of around 100 during training than one which was actually trained with this level of sparsity. For instance, the left hand panel in Figure 10 shows that SAEs with a training L0 of 100 are better than those with an L0 of around 10 at almost every sparsity level in terms of ITO reconstruction. However, gated SAE dictionaries have a small but real advantage over standard SAEs in terms of loss recovered at most target sparsity levels, suggesting that part of the advantage of gated SAEs is that they learn better dictionaries as well as addressing issues with shrinkage. However, there are some subtleties here; for example, we find that baseline SAEs trained with a lower sparsity penalty (higher training L0) often outperform more sparse baseline SAEs according to this measure, and the best performing baseline SAE (L0≈99) is comparable to the best performing Gated SAE (L0≈20). 20 Improving Dictionary Learning with Gated Sparse Autoencoders Figure 10|This figure compares the ITO performance of different decoders across a sweep for decoders trained using a baseline SAE and the gated method, at three different test time target sparsities. Gated SAEs trained at lower target sparsities consistently achieve better dictionaries by this measure. Interestingly, the best performing baseline dictionary by this measure often has a much higher test time sparsity than the target; for instance, at a test time sparsity of 30, the best baseline SAE was the one that had a test time sparsity of more like 100. This could be an artifact of the fact that the L0 measure is quite sensitive to noise, and standard SAE architectures tend to have a reasonable number of features with very low activation. Figure 11 compares the Pareto frontiers of a baseline model and a gated model to the Pareto frontier of an ITO sweep of the best performing dictionary of each. Note that, while the Pareto curve of the baseline dictionary is formed by several models as each encoder is specialised to a given sparsity level, as mentioned, ITO lets us plot a Pareto frontier by sweeping the target sparsity with a single dictionary; here we plot only the best performing dictionary from each model type to avoid cluttering the figure. This figure suggests that the performance gap between the encoder and using ITO is smaller for the gated model. Interestingly, this cannot solely be explained by addressing shrinkage, as we demonstrate by experimenting with a baseline model which learns a rescale and shift with a frozen encoder and decoder directions. B. More Loss Recovered / L0 Pareto frontiers In Figure 12 we show that Gated SAEs outperform baseline SAEs. In Figure 13 we show that Gated SAEs ourperform baseline SAEs at all but one MLP output or residual stream site that we tested on. In Figure 13 at the attention output pre-linear site at layer 27, loss recovered is bigger than 1.0. On investigation, we found that the dataset used to train the SAE was not identical to Gemma’s pretraining dataset, and at this site it was possible to mean ablate this quantity and decrease loss – explaining why SAE reconstructions had lower loss than the original model. C. Further Shrinkage Plots In Figure 14, we show that Gated SAEs resolve shrinkage (as measured by relative reconstruction bias (Section 4.2)) in Pythia-2.8B. 21 Improving Dictionary Learning with Gated Sparse Autoencoders Figure 11|Pareto frontiers of a baseline SAE, a baseline SAE with learned rescale and shift (to account for shrinkage) and a gated SAE across different sparsity lambdas, compared to the ITO Pareto frontier of the best decoder of each type with ITO, varying the target sparsity. The best gated encoder is better than the best standard encoder by this measure, but the difference is marginal. As shown in the plot above, the best baseline encoder by the ITO measure had a much larger test time sparsity (around 100) than the best gated model (around 30). This figure suggests that the gap between SAE performance and ’optimal’ performance, if we assume that ITO is close to the maximum possible reconstruction using the given encoder, is much smaller for the gated model. D. Training and evaluation: hyperparameters and other details. D.1. Training D.1.1. General training details Other details of SAE training are: •SAE Widths. Our SAEs have width2 17 for most baseline SAEs,3×2 16 for Gated SAEs, except for the (Pythia-2.8B, Residual Stream) sites we used2 15 for baseline and3×2 14 for Gated since early runs at these sites had lots of learned feature death. •Training data. We use activations from hundreds of millions to billions of activations from LM forward passes as input data to the SAE. Following Nanda (2023), we use a shuffled buffer of these activations, so that optimization steps don’t use data from highly correlated activations. 13 •Resampling. We usedresampling, a technique which at a high-level reinitializes features that activate extremely rarely on SAE inputs periodically throughout training. We mostly follow the approach described in the ‘Neuron Resampling’ appendix of Bricken et al. (2023), except we reapply learning rate warm-up after each resampling event, reducing learning rate to 0.1x the ordinary value, and, increasing it with a cosine schedule back to the ordinary value over the next 1000 training steps. •Optimizer hyperparameters. We use the Adam optimizer with훽 2 =0.999and훽 1 =0.0, following Templeton et al. (2024), as we also find this to be a slight improvement to training. 13 In contrast to earlier findings (Conmy, 2023), we found that when using Pythia-2.8B’s activations from sequences of length 2048, rather than GELU-1L’s activations from sequences of length 128, it was important to shuffle the10 6 length activation buffer used to train our SAEs. 22 Improving Dictionary Learning with Gated Sparse Autoencoders 050100150 0.6 0.8 1 020406080100120140020406080100120140 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 0.6 0.8 1 SAE Type Gated Baseline Residual stream post-MLP MLP output L0 (Lower is sparser) Attention output pre-linear Layer 28 Layer 20 Loss Recovered (Fidelity) Layer 16 Layer 12 Layer 4 Figure 12|Gated SAEs throughout Pythia-2.8B. At all sites we tested, Gated SAEs are a Pareto improvement. In every plot, the SAE with maximal loss recovered was a Gated SAE. 23 Improving Dictionary Learning with Gated Sparse Autoencoders 050100150200 0.4 0.6 0.8 1 1.2 050100150200050100150200 0.4 0.6 0.8 1 1.2 0.4 0.6 0.8 1 1.2 0.4 0.6 0.8 1 1.2 SAE Type Gated Baseline Residual stream post-MLP MLP output L0 (Lower is sparser) Attention output pre-linear Layer 27 Layer 20 Loss Recovered (Fidelity) Layer 13 Layer 6 Figure 13|Gated and Normal Pareto-Optimal SAEs for Gemma-7B – see Appendix B for a discussion of the anomalies (such as the Layer 27 attention output SAEs), and Table 1-4 for full stats (including points not on the Pareto frontier). 24 Improving Dictionary Learning with Gated Sparse Autoencoders 010050150 1 0.8 1.2 010050150010050150 1 0.8 1.2 1 0.8 1.2 1 0.8 1.2 SAE Type Gated Baseline Residual stream post-MLP MLP output L0 (Lower is sparser) Attention output pre-linear Layer 27 Layer 20 Relative reconstruction bias γ Layer 13 Layer 6 Figure 14|Gated SAEs address the shrinkage problem (Section 4.2) in Pythia-2.8B. 25 Improving Dictionary Learning with Gated Sparse Autoencoders We use a learning rate warm-up. See Appendix D.1.2 for learning rates of different experiment. •Decoder weight norm constraints. Templeton et al. (2024) suggest constraining columns to haveat mostunit norm (instead of exactly unit norm), which can help distinguish between productive and unproductive feature directions (although it should have no systematic impact on performance). However, we follow the original approach of constraining columns to have exact unit norms in this work for the sake of simplicity. •Interpreting the L1휆coefficients.. In our infrastructure we calculate L2 loss and then divide by푛. In the baseline experiments we further divide the reconstruction L2 loss by피||푥|| 2 . D.1.2. Experiment-specific training details •We use learning rate 0.0003 for all Gated SAE experiments, and the GELU-1L baseline experi- ment. We swept for optimal baseline learning rates for the GELU-1L baseline to generate this value. For the Pythia-2.8B and Gemma-7B baseline SAE experiments, we divided the L2 loss by피||푥|| 2 , motivated by better hyperparameter transfer, and so changed learning rate to 0.001 and 0.00075 (full learning rate detail in tables Table 1-8). We didn’t see noticeable difference in the Pareto frontier and so did not sweep this hyperparameter further. • We generate activations from sequences of length 128 for GELU-1L, 2048 for Pythia-2.8B and 1024 for Gemma-7B. •We use a batch size of 4096 for all runs. We use 300,000 training steps for GELU-1L and Gemma-7B runs, and 400,000 steps for Pythia-2.8B runs. D.1.3. Lessons learned scaling SAEs • Learned feature death is unpredictable. In Table 1 (and other tables) there are few patterns that can be gleaned from staring at which runs have high numbers of dead learned features (called dead neurons in Bricken et al. (2023)). •Resampling makes hyperparameter sweeps difficult. We found that resampling caused L0 and loss recovered to increase, similar to Conmy (2023). •Training appears to converge earlier than expected. We found that we did not need 20B tokens as in Bricken et al. (2023), as generally resampling had stopped causing gains and loss curves plateaued after just over one billion tokens. D.2. Evaluation We evaluated the models on over a million held-out tokens. Tables 1-8 show summary stats from training runs on the Pareto frontier. E.Equivalence between gated encoder with tied weights and linear encoder with non-standard activation function In this section we show under the weight sharing scheme defined in Eq. (7), a gated encoder as defined in Eq. (6) is equivalent to a linear layer with a non-standard (and parameterised) activation function. Without loss of generality, consider the case of a single latent feature (푀=1) and set the pre-encoder bias to zero. In this case, the gated encoder is defined as ̃ 푓(x):=ퟙ w gate ·x+푏 gate >0 ReLU w mag ·x+푏 mag (13) 26 Improving Dictionary Learning with Gated Sparse Autoencoders SiteLayerSparsity휆LRL0% CE RecoveredClean CE LossSAE CE Loss0 Abl. CE LossWidth% Alive FeaturesShrinkage훾 Resid63e-050.00118.195.28%2.54263.184716.154919660816.8%0.982 Resid62e-050.00110.585.3%2.54264.543316.15491966085.72%1.136 Resid61e-050.00119.091.24%2.54263.734916.15491966085.11%1.606 Resid62e-050.0007529.896.65%2.54262.998916.154919660813.67%1.261 Resid63e-050.0007525.497.9%2.54262.827916.154919660838.86%0.976 Resid68e-060.0007529.891.28%2.54263.730116.15491966089.88%1.105 Resid61e-050.0007557.397.36%2.54262.902316.154919660811.78%1.03 Resid64e-060.0007569.295.98%2.54263.089216.154919660813.54%1.239 Resid66e-060.0007540.095.49%2.54263.156216.154919660824.34%1.159 Resid139e-050.0007514.396.77%2.54263.442330.358819660898.38%0.806 Resid138e-050.0007517.597.66%2.54263.194730.358819660898.7%0.824 Resid138e-050.00118.097.63%2.54263.202130.358819660895.35%0.838 Resid135e-050.0007522.297.69%2.54263.184930.358819660825.78%0.889 Resid133e-050.0007529.097.64%2.54263.198630.35881966088.55%0.903 Resid135e-050.00129.598.71%2.54262.900530.358819660865.17%0.867 Resid133e-050.00139.298.26%2.54263.02630.358819660826.33%0.936 Resid132e-050.0007556.698.49%2.54262.961530.358819660816.19%0.976 Resid131e-050.00075101.397.83%2.54263.145930.35881966084.55%1.018 Resid200.000120.0007510.491.87%2.54263.927719.589119660892.51%0.773 Resid200.00010.0007513.893.68%2.54263.620419.589119660897.46%0.797 Resid209e-050.0007516.094.48%2.54263.483519.589119660899.2%0.81 Resid203e-050.00125.290.71%2.54264.125819.58911966083.11%0.951 Resid207e-050.00121.395.73%2.54263.2719.589119660899.62%0.824 Resid205e-050.00127.897.15%2.54263.028119.589119660888.4%0.879 Resid203e-050.0007539.196.43%2.54263.151819.589119660835.64%1.019 Resid204e-050.0007546.497.95%2.54262.892219.589119660899.9%0.874 Resid202e-050.0007549.495.26%2.54263.350519.58911966088.61%0.983 Resid201.5e-050.0007550.395.99%2.54263.226819.58911966089.46%2.179 Resid201e-050.00075124.897.69%2.54262.936719.589119660812.3%0.997 Resid271e-050.00127.647.08%2.54267.787812.45341966081.68%1.022 Resid278e-060.00130.549.63%2.54267.534512.45341966081.12%0.965 Resid271.2e-050.0007536.239.49%2.54268.539812.45341966082.02%1.564 Resid276e-060.00139.152.72%2.54267.22812.45341966081.39%1.035 Resid274e-060.0007563.461.84%2.54266.324612.45341966083.03%1.017 Resid272e-060.0007588.258.45%2.54266.660912.45341966082.22%1.163 MLP60.00040.0010.242.33%2.54262.67742.776419660819.17%0.857 MLP60.00010.0016.367.78%2.54262.61792.776419660882.35%0.794 MLP60.00010.000757.659.55%2.54262.63712.776419660869.88%1.189 MLP67e-050.00110.670.77%2.54262.61092.776419660875.8%0.835 MLP63e-050.0007515.364.49%2.54262.62562.776419660815.36%1.001 MLP67e-050.0007512.074.63%2.54262.60192.776419660894.97%0.82 MLP61.5e-050.0007514.947.57%2.54262.66512.77641966083.03%1.0 MLP65e-050.0007517.175.36%2.54262.60022.776419660868.12%0.864 MLP138e-050.000751.432.78%2.54262.5732.587819660810.16%0.92 MLP138e-050.00111.350.99%2.54262.56472.587819660873.07%0.848 MLP135e-050.00122.647.32%2.54262.56642.587819660866.09%0.882 MLP135e-050.0007529.461.19%2.54262.56012.587819660884.51%0.863 MLP133e-050.00144.864.14%2.54262.55882.587819660856.91%0.864 MLP133e-050.0007580.871.28%2.54262.55562.587819660873.31%0.901 MLP132e-050.00075160.772.08%2.54262.55522.587819660856.12%0.894 MLP131e-050.00075610.077.67%2.54262.55272.587819660844.39%0.858 MLP207e-050.00115.879.11%2.54262.5732.688119660896.84%0.852 MLP205e-050.00124.582.67%2.54262.56782.688119660896.93%0.869 Table 1|Gemma-7B Baseline SAEs (1024 sequence length). Italic are Pareto optimal SAEs. and the weight sharing scheme becomes w mag :=휌 mag w gate (14) with a non-negative parameter휌 mag ≡exp(r mag ). Substituting Eq. (14) into Eq. (13) and re-arranging, we can re-express ̃ 푓(x)as a single linear layer ̃ 푓(x):=휎 푏 mag −휌 mag 푏 gate w mag ·x+푏 mag (15) 27 Improving Dictionary Learning with Gated Sparse Autoencoders SiteLayerSparsity휆LRL0% CE RecoveredClean CE LossSAE CE Loss0 Abl. CE LossWidth% Alive FeaturesShrinkage훾 MLP205e-050.0007526.082.36%2.54262.56822.688119660897.96%0.865 MLP204.5e-050.0007531.483.94%2.54262.56592.688119660899.24%0.877 MLP203e-050.00139.583.12%2.54262.56712.688119660846.33%0.924 MLP204e-050.0007538.385.18%2.54262.56412.688119660895.73%0.889 MLP203.5e-050.0007543.284.11%2.54262.56572.688119660894.62%0.874 MLP203e-050.0007556.887.23%2.54262.56122.688119660896.88%0.894 MLP202e-050.0007568.184.18%2.54262.56562.688119660853.42%0.898 MLP202e-050.0007575.685.63%2.54262.56352.688119660866.29%0.899 MLP201.5e-050.00075104.685.71%2.54262.56342.688119660841.7%0.965 MLP201e-050.00075321.190.3%2.54262.55672.688119660856.83%0.911 MLP271.2e-050.00110.286.28%2.54265.775126.11141966080.6%1.019 MLP271e-050.00120.595.05%2.54263.708126.11141966081.73%1.002 MLP278e-060.00121.393.55%2.54264.062326.11141966080.66%0.988 MLP276e-060.0007526.491.19%2.54264.618526.11141966080.57%0.973 MLP275.5e-060.0007518.185.53%2.54265.952226.11141966080.58%0.994 MLP273e-060.0007526.990.82%2.54264.70626.11141966080.98%1.024 Attn67e-050.0007515.469.89%2.54262.59892.729519660896.78%0.72 Attn65e-050.0007526.478.08%2.54262.58362.729519660898.97%0.777 Attn63e-050.0007554.685.42%2.54262.56982.729519660899.7%0.846 Attn137e-050.0007522.660.79%2.54262.54812.556619660893.47%0.721 Attn135e-050.0007536.565.45%2.54262.54742.556619660897.59%0.786 Attn133e-050.0007568.881.03%2.54262.54522.556619660899.19%0.804 Attn209e-050.0007510.868.98%2.54262.55192.572619660879.34%0.715 Attn208e-050.0007512.372.48%2.54262.55082.572619660883.58%0.723 Attn207e-050.0007515.975.83%2.54262.54982.572619660887.54%0.755 Attn206e-050.0007518.778.38%2.54262.54912.572619660889.49%0.759 Attn205e-050.0007525.182.96%2.54262.54772.572619660892.36%0.786 Attn204e-050.0007532.685.95%2.54262.54682.572619660895.14%0.802 Attn203e-050.0007550.389.52%2.54262.54572.572619660896.52%0.841 Attn202e-050.0007597.392.52%2.54262.54482.572619660895.74%0.878 Attn201.5e-050.00075148.695.01%2.54262.54412.572619660892.55%0.867 Attn201e-050.00075329.796.57%2.54262.54362.572619660878.75%0.895 Attn270.00080.000750.0121.03%2.54262.42913.08191966085.34%1.009 Attn270.00060.000750.0121.63%2.54262.42593.08191966084.7%1.007 Attn270.00010.000759.7126.97%2.54262.39713.081919660835.94%0.829 Table 2|Gemma-7B Baseline SAEs (1024 sequence length) continued from Table 1. with the parameterised activation function 휎 휃 (푧):=ퟙ 푧>휃 ReLU ( 푧 ) .(16) called JumpReLU in a different context (Erichson et al., 2019). Fig. 4 illustrates the shape of this activation function. F. A toy setting where Jump ReLU SAEs outperform baseline SAEs An additional reason that Gated SAEs may perform baseline SAEs, beyond resolving shrinkage, is that they are a more expressive architecture: at inference time, they’re equivalent to an SAE with the ReLU replaced by a potentially discontinuous Jump ReLU (Erichson et al., 2019), as shown in Appendix E. In this appendix we present a toy setting where a Jump ReLU is a more natural activation function for sparsely reconstructing activations than a ReLU. We adopt a more intuitive and less formal style, for pedagogical purposes. Consider a sparsely activating but continuously valued feature푋, and a fixed unit encoder direction ˆ v. If푋is off (푋=0), the projection of activationsaonto ˆ vis normally distributed asN(0,1)(simulating noise from non-orthogonal features firing); if푋is on (푋 >0) then the projection is normally distributed asN(2,1/4). Suppose further that푋is on with50%probability. So푎·ˆ푣∼ퟙ 푋is on (0.5푍 1 +2)+ퟙ 푋is off 푍 2 . Where퐴is the activation and푍 1 , 푍 2 ∼N(0,1 2 ) are standard 1D Gaussians. The empirical distribution is shown in Figure 15a. 28 Improving Dictionary Learning with Gated Sparse Autoencoders SiteLayerSparsity휆LRL0% CE RecoveredClean CE LossSAE CE Loss0 Abl. CE LossWidth% Alive FeaturesShrinkage훾 Resid60.00120.00032.295.55%2.54263.148316.154913107293.94%1.006 Resid60.0010.00033.096.67%2.54262.995416.154913107296.24%1.006 Resid60.00080.00034.397.83%2.54262.838216.154913107297.52%1.003 Resid60.00060.00037.098.76%2.54262.710816.154913107298.3%0.996 Resid60.00040.000314.399.35%2.54262.631216.154913107298.68%0.996 Resid60.00020.000345.999.77%2.54262.573516.154913107299.51%0.999 Resid62e-050.000395.298.62%2.54262.730216.154913107245.13%1.148 Resid64e-050.0003144.099.35%2.54262.631316.154913107236.05%1.038 Resid68e-060.0003177.599.29%2.54262.638616.154913107253.36%1.086 Resid60.00010.0003131.899.94%2.54262.551116.154913107299.47%1.005 Resid68e-050.0003153.299.93%2.54262.552416.154913107298.14%0.984 Resid66e-050.0003215.799.93%2.54262.552116.154913107293.91%0.982 Resid64e-050.0003284.599.62%2.54262.594816.154913107284.71%2.56 Resid62e-050.0003801.399.82%2.54262.567316.154913107291.71%1.272 Resid68e-060.0003-288.299.7%2.54262.583516.154913107285.02%1.006 Resid130.00080.00035.498.3%2.54263.014930.358813107298.15%1.008 Resid130.00050.000313.199.25%2.54262.751430.358813107298.71%0.998 Resid130.00030.000331.899.62%2.54262.648330.358813107299.31%0.992 Resid130.00020.000362.699.76%2.54262.608330.358813107299.69%0.993 Resid130.00020.000363.799.77%2.54262.606730.358813107299.68%0.997 Resid130.00010.0003146.199.87%2.54262.578830.358813107267.47%1.056 Resid130.00010.000396.899.64%2.54262.642130.358813107264.18%0.934 Resid200.0010.00038.296.15%2.54263.199519.589113107296.49%1.004 Resid200.00090.000310.096.7%2.54263.105919.589113107296.89%1.003 Resid200.00080.000312.397.14%2.54263.029319.589113107297.46%0.997 Resid200.00070.000315.697.7%2.54262.935319.589113107298.02%0.997 Resid200.00050.000329.398.62%2.54262.777519.589113107298.66%1.016 Resid200.00050.000328.098.53%2.54262.793119.589113107298.73%0.997 Resid200.00050.000328.598.58%2.54262.784419.589113107298.67%1.004 Resid200.00030.000367.399.3%2.54262.661119.589113107299.33%1.013 Resid200.00020.0003123.499.58%2.54262.613919.589113107299.69%1.01 Resid200.00010.0003212.199.65%2.54262.602419.589113107255.01%1.04 Resid270.0030.000317.381.66%2.54264.360212.453413107228.57%1.001 Resid270.0020.000325.985.26%2.54264.003312.453413107231.98%0.999 Resid270.0010.000354.490.26%2.54263.508112.453413107233.58%1.008 MLP60.00040.00034.073.71%2.54262.6042.776413107298.69%1.009 MLP60.00010.000345.289.13%2.54262.5682.776413107296.23%0.998 MLP67e-050.0003106.090.67%2.54262.56442.776413107287.51%1.0 MLP139e-050.000336.076.36%2.54262.55332.587813107299.87%1.002 MLP139e-050.000336.176.25%2.54262.55332.587813107299.91%1.004 MLP138e-050.000348.978.71%2.54262.55222.587813107299.72%1.007 MLP137e-050.000369.782.15%2.54262.55062.587813107299.77%1.01 MLP137e-050.000367.081.24%2.54262.55112.587813107299.61%0.997 Table 3|Gemma-7B Gated SAEs (1024 sequence length). Continued in Table 4. Consider the problem of fitting a ReLU encoder to this. With fixed encoder unit directionˆ푣, the encoder is parametrised by a bias푏and magnitude푚, where푎→max(푚푎·ˆ푣+푏,0).푏can be reparametrised in terms of a threshold푡= −푏 푚 , so it is now푎→ퟙ 푎·ˆ푣>푡 푚(푎·ˆ푣−푡). Geometrically, we set some threshold푡, a vertical line. Everything to the left is set to zero, and everything to the right is set to some multiple ofthe distance from the line. This illuminates the core problem with ReLU SAEs: the threshold both determines whether to fire at all, and gives an origin to take the distance from if firing. The optimal reconstruction when푋is on requires us to take푡=0. But now we fire half the time when푋is off, as a lot of the blue histogram is to the right of the green line. However, if we take a high enough threshold to exclude most blue, e.g. 푡=1, we now need to take the distance from푡=1too when푋is on, distorting things, even if we try to correct by rescaling with푚, see the blue line in Figure 15b. Jump ReLUs solve this problem. Mathematically, we can parametrise a Jump ReLU, at least in the setting of Gated SAEs, asퟙ (푥>푡) 푚(푥−푑). Geometrically, we now have two vertical lines.푥=푡sets the threshold: anything to the left is set to zero.푥=푑sets the origin point (for some푑≤푡), we return 29 Improving Dictionary Learning with Gated Sparse Autoencoders SiteLayerSparsity휆LRL0% CE RecoveredClean CE LossSAE CE Loss0 Abl. CE LossWidth% Alive FeaturesShrinkage훾 MLP135e-050.0003196.485.54%2.54262.54912.587813107276.56%1.003 MLP133e-050.0003766.593.04%2.54262.54572.587813107286.81%1.033 MLP200.000190.000324.487.81%2.54262.56032.688113107299.91%1.004 MLP200.000160.000332.789.16%2.54262.55832.688113107299.94%1.004 MLP200.000150.000336.489.63%2.54262.55772.688113107299.95%1.002 MLP200.000140.000340.889.73%2.54262.55752.688113107299.96%1.0 MLP200.000130.000346.690.3%2.54262.55672.688113107299.95%1.002 MLP200.000120.000353.590.99%2.54262.55572.688113107299.99%1.001 MLP200.00010.000374.991.42%2.54262.55512.688113107299.99%0.999 MLP209e-050.000391.292.01%2.54262.55422.688113107299.9%0.998 MLP208e-050.0003111.393.3%2.54262.55232.6881131072100.0%1.0 MLP201.1e-050.0003-91.1103.85%2.54262.5372.688113107246.33%1.005 MLP270.00120.000320.394.14%2.54263.923226.11141310725.8%1.003 MLP270.0010.000323.196.01%2.54263.483426.11141310726.13%0.995 MLP270.00080.000327.396.47%2.54263.374726.11141310725.18%1.005 MLP270.00030.000359.399.07%2.54262.762726.11141310723.89%1.002 MLP270.00020.000380.998.19%2.54262.96926.11141310723.64%1.006 MLP270.0001750.000389.797.35%2.54263.167826.11141310723.89%1.008 MLP270.000150.0003108.598.87%2.54262.809326.11141310723.54%1.002 MLP270.0001350.0003103.698.33%2.54262.936526.11141310723.75%0.997 Attn60.00070.00038.982.28%2.54262.57572.729513107293.49%1.015 Attn60.00050.000316.485.54%2.54262.56962.729513107295.16%1.014 Attn60.00030.000338.788.69%2.54262.56372.729513107297.63%1.015 Attn130.00120.00032.946.05%2.54262.55022.556613107263.06%1.042 Attn130.00060.000313.276.64%2.54262.54592.556613107283.81%1.0 Attn130.00040.000328.163.78%2.54262.54772.556613107289.64%0.992 Attn130.00020.000395.182.86%2.54262.5452.556613107297.05%0.993 Attn134e-050.00031079.593.95%2.54262.54342.556613107264.6%1.002 Attn132e-050.0003-635.187.73%2.54262.54432.556613107292.21%1.003 Attn200.00120.00032.164.17%2.54262.55332.572613107272.67%1.038 Attn200.00060.00039.080.22%2.54262.54852.572613107289.06%1.014 Attn200.000550.000310.184.01%2.54262.54742.572613107290.35%0.997 Attn200.000450.000314.885.85%2.54262.54682.572613107292.05%1.003 Attn200.00040.000318.786.55%2.54262.54662.572613107292.77%1.016 Attn200.000350.000322.888.2%2.54262.54612.572613107294.07%1.009 Attn200.000250.000339.790.97%2.54262.54532.572613107296.42%1.009 Attn200.00020.000355.292.72%2.54262.54482.572613107297.73%0.994 Attn200.000150.000389.194.39%2.54262.54432.572613107298.93%0.999 Attn200.00010.0003178.094.71%2.54262.54422.572613107299.69%1.003 Attn206e-050.0003483.899.72%2.54262.54272.572613107298.66%0.994 Attn204e-050.0003894.697.03%2.54262.54352.572613107266.5%0.991 Attn202e-050.0003-851.3106.91%2.54262.54052.572613107286.24%1.0 Attn270.0020.00036.6100.37%2.54262.54063.081913107256.82%1.008 Attn270.0010.000316.5105.72%2.54262.51173.081913107270.25%1.002 Attn270.00070.000326.2104.26%2.54262.51963.081913107277.02%0.999 Table 4|Gemma-7B Gated SAEs (1024 sequence length). Continued from Table 3 the distance to푑times some magnitude푚. This solves our problem: we can set푡=1(the purple line),푑=0(the green line) and푚=1(no distortion correction needed), and get the blue line in Figure 15b, a near perfect reconstruction! Some caveats and reflections on this toy model: •The numbers푡=1,푚=2are likely not the mathematically optimal solution, and are given for pedagogical purposes, but this seems unlikely to change the conceptual takeaways. •This toy model has not been empirically tested, and could be totally off. But we’ve found it useful for building intuition. • Why was it realistic to assume that the projection wasn’t just zero when푋was off? Because there are likely many other non-orthogonal features firing, due to superposition, which in aggregate creates significant interference. Indeed, a common problem when studying SAE features/other interpretable directions is that, while the tails are monosemantic, activations near zero are very noisy (see e.g. the Arabic feature in Bricken et al. (2023) or the sentiment feature in Tigges et al. 30 Improving Dictionary Learning with Gated Sparse Autoencoders SiteLayerSparsity휆LRL0% CE RecoveredClean CE LossSAE CE Loss0 Abl. CE LossWidth% Alive FeaturesShrinkage훾 Attn48e-050.00117.681.04%1.96991.98242.036119660894.29%0.827 Attn46e-050.00124.284.12%1.96991.98042.036119660895.76%0.848 Attn43e-050.00162.190.96%1.96991.97592.036119660896.72%0.93 Attn128e-050.00116.151.88%1.96991.99072.013119660865.73%0.78 Attn126e-050.00124.058.46%1.96991.98782.013119660869.85%0.802 Attn123e-050.00175.072.84%1.96991.98162.013119660873.04%0.848 Attn160.000450.0010.3-3.54%1.96992.00582.00464915220.1%0.554 Attn168e-050.00114.667.69%1.96991.98112.004619660864.35%0.798 Attn163e-050.00163.081.78%1.96991.97622.004619660870.75%0.868 Attn166e-050.00120.872.07%1.96991.97962.004619660869.92%0.813 Attn160.00010.0019.560.16%1.96991.98372.00464915288.32%0.754 Attn169e-050.00111.362.62%1.96991.98292.00464915289.87%0.769 Attn206e-050.00118.387.49%1.96981.97692.026919660863.81%0.87 Attn208e-050.00113.685.63%1.96981.9782.026919660860.17%0.871 Attn203e-050.00152.091.92%1.96981.97442.026919660865.83%0.899 Attn283e-050.00191.973.29%1.96981.97151.97619660871.36%0.817 Attn286e-050.00120.657.17%1.96981.97251.97619660864.79%0.771 Attn288e-050.00112.549.8%1.96981.97291.97619660855.92%0.747 MLP43.5e-050.00120.086.36%1.96981.98022.04619660895.6%0.954 MLP41e-050.00164.583.61%1.96981.98232.04619660842.92%0.977 MLP42e-050.00143.387.2%1.96981.97962.04619660874.78%0.986 MLP123e-050.00177.881.95%1.96981.97832.016719660899.58%0.932 Table 5|Pythia-2.8B baseline SAEs (2048 sequence length). Continued in Table 6. SiteLayerSparsity휆LRL0% CE RecoveredClean CE LossSAE CE Loss0 Abl. CE LossWidth% Alive FeaturesShrinkage훾 MLP125e-050.00128.276.01%1.96981.98112.016719660899.45%0.909 MLP127e-050.00116.271.94%1.96981.9832.016719660899.14%0.883 MLP162.5e-050.00179.878.44%1.96981.97852.009819660899.83%0.919 MLP164e-050.00129.072.83%1.96981.98072.009819660899.82%0.923 MLP163.5e-050.00135.973.95%1.96981.98032.009819660899.83%0.914 MLP167.5e-050.00111.265.88%1.96981.98352.009819660899.45%0.884 MLP164.5e-050.00122.070.73%1.96981.98152.009819660899.79%0.901 MLP163e-050.00154.976.5%1.96981.97922.009819660899.86%0.947 MLP203.5e-050.00120.691.28%1.96981.98142.102219660895.85%0.971 MLP202.5e-050.00125.491.64%1.96981.98092.102219660890.15%0.964 MLP207e-060.001269.293.37%1.96981.97862.102219660817.28%0.962 MLP282.25e-050.00195.279.05%1.96981.97922.014519660899.81%0.941 MLP284.5e-050.00118.567.4%1.96981.98442.014519660894.33%0.92 MLP283e-050.00137.071.12%1.96981.98272.014519660892.72%0.932 Resid43e-050.00115.998.11%1.96992.179313.04344915296.34%0.966 Resid42e-050.00128.198.67%1.96992.117413.04344915297.0%0.974 Resid41e-050.00179.199.27%1.96992.050613.04344915298.93%0.983 Resid121e-050.001128.797.68%1.96982.171210.65584915252.7%0.951 Resid123e-050.00125.193.87%1.96982.502110.65584915264.28%0.96 Resid122e-050.00152.196.34%1.96982.287410.65584915267.39%0.979 Resid162e-050.00142.795.55%1.96982.402511.6824915268.44%0.975 Resid161e-050.00194.896.48%1.96982.311911.6824915236.81%0.94 Resid161.5e-050.00155.595.97%1.96982.360911.6824915259.52%0.95 Resid163e-050.00117.190.16%1.96982.925211.6821966089.91%0.932 Resid165e-050.00110.986.0%1.96983.329311.6821966088.82%0.929 Resid168e-060.00149.184.1%1.96983.514511.6821966081.06%0.946 Resid207e-060.001103.491.94%1.96982.654310.45784915215.4%1.016 Resid202e-050.00133.490.97%1.96982.736310.45784915246.57%0.986 Resid204e-050.00113.686.19%1.96983.142110.45784915259.96%0.954 Resid282e-050.00121.095.09%1.96983.24227.86634915220.22%0.916 Resid287e-060.001109.297.45%1.96982.629827.86634915220.65%1.021 Resid281e-050.00142.996.27%1.96982.934927.86634915222.59%0.932 Table 6|Pythia-2.8B baseline SAEs (2048 sequence length). Continued from Table 5. (2023)). We speculate that this is a consequence of ReLU SAEs needing to choose a threshold with a mix of on and off activations (a mix of red and blue in 15a) to minimise distortion to the tails, as L1 does not penalise incorrectly firing at small magnitudes much. We hope that Gated SAEs may have less of these issues, as they can just have a large gap between푡and푑. 31 Improving Dictionary Learning with Gated Sparse Autoencoders SiteLayerSparsity휆LRL0% CE RecoveredClean CE LossSAE CE Loss0 Abl. CE LossWidth% Alive FeaturesShrinkage훾 Attn40.00060.000338.292.85%1.96991.97462.036113107293.76%1.006 Attn40.00040.000369.894.82%1.96991.97332.036113107296.29%1.0 Attn40.00080.000324.790.94%1.96991.97592.036113107291.45%1.007 Attn120.00060.000364.582.04%1.96991.97762.013113107274.48%0.99 Attn120.0010.000327.173.09%1.96991.98152.013113107263.68%0.987 Attn120.00080.000340.577.52%1.96991.97962.013113107267.74%0.998 Attn160.0010.000317.279.67%1.96991.97692.00463276889.76%0.988 Attn160.00060.000339.187.21%1.96991.97432.004613107280.93%0.985 Attn160.00090.000320.881.8%1.96991.97622.00463276891.0%0.993 Attn160.00040.000377.290.56%1.96991.97322.004613107285.48%0.987 Attn160.00080.000325.083.57%1.96991.97562.004613107279.41%0.993 Attn160.00050.000357.888.63%1.96991.97382.00463276896.08%0.992 Attn200.00040.000371.296.25%1.96981.9722.026913107288.74%0.992 Attn200.00060.000336.594.34%1.96981.9732.026913107285.88%0.986 Attn200.00080.000324.093.05%1.96981.97382.026913107283.05%0.994 Attn280.00080.000327.873.39%1.96981.97151.97613107268.41%0.988 Attn280.0010.000317.768.35%1.96981.97181.97613107268.14%0.991 Attn280.00060.000351.278.11%1.96981.97121.97613107272.44%0.986 MLP40.00060.000328.689.28%1.96981.9782.04613107299.16%1.011 MLP40.00040.000366.592.74%1.96981.97542.04613107299.52%1.002 MLP40.00080.000315.887.13%1.96981.97962.04613107298.46%1.007 MLP120.0010.000335.081.33%1.96981.97862.016713107297.55%1.011 MLP120.0020.00038.272.1%1.96981.98292.016713107294.68%1.002 MLP120.00080.000355.784.15%1.96981.97732.016713107298.23%1.004 Table 7|Pythia-2.8B Gated SAEs (2048 sequence length). Continued in Table 8. SiteLayerSparsity휆LRL0% CE RecoveredClean CE LossSpliced SAE CE LossZero Ablation CE LossShrinkage훾 MLP160.00080.000351.080.32%1.96981.97772.009813107299.05%1.002 MLP160.00160.000312.470.76%1.96981.98152.009813107297.38%1.005 MLP160.00070.000370.182.09%1.96981.9772.009813107299.32%1.001 MLP160.00140.000316.172.62%1.96981.98082.009813107297.48%1.007 MLP160.00120.000321.975.12%1.96981.97982.009813107298.18%1.012 MLP160.00090.000338.378.41%1.96981.97852.009813107298.72%0.993 MLP200.00080.000351.094.28%1.96981.97742.102213107299.06%1.007 MLP200.00120.000322.192.53%1.96981.97972.102213107297.97%1.0 MLP200.0010.000330.993.27%1.96981.97882.102213107298.39%1.003 MLP280.0010.000347.779.96%1.96981.97882.014513107298.76%1.004 MLP280.00080.000382.183.68%1.96981.97712.014513107298.48%1.002 MLP280.00150.000321.373.3%1.96981.98182.014513107297.58%1.004 Resid40.00080.000370.799.5%1.96992.025713.04343276899.68%0.996 Resid40.0010.000349.099.37%1.96992.039913.04343276899.52%0.998 Resid40.0020.000316.298.83%1.96992.099813.04343276898.72%1.001 Resid120.0040.000316.295.92%1.96982.323910.65583276872.56%1.003 Resid120.00160.000377.198.61%1.96982.090810.65583276885.53%0.998 Resid120.0020.000352.898.2%1.96982.126110.65583276883.41%1.0 Resid160.0030.000337.597.46%1.96982.216211.6823276878.14%1.0 Resid160.0060.000312.594.29%1.96982.524911.6823276862.59%0.998 Resid160.0020.000371.598.33%1.96982.132411.6823276882.89%0.998 Resid160.00250.000346.298.04%1.96982.159711.68213107238.15%0.993 Resid160.00450.000318.996.55%1.96982.304511.68213107238.92%0.996 Resid160.00150.000395.698.62%1.96982.10411.68213107229.77%0.991 Resid200.00750.000315.491.68%1.96982.676310.45783276859.39%0.994 Resid200.0040.000338.795.09%1.96982.386610.45783276865.15%0.995 Resid200.0030.000358.496.05%1.96982.305310.45783276868.08%0.994 Resid280.00750.000325.096.54%1.96982.864627.86633276829.97%0.993 Resid280.0050.000346.697.58%1.96982.597327.86633276840.94%1.008 Resid280.0040.000361.297.9%1.96982.513627.86633276835.93%1.005 Table 8|Pythia-2.8B Gated SAEs (2048 sequence length). Continued from Table 7. • We asserted that X was a sparsely activating but continuous feature. It’s an open question how much such features actually exist in models (though at least some likely do (Gurnee and Tegmark, 2024)). Our intuition is that most features are essentially binary (e.g. "is this about basketball"), but that models track their confidences in them as coefficients of the feature 32 Improving Dictionary Learning with Gated Sparse Autoencoders −3−2−101234 0 2 4 6 8 10 12 14 16 X off on Distribution of activation projection percent (a) Empirical distribution of푎·ˆ푣 0246 0 1 2 3 4 5 6 SAE function relu,t=1,m=2 jump,t=1,m=1,d=0 Act vs Reconstructed Act for Jump & Normal ReLU Act Reconstructed act (b) Comparison of푎·ˆ푣and its reconstruction for a standard ReLU (blue) and a Jump ReLU (red) Figure 15|(a) Shows the empirical distribution of푎·ˆ푣in the toy model where푎·ˆ푣∼ퟙ 푋is on (0.5푍 1 + 2) +ퟙ 푋is off 푍 2 . The green line is푥=0, the purple line is푥=1. (b) shows a scatter plot of the reconstruction of푎·ˆ푣against푎·ˆ푣for two possible SAE activations: the blue line is a standard ReLU (with푡=1, 푚=2), i.e. setting a threshold at the purple line and then taking twice the distance from it, and the red line is a Jump ReLU (with푡=1, 푚=1, 푑=0), i.e. setting a threshold at the purple line and then taking the distance from thegreenline. Note that the Jump ReLU gives a perfect reconstruction (above one) while the standard ReLU is highly imperfect. directions, and that reconstructing the precise coefficients matters (otherwise, we should just discretise SAE activations at inference time!), so they can be thought of as continuous. We think understanding this better is a promising direction of future work. • In real models, the probability that푋fires is likely much less than50%! But this assumption simplified the reasoning and diagrams, without qualitatively changing much. •We assumed that the encoder direction (ˆ푣) was frozen, even if its magnitude was not. This was a simplifying assumption that is clearly false in Gated SAEs, and indeed, as shown in Section 5.2, their ability to choose different directions to a standard SAE is key to performance, and they outperform a standard SAE fine-tuned with Jump ReLUs. •The main reason a standard ReLU SAE doesn’t want푡=0is that this includes too many activations when푋is off. But, actually, this is good for reconstruction, just bad for L1 and sparsity. Gated SAEs decouple L1 from their encoder directions, making it hard to reason clearly about whether the need for a high threshold would still apply in a hypothetical Gated SAE with standard ReLUs (though in an actual Gated SAE, the L1 cruciallyisstill applied to푡) 33 Improving Dictionary Learning with Gated Sparse Autoencoders G. Pseudo-code for Gated SAEs and the Gated SAE loss function def gated_sae(x, W_gate, b_gate, W_mag, b_mag, W_dec, b_dec): # Apply pre-encoder bias x_center = x - b_dec # Gating encoder (estimates which features are active) active_features = ((x_center @ W_gate + b_gate) > 0) # Magnitudes encoder (estimates active features’ magnitudes) feature_magnitudes = relu(x_center @ W_mag + b_mag) # Multiply both before decoding return (active_features * feature_magnitudes) @ W_dec + b_dec Figure 16|Pseudo-code for the Gated SAE forward pass. def loss(x, W_gate, b_gate, W_mag, b_mag, W_dec, b_dec): gated_sae_loss = 0.0 # We’l use the reconstruction from the baseline forward pass to train # the magnitudes encoder and decoder. Note we don’t apply any sparsity # penalty here. Also, no gradient will propagate back to W_gate or b_gate # due to binarising the gated activations to zero or one. reconstruction = gated_sae(x, W_gate, b_gate, W_mag, b_mag, W_dec, b_dec) gated_sae_loss += sum((reconstruction - x)**2, axis=-1) # We apply a L1 penalty on the gated encoder activations (pre-binarising, # post-ReLU) to incentivise them to be sparse x_center = x - b_dec via_gate_feature_magnitudes = relu(x_center @ W_gate + b_gate) gated_sae_loss += l1_coef * sum(via_gate_feature_magnitudes, axis=-1) # Currently the gated encoder only has gradient signal to be sparse, and # not to reconstruct well, so we also do a "via gate" reconstruction, to # give it an appropriate gradient signal. We stop the gradients to the # decoder parameters in this forward pass, as we don’t want these to be # influenced by this auxiliary task. via_gate_reconstruction = ( via_gate_feature_magnitudes @ stop_gradient(W_dec) + stop_gradient(b_dec) ) gated_sae_loss += sum((via_gate_reconstruction - x)**2, axis=-1) return gated_sae_loss Figure 17|Pseudo-code for the Gated SAE loss function. Note that this pseudo-code is written for expositional clarity. In practice, taking into account parameter tying, it would be more efficient to rearrange the computation to avoid unnecessarily duplicated operations. 34 Improving Dictionary Learning with Gated Sparse Autoencoders 푝-valuesRaw labelDelta from Baseline to Gated Pythia-2.8B (Page’s trend test).50.13 Pythia-2.8B (Friedman test).57.05 Gemma-7B (Page’s trend test).37.31 Gemma-7B (Friedman test).003.64 Table 9|Layer significance tests 푝-valuesRaw labelDelta from Baseline to Gated Across models (Kruskal-Wallis H-test).01.71 Pythia-2.8B (Friedman test).13.05 Gemma-7B (Friedman test).03.76 Table 10|Rater significance tests H. Further analysis of the human interpretability study We perform some further analysis on the data from Section 4.3, to understand the impact of different sites, layers, and raters. H.1. Sites We first pose the question of whether there’s evidence that the sites had different interpretability outcomes. A Friedman test across sites shows significant differences (at푝=.047) between the Gated-vs-Baseline differences, though not (푝=.92) between the raw labels. Breaking down by site and repeating the Wilcoxon-Pratt one-sided tests and computing confidence intervals, we find the result on MLP outputs is strongest, with mean .40, significance푝=.003, and CI [.18, .63]; this is as compared with the attention outputs (푝=.47, mean .05, CI [-.16, .26]) and final residual (푝=.59, mean -0.07, CI [-.28, .12]) SAEs. H.2. Layers Next we test whether different layers had different outcomes. We do this separately for the 2 models, since the layers aren’t directly comparable. We run 2 tests in each setting: Page’s trend test (which tests for a monotone trend across layers) and the Friedman test (which tests for any difference, without any expectation of a monotone trend). Results are presented in Table 9; they suggest there are some significant nonmonotone differences between layers. To elucidate this, we present 90% BCa bootstrap confidence intervals of the mean raw label (where ‘No’=0, ‘Maybe’=1, ‘Yes’=2) and the Gated-vs-Baseline difference, per layer, in Figure 18 and Figure 19, respectively. H.3. Raters In Table 10 we present test results weakly suggesting that the raters differed in their judgments. This underscores that there’s still a significant subjective component to this interpretability labeling. (Notably, different raters saw different proportions of Pythia vs Gemma features, so aggregating across the models is partially confounded by that.) 35 Improving Dictionary Learning with Gated Sparse Autoencoders Figure 18|Per-layer 90% confidence intervals for the mean interpretability label Figure 19|Per-layer 90% confidence intervals for the Gated-vs-Baseline label difference 36 Improving Dictionary Learning with Gated Sparse Autoencoders Figure 20|Contingency tables for the paired (gated vs baseline) interpretability labels, for Pythia-2.8B Figure 21|Contingency tables for the paired (gated vs baseline) interpretability labels, for Gemma-7B 37