← Back to papers

Paper deep dive

Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need

Adam Karvonen

Year: 2025Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 49

Models: Gemma-2-2B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 5:58:15 PM

Summary

The paper introduces a computationally efficient fine-tuning method for Sparse Autoencoders (SAEs) using a brief KL divergence and MSE loss stage (25M tokens). This approach achieves performance comparable to full end-to-end training while reducing computational costs by approximately 50%, and avoids the complexities of modifying language model weights via LoRA adapters. The authors demonstrate that this method effectively addresses reconstruction errors in MSE-trained SAEs, though results on downstream interpretability metrics (SAEBench) are mixed depending on the architecture and task.

Entities (5)

Sparse Autoencoder · model-architecture · 100%KL+MSE fine-tuning · training-method · 98%Gemma-2-2B · language-model · 95%LoRA · adaptation-technique · 95%SAEBench · evaluation-suite · 95%

Relation Signals (3)

SAEBench evaluates Sparse Autoencoder

confidence 95% · We evaluate our proposed method on SAEBench

KL+MSE fine-tuning improves Sparse Autoencoder

confidence 95% · achieves comparable improvements, reducing the cross-entropy loss gap by 20-50%

KL+MSE fine-tuning reduces Computational Cost

confidence 90% · reducing wall-clock time by approximately 50% compared to full end-to-end training

Cypher Suggestions (2)

Find all training methods used for Sparse Autoencoders · confidence 90% · unvalidated

MATCH (m:Method)-[:APPLIED_TO]->(s:Model {name: 'Sparse Autoencoder'}) RETURN m.name

Identify evaluation suites used for specific model architectures · confidence 85% · unvalidated

MATCH (e:EvaluationSuite)-[:EVALUATES]->(a:ModelArchitecture) RETURN e.name, a.name

Abstract

Abstract:Sparse autoencoders (SAEs) are widely used for interpreting language model activations. A key evaluation metric is the increase in cross-entropy loss between the original model logits and the reconstructed model logits when replacing model activations with SAE reconstructions. Typically, SAEs are trained solely on mean squared error (MSE) when reconstructing precomputed, shuffled activations. Recent work introduced training SAEs directly with a combination of KL divergence and MSE ("end-to-end" SAEs), significantly improving reconstruction accuracy at the cost of substantially increased computation, which has limited their widespread adoption. We propose a brief KL+MSE fine-tuning step applied only to the final 25M training tokens (just a few percent of typical training budgets) that achieves comparable improvements, reducing the cross-entropy loss gap by 20-50%, while incurring minimal additional computational cost. We further find that multiple fine-tuning methods (KL fine-tuning, LoRA adapters, linear adapters) yield similar, non-additive cross-entropy improvements, suggesting a common, easily correctable error source in MSE-trained SAEs. We demonstrate a straightforward method for effectively transferring hyperparameters and sparsity penalties between training phases despite scale differences between KL and MSE losses. While both ReLU and TopK SAEs see significant cross-entropy loss improvements, evaluations on supervised SAEBench metrics yield mixed results, with improvements on some metrics and decreases on others, depending on both the SAE architecture and downstream task. Nonetheless, our method may offer meaningful improvements in interpretability applications such as circuit analysis with minor additional cost.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)mechanistic-interp (suggested, 92%)

Links

PDF not stored locally. Use the link above to view on the source site.

Full Text

49,172 characters extracted from source content.

Expand or collapse full text

Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Adam Karvonen 1 Abstract Sparse autoencoders (SAEs) are widely used for interpreting language model activations. A key evaluation metric is the increase in cross-entropy loss between the original model logits and the re- constructed model logits when replacing model activations with SAE reconstructions. Typically, SAEs are trained solely on mean squared error (MSE) when reconstructing precomputed, shuf- fled activations. Recent work introduced train- ing SAEs directly with a combination of KL divergence and MSE (“end-to-end” SAEs), sig- nificantly improving reconstruction accuracy at the cost of substantially increased computation, which has limited their widespread adoption. We propose a brief KL+MSE fine-tuning step applied only to the final 25M training tokens (just a few percent of typical training budgets) that achieves comparable improvements, reducing the cross- entropy loss gap by 20–50%, while incurring minimal additional computational cost. We fur- ther find that multiple fine-tuning methods (KL fine-tuning, LoRA adapters, linear adapters) yield similar, non-additive cross-entropy improvements, suggesting a common, easily correctable error source in MSE-trained SAEs. We demonstrate a straightforward method for effectively transfer- ring hyperparameters and sparsity penalties be- tween training phases despite scale differences between KL and MSE losses. While both ReLU and TopK SAEs see significant cross-entropy loss improvements, evaluations on supervised SAEBench metrics yield mixed results, with im- provements on some metrics and decreases on oth- ers, depending on both the SAE architecture and downstream task. Nonetheless, our method may offer meaningful improvements in interpretability applications such as circuit analysis with minor additional cost. 1 Independent.Correspondence to:Adam Karvonen <adam.karvonen@gmail.com>. 1. Introduction Sparse autoencoders (SAEs) have emerged as an important tool in mechanistic interpretability, enabling the decomposi- tion of language model activations into sparse linear combi- nations of interpretable latent features (Cunningham et al., 2023; Bricken et al., 2023). The central hypothesis behind SAEs is that neural activations can be effectively represented using sparse combinations of meaningful latent directions, facilitating deeper understanding and interpretability of neu- ral network computations. The primary evaluation metric for SAEs is the increase in cross-entropy loss incurred when original model acti- vations are replaced by SAE reconstructions during infer- ence, capturing the trade-off between sparsity and fidelity. Recent advances have focused extensively on improving this sparsity-fidelity trade-off through novel SAE architec- tures (Rajamanoharan et al., 2024a; Mudide et al., 2024), activation functions (Gao et al., 2024; Taggart, 2024; Raja- manoharan et al., 2024b; Bussmann et al., 2024a; Ayonrinde, 2024), and training loss formulations (Karvonen et al., 2024; Bussmann et al., 2024b). Typically, SAEs are trained with mean squared error (MSE) loss when reconstructing precomputed and shuffled model activations (Lieberum et al., 2024). However, this approach does not directly optimize the cross-entropy loss between the original model logits and reconstructed model logits used during evaluation. Recent methods have explored train- ing SAEs with a combination of KL divergence and MSE loss to align training objectives more closely with evalu- ation objectives. Notably, Braun et al. (2024) proposed ”end-to-end” SAE training, using KL+MSE loss throughout training. While effective, this method substantially increases computational cost, thus limiting practical applicability and widespread adoption by the field. Alternatively, Chen et al. (2025) introduced a method lever- aging low-rank adapters (LoRA) to fine-tune the underlying language model around pre-trained SAEs. This approach achieves similar performance gains with much lower com- putational overhead. However, it introduces additional com- plexity and alters the original language model, which may be undesirable for interpretability studies. 1 arXiv:2503.17272v2 [cs.LG] 29 Mar 2025 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Figure 1: Comparison of training approaches for a sparse autoencoder (K=80, width=16K) on Pythia-160M. The proposed KL+MSE fine-tuning approach (25M tokens) achieves slightly better performance than full end-to-end (E2E) training (Braun et al., 2024) on the same amount of data while reducing wall-clock time by approximately 50%. In this work, we revisit end-to-end SAE training and demon- strate a simpler yet equally effective strategy: applying a brief fine-tuning stage with KL+MSE loss for only the final 25M tokens of training (0.5-10% of typical training budgets (250M - 5B tokens)). In our particular setting, this targeted approach exceeds the performance of both full end-to-end training and LoRA adapters, while significantly reducing wall-clock time—approximately 50% reduction compared to full end-to-end training, and a minor reduction compared to LoRA adapters. In other common scenarios, such as train- ing on larger language models, employing early stopping during activation collection, or amortizing SAE training across precomputed activations, wall-clock time savings relative to end-to-end training can easily exceed 90%. 1 These results suggest that training with an MSE loss func- tion learns meaningful features with an easily correctable KL divergence error source which can be removed by multi- ple methods, of which SAE fine-tuning is the simplest and most performant. We present a practical recipe for smoothly transferring hyperparameters and sparsity penalties from MSE-only to KL+MSE training phases, addressing chal- lenges arising from significant scale differences between these losses. 1 For example, training six SAEs at the middle layer involves KL divergence loss (two forward passes + one backward, 4× cost), no early stopping (2×), and no amortization across SAEs (6×), totaling up to 48× extra compute for activation collection. Actual wall-clock savings depend on SAE-to-LLM size ratios. We evaluate our proposed method on SAEBench (Karvonen et al., 2025), a comprehensive suite of metrics beyond recon- struction accuracy. Evaluations on downstream supervised metrics yield mixed results, showing both improvements and declines depending on both the specific evaluation metric and SAE architecture, with particularly pronounced changes for ReLU-based SAEs. Despite this, we believe that achiev- ing considerable reductions in cross-entropy loss at minimal cost may substantially benefit interpretability-focused appli- cations, such as circuit analysis, by reducing the influence of reconstruction error nodes and thus decreasing the risk of missing critical signals. Our main contributions include: 1. A simplified and computationally efficient fine-tuning approach for achieving end-to-end SAE training bene- fits without altering model architecture or adding sig- nificant complexity. 2.A practical method for transferring hyperparameters and sparsity penalties between MSE and KL+MSE training phases. 3. Empirical evidence demonstrating mixed results in su- pervised interpretability metrics on SAEBench, high- lighting that interpretability benefits from KL+MSE training depend on both SAE architecture and the downstream task. 2 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need We release code, data, and models atgithub.com/ adamkarvonen/sae_kl_finetune. 2. Related Work 2.1. Sparse Autoencoders for Interpretability Sparse autoencoders (SAEs) have gained popularity as an unsupervised interpretability method for analyzing activa- tions of large language models. An SAE generally consists of an encoder-decoder structure: the encoder transforms the original activations into a higher-dimensional but sparse latent representation, while the decoder reconstructs the original activations from this sparse representation. The first widely used architecture involved a linear encoder layer, a sparsity-inducing activation (often ReLU), and a linear decoder. Training these autoencoders involved minimizing a reconstruction loss along with anL 1 sparsity regulariza- tion applied to the latent representation. Formally, the SAE forward pass and optimization objective can be defined as: h=ReLU(W E x+b E )(1) ˆx=W D h+b D (2) L=∥x−ˆx∥ 2 2 | z reconstruction +λ∥h∥ 1 |z sparsity (3) wherexis the input activation vector,hrepresents the sparse latent representation, andˆxis the reconstructed activation vector. The parametersW E ,b E andW D ,b D correspond to the weights and biases of the encoder and decoder, re- spectively, andλis the sparsity penalty which controls the balance between reconstruction accuracy and sparsity. Since reconstructions by SAEs are inherently imperfect at any given sparsity level, substituting these approximations back into the original model generally results in increased loss. Consequently, recent research has focused extensively on improving SAE architectures and activation functions to enhance reconstruction accuracy at fixed sparsity levels. Prominent advancements include TopK and BatchTopK ac- tivation functions (Gao et al., 2024; Bussmann et al., 2024a), which explicitly control sparsity levels, and JumpReLU (Ra- jamanoharan et al., 2024b), which utilizes straight-through estimators to directly optimize an averageL 0 sparsity ob- jective. These innovations have significantly enhanced the reconstruction accuracy achievable at specific sparsity lev- els, effectively shifting and improving the sparsity-fidelity frontier. Most recently proposed SAE variants primarily focus on improving the sparsity-fidelity trade-off. End to End SAE TrainingThe recent introduction of ”end-to-end” SAE training by Braun et al. (2024) leverages KL+MSE loss throughout training, directly aligning training with evaluation objectives, and significantly improves the sparsity-fidelity trade-off for a fixed dataset size. However, this approach substantially increases wall-clock time and computational cost—typically requiring two forward passes per optimization step (original and modified models) and additional backward passes, quadrupling total compute rela- tive to conventional SAE training. Furthermore, end-to-end training introduces several operational constraints: it pre- vents activation shuffling (which has been shown to improve performance Lieberum et al. (2024)), eliminates the ability to amortize activation collection costs across multiple SAE training runs, and prevents the use of early stopping at in- termediate layers during activation collection. Depending on the specific architecture and training setup, these limita- tions can result in an order of magnitude increase in total computational requirements. Alternatively, Chen et al. (2025) proposed a method to train on KL divergence with a fixed, pre-trained SAE, through low-rank adaptation (LoRA), efficiently fine-tuning under- lying model parameters to reduce reconstruction-induced performance gaps. While computationally cheaper, this strategy introduces additional complexity and modifies the original language model weights, which some may view as undesirable for interpretability studies. 3. Methods Many sparse autoencoder (SAE) training methods, including ReLU-based SAEs, incorporate various auxiliary losses or sparsity penalties. In contrast, TopK-based SAEs enforce strict sparsity deterministically and thus rely solely on mean squared error (MSE) for training. To effectively transfer experimentally determined hyperparameters during the fine- tuning stage, it is crucial to balance these existing losses with any newly introduced loss terms. Therefore, we include both ReLU-based SAEs, with auxiliary sparsity penalties, and TopK-based SAEs, without auxiliary penalties, in our experiments. While a simpler ReLU-based SAE formulation was de- scribed in Section 2.1, our experiments use an improved sparsity penalty following Anthropic Interpretability Team (2024). Specifically, the sparsity penalty for each feature is weighted by the L1 norm of its corresponding decoder vector: L=|x−ˆx| 2 2 | z reconstruction +λ X i h i ∥w i ∥ 1 | z weighted sparsity (4) wherew i is thei-th column of the decoder matrixW D . 3 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need 3.1. TopK-based Sparse Autoencoders TopK-based SAEs differ from ReLU-based SAEs primar- ily through their sparsity enforcement mechanism. While ReLU SAEs encourage sparsity through an explicit L1 penalty, TopK SAEs enforce a strict sparsity constraint by allowing exactly the topKencoder activations to remain active per input. Specifically, the TopK SAE forward pass and loss are defined as follows: h=TopK(W E x+b E ,K)(5) ˆx=W D h+b D (6) L=|x−ˆx| 2 2 (7) where the TopK operation retains only the largestKvalues in the encoder’s output, setting all other values to zero. Thus, sparsity is deterministic, removing the need for an explicit sparsity penalty. However, other auxillary loss penalties are still commonly used, such as the auxk loss for preventing dead features from Gao et al. (2024). 3.2. Training Method After initial training with MSE loss on shuffled activations, we propose performing a brief fine-tuning using a combined KL divergence and MSE loss. This approach follows in- sights from prior end-to-end SAE work Braun et al. (2024), which found that jointly optimizing KL and MSE losses is important for achieving optimal performance. Specifically, training solely on KL divergence tends to degrade MSE reconstruction quality because multiple plausible activation paths exist through the model. Conversely, using only MSE loss ignores the model’s actual predictive priorities as mea- sured by KL divergence. By training on both KL divergence and MSE simultaneously, SAEs can capture both accurate reconstruction and meaningful features without substantial degradation in either objective. We also found that using a mixture of KL and MSE loss is more effective than using only KL loss (experimental results in Appendix B). Our fine-tuning occurs for only the final 25 million training tokens, representing merely 0.5-10% of typical SAE training budgets (250M - 5B tokens). We train on both Gemma-2- 2B Gemma Team et al. (2024) and Pythia-160M Biderman et al. (2023). The combined loss explicitly aligns SAE train- ing with the evaluation objective—minimizing the increase in cross-entropy loss when original model activations are replaced by SAE reconstructions: L finetune = (|x−ˆx| 2 2 +α·KL(f(ˆx),f(x)))∗0.5(8) Here,f(·)denotes the original language model’s output probabilities given activationsx. To ensure the smooth transfer of experimentally determined hyperparameters, we dynamically adjust the KL divergence term scaling factor, α KL , at each batch: α KL = MSE loss KL loss+ 1e −8 (9) This adjustment balances the ratio of KL vs MSE and en- sures that the total reconstruction loss (KL divergence com- bined with MSE) maintains the scale of the original MSE loss. This approach contrasts with the approach of Braun et al. (2024), which manually tuned a fixedβhyperparame- ter to balance these losses. Our dynamic adjustment elimi- nates the need for this additional hyperparameter—an impor- tant consideration given that the ratio between MSE and KL losses can vary substantially (we observed from 50-1000×) depending on the model, layer, and whether activations are normalized. Althoughα KL exhibits minor fluctuations from batch to batch, we prioritized maintaining a consistent ratio between reconstruction loss and any auxiliary or sparsity losses. We believe exact batch-level balancing is preferable to smooth- ing or averagingα KL across multiple batches. In Appendix B, we observed no degradation to the final loss when apply- ing this dynamic balancing. We found that learning rate decay is important in achieving optimal performance during fine-tuning. We started with a relatively low learning rate of 5e-5 and applied linear decay to 0 over the fine-tuning period. This short fine- tuning strategy maintains the computational advantages of traditional MSE-based training while slightly exceeding the performance gains of full end-to-end training. 3.3. Implementation In practice, the dynamic balancing of losses described above is implemented succinctly in PyTorch as follows: alpha_kl = (mse_loss / (kl_loss + 1e-8) ).detach() loss = (mse_loss + alpha_kl * kl_loss) * 0.5 4. Results 4.1. Cross Entropy Loss Results Training SAEs on MSE only identifies features capable of achieving competitive cross-entropy loss reductions through various fine-tuning methods; these methods do not stack, making brief KL+MSE fine-tuning optimal for both performance and simplicity. We evaluated three different adaptation techniques—LoRA adapting, full KL+MSE fine-tuning, and linear adapters with 4 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Figure 2: Comparison of KL+MSE finetuning (25M tokens) vs full end-to-end training (E2E) on Gemma-2-2B with 65K width SAEs. skip connections—to assess their effectiveness in reducing KL divergence. Individually, each method successfully re- duced cross-entropy loss. However, we observed minimal incremental improvements when combining these methods, suggesting the presence of a common, easily correctable source of error in SAEs initially trained only on MSE. Fur- ther evidence for this hypothesis comes from our finding that even an extremely low-rank LoRA adapter (rank 2) applied directly to the SAE captures more than half of the benefit achievable through full fine-tuning (Appendix D). This would also explain why a short KL+MSE fine-tuning stage is sufficient to achieve the benefits of full end-to-end training. Consequently, our results indicate that training primarily on MSE and then briefly fine-tuning with KL+MSE is the most practical and efficient strategy for optimizing cross- entropy loss. However, we do see differences on supervised SAEBench metrics, which we discuss in the next section. Below, we detail empirical results supporting these observa- tions. KL+MSE finetuning typically meets or exceeds the per- formance of full end-to-end training while significantly reducing wall-clock time. Figure 2 compares our KL+MSE fine-tuning approach against full end-to-end (E2E) training across different sparsity levels (K=20, 40, 80, and 160) on Gemma-2-2B. For most sparsity levels (K=40, 80, 160), our fine-tuning approach matches or exceeds the performance of full E2E training while reducing wall-clock time by ap- proximately 50%. The actual reduction in wall-clock time will vary depending on several factors, including the ratio of SAE size to LLM size, whether early stopping is em- ployed during activation collection, and whether activation collection costs are being amortized across multiple SAEs. However, for the highly sparse K=20 configuration, the fine-tuning approach performs worse and shows incomplete convergence, indicating that either additional training data or learning rate tuning may be necessary when operating at high sparsity levels. Interestingly, at K=160, MSE fine- tuning alone achieves comparable or slightly better results than KL+MSE E2E training. We speculate that this could be due to the fact that MSE only is a simpler optimization objective which is sufficient at low sparsities. This may also be due to the fact that these experiments use identical data orderings for fair comparison between methods, meaning they do not benefit from shuffled activations, which could impact training dynamics. However, we still notice a benefit to the KL+MSE finetune for K=160. KL+MSE fine-tuning exceeds the performance of train- ing LoRA adapters on the LLM. Prior work has proposed two primary methods involving LoRA adapters (Chen et al., 2025): training adapters across all layers of the language model, and training adapters solely after the SAE layer. Training LoRA adapters across all model layers restricts ap- plicability mainly to TopK-based SAEs, as adapters placed before non-TopK SAEs can ”cheat” by decreasing sparsity 5 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Figure 3: Comparison of KL+MSE finetuning (25M tokens) vs LoRA adapters on Gemma-2-2B with 65K width SAEs. due to the non-differentiability of L0 sparsity objectives. Un- der these conditions, our proposed SAE fine-tuning method achieves slightly better performance than training full-model LoRA adapters. Alternatively, applying LoRA adapters after the SAE avoids the sparsity concerns but leads to significantly weaker per- formance compared to our SAE fine-tuning approach. Addi- tionally, there are interpretability considerations: many inter- pretability workflows prefer analyzing the original, unmodi- fied base model, making SAE fine-tuning clearly preferable. Alternative lightweight adaptations can individually cap- ture significant portions of SAE fine-tuning performance, but their benefits fail to stack.We further investigated al- ternative lightweight adaptations post-SAE, such as adding low-rank linear transforms or small MLP layers with skip connections after the encoder. Although these methods in- dividually achieved meaningful reductions in cross-entropy loss—approximately 60% of the gains seen from SAE fine- tuning—we found that combining these adaptations sequen- tially after fine-tuning provided minimal incremental im- provement. This suggests that these methods primarily cap- ture overlapping benefits, making simple SAE fine-tuning preferable for practical applications due to reduced com- plexity. Full details and experiments are provided in Ap- pendix C. 4.2. SAEBench Results Fine-tuning yields mixed results on SAEBench metrics, dependent on both SAE architecture and evaluation met- ric.While enhancing the sparsity-fidelity trade-off and reducing cross-entropy loss are important objectives, they ultimately serve as proxy metrics for evaluating SAE utility in practical interpretability tasks. Recently, evaluation suites like SAEBench have emerged to assess SAE performance across diverse downstream interpretability tasks. Several tasks are of note: Spurious Correlation Removal (SCR)measures the ability to debias a spurious correlation from a classifier by ablat- ing identified latents.Targeted Probe Perturbation (TPP) measures the ability to degrade the performance of a tar- geted probe by ablating identified latents, with higher scores indicating that the SAE has latents which correspond to iden- tified concepts.Sparse Probingassesses whether individual SAE latents correlate strongly with identified concepts by training probes on single latents, where higher scores in- dicate better concept quality.Automated Interpretability uses an LLM judge to quantify the human-interpretability of selected latents using theDetection scoreproposed by Paulo et al. (2024).RAVEL(Resolving Attribute-Value En- tanglements in Language Models) measures how cleanly SAEs disentangle related attributes by testing whether tar- geted interventions on latents can selectively modify model outputs with respect to specific attributes (e.g., changing a city’s location) while preserving other related knowledge (e.g., the language spoken there). 6 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Figure 4: SAEBench results for KL+MSE finetuning (15M tokens) on 65k width SAEs on Gemma-2-2B. 7 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need We evaluate our fine-tuning method using SAEBench and find substantial improvements in supervised metrics, no- tably sparse probing and RAVEL scores, for ReLU-based SAEs, as illustrated in Figure 4 and Appendix G. How- ever, we observe notable decreases in other metrics, such as SCR, suggesting potential trade-offs between different in- terpretability objectives. In contrast, fine-tuned TopK-based SAEs show smaller and more inconsistent changes across SAEBench metrics at both 16K and 65K widths—improving performance on SCR and TPP metrics, while decreasing on RAVEL and unlearning metrics. These experiments involved fine-tuning SAEs from the SAEBench baseline suite for 15 million tokens. While our method largely maintains sparsity levels during fine-tuning of ReLU-based SAEs through hyperparameter transfer, we found that a simple dynamic adjustment of the sparsity penalty (detailed in Appendix E) helps ensure consistent L0 sparsity across experiments. E2E SAEs achieve similar cross-entropy scores but dif- fer significantly on SAEBench metrics.Interestingly, as shown in Figure 12, E2E-trained SAEs exhibit notably lower performance across several SAEBench metrics, including RAVEL, Feature Absorption, TPP, and SCR, despite compa- rable cross-entropy scores. This discrepancy suggests that E2E SAEs might identify fundamentally different features capable of achieving similar KL divergence. Additionally, we observe modest performance degradation in TPP and SCR scores during KL fine-tuning. We hypothesize this degradation may be due to correlations between in-batch datapoints in the absence of activation shuffling, an effect that could potentially be mitigated by using much larger batch sizes. Future work could systemati- cally investigate this by conducting a batch size sweep for E2E SAE training. For now, our primary analysis focuses on fine-tuning the suite of SAEBench SAEs, which are trained on shuffled activations, reflecting current best practices. 5. Discussion Why do we observe significant differences between ReLU and TopK SAEs on SAEBench metrics?We propose two hypotheses that could explain this observation. First, ReLU- based SAEs generally exhibit significantly worse recon- struction accuracy, as reflected by higher cross-entropy loss, compared to TopK-based SAEs. This reduced reconstruc- tion accuracy might disproportionately impact downstream interpretability evaluations. Second, ReLU-based SAEs are known to suffer from the phenomenon known as ”shrinkage”, which is the system- atic underestimation of feature activations caused by the L1 penalty pushing activations towards zero. (Wright & Sharkey, 2024). Shrinkage may cause ReLU SAEs to learn qualitatively different—and potentially less faith- ful—features compared to those identified by TopK SAEs. As a result, ReLU SAEs may have greater room for improve- ment when fine-tuned using KL divergence, which directly aligns reconstruction with the model’s predictive priorities. For applications such as circuit analysis, improved recon- struction accuracy can directly enhance interpretability. For example, Marks et al. (2024) employed attribution patch- ing to quantify the contribution of each SAE feature to a metric of interest. Imperfect reconstruction necessitates incorporating an additional “error node” into the analysis, which is often among the most important contributors to the metric’s outcome. This introduces uncertainty about what crucial information might remain hidden within the error node. A recent effort (Ameisen et al., 2025) explores an even more ambitious direction: constructing “replacement mod- els” that substitute interpretable components for entire MLP layers, using cross-layer transcoders (a variant of SAEs). While promising, these models currently match the orig- inal model’s output in only around 50% of cases, even for relatively simple behaviors. While we do not conduct circuit-level evaluations in this work, we believe the signifi- cant reduction in KL divergence achieved by our method is likely to improve the fidelity of such replacement models or attribution-based analyses. 6. Limitations Mixed results on SAEBench indicate that the practical benefits of KL+MSE fine-tuning vary depending on SAE architecture and evaluation metrics.Our findings demon- strate that while some interpretability metrics improve, oth- ers decline, highlighting potential trade-offs and suggesting the need for careful consideration of architecture-specific and task-specific contexts when employing KL-based fine- tuning. Interpretability metrics may not fully capture the inter- nal representations that models actually use.Current automated interpretability metrics, including those provided by SAEBench, focus primarily on practical downstream in- terpretability tasks and tend to emphasize disentangled and easily interpretable features. However, these metrics may not fully capture the internal representations that models actually use, which could be inherently more entangled or complex. For example, it is possible that end-to-end trained SAEs, despite lower scores on SAEBench, might learn rep- resentations that are more faithful to the model’s internal computations but inherently less interpretable by current automated metrics. 8 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need 7. Conclusion In this work, we demonstrated that a short KL+MSE fine- tuning stage applied at the end of sparse autoencoder (SAE) training substantially improves cross-entropy loss, achiev- ing results comparable to full end-to-end training but at sig- nificantly reduced computational cost. This fine-tuning ap- proach could be particularly beneficial in applications where reconstructive accuracy directly enhances interpretability, such as circuit analysis. Despite these clear improvements in reconstruction fidelity, evaluations on SAEBench met- rics produced mixed results, indicating potential trade-offs between different interpretability objectives. Consequently, practitioners adopting KL+MSE fine-tuning should care- fully consider their specific interpretability goals and down- stream tasks. Future research could further clarify the rela- tionship between SAE training objectives and interpretabil- ity outcomes, guiding the development of more effective and interpretable feature representations. Acknowledgements Adam Karvonen is supported by a grant from Open Philan- thropy and a compute grant from Lambda Labs. The author would also like to thank Neel Nanda, Can Rager, Samuel Marks, Bart Bussmann, Patrick Leask, Josh Engels, and Matthew Chen for helpful discussions and feedback. 9 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need References Ameisen, E., Lindsey, J., Pearce, A., Gurnee, W., Turner, N. L., Chen, B., Citro, C., Abrahams, D., Carter, S., Hos- mer, B., Marcus, J., Sklar, M., Templeton, A., Bricken, T., McDougall, C., Cunningham, H., Henighan, T., Jermyn, A., Jones, A., Persic, A., Qi, Z., Ben Thompson, T., Zim- merman, S., Rivoire, K., Conerly, T., Olah, C., and Bat- son, J. Circuit tracing: Revealing computational graphs in language models.Transformer Circuits Thread, 2025. URLhttps://transformer-circuits.pub/ 2025/attribution-graphs/methods.html. Anthropic Interpretability Team.Training sparse au- toencoders.https://transformer-circuits. pub/2024/april-update/index.html# training-saes, 2024. [Accessed January 20, 2025]. Ayonrinde, K. Adaptive sparse allocation with mutual choice & feature choice sparse autoencoders, 2024. URL https://arxiv.org/abs/2411.02124. Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., and van der Wal, O. Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https://arxiv.org/abs/2304.01373. Braun, D., Taylor, J., Goldowsky-Dill, N., and Sharkey, L. Identifying functionally important features with end- to-end sparse dictionary learning, 2024. URLhttps: //arxiv.org/abs/2405.12241. Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer- circuits.pub/2023/monosemantic-features/index.html. Bussmann, B., Leask, P., and Nanda, N.Batch- topk:Asimpleimprovementfortopk-saes, 2024a.URLhttps://w.alignmentforum. org/posts/Nkx6yWZNbAsfvic98/ batchtopk-a-simple-improvement-for-topk-saes. Bussmann, B., Leask, P., and Nanda, N.Learning multi-level features with matryoshka saes, December 19 2024b. URLhttps://w.alignmentforum. org/posts/rKM9b6B2LqwSB5ToN/ learning-multi-level-features-with-matryoshka-saes. Alignment Forum. Chen, M., Engels, J., and Tegmark, M. Low-rank adapting models for sparse autoencoders, 2025. URLhttps: //arxiv.org/abs/2501.19406. Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models, 2023. URLhttps:// arxiv.org/abs/2309.08600. Gao, L., Dupr ́ e la Tour, T., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093, 2024.URLhttps:// arxiv.org/abs/2406.04093. Gemma Team, Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram ́ e, A., Ferret, J., Liu, P., Tafti, P., Friesen, A., Casbon, M., Ramos, S., Kumar, R., Lan, C. L., Jerome, S., Tsitsulin, A., Vieillard, N., Stanczyk, P., Girgin, S., Momchev, N., Hoffman, M., Thakoor, S., Grill, J.-B., Neyshabur, B., Bachem, O., Walton, A., Severyn, A., Par- rish, A., Ahmad, A., Hutchison, A., Abdagic, A., Carl, A., Shen, A., Brock, A., Coenen, A., Laforge, A., Pater- son, A., Bastian, B., Piot, B., Wu, B., Royal, B., Chen, C., Kumar, C., Perry, C., Welty, C., Choquette-Choo, C. A., Sinopalnikov, D., Weinberger, D., Vijaykumar, D., Rogozi ́ nska, D., Herbison, D., Bandy, E., Wang, E., Noland, E., Moreira, E., Senter, E., Eltyshev, E., Visin, F., Rasskin, G., Wei, G., Cameron, G., Martins, G., Hashemi, H., Klimczak-Pluci ́ nska, H., Batra, H., Dhand, H., Nar- dini, I., Mein, J., Zhou, J., Svensson, J., Stanway, J., Chan, J., Zhou, J. P., Carrasqueira, J., Iljazi, J., Becker, J., Fernandez, J., van Amersfoort, J., Gordon, J., Lipschultz, J., Newlan, J., yeong Ji, J., Mohamed, K., Badola, K., Black, K., Millican, K., McDonell, K., Nguyen, K., Sod- hia, K., Greene, K., Sjoesund, L. L., Usui, L., Sifre, L., Heuermann, L., Lago, L., McNealus, L., Soares, L. B., Kilpatrick, L., Dixon, L., Martins, L., Reid, M., Singh, M., Iverson, M., G ̈ orner, M., Velloso, M., Wirth, M., Davidow, M., Miller, M., Rahtz, M., Watson, M., Ris- dal, M., Kazemi, M., Moynihan, M., Zhang, M., Kahng, M., Park, M., Rahman, M., Khatwani, M., Dao, N., Bar- doliwalla, N., Devanathan, N., Dumai, N., Chauhan, N., Wahltinez, O., Botarda, P., Barnes, P., Barham, P., Michel, P., Jin, P., Georgiev, P., Culliton, P., Kuppala, P., Co- manescu, R., Merhej, R., Jana, R., Rokni, R. A., Agarwal, R., Mullins, R., Saadat, S., Carthy, S. M., Cogan, S., Perrin, S., Arnold, S. M. R., Krause, S., Dai, S., Garg, S., Sheth, S., Ronstrom, S., Chan, S., Jordan, T., Yu, T., Eccles, T., Hennigan, T., Kocisky, T., Doshi, T., Jain, V., Yadav, V., Meshram, V., Dharmadhikari, V., Barkley, W., Wei, W., Ye, W., Han, W., Kwon, W., Xu, X., Shen, Z., Gong, Z., Wei, Z., Cotruta, V., Kirk, P., Rao, A., Giang, M., Peran, L., Warkentin, T., Collins, E., Bar- 10 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need ral, J., Ghahramani, Z., Hadsell, R., Sculley, D., Banks, J., Dragan, A., Petrov, S., Vinyals, O., Dean, J., Has- sabis, D., Kavukcuoglu, K., Farabet, C., Buchatskaya, E., Borgeaud, S., Fiedel, N., Joulin, A., Kenealy, K., Dadashi, R., and Andreev, A. Gemma 2: Improving open language models at a practical size, 2024. URL https://arxiv.org/abs/2408.00118. Karvonen, A., Wright, B., Rager, C., Angell, R., Brinkmann, J., Smith, L., Verdun, C. M., Bau, D., and Marks, S. Measuring progress in dictionary learning for language model interpretability with board game models, 2024. URLhttps://arxiv.org/abs/2408.00113. Karvonen, A., Rager, C., Lin, J., Tigges, C., Bloom, J., Chanin, D., Lau, Y.-T., Farrell, E., McDougall, C., Ayon- rinde, K., Wearden, M., Conmy, A., Marks, S., and Nanda, N. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025. URLhttps://arxiv.org/abs/2503.09532. Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kram ́ ar, J., Dragan, A., Shah, R., and Nanda, N. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024. URLhttps: //arxiv.org/abs/2408.05147. Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., and Mueller, A. Sparse feature circuits: Discovering and editing interpretable causal graphs in language mod- els, 2024. URLhttps://arxiv.org/abs/2403. 19647. Mudide, A., Engels, J., Michaud, E. J., Tegmark, M., and Schroeder de Witt, C. Efficient dictionary learn- ing with switch sparse autoencoders.arXiv preprint arXiv:2410.08201, 2024.URLhttps://arxiv. org/abs/2410.08201. Paulo, G., Mallen, A., Juang, C., and Belrose, N. Auto- matically interpreting millions of features in large lan- guage models, 2024. URLhttps://arxiv.org/ abs/2410.13928. Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., Kram ́ ar, J., Shah, R., and Nanda, N. Improving dictionary learning with gated sparse autoencoders.arXiv preprint arXiv:2404.16014, 2024a. URLhttps:// arxiv.org/abs/2404.16014. Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kram ́ ar, J., and Nanda, N. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders.arXiv preprint arXiv:2407.14435, 2024b. URLhttps://arxiv.org/abs/2407.14435. Taggart, G. M. Prolu: A nonlinearity for sparse autoen- coders, 2024.https://w.alignmentforum. org/posts/HEpufTdakGTTKgoYF/ prolu-a-nonlinearity-for-sparse-autoencoders . Wright, B. and Sharkey, L. Addressing feature suppression insaes.https://w.alignmentforum. org/posts/3JuSjTZyMzaSeTxKk/ addressing-feature-suppression-in-saes , Feb 2024. 11 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need A. Experiment Details A.1. Training from Scratch vs. End-to-End For comparisons between training from scratch (MSE + KL fine-tuning) and full end-to-end (E2E) training, all SAEs were trained on a total of 500 million tokens. In the fine-tuning approach, KL+MSE training was performed only on the final 5% (25 million tokens) of the training budget. We employed a constant learning rate of5×10 −5 . All SAEs were trained with the same dataset, using identical data ordering, a sequence length of 1024, and a batch size of 1. To facilitate stable and fair comparisons without retuning the learning rate, we dynamically balanced the magnitude of the KL and MSE losses, ensuring that the total magnitude of the KL+MSE loss closely matched that of the original MSE-only loss. We conducted experiments across two model scales. For Gemma-2-2B, we trained SAEs with width 65K at four different sparsity levels (K=20, 40, 80, 160). For Pythia-160M, we trained a single configuration with K=80 and width=16K to validate our findings on a smaller model. We did not apply early stopping to the MSE-only training phase, which could have halved the cost of activation collection. A.2. Fine-Tuning SAEBench SAEs For fine-tuning the SAEBench SAEs, we trained for 15 million tokens using a sequence length of 1024 and a batch size of 2. We used an initial learning rate of5×10 −5 , linearly decayed to 0 throughout training. A.3. LoRA Adapter Comparison When comparing our SAE fine-tuning approach against LoRA adapters, training was performed for 25 million tokens with a sequence length of 1024 and a batch size of 2. We used an initial learning rate of5×10 −5 , linearly decayed to 0 by the end of training. Our LoRA adapters were rank 16. All additional comparisons (LoRA adapters, linear adapters, KL-only training) were conducted using the SAEBench baseline TopK SAE with K=80 and width 65K, which was trained on layer 12 of the Gemma-2-2B model. Our training dataset was the Pile, matching the SAEBench training dataset. B. KL+MSE vs. KL only (a) Training MSE comparison(b) Validation loss comparison Figure 5: Comparison of training with KL+MSE loss versus KL-only loss. There is virtually no difference in validation loss between the two methods, while KL only shows significantly worse MSE on the training set. 12 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need C. Alternative Adapter Experiments We explored additional mechanisms to further reduce cross-entropy loss beyond SAE fine-tuning. Specifically, we experimented with two lightweight adaptation methods applied after a pre-trained sparse autoencoder (SAE): Low-Rank Linear Transform with Skip Connection: We introduced a low-rank linear layer defined as follows: y=x+UV x wherexis the SAE output,U∈R d×r andV∈R r×d are learnable matrices with rankr≪d. Following LoRA-style initialization,Vis initialized to zero, ensuring the adaptation has no effect at the start of training. Small MLP with Skip Connection: We tested a simple two-layer MLP with a small hidden dimension and ReLU activations, again with a residual connection: y=x+W 2 ReLU(W 1 x+b 1 ) +b 2 whereW 1 ∈R h×d ,W 2 ∈R d×h (zero-initialized),b 1 ∈R h , andb 2 ∈R d (zero-initialized), with hidden dimensionh≪d. The zero initialization ofW 2 andb 2 ensures the residual connection initially passes through the input unchanged. Results: Both methods individually yielded significant improvements in cross-entropy loss—roughly 60% of the improve- ment achievable via SAE fine-tuning alone (Figure 6a). To evaluate whether the gains from these methods would stack with SAE fine-tuning, we performed an additional two-stage fine-tuning experiment. First, we fine-tuned the SAE, and subsequently fine-tuned the low-rank linear transform or MLP on top of the already fine-tuned SAE. As shown in Figure 6b, we observed minimal additional improvement from this sequential approach, suggesting these methods primarily capture similar underlying improvements. Conclusion: These results suggest that while lightweight adaptation methods can individually reduce cross-entropy loss, their benefits do not significantly stack with SAE fine-tuning. Therefore, to minimize complexity, we recommend using SAE fine-tuning alone in practice. (a) Individual adaptation methods compared to SAE fine-tuning(b) Sequential two-stage training Figure 6: Comparison of adaptation methods. SAE fine-tuning alone provides most improvements; additional methods yield minimal incremental gains. 13 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need D. LoRA Adapters on Sparse Autoencoders We additionally experimented with training LoRA adapters directly on the SAE weights, instead of performing a full fine-tune of the SAE parameters. Our results indicate that fully fine-tuning the SAE consistently yielded better performance. However, we found that even a very low-rank LoRA adapter (rank 2) could capture more than half of the performance gains achievable through a full fine-tune. This suggests that a substantial portion of the SAE reconstruction error is due to a relatively simple and easily correctable issue. Figure 7: Comparison of a full fine-tune of the SAE versus rank 2 and rank 64 LoRA adapters. E. ReLU Sparsity Penalty Adjustments Using our adaptive balancing between KL and MSE losses, we would expect perfect transfer of sparsity penalties. However, we observed some deviations, possibly due to unshuffled activations. To ensure fair evaluation by maintaining consistent L0 sparsity levels across experiments, we implemented a simple dynamic sparsity penalty adjustment: defadjust_l1_penalty(current_l0, target_l0, l1_penalty): """Dynamically adjust L1 penalty to maintain target L0 sparsity.""" adjustment_rate = 0.001 ifcurrent_l0 < target_l0: l1_penalty * = (1 - adjustment_rate) else: l1_penalty * = (1 + adjustment_rate) returnl1_penalty This controller adjusts the L1 penalty during training to maintain the desired L0 sparsity level. While simple, we found this approach to be effective and stable across our experiments. This sparsity control approach was adapted from an implementation shared by Glen Taggart. 14 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need F. Batch Size Investigation When training with KL divergence loss, activation shuffling is not possible, resulting in correlated activations from sequential tokens. To investigate the impact of batch size on this correlation, we conducted experiments comparing fine-tuning of 65K width TopK SAEs from the SAEBench suite using two different configurations: a batch size of 2 with sequence length 1024, and a batch size of 32 with the same sequence length. Both experiments were conducted over 25 million tokens. Analysis of SAEBench metrics revealed only minor, insignificant differences between the two batch sizes. Figure 8: SAEBench results for 65K width TopK SAEs using KL+MSE finetuning with batch size 2 and 32. 15 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need G. Further SAEBench Results G.1. SCR and TPP, number of latents ablated hyperparameter sweep The Spurious Correlation Removal (SCR) and Targeted Probe Perturbation (TPP) metrics each have a hyperparameter k that determines the number of SAE latents to ablate. In the main results, following the SAEBench paper, we present results using k=20. Here, we analyze the impact of this hyperparameter by sweeping across multiple k values and selecting the best performing configuration. This analysis reveals more substantial improvements for fine-tuned TopK SAEs than previously shown, particularly at larger model widths. However, the general trends for ReLU SAEs remain consistent with our main findings: worse SCR and TPP scores. These results reinforce our conclusion that the benefits of KL+MSE fine-tuning vary significantly with architecture choice. (a) SCR, 65K width(b) TPP, 65K width Figure 9: SCR and TPP, number of latents ablated hyperparameter sweep. We sweep across multiple k values and choose the best result obtained. We see more substantial improvements for fine-tuned TopK SAEs, especially larger sizes. (a) SCR, 16K width(b) TPP, 16K width Figure 10: SCR and TPP, number of latents ablated hyperparameter sweep. We sweep across multiple k values and choose the best result obtained. We see more substantial improvements for fine-tuned TopK SAEs, especially larger sizes. 16 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Figure 11: SAEBench results for KL+MSE finetuning (15M tokens) on 16k width SAEs on Gemma-2-2B. 17 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Figure 12: SAEBench results for SAEs trained using MSE only on 475M tokens, E2E training on 500M tokens, and MSE followed by KL+MSE finetuning on 25M tokens. 18 Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need H. Weight Stability during KL+MSE fine-tuning We analyzed the stability of SAE weights during KL+MSE finetuning by measuring cosine similarities between pre- and post-finetuning weights for both encoder and decoder matrices. This analysis was performed across two architectures (TopK and ReLU) and two width configurations (16K and 65K). Interestingly, we observed different patterns of weight stability across configurations. In both TopK and 65K ReLU models, the decoder weights showed greater changes (lower cosine similarities) compared to encoder weights during finetuning. However, the 16K ReLU model exhibited an unexpected pattern where encoder weights underwent more substantial changes than decoder weights. This difference in behavior between the 16K and 65K ReLU models suggests that model width may play a significant role in how weights adapt during KL+MSE finetuning. The plots below show the 25th-75th percentile ranges of cosine similarities for the encoder and decoder weights over the course of finetuning. Higher values indicate greater stability, with a cosine similarity of 1.0 representing perfectly preserved weights. (a) 16K width TopK SAEs(b) 16K width ReLU SAEs Figure 13: Weight stability during KL+MSE finetuning. The plots show the 25th-75th percentile ranges of cosine similarities for the encoder and decoder weights of the TopK and ReLU SAEs. The 16K ReLU model exhibits an unexpected pattern where encoder weights underwent more substantial changes than decoder weights. (a) 65K width TopK SAEs(b) 65K width ReLU SAEs Figure 14: Weight stability during KL+MSE finetuning. The plots show the 25th-75th percentile ranges of cosine similarities for the encoder and decoder weights of the TopK and ReLU SAEs. 19