← Back to papers

Paper deep dive

Accelerating Sparse Autoencoder Training via Layer-Wise Transfer Learning in Large Language Models

Davide Ghilardi, Federico Belotti, Marco Molinari, Jaehyuk Lim

Year: 2024Venue: BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (EMNLP 2024)Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 60

Models: Pythia-160M

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/12/2026, 6:05:48 PM

Summary

This paper investigates the use of layer-wise transfer learning to accelerate the training of Sparse Autoencoders (SAEs) for Large Language Models (LLMs). By initializing SAEs with weights from adjacent layers (Forward and Backward approaches) and fine-tuning them, the authors demonstrate significant reductions in computational requirements (up to 46%) while maintaining or improving reconstruction quality, faithfulness, and completeness compared to training from scratch.

Entities (4)

Pythia-160M · large-language-model · 100%Sparse Autoencoder · model-architecture · 100%Transfer Learning · methodology · 98%Mechanistic Interpretability · field-of-study · 95%

Relation Signals (3)

Transfer Learning accelerates Sparse Autoencoder

confidence 98% · Accelerating Sparse Autoencoder Training via Layer-Wise Transfer Learning in Large Language Models

Pythia-160M evaluatedon Sparse Autoencoder

confidence 95% · We tested this hypothesis on Pythia-160M... By reusing the representations learned in earlier layers, computational demands of training can be reduced

Sparse Autoencoder usedfor Mechanistic Interpretability

confidence 95% · Sparse AutoEncoders (SAEs) have gained popularity as a tool for enhancing the interpretability of Large Language Models (LLMs).

Cypher Suggestions (2)

Find all models and the interpretability techniques applied to them. · confidence 90% · unvalidated

MATCH (m:Model)-[:USES_TECHNIQUE]->(t:InterpretabilityTechnique) RETURN m.name, t.name

Identify methods used to accelerate model training. · confidence 90% · unvalidated

MATCH (m:Methodology)-[:ACCELERATES]->(t:TrainingTask) RETURN m.name, t.name

Abstract

Davide Ghilardi, Federico Belotti, Marco Molinari, Jaehyuk Lim. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)mechanistic-interp (suggested, 92%)

Links

Full Text

59,594 characters extracted from source content.

Expand or collapse full text

Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 530–550 November 15, 2024 ©2024 Association for Computational Linguistics Accelerating Sparse Autoencoder Training via Layer-Wise Transfer Learning in Large Language Models Davide Ghilardi 1 * , Federico Belotti 1 * , Marco Molinari 2, 4 * , Jaehyuk Lim 2, 3 1 University of Milan-Bicocca, 2 LSE.AI, 3 University of Pennsylvania, 4 London School of Economics * Equal contribution Correspondence:d.ghilardi@campus.unimib.it Abstract Sparse AutoEncoders (SAEs) have gained pop- ularity as a tool for enhancing the interpretabil- ity of Large Language Models (LLMs). How- ever, trainingSAEscan be computationally in- tensive, especially as model complexity grows. In this study, the potential of transfer learning to accelerateSAEstraining is explored by cap- italizing on the shared representations found across adjacent layers ofLLMs. Our exper- imental results demonstrate that fine-tuning SAEsusing pre-trained models from nearby layers not only maintains but often improves the quality of learned representations, while significantly accelerating convergence. These findings indicate that the strategic reuse of pre- trainedSAEsis a promising approach, particu- larly in settings where computational resources are constrained. 1 Introduction Transformer-based models have become ubiqui- tous in a large variety of different application fields (Dubey et al., 2024; Kirillov et al., 2023; Rad- ford et al., 2023; Chen et al., 2021; Zitkovich et al., 2023; Waisberg et al., 2023). Given their tremen- dous impact on society, concerns about their inter- pretability have been raised by various stakehold- ers (Bernardo, 2023). Mechanistic Interpretabil- ity (MI) (Conmy et al., 2023; Nanda et al., 2023), seeks to reverse-engineer how Neural Networks, and in particularLLMs, generate outputs by uncov- ering the circuits they have learned during training, stored inside their parameters, and executed dur- ing a forward pass (Nanda et al., 2023; Conmy et al., 2023; Gurnee et al., 2023). A promising in- terpretability technique is dictionary learning (Cun- ningham et al., 2023; Gao et al., 2024; Karvonen et al., 2024) which seeks to capture interpretable and editable features within the internal layers of LLMs. This method implies training Sparse Au- toencoders (SAEs) to reconstruct the model’s ac- Figure 1: Visualization of our method. From left to right:baselinemethod where each Sparse AutoEn- coder (SAE) is trained from scratch (solid line);for- wardmethod whereSAEsare initialized with weights from the previous layer’sSAEand fine-tuned (dashed line) with the new layer activations;backwardmethod whereSAEsare initialized with weights from the fol- lowing layer’s SAE. tivations using sparse learned features. However, trainingSAEsis computationally intensive, par- ticularly when applied across multiple layers in deep networks. This computational burden poses a significant barrier to their widespread applica- tion, especially in resource-constrained environ- ments where the cost of training from scratch is prohibitive. Recent research has highlighted the po- tential of transfer learning as a strategy to mitigate these challenges (Kissane et al., 2024). In partic- ular, it has been shown in Gromov et al. (2024) that adjacent layers in LLMs are often redundant, suggesting that the knowledge encoded in one layer is also present in neighboring ones and that it can effectively be transferred. This observation forms the basis of our investigation: we hypothesize that SAEstrained on one set of layers can serve as ef- fective initialization forSAEsdesigned for closely related layers. Specifically, theforwardapproach is defined as initializing anSAEwith the weights of 530 a previous layerSAE, and thebackwardapproach as initializing anSAEwith the weights of a subse- quent layerSAE. The overall training procedure is summarized in Figure 1. We tested this hypothe- sis on Pythia-160M, a small 12-layer decoder-only transformer from the Pythia family (Biderman et al., 2023). By reusing the representations learned in earlier layers, computational demands of training can be reduced by at least 25% 1 while maintain- ing, or even improving, the quality of the resulting models. Our contributions are as follows: •We demonstrate thatSAEsexhibit partial transfer to adjacent layers in a zero-shot set- ting, though fine-tuning is recommended for optimal performance. •We show that both Forward-SAEs and Backward-SAEs, when fine-tuned on adja- cent activations, consistently transfer across all tested checkpoints, achieving comparable or superior performance toSAEstrained from scratch, while using significantly less training data. • We train and publicly releaseSAEsfor Pythia- 160M (Biderman et al., 2023), the model uti- lized in this study. Code, data, and trained models will be publicly released after the double-blind review. 2 Background and objectives 2.1 Linear representation hypothesis and superposition Although it has been demonstrated thatLLMsrep- resent some of their feature linearly (Park et al., 2024), a key challenge in LLM interpretability is the lack of clear neuron interpretation. Recent work of Elhage et al. (2022) tries to explain this phenomenon by showing that models can usen- dimensional activations to representm≫nsparse almost-orthogonal features insuperposition. Super- position theory is based on three key concepts: (i) the existence of a hypothetical large and disentan- gled model where each neuron perfectly aligns with a single feature, with each neuron activating for ex- actly one feature at a time. The observed models can be thought as dense, almost-orthogonal projec- tions of this larger, ideal model. (i) Features are 1 Assuming training half ofSAEsfrom scratch and the other half with transfer from an adjacent layer with half of the training tokens. sparse, reflecting the idea that in the natural world, many features are inherently sparse. (i) The im- portance of features varies depending on the task at hand. These assumptions, combined with two mathematical principles 2 , suggest that the hidden sparse features can be recovered by projecting the dense model back to the hypothetical large and dis- entangled one.SAEsserve this purpose: learning a set of sparse, interpretable, and high-dimensional features from an observed model’s dense and su- perposed activations. 2.2 Sparse Autoencoders Recently, Sparse AutoEncoders have become a pop- ular tool in Large Language Model (LLM) inter- pretability as they effectively decompose neuron ac- tivations into interpretable features (Bricken et al., 2023; Cunningham et al., 2023). For a given in- put activationx∈R d model , theSAEcomputes a reconstruction ˆ xas a sparse linear combination of d sae ≫d model featuresv i ∈R d model . The recon- struction is given by: ( ˆ x◦f)(x) =W d f(x) +b d (1) wherev i are the columns ofW d ,b d is the bias term of the decoder andf(x)are feature activations. The latter are computed as: f(x) =ReLU(W e (x−b d ) +b e )(2) whereb e is the encoder bias term. SAEs are trained to minimize the following loss function: L sae =∥x− ˆ x∥ 2 2 +λ∥f(x)∥ 1 (3) In Equation 3, the first term corresponds to the re- construction error, to which anℓ 1 regularization term on the activationsf(x)is added to promote sparsity in the feature activations. InSAEstrain- ing, it is common to setd sae =cd model withc∈ 2 n |n∈N + . So, the training process of aSAE can become computationally intensive, particularly as model size increases. For example, training a singleSAEof a widely used model such as Llama- 3-8b (Dubey et al., 2024) (d model = 4096) with an expansion factor ofc= 32(i.e.,d sae = 131072) requires≈1B parameters. Under these circum- 2 The Johnson-Lindenstrauss lemma, which ensures that points in a high-dimensional space can be embedded into a lower dimension while almost preserving distances, and compressed sensing, which exploits sparsity to recover signals from fewer samples than required by the Nyquist–Shannon theorem 531 ConfigValue Layers (L)12 Model dimension (d model )768 Heads (H)12 Non-Embedding params85,056,000 Equivalent models 3 GPT-Neo OPT-125M Table 1: Pythia-160M model specifics stances, transfer learning is a useful resource to reduce the number of trainedSAEs, with the trans- fer that can happenintra-model, whereSAEstrain- ing is shared between layers of the same model (our case), orinter-model, whereSAEsare shared between different fine-tuned versions of the same model as shown in Kissane et al. (2024). 2.3 Evaluating SAEs EvaluatingSAEsand the features they have learned presents significant challenges. In our work, the techniques employed can be divided intorecon- structionandinterpretabilitymetrics. The first includes: •The Cross-Entropy Loss Score (CES), is de- fined as CES= CE(ζ)−CE( ˆ x◦f) CE(ζ)−CE(Id) (4) where ˆ x◦fis the autoencoder function, ζ:x→0the zero-ablation function and Id:x→xthe identity function. Accord- ing to this definition, aSAEwould get a CES equal to1if it perfectly reconstructsx(>1 if it improves the CE loss),≤0when the re- construction is not better than zero-ablation, otherwise the score is comprised in the unit interval. •TheL2loss (reconstruction loss) is the first term of Equation 3, which measures the recon- struction error made by the SAE. •TheL0loss of the learned features, defined as ∥f∥ 0 = |f| X j=1 I[f j ̸= 0](5) 3 As specified in (Biderman et al., 2023) which represents the number of non-zeroSAE features used to compute the reconstruction. Measuring the quality of the features learned by a SAEis not straightforward, and multiple strategies exist. As reported in Makelov et al. (2024),inter- pretabilitymetrics can be categorized as follows: •Indirect Geometric Measures: Sharkey et al. (2023) proposed using mean maximum co- sine similarity (MMCS) between features learned by differentSAEsto assess their qual- ity. Given two feature dictionariesDandD ′ , with|D|=|D ′ |, MMCS is defined as: MMCS D,D ′ = 1 |D| X u∈D max v∈D ′ CosSim(u,v) (6) •Auto-Interpretability: Bricken et al. (2023), Bills et al. (2023), and Cunningham et al. (2023) usedLLMsto generate natural- language descriptions ofSAEfeatures based on highly activating examples and measured interpretability as the prediction quality on previously unseen text. •Manually Crafted Proxies for Ground Truth: (Bricken et al., 2023) developed computa- tional proxies for a set ofSAEfeatures, re- lying on manually formulated hypotheses. • Faithfulness and Completeness of task feature circuits: Marks et al. (2024) compute faith- fulness and completeness as measures to es- timate the task sufficiency and necessity of learnedSAEfeatures. In particular, given a task, they first compute a circuitCofSAE features by selecting them according to their importance, estimated via their Indirect Ef- fect 4 (Pearl, 2022): IE(m;f;a c ,a w ) = m[M(a c |do(f=f w ),x);M(a c |x)](7) wherexis a given prompt andm:R d vocab → Ris the logit-difference computed by a LLMMover two contrastive answer tokens a c ,a w . 5 In this equation,f w representsSAE feature activations during the computation of 4 We estimate the IE through Attribution Patching (AtP) (Syed et al., 2023; Nanda, 2023) A formal definition of AtP is given in Appendix A 5 E.g.,x=“The square root of 9 is”,a c =3, anda w =2 532 M(a w |x), andM(a c |do(f=f w ),x)refers to the value ofM(a c )under an intervention where the activation of featurefis set tof w . Then, they estimate thefaithfulnessas m(C)−m(∅) m(M)−m(∅) (8) wherem(C)is the model logit difference when using only the importantSAEfeatures while mean-ablating the others;m(M),m(∅) represent the logit-difference achieved by the model alone and with the mean-ablatedSAE reconstructions, respectively.Completenessis estimated by replacingm(C)withm(M ) in Equation 8. Intuitively, faithfulness cap- tures the proportion of the model’s perfor- mance the circuitCexplains, relative to mean- ablate the full model, thus modeling suffi- ciency. On the other hand, completeness captures the necessity of the learned features by measuring low downstream performance whenever the importantSAEfeatures are mean-ablated. •SupervisedDictionaryBenchmarking: Makelov et al. (2024) introduced a technique that benchmarks unsupervisedSAEdictionar- ies against supervised dictionaries based on task-relevant attributes to ensure extracted features are interpretable and relevant to specific tasks. In our work, evaluation metrics employed include all the reconstruction techniques listed above, the MMCS between features fromSAEstrained with transfer learning and the ones fromSAEstrained from scratch, and a Human Interpretability Score defined in Section 3. Moreover, we evaluate both faithfulness and completeness on three standard downstream tasks: Indirect Object Identification (IOI) (Wang et al., 2023), Greater Than (Hanna et al., 2023), and Subject-Verb Agreement (Marks et al., 2024), all of them comprising a set of exam- ples in the form of(x,a c ,a w ) i . Additionally, for faithfulness and completeness computation we fix the number of top important featuresNthrough- out all the experiments: for faithfulness we let Nvary in123,246,368,492, which correspond to2%,4%,6%and8%of top active features; for completeness,Nvaries in4,36,68,100. 6 Fi- nally, in Appendix B we report the Direct Logit 6 Top important features are computed on a per-example basis. Figure 2: Cross-Entropy Loss Score (CE-Loss Score) (Eq. 4), where the cell(i,j)in the plot represents the CE-Loss Scoreobtained by reconstructing the activa- tions from layeriwith SAE j . This plot has to be read column-wise. Attribution (DLA), as specified by Bricken et al. (2023). 2.4 Transfer Learning Transfer learning (Goodfellow et al., 2016) is a powerful technique in machine learning where knowledge gained from one task is applied to im- prove performance on a related, but distinct, task. This approach is particularly useful when train- ing from scratch is computationally expensive or when labeled data is scarce. In the context ofSAEs forLLMs, transfer learning enables the reuse of weights learned in one layer to initialize and accel- erate the training of SAEs in adjacent layers. 2.5 Objectives In this work transferability and generalization of intra-modelSAEshave been studied, aiming to answer the following research questions: Q1.AreSAEstransferable between layers? I.e., can aSAEtrained on the activations of layer ibe reused to reconstruct activations of layer j̸=i? Q2.Is Transfer Learning applicable toSAEs? Specifically, can aSAEinitialized with the weights of a neighboringSAEand then fine- tuned achieve equal or superior performance, potentially using only a fraction of the data, compared to an SAE trained from scratch? 533 Figure 3: Average CE-Loss Score,L2-Loss andL0-Loss. The average is computed over layers for a single checkpoint. The “No Transfer” average is computed considering the performance obtained by SAE i (x i ),∀i= 0,...,11. 3 Experimental setup To address the questions raised in Section 2, we first trained from scratch one SAE i for each layer iof Pythia-160M, a 12-layer decoder-only Trans- former model from the Pythia family (Biderman et al., 2023). EachSAEwas trained using the JumpReLU activation function (Rajamanoharan et al., 2024), with activations taken from the cor- responding layer’s residual stream after the MLP contribution. The model configuration details are provided in Table 1. Let alsoj̸=ibe another layer index. Then SAE i←j is defined as theSAE initialized with weights from thej-thSAEand fine- tuned with activations of thei-th layer. In particular, this work is focused on SAE i←i−1 and SAE i←i+1 , named Forward-SAE (Fwd-SAE) and Backward- SAE (Bwd-SAE) respectively. Figure 1 summa- rizes the overall training and fine-tuning procedure, with the hyperparameters specified in Table 2. The dataset adopted for both training and fine-tuning is the Pile-small-2b 7 , an already tokenized version of the Pile dataset (Gao et al., 2020) with a total of 2b tokens. To effectively measure the recon- struction performance of aSAEbefore and after fine-tuning with transfer learning, the normalized CE-Loss Score is adopted and defined as: CES i,j = CES(SAE i←j (x i ))−CES(SAE j (x i )) CES(SAE i (x i ))−CES(SAE j (x i )) (9) by assumingCES(SAE j (x i ))andCES(SAE i (x i )) being, respectively, the lower and the upper bound for the CES onx i . With the definitions above,CES i,i−1 andCES i,i+1 are the normalized CE-Loss Scoreof theFwd-SAEandBwd-SAEre- 7 https://huggingface.co/datasets/NeelNanda/ pile-small-tokenized-2b spectively. Finally, to evaluate feature quality, a Human Interpretability Scorehas been defined as the ratio of features that have been evaluated in- terpretable by human annotators. To generate the score, all theSAEshave been run on approximately 1M tokens randomly sampled from the training dataset. With their activations, max activating to- kens and top/bottom attribution logits have been computed and analyzed from the labelers. 4 Results 4.1 SAE transferability Figure 2 shows theCE-Loss Scoreachieved by ev- ery SAE j reconstructing the activations of layer i, for everyi,j= 0,...,L−1, i.e., the zero-shot setting. It is clear that a certain degree of trans- ferability exists between SAE j and the activations of adjacent layers, with this being more noticeable wheni=j−1(i.e.,SAEsare more effective at reconstructing the activations of preceding layers than those of subsequent ones). These findings can also be attributed to the fact that, as demonstrated by Gromov et al. (2024), angular distances between adjacent layers are smaller, enabling neighboring SAEsto operate on a similar basis with respect to the activations they were trained on. The answer to Q1is, therefore, yes; however, although transfer- ability between layers exists, it remains partial and, potentially, not completely reliable for downstream applications. 4.2 SAE transfer learning Figure 3 shows all reconstruction metrics aver- aged for all layers across every tested checkpoint. Detailed results for single layer and aggregated over time can be found in Appendix C (Figures 9 - 17) along with the normalizedCE-Loss Score 534 Figure 4: Average Faithfulness and Completeness. The average is computed over layers and the number of important activeSAEfeatures for a single checkpoint. The “No Transfer” average is computed considering the performance obtained by SAE i (x i ),∀i= 0,...,11. (Eq. 9) in Tables 3 and 4. Looking at the plots, it can be seen that forward and backwardSAEs achieve almost equal or even superior performance than the ones trained from scratch with as little as1/10-th (100M tokens) of the original training data (1B tokens), with the scores constantly in- creasing with the number of tokens used for fine- tuning. As a result, it can be said that both forward ad backward are effective strategies to reduce the number ofSAEstrained from scratch. Between the two, the backward technique is the one that constantly shows better results, both in terms of CE-Loss Score,L2, andL1loss. So, the answer to Q2is also yes if we just consider the reconstruction metrics. To fully respond toQ2beyond reconstruc- tion performance, the quality of the learnedSAE features have to be inspected. 4.3 Feature Evaluation Figure 4 displays the layer-averaged faithfulness and completeness scores for each tested checkpoint. The plot reveals that both forward and backward Figure 5: Per-layer MMCS of the Forward and Back- ward SAEs. transferSAEsconsistently achieve better scores than the baselineSAEs, with minimal differences between the two transfer methods. Therefore, both the forward and backwardSAEsmaintain suffi- ciency and necessity during their transfer. Figure 5 presents the MMCS betweenSAEstrained with transfer learning and those trained from scratch. The metric value decreases for deeper layers, sug- gesting a slight divergence in the features learned by the transferSAEs. Notably,SAE L−1←L ex- hibits a sharp decline in the score, indicating that transferring on the last layer should be approached with caution. Lastly, from human interpretability scores (Figure 7), no significant differences can be observed between each transfer type. By manually looking at the learned features, a key pattern has emerged: many features learned bySAEstrained with transfer learning remain shared with theSAE used for initialization. This phenomenon, termed Feature Transfer, particularly affects the most inter- pretable features (see an example in Figure 23). To further investigate this phenomenon, a metric was developed to quantify it. Given aSAE i and another trained via transfer learning from it,SAE i←i±1 , the number of shared “top”, “bottom”, and “max activating tokens” 8 for each feature have been com- puted (features have been compared using the same indices). The transfer score has been then defined as the percentage of shared tokens across all three heuristics. Figure 6 presents the scores across all the layers for the last evaluated checkpoint. Except for layer 1, backward transfer consistently exhibits lower scores. It’s important to note that this phe- 8 “Top” and “bottom” logit tokens refer to those whose unembedding directions are most and least aligned, respec- tively, with the projection of the feature in the unembedding space. “Max activating” tokens are those for which the feature exhibits the highest activations. 535 Figure 6: Per-layer number of shared tokens for the Forward and BackwardSAEs, as defined in Sec- tion 4.3. Each bar represents the percentage of shared token betweenSAE i trained from scratch and forward SAE i+1←i and backward SAE i−1←i , respectively. nomenon is easily recognized inSAEstrained with transfer learning when compared to their initializa- tion, as feature indices are preserved. Evaluating this inSAEstrained from scratch is more demand- ing due to the exponential growth in the number of comparisons required, and although relevant, it falls outside the scope of this work. 4.4 Compute Efficiency Leveraging forward and backward transfer, we were able to reduce total training steps when uti- lizing forward transfer and backward transfer by 42% and 46%, respectively. Check Appendix B.1 for details. 5 Related works 5.1 Scaling and evaluating SAEs AsSAEsgain popularity forLLMsinterpretabil- ity and are increasingly applied to state-of-the-art models (Lieberum et al., 2024), the need for more efficient training techniques has become evident. To address this, Gao et al. (2024) explored scaling laws of autoencoders to identify the optimal combi- nation of size and sparsity. However, trainingSAEs is only one aspect of the challenge; evaluating them presents another significant hurdle. This evalua- tion is a crucial focus withinMI. While early ap- proaches in Cunningham et al. (2023) and (Bricken et al., 2023) relied on unsupervised metrics like reconstruction loss andL0sparsity to assessSAE performance, these metrics alone cannot fully cap- ture the efficacy of aSAE. They provide quantita- tive measures of how wellSAEscapture informa- Figure 7: Human Interpretability Scores (Section 3) for 32 features randomly sampled from eachSAElayer and type of transfer. tion in model activations while maintaining spar- sity, but they fall short of addressing the broader utility of these features. More recent techniques, such as auto-interpretability (Bricken et al. (2023), Bills et al. (2023), Cunningham et al. (2023)) and ground-truth comparisons (Sharkey et al., 2023), have shifted towards a more holistic evaluation, focusing on the causal relevance of the extracted features (Marks et al., 2024) and evaluatingSAEs on different downstream tasks in which they can be employed (Makelov et al., 2024). In particular, Makelov et al. (2024) introduced a framework for evaluatingSAEson the Indirect Object Identifica- tion (IOI) task, focusing on three key aspects: the sufficiency and necessity of activation reconstruc- tions, the ability to control model behavior through sparse feature editing, also called feature steering (Templeton et al., 2024), and the interpretability of features in relation to their causal role. Kar- vonen et al. (2024) further advanced principled evaluations by introducing novel metrics specifi- cally designed for board game language models. Their approach leverages the well-defined structure of chess and Othello to create supervised metrics forSAEquality, including board reconstruction accuracy and coverage of predefined board state properties. These methods provide a more direct assessment of how wellSAEscapture semantically meaningful and causally relevant features, offering a complement to the earlier unsupervised metrics likeL0andL2. 5.2 SAEs transfer learning Recent work by Kissane et al. (2024) and Lieberum et al. (2024) has demonstrated the transferability of SAEweights between base and instruction-tuned 536 versions of the Gemma-1 (Team et al., 2024a) and Gemma-2 (Team et al., 2024b), respectively. This finding is significant as it suggests that many in- terpretable features are preserved during the fine- tuning process. While this transfer occurs be- tween model variants (inter-model) rather than be- tween layers (intra-model), it complements our work by indicating thatSAEfeatures can remain stable across different stages of model develop- ment. The preservation of these features through fine-tuning not only offers insights into the robust- ness of learned representations but also suggests potential efficiency gains in interpreting families of models derived from a common base SAE. 6 Conclusions We hypothesized and validated whetherSAEtrans- fer is an effective method to accelerate and opti- mize theSAEtraining process. We investigated whetherSAEweights derived from adjacent layers could maintain efficacy in reconstruction, which our results affirmed. Furthermore, we examined whether the transferredSAEs, when fine-tuned on a layer’s activations, could reliably capture monose- mantic features comparable to the originalSAE, which has been also confirmed by our experiments. The transferredSAEs (both forward and backward) demonstrated comparable and occasionally supe- rior reconstruction loss relative to the original. Em- pirically, we observed frequent overlap in the most strongly activated features across adjacent layers (e.g. Figure 23). For a given feature indexi, the features learned by SAE i←i+1 (Backward), SAE i (No Transfer), and SAE i←i−1 (Forward) appeared to represent similar concepts. 7 Limitations and future works While our study successfully demonstrates the fea- sibility of reconstruction transfer and the transfer learning ofSAEweights to adjacent layers, there are several limitations that warrant consideration and pave the way for future research directions. •Model Size and Scope: We trained base and transferSAEson the activations of Pythia- 160m, a model mcuh smaller than state-of- the-artLLMs. Although not being tested, as model size and training complexity increase, the benefits of transfer learning are expected to become more pronounced. In such sce- narios, transfer learning can significantly ac- celerate training and reduce associated costs, making our approach potentially more impact- ful for larger models. Therefore, a critical area for future research is to extend these in- vestigations to larger models, exploring how scaling affects the efficacy of transfer learning and how these benefits can be maximized in real-world settings. •Inter-Model and Intra-Model transferability: In our study, we focused on the transfer of intra-modelSAEs, particularly assessing the transferability betweenSAEsin adjacent lay- ers. Given that model architectures are now commonly shared across different model fam- ilies, a direction for future research would be to evaluate the transferability of intra-model SAEswithin models from different families that utilize the same architecture. This explo- ration could offer valuable insights into the broader applicability ofSAEsbeyond closely related model families. • Experimental Scale and Hyperparameter In- teractions: Our study was conducted on a lim- ited scale in terms of model components in- volved and the range of training hyperparame- ters explored. The fixed set of hyperparame- ters used may not fully capture the potential of our transfer learning approach across differ- ent configurations. Future research should in- volve a broader exploration of hyperparameter spaces, especially theλcoefficient and expan- sion factorc, along with component variations to determine the robustness and versatility of the method. •Feature Transfer Phenomenon: Our find- ings reveal a “feature transfer” phenomenon, where features learned in one layer are exactly replicated in another during transfer learning. This can be problematic, as it may prevent the fine-tunedSAEsfrom discovering new, layer-specific features. However, it also of- fers an interesting opportunity to study how similar features are encoded across layers. Fu- ture research should focus on understanding and managing this phenomenon to either har- ness or mitigate its effects, depending on the desired outcomes, thereby improving the flex- ibility and effectiveness of transfer learning. 537 Acknoledgements This work has been partially funded by the Eu- ropean innovation action enRichMyData (HE 101070284). References Vítor Bernardo. 2023.Techdispatch #2/2023 - explainable artificial intelligence.https: //w.edps.europa.eu/data-protection/our- work/publications/techdispatch/2023- 11-16-techdispatch-22023-explainable- artificial-intelligence_en .European Data Protection Supervisor. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hal- lahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. 2023. Pythia: A suite for analyzing large language mod- els across training and scaling. InInternational Conference on Machine Learning, pages 2397–2430. PMLR. Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. 2023. Lan- guage models can explain neurons in language mod- els. Accessed: 2024-08-18. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. 2023. Towards monosemanticity: Decom- posing language models with dictionary learning. Transformer Circuits Thread. Https://transformer- circuits.pub/2023/monosemantic- features/index.html. Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. 2021. Decision trans- former: Reinforcement learning via sequence model- ing. InAdvances in Neural Information Processing Systems, volume 34, pages 15084–15097. Curran As- sociates, Inc. Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. 2023. Towards automated circuit discovery for mech- anistic interpretability.Advances in Neural Informa- tion Processing Systems, 36:16318–16352. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. Sparse autoencoders find highly interpretable features in language models. Preprint, arXiv:2309.08600. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Bap- tiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Al- lonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Geor- gia Lewis Anderson, Graeme Nail, Gregoire Mi- alon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuen- ley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Lau- rens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bash- lykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Pra- jjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Ro- main Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gu- rurangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petro- vic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whit- 538 ney Meers, Xavier Martinet, Xiaodong Wang, Xiao- qing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesen- berg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, An- drei Lupu, Andres Alvarado, Andrew Caples, An- drew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Apara- jita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yaz- dan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Han- cock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Da- mon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Tes- tuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Hol- land, Edward Dowling, Eissa Jamil, Elaine Mont- gomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, Gabriela Medina Flo- rez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Han- wen Zha, Haroun Habeeb, Harrison Rudolph, He- len Suk, Henry Aspegren, Hunter Goldman, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizen- stein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Sax- ena, Karthik Prasad, Kartikay Khandelwal, Katay- oun Zand, Kathy Matosich, Kaushik Veeraragha- van, Kelly Michelena, Keqian Li, Kun Huang, Ku- nal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsim- poukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Her- moso, Mo Metanat, Mohammad Rastegari, Mun- ish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pa- van Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratan- chandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Mah- eswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lind- say, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agar- wal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Consta- ble, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yan- jun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao. 2024. The llama 3 herd of models.Preprint, arXiv:2407.21783. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. 2022. Toy models of superpo- sition.Preprint, arXiv:2209.10652. Leo Gao, Stella Biderman, Sid Black, Laurence Gold- ing, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The pile: An 800gb dataset of diverse text for language modeling. Preprint, arXiv:2101.00027. Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. 2024. Scaling and evaluating sparse autoencoders.Preprint, arXiv:2406.04093. Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville. 2016.Deep Learning. MIT Press, Cambridge, MA, USA.http://w.deeplearningbook.org. Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A. Roberts. 2024. The 539 unreasonable ineffectiveness of the deeper layers. Preprint, arXiv:2403.17887. Wes Gurnee, Neel Nanda, Matthew Pauly, Kather- ine Harvey, Dmitrii Troitskii, and Dimitris Bert- simas. 2023.Finding neurons in a haystack: Case studies with sparse probing.arXiv preprint arXiv:2305.01610. Michael Hanna, Ollie Liu, and Alexandre Variengien. 2023. How does GPT-2 compute greater-than?: In- terpreting mathematical abilities in a pre-trained lan- guage model. InThirty-seventh Conference on Neu- ral Information Processing Systems. Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs Smith, Clau- dio Mayrink Verdun, David Bau, and Samuel Marks. 2024. Measuring progress in dictionary learning for language model interpretability with board game models. InICML 2024 Workshop on Mechanistic Interpretability. Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment anything.Preprint, arXiv:2304.02643. Connor Kissane,Ryan Krzyzanowski,Andrew Conmy, and Neel Nanda. 2024.SAEs (usu- ally) transfer between base and chat models. https://w.alignmentforum.org/posts/ fmwk6qxrpW8d4jvbd/saes-usually-transfer- between-base-and-chat-models. AI Alignment Forum. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. 2024. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147. Aleksandar Makelov, George Lange, and Neel Nanda. 2024. Towards principled evaluations of sparse au- toencoders for interpretability and control.Preprint, arXiv:2405.08366. Samuel Marks, Can Rager, Eric J Michaud, Yonatan Be- linkov, David Bau, and Aaron Mueller. 2024. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models.arXiv preprint arXiv:2403.19647. NeelNanda.2023.Attributionpatching: Activationpatchingatindustrialscale. https://w.neelnanda.io/mechanistic- interpretability/attribution-patching. Mechanistic Interpretability. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. Progress mea- sures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations. Kiho Park, Yo Joong Choe, and Victor Veitch. 2024.The linear representation hypothesis and the geometry of large language models.Preprint, arXiv:2311.03658. Judea Pearl. 2022. Direct and indirect effects. InProb- abilistic and causal inference: the works of Judea Pearl, pages 373–392. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine Mcleavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak su- pervision. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR. Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. 2024. Jumping ahead: Im- proving reconstruction fidelity with jumprelu sparse autoencoders.Preprint, arXiv:2407.14435. Lee Sharkey, Dan Braun, and Beren Millidge. 2023. Taking the temperature of transformer circuits. Ac- cessed: 2024-08-18. Aaquib Syed, Can Rager, and Arthur Conmy. 2023. At- tribution patching outperforms automated circuit dis- covery. InNeurIPS Workshop on Attributing Model Behavior at Scale. Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro- Ros, Ambrose Slone, Amélie Héliou, Andrea Tac- chetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christo- pher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Bren- nan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Milli- can, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bai- ley, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Kli- menko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli 540 Collins, Armand Joulin, Noah Fiedel, Evan Sen- ter, Alek Andreev, and Kathleen Kenealy. 2024a. Gemma: Open models based on gemini research and technology.Preprint, arXiv:2403.08295. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati- raju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momchev, Matt Hoffman, Shantanu Thakoor, Jean-Bastien Grill, Behnam Neyshabur, Olivier Bachem, Alanna Wal- ton, Aliaksei Severyn, Alicia Parrish, Aliya Ah- mad, Allen Hutchison, Alvin Abdagic, Amanda Carl, Amy Shen, Andy Brock, Andy Coenen, An- thony Laforge, Antonia Paterson, Ben Bastian, Bilal Piot, Bo Wu, Brandon Royal, Charlie Chen, Chintu Kumar, Chris Perry, Chris Welty, Christopher A. Choquette-Choo, Danila Sinopalnikov, David Wein- berger, Dimple Vijaykumar, Dominika Rogozi ́ nska, Dustin Herbison, Elisa Bandy, Emma Wang, Eric Noland, Erica Moreira, Evan Senter, Evgenii Elty- shev, Francesco Visin, Gabriel Rasskin, Gary Wei, Glenn Cameron, Gus Martins, Hadi Hashemi, Hanna Klimczak-Pluci ́ nska, Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda Mein, Jack Zhou, James Svens- son, Jeff Stanway, Jetha Chan, Jin Peng Zhou, Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe Fer- nandez, Joost van Amersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Ju yeong Ji, Kareem Mo- hamed, Kartikeya Badola, Kat Black, Katie Mil- lican, Keelin McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish Greene, Lars Lowe Sjoesund, Lau- ren Usui, Laurent Sifre, Lena Heuermann, Leti- cia Lago, Lilly McNealus, Livio Baldini Soares, Logan Kilpatrick, Lucas Dixon, Luciano Martins, Machel Reid, Manvinder Singh, Mark Iverson, Mar- tin Görner, Mat Velloso, Mateo Wirth, Matt Davi- dow, Matt Miller, Matthew Rahtz, Matthew Wat- son, Meg Risdal, Mehran Kazemi, Michael Moyni- han, Ming Zhang, Minsuk Kahng, Minwoo Park, Mofi Rahman, Mohit Khatwani, Natalie Dao, Nen- shad Bardoliwalla, Nesh Devanathan, Neta Dumai, Nilay Chauhan, Oscar Wahltinez, Pankil Botarda, Parker Barnes, Paul Barham, Paul Michel, Peng- chong Jin, Petko Georgiev, Phil Culliton, Pradeep Kuppala, Ramona Comanescu, Ramona Merhej, Reena Jana, Reza Ardeshir Rokni, Rishabh Agar- wal, Ryan Mullins, Samaneh Saadat, Sara Mc Carthy, Sarah Perrin, Sébastien M. R. Arnold, Sebastian Krause, Shengyang Dai, Shruti Garg, Shruti Sheth, Sue Ronstrom, Susan Chan, Timothy Jordan, Ting Yu, Tom Eccles, Tom Hennigan, Tomas Kocisky, Tulsee Doshi, Vihan Jain, Vikas Yadav, Vilobh Meshram, Vishal Dharmadhikari, Warren Barkley, Wei Wei, Wenming Ye, Woohyun Han, Woosuk Kwon, Xiang Xu, Zhe Shen, Zhitao Gong, Zichuan Wei, Victor Cotruta, Phoebe Kirk, Anand Rao, Minh Giang, Ludovic Peran, Tris Warkentin, Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, D. Sculley, Jeanine Banks, Anca Dragan, Slav Petrov, Oriol Vinyals, Jeff Dean, Demis Hass- abis, Koray Kavukcuoglu, Clement Farabet, Elena Buchatskaya, Sebastian Borgeaud, Noah Fiedel, Ar- mand Joulin, Kathleen Kenealy, Robert Dadashi, and Alek Andreev. 2024b. Gemma 2: Improving open language models at a practical size.Preprint, arXiv:2408.00118. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. 2024. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Transformer Circuits Thread. Ethan Waisberg, Joshua Ong, Mouayad Masalkhi, Sharif Amit Kamran, Nasif Zaman, Prithul Sarker, Andrew G Lee, and Alireza Tavakkoli. 2023. Gpt- 4 and ophthalmology operative notes.Annals of Biomedical Engineering, 51(11):2353–2355. Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2023. Inter- pretability in the wild: a circuit for indirect object identification in GPT-2 small. InThe Eleventh Inter- national Conference on Learning Representations. Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Van- houcke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Hen- ryk Michalewski, Yao Lu, Sergey Levine, Lisa Lee, Tsang-Wei Edward Lee, Isabel Leal, Yuheng Kuang, Dmitry Kalashnikov, Ryan Julian, Nikhil J. Joshi, Alex Irpan, Brian Ichter, Jasmine Hsu, Alexander Herzog, Karol Hausman, Keerthana Gopalakrish- nan, Chuyuan Fu, Pete Florence, Chelsea Finn, Ku- mar Avinava Dubey, Danny Driess, Tianli Ding, Krzysztof Marcin Choromanski, Xi Chen, Yevgen Chebotar, Justice Carbajal, Noah Brown, Anthony Brohan, Montserrat Gonzalez Arenas, and Kehang Han. 2023. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InPro- ceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 2165–2183. PMLR. 541 A IE estimation through Attribution Patching In Equation 7 we reported the Indirect Effect (IE) (Pearl, 2022), which measures the importance of a feature with respect to a generic downstream taskT. To reduce the computational burden of esti- mating the IE with a single forward pass per feature, we employed Attribution Patching (AtP) (Nanda, 2023; Syed et al., 2023). AtP employs a first-order Taylor expansion ˆ IE AtP (m;f;a c ,a w ) =∇ f m f=f c (f w −f c )(10) which estimates Equation 7 for everyfin two for- ward passes and a single backward pass. Figure 8: Direct Logit Attribution Scores averaged across layers for every tested checkpoint compared to the “No Transfer” baseline, i.e. the DLA scores ob- tained by SAE i (x i ),∀i= 0,...,11. B Direct Logit Attribution We also report the Direct Logit Attribution (DLA) between forward SAE i←i−1 and backward SAE i←i+1 transferSAEs. Introduced by Bricken et al. (2023), DLA assesses the direct effect of a feature on the next-token distribution, providing insights into the causal role of features. The attri- bution score is computed as follows: attr i (x;a c ;a w ) =f i v i ·∇ x L(a c ,a w )(11) wherexis a given prompt and∇ x Lis the gradient of the logit difference between two contrastive answer tokensa c ,a w (E.g.,x= “The square root of 9 is” ,a c = 3, anda w = 2). We report the feature averaged DLA computed on a custom dataset comprising 64 handcrafted prompts in the form of(x,a c ,a w ) i . Figure 8 displays the layer-averaged DLA scores for each tested checkpoint. The plot reveals that forward transferSAEsconsistently achieves higher scores than the baseline, while backward transferSAEs consistently scores lower. This outcome contrasts with the reconstruction metrics, where the back- ward technique consistently outperformed the for- ward approach. A detailed per-layer DLA scores plot is reported in Figure 22. B.1 Compute Efficiency This work proposes a novel method leveraging transfer learning to significantly reduce compu- tational costs in trainingSAEsin the context ofLLMs. We demonstrate that bothFwd-SAE SAE i←i−1 andBwd-SAESAE i←i+1 , trained with our fine-tuning strategy, are both valid alternatives to the standard layer-by-layer training of SAE i , in terms of both reconstruction quality of the learned representation and performance on downstream tasks. In practice, our approach consists of the following steps: 1.Train a SAE i on alternate layers, depending on the transfer direction. For Forward trans- feri∈ 0,2,4,...,L, while for Backward transferi∈1,3,5,...,L−1. 2.Initialize the current SAE i by either SAE i←i−1 for forward transfer or SAE i←i+1 for backward transfer. 3. Apply transfer learning by training the remain- ingSAEsand stop when some criteria are matched (e.g., when the loss converges to a specific value or when a computational budget has been reached). Empirical results demonstrate substantial efficiency gains. In our experiments with a 12-layer Pythia- 160M (Biderman et al., 2023) model, we observed a performance increase after fine-tuning on 10% of the training data (Figure 3 and Figure 4), with performance increasing over time. Extrapolating these findings, we can compute empirical lower and upper bounds on the training efficiency. Given a model withL(in our particular caseL= 12) layers and a training set consisting of 1B tokens, we have: • Baseline training:Train one SAE i ∀i∈ 1,...,12for 1B tokens:12B tokens • Forward/Backward transfer - 10% of data: 542 –Train one SAE i for half of the layers for 1B tokens:6B tokens –Fine-tune the remaining SAE i←i−1 or SAE i←i+1 for 100M tokens:0.6B to- kens – Total:6.6B tokens •Forward/Backward transfer - 50% of data: –Train one SAE i for half of the layers for 1B tokens:6B tokens –Fine-tune the remaining SAE i←i−1 or SAE i←i+1 for 500M tokens:3B tokens – Total:9B tokens •Computational savings: –Lower boundForward/Backward trans- fer: 12B - 6.6B = 5.4B tokens –Upper boundForward/Backward trans- fer: 12B - 9B = 3B tokens •Relative reduction in compute cost: –Lower boundForward/Backward trans- fer: 5.4B 12B ×100% = 45% – Upper boundForward/Backward trans- fer: 9B 12B ×100% = 25% Our analysis indicates that the proposed transfer learning approach can reduce compute costs by 25% to 45% for forward and backward transfer when fine-tuned for 50% and 10% of the training data respectively, improving efficiency and reduc- ing costs by a great margin, while maintaining both reconstruction quality and performance on down- stream tasks. C Additional plots and tables HyperparameterValue c8 λ1.0 Hook nameresid-post Batch size4096 Adam(β 1 ,β 2 )(0,0.999) lr (Train)3e-5 lr (Fine-tuning)1e-5 lr schedulerconstant lr deacy steps20% of the training steps l1 warm-up steps5% of the training steps #tokens (Train)1B #tokens (Fine-tuning)500M Checkpoint freq.100M Table 2: Training and fine-tuning hyperparameters 543 Checkpoint i 1234567891011 100M0.9620.9600.9830.9200.8650.4390.9550.9480.8580.9441.003 200M0.9680.9680.9960.9330.8730.4590.9700.9560.8940.9651.005 300M0.9690.9711.0000.9410.8770.4750.9810.9600.9110.9721.005 400M0.9710.9741.0030.9440.8790.4790.9880.9630.9210.9781.006 500M0.9720.9751.0050.9460.8810.4880.9910.9640.9290.9811.006 Table 3: Normalized CE-Loss ScoresCES i,i−1 (Eq. 9) of theFwd-SAEat different checkpoints. Oni= 6, the NormalizedCE-Loss Scoreincreases over time even though it starts with a lower value w.r.t. the other checkpoints. From Figure 9 we note how theCE-Loss Scoreof SAE 5 (x 6 )and SAE 6←5 (x 6 )are nearly identical to the obtained by SAE 6 (x 6 ), thus the increment given by the fine-tuning over the baseline SAE 5 (x 6 ), captured by the Normalized CE-Loss Score in Eq. 9, is minimal and resulting in a lower value. Checkpoint i 012345678910 100M0.9880.9270.9641.0520.8030.3750.8011.0440.9201.0050.939 200M0.9900.9390.9691.0760.8120.3960.8051.0470.9121.0010.953 300M0.9910.9450.9721.0840.8230.4120.8081.0490.9130.9990.965 400M0.9950.9510.9751.0980.8270.4200.8111.0520.9120.9970.972 500M0.9970.9510.9751.0980.8270.4250.8141.0560.9130.9980.976 Table 4: Normalized CE-Loss ScoresCES i,i+1 of theBwd-SAEat different checkpoints. Oni= 5, the Normalized CE-Loss Scoreincreases over time even though it starts with a lower value w.r.t. the other checkpoints. From Figure 9 we note how theCE-Loss Scoreof SAE 6 (x 5 )and SAE 5←6 (x 5 )are nearly identical to the obtained by SAE 5 (x 5 ), thus the increment given by the fine-tuning over the baseline SAE 6 (x 5 ), captured by the Normalized CE-Loss Score in Eq. 9, is minimal and resulting in a lower value. 544 Figure 9: Detailed per-layer CE-Loss Score at the final checkpoint (500M). SAE i−1 (x i )and SAE i+1 (x i )are the baselines for the Fwd-SAE and Bwd-SAE respectively. Figure 10: Detailed per-layerL2-Loss at the final checkpoint (500M). SAE i−1 (x i )and SAE i+1 (x i )are the baselines for the Fwd-SAE and Bwd-SAE respectively. They-axis is on a logarithmic scale. Figure 11: Detailed per-layerL0-Loss at the final checkpoint (500M). SAE i−1 (x i )and SAE i+1 (x i )are the baselines for the Fwd-SAE and Bwd-SAE respectively. They-axis is on a logarithmic scale. 545 Figure 12: Detailed per-layer CE-Loss Score over time (Checkpoint) after Forward Transfer. Figure 13: Detailed per-layer CE-Loss Score over time (Checkpoint) after Backward Transfer. 546 Figure 14: Detailed per-layerL2-Loss over time (Checkpoint) after Forward Transfer. Figure 15: Detailed per-layerL2-Loss over time (Checkpoint) after Backward Transfer. 547 Figure 16: Detailed per-layerL0-Loss over time (Checkpoint) after Forward Transfer. Figure 17: Detailed per-layerL0-Loss over time (Checkpoint) after Backward Transfer. 548 Figure 18: Faithfulness over time (Checkpoint) averaged by layer andNfor the three downstream tasks. Figure 19: Completeness over time (Checkpoint) averaged by layer andNfor the three downstream tasks. Figure 20: Faithfulness overNaveraged by layer and time (Checkpoints) for the three downstream tasks. Figure 21: Completeness overNaveraged by layer and time (Checkpoints) for the three downstream tasks. 549 Figure 22: Detailed per-layer feature averaged Logits Attribution scores over time (Checkpoint), as defined in Equation 11. Figure 23: Comparison of top activations of feature 949 across layer 8SAEand two transferSAEspre-trained on the former. SAE 8 (Left), SAE 7←8 (Middle), SAE 9←8 (Right). Evidence of feature transfer across three layers. 550