Paper deep dive
Training Superior Sparse Autoencoders for Instruct Models
Jiaming Li, Haoran Ye, Yukun Chen, Xinyue Li, Lei Zhang, Hamid Alinejad-Rokny, Jimmy Chih-Hsien Peng, Min Yang
Models: Llama3.2-3B-Instruct, Qwen2.5-7B-Instruct
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/12/2026, 5:21:08 PM
Summary
The paper introduces Finetuning-aligned Sequential Training (FAST), a novel training method for Sparse Autoencoders (SAEs) specifically designed for instruct-tuned Large Language Models. Unlike traditional Block Training (BT) which causes semantic discontinuity by concatenating data, FAST processes data instances independently to preserve semantic integrity. Experimental results on Llama and Qwen models demonstrate that FAST significantly improves token reconstruction quality and feature interpretability, particularly for special tokens.
Entities (6)
Relation Signals (3)
FAST → improves → Sparse Autoencoder
confidence 95% · FAST significantly enhances SAE performance across various tasks.
FAST → outperforms → Block Training
confidence 95% · FAST achieves a mean squared error of 0.6468 in token reconstruction, significantly outperforming baseline methods
Sparse Autoencoder → usedfor → Mechanistic Interpretability
confidence 90% · Sparse autoencoders (SAEs) have emerged as a key tool in mechanistic interpretability
Cypher Suggestions (2)
Find all training methods and the models they are applied to. · confidence 90% · unvalidated
MATCH (m:Model)-[:TRAINED_WITH]->(t:TrainingMethod) RETURN m.name, t.name
Identify performance improvements of FAST over baselines. · confidence 85% · unvalidated
MATCH (a:TrainingMethod {name: 'FAST'})-[r:OUTPERFORMS]->(b:TrainingMethod) RETURN a.name, b.name, r.metricAbstract
Abstract:As large language models (LLMs) grow in scale and capability, understanding their internal mechanisms becomes increasingly critical. Sparse autoencoders (SAEs) have emerged as a key tool in mechanistic interpretability, enabling the extraction of human-interpretable features from LLMs. However, existing SAE training methods are primarily designed for base models, resulting in reduced reconstruction quality and interpretability when applied to instruct models. To bridge this gap, we propose $\underline{\textbf{F}}$inetuning-$\underline{\textbf{a}}$ligned $\underline{\textbf{S}}$equential $\underline{\textbf{T}}$raining ($\textit{FAST}$), a novel training method specifically tailored for instruct models. $\textit{FAST}$ aligns the training process with the data distribution and activation patterns characteristic of instruct models, resulting in substantial improvements in both reconstruction and feature interpretability. On Qwen2.5-7B-Instruct, $\textit{FAST}$ achieves a mean squared error of 0.6468 in token reconstruction, significantly outperforming baseline methods with errors of 5.1985 and 1.5096. In feature interpretability, $\textit{FAST}$ yields a higher proportion of high-quality features, for Llama3.2-3B-Instruct, $21.1\%$ scored in the top range, compared to $7.0\%$ and $10.2\%$ for $\textit{BT(P)}$ and $\textit{BT(F)}$. Surprisingly, we discover that intervening on the activations of special tokens via the SAEs leads to improvements in output quality, suggesting new opportunities for fine-grained control of model behavior. Code, data, and 240 trained SAEs are available at this https URL.
Tags
Links
- Source: https://arxiv.org/abs/2506.07691
- Canonical: https://arxiv.org/abs/2506.07691
- Code: https://github.com/Geaming2002/FAST
PDF not stored locally. Use the link above to view on the source site.
Full Text
104,539 characters extracted from source content.
Expand or collapse full text
arXiv:2506.07691v1 [cs.CL] 9 Jun 2025 Training Superior Sparse Autoencoders for Instruct Models Jiaming Li 1,2 * Haoran Ye 3 * Yukun Chen 1,2 Xinyue LiLei Zhang 1,2 Hamid Alinejad-Rokny 4 Jimmy Chih-Hsien Peng 3† Min Yang 1† 1 Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences 2 University of Chinese Academy of Sciences 3 National University of Singapore 4 The University of New South Wales jm.li4, min.yang@siat.ac.cn y_haoran@u.nus.edu jpeng@nus.edu.sg Abstract As large language models (LLMs) grow in scale and capability, understanding their inter- nal mechanisms becomes increasingly critical. Sparse autoencoders (SAEs) have emerged as a key tool in mechanistic interpretability, en- abling the extraction of human-interpretable features from LLMs. However, existing SAE training methods are primarily designed for base models, resulting in reduced reconstruc- tion quality and interpretability when applied to instruct models. To bridge this gap, we pro- poseFinetuning-alignedSequentialTraining (FAST), a novel training method specifically tai- lored for instruct models.FASTaligns the train- ing process with the data distribution and acti- vation patterns characteristic of instruct models, resulting in substantial improvements in both reconstruction and feature interpretability. On Qwen2.5-7B-Instruct,FASTachieves a mean squared error of 0.6468 in token reconstruc- tion, significantly outperforming baseline meth- ods with errors of 5.1985 and 1.5096. In fea- ture interpretability,FASTyields a higher pro- portion of high-quality features, for Llama3.2- 3B-Instruct, 21.1% scored in the top range, compared to 7.0% and 10.2% forBT(P)and BT(F). Surprisingly, we discover that inter- vening on the activations of special tokens via the SAEs leads to improvements in out- put quality, suggesting new opportunities for fine-grained control of model behavior. Code, data, and 240 trained SAEs are available at https://github.com/Geaming2002/FAST. 1 Introduction Large Language Models (LLMs) demonstrate exceptional performance across diverse natural language processing tasks (Brown et al., 2020; Ouyang et al., 2022; Guo et al., 2025). How- ever, their complexity, vast number of parameters, and intricate training processes present significant * Equal contribution. † Corresponding author. LLM Activations Cache SAE' SAE LowHighLow Concrete Features Dimensional Space: ... Training Figure 1: Overview of the sparse autoencoder, illustrat- ing its process for interpreting the internal representa- tions of large language models. challenges in understanding their internal mecha- nisms (Bengio et al., 2023; Bubeck et al., 2023). As these models advance, aligning them with hu- man values and mitigating risks becomes critical, highlighting the importance of mechanistic inter- pretability (Bereska and Gavves, 2024; Ji et al., 2023; Anwar et al., 2024). Sparse autoencoders (SAEs) serve as a powerful tool for interpreting LLMs by mapping high-dimensional activations to sparse, interpretable feature spaces, thereby de- composing neural networks into understandable components (Bereska and Gavves, 2024; Bricken et al., 2023; Cunningham et al., 2023). SAE train- ing, conceptualized as dictionary learning (Kreutz- Delgado et al., 2003; Yun et al., 2021), utilizes hidden layer weights as dictionary bases and en- forces sparsity for efficient representations, align- ing with the linear representations and superposi- tion hypotheses (Elhage et al., 2022; Arora et al., 2018; Olah, 2022). Figure 1 provides an overview of sparse autoencoders. Current SAE training methods primarily fo- cus on base models and follow Block Training paradigm that concatenates datasets and splits them into fixed-length blocks (Joseph Bloom and Chanin, 2024; Bricken et al., 2023). It aligns with the pre- training phase of LLMs, making it a natural and effective choice for training SAEs on base mod- els. While effective for base models, this method faces significant limitations when applied to in- struct models (Joseph Bloom and Chanin, 2024; Kissane et al., 2024b). The semantic discontinu- ity caused by combining data from diverse sources undermines the semantic coherence for alignment with downstream tasks, ultimately degrading SAE training performance (Kissane et al., 2024b). To address these challenges, we propose Finetuning-alignedSequentialTraining (FAST), a novel SAE training method specifically designed for instruct models.FASTprocesses each data in- stance independently, preserving semantic integrity and maintaining alignment with the fine-tuning ob- jectives of the model. By providing a consistent and complete semantic space during SAE training, FASTenhances the model’s understanding of input and improves the quality of feature extraction. Experimental results demonstrate thatFASTsig- nificantly enhance SAE performance across var- ious tasks. In token reconstruction on Qwen2.5- 7B-Instruct (Yang et al., 2024),FASTachieves a mean squared error of 0.6468, outperforming base- lines of 5.1985 and 1.5093. It also excels in feature interpretability; for Llama3.2-3B-Instruct (Dubey et al., 2024), 21.1% of features are rated highest in quality, compared to 7.0% forBT(P)and 10.2% forBT(F). Additionally, SAEs are used to study the impact of special tokens on outputs, offering insights into their roles and practical applications, and paving the way for future research. Our contributions are summarized as follows: • This paper proposes Finetuning-aligned Se- quential Training (FAST), a novel method specifically designed for training SAEs on in- struct models. • Experimental results demonstrate thatFAST significantly improves the performance of SAEs on token reconstruction. Additionally, feature interpretability experiments confirm the effectiveness and generalizability ofFAST. •The SAEs are further utilized to investigate the influence of special tokens on model outputs, providing new insights into their specific roles and offering fresh directions for the practical application of SAE models. 2 Related Work Mechanistic Interpretability.As LLMs con- tinue to advance, their increasing complexity, mas- sive parameter scales, and intricate training pro- cesses present significant challenges to human un- derstanding of their inner workings (Bubeck et al., 2023; Bengio et al., 2023). Achieving a deep un- derstanding of LLMs is crucial to ensuring align- ment with human values (Ji et al., 2023; Anwar et al., 2024) and mitigating harmful or unintended outcomes (Anwar et al., 2024; Hendrycks et al., 2021; Slattery et al., 2024; Hendrycks et al., 2023). However, the "black box" nature (Casper et al., 2024) obscures the underlying causes of misalign- ment and associated risks. To address these chal- lenges, mechanistic interpretability has emerged as a critical area of research focused on understanding the inner workings of LLMs (Bereska and Gavves, 2024; Nanda, 2022d, 2023, 2022a; Olah, 2022). This discipline seeks to achieve a detailed under- standing of model behavior through systematic re- verse engineering (Nanda, 2022c,b). Sparse Autoencoders for LLM.The training of sparse autoencoders (SAEs) can be framed as a form of dictionary learning, where the hid- den layer weights serve as the dictionary ba- sis, and sparsity constraints enforce efficient and sparse data representations (Bereska and Gavves, 2024; Bricken et al., 2023). Additionally, SAEs align with both the linear representations hypoth- esis (Mikolov et al., 2013) and the superposition hypothesis (Elhage et al., 2022; Arora et al., 2018; Olah et al., 2020), ensuring that the learned repre- sentations adhere to theoretical principles of high- dimensional feature spaces. Specifically, the lin- ear representation hypothesis suggests that features in language models correspond to directions in activation space, enabling embedding arithmetic, such as:v("king")−v("man") +v("woman") = v("queen")(Mikolov et al., 2013). Neurons in LLMs are often polysemantic, encod- ing multiple distinct features due to the limited di- mensionality of feature activation space. (Bereska and Gavves, 2024). The superposition hypothe- sis explains how neural networks represent more features than the number of available neurons by encoding features as nearly orthogonal directions in the neuron output space (Elhage et al., 2022). The activation of one feature may appear as a slight activation of another, resulting from the overlap of non-orthogonal vectors. While such overlaps intro- Init Model Base Model Instruct Model LLM Training Pipeline (b) Block Training SAE Training Methods Pretraining Dataset It is done, and submitted. You can play “Survival of the Tastiest” on Android... Topic: reinvent midnight madness Amazon announced a new service at the AWS re:Invent Midnight Ma... Working Women, Special Provision and the Debate on Equality There has b... [...] It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on ... Amazon announced a new service at the AWS re:Invent Midnight Madness event. Sumerian is solution... [...] TiO2 nanotubes for bone regeneration. Nanostructured materials are believed to play a ... because bone itself has a structural hierarchy at the fir ... Text Blocks Concat and Split SAE' Activations Cache of Blocks SFT Dataset ['content': 'how can identity... theft', 'role': 'user', 'content': 'Ide.... credit reports.', 'role': 'assistant'], ['content': 'Beside OFAC’s ... ssible', 'role': 'user', ... 'content': 'can you pleas ... category', 'role': 'user', 'content': 'Sure, here ... rnment.', 'role': 'assistant'], ['content': 'You are ... s and outputs.', 'role': ‘user', 'content': 'To append ... 9] ```', 'role': 'assistant'], [...] (a) Training Chat Template [...] (c) FAST (ours) SAE [...] [...] Activations Cache of Sequences SAE' Training SAE How can I calculate the average grade using this data?<|eot_id|><| start_header_id|>assistant<| end_header_id|> can calculate the average grade by iterating through the `student_grades` list and summing up all the grades. How can I calculate the average grade using this data?<|eot_id|><| start_header_id|>assistant<| end_header_id|> can calculate the average grade by iterating through the `student_grades` list and summing up all the grades. I can not know the specific function because the activation seems Irregular. This is related to the e n d o f t h e u s e r ' s speech and the b e g i n n i n g o f t h e assistant's answer. Feature Interpretability Figure 2: Illustration of the LLM training pipeline and SAE training methods.(a)The pipeline transitions from pretraining to fine-tuning.(b)Block Training (BT) concatenates datasets and resplits them into fixed-length blocks.(c)Finetuning-aligned Sequential Training (FAST) processes data instances independently, preserving semantic integrity and improving alignment with fine-tuning objectives, leading to better performance in feature interpretability. duce interference, the advantages of representing a greater number of non-orthogonal features out- weigh the drawbacks, particularly in highly sparse neural networks (Bricken et al., 2023; Bereska and Gavves, 2024; Rajamanoharan et al., 2024a). This property makes SAEs particularly valuable in mechanistic interpretability, as they enable the decomposition of language models by capturing high-dimensional features (Gao et al., 2024; Fer- rando et al., 2024; Rajamanoharan et al., 2024b; Lieberum et al., 2024; He et al., 2024). 3 Finetuning-alignedSequentialTraining Motivation.Recent studies have adopted a train- ing paradigm for SAE that builds upon the pretrain- ing phase of LLMs, as illustrated in Figure 2(b). This approach, referred to as Block Training (BT), involves concatenating datasets and splitting them into fixed-length blocks for training (Bereska and Gavves, 2024; He et al., 2024; Kissane et al., 2024a).BTaligns with the pretraining phase of LLMs, making it a natural and effective choice for training SAEs on base models. Since base models are directly trained on large-scale corpora with- out additional fine-tuning,BTensures consistency between the SAE training and the pretraining ob- jectives of LLMs. However, when it comes to instruct models, which undergo a supervised fine-tuning (SFT) phase to align with specific instructions or down- stream tasks, the limitations ofBTbecome more apparent. For instance, studies demonstrate that SAE trained on the pretraining dataset exhibit sig- nificantly weak abilities in adhering to refusal di- rectives (Kissane et al., 2024b). An alternative approach utilizes SFT datasets, introducing spe- cial tokens and applying block training in the same manner (Kissane et al., 2024b). While this method leverages SFT datasets, it still preserves theBT methodology, which does not align well with the finetuning objectives of instruct models. Specifi- cally,BTtreats the input sequences as concatenated blocks, often combining data samples from differ- ent sources. For example, in a sequence of 8,192 tokens, the first 2,048 tokens may originate from one sample, while the remaining 6,144 tokens come from another. While such semantic discontinuity is less problematic for base models, as it mirrors their pretraining setup, it poses significant challenges for instruct models. Maintaining semantic integrity is crucial for aligning with downstream tasks, and the lack of such alignment hinders the model’s ability to fully understand the input, ultimately degrading SAE training performance. To address these challenges, we propose a novel SAE training method for instruct models: Finetuning-aligned Sequential Training (FAST), which better aligns with the fine-tuning phase, both in terms of dataset utilization and training method- ology in Figure 2(c). By providing the instruct model with a consistent and complete semantic space during SAE training,FASTenhances the alignment with the fine-tuning phase and improves the quality of SAE training. This alignment forms the primary motivation behindFAST. 3.1 Data Processing As previously described,FASTtrains the SAE us- ing finetuning datasets. Specifically, multiple multi- turn dialogue datasets are collected, and each data instance is combined with the corresponding chat template of the instruct model. This process not only introduces special tokens but also ensures consistency with the data processing methodology used during the fine-tuning phase of the model. A key innovation lies in independent processing of each data instance, rather than concatenating multiple instances before inputting them into the model. By eliminating the constraint of context size, the dataset is processed sequentially. Each data instance is individually fed into the LLM to extract hidden layer activations, which subse- quently used to train the SAE, as illustrated in Fig- ure 2(c). This approach effectively avoids semantic discontinuity caused by data concatenation, while preserving the semantic integrity of each instance thereby providing higher-quality inputs for training the SAE. 3.2 SAE This section introduces the two types of SAE mod- els utilized inFAST: the Standard ReLU-based SAE and the JumpReLU SAE. The Standard ReLU- based SAE is a widely adopted approach (Bereska and Gavves, 2024; Bricken et al., 2023), while JumpReLU SAE achieves superior reconstruction quality and sparsity control (Rajamanoharan et al., 2024a; Lieberum et al., 2024). Here we provide the details of the two SAE models and the initialization method in Appendix A. Standard SAE.For the input vectorx∈R d in from the residual stream,d in denotes the dimen- sionality of the model’s hidden layer. The ReLU- based SAE model consists of an encoder, decoder, and a corresponding loss function, which are de- fined as follows: f(x) =ReLU(W enc x+b enc )(1) ˆ x=W dec f(x) +b dec (2) L=∥x− ˆ x∥ 2 2 +λ∥z L1 ∥(3) W enc ,W dec ,b enc ,b dec represent the weight ma- trices and bias vectors for the encoder and decoder, respectively.∥x− ˆ x∥ 2 2 denotes the mean squared error (MSE) loss,∥z L1 ∥ 1 represents theL 1 loss used for sparsity regularization, andλis the spar- sity regularization hyperparameter. JumpReLU SAE.The JumpReLU SAE retains the same parameter matricesWandbas the Stan- dard SAE but introduces a modified activation func- tion and sparsity regularization: f(x) =JumpReLU θ (W enc x+b enc ),(4) ˆ x=W dec f(x) +b dec ,(5) L=∥x− ˆ x∥ 2 2 +λ∥z L0 ∥,(6) The JumpReLU function is defined as JumpReLU θ (z) :=z⊙H(z−θ), whereθ >0 is a learnable, vector-valued threshold parameter. Here,⊙denotes elementwise multiplication, andHrepresents the Heaviside step function. Additionally,∥z L0 ∥ 1 represents theL 0 loss used for sparsity regularization, whileλis the sparsity regularization hyperparameter. 3.3 Mixing Activation Buffer Activation values, which represent the activa- tion levels of hidden layer dimensions during the model’s forward pass, require significant storage space. To mitigate this challenge, we employ a producer-consumer framework inspired by previ- ous studies (Joseph Bloom and Chanin, 2024), wherein the LLM generates activations and stores them in a dedicated buffer. As shown in Figure 3, the process begins with the buffer being filled to capacity with activation values. Once the buffer is full, the activations are shuffled to ensure randomness and diversity. Sub- sequently, half of the shuffled activations are sent to the SAE model for training, while the other half remains in the buffer. After training, the buffer is replenished with new activations generated by the Source Dataset Activations Cache 123456 Mixing Buffer Fill the whole buffer 264351 Shuffle 251 153642 Shuffle 364 Refill the buffer Refill the buffer SAESAE'SAE′ SAE Training Process: Refill process goes on Shuffle process goes on Training process goes on Figure 3: The mixing activation buffer is shuffled, half is sent to the SAE for training, and the resulting new activations are used to refill the buffer. This iterative process ensures data diversity and storage efficiency. model, and the cycle repeats. This iterative mech- anism optimizes storage efficiency and ensures a high level of data variability, thereby enhancing the robustness of model training. By leveraging the mixing buffer, this approach effectively balances data diversity with storage efficiency. 4 Experiments 4.1 Experiment Setup Dataset.We construct a large-scale instruction dataset for fine-tuning LLMs by combining several publicly available, high-quality datasets, includ- ing WildChat-1M-Full (Zhao et al., 2024), Infinity- Instruct (BAAI, 2024), tulu-3-sft-mixture (Lambert et al., 2024), orca-agentinstruct-1M-v1-cleaned 1 , and lmsys-chat-1m (Zheng et al., 2023). After ap- plying a 20-gram deduplication strategy, it is re- duced to 4,758,226 samples. Details are in Ap- pendix B. LLMs.We conduct experiments on seven mod- els from two families: Llama (Llama-3.1, Llama- 3.2)(Dubey et al., 2024) and Qwen (Qwen- 2.5)(Yang et al., 2024), selected for their state- of-the-art performance to evaluate our approach’s robustness and generalization across families and scales. The models and their respective layer con- figurations, detailed in Table 1, are selected from various depths to mitigate depth bias. Following prior works (Bereska and Gavves, 2024; Bricken et al., 2023; Gao et al., 2024), we train SAEs on the residual stream, as inter-layer relationships have minimal impact on performance. 1 https://huggingface.co/datasets/mlabonne/ orca-agentinstruct-1M-v1-cleaned Baselines.Prior to this study, all SAE model training methods exclusively utilize the Block Training (BT) strategy. Depending on the type of training dataset used, Block Training can be cat- egorized into two primary forms:BT(P)andBT(F) as follows: •BT(P): Block Training using the pretraining dataset. The pretraining dataset is processed by concatenating and segmenting the data into text blocks of equal length, which are then used for training the SAE model. •BT(F): Block Training using the finetuning dataset. This approach utilizes a finetuning dataset. The data within the dataset is concate- nated to form text blocks. ForBT(P), we utilize the pile-uncopyrighted dataset 2 . As forBT(F), we use the finetuning dataset metioned before which is also used inFAST. Configuration.SAEs are trained on 8*NVIDIA A100 GPUs usingsae_lens(Joseph Bloom and Chanin, 2024) with custom implementation. For models more than 7B parameters, the expansion factor of SAE is fixed at 8X, whereas for other mod- els, the expansion factor can be 8X or 16X. To en- sure fairness across methods at the same data scale, the number of training tokens is set to 40,960,000. ForBT(P)andBT(F),context_sizeis 2,048, with each text block containing 2,048 tokens. ForFAST, no explicitcontext_sizeis required; instead, a truncation length of 8,192 is applied to manage memory usage. For JumpReLU SAE,L sparsity is 0.01, while for Standard SAE, it is 5. Further pa- rameter details are in Appendix C. Evaluation Metric.The performance of the SAE is assessed using the Mean Squared Error (MSE), which is calculated as: MSE = P N i=1 1 L i P L i j=1 P H k=1 (y i,j,k −ˆy i,j,k ) 2 N·H (7) whereNdenotes the size of the dataset,L i rep- resents the length of thei-th sequence,Hrefers to the hidden dimension of the model. To evaluate the SAE’s performance specifically on special tokens, we also compute the MSE of special tokens, de- noted asMSE st 3 . LowerMSEvalues reflect better model performance. 2 https://huggingface.co/datasets/monology/ pile-uncopyrighted 3 To facilitate a more direct comparison of performance -16 -12 -8 -4 0 412182025 Llama-3.1-8B-Instruct ✅-Jumprelu SAE-MSE Loss-log2-Special Token Qwen2.5-7B-Instruct -12 -9 -6 -3 0 3 412182025 Qwen2.5-3B-Instruct -12 -9 -6 -3 0 41832 Qwen2.5-1.5B-Instruct -14 -10.5 -7 -3.5 0 3.5 41424 Qwen2.5-0.5B-Instruct -30 -22.5 -15 -7.5 0 41220 Llama3.2—1B-Instruct -24 -18 -12 -6 0 4914 Llama3.2—3B-Instruct -18 -13.5 -9 -4.5 0 41220 Llama Series Layer log2(MSE) Qwen Series BT(P) BT(F) FAST(ours) 8x 16x Width Method Figure 4:MSE st performance of the JumpReLU SAE (all metrics are presented in log scale, where lower values indicate better SAE reconstruction performance). Within the JumpReLU architecture,FASTexhibits the best reconstruction capability compared toBT(P)andBT(F). Model NameLayer Llama series Llama-3.1-8B-Instruct[4,12,18,20,25] Llama-3.2-3B-Instruct[4,12,20] Llama-3.2-1B-Instruct[4,9,14] Qwen series Qwen2.5-7B-Instruct[4,12,18,20,25] Qwen2.5-3B-Instruct[4,18,32] Qwen2.5-1.5B-Instruct[4,14,24] Qwen2.5-0.5B-Instruct[4,12,20] Table 1: Layer configurations of the Llama and Qwen model series, showcasing the selection of layers across varying depths to mitigate depth-related biases and opti- mize model performance. 4.2 Main Results A random sample of 5,000 dialogues is extracted from the remaining portion of the dataset for eval- uation. Figure 4 compares theMSE st scores of three methods using the JumpReLU SAE, while Figure 6 illustrates theMSE st performance of the Standard SAE. Detailed results for bothMSEand MSE st are presented in Appendix D. In terms of overall token reconstruction (MSE), the JumpReLU architecture with Qwen models demonstrates similar patterns, withFASTconsis- tently outperforming baseline methods.FAST method achieves superior performance across most across different methods, all MSE values are transformed usinglog 2 . configurations. For instance, in Llama-3.2-3B- Instruct-L20-8X-Standard,FASTattains -0.9527, significantly surpassing the baselines which score -0.6926 and -0.9186.In special token recon- struction (MSE st ),FASTshows marked improve- ments across models. In Qwen2.5-7B-Instruct- L18-8X-Standard,FASTachieves 0.6468, outper- forming the baselines (5.1985 and 1.5093). In the JumpReLU SAEs, it achieves -9.7604 compared to -4.0005 and -8.0743. Overall, the findings demonstrate thatFASTex- cels in reconstructing both general and special to- kens. Interestingly,FASTshows even stronger im- provements in Standard SAE architectures com- pared to JumpReLU SAEs, potentially due to the latter’s already highMSEperformance, leaving less room for enhancement. Despite limitations in Standard architectures due to L1 regularization and ReLU activation,FASTsignificantly improves token reconstruction in these models. 5 Feature Interpretability This section evaluates the interpretability of fea- tures extracted by SAEs through an automated analysis framework, building upon methodolo- gies (Bills et al., 2023; Cunningham and Con- erly, 2024; He et al., 2024). The middle lay- ers of the trained SAEs are selected for analy- sis based on their demonstrated superior perfor- mance. Given that experiments demonstrate that the JumpReLU activation function outperforms BT(P) BT(F)FAST(ours) Llama3.2—3B-Instruct 0% 15% 30% 45% 60% 12345 0% 20% 40% 60% 80% 12345 8x16x Llama3.2—1B-Instruct 0% 12.5% 25% 37.5% 50% 12345 0% 12.5% 25% 37.5% 50% 12345 8x16x Llama3.1—8B-Instruct 0% 12.5% 25% 37.5% 50% 12345 0% 12.5% 25% 37.5% 50% 12345 Qwen2.5—7B-Instruct Qwen2.5—0.5B-Instruct 0% 12.5% 25% 37.5% 50% 12345 0% 12.5% 25% 37.5% 50% 12345 8x16x Scores Percentage Qwen2.5—1.5B-Instruct 0% 12.5% 25% 37.5% 50% 12345 0% 12.5% 25% 37.5% 50% 12345 8x16x Qwen2.5—3B-Instruct 0% 12.5% 25% 37.5% 50% 12345 0% 12.5% 25% 37.5% 50% 12345 8x16x Percentage 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 CDF Curve Figure 5: Experiment results of feature interpretability.FASTachieves notable improvements compared to the other two training methods across all the tested models.FASTattains 21.1% of features rated in the highest quality range (scores 4-5), in contrast to 7.0% forBT(P)and 10.2% forBT(F). ScoreDescription 5Clear pattern with no deviating examples 4Clear pattern with one or two deviating examples 3 Clear overall pattern but quite a few ex- amples not fitting that pattern 2Broad consistent theme but lacking struc- ture 1No discernible pattern Table 2: Scoring criteria for feature interpretability. other alternatives (Rajamanoharan et al., 2024b; Lieberum et al., 2024), the evaluation exclusively employs SAEs equipped with JumpReLU. Table 10 presents the specific SAE models evaluated. Additional 10,000 instances are sampled and their activation values are computed. Then the top five sentences with the highest activation values are identified to construct an activation dataset for evaluating features. Based on the assumption that dead features are irrelevant to the evaluation, an initial screening of features is conducted, ensuring that only features with non-zero activation values in top five sentences are retained. After that, we randomly select 128 features as the final evaluation. GPT-4o 4 is prompted to score each group of five contexts and generate a descriptive summary. Ad- ditionally, a monosemanticity score ranging from 1 to 5 is assigned, based on a rubric adapted from (Cunningham and Conerly, 2024; He et al., 2024). Detailed prompt is shown in Appendix E.2. A total of 4,608 feature scores are computed and presented in Figure 5. The results demon- 4 GPT-4o version: 2024-11-20 strate thatFASTconsistently outperformsBT(P) andBT(F)across all evaluated SAEs. For the 8x scaled Llama3.2-3B-Instruct,FASTachieves 21.1% of features in the highest quality range (scores 4-5), compared to 7.0% forBT(P)and 10.2% forBT(F). Generally, compared to both baseline methods, we observe thatFASTreduces the proportion of low- quality features while increasing the proportion of high-quality features in 8X and 16X SAEs. This highlights the superiority ofFASTin producing more interpretable features during SAE training. Furthermore, Cumulative Distribution Function (CDF) curve analysis reveals thatFAST’s percent- age of features scoring below 3 is consistently the lowest. For instance, with Qwen2.5-3B-Instruct model, the CDF at score 3 is 76.5% forFAST, compared to 89.0% forBT(F)and 92.2% for BT(P), indicating fewer low-scoring features for FAST. These findings suggest that both appropri- ate training dataset selection for SAEs and the sequence training methodology contribute to en- hanced model interpretability.FASTappears to suc- cessfully integrate these aspects, leading to more interpretable SAEs. 6 Steering with SAE Latents Feature steering represents an intuitive approach to evaluate model inference by adjusting the activa- tion coefficients within a trained SAE, thereby di- rectly influencing the model’s output. This method resembles the use of decoder latent vectors for acti- vation guidance, but the SAE offers a more robust and unambiguous process for activation guidance. Based on the formulations in Equations 2 and 5, the reconstructed outputs of the SAE derive from a weighted combination of its latent variables. (Fer- rando et al., 2024; Templeton, 2024). z ′ =z+αd k (8) These latent variables correspond to row vec- tors ofW dec , withαscaling thek-th latent. To implement this steering, a latent dimensionkis selected, scaling its decoder vectord k byα. Then z ′ is introduced into the model’s residual stream. Following Ferrando et al. (2024), 1,010 sam- pled instruction instances are randomly partitioned into two parts: 1,000 samples to identify highly activated SAE features and 10 samples to evaluate post-steering model outputs. We use the chat tem- plate corresponding to the instruct model during in- ference. The 10 questions appear in Appendix F.1. We focus on feature related to these special to- kens 5 (shown in Table 11) to examine how special tokens, which are not associated with specific en- tities, influence the model’s output. Using 1,000 samples, the average maximum activation values are calculated for each feature. Complete activation values for each model appear in Appendix F.3. Three representative questions are selected to il- lustrate the effects of steering features. Due to space constraints, feature steering primarily fo- cuses on the<|start_header_id|>for Llama3.1- 8B-Instruct and<|im_start|>for Qwen2.5-7B- Instruct. The experiments employ scalingα∈ [0,15,25,50,100,150,200]using 8X JumpReLU SAE throughFASTand greedy decoding. Detailed analyses of three questions are presented in Ap- pendix F.4. Steering high-activation features particularly those associated with special tokens significantly influences the model’s output quality and reason- ing ability. This effect remains consistent across diverse tasks and linguistic contexts. There is an op- timal range for the coefficientα. Within this range, model responses become more accurate, coherent, and relevant to the given instructions. For instance, in Question 3(F.4.2), amplify- ing the activation of a feature tied to both the <|im_start|>anduserresults in a clear transi- tion: moderate values ofαimproved engagement and output relevance, while excessive amplification led to language switching and incoherent, repeti- 5 user andassistantare incorporated into the special tokens, as they frequently appear together with other special tokens. tive text. Similarly, in Question 4(F.4.3), steer- ing the highest activation feature associated with the<|im_start|>marker within a specific coef- ficient range led to more convincing and logically structured answers, but pushingαtoo far again de- graded output quality. Similar patterns can also be observed in Q2(F.4.1). The consistency in findings suggests that these features encode essential aspects of the model’s rea- soning capabilities, transcending individual tasks or linguistic contexts. There is an optimal coeffi- cientαrange suggests a "sweet spot" for feature steering, enhancing performance without introduc- ing the degradation seen at higher coefficients. This observation presents important implications for the practical application of SAEs. It demon- strates that steering certain features potentially as- sociated with special tokens emerges as a reliable method to improve model performance across di- verse tasks. Unlike traditional SAE-feature ap- proaches, which often impose output biases tied to predefined meanings or entities, feature steering with special tokens refines the guidance of models, resulting in higher-quality responses. 7 Conclusion This paper proposes a novel approach, Finetuning- aligned Sequential Training(FAST), for training SAEs on instruct models. By independently pro- cessing individual data instances while maintaining semantic integrity,FASTaddresses the limitations of traditional Block Training (BT) methods, which often suffer from semantic discontinuity and mis- alignment with downstream task requirements. Ex- perimental results show thatFASTimproves perfor- mance across various SAE models, demonstrating its versatility and general applicability. Further- more,FASTconsistently achieves superior results in feature interpretability evaluations, highlighting its effectiveness and advantages. Also we employ SAEs to explore the influence of special tokens on model outputs. Results indi- cate that steering features within a specific coef- ficient range substantially enhance model output quality. These insights provide a valuable method for studying the functional roles of special tokens and practical applications of SAEs. To facilitate future research, the complete codebase, datasets and a total of 240 pre-trained SAE models will be released publicly, establishing a robust foundation for innovation and advancement in this domain. Limitations As language models increase in scale, investigating their internals with SAE-based methods becomes more challenging. Computational constraints re- strict our investigation to smaller Qwen and Llama models (under 8B parameters), though our frame- work could be extended to larger architectures. Feature interpretability analysis focuses mainly on strongly activated features, potentially overlooking weakly activated samples (He et al., 2024). Fur- thermore, feature steering experiments are prelimi- nary studies centered on special token-related fea- tures that correlate with response quality. A more comprehensive investigation of these features’ in- fluence remains an important direction for future research. Ethical Statements This research focuses on interpreting and steering instruction-tuned language models through sparse autoencoders. All experiments rely solely on pub- licly available, appropriately licensed text corpora that are deduplicated and stripped of personally identifiable information; no human subjects are in- volved nor private data collected. Nevertheless, it is important to acknowledge that LLMs are trained on extensive publicly available datasets, potentially re- sulting in inadvertent reproduction of copyrighted material. Our codes, parameters, and deduplicated demo data will be released under an open-source licence to support reproducibility. References Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sour- but, et al. 2024. Foundational challenges in assuring alignment and safety of large language models.arXiv preprint arXiv:2404.09932. Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2018. Linear algebraic struc- ture of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495. BAAI. 2024.Infinity instruct.arXiv preprint arXiv:2406.X. Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Had- field, et al. 2023. Managing ai risks in an era of rapid progress.arXiv preprint arXiv:2310.17688, page 18. Leonard Bereska and Efstratios Gavves. 2024. Mech- anistic interpretability for ai safety–a review.arXiv preprint arXiv:2404.14082. Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. 2023. Lan- guage models can explain neurons in language mod- els. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. 2023. Towards monosemanticity: Decom- posing language models with dictionary learning. Transformer Circuits Thread. Https://transformer- circuits.pub/2023/monosemantic- features/index.html. Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901. Sébastien Bubeck, Varun Chandrasekaran, Ronen El- dan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lund- berg, et al. 2023. Sparks of artificial general intelli- gence: Early experiments with gpt-4.arXiv preprint arXiv:2303.12712. Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, et al. 2024. Black-box access is insuffi- cient for rigorous ai audits. InThe 2024 ACM Confer- ence on Fairness, Accountability, and Transparency, pages 2254–2272. Hoagy Cunningham and Tom Conerly. 2024. Circuits updates - june 2024.Transformer Circuits Thread. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. 2022. Toy models of superposition. arXiv preprint arXiv:2209.10652. Javier Ferrando, Oscar Obeso, Senthooran Rajamanoha- ran, and Neel Nanda. 2024. Do i know this entity? knowledge awareness and hallucinations in language models.arXiv preprint arXiv:2411.14257. Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. 2024.Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: In- centivizing reasoning capability in llms via reinforce- ment learning.arXiv preprint arXiv:2501.12948. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision, pages 1026–1034. Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, et al. 2024. Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders.arXiv preprint arXiv:2410.20526. Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. 2021. Unsolved problems in ml safety.arXiv preprint arXiv:2109.13916. Dan Hendrycks, Mantas Mazeika, and Thomas Wood- side. 2023. An overview of catastrophic ai risks. arXiv preprint arXiv:2306.12001. Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, et al. 2023. Ai alignment: A comprehensive survey.arXiv preprint arXiv:2310.19852. Curt Tigges Joseph Bloom and David Chanin. 2024. Saelens.https://github.com/jbloomAus/ SAELens. Connor Kissane, Robert Krzyzanowski, Arthur Conmy, and Neel Nanda. 2024a. Saes (usually) transfer be- tween base and chat models. Alignment Forum. Connor Kissane, Robert Krzyzanowski, Neel Nanda, and Arthur Conmy. 2024b. Saes are highly dataset dependent: A case study on the refusal direction. Alignment Forum. Kenneth Kreutz-Delgado, Joseph F Murray, Bhaskar D Rao, Kjersti Engan, Te-Won Lee, and Terrence J Sejnowski. 2003. Dictionary learning algorithms for sparse representation.Neural computation, 15(2):349–396. Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi. 2024. Tülu 3: Pushing frontiers in open language model post-training. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. 2024. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147. Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. InProceedings of the 2013 conference of the north american chapter of the as- sociation for computational linguistics: Human lan- guage technologies, pages 746–751. Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, and Ahmed Awadallah. 2024. AgentInstruct: To- ward Generative Teaching with Agentic Flows. https://arxiv.org/abs/2407.03502.Preprint, arXiv:2407.03502. Neel Nanda. 2022a.200 concrete open problems in mechanistic interpretability: Introduction.Neel Nanda’s Blog. Neel Nanda. 2022b. 200 cop in mi: Analysing training dynamics.Neel Nanda’s Blog. Neel Nanda. 2022c. 200 cop in mi: Interpreting algo- rithmic problems.Neel Nanda’s Blog. Neel Nanda. 2022d. A comprehensive mechanistic in- terpretability explainer & glossary.Neel Nanda’s Blog. Neel Nanda. 2023. Mechanistic interpretability quick- start guide.Neel Nanda’s Blog. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020. Zoom in: An introduction to circuits.Distill, 5(3):e00024– 001. Christopher Olah. 2022. Mechanistic interpretability, variables, and the importance of interpretable bases. Transformer Circuits Thread. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Sandip Slama, Alex Ray, et al. 2022. Training language models to follow instruc- tions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744. Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. 2024a. Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014. Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. 2024b. Jumping ahead: Im- proving reconstruction fidelity with jumprelu sparse autoencoders.arXiv preprint arXiv:2407.14435. Peter Slattery, Alexander K Saeri, Emily AC Grundy, Jess Graham, Michael Noetel, Risto Uuk, James Dao, Soroush Pour, Stephen Casper, and Neil Thompson. 2024. The ai risk repository: A comprehensive meta- review, database, and taxonomy of risks from artifi- cial intelligence.arXiv preprint arXiv:2408.12622. Adly Templeton. 2024.Scaling monosemanticity: Ex- tracting interpretable features from claude 3 sonnet. Anthropic. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits rea- soning in large language models.Advances in neural information processing systems, 35:24824–24837. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5 tech- nical report.arXiv preprint arXiv:2412.15115. Zeyu Yun, Yubei Chen, Bruno A Olshausen, and Yann LeCun. 2021. Transformer visualization via dictio- nary learning: contextualized embedding as a linear superposition of transformer factors.arXiv preprint arXiv:2103.15949. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. 2024. Wildchat: 1m chatGPT interaction logs in the wild. InThe Twelfth International Conference on Learning Representa- tions. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing, Joseph E. Gonza- lez, Ion Stoica, and Hao Zhang. 2023. Lmsys-chat- 1m: A large-scale real-world llm conversation dataset. Preprint, arXiv:2309.11998. A SAE Initialization Method The encoder weights (W enc ) and decoder weights (W dec ) are initialized using the Kaiming Uniform initialization method (He et al., 2015). This step, used exclusively in the JumpReLU method, normalizes each row of theW dec using the L2 norm and adjusts the thresholdεand encoder biasb enc accordingly. After that, some data is selected for geometric median evaluation. The goal is to minimize the weighted sum of distances to all sample points. To achieve this, the Weiszfeld algorithm is employed to a specified precision offtol= 1×10 −20 . The resulting optimal point is then used as the initial value forb dec , which is set to 0. There exists the formulas about the geometric median evaluation as follows: f(m) = n X i=1 w i ∥m−p i ∥,m 0 = P n i=1 w i p i P n i=1 w i (9) d i =∥p i −m k ∥,w ′ i = w i max(d i ,ε) (10) m k+1 = P n i=1 w ′ i p i P n i=1 w ′ i (11) |f(m k+1 )−f(m k )|≤ftol·f(m k )(12) The parameters used in the equations are defined as follows:mrepresents the target point or the weighted mean to be optimized, whilep i is thei-th data point in the dataset.w i denotes the weight associated with thei-th data point. The objective function,f(m), is the weighted sum of distances betweenmand all data pointsp i . The initial estimate ofm, denoted asm 0 , is calculated as the weighted mean of all points.d i is the distance between thei-th data pointp i and the current estimatem k . The updated weight for thei-th data point,w ′ i , is adjusted by the distanced i and a small constantεto prevent division by zero.m k+1 is the updated estimate ofmat iterationk+ 1, computed as the weighted mean of all points using the updated weightsw ′ i . B SFT Dataset Construction Details We collect and integrate several large-scale instruction datasets specifically designed for fine-tuning LLMs. Datasets are shown below: •WildChat-1M-Full(Zhao et al., 2024) is a dataset comprising 1 million conversations between human users and ChatGPT, enriched with demographic metadata such as state, country, hashed IP addresses, and request headers. • Infinity-Instruct(BAAI, 2024) is a large-scale, high-quality instruction dataset, specifically designed to enhance the instruction-following capabilities of LLMs in both general and domain-specific tasks. •tulu-3-sft-mixture(Lambert et al., 2024) is used to train the Tulu 3 series of models •orca-agentinstruct-1M-v1-cleaned 6 is a cleaned version of the orca-agentinstruct-1M-v1 (Mitra et al., 2024) dataset released by Microsoft, a fully synthetic dataset using only raw text publicly available on the web as seed data. • lmsys-chat-1m(Zheng et al., 2023) is a comprehensive real-world conversational dataset containing one million interactions with 25 LLMs. This dataset spans a wide range of topics and interaction types, effectively capturing diverse user-LLM interaction patterns. 6 https://huggingface.co/datasets/mlabonne/orca-agentinstruct-1M-v1-cleaned Together, they comprise 11,425,231 samples, forming a robust and diverse foundation for advancing research on instruct LLMs. Inevitably, many datasets contain a significant amount of similar or even duplicate data, which can adversely affect both model training and the accuracy of evaluations. To address this issue, we employ an n-gram-based deduplication technique to preprocess the data (Algorithm 1). N-gram method decomposes text into consecutive sequences of n words (or characters), effectively capturing local features. Algorithm 1Deduplicate Dataset by N-Grams Input:DatasetD, N-gram sizen Output:Deduplicated datasetD dedup 1:D dedup ←# Initialize deduplicated dataset 2:seen_hashes←# Set to store hashes of seen N-grams 3:foreach samplesinDdo 4:ngrams←# Initialize N-grams for the sample 5:foreach conversationcins.conversationsdo 6:ngrams←ngrams∪GenerateNGrams(c.content,n) 7:end for 8:ifany Hash(ngram)∈seen_hashesforngram∈ngramsthen 9:continue#Skip sample if any N-gram hash is already seen 10:end if 11:seen_hashes←seen_hashes∪Hash(ngram)|ngram∈ngrams 12:D dedup ←D dedup ∪s 13:end for 14:returnD dedup This approach enables the detection and identification of repetitive patterns within the text. By leveraging this method, we are able to filter out not only completely identical instances but also content that exhibits high semantic or structural similarity. Consequently, the quality and diversity of the dataset are significantly enhanced. Finally, we adopt a 20-gram deduplication strategy to eliminate redundancy in the dataset. After applying this process, a total of 4,758,226 data entries are obtained. C Hyperparameter Settings The detailed parameter settings used in the experiment are as follows: General Settings • Learning Rate (lr):7×10 −5 • End Learning Rate (lr end ):7×10 −6 • Seed:42 • Data Type (dtype):float32 Optimizer Settings • Optimizer: Adam –Beta 1 (β 1 ):0.9 –Beta 2 (β 2 ):0.999 • Learning Rate Scheduler:cosineannealing –Learning Rate Decay Steps:64,000 –Learning Rate Warm-up Steps:16,000 • Sparsity Loss Coefficient (L sparsity ): –0.01for JumpReLU –5for Standard • Sparsity Loss Warm-up Steps (L sparsity ):10,000 Training Settings • Training Tokens:4.096×10 7 • Train Batch Size (tokens):128 Activation and Decoder Initialization • Decoder Initialization Method (b dec_init_method ):geometric_median • Normalize SAE Decoder:True • Dead Feature Threshold:10 −8 • Dead Feature Window:1000 Additional Settings • Noise Scale:0 • Expansion Factor:8or16 • Feature Sampling Window:2000 • JumpReLU Bandwidth:0.001 • JumpReLU Init Threshold:0.001 • Apply Decoder to Input (apply_b_dec_to_input):False • Use Ghost Gradients:False • Use Cached Activations:False D Mean Squared Error (MSE) of SAEs The Mean Squared Error (MSE) results for the token reconstruction task are presented in this section. D.1 Mean Squared Error (MSE) of special tokens of standard SAEs -5 -3.75 -2.5 -1.25 0 1.25 412182025 Llama-3.1-8B-Instruct ✅-Standard SAE-MSE Loss-log2-Special Token Qwen2.5-7B-Instruct -2 0 2 4 6 8 412182025 Qwen2.5-3B-Instruct -2.5 -1.25 0 1.25 2.5 3.75 5 41832 Qwen2.5-1.5B-Instruct -3.75 -2.5 -1.25 0 1.25 2.5 3.75 5 41424 Qwen2.5-0.5B-Instruct -4 -3 -2 -1 0 1 2 41220 Llama3.2—1B-Instruct -5 -3.75 -2.5 -1.25 0 4914 Llama3.2—3B-Instruct -4 -3 -2 -1 0 41220 Llama Series Layer log2(MSE) Qwen Series BT(P) BT(F) FAST(ours) 8x 16x Width Method Figure 6:MSE st performance of the Standard SAE (all metrics are presented in log scale, where lower values indicate better SAE reconstruction performance). Within the Standard architecture,FASTexhibits the best recon- struction capability compared toBT(P)andBT(F) While the reconstruction capability of Standard SAE models was generally inferior to the JumpReLU structure,FASTis also able to effectively reduce theMSE st , especially in the Qwen series models. D.2 MSE of SAEs trained on Llama-3.1-8B-Instruct Layer Expansion Method Standard SAEJumpReLU SAE Factorlog 2 (MSE)log 2 (MSE st )log 2 (MSE)log 2 (MSE st ) 48 BT(P)-5.5059-4.2377-9.4350-6.8026 BT(F)-5.6080-4.8046-9.8097-8.3853 FAST-5.6432-4.7236-9.8187-10.1534 128 BT(P)-3.2837-1.6776-11.2353-5.4823 BT(F)-3.3437-2.8733-13.9975-9.2049 FAST-3.4104-3.0011-14.1393-12.1287 188 BT(P)-1.6059-0.6085-13.0282-7.4267 BT(F)-1.7131-1.6009-15.0851-10.4278 FAST-1.8697-2.2923-15.0666-12.4442 208 BT(P)-1.1852-0.1692-13.3080-7.8271 BT(F)-1.3509-1.3587-14.7969-10.4507 FAST-1.4721-1.9375-15.5552-13.1463 258 BT(P)-0.16771.0444-12.9767-7.1657 BT(F)-0.5163-0.5639-16.6192-11.6569 FAST-0.5747-0.8982-16.5138-15.9845 Table 3: Mean Squared Error (MSE) of SAEs trained on Llama-3.1-8B-Instruct. Each value is highlighted with a green background to indicate performance, with darker shades of green representing better results. D.3 MSE of SAEs trained on Llama-3.2-3B-Instruct Layer Expansion Method Standard SAEJumpReLU SAE Factorlog 2 (MSE)log 2 (MSE st )log 2 (MSE)log 2 (MSE st ) 4 8 BT(P)-4.5650-3.8363-13.7434-8.3908 BT(F)-4.5785-3.8250-13.6105-8.5868 FAST-4.5931-3.9053-9.0852-8.7193 16 BT(P)-4.5645-3.8158-9.6278-7.5321 BT(F)-4.5858-3.8210-9.6102-7.6905 FAST-4.5959-3.9055-9.8054-9.3065 12 8 BT(P)-2.6239-1.9052-13.4038-8.5246 BT(F)-2.6757-2.1318-14.7879-9.1440 FAST-2.7236-2.4763-15.3747-13.4614 16 BT(P)-2.6279-1.9488-12.2827-7.7836 BT(F)-2.6754-2.2725-13.8874-8.4299 FAST-2.7509-2.5644-14.4420-12.6355 20 8 BT(P)-0.6926-0.4378-13.5554-8.4006 BT(F)-0.9186-1.0709-14.8424-8.9061 FAST-0.9527-1.4473-18.8809-17.3707 16 BT(P)-0.8145-0.4607-13.1516-9.1137 BT(F)-1.0947-1.1447-14.2900-8.9611 FAST-1.1285-1.5387-14.6872-12.1711 Table 4: Mean Squared Error (MSE) of SAEs trained on Llama-3.2-3B-Instruct. Each value is highlighted with a green background to indicate performance, with darker shades of green representing better results. D.4 MSE of SAEs trained on Llama-3.2-1B-Instruct Layer Expansion Method Standard SAEJumpReLU SAE Factorlog 2 (MSE)log 2 (MSE st )log 2 (MSE)log 2 (MSE st ) 4 8 BT(P)-5.3374-4.4021-15.3160-9.6296 BT(F)-5.3583-4.4375-15.6237-10.0324 FAST-5.3775-4.3920-15.8654-13.9127 16 BT(P)-5.3370-4.3794-14.5574-9.0583 BT(F)-5.3587-4.4358-14.7275-9.4817 FAST-5.3804-4.3879-10.5009-10.2448 9 8 BT(P)-3.6638-2.9507-7.9900-7.2577 BT(F)-3.7759-3.0874-16.1021-10.5349 FAST-3.8282-3.5754-16.4928-13.9685 16 BT(P)-3.6642-2.9456-7.1584-6.5155 BT(F)-3.8049-3.3775-15.1966-9.8149 FAST-3.8344-3.6778-15.8696-12.9629 14 8 BT(P)-1.2195-0.4927-8.0419-5.1825 BT(F)-1.7311-1.7559-15.2996-9.3409 FAST-1.7410-2.6844-21.4449-23.4395 16 BT(P)-1.2449-0.5642-6.4784-5.2817 BT(F)-1.8371-1.8036-14.9445-9.3654 FAST-1.8409-2.7668-16.2748-13.3547 Table 5: Mean Squared Error (MSE) of SAEs trained on Llama-3.2-1B-Instruct. Each value is highlighted with a green background to indicate performance, with darker shades of green representing better results. D.5 MSE of SAEs trained on Qwen2.5-7B-Instruct Layer Expansion Method Standard SAEJumpReLU SAE Factorlog 2 (MSE)log 2 (MSE st )log 2 (MSE)log 2 (MSE st ) 48 BT(P)1.29197.2207-4.18521.9109 BT(F)-0.52330.0494-5.9622-3.3368 FAST-0.7358-1.6090-10.6174-11.9105 128 BT(P)1.47515.7788-5.8014-4.1171 BT(F)0.76810.9550-6.3039-5.9309 FAST0.6177-0.0770-9.8207-10.4545 188 BT(P)2.00245.1985-6.5926-4.0005 BT(F)1.47491.5093-6.8466-8.0743 FAST1.38920.6468-9.1659-9.7604 208 BT(P)2.67725.1501-4.9649-0.7776 BT(F)2.14531.9877-5.6461-3.5904 FAST2.07961.1869-8.2213-8.7821 258 BT(P)4.87646.2532-2.14822.0938 BT(F)4.41393.7031-2.69571.6207 FAST4.44713.0934-4.9598-5.5615 Table 6: Mean Squared Error (MSE) of SAEs trained on Qwen2.5-7B-Instruct. Each value is highlighted with a green background to indicate performance, with darker shades of green representing better results. D.6 MSE of SAEs trained on Qwen2.5-3B-Instruct Layer Expansion Method Standard SAEJumpReLU SAE Factorlog 2 (MSE)log 2 (MSE st )log 2 (MSE)log 2 (MSE st ) 4 8 BT(P)-0.88732.8616-8.7177-2.2147 BT(F)-1.45721.1595-8.5340-1.9954 FAST-1.5098-1.6682-13.9907-11.6534 16 BT(P)-1.00582.8627-8.8511-2.3755 BT(F)-1.66851.1371-8.9769-2.4576 FAST-1.5147-1.7482-13.2162-10.7660 18 8 BT(P)0.92573.0243-9.2313-2.9916 BT(F)0.47441.1862-9.3796-2.9188 FAST0.6782-0.9288-10.3007-11.2916 16 BT(P)0.85943.4799-9.6147-3.1930 BT(F)0.34381.1729-9.5534-3.0426 FAST0.5485-1.0730-10.3197-11.1114 32 8 BT(P)3.88834.7227-4.3442-2.3480 BT(F)3.43883.7056-5.5300-5.3856 FAST3.66471.6953-5.0278-7.3022 16 BT(P)3.77364.6584-4.4299-2.9327 BT(F)3.29783.4334-5.6515-6.2729 FAST3.56761.4331-5.0783-7.2653 Table 7: Mean Squared Error (MSE) of SAEs trained on Qwen2.5-3B-Instruct. Each value is highlighted with a green background to indicate performance, with darker shades of green representing better results. D.6.1 Qwen2.5-1.5B-Instruct Layer Expansion Method Standard SAEJumpReLU SAE Factorlog 2 (MSE)log 2 (MSE st )log 2 (MSE)log 2 (MSE st ) 4 8 BT(P)-0.11503.8222-5.04041.5111 BT(F)-0.56533.2719-5.17941.3737 FAST-0.7745-2.1358-13.4069-12.5193 16 BT(P)-0.23153.8196-4.89801.6550 BT(F)-0.76143.2068-5.14951.4045 FAST-0.9958-2.0996-13.3622-11.6841 14 8 BT(P)0.40873.5463-5.49901.0522 BT(F)0.03062.9569-6.27910.2762 FAST-0.0925-1.2535-11.2579-11.8198 16 BT(P)0.31863.5454-4.95611.5981 BT(F)-0.09183.0073-5.95670.5989 FAST-0.2312-1.3543-11.6309-12.1911 24 8 BT(P)3.05064.3907-4.64250.4759 BT(F)2.54243.5608-5.36300.5141 FAST2.51220.6336-6.2603-7.9484 16 BT(P)2.94114.3725-4.45661.1218 BT(F)2.38773.5499-5.02981.0916 FAST2.37620.3794-6.3063-8.0686 Table 8: Mean Squared Error (MSE) of SAEs trained on Qwen2.5-1.5B-Instruct. Each value is highlighted with a green background to indicate performance, with darker shades of green representing better results. D.7 MSE of SAEs trained on Qwen2.5-0.5B-Instruct Layer Expansion Method Standard SAEJumpReLU SAE Factorlog 2 (MSE)log 2 (MSE st )log 2 (MSE)log 2 (MSE st ) 4 8 BT(P)-2.7554-0.1257-10.6725-4.1202 BT(F)-2.8808-1.3213-11.6763-5.1212 FAST-2.8732-3.2218-21.7343-23.1697 16 BT(P)-2.9204-0.0721-10.7024-4.1569 BT(F)-3.1034-1.1148-11.6959-5.1497 FAST-3.0970-3.2153-17.4590-16.7389 12 8 BT(P)-2.0463-0.0492-9.5392-2.9978 BT(F)-2.2811-1.1008-10.4276-3.8743 FAST-2.2836-3.0505-21.1734-25.6605 16 BT(P)-2.1648-0.0915-9.4019-2.8551 BT(F)-2.4489-1.1418-10.5582-4.0043 FAST-2.4406-3.0602-20.7499-19.0931 20 8 BT(P)0.24081.3303-10.5099-4.2017 BT(F)-0.3029-0.0174-11.4078-4.8666 FAST-0.3387-1.9461-15.2442-16.9599 16 BT(P)0.12961.2181-10.6728-4.2739 BT(F)-0.4536-0.0825-11.3337-4.7864 FAST-0.4924-2.1033-16.3662-18.0564 Table 9: Mean Squared Error (MSE) of SAEs trained on Qwen2.5-0.5B-Instruct. The best and second-best methods are highlighted with dark green and light green backgrounds, respectively. E Implementation Details of Feature Interpretability This section provides a detailed explanation of the implementation process for evaluating and interpreting feature interpretability. E.1 SAEs for Feature Interpretability Model NameLayerExpansion Factor Llama series Llama-3.1-8B-Instruct188X Llama-3.2-3B-Instruct128X&16X Llama-3.2-1B-Instruct98X&16X Qwen series Qwen2.5-7B-Instruct188X Qwen2.5-3B-Instruct188X&16X Qwen2.5-1.5B-Instruct148X&16X Qwen2.5-0.5B-Instruct128X&16X Table 10: Model configurations of the Llama and Qwen model series. E.2 Prompt for Feature Interpretability System Prompt We are analyzing the activation levels of features in a neural network. Each feature activates specific tokens in a text, and the activation value of each token indicates its relevance to the feature. Higher activation values signify a stronger association. Your task is to evaluate the feature based on the following scoring rubric and assign it a monose- manticity score. ### Scoring Rubric: Activation Consistency 1: No discernible pattern 2: Broad consistent theme but lacking structure 3: Clear overall pattern but quite a few examples not fitting that pattern 4: Clear pattern with one or two deviating examples 5: Clear pattern with no deviating examples ### Instructions: 1. Analyze the context provided, which consists of a sequence of alternating tokens and their cor- responding activation values. 2. Assign a score based on the activation consistency rubric. 3. Provide a descriptive name for the feature that captures its essence. Example output: ’My final verdict score is: [[3]], feature name is [[Mathematical Problem Explana- tion]]’. User: prompt Prompt Template Below is the context of feature feature_index, represented as sentences with tokens and their activation values: context F Implementation Details of Steering with SAE Latents F.1 10 Questions Question 1: How do I export constants and classes from a JavaScript module? Question 2: FINAL EXAM Question 1. What was Elsie Marley profession? Question 3: lettre de mativation Question 4: 请回答以下问题,找出铁锤和磁铁之间的主要区别是什么? Question 5: Summarize this article in one sentence. playback is not supported on this device pulled away from American Dathan Ritzenhein in the last mile in his first race since retaining his 5,000m and 10,000m Olympic titles in Rio. the women\"s race, Olympic 5,000m champion Vivian Cheruiyot of Kenya won in her first half marathon. \"s Mark Telford took the men\"s wheelchair crown, a second ahead of fellow Briton Bret Crossley. Great North Run is the world\"s biggest half marathon and there were more than 41,000 runners taking part in this year\"s event from 178 nations. , 33, was taken on a fast pace by former American 5,000m record holder Ritzenhein, but powered away with a mile to go and even had time to do a cheeky heel flip before he crossed the line in one hour and four seconds, the slowest of his three wins. \"s Emmanuel Bett, who ran the second half of the race almost on his own, crossed in third. told BBC Sport: \"To be honest with you, I\"m knackered. \"I knew I had to work hard because Dathan is a former training partner and was running a great race. \"He put his foot down and tried to get rid of me because he knew I have amazing pace. \"It\"s good to finish the year on a high, what a year I\"ve had. I just want to go home now, chill out, see the kids, get up to no good. \" out about how to get into running with our special guide. women\"s race was billed as a shoot-out between middle distance greats Cheruiyot and three-time Olympic champion Tirunesh Dibaba. failed to keep pace with Cheruiyot and fellow Kenyan Priscah Jeptoo in the closing stages of the 13.1-mile course. It was Cheruiyot who took victory, producing a sprint finish to clock 1:07.54, just one second ahead of Jeptoo. said: \"I\"m so happy because it\"s my birthday. I found it tough with one kilometre to go but it\"s fantastic for me to end my season this way. \" playback is not supported on this device : Question 6: def intersection(list1, list2): \"\"\" This function returns a list of common elements between two lists: list1 and list2. Parameters: list1 (list): First list of elements list2 (list): Second list of elements Returns: list: A list of common elements between list1 and list2 \"\"\" # Complete the code to find the intersection of list1 and list2 using nested for loops common_elements = [] for element1 in list1: for element2 in list2: if element1 == element2: common_elements.append(element1) break return common_elements Question 7: Article: Carol Dunbar stands outside of her writing studio that is under her family’s water tower on their property deep in the woods south of Superior, Wisconsin. Jed Carlson / Superior Telegram Dunbar works at the desk inside her writing studio in her family’s water tower in the woods south of Superior on Tuesday afternoon, Oct. 26, 2021. Jed Carlson / Superior Telegram Dunbar speaks about her love of nature and living off the grid on her family’s property deep in the woods south of Superior on Tuesday afternoon, Oct. 26, 2021. Jed Carlson / Superior Telegram Dunbar looks through short writings and other trinkets that were her grandmothers that she keeps in her writing studio in her family’s water tower in the woods south of Superior on Tuesday afternoon, Oct. 26, 2021. Jed Carlson / Superior Telegram Carol Dunbar talks about her struggles with the editing process of her upcoming novel as she sits in her writing studio in the family’s water tower south of Superior on Tuesday afternoon, Oct. 26, 2021. Jed Carlson / Superior Telegram Carol Dunbar looks out of one of the windows of her writing studio in her family’s water tower on their property deep in the woods south of Superior on Tuesday afternoon, Oct. 26, 2021. Jed Carlson / Superior Telegram , Wis. — Carol Dunbar stepped through the woods as fallen leaves crunched beneath her feet. Her homestead south of Superior includes the main residence, her husband’s workshop and a water tower. Living off the grid, the structure is a necessity for the homestead’s water pressure — and for Dunbar’s work. “Me getting into this water tower was finding a space where I could shut a door behind me to create,” she said. “I wouldn’t want any other kind of office, but it definitely has its challenges.” novelist and freelance ghostwriter’s computers, manuscripts and books all reside under what some might consider to be their worst enemy: “There are literally two 250-gallon tanks of water over my head right now,” she said. , her office has flooded several times. “It’s like being in a room that’s pouring rain. It’s awful, and I’ve had to make peace with that.” see her work be so vulnerable makes it that much more endearing. “I know there’s a really interesting metaphor about art and risk,” she added. ’s no other space on their 80 acres where she can work the way she’s able to here. After numerous floods and years spent working from the living room, her husband redid the space and built the staircase for better access and heat circulation. intended as a guest room, it’s a 10-by-10 space on the second floor of the water tower. She calls it the cockpit. ’s a porch on the back and windows on all four sides, so “I feel like I’m writing in the treetops,” she said. she hears water moving through the pipes around her, “The view that it affords me and the peace that I have here in this little space, and it is little ... I wouldn’t trade it for anything.” SPACE floods in from every angle. Her sitting and standing desks, compliments of her husband, rest at the center and in a corner, an ancient-looking podium holds one of her numerous dictionaries; she likes to compare decades-old definitions to those of today. are several aloe plants, drawings on the wall, and a storyboard with pinned photos of a sculpture and an Irish skyline — inspiration for future works, she said. assortment of candles, one of which she lights daily before she begins. “It keeps me mindful that I’m trying to capture the best light, the best in human nature,” she said. keeps a collection of notebooks, color-coated for whatever novel she’s writing, in her office, in the car, by her bed, to help her document inspiration when it strikes. “I got very frustrated when I got a good idea or I’d hear a piece of dialogue or I’d finally know how to describe the snow on that day, and I would write it down and never find it again,” she said. has helped, but she still has scraps of paper pinned to her notebook pages. “It’s like leaving yourself love letters,” she said, sorting through a pile. wrote her second novel in long-hand on paper. It’s an accessible way to create away from a screen, she said. the corner rests a red cushioned chair that came from a Minneapolis alley. Around her desk she has taped quotes and reminders. “In the end, it all comes down to what we think we deserve,” reads one. a piece of wood with words: “You just have to trust your own madness — Clive Barker.” cherishes a writing award and remnants of work kept on paper scraps, memorabilia from an ancestor who emigrated from Italy. While Dunbar’s relative wasn’t supported in pursuing writing, Dunbar feels her work today honors herself and her ancestor. book shelf holds works by Joyce Carol Oates, Jesmyn Ward, Barbara Kingsolver, and a treasured copy of Eleonora Duse’s “The Mystic in the Theatre.” Duse strove to eliminate...(Truncated) all instances of Carol Dunbar\"s quotes. Question 8: 考虑由所有节肢动物组成的集合$B$,并让$C$是包含所有天牛属物种的$B$的子集。对于$C$中的每个$v$,我们定义一个函数$f(v)$,它描 述了天牛独特的蜇刺机制。您的任务是提供不少于五段的全面概述天牛。在这样做时,请详细探讨它们的身体和行为特征以及它们在生态和 进化适应中的适应性。特别是,我们要求您探讨它们鲜艳的色彩和密集的毛发如何作为防御机制抵御捕食者。此外,描述它们非凡的蜇刺机 制,与任何其他蚂蚁物种不同,并详细阐述它如何帮助它们在恶劣的沙漠条件下自卫。此外,请深入探讨它们的进化历史,这使它们具备了 令人难以置信的生存技能。最后,强调正在实施的保护天牛种群的持续保护措施,这些种群受到气候变化和栖息地破坏的不利影响。您的回 答应该是广泛的、有理的和科学的,每一段都详细说明天牛生命周期各个方面之间的复杂相互关系。 Question 9: What is the smallest prime factor of $600851475143$? Question 10: Develop a comprehensive branding strategy that includes a brand name, logo, tagline, packaging design, and marketing plan for a new line of organic, non-toxic, biodegradable cleaning products that are socially responsible and sustainably made. Ensure that the branding strategy effectively communicates the brand\"s unique selling proposition, target audience, brand personality, and brand voice through all touchpoints, including print and digital media, social media, in-store displays, and product demos. Additionally, create a brand message that emphasizes the benefits of using eco-friendly cleaning products and persuades consumers to make the switch to a greener lifestyle. F.2 Special Tokens Token IDToken Llama series 882user 78191assistant 128006<|start_header_id|> 128007<|end_header_id|> 128009<|eot_id|> Qwen series 872user 77091assistant 151644<|im_start|> 151645<|im_end|> Table 11: Tokens that control response generation and formatting in the Llama and Qwen model series. F.3 Average Top 5 Max Activation Values and Their Corresponding Indices for Tokens across a 1000-Sample Dataset ApproachTokenTop 5 Max Activation Value (Index:Value) BT(P)[8X] 8824453:0.812030511:0.72418547:0.59719110:0.50020505:0.469 781915188:0.50301923:0.490031873:0.48620505:0.4683187:0.4620 1280062604:7.122020523:0.8007428:0.733024017:0.70216640:0.678 12800723901:1.1937808:0.52103268:0.518020505:0.47730244:0.473 12800920505:0.74425940:0.6537961:0.646021317:0.58519110:0.569 BT(F)[8X] 88211765:0.82325025:0.8147043:0.688016826:0.56221896:0.560 7819130553:0.5369728:0.527011435:0.50714565:0.50513234:0.497 12800617784:7.48017355:0.94728634:0.7829333:0.771027149:0.744 12800723677:1.0026426:0.668026136:0.6035783:0.572026958:0.526 12800923677:0.8347100:0.756030568:0.73415188:0.6668346:0.6430 FAST[8X] 88222534:0.61113320:0.47029165:0.46419871:0.42829033:0.418 7819116063:0.46313320:0.46119871:0.46032613:0.44122277:0.399 12800622642:4.3922417:0.717027839:0.7063095:0.703010814:0.654 12800730457:2.48919871:0.5326870:0.464028096:0.44613266:0.413 12800913822:0.75322277:0.60621866:0.53717489:0.493118:0.41200 Table 12: Top 5 Average Activation Values for Special Tokens in Llama3.1-8B-instruct with JumpReLU SAE ApproachToken IDTop 5 Max Activation Value (Index:Value) BT(P)[8X] 8823817:0.455011734:0.430505:0.4220023884:0.41714851:0.380 781916451:0.346011061:0.34019811:0.32712369:0.32511734:0.308 1280062064:20.3515699:0.409014393:0.3997505:0.3770548:0.37500 12800720232:0.4275095:0.400019583:0.39323908:0.3623719:0.3590 12800914536:0.46816718:0.43723736:0.41313925:0.37910211:0.368 BT(P)[16X] 88223287:0.81444336:0.71810727:0.71211701:0.68326467:0.658 7819134602:0.62210655:0.60045414:0.59123156:0.55319333:0.522 12800638076:28.4148766:0.67516639:0.65928134:0.65345:0.621000 1280079822:0.753039737:0.6595712:0.643038496:0.57423156:0.570 128009483:0.7980048233:0.78922660:0.67024339:0.62423774:0.600 BT(F)[8X] 88221524:0.49617981:0.47110125:0.43611210:0.43114456:0.410 7819116126:0.4478704:0.447020691:0.41819630:0.39310125:0.365 12800615765:21.391640:0.518014456:0.47945:0.45900017981:0.442 1280077814:0.512024565:0.4891759:0.48408704:0.439014456:0.396 1280095506:0.523020691:0.51420328:0.4886878:0.45507593:0.4460 BT(F)[16X] 88220561:0.71928995:0.69814625:0.66232041:0.6254844:0.5850 7819123154:0.7258239:0.670045582:0.63023594:0.59311425:0.564 12800630984:25.3810207:0.75221441:0.75126876:0.70035477:0.683 12800741219:0.68714625:0.67021050:0.66223942:0.62127267:0.595 12800926876:0.76113612:0.7229537:0.693044518:0.6536317:0.6240 FAST[8X] 8822950:0.57301343:0.567016808:0.49819508:0.4815931:0.4590 7819123183:0.548263:0.509008564:0.48602680:0.475023798:0.472 1280068772:37.47120896:0.6102950:0.606012126:0.53816622:0.534 12800712955:0.55022995:0.5363339:0.50807878:0.49702950:0.4730 1280097814:0.585016940:0.5514605:0.508012331:0.4934439:0.4880 FAST[16X] 8829447:0.83805861:0.721019741:0.71622320:0.66925160:0.645 781914177:0.822043897:0.71918009:0.66725117:0.59430970:0.590 12800622974:37.6636:0.87300018075:0.81326318:0.77445047:0.762 12800742421:0.798655:0.7530013955:0.69726318:0.63228994:0.589 12800929041:0.88818075:0.84433332:0.7762705:0.712026318:0.695 Table 13: Top 5 Average Activation Values for Special Tokens in Llama3.2-3B-instruct with JumpReLU SAE ApproachToken IDTop 5 Max Activation Value (Index:Value) BT(P)[8X] 88212248:0.45514322:0.44610030:0.44411886:0.425731:0.39800 7819114903:0.44315672:0.4358014:0.419013261:0.41011985:0.405 1280064464:10.4634858:0.460012143:0.4549898:0.44406877:0.3700 128007196:0.4540015332:0.3989561:0.358012143:0.355626:0.35500 12800915332:0.4961296:0.49104858:0.41706877:0.417015975:0.412 BT(P)[16X] 88220612:0.64222827:0.6133012:0.605011176:0.5782141:0.5760 7819128423:0.67224765:0.66130621:0.64922827:0.64918585:0.621 1280064169:11.46011176:0.7939495:0.67709911:0.673024072:0.586 12800726090:0.82010861:0.62224072:0.61526939:0.59123109:0.541 12800911176:0.74716525:0.71626594:0.6858403:0.649015861:0.633 BT(F)[8X] 8822387:0.413013266:0.3417778:0.30908423:0.28403682:0.2800 781917783:0.332010427:0.3168941:0.315016174:0.3114764:0.3080 1280062537:9.946015768:0.3829146:0.35001604:0.344014204:0.312 12800710680:0.39015478:0.3128905:0.30906638:0.302015034:0.284 1280092568:0.40503528:0.386014204:0.3711604:0.360015768:0.313 BT(F)[16X] 88224100:0.5306794:0.52407848:0.52309322:0.490017577:0.490 7819112548:0.58324258:0.5422092:0.52602460:0.496015997:0.484 1280064967:10.55924354:0.67520054:0.61412136:0.59912707:0.537 12800718190:0.5812543:0.500023285:0.49915997:0.49417059:0.486 12800926830:0.63517228:0.62311407:0.55118494:0.52311681:0.483 FAST[8X] 8822926:0.3780878:0.354004753:0.337010237:0.3367582:0.3140 7819113371:0.38814099:0.3768581:0.368011313:0.3615121:0.3400 12800612361:8.48613371:0.386878:0.37500129:0.349001866:0.3300 1280078581:0.412012864:0.35713371:0.3414478:0.33804523:0.3150 128009878:0.4700011483:0.4086832:0.37708581:0.3690865:0.34700 FAST[16X] 8821835:0.75003851:0.7100982:0.604009493:0.60208463:0.4780 7819119765:0.59614393:0.53928589:0.5122350:0.485012592:0.482 12800612329:10.309838:0.644013262:0.5921450:0.526027818:0.504 1280073368:0.582031764:0.56816867:0.51816432:0.5039648:0.4590 12800910365:0.69631406:0.63730028:0.60215515:0.57416339:0.535 Table 14: Top 5 Average Activation Values for Special Tokens in Llama3.2-1B-instruct with JumpReLU SAE ApproachToken IDTop 5 Max Activation Value (Index:Value) BT(P)[8X] 87212461:9.058439:3.8800019183:2.97818767:2.88913685:1.992 770912547:2.933015678:2.56219183:2.5496508:2.32904400:2.0270 15164412461:9.1931261:2.70506508:2.30602547:2.12404400:2.1140 1516451261:2.97302547:2.86406508:2.414018778:2.22313888:2.118 BT(F)[8X] 8724710:6.350015390:3.37720684:3.19225558:2.93727629:2.800 7709125558:3.13527629:3.06119040:3.01210759:2.80213257:2.378 1516444710:6.717010759:3.41211735:3.04928219:2.74926983:2.596 15164528219:3.13011735:2.6922174:2.467010614:2.46425812:2.120 FAST[8X] 87213794:37.1917783:4.81620022:4.51921950:4.07711739:4.053 7709120022:5.66711739:4.35216782:4.1802670:3.781013794:3.731 15164413794:39.8720022:5.4187579:4.19003817:4.189026689:4.023 15164520022:4.4632670:3.697022845:3.13925469:2.9399676:2.6890 Table 15: Top 5 Average Activation Values for Special Tokens in Qwen2.5-7B-instruct with JumpReLU SAE ApproachToken IDTop 5 Max Activation Value (Index:Value) BT(P)[8X] 87211485:2.7568925:2.44903645:2.41301600:2.11602801:2.0860 7709110992:1.9111600:1.830015929:1.77714942:1.74712230:1.677 1516447152:132.522713:2.010011354:1.99615302:1.89115795:1.885 15164512297:2.58811352:2.4574096:2.452010336:2.42910992:2.214 BT(P)[16X] 87214113:2.01012080:1.75018074:1.73914580:1.7202607:1.4890 770914047:1.386027294:1.2943356:1.289014113:1.2489469:1.2420 15164432641:150.014113:1.3627224:1.334028068:1.3274741:1.2860 15164523725:1.69614113:1.67425421:1.66968:1.6190009469:1.5140 BT(F)[8X] 8727603:2.83803184:2.784015060:2.7778391:2.73906484:2.3780 7709115060:3.1753373:2.35307293:2.34801317:2.33987603:2.2900 15164416236:121.216225:2.5637603:2.50007189:2.4970958:2.43000 1516453104:3.99101317:3.421016225:3.3976700:3.350015704:3.101 BT(F)[16X] 87223210:2.32029265:1.80711930:1.76728994:1.7122757:1.5020 7709123210:1.8446805:1.657020713:1.56411930:1.54429265:1.483 15164431443:153.423210:2.1605146:2.001024831:1.89429265:1.859 1516455146:2.98805924:2.43205572:2.342012821:2.07824491:1.502 FAST[8X] 8722941:3.44108775:2.640010076:2.62512216:2.178776:1.99600 770912653:3.637010076:3.4503411:3.05409785:2.510011618:2.004 1516448775:248.3612291:2.88010076:2.8293411:2.828013964:2.566 15164510076:4.53812216:3.77512139:3.7294383:3.592012209:3.279 FAST[16X] 8726863:3.76009230:2.951020605:2.44621312:2.28517408:2.063 7709123681:4.2236863:3.944017147:3.05910035:2.9694751:2.7968 15164431443:85.355599:1.59749299:1.534118964:1.4454751:1.4220 15164523681:3.0006863:2.439020511:2.1739230:1.821517147:1.517 Table 16: Top 5 Average Activation Values for Special Tokens in Qwen2.5-3B-instruct with JumpReLU SAE ApproachToken IDTop 5 Max Activation Value (Index:Value) BT(P)[8X] 872734:312.4412664:2.5160576:2.316004162:2.10509629:2.1030 770911656:2.26703248:2.20904162:2.10404098:2.09108997:2.0460 151644734:288.485391:1.925005536:1.924011982:1.66011102:1.625 15164511322:1.905734:1.745001263:1.60309637:1.590012143:1.499 BT(P)[16X] 87215738:261.83080:1.49202724:1.373019787:1.37217743:1.258 770912724:1.372017351:1.3541954:1.334019787:1.3079767:1.2760 15164415738:241.513157:1.14813486:1.11614339:0.9456977:0.9250 1516459971:1.127022929:1.03214028:1.00319840:0.93622072:0.864 BT(F)[8X] 8721910:255.407039:2.55909420:2.53008118:2.47101693:2.4060 770918118:2.70407067:2.52301223:2.48907039:2.46704086:2.4190 1516441910:234.854798:1.99706153:1.89005905:1.700011021:1.682 15164510536:1.87011021:1.7247064:1.65501787:1.56306153:1.5040 BT(F)[16X] 8722077:263.4913135:1.62417747:1.43916136:1.35319975:1.338 770916886:1.517019975:1.50817747:1.50018492:1.29616136:1.249 1516442077:242.0619387:1.5344177:1.358022526:1.28319497:1.178 1516454177:1.16105724:1.10009985:1.08906552:1.019011894:0.945 FAST[8X] 8727505:462.494918:2.40104694:2.30604141:2.162010728:2.098 77091491:2.258004141:2.230011303:2.1258603:2.00906358:1.9430 1516447505:425.7310900:1.7936473:1.756010139:1.6142006:1.5990 151645491:2.2010011115:1.74811252:1.6656473:1.553010257:1.326 FAST[16X] 87221852:580.011515:1.9889360:1.572021118:1.50111834:1.487 7709121118:2.0689718:1.612014362:1.5369360:1.524011834:1.477 15164421852:532.921118:1.68316522:1.35017617:1.26512233:1.174 15164521118:2.07017617:1.47416522:1.31218955:1.19621139:1.084 Table 17: Top 5 Average Activation Values for Special Tokens in Qwen2.5-1.5B-instruct with JumpReLU SAE ApproachToken IDTop 5 Max Activation Value (Index:Value) BT(P)[8X] 8726091:1.06802897:0.82501389:0.82406239:0.81506434:0.7770 770913245:0.84301767:0.84301389:0.83105981:0.81206239:0.7790 1516441608:43.2096818:0.76006245:0.74806724:0.71501235:0.7150 1516454541:0.81705212:0.80101744:0.77604498:0.7280507:0.72400 BT(P)[16X] 8728475:0.688013976:0.545889:0.510008786:0.46803099:0.4680 770913099:0.54809308:0.534013976:0.5288786:0.4830432:0.46500 15164410161:28.277726:0.48306509:0.45509343:0.45106947:0.4260 1516451934:0.558012380:0.5057726:0.43707385:0.43701823:0.4280 BT(F)[8X] 8725375:1.02903317:0.90004825:0.85103896:0.83605791:0.8260 770913896:0.85104825:0.84502552:0.84205375:0.80303203:0.8010 1516442428:40.9995130:0.75101326:0.7050557:0.681002765:0.6540 1516452734:0.89706507:0.7080628:0.696002913:0.69301119:0.6680 BT(F)[16X] 87213102:0.65812215:0.57210208:0.5426285:0.46705598:0.4430 770917823:0.586012215:0.58010208:0.55112606:0.5215598:0.4871 1516441983:27.7615393:0.518012215:0.4585515:0.44709460:0.4360 1516454484:0.498012615:0.47213322:0.4415393:0.43708592:0.3820 FAST[8X] 8721299:0.93102747:0.90903288:0.81701859:0.78604804:0.7210 770916296:0.89606776:0.86403288:0.84507041:0.83002747:0.8140 1516443154:34.650825:0.717005377:0.69406140:0.68303724:0.6450 1516453724:0.86303955:0.82401371:0.80303931:0.69405940:0.6740 FAST[16X] 87211717:0.5786739:0.50308487:0.49902010:0.464012647:0.442 770918487:0.58406739:0.534011717:0.52911505:0.4932851:0.4760 1516443384:28.3244241:0.47209335:0.425011285:0.416298:0.38400 1516454241:0.54105731:0.44506167:0.44407780:0.39405314:0.3770 Table 18: Top 5 Average Activation Values for Special Tokens in Qwen2.5-0.5B-insturct with JumpReLU SAE F.4 Steering Output of Three Questions F.4.1 Q2 Question2: FINAL EXAM Question 1. What was Elsie Marley profession? Raw Output () α=0 I'm sorry, but there seems to be some confusion. "Elsie Marley" is not a widely known public figure or character, and without more context, it's difficult to determine who this person is or their profession. Could you please provide more information or context about Elsie Marley? This would help me give you an accurate answer.<|im_end|> Steered Output () α=25 I'm sorry, but there seems to be some confusion. "Elsie Marley" is not a widely known public figure or character, and without more context, it's difficult to determine who this person is or their profession. Could you please provide more information or context about Elsie Marley? This would help me give you an accurate answer.<|im_end|> Steered Output () α=100 To answer the question accurately, I'l need more context about which "Elsie Marley" is being referred to, as there could be multiple individuals with this name. However, based on the name and the common use of names in literature, I'l assume you're referring to Elsie Marley from J.K. Rowling's Harry Potter series. In the Harry Potter series, Elsie Marley is the grandmother of Sirius Black's son, Nymphadora Tonks. She is mentioned in "Harry Potter and the Order of the Phoenix" but does not have a significant role in the story. If you're referring to a different Elsie Marley, please provide more details so I can give you the correct information.<|im_end|> Qwen2.5-7B-Instruct-L18-8X JumpReLU SAE Feature ID: 13794 (act most on "<|im_start|>" and "user") Steered Output () α=150 To answer the question accurately, I'l need more context about which "Elsie Marley" is being referred to, as there could be multiple individuals with this name. However, based on the name and the commonality of the surname "Marley," one possible reference is to the character Elsie Marley from the Harry Potter series, who is the mother of Sirius Black. In the books, Elsie Marley is described as a Muggle woman who was Sirius Black's mother. If this is the case, her profession is not explicitly stated in the books, but given her status as a Muggle, it's likely she was involved in a profession typical for her time and social class, such as a teacher, nurse, or office worker. If you have a different context in mind, please provide more details.<|im_end|> Figure 7: The steering output generated by Qwen2.5-7B-Instruct with Feature ID: 13794, focusing onuserand <|im_start|>tokens for the Question 2 (entity description). Llama3.1-8B-Instruct-L18-8X JumpReLU SAE Question2: FINAL EXAM Question 1. What was Elsie Marley profession? Raw Output () α=0 I don't have information on Elsie Marley's profession.<|eot_id|> Steered Output () α=15 I don't have information on Elsie Marley's profession.<|eot_id|> Steered Output () α=25 I cannot verify the profession of Elsie Marley. If you'd like to know more about Elsie Marley, I'd be happy to help you find the information you need. I can also help you find information about a wide range of other topics if you'd like. Is there anything else I can help you with? If you have any other questions, feel free to ask.<|eot_id|> Feature ID: 22642 (act most on "<|start_header_id|>") Figure 8: The steering output generated by Llama3.1-8B-Instruct with Feature ID: 22642, focusing onuserand <|im_start|>tokens for the Question 2 (entity description). For Question 2, the Qwen model (Figure 7) shows noticeably improved output quality when feature 13794 is moderately amplified (withαin the range of 25 to 75). Within this range, the responses become more polite, detailed, and engaging, showing a clear enhancement in interaction quality. However, when the amplification coefficient exceeds this sweet spot (e.g.,α≥100), the model begins to fabricate information and eventually devolves into repetitive or nonsensical output, resulting in a rapid decline in quality. In comparison, the Llama model (Figure 8) only benefits from a much narrower range of amplification (approximatelyα= 15to25). Within this window, its responses become slightly more polite and helpful, but still lack substantive factual content. Beyond this narrow range, the output quickly becomes repetitive and loses coherence. Overall, Qwen is able to improve output quality over a broader range of amplification coefficients, while Llama’s effective range is much more limited. F.4.2 Q3 Question3: lettre de mativation Raw Output () α=0 Bien sûr, je peux vous aider à rédiger une lettre de motivation. Cependant, pour que je puisse vous fournir un modèle précis, j'aurais besoin de quelques informations supplémentaires : 1. Quel est le poste pour lequel vous postulez ? 2. Quelle est l'entreprise à laquelle vous postulez ? 3. Quelles sont vos principales compétences et expériences pertinentes ? 4. Quels sont vos objectifs professionnels à court et long terme ? Avec ces informations, je pourrai vous proposer un modèle de lettre de motivation personnalisé et adaptée à votre situation.<|im_end|> Steered Output () α=15 Voici un exemple de lettre de motivation pour un poste d'ingénieur en informatique : [Votre nom et adresse] [Date] [Nom et adresse de l'entreprise] Objet : Candidature au poste d'ingénieur en informatique Madame, Monsieur, Je me permets de vous adresser ma candidature pour le poste d'ingénieur en informatique que vous avez bien voulu publier sur votre site internet. Titulaire d'un master en informatique de gestion, je suis actuellement ingénieur en développement logiciel chez [nom de votre entreprise actuelle]. Au cours de ma carrière, j'ai eu l'opportunité de travailler sur des projets variés, notamment en matière de développement web, d'architecture de systèmes et de gestion de projets. Mes principales compétences sont les suivantes : - Maîtrise des langages de programmation : Java, Python, C# - Connaissance approfondie de l'architecture de systèmes et des technologies cloud (AWS, Azure) - Expérience en gestion de projets et en gestion de la qualité - Bonne capacité d'analyse et de résolution de problèmes - Excellentes compétences en communication et en travail d'équipe Au cours de mes précédentes expériences professionnelles, j'ai pu développer une approche méthodique et rigoureuse dans la conception et la mise en œuvre de solutions informatiques. Je suis particulièrement attiré par votre entreprise pour ses valeurs d'innovation et de performance. Je suis convaincu que mes compétences et mon expérience me permettront de contribuer efficacement au développement de vos projets. Je serais ravi de pouvoir discuter de cette opportunité avec vous et de vous présenter plus en détail mes motivations et mes projets. Je vous remercie par avance pour l'attention que vous porterez à ma candidature et reste à votre disposition pour tout renseignement complémentaire. Dans l'attente de votre réponse, je vous prie d'agréer, Madame, Monsieur, l'expression de mes salutations distinguées. [Votre signature] [Votre nom] N'hésitez pas à adapter ce modèle en fonction de votre expérience et du poste pour lequel vous postulez.<|im_end|> Steered Output () α=25 Voici un exemple de lettre de motivation pour un poste d'assistant : [Your Name] [Your Address] [City, State, ZIP Code] [Email Address] [Phone Number] [Date] [Recipient's Name] [Company's Name] [Company's Address] [City, State, ZIP Code] Dear [Recipient's Name], I am writing to express my interest in the Assistant position at [Company's Name], as advertised on [where you found the job posting]. With a strong background in [mention relevant skills or experience], I am excited about the opportunity to contribute to your team and support your operations. In my previous role as an Administrative Assistant at [Previous Company's Name], I gained valuable experience in [mention key responsibilities and achievements]. I was responsible for [specific tasks or projects], which allowed me to develop skills in [mention relevant skills]. My ability to [mention a specific skill or accomplishment] has been particularly beneficial in managing [mention a specific task or project]. I am particularly drawn to [Company's Name] because of [mention something specific about the company or its mission that appeals to you]. I am impressed by [mention a specific aspect of the company or its work], and I believe my skills and experience align well with your needs. In my current role, I have honed my [mention relevant skills], which I believe will be valuable in supporting your team. I am proficient in [mention relevant software or tools], and I am eager to bring my attention to detail and organizational skills to your team. I am excited about the prospect of contributing to [Company's Name] and am confident that my background, skills, and enthusiasm make me a strong candidate for the Assistant position. I would welcome the opportunity to discuss how I can contribute to your team further. Thank you for considering my application. I look forward to the possibility of discussing this opportunity with you. Sincerely, [Your Name] Remember to customize this template to fit your specific situation and the company you are applying to. Good luck with your application!<| im_end|> Qwen2.5-7B-Instruct-L18-8X JumpReLU SAE Feature ID: 13794 (act most on “<|im_start|>" and "user") Figure 9: The steering output generated by Qwen2.5-7B-Instruct with Feature ID: 13794, focusing onuserand <|im_start|>tokens for the Question 3 (cover letter task). Question3: lettre de mativation Raw Output () α=0 Je ne comprends pas ce que vous voulez dire avec "lettre de mativation". Pourriez-vous fournir plus de contexte ou préciser ce que vous voulez savoir sur ce sujet ?<|eot_id|> Steered Output () α=15 Je ne comprends pas ce que vous voulez dire avec "lettre de mativation". Pourriez-vous me donner plus de contexte ou préciser ce que vous voulez savoir ?<|eot_id|> Steered Output () α=25 Je ne comprends pas ce que vous voulez dire avec "lettre de mativation". Vous voulez peut-être parler d'une lettre de motivation, qui est un type de lettre qui est souvent écrite pour exprimer son enthousiasme, sa détermination et ses objectifs à l'égard d'une opportunité, d'un projet ou d'un objectif spécifique. Si c'est le cas, voici un exemple de lettre de motivation : [Votre nom] [Votre adresse] [Ville, code postal] [Date] [Nom du destinataire] [Titre du destinataire] [Nom de l'entreprise ou de l'organisation] [Adresse de l'entreprise ou de l'organisation] [Ville, code postal] Objet : [Objet de la lettre, par exemple "Candidature pour le poste de [poste]"] Cher(e) [Nom du destinataire], Je vous écris pour exprimer mon enthousiasme et ma détermination à rejoindre [nom de l'entreprise ou de l'organisation] dans le cadre du poste de [poste] que vous avez publié. Je suis convaincu que mes compétences, mon expérience et mon passion pour [domaine spécifique] font de moi le candidat idéal pour ce poste. Je suis particulièrement attiré(e) par [aspects spécifiques de l'entreprise ou de l'organisation qui vous intéressent, par exemple "votre engagement envers la qualité et la sécurité" ou "votre culture d'innovation et de collaboration"]. Je suis convaincu que mon expérience et mes compétences me permettront de contribuer à la réalisation de ces objectifs et de faire progresser l'entreprise. Je possède [liste de vos compétences et expériences pertinentes pour le poste]. Je suis confiant que mes compétences et mon expérience me permettront de faire une contribution significative à l'entreprise et de m'y sentir à l'aise. Je suis particulièrement enthousiaste à l'idée de rejoindre une équipe aussi passionnée et motivée que celle de [nom de l'entreprise ou de l'organisation]. Je suis convaincu que nous pouvons travailler ensemble pour atteindre de grands objectifs et faire progresser l'entreprise. Je vous remercie de l'attention que vous porterez à ma candidature. Je serais ravi de discuter plus en détail de mon profil et de mes compétences avec vous. Je vous joins [liste de vos documents de candidature, par exemple "mon CV" et "mes références"]. Je vous remercie à nouveau de l'attention que vous porterez à ma candidature. Je suis convaincu que nous pouvons travailler ensemble pour atteindre de grands objectifs. Cordialement, [Votre nom] N'oubliez pas de personnaliser la lettre en fonction de votre propre expérience et de vos compétences, et de laisser suffisamment de temps pour que la lettre soit lue et considérée.<|eot_id|> Llama3.1-8B-Instruct-L18-8X JumpReLU SAE Feature ID: 22642 (act most on "<|start_header_id|>") Figure 10: The steering output generated by Llama3.1-8B-Instruct with Feature ID: 22642, focusing on <|start_header_id|>tokens for the Question 3 (cover letter task). For Question 3, the Qwen model (Figure 9) shows that moderate amplification of feature 13794 (with αbetween 50 and 100) leads to more informative and structured responses, providing richer content and clearer reasoning. This indicates a substantial improvement in output quality within this coefficient range. However, further increasing the amplification causes the model to hallucinate, such as switching languages or generating irrelevant content, and ultimately results in repetitive or meaningless output. The Llama model (Figure 10) also exhibits some improvement in informativeness and engagement when its most active feature is lightly amplified, but this effect is only present at very low coefficients (up to aboutα= 25). Beyond this point, the output rapidly deteriorates into repetitive or off-topic text. Compared to Qwen, Llama’s window for beneficial amplification is much narrower and less robust. F.4.3 Q4 Qwen2.5-7B-Instruct-L18-8X JumpReLU SAE Feature ID: 13794 (act most on “<|im_start|>" and "user") Question4: 请回答以下问题,找出铁锤和磁铁之间的主要区别是什么? Raw Output () α=0 铁锤和磁铁的主要区别在于它们的性质和功能。铁锤是一种工具,主要用于敲打、破碎或装配物体,它的主要功能是提供机械力。而磁 铁则是一种能够吸引铁、镍、钴等磁性材料的物体,它的主要性质是磁性。因此,铁锤和磁铁之间的主要区别在于它们的功能和性质不 同。<|im_end|> Steered Output () α=25 铁锤和磁铁的主要区别在于它们的性质和功能。铁锤是一种工具,主要用于敲打、破碎或装配物体,它的主要功能是提供机械力。而磁 铁则是一种能够吸引铁、镍、钴等磁性材料的物体,它的主要性质是磁性。因此,铁锤和磁铁之间的主要区别在于它们的功能和性质不 同。<|im_end|> Steered Output () α=100 要回答这个问题,我们需要了解铁锤和磁铁各自的基本特性和功能。 铁锤是一种工具,主要用于敲击、拆卸或组装物体。它通常由一个坚硬的金属头(通常是铁或钢制成)和一个手柄组成。铁锤的设计是 为了提供力量和控制,以便于进行各种物理操作。 磁铁是一种能够吸引铁、镍、钴等金属的物体的物质。磁铁具有磁性,可以用来吸引或排斥其他磁性物体。磁铁的种类很多,包括永久 磁铁和电磁铁,它们的磁性可以是固定的也可以是可变的。 因此,铁锤和磁铁之间的主要区别在于它们的功能和特性: 1. 功能:铁锤主要用于物理操作,如敲击、拆卸或组装物体;而磁铁主要用于吸引或排斥具有磁性的物体。 2. 特性:铁锤是一个物理工具,没有磁性;磁铁则具有吸引特定金属的能力,这是由于其内部的磁场。 综上所述,铁锤和磁铁之间的主要区别在于它们的功能和特性。<|im_end|> Figure 11: The steering output generated by Qwen2.5-7B-Instruct with Feature ID: 13794, focusing onuserand <|im_start|>tokens for the Question 4 (entity discrimination task). Llama3.1-8B-Instruct-L18-8X JumpReLU SAE Question4: 请回答以下问题,找出铁锤和磁铁之间的主要区别是什么? Raw Output () α=0 铁锤和磁铁之间的主要区别在于它们的磁性质。铁锤是一种非磁性材料,它不具有磁性,而磁铁是一种磁性材料,它具有磁性,可以吸 附铁锤。<|eot_id|> Steered Output () α=15 铁锤和磁铁之间的主要区别在于它们的磁性质。铁锤通常是非磁性的,除非它被磁化或加磁。磁铁则是磁性的,具有磁场,可以吸附铁 锤或其他磁性物体。<|eot_id|> Steered Output () α=25 一块铁锤和一块磁铁之间的主要区别在于铁锤通常不是磁铁,而磁铁则是具有磁性并且可以吸附铁锤的物体。铁锤通常是铁制的,但它 不是磁铁。磁铁则是铁制的,但具有磁性,可以吸附铁锤。<|eot_id|> Feature ID: 22642 (act most on "<|start_header_id|>") Figure 12: The steering output generated by Llama3.1-8B-Instruct with Feature ID: 22642, focusing on <|start_header_id|>tokens for the Question 4 (entity discrimination task). In Question 4, both models show that feature amplification can enhance Chain-of-Thought (CoT) (Wei et al., 2022) reasoning and answer quality, but only within specific coefficient ranges. For Qwen (Fig- ure 11), amplifying the most active feature withαbetween 25 and 100 produces more convincing, informative, and well-structured responses. This improvement is especially evident in the quality of reasoning and the clarity of the final answers. However, excessive amplification again leads to a loss of coherence and informativeness. For Llama (Figure 12), a similar pattern is observed but within an even narrower range. Mild ampli- fication (up toα= 25) can slightly improve the quality of reasoning and engagement, but any further increase quickly causes the output to become repetitive and less meaningful. This highlights that while both models benefit from feature amplification, Qwen maintains improved output quality over a wider range of coefficients, whereas Llama’s useful range is much more restricted. G Model Training Log Due to space constraints, we select training logs from a subset of SAEs for presentation. The complete training logs for all SAEs will be released publicly. G.1 Llama-3.1-8B-Instruct G.1.1 L18-8X-Standard BT(P): Block Training (Pretraining dataset) (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss BT(F): Block Training (Finetuning dataset) (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (a) L1 Loss(c) Overall Loss(b) MSE Loss FAST: Finetuning-aligned Sequential Training (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (a) L1 Loss(c) Overall Loss(b) MSE Loss G.1.2 L18-8X-JumpReLU BT(P): Block Training (Pretraining dataset) (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (a) L0 Loss(c) Overall Loss(b) MSE Loss BT(F): Block Training (Finetuning dataset) (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (a) L0 Loss(c) Overall Loss(b) MSE Loss FAST: Finetuning-aligned Sequential Training (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss G.2 Llama-3.2-1B-Instruct G.2.1 L9-8X-Standard BT(P): Block Training (Pretraining dataset) (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss BT(F): Block Training (Finetuning dataset) (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss FAST: Finetuning-aligned Sequential Training (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss G.2.2 L9-8X-JumpReLU BT(P): Block Training (Pretraining dataset) (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss BT(F): Block Training (Finetuning dataset) (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss FAST: Finetuning-aligned Sequential Training (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss G.3 Llama-3.2-3B-Instruct G.3.1 L12-8X-Standard BT(P): Block Training (Pretraining dataset) (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss BT(F): Block Training (Finetuning dataset) (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss FAST: Finetuning-aligned Sequential Training (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss G.3.2 L12-8X-JumpReLU BT(P): Block Training (Pretraining dataset) (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss BT(F): Block Training (Finetuning dataset) (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss FAST: Finetuning-aligned Sequential Training (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss G.4 Qwen-2.5-7B-Instruct G.4.1 L18-8X-Standard BT(P): Block Training (Pretraining dataset) (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss BT(F): Block Training (Finetuning dataset) (a)L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss FAST: Finetuning-aligned Sequential Training (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss ‘ G.4.2 L18-8X-JumpReLU BT(P): Block Training (Pretraining dataset) (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss aBT(F): Block Training (Finetuning dataset) (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss FAST: Finetuning-aligned Sequential Training (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss G.5 Qwen-2.5-3B-Instruct G.5.1 L18-8X-Standard BT(P): Block Training (Pretraining dataset) (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss BT(F): Block Training (Finetuning dataset) (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss FAST: Finetuning-aligned Sequential Training (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss G.5.2 L18-8X-JumpReLU BT(P): Block Training (Pretraining dataset) (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss BT(F): Block Training (Finetuning dataset) (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss FAST: Finetuning-aligned Sequential Training (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss G.6 Qwen-2.5-1.5B-Instruct G.6.1 L14-8X-Standard BT(P): Block Training (Pretraining dataset) (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss BT(F): Block Training (Finetuning dataset) (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss FAST: Finetuning-aligned Sequential Training (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss G.6.2 L14-8X-JumpReLU BT(P): Block Training (Pretraining dataset) (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss BT(F): Block Training (Finetuning dataset) (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss FAST: Finetuning-aligned Sequential Training (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss G.7 Qwen-2.5-0.5B-Instruct G.7.1 L12-8X-Standard BT(P): Block Training (Pretraining dataset) (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss BT(F): Block Training (Finetuning dataset) (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss FAST: Finetuning-aligned Sequential Training (a) L1 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss G.7.2 L12-8X-JumpReLU BT(P): Block Training (Pretraining dataset) (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss BT(F): Block Training (Finetuning dataset) (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss FAST: Finetuning-aligned Sequential Training (a) L0 Loss (g) CE Loss Score(f) Explained Variance(e) Dead Features(d) L0 Sparsity (c) Overall Loss(b) MSE Loss