Paper deep dive

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

Zirui He, Haiyan Zhao, Yiran Qiao, Fan Yang, Ali Payani, Jing Ma, Mengnan Du

Year: 2025Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 73

Models: Gemma-2-2B, Gemma-2-9B, Llama-3.1-8B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 7:29:16 PM

Summary

The paper introduces SAIF, a framework using Sparse Autoencoders (SAEs) to interpret and steer instruction-following behavior in Large Language Models (LLMs). By identifying instruction-relevant latent features and computing steering vectors, the authors demonstrate that instruction following is encoded by a set of high-level concepts rather than single features. The framework effectively improves model performance across various instruction types and model scales, highlighting the importance of feature selection, the final transformer layer, and instruction positioning.

Entities (5)

SAIF · framework · 100%Sparse Autoencoders · methodology · 100%Gemma-2 · language-model · 95%Llama3.1-8B · language-model · 95%Neuronpedia · tool · 90%

Relation Signals (3)

SAIF → uses → Sparse Autoencoders

confidence 100% · This paper presents a novel framework that leverages sparse autoencoders (SAE) to interpret how instruction following works in these models.

SAIF → steers → LLMs

confidence 95% · We demonstrate how the features we identify can effectively steer model outputs to align with given instructions.

Sparse Autoencoders → decomposes → Hidden Representations

confidence 90% · SAEs are employed to decompose hidden representations into a high-dimension space and then reconstruct the hidden representations.

Cypher Suggestions (2)

Find all models evaluated using the SAIF framework. · confidence 90% · unvalidated

MATCH (f:Framework {name: 'SAIF'})-[:STEERS|EVALUATED_ON]->(m:Model) RETURN m.name

Identify the relationship between methodologies and their application in the paper. · confidence 85% · unvalidated

MATCH (m:Methodology)-[r]->(e:Entity) RETURN m.name, type(r), e.name

Abstract

Abstract:The ability of large language models (LLMs) to follow instructions is crucial for their practical applications, yet the underlying mechanisms remain poorly understood. This paper presents a novel framework that leverages sparse autoencoders (SAE) to interpret how instruction following works in these models. We demonstrate how the features we identify can effectively steer model outputs to align with given instructions. Through analysis of SAE latent activations, we identify specific latents responsible for instruction following behavior. Our findings reveal that instruction following capabilities are encoded by a distinct set of instruction-relevant SAE latents. These latents both show semantic proximity to relevant instructions and demonstrate causal effects on model behavior. Our research highlights several crucial factors for achieving effective steering performance: precise feature identification, the role of final layer, and optimal instruction positioning. Additionally, we demonstrate that our methodology scales effectively across SAEs and LLMs of varying sizes.

PDF

Open source PDF →Open local PDF →

Full Text

72,581 characters extracted from source content.

Expand or collapse full text

SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models Zirui He 1,* , Haiyan Zhao 1,* , Yiran Qiao 2 , Fan Yang 3 , Ali Payani 4 ,Jing Ma 2 ,Mengnan Du 1 1 NJIT 2 Case Western Reserve University 3 Wake Forest University 4 Cisco * Equal contribution zh296,hz54,mengnan.du@njit.edu,yxq350,jxm1384@case.edu,yangfan@wfu.edu,apayani@cisco.com Abstract The ability of large language models (LLMs) to follow instructions is crucial for their practical applications, yet the underlying mechanisms remain poorly understood. This paper presents a novel framework that leverages sparse autoen- coders (SAE) to interpret how instruction fol- lowing works in these models. We demonstrate how the features we identify can effectively steer model outputs to align with given instruc- tions. Through analysis of SAE latent activa- tions, we identify specific latents responsible for instruction following behavior. Our find- ings reveal that instruction following capabili- ties are encoded by a distinct set of instruction- relevant SAE latents. These latents both show semantic proximity to relevant instructions and demonstrate causal effects on model behavior. Our research highlights several crucial factors for achieving effective steering performance: precise feature identification, the role of final layer, and optimal instruction positioning. Ad- ditionally, we demonstrate that our methodol- ogy scales effectively across SAEs and LLMs of varying sizes. 1 Introduction Large language models (LLMs) have demon- strated remarkable capabilities in following in- structions, enabling alignment between model out- puts and user objectives. These capabilities are typically gained through instruction tuning meth- ods (Ouyang et al., 2022; Wei et al., 2022), includ- ing extensive training data and computationally intensive fine-tuning processes. While these ap- proaches effectively control model behavior, the underlying mechanisms by which models process and respond to instructions remain poorly under- stood. In-depth mechanistic investigations are es- sential for improving our ability to control models and enhance their instruction-following capability. Prior research has attempted to understand instructions following from two perspectives: 1) prompting-based; 2) activation-space-based. Among prompting-based studies, the importance of instruction positions has been thoroughly stud- ied (Liu et al., 2024; Ma et al., 2024).For activation-based studies, Stolfo et al. (2024) pro- pose to manipulate model following instructions with representation vector in residual stream. How- ever, both methods ultimately fail to explain the inner workings of how LLMs follow instructions in a fine-grained manner, i.e. the concept level. Specifically, prompting-based approaches provide insights into better prompt formulation strategies to improve instruction following, while activation- space-based methods provide a possible way to im- plement steering with instruction following rather than explaining how it works. In this paper, we propose a novel framework SAIF(SparseAutoencoder steering forInstruction Following) to understand working mechanisms of instruction following at the concept level through the lens of sparse autoencoders (SAEs). First, we develop a robust method to sample instruction- relevant features. Then, we select influential fea- tures using designed metrics and further compute steering vectors (see Figure 1a). Furthermore, we measure the effectiveness of these steering vectors through steering tasks (see Figure 1b). Addition- ally, we examine the extracted features using Neu- ronpedia (Lin, 2023) to illustrate how semantically relevant the activating text of features is to instruc- tions. We also measure steering performance to demonstrate the effectiveness of extracted features. Through these tools, we gain some intriguing in- sights regarding the importance of the feature num- ber used in representing instructions, the role of the last layer, the impact of instruction position and model scale. Our main contributions in this work can be summarized as follows: • We proposeSAIF, a framework that interprets instruction following in LLMs at a fine-grained 1 arXiv:2502.11356v1 [cs.LG] 17 Feb 2025 SAE(X) Steering Vector Residual SAE Residual SAE SAE(X′) SAE Latent + Steering Vector Residual SAE SAE Decoder + LLM SAE Decoder + LLM Tell me something about LeBron James. Tell me something about LeBron James. Instruction: Translate this sentence to French. INPUT Generation SAE(X) SAE Latent Tell me something about LeBron James. SAE Decoder + LLM (a) Extract Steering Vector INPUT He is one of the most influential basketball players in NBA history. Dites-moi quelque chose sur LeBronJames. Dites-moi quelque chose sur LeBronJames. Generation (b) Steering Model Behavior Figure 1: The proposed SAIF framework. The model computes steering vectors from SAE latent differences to guide outputs according to instructions. (a) Extract steering vector. (b) Apply steering for controlled output. conceptual level. Our analysis reveals how mod- els internally encode and process instructions through interpretable latent features in their rep- resentation space. •We demonstrate that instructions cannot be ade- quately represented by a single concept in SAEs, but rather comprise multiple high-level concepts. Effective instruction steering requires a set of instruction-relevant features, which our method precisely identifies. •We reveal the critical role of the last layer in SAE-based activation steering. Moreover, the effectiveness of our framework has been demon- strated across instruction types and model scales. 2 Preliminaries Sparse Autoencoders (SAEs).Dictionary learn- ing enables disentangling representations into a set of concepts (Olshausen and Field, 1997; Bricken et al., 2023). SAEs are employed to decompose hidden representations into a high-dimension space and then reconstruct the hidden representations. Specifically, the input of SAEs is the hidden repre- sentation from a model’s residual stream denoted asz∈R d and the reconstructed output is denoted asSAE(z)∈R d , we obtain thatz=SAE(z) +ε whereεis the error. In our paper, we focus on layer- wise SAEs trained with an encoderW enc ∈R d×m followed by the non-linear activation function, and a decoderW dec ∈R m×d (He et al., 2024). The definition of SAEs is: a(z) =σ(zW enc +b enc ),(1) SAE(z) =a(z)W dec +b dec ,(2) whereb enc ∈R m andb dec ∈R d are the bias terms. The decomposed high-dimension latent activations a(z)have dimensionmandm≫d, which is a highly sparse vector. Note that different SAEs use different non-linear activation functionσ. For ex- ample, Llama Scope (He et al., 2024) adopts TopK- ReLU, while Gemma Scope (Lieberum et al., 2024) uses JumpReLU (Rajamanoharan et al., 2024). Steering with SAE Latents.Following Eq.(2), the reconstructed SAE outputs are a linear com- bination ofSAE latents, which represent the row vectors of SAE decoderW dec . The weight ofj-th SAE latent isa(z) j . Typically, a prominent dimen- sionj∈ 1,·,mis chosen, and its decoder latent vectord j is scaled with a factorαand then added to the SAE outputs (Ferrando et al., 2025). The computation is as follows: z new ←z+αd j .(3) This modified representationz new can then be fed back into the model’s residual stream to steer the model’s behavior during generation. 3 Proposed Method In this section, we introduce theSAIF, a framework for analyzing and steering instruction following in LLMs. First, we introduce linguistic variations to construct diverse instruction sentences and re- lated datasets, which are further used to compute SAE latent activations. Second, we develop a two- stage process for computing steering vectors that quantifies the sensitivity of features to instruction presence. Finally, we investigate how these identi- 2 fied features can be leveraged for steering model behavior, demonstrating a technique for enhanc- ing instruction following while preserving output coherence (see Figure 1). 3.1 Format Instruction Feature To identify instruction-relevant features given an in- struction, we construct a datasetDwithNpositive- negative sample pairs. For example, we focus on an instructionTranslate the sentence to French. In a sample pair, the positive sample refers to a prefix prompt followed by the instruc- tion, while the negative sample refers to the prefix prompt without the instruction sentence. Thedifference-in-means(Rimsky et al., 2024) is a typical approach to derive concept vectors. It computes the activation differences between each sample pair over the last token, and then averages over all pairs of activation difference vectors. How- ever, directly applying this pipeline to instruction following presents a significant challenge. When a single instruction sentence is used repeatedly to generate samples, the model tends to encode the specific semantic meaning of that instruction rather than learning a general-purpose vector that can reliably execute the intended operation (See Appendix G). Specifically, the derived vector can barely operate the same instruction if we rephrase the instruction in a linguistically different but se- mantically similar manner. To resolve this chal- lenge, we propose to introduce linguistic variations to extract instruction functions. We formulate instruction sentences for a given instruction through different strategies. These vari- ations include syntactic reformulations (e.g., imper- ative to interrogative form, task-oriented to process- based description) and cross-lingual translations (e.g., English, Chinese, German). In this way, we generated six diverse instruction sentences compre- hensively capturing key features of an instruction. The instruction design used in our paper is shown in Appendix A. For each instruction variant, we extract samples’ residual stream representation and compute the cor- responding SAE latent activations. While diverse linguistic information are contained, the latent fea- tures specifically corresponding to the core instruc- tional concept should maintain relatively consistent activation levels across all variants. These dimen- sions with consistent activation patterns will be further used to construct instruction vectors. 3.2 Steering Vector Computation Based on SAE latent activations computed in Sec- tion 3.1, we develop a two-step process for com- puting steering vectors. The first step identifies features that consistently respond to a given in- struction, while the second step quantifies their sensitivity. GivenNinput samples and a target instruction type (e.g., translation), we first obtain both positive samples (with instruction) and negative samples (without instruction) for each input. For each sam- ple pairiand featurej, we compute the activation state change: ∆h i,j =1(h w i,j >0)−1(h w/o i,j >0),(4) whereh w i,j andh w/o i,j represent the SAE latent acti- vation values with and without instruction respec- tively, and1(·)is the indicator function.∆h i,j captures whether featurejbecomes activated in response to the instruction for samplei. We then compute a sensitivity scoreC j for each feature: C j = 1 N N X i=1 1(∆h i,j >0).(5) The score represents the proportion of samples whose featurejbecomes activated in response to instructions. Features with higher scores are more consistently responsive to instruction pres- ence. By sorting these sensitivity scores in a de- scending order, we select the top-kresponsive fea- tures. These selected features form the instruction- relevant feature setV=W dec,j |rank(C j )≤k whereW dec,j =W dec [j,:]denotes thej-th SAE latent. These features will be used for further con- structing steering vectors. 3.3 Steering Procedure Different from the classic steering approach de- fined in Eq.(3), we hypothesize that instruction following steering requires a set of features to be effective. The individual feature utilized in the clas- sic method focuses on token-level concepts, where individual concepts typically correlate with a few SAE latent activations. As a result, this approach can barely operate instructions. It is partly due to the complexity of sentence-level instructions, which are composed of multiple high-level features represented by a set of SAE latent features. Addi- tionally, SAEs tend to overly split features, which further increases the number of features needed for steering (Ferrando et al., 2025). Thus, we propose 3 Algorithm 1:The proposedSAIFframework Input:Input textx; Target instruction type (e.g., translation, summarization) Stage 1: Format Instruction Feature Generate diverse instruction variants Construct datasetDwithNpositive/negative pairs Stage 2: Compute Steering Vector foreach sample pairiand featurejdo Compute activation state change: ∆h i,j =1(h w i,j >0)−1(h w/o i,j >0) Calculate sensitivity score: C j = 1 N P N i=1 1(∆h i,j >0) Sort features by sensitivity scoresC j Select top-kfeatures as instruction-relevant setV Stage 3: Steering Procedure Obtain residual stream representationzof inputx foreach featurei∈Vdo Compute activation strength:α i =μ i +βs i whereμ i is mean activation,s i is std deviation Apply steering:z new =z+ P k i=1 α i v i Output:Steered text following the instruction to determine how to steer with a set of vectors. Building on top of the feature setVderived in Section 3.2, we employ the set of features to steer residual stream representation of a certain input at layerl. Our steering is implemented as below: z new =z+ k X i=1 α i v i ,(6) wherezrepresents the residual stream representa- tion of the input over the last token, andα i denotes the steering strength of featurei. Here,v i repre- sents a certain instruction-relevant feature inV. As the strength of each selected feature is crucial to steering performance, we further compute the strength of each feature by employing statistical measurements of feature activation values to make it more robust and reliable. The activation strength for featureiis calculated as: α i =μ i +βs i ,(7) whereμ i is the mean activation value of featurei observed in instruction-following examples,s i is the standard deviation of these activation values, andβis a hyperparameter to scales i meanwhile controlling the strength value. 4 Experiments In this section, we conduct experiments to eval- uate the effectiveness ofSAIFby answering the following research questions (RQs): •RQ1: How interpretable are the features ex- tracted using SAEs, and do they correspond to instruction-related concepts? (Section 4.2) •RQ2: Can the proposedSAIFframework effec- tively control model behavior? (Section 4.3) • RQ3: What role does the final Transformer layer play in the instruction following? (Section 4.4) • RQ4: How does instruction positioning affect the effectiveness of instruction following and feature activation patterns? (Section 4.5) 4.1 Experimental Setup Datasets and Models.Our experiments are con- ducted with multiple language models including Gemma-2-2b, Gemma-2-9b (Team et al., 2024) and Llama3.1-8b. The Cross-lingual Natural Language Inference (XNLI) dataset (Conneau et al., 2018) is used to construct input samples. It encompasses di- verse languages (including English, French, Span- ish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu) and rich syntactic structures (such as active/passive voice alternations, negation patterns, and various clause structures). The diverse lin- guistic patterns within the dataset are essential in constructing a comprehensive set of samples for an instruction. Moreover, it ensures extracting consis- tent SAE activations from the residual stream of input samples. Instruction Design.Following the settings in IFE- val (Zhou et al., 2023), we investigate three types of instructions: keyword inclusion, summarization, and translation. For keywords inclusion, we pro- vide models with a keyword (e.g., “Sunday”), and expect model output incorporating the specified keyword. For formatting, we instruct the model to perform summarization, where the ideal out- put should be concise, maintain the key informa- tion from the original text, and follow a consis- tent format with a clear topic sentence followed by supporting details. For translation, we direct the model to translate sentences into different lan- guages (English, French, and Chinese), where the ideal model output should accurately perform the requested translation while preserving the original meaning. The complete set of instructions used for each task is provided in Appendix A. Implementation Details.We use pre-trained SAEs from Gemma Scope (Lieberum et al., 2024) and Llama Scope (He et al., 2024). When con- structing input samples for each instruction, we set the number of positive/negative samplesNto 800. For SAE latent extraction, we use sparse au- 4 F15425F33659F34014F42374F49454F54902F55427F59061 0 20 40 60 80 100 120 Activation Strength Feature Activation Strength Comparison Pre-instruction Post-instruction F15425F33659F34014F42374F49454F54902F55427F59061 0 5 10 15 20 Stability Feature Stability Comparison F15425F33659F34014F42374F49454F54902F55427F59061 0 20 40 60 80 100 Probability (%) Feature Probability Comparison Figure 2: Comparison of feature activation patterns between pre-instruction and post-instruction conditions across different SAE latent dimensions. The plots show three key metrics: activation strength (left), feature stability (middle), and activation probability (right) for eight identified instruction-relevant features. toencoders with dimensions of 65K and 131K for Gemma-2-2b-it 1 and Gemma-2-9b-it 2 models re- spectively. We also use SAE with dimension 32K for Llama3.1-8b 3 . All experiments were run on 1 NVIDIA A100 GPU. As default settings, for Equa- tion (6), we fixk= 15, meaning that we use the top 15 most responsive SAE features for instruc- tion steering. The strategy to choose the optimalk will be further discussed in Section 4.2. For Equa- tion (7), we fix the hyperparameterβ= 0, and we discuss the impact of adjusting this hyperparameter on the steering effect in Appendix C. SAE Latent Activation Metrics.We consider the following three metrics to quantify features’ behavior and reliability in instruction processing. Note that we only consider features activated on positive samples but not negative samples. •Activation Strength: The mean activation value is calculated as:μ i = 1 |A i | P a∈A i a , whereA i is the set of non-zero activation values for featurei. •Activation Probability: The probability of feature iis activated across positive/negative samples: P i = |A i | N , whereNis the total number of posi- tive/negative samples. •Activation Stability: The normalized standard deviation value of non-zero activation values: Ω i = 1/s i . A high-quality instruction-relevant feature should ideally exhibit strong activation (μ i ), consistent triggering (P i ), and stable behavior (Ω i ) across different formulations of the same instruction. Steering Effectiveness Metrics.We evaluate steering outputs with two metrics: 1)Strict Ac- 1 https://huggingface.co/google/gemma-2-2b-it 2 https://huggingface.co/google/gemma-2-9b-it 3 https://huggingface.co/meta-llama/Llama-3.1-8B- Instruct curacy, which measures the proportion of cases where the model completely follows the instruc- tion, meaning it both understands and produces out- put exactly as instructed; and 2)Loose Accuracy, which measures the proportion of cases where the model partially follows the instruction, meaning it understands the instruction but the output does not fully conform to the requirements. Note that we use GPT-4o-mini to rate the responses, and please refer to the details in Appendix D. 4.2 Analysis of Instruction-Related Concepts To investigate RQ1, we analyze the interpretabil- ity of features extracted using SAEs and assess their correspondence to instruction-related con- cepts. Our analysis consists of two parts. First, we examine the activating text of extracted features with Neuronpedia (Lin, 2023) to evaluate their se- mantic relevance to instructions. Second, we com- pare how strongly the activating examples of top-k features and lower-ranked features correspond to instruction-related concepts, demonstrating the re- lationship between feature importance and instruc- tion relevance. We focus on analyzing the consistent instruction- relevant latent activations through the lens of Neu- ronpedia (Lin, 2023), which provides detailed acti- vated text for each SAE latent. Taking translation- related instructions as an example (e.g., “Translate the sentence to French.”), we identify a notable latent that shows strong activation patterns. This latent exhibits high activation not only for various languages but also for directional prepositions like “to” and “from” that commonly appear in transla- tion instructions, as shown in Table 1. We summa- rize two key findings as below: • Our extracted SAE latent features show strong correspondence with instruction-related concepts, 5 Table 1: Maximally activating examples for Feature 15425 in Layer 25 of Gemma2-2b-it when prompted with “Translate the sentence to French.” Data sourced from Neuronpedia (Lin, 2023). Activating Examples with ‘Translate the sentence to French’ (Feature 15425, Layer 25) The Theory of Super conductivity(1958)(translatedfromRussian: Consultants Bureau, Inc., New York. Save your game, go back to change the PS 3 systemlanguagesettingstoEnglish. We have postedapartialtranslationofhisspeechfromYiddishtoHebrew,whichwaspostedin... IcanspeakEnglish,but i’mafraid itmaybeworsethanyourfrench. Table 2: Layer25 Experimental Results F15453 k= 1 F33659 k= 2 F65085 k= 3 F2369 k= 13 F58810 k= 14 F21836 k= 15 translation French language bienfaitshereNameInMap Translation France Speaking attentesHere CloseOperation translators french languages prochaines BelowJspwriter Table 3: Performance of instruction positions, including pre-instruction and post-instruction. PositionStrict Acc Loose Acc Original Pre-Instruction0.140.470.56 Post-Instruction0.230.640.75 1251015202530 Latent Dimensions 0.0 0.2 0.4 0.6 0.8 1.0 Value SA LA Figure 3: The impact of the number of latent dimensions (k) on our steering experiments. The x-axis represents different values of k, while the y-axis records the ac- curacy. We track the trend of strict accuracy (SA) and loose accuracy (LA) across 8 different k values. as demonstrated in Table 1. The extracted fea- tures consistently activate on instruction-relevant terms (e.g., “translate”, “French”) and related linguistic elements. •The activating examples of our extracted top-k features reveal a clear relevance pattern: they are directly corresponding to core instruction elements (e.g., task commands, target specifi- cations), while those of lower-ranked features show decreasing relevance to instruction-relevant terms, capturing more peripheral or contextual information. The result is shown in Table 2. Take the Layer 25 as an example, for the top-13th fea- ture, the top 3 tokens are French words. But for the top-14th and 15th features, the top 3 tokens seem irrelevant to the instruction. 4.3 Steering Performance Analysis In this section, we evaluate the effectiveness of steering vectors constructed from SAE features and investigate the optimal number of features needed for reliable control. Steering Effectiveness.We visualize a case study in Figure 4 and compare the performance of steer- ing results in Figure 5, including bothstrict ac- curacyandloose accuracy. Our analysis reveals several key findings: • The quantitative results in Figure 5 demonstrate significant improvements in instruction follow- ing, with the steered models achieving over 30% strict accuracy across different tasks. The loose accuracy of our steered approach performs nearly on par with prompting-based instruction methods, falling only slightly below. These results strongly indicate that SAIF can effectively extract features for user instructions and adjust LLMs’ behaviors according to relevant instructions. •The case study in Figure 4 illustrates two distinct scenarios of instruction following: strict adher- ence (successful Chinese-to-French translation) and loose following (understanding that this is a French translation task). It demonstrates how SAIF manipulates model responses from the fail- ure case toward either strict instruction following or loose instruction following. • The Gemma-2-9b-it model consistently outper- forms Gemma-2-2b-it with slightly higher in- struction steering performance across all five tasks, suggesting that SAIF’s effectiveness scales well with model size. •The LLaMA-3.1-8B model shows comparable performance to the Gemma models across tasks. Looking at French translation as an example, LLaMA-3.1-8B achieves around 30% strict accu- 6 An example of French-Translation Task (Strict Instruction Following) Prompt:Robi Output:Elle a une soupe simple, mais son goût hésite, pas aussi clair que Robi. 001 An example of French-Translation Task (Loose Instruction Following) Prompt:ε ό, τι αφορά την κυβέρνηση, αποτελεί μέρος της ισπανικής επαρχίας Output:TheFrench translationgovernment,itisapartoftheSpanishofthesentenceis: “**EnWhat relates to the archipelago of the Balearic Islands.**” 002 003 An example of French-Translation Task (Failed to follow Instruction) Prompt:Nefeskesengösterirbiri,ÇinMahallesi'ninbatıyakasındajalanbandar'dabulunanSriMa- hamariamman Hindu Tapınağı ' nda Kwong Siew Derneği Tapınağı ' ndan kısa bir süre sonra geliyor. Output:Translation translation Translation Translation Translation Translation Translation Transla- tion Translation Translation Translation 004 005 An example of Include Keyword Task (Strict Instruction Following) Prompt:Es wurde 1972 abgeschlossen , aber die Fruchtbarkeit der ägyptischen Landwirtschaft sinkt jetzt , da sie von der jährlichen Schicht frischer Nährstoffe , die durch die Flut verursacht wurden , abgeschnitten wird.. Output:WährenderjedenSundayunddasganzeJahrübereineverlässlicheBewässerungermöglicht, hat er auch negative Auswirkungen. 006 007 An example of Include Keyword Task (Loose Instruction Following) Prompt:Естественый отбор занимает у милионера yachters.. Output:To Include Sundayin text, please provide more information. 008 009 An example of Include Keyword Task (Failed to follow Instruction) Prompt:Chapter6 examines theavailability of resourcesnecessary for the installationof SO2 , NOX , and mercury control retrofit technologies for the timing and emission reductions proposed under the Clear Skies Act. Output:Please tell me what you think is the meaning of the passage, and can you clarify the meaning of “Clear Skies Act”? 010 011 An example of Summarization Task (Strict Instruction Following) Prompt: Output:The government encourages retirees to work and supports their economic development by providing skills training and courses to help them re-enter the job market. 012 013 1 Figure 4: Examples of French translation task outcomes showing strict instruction following and loose instruction following using inputs in different languages. (Gemma-2-2b-it, SAE dimension of 65K) racy and 65% loose accuracy, which is similar to Gemma-2-2b-it’s performance. Latent Dimension Analysis.We study the effect of single latent and the number of latents on steer- ing, showing that too few and too many dimensions both lead to failures. For individual latent, we use the single top 1 latent and latent listed in Table 1 for steering. Despite their apparent semantic relevance to translation tasks, the model shows zero accuracy. This suggests that instruction following cannot be captured by a single high-level concept, even when that concept appears highly correlated with specific instruction types. This observation leads us to investigate whether a combination of multiple latent dimensions could achieve better steering performance. Our experi- ments, shown in Figure 3, systematically evaluate the impact of varying the number of latent dimen- sions from 1 to 30. The instructions used here are sourced from French translation task. The results reveal several key patterns: • Steering performance remains near zero when k≤5, indicating that too few dimensions are insufficient for capturing instruction-following behavior. Performance begins to improve notably aroundk= 10, with both strict accuracy and loose accuracy showing substantial increases. •The optimal performance is achieved atk= 15, where loose accuracy peaks at approximately0.7 and strict accuracy reaches about0.25. • However, as we increase dimensions beyond k= 15, both metrics show a consistent decline. This deterioration becomes more pronounced as kapproaches30, suggesting that excessive di- mensions introduce noise that interferes with ef- fective steering. 4.4The Role of Last Layer Representations in Instruction Processing In previous sections, we exclusively used SAE from the last Transformer layer for concept vector extraction and instruction steering. In this section, we analyze why extracting concepts and steering from the final layer is most effective. Concept Extraction Perspective.From the re- sults in Table 4, we observe an intriguing phe- nomenon that shallower layers are less effective in providing clean instruction-relevant features. Fol- lowing our default experimental settings, we ex- tract the top 15 SAE features from each layer of the model. The features extracted from the last layer can precisely capture the semantics of ‘French’, showing strong activations on French- related words, wherek= 2indicates this feature is considered the second most instruction-relevant feature. Starting from the penultimate layer, as we attempt to trace French-related features, our ex- perimental results reveal that the extracted French- related concepts undergo a gradual shift as the layer depth decreases. Specifically, the feature evolves from exclusively activating on French-related to- kens to encompassing a broader spectrum of lan- guages (English, Spanish, Hindi, and Belgian), demonstrating a hierarchical abstraction pattern from language-specific to cross-lingual represen- tations. Moreover, the increasingkvalues sug- gest that these French-related features become less instruction-relevant in earlier layers. For Gemma2- 2b-it model, before Layer 21, we can no longer identify French-related features among the top 15 SAE features. Steering Perspective.We conducted steering ex- periments using the top 15 features extracted from Layers 21-25 respectively under default settings on French Translation task. The results align with our 7 French Chinese English Keyword Summary 0 25 50 75 100 ACC(%) Gemma-2-2b-it French Chinese English Keyword Summary 0 25 50 75 100 Gemma-2-9b-it French Chinese English Keyword Summary 0 25 50 75 100 LLaMA-3.1-8B Steered-SA Steered-LA Original Figure 5: Performance comparison between original model outputs and two steering approaches across different instruction types on Gemma-2-2b-it and Gemma-2-9b-it models. Results show the accuracy percentages for translation tasks (French, Chinese, English), keyword inclusion, and summarization tasks. Table 4: Analysis of Layer Features # of Layer Top 5 tokens with the highest logit increases by the feature influence # of top_k# of Feature 25French, France, french, FRENCH, Paris233659 24French, nb, french, Erreur, Fonction865238 23French, France, french, Paris, Francis1549043 22English, english, Spanish, French, Hindi12351 21Belgian, Belgium, Brussels, Flemish, Belgique1427665 findings on concept extraction, showing the effec- tiveness and importance of last layer representation on instruction following. Using loose accuracy as the evaluation metric, we observe that steering with Layer 24 features still maintains some effec- tiveness, though the loose accuracy drops sharply from 0.64 (Layer 25) to 0.33. Steering attempts using features from earlier layers fail to guide the model towards instruction-following behavior, with the model instead tending to generate repetitive and instruction-irrelevant content. 4.5 Impact of Instruction Position Previous studies have shown that models’ instruction-following capabilities can vary signif- icantly depending on the relative positioning of instructions and content. This motivates us to ex- amine how instruction positioning affects the acti- vation patterns of previously identified features. We investigate the effect of instruction position by comparing two patterns: pre-instruction (P pre = [Instruction] + [Content]) and post-instruction (P post = [Content] + [Instruction]) as in Liu et al. (2024). Using identical instruction-content pairs while varying only their relative positions allows us to isolate the effects of position. Our analysis reveals several key findings from both the quanti- tative metrics (see Table 3) and feature activation patterns (see Figure 2): • Performance metrics demonstrate that post- instruction positioning consistently outperforms pre-instruction, with post-instruction achieving higher accuracy across all measures (Strict Acc: 0.23 vs 0.14, Loose Acc: 0.64 vs 0.47), aligning with the result in Liu et al. (2024). •Feature activation patterns show that post- instruction enables more robust processing with stronger activation peaks (particularly for key features like F33659), more consistent stabil- ity scores, and higher activation probabilities (>80%) across most features compared to pre- instruction’s more variable patterns. 5 Conclusions In this paper, we have introduced to use SAEs to analyze instruction following in LLMs, revealing the underlying mechanisms through which mod- els encode and process instructions. Our analysis demonstrates that instruction following is mediated by interpretable latent features in the model’s rep- resentation space We have developed a lightweight steering technique that enhances instruction follow- ing by making targeted modifications to specific latent dimensions. We find that effective steering requires the careful combination of multiple latent features with precisely calibrated weights. Exten- sive experiments across diverse instruction types have demonstrated that our proposed steering ap- proach enables precise control over model behavior while consistently maintaining coherent outputs. 8 Limitations One limitation of our steering approach is that it sometimes produces outputs that only partially fol- low the intended instructions, particularly when handling complex tasks. While the model may un- derstand the general intent of the instruction, the generated outputs may not fully satisfy all aspects of the requested task. For example, in translation tasks, the model might incorporate some elements of the target language but fail to produce a com- plete and accurate translation. Besides, our current work focuses primarily on simple, single-task in- structions like translation or summarization. In future, we plan to investigate how to extend this approach to handle more sophisticated instruction types, such as multi-step reasoning tasks or instruc- tions that combine multiple objectives. Addition- ally, our experiments were conducted using models from the Gemma and Llama two LLM families. In the future, we plan to extend this analysis to a more diverse set of language model architectures and families to validate the generality of our findings. References Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717. Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances.Computational Linguis- tics, 48(1):207–219. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Chris Olah. 2023. Towards monosemanticity: Decomposing language models with dictionary learn- ing. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Al- bert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdh- ery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Ja- cob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2024. Scaling instruction-finetuned language models.Journal of Machine Learning Re- search (JMLR), 25(70):1–53. Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- ina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. Xnli: Evaluating cross- lingual sentence representations. InProceedings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing. Association for Computa- tional Linguistics. Javier Ferrando, Oscar Balcells Obeso, Senthooran Ra- jamanoharan, and Neel Nanda. 2025. Do i know this entity? knowledge awareness and hallucinations in language models. InThe Thirteenth International Conference on Learning Representations. Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. 2024.Scaling and evaluating sparse autoencoders.arXiv preprint arXiv:2406.04093. Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, et al. 2024. Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders.arXiv preprint arXiv:2410.20526. John Hewitt, Nelson F Liu, Percy Liang, and Christo- pher D Manning. 2024. Instruction following without instruction tuning.arXiv preprint arXiv:2409.14254. Ole Jorgensen, Dylan Cope, Nandi Schoots, and Murray Shanahan. 2024. Improving activation steering in language models with mean-centring. InResponsible Language Models Workshop at AAAI-24. Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. 2018. In- terpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). InIn- ternational conference on machine learning, pages 2668–2677. PMLR. Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, and Neel Nanda. 2024. In- terpreting attention layer outputs with sparse autoen- coders. InICML 2024 Workshop on Mechanistic Interpretability. Po-Nien Kung and Nanyun Peng. 2023. Do models really learn to follow instructions? an empirical study of instruction tuning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1317– 1328. Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2024a. Measuring and controlling per- sona drift in language model dialogs.arXiv preprint arXiv:2402.10962. Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2024b. Inference- time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36. 9 Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. 2024. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147. Johnny Lin. 2023. Neuronpedia: Interactive reference and tooling for analyzing neural networks. Software available from neuronpedia.org. Yijin Liu, Xianfeng Zeng, Fandong Meng, and Jie Zhou. 2024. Instruction position matters in sequence gen- eration with large language models. InFindings of the Association for Computational Linguistics: ACL 2024. Wanqin Ma, Chenyang Yang, and Christian Kästner. 2024. (why) is my prompt getting worse? rethinking regression testing for evolving llm apis. InProceed- ings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI, CAIN ’24, page 166–171, New York, NY, USA. As- sociation for Computing Machinery. Samuel Marks and Max Tegmark. 2024. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. InCon- ference on Language Modeling. Bruno A. Olshausen and David J. Field. 1997. Sparse coding with an overcomplete basis set: A strategy employed by v1?Vision Research, 37(23):3311– 3325. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instruc- tions with human feedback.Advances in Neural In- formation Processing Systems (NeurIPS), 35:27730– 27744. Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. 2024. Jumping ahead: Im- proving reconstruction fidelity with jumprelu sparse autoencoders.Preprint, arXiv:2407.14435. Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. 2024. Steer- ing llama 2 via contrastive activation addition. In Proceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers). Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2022. Multitask prompted training enables zero- shot task generalization. InInternational Conference on Learning Representations. Lee Sharkey, Dan Braun, and Beren Millidge. 2022. Taking features out of superposition with sparse au- toencoders. Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. 2024. Improving instruction-following in language mod- els through activation steering.arXiv preprint arXiv:2410.12877. Jiao Sun, Yufei Tian, Wangchunshu Zhou, Nan Xu, Qian Hu, Rahul Gupta, John Wieting, Nanyun Peng, and Xuezhe Ma. 2023. Evaluating large language models on controlled generation tasks. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3155–3168, Singapore. Association for Computational Linguis- tics. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati- raju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118. Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. 2023.Linear representations of sentiment in large language models.Preprint, arXiv:2310.15154. Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned language mod- els are zero-shot learners. InInternational Confer- ence on Learning Representations (ICLR). Haiyan Zhao, Heng Zhao, Bo Shen, Ali Payani, Fan Yang, and Mengnan Du. 2025. Beyond single con- cept vector: Modeling concept subspace in LLMs with gaussian distribution. InThe Thirteenth Interna- tional Conference on Learning Representations. Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. 2024. Lima: Less is more for align- ment.Advances in Neural Information Processing Systems, 36. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Sid- dhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023.Instruction-following evalu- ation for large language models.arXiv preprint arXiv:2311.07911. Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. 2023. Representation engineering: A top-down approach to ai transparency.Preprint, arXiv:2310.01405. 10 A Details of Instructions InstructionDescription Include KeywordsInclude keyword keyword in your response. In your response, the keyword keyword should appear. Answer with keyword keyword in your response. keyword Fügen Sie das Schlüsselwort keyword in Ihre Antwort ein. 返信の中にキーワードkeywordを含めてください。 ExtractionProvide a summary of this text. Summarize this sentence. Give me the bullet points of the given content. Fassen Sie diesen Text zusammen. このテキストの要約を提供してください。 Translation (French)Translate the sentence to French. Provide a French translation of the given text. Convert this sentence to French. Übersetzen Sie den Satz ins Französische. この文を仏語に翻訳してください。 Translation (English)Translate the sentence to English. Provide an English translation of the given text. Convert this sentence to English. Übersetzen Sie den Satz ins Englische. この文を英語に翻訳してください。 Translation (Chinese)Translate the sentence to Chinese. Provide a Chinese translation of the given text. Convert this sentence to Chinese. Übersetzen Sie den Satz ins Chinesische. この文を中国語に翻訳してください。 3 B Related Work In this section, we briefly summarize several research directions that are most relevant to ours. Instruction Following in Language Models.Instruction following capabilities are crucial for improving LLM performance and ensuring safe deployment. Recent advances in instruction tuning have demonstrated significant progress through various methods (Ouyang et al., 2022; Sanh et al., 2022; Wei et al., 2022; Chung et al., 2024). However, capable models still struggle with hard-constrained tasks (Sun et al., 2023) and lengthy generations(Li et al., 2024a). Some studies find that instruction following can be improved with in-context few-shot examples (Kung and Peng, 2023), optimal instruction positions (Liu et al., 2024), carefully selected instruction-response pairs with fine-tuning (Zhou et al., 2024), and adaptations (Hewitt et al., 2024). Unfortunately, the mechanistic understanding of how LLMs internally represent and process these instructions remains limited. Language Model Representations.A body of research have focused on studying the linear represen- tation of concepts in representation space (Kim et al., 2018). The basic idea is to find a direction in the 11 Llama3.1-8BGemma2-9B 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 Relative Standard Deviation (std/mean) Distribution of Relative Standard Deviation Figure 6: Visualization of steering vectors extracted from LLaMA-3.1-8B and Gemma-2-9B for French translation task. The y-axis denotes the ratio between the standard deviation and mean of feature activation strengths. space to represent the related concept. This can be achieved using a dataset with positive and negative samples relevant to concepts. Existing approaches computing the concept vectors include probing classi- fiers (Belinkov, 2022), mean difference (Rimsky et al., 2024; Zou et al., 2023), mean centering (Jorgensen et al., 2024), gaussian concept subspace (Zhao et al., 2025), which provide a rich set of tools to derive concept vectors. The derived concept vectors represent various high-level concepts such as honesty (Li et al., 2024b), truthfulness (Tigges et al., 2023), harmfulness (Zou et al., 2023), and sentiments (Zhao et al., 2025). Sparse Autoencoders.Dictionary learning is effective in disentangling features in superposition without representation space. Sparse autoencoder (SAE) offers a feasible way to map representations into a higher- dimension space and reconstruct to representation space. Various SAEs have been proposed to improve their performance such as vallina SAEs (Sharkey et al., 2022), TopK SAEs (Gao et al., 2024). Based on them, a range of sparse autoencoders (SAEs) have been trained to interpret hidden representations including Gemma Scope (Lieberum et al., 2024) and Llama Scope (He et al., 2024). These SAEs have also been used to interpret models’ representational output (Kissane et al., 2024) and understand their abilities (Ferrando et al., 2025). Activation Steering.Recently, a body of research has utilized concept vectors to steer model behaviors during inference. Specifically, concepts vectors can be computed with diverse approaches, and these vectors are mostly effective on manipulating models generating concept-relevant text. For instance, many studies find it useful in improving truthfulness(Marks and Tegmark, 2024) and safety (Arditi et al., 2024), mitigating sycophantic and biases (Zou et al., 2023). Steering primarily operates in the residual stream following methods defined in Eq.(3), but it is worth-noting that the steering vectors can be computed from either residual stream representations or SAEs. Existing work mostly concentrates on computing with residual stream representations, which provide limited insights on what finer features contribute to the high-level concept vector. This coarse approach could further limit our deeper understanding on more complicated vectors such as instructions. In our work, we aim to bridge this gap by studying instruction vectors with SAEs to uncover their working mechanism. C Additional Results for Llama-3.1-8b In our experimental setup, we employ Equation (7) to control feature activation during model steering, whereμ i denotes the pre-computed mean activation strength ands i represents the standard deviation for featurei. The hyperparameterβcontrols the perturbation magnitude relative to the standard deviation. Our experiments reveal distinct robustness characteristics across different model architectures. For the Gemma-2 family models, the steering vectors maintain their effectiveness whenβ∈[−1,1], indicating robust feature representations. These models exhibit high activation strength values (μ i ) with low standard deviations (s i ), suggesting stable and consistent feature characteristics. In contrast, the Llama-3.1-8b 12 Table 5: Evaluation Prompt for Generated Output Your task is to strictly evaluate whether the generated output follows the given instruction. First you should review the following components: Original Input: input_text Instruction: instruction Generated Output: generated_output Here is the evaluation criteria: A: The generated content completely follows the instruction. B: Contains instruction keywords but doesn’t follow the instruction completely. C: Completely irrelevant to the instruction Critical. Remember: If the Generated Output only contains repeated words or sentences, select C immediately. DO NOT provide explanation. Provide your evaluation by selecting one option(A/B/C). Your Answer is: model demonstrates higher sensitivity to activation perturbations. The steering vectors remain effective only whenβ∈[−0.1,0.1], indicating a significantly narrower tolerance range. The relative standard deviations illustrated in Figure 6 quantify this distinction. This narrow tolerance range suggests that Llama-3.1-8b’s feature space may possess the following characteristics: stricter boundaries between features, more discrete transitions between different instruction states, and poorer robustness to noise. D Steering Accuracy Evaluation based on GPT-4o-mini To evaluate generated outputs, we instruct GPT-4o-mini to rate in the following way. For each instance, we provide GPT-4o-mini with three components: the original input text, the instruction, and the model- generated output. To ensure reliable assessment, we implement a voting mechanism where GPT-4o-mini performs five independent evaluations for each instance. For each evaluation, GPT-4o-mini is prompted to assess the instruction following level by selecting whether the generated content completely follows the instruction (A), contains instruction keywords but doesn’t follow the instruction (B), or is completely irrelevant to the instruction (C). The final grade is determined by majority voting among the five evaluations. In cases where there is no clear majority (e.g., when votes are split as 2-2-1), we choose the lower grade between the two options that received the most votes (C is considered lower than B, and B is lower than A). This ensures a stringent evaluation standard when the votes are divided. Thus, theStrict Accuracyis the ratio of A and theLoose Accuracyis the ratio of A + B. The prompt we use in the experiments can be found in Table 5. E Model Scale Analysis We explore the influence of both model scale and SAE scale, showing larger sizes always contribute to better performance. Using SAE with larger dimensions (e.g., increasing Gemma-2-2b’s SAE from 16K to 65K) can effectively improve the interpretability of feature extraction. For the same prompt, Gemma-2-2b’s 16K SAE is almost unable to extract interpretable features under our settings, while the 65K model performs well. For Gemma-2-9b and Llama3.1-8b models, even the SAE with minimal dimensions can extract features with good interpretability. 13 F More Activating Examples of Top-ranked Features Table 6: The remaining eight features we used to construct the steering vector for Gemma2-2B SAE on the French Translation task, along with their corresponding activation examples. (The other seven features can be found in Table 1 and Table 2.) The examples are provided by Neuropedia (Lin & Bloom, 2024). Layer25, Feature42374 Could you please translate the followingsentence to French? Ithink “everyone”and “we” are the same inthissentence. Layer25, Feature49454 Quote from thearticlebelow: Variable names are case - sensitive. With puremind and internalcomtemplationthere is noneed for... Layer25, Feature54902 The incredible spe ta culodelavida, theincredible spe taculodelamuerte! This is a continuation of the precedent the band establishedwith Re... Layer25, Feature55427 Whatever themodifier maybe, bothsentencesare discussing... I canmake no distinctionbetweenthetwo lsentences at issue... Layer25, Feature6201 Furthermore, figure has a plethora of othersenses, evincedbythedictionaryentrylinked above. Themeaning and nuance of thisphrase can be quite different depending on thecontext. Layer25, Feature17780 How to convert the text into Hyperlinks?Thanks inadvance! Hi Jimmy, I don’t have your grandfather Birl listed in my files... Layer25, Feature22091 She can’t focus sufficiently touttercompletesentences without needing tostop and reflect. He speaks in a Hiroshima accent and often ends hissentenceswith " garu" and " ja". Layer25, Feature59061 Helderberg is aDutchnamemeaning "clear mountrain". Kaila - Altered formofEnglish Kaylay,meaning "slender". 14 G Examples of Instruction Following Tasks with Steering Vectors An example of Summarization Task (Loose Instruction Following) Prompt:So I’m up here you know nights and weekends you know working on the little PC on the PC up here you know typing stuff so I can see where it would be a great advantage to have one at home you know if you were going to school. Output:Let me break down this sentence and give you a summary. 014 015 An example of Summarization Task (Failed to follow Instruction) Prompt:With its economy in disarray , the government introduced a limited number of capitalist measures while maintaining a firm political grip. Output:This was a response to its citizens’ increasing demand for a more m oriented system . The phrase“limitednumberofcapitalistmeasures”impliesacombinationofbothtraditionalandcapitalist mechanisms that would not completely free the economy from state control. 016 017 An example of French-Translation Task (Loose Instruction Following) Prompt:They got divorced because the wife found out that the husband had betrayed her. Output:FrenchTranslation: PleasetranslatethetranslationintoFrench. Translationdoesnottranslate properly in its current format, we need to rewrite it. Please rewrite the translation into French. 018 019 2 An example of French-Translation Task (Strict Instruction Following) Prompt:Robi Output:Elle a une soupe simple, mais son goût hésite, pas aussi clair que Robi. 001 An example of French-Translation Task (Loose Instruction Following) Prompt:ε ό, τι αφορά την κυβέρνηση, αποτελεί μέρος της ισπανικής επαρχίας Output:TheFrench translationgovernment,itisapartoftheSpanishofthesentenceis: “**EnWhat relates to the archipelago of the Balearic Islands.**” 002 003 An example of French-Translation Task (Failed to follow Instruction) Prompt:Nefeskesengösterirbiri,ÇinMahallesi'ninbatıyakasındajalanbandar'dabulunanSriMa- hamariamman Hindu Tapınağı ' nda Kwong Siew Derneği Tapınağı ' ndan kısa bir süre sonra geliyor. Output:Translation translation Translation Translation Translation Translation Translation Transla- tion Translation Translation Translation 004 005 An example of Include Keyword Task (Strict Instruction Following) Prompt:Es wurde 1972 abgeschlossen , aber die Fruchtbarkeit der ägyptischen Landwirtschaft sinkt jetzt , da sie von der jährlichen Schicht frischer Nährstoffe , die durch die Flut verursacht wurden , abgeschnitten wird.. Output:WährenderjedenSundayunddasganzeJahrübereineverlässlicheBewässerungermöglicht, hat er auch negative Auswirkungen. 006 007 An example of Include Keyword Task (Loose Instruction Following) Prompt:Естественый отбор занимает у милионера yachters.. Output:To Include Sundayin text, please provide more information. 008 009 An example of Include Keyword Task (Failed to follow Instruction) Prompt:Chapter6 examines theavailability of resourcesnecessary for the installationof SO2 , NOX , and mercury control retrofit technologies for the timing and emission reductions proposed under the Clear Skies Act. Output:Please tell me what you think is the meaning of the passage, and can you clarify the meaning of “Clear Skies Act”? 010 011 An example of Summarization Task (Strict Instruction Following) Prompt: Output:The government encourages retirees to work and supports their economic development by providing skills training and courses to help them re-enter the job market. 012 013 1 An example of French-Translation Task (Strict Instruction Following) Prompt:Robi Output:Elle a une soupe simple, mais son goût hésite, pas aussi clair que Robi. 001 An example of French-Translation Task (Loose Instruction Following) Prompt:ε ό, τι αφορά την κυβέρνηση, αποτελεί μέρος της ισπανικής επαρχίας Output:TheFrench translationgovernment,itisapartoftheSpanishofthesentenceis: “**EnWhat relates to the archipelago of the Balearic Islands.**” 002 003 An example of French-Translation Task (Failed to follow Instruction) Prompt:Nefeskesengösterirbiri,ÇinMahallesi'ninbatıyakasındajalanbandar'dabulunanSriMa- hamariamman Hindu Tapınağı ' nda Kwong Siew Derneği Tapınağı ' ndan kısa bir süre sonra geliyor. Output:Translation translation Translation Translation Translation Translation Translation Transla- tion Translation Translation Translation 004 005 An example of Include Keyword Task (Strict Instruction Following) Prompt:Es wurde 1972 abgeschlossen , aber die Fruchtbarkeit der ägyptischen Landwirtschaft sinkt jetzt , da sie von der jährlichen Schicht frischer Nährstoffe , die durch die Flut verursacht wurden , abgeschnitten wird.. Output:WährenderjedenSundayunddasganzeJahrübereineverlässlicheBewässerungermöglicht, hat er auch negative Auswirkungen. 006 007 An example of Include Keyword Task (Loose Instruction Following) Prompt:Естественый отбор занимает у милионера yachters.. Output:To Include Sundayin text, please provide more information. 008 009 An example of Include Keyword Task (Failed to follow Instruction) Prompt:Chapter6 examines theavailability of resourcesnecessary for the installationof SO2 , NOX , and mercury control retrofit technologies for the timing and emission reductions proposed under the Clear Skies Act. Output:Please tell me what you think is the meaning of the passage, and can you clarify the meaning of “Clear Skies Act”? 010 011 An example of Summarization Task (Strict Instruction Following) Prompt: Output:The government encourages retirees to work and supports their economic development by providing skills training and courses to help them re-enter the job market. 012 013 1 An example of French-Translation Task (Strict Instruction Following) Prompt:Robi Output:Elle a une soupe simple, mais son goût hésite, pas aussi clair que Robi. 001 An example of French-Translation Task (Loose Instruction Following) Prompt:ε ό, τι αφορά την κυβέρνηση, αποτελεί μέρος της ισπανικής επαρχίας Output:TheFrench translationgovernment,itisapartoftheSpanishofthesentenceis: “**EnWhat relates to the archipelago of the Balearic Islands.**” 002 003 An example of French-Translation Task (Failed to follow Instruction) Prompt:Nefeskesengösterirbiri,ÇinMahallesi'ninbatıyakasındajalanbandar'dabulunanSriMa- hamariamman Hindu Tapınağı ' nda Kwong Siew Derneği Tapınağı ' ndan kısa bir süre sonra geliyor. Output:Translation translation Translation Translation Translation Translation Translation Transla- tion Translation Translation Translation 004 005 An example of Include Keyword Task (Strict Instruction Following) Prompt:Es wurde 1972 abgeschlossen , aber die Fruchtbarkeit der ägyptischen Landwirtschaft sinkt jetzt , da sie von der jährlichen Schicht frischer Nährstoffe , die durch die Flut verursacht wurden , abgeschnitten wird.. Output:WährenderjedenSundayunddasganzeJahrübereineverlässlicheBewässerungermöglicht, hat er auch negative Auswirkungen. 006 007 An example of Include Keyword Task (Loose Instruction Following) Prompt:Естественый отбор занимает у милионера yachters.. Output:To Include Sundayin text, please provide more information. 008 009 An example of Include Keyword Task (Failed to follow Instruction) Prompt:Chapter6 examines theavailability of resourcesnecessary for the installationof SO2 , NOX , and mercury control retrofit technologies for the timing and emission reductions proposed under the Clear Skies Act. Output:Please tell me what you think is the meaning of the passage, and can you clarify the meaning of “Clear Skies Act”? 010 011 An example of Summarization Task (Strict Instruction Following) Prompt: Output:The government encourages retirees to work and supports their economic development by providing skills training and courses to help them re-enter the job market. 012 013 1 An example of French-Translation Task (Strict Instruction Following) Prompt:Robi Output:Elle a une soupe simple, mais son goût hésite, pas aussi clair que Robi. 001 An example of French-Translation Task (Loose Instruction Following) Prompt:ε ό, τι αφορά την κυβέρνηση, αποτελεί μέρος της ισπανικής επαρχίας Output:TheFrench translationgovernment,itisapartoftheSpanishofthesentenceis: “**EnWhat relates to the archipelago of the Balearic Islands.**” 002 003 An example of French-Translation Task (Failed to follow Instruction) Prompt:Nefeskesengösterirbiri,ÇinMahallesi'ninbatıyakasındajalanbandar'dabulunanSriMa- hamariamman Hindu Tapınağı ' nda Kwong Siew Derneği Tapınağı ' ndan kısa bir süre sonra geliyor. Output:Translation translation Translation Translation Translation Translation Translation Transla- tion Translation Translation Translation 004 005 An example of Include Keyword Task (Strict Instruction Following) Prompt:Es wurde 1972 abgeschlossen , aber die Fruchtbarkeit der ägyptischen Landwirtschaft sinkt jetzt , da sie von der jährlichen Schicht frischer Nährstoffe , die durch die Flut verursacht wurden , abgeschnitten wird.. Output:WährenderjedenSundayunddasganzeJahrübereineverlässlicheBewässerungermöglicht, hat er auch negative Auswirkungen. 006 007 An example of Include Keyword Task (Loose Instruction Following) Prompt:Естественый отбор занимает у милионера yachters.. Output:To Include Sundayin text, please provide more information. 008 009 An example of Include Keyword Task (Failed to follow Instruction) Prompt:Chapter6 examines theavailability of resourcesnecessary for the installationof SO2 , NOX , and mercury control retrofit technologies for the timing and emission reductions proposed under the Clear Skies Act. Output:Please tell me what you think is the meaning of the passage, and can you clarify the meaning of “Clear Skies Act”? 010 011 An example of Summarization Task (Strict Instruction Following) Prompt: Output:The government encourages retirees to work and supports their economic development by providing skills training and courses to help them re-enter the job market. 012 013 1 An example of Summarization Task (Loose Instruction Following) Prompt:So I’m up here you know nights and weekends you know working on the little PC on the PC up here you know typing stuff so I can see where it would be a great advantage to have one at home you know if you were going to school. Output:Let me break down this sentence and give you a summary. 014 015 An example of Summarization Task (Failed to follow Instruction) Prompt:With its economy in disarray , the government introduced a limited number of capitalist measures while maintaining a firm political grip. Output:This was a response to its citizens’ increasing demand for a more m oriented system . The phrase“limitednumberofcapitalistmeasures”impliesacombinationofbothtraditionalandcapitalist mechanisms that would not completely free the economy from state control. 016 017 2 An example of Summarization Task (Loose Instruction Following) Prompt:So I’m up here you know nights and weekends you know working on the little PC on the PC up here you know typing stuff so I can see where it would be a great advantage to have one at home you know if you were going to school. Output:Let me break down this sentence and give you a summary. 014 015 An example of Summarization Task (Failed to follow Instruction) Prompt:With its economy in disarray , the government introduced a limited number of capitalist measures while maintaining a firm political grip. Output:This was a response to its citizens’ increasing demand for a more m oriented system . The phrase“limitednumberofcapitalistmeasures”impliesacombinationofbothtraditionalandcapitalist mechanisms that would not completely free the economy from state control. 016 017 2 15 H Extracted Features Correlation Visualization and Analysis In Section 4.5, we explored how instruction placement (before or after the original prompt) affects model behavior. To further understand how the model encodes and processes instructions in different positions, we present visualization analysis using feature correlation heatmaps. Figure 7 to Figure 11 show the feature correlations of Gemma-2-2b model across five different tasks. Taking Figure 7 as an example, the visualization is divided into Pre-Instruction and Post-Instruction modes. Each part contains two 20×20 heatmap matrices showing Activation Probability and Activation Strength correlations respectively. The heatmaps use a red-blue color scheme, where dark red indicates strong positive correlation (1.0), dark blue indicates strong negative correlation (-1.0), and light or white areas indicate correlations close to 0. The axes range from 0 to 19, representing the top 20 SAE latent features. Our analysis reveals distinct differences between the two instruction placement modes. The Pre- Instruction mode shows dispersed correlations with predominantly light colors outside the diagonal, indicating stronger feature independence. In contrast, the Post-Instruction mode exhibits more pronounced red and blue areas, demonstrating enhanced feature correlations and a more tightly connected feature network. This finding aligns with our key conclusion that effective instruction following requires precise combinations of multiple latent features. The stronger feature correlations in Post-Instruction mode confirm that single-feature manipulation is insufficient for reliable control. This insight into feature cooperation supports the effectiveness of our proposed steering technique based on precisely calibrated weights across multiple features. 16 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Probability Correlation 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Strength Correlation 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Pre-Instruction Feature Correlations 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Probability Correlation 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Strength Correlation 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Post-Instruction Feature Correlations Figure 7: Heatmaps for Keyword Task. 17 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Probability Correlation 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Strength Correlation 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Pre-Instruction Feature Correlations 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Probability Correlation 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Strength Correlation 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Post-Instruction Feature Correlations Figure 8: Heatmaps for Summarization Task. 18 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Probability Correlation 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Strength Correlation 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Pre-Instruction Feature Correlations 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Probability Correlation 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Strength Correlation 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Post-Instruction Feature Correlations Figure 9: Heatmaps for Translation(English) Task. 19 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Probability Correlation 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Strength Correlation 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Pre-Instruction Feature Correlations 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Probability Correlation 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Strength Correlation 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Post-Instruction Feature Correlations Figure 10: Heatmaps for Translation(French) Task. 20 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Probability Correlation 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Strength Correlation 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Pre-Instruction Feature Correlations 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Probability Correlation 0123456789 10111213141516171819 Feature Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Feature Index Activation Strength Correlation 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Post-Instruction Feature Correlations Figure 11: Heatmaps for Translation(Chinese) Task. 21