← Back to papers

Paper deep dive

Mechanistic Indicators of Steering Effectiveness in Large Language Models

Mehdi Jafari, Hao Xue, Flora Salim

Year: 2026Venue: arXiv preprintArea: Representation AnalysisType: EmpiricalEmbeddings: 47

Models: Gemma

Abstract

Abstract:Activation-based steering enables Large Language Models (LLMs) to exhibit targeted behaviors by intervening on intermediate activations without retraining. Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood, as prior work has relied primarily on black-box outputs or LLM-based judges. In this study, we investigate whether the reliability of steering can be diagnosed using internal model signals. We focus on two information-theoretic measures: the entropy-derived Normalized Branching Factor (NBF), and the Kullback-Leibler (KL) divergence between steered activations and targeted concepts in the vocabulary space. We hypothesize that effective steering corresponds to structured entropy preservation and coherent KL alignment across decoding steps. Building on a reliability study demonstrating high inter-judge agreement between two architecturally distinct LLMs, we use LLM-generated annotations as ground truth and show that these mechanistic signals provide meaningful predictive power for identifying successful steering and estimating failure probability. We further introduce a stronger evaluation baseline for Contrastive Activation Addition (CAA) and Sparse Autoencoder-based steering, the two most widely adopted activation-steering methods.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)representation-analysis (suggested, 92%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/11/2026, 1:29:38 AM

Summary

This paper investigates mechanistic indicators of steering effectiveness in Large Language Models (LLMs) by analyzing internal model signals such as the Normalized Branching Factor (NBF) and KL divergence. The authors demonstrate that these signals can predict steering success and failure, and they introduce a rotation-based steering function as a more robust alternative to traditional additive steering.

Entities (6)

KL Divergence · metric · 98%Normalized Branching Factor · metric · 98%Contrastive Activation Addition · method · 95%Gemma-2 · model · 95%Sparse Autoencoders · method · 95%Residual Rotation Steering · method · 92%

Relation Signals (3)

NBF indicates Steering Effectiveness

confidence 90% · We hypothesize that effective steering corresponds to structured entropy preservation

KL Divergence measures Steering Alignment

confidence 90% · KL divergence between steered activations and targeted concepts in the vocabulary space

Residual Rotation Steering improves Steering Quality

confidence 85% · This intervention preserves the magnitude of the residual stream while smoothly rotating its direction

Cypher Suggestions (2)

Find all methods used for steering in the paper · confidence 90% · unvalidated

MATCH (m:Method)-[:USED_FOR]->(s:Steering) RETURN m.name

Link metrics to the concepts they measure · confidence 90% · unvalidated

MATCH (m:Metric)-[:MEASURES]->(c:Concept) RETURN m.name, c.name

Full Text

46,852 characters extracted from source content.

Expand or collapse full text

Mechanistic Indicators of Steering Effectiveness in Large Language Models Mehdi Jafari 1 2 Hao Xue 3 2 Flora Salim 1 2 Abstract Activation-based steering enables large language models (LLMs) to exhibit targeted behaviors by intervening on intermediate activations without retraining. Despite its widespread use, the mech- anistic factors that govern when steering suc- ceeds or fails remain poorly understood, as prior work has relied primarily on black-box outputs or LLM-based judges. In this study, we inves- tigate whether the reliability of steering can be diagnosed using internal model signals. We fo- cus on two information-theoretic measures: the entropy-derived Normalized Branching Factor (NBF), and the Kullback–Leibler (KL) divergence between steered activations and targeted concepts in the vocabulary space. We hypothesize that effective steering corresponds to structured en- tropy preservation and coherent KL alignment across decoding steps. Building on a reliability study demonstrating high inter-judge agreement between two architecturally distinct LLMs, we use LLM-generated annotations as ground truth and show that these mechanistic signals provide meaningful predictive power for identifying suc- cessful steering and estimating failure probability. We further introduce a stronger evaluation base- line for Contrastive Activation Addition (CAA) and Sparse Autoencoder-based steering, the two most widely adopted activation-steering meth- ods. 1 1 School of Computer Science and Engineering, Univer- sity of New South Wales, Sydney, Australia 2 ARC Centre of Excellence for Automated Decision-Making and Society, Melbourne, Australia 3 The Hong Kong University of Sci- ence and Technology, Guangzhou, China.Correspondence to: Mehdi Jafari<mehdi.jafari@unsw.edu.au>, Flora Salim <flora.salim@unsw.edu.au>. Preprint. February 19, 2026. 1 Codeisavailableathttps://github.com/ cruiseresearchgroup/IntSteer. Branching Factor Features KL Divergence Features Attention Pattern Features LLM Judge Annotations Regression Model tokens prompt embed + + + unembed logits Response Response≡T Att. Att2Att1 FNN F h (ℓ) t ,s,α L × Transformer Block h (ℓ) t h (ℓ−1) t h (ℓ) 0 h (ℓ) 0 ˆ h (ℓ) t S(T )C(T ) Steering Operation Time’s Direction Figure 1. Overall pipeline of the proposed method. (Left) An abstract schematic of the LLM, illustrating the extraction of three distinct feature sets (blue, red, and green blocks). (Right) An overview of how the extracted features are utilized within the regression framework. 1. Introduction Recent work (Turner et al., 2024; Nguyen et al.; Wang et al., 2025a) shows that LLMs can be steered via intervention vec- tors in the residual stream, enabling controllable generation without retraining. Major approaches include CAA (Rim- sky et al., 2024; Hao et al., 2025) and Sparse Autoencoders (SAEs) (Lieberum et al., 2024; Chalnev et al., 2024; Joshi et al., 2025; Cho et al., 2025), which bias outputs by en- riching residual representations. Despite their growing use (Soo et al., 2025; Arad et al., 2025; Wang et al., 2025b; Sun et al., 2025), the mechanistic factors behind steering suc- cess remain unclear, leaving it as a largely heuristic control 1 arXiv:2602.01716v2 [cs.CL] 17 Feb 2026 Mechanistic Indicators for Steering method rather than a principled intervention. Although quantitative metrics can assess steering (Wang et al., 2025a), most tasks involve free-text generation, where evaluation practices limit our understanding of reliability. Many studies (Wu et al., 2025; Chalnev et al., 2024; Soo et al., 2025; Thakur et al., 2025; Li et al., 2025) use LLMs as automated judges, reducing human annotation but out- sourcing reliability to another opaque model. This raises the question of whether such evaluations reflect true steering success or artifacts of the judge’s biases. Mechanistic interpretability research (Rai et al., 2025; Sun et al., 2025; Bereska & Gavves, 2024; Sharkey et al., 2025) provides tools to analyze neural network internals, effective in tasks like knowledge conflict detection (Zhao et al., 2024), theory-of-mind monitoring (Jafari et al., 2025), world mod- eling (Karvonen, 2024; Gurnee & Tegmark, 2024), and user modeling (Chen et al., 2024). Yet, these tools are underused for steering evaluation, which remains outcome-focused. For instance internal signals, such as disrupted attention pat- terns, can indicate a high risk of failure by revealing when the model loses track of long-term dependencies, and these signals are directly extractable during generation. In this work, we adopt an empirical, interpretability-driven perspective and argue that steering effectiveness or failure, can be inferred directly from an LLM’s internal mechanistic signals. We frame steering as a natural component of the LLM, analogous to attention heads and feed-forward multi- layer perceptrons, which read information from the residual stream and write enriched representations back into it. Un- der this view, the performance of a steering block can be analyzed using existing mechanistic interpretability toolsets, supported by established linguistic and interpretability in- sights into LLM behavior. We focus on two information-theoretic signals: NBF, based on vocabulary entropy, and KL proximity between behavior- specific vocabularies and steered versus unsteered outputs (Section 3). These serve as mechanistic indicators of how steering reshapes model behavior across layers and decod- ing steps. We hypothesize that effective steering preserves structured entropy while aligning KL with steering vectors. To test this, we evaluate whether these signals can predict steering quality or failure during generation without external evaluators. Due to the infeasibility of large-scale human annotation and the inherently ambiguous nature of qualitative text assessment for this task, we treat LLM-generated anno- tations as ground truth.To mitigate concerns regard- ing evaluator bias and annotation reliability, we con- duct a dedicated reliability study prior to our empir- ical and interpretability analyses (Section 6.1).This study employs two comparably capable but architec- turally distinct LLM judges—ChatGPT-4o-miniand Gemini-Flash-2.5—developed by different organiza- tions. The results demonstrate consistent inter-judge agree- ment across a broad set of experimental conditions (dif- ferent models, targeted behaviors, and steering methods), supporting the use of LLMs as evaluators in our subsequent analyses. Our contributions are as follows: 1. We empirically assess the reliability of LLMs judges as qualitative annotators for steering evaluation across diverse experimental settings, including different target models, steering vectors, and steering functions (2,304 experiments). 2.We provide qualitative analysis and a mechanistic ac- count of activation-based steering by examining NBF, derived from entropy, together with KL divergence dy- namics across layers and decoding steps (cherry-picked examples are provided). 3. We demonstrate that steering quality scores can be predicted from internal model signals with reasonable accuracy. 4. Based on qualitative analysis, we introduce a stronger benchmarking baseline for activation-based steering methods, focusing on CAA and SAE-based steering as the most widely adopted approaches. 24681012141618202224262830 0 5 10 15 20 Generation step NBF α = 0 α = 40 α = 80 α = 120 α = 160 α = 200 α = 240 α = 280 Figure 2. unsuccessful steering example. The best performance is achieved0.06, corresponding to the Gemma 2–2B model for the London concept using the addition steering function with the SAE extraction method. No clear increase in NBF is observed as the steering intensity α increases. The paper is organized as follows. Section 2 introduces steer- ing preliminaries and notation. Section 3 presents mecha- nistic signals for analyzing steering. Section 4 describes our methodology, followed by the experimental setup (Sec- tion 5) and results (Section 6). Related work is in Section 7, and Sections 8, 9, and 10 provide discussion, limitations, and conclusions, respectively. 2 Mechanistic Indicators for Steering 24681012141618202224262830 0 5 10 15 20 Generation step NBF α = 0 α = 40 α = 80 α = 120 α = 160 α = 200 α = 240 α = 280 Figure 3. Successful steering example. The best performance is achieved0.24, corresponding to the Gemma 2–2B model for the London concept using the rotational steering function with the SAE extraction method. A clear increase in NBF is observed as the steering intensity α increases. 2. Problem Setup and Definitions 2.1. Model and Notation LetM θ denote an autoregressive language model with pa- rametersθ. Given an input promptX = (x 1 ,...,x n )and a previously generated token sequencey <t = (y 1 ,...,y t−1 ), the model defines a next-token distribution P θ (y t | X,y <t ) = softmax(z t ), wherez t ∈R |V| denotes the logits at generation stept, andVdenotes the vocabulary. Figure 1 (left) provides a high-level visualization of such a model. For analysis, attention is restricted to an effective vocabulary V eff (t) ⊆ V, defined as the set of theNmost probable tokens underP θ (· | X,y <t ). Letddenote the residual stream dimension. The final-layer hidden state at stept is denoted byh (L) t ∈R d . The unembedding block of the language model is represented as an affine transformation z t = U h (L) t + b U ,(1) where U ∈R |V|×d and b U ∈R |V| . Transformer architecture.The model consists of a stack ofLtransformer layers acting on the residual stream. Let h (ℓ) t ∈R d denote the residual representation at layerℓand stept, withh (0) t the input embedding (possibly including position encodings). Each transformer layer is composed of multi-head self-attention (MHA) and a position-wise feed- forward network (FFN), together with residual connections and normalization (pre- or post-layer normalization depend- ing on the variant). The scaled dot-product attention is Attn(Q,K,V ) = softmax QK ⊤ √ d k V,(2) whereQ = HW Q ,K = HW K ,V = HW V for a sequence representationHand learned projectionsW Q ,W K ,W V ;d k is the key dimension. The FFN acts position-wise as FFN(h) = σ hW 1 + b 1 W 2 + b 2 , whereσ(·)denotes the GELU nonlinearity,and W 1 ,W 2 ,b 1 ,b 2 are learned parameters.The update on residual stream is h (ℓ) t ← h (ℓ−1) t + ∆ (ℓ−1) t , where∆ (ℓ) t is the contribution from the combination of MHA and FFN (and normalization applied per architecture) at layer l. 2.2. Steering as Residual Intervention We formalize steering as an intervention on the residual stream, analogous to standard layer operations. Letcdenote a control signal encoding a desired attribute or behavior, and letg(.)be a (learned or predefined) mapping from control signals to the residual space, s = g(c)∈R d , wheredis the residual stream dimension.LetL ⊆ 1,...,Lbe the subset of layers at which steering is ap- plied, and letα ∈Rbe a scalar controlling intervention strength. For time steptand anyℓ ∈ L, we define the residual steering update ̃ h (ℓ) t ← F h (ℓ) t ,s,α ,(3) whereF (·)is an intervention function that adds the original residual representation h (ℓ) t with its contribution ˆ ∆ (ℓ) t . ˆ ∆ (ℓ) t = F h (ℓ) t ,s,α − h (ℓ) t Anger Christian Conspiracy London French Love Praise Want to die Wedding L25 L20 L15 L10 l5 L1 0.6 0.7 0.8 0.9 1 Intervention Figure 4. Maximum probability values extracted from the attention head show a clear drop in confidence (used here as a proxy for semantic consistency and fluency) immediately after the layer where the intervention occurs. 3 Mechanistic Indicators for Steering 2.3. Coherence and Steering Objectives Steering a generative model introduces a fundamental trade- off between preserving linguistic coherence and enforcing alignment with a target attribute encoded by the steering vec- tors. this researrch formalize this trade-off using two scalar evaluation functions computed on the fully generated textT. LetS(T )∈ [0, 1]denote a steering score, which quantifies the extent to which the output reflects the desired attribute represented bys. LetC(T ) ∈ [0, 1]denote a coherence score, which serves as a proxy for fluency and semantic con- sistency. These objectives are typically in tension: increas- ing the steering strengthαmay enhance attribute expression while degrading coherence. Consequently, steering can be framed as an optimization problem under competing criteria, balancing behavioral control against the preservation of the model’s generative competence. 3. Mechanistic Signals 3.1. Normalized Branching Factor LetP (eff) θ (y t | X,y <t ) = softmax z t ↾ V eff (t) denote the effective vocabulary probablity distribution at generation stept. We define the branching factor at time steptas the exponential of the entropy of the output distribution, B t = exp H(p (eff) t ) where the entropy of any given distribution is calcualted as H(p) =− N X y=1 p(y) logp(y) To account for sequence length effects, we define the nor- malized branching factor up to time step T as ̄ B 1:T = 1 T T X t=1 B t . Figures 3 and 2 present the NBF signals across the genera- tion steps in the final layer of the LLM. 3.2. KL Divergence difference Leth (ℓ) t and ˆ h (ℓ) t denote the residual representations before and after the steering intervention at layerℓand time step t. Using the model’s unembedding function as defined in Equation 1, residual representations are mapped to effective vocabulary distributions in a manner analogous to logit- lens–based interpretations (Wang, 2025): p (ℓ) t = U h (ℓ) t ,ˆp (ℓ) t = U ˆ h (ℓ) t . Similarly, the effective vocabulary distribution induced by the steering vector s is defined as q (ℓ) = U(s), which remains invariant across time steps in this setting. The KL difference induced by steering at layerℓand time step t is then defined as Diff (ℓ) t = KL p (ℓ) t ∥q (ℓ) − KL ˆp (ℓ) t ∥q (ℓ) . Figures 6 and 5 depict the KL signals across the generation steps in the 12th layer of the LLM. 24681012141618202224262830 11.5 12 12.5 Generation step KL divergence Diff unsteered Diff steered Mean Diff unsteered Mean Diff steered Figure 5. Unsuccessful steering, as evidenced by the lack of a significant difference in KL divergence between the steered and unsteered representations. 3.3. Attention Pattern Structure LetAttn (ℓ) t ∈R t×t (as it is defined in 2) denote the self- attention probability matrix at layerℓand time stept. We consider the layerℓ + immediately following the steering intervention as the primary site for extracting attention pat- tern signals. Figure 4 shows a clear pattern of attention disruption after intervention. 24681012141618202224262830 11.5 12 12.5 Generation step KL divergence Diff unsteered Diff steered Mean Diff unsteered Mean Diff steered Figure 6. Successful steering, as evidenced by a significant differ- ence in KL divergence between the steered and unsteered repre- sentations. 4. Methodology Based on the notation introduced in Section 2 and follow- ing the experimental setups proposed by (Soo et al., 2025; Turner et al., 2024), this research extracts three sets of mech- anistic signals for probing steering quality. Owing to the qualitative nature of steering-quality assessment—and to manage annotation costs—we treat LLM-generated annota- tions as proxies for human expert annotations. This choice 4 Mechanistic Indicators for Steering is validated through an inter-judge agreement analysis con- ducted prior to the main experiment. The main experiment involves training a regression model on the pre-extracted signals and evaluating its performance on a held-out set of unseen experiments. We ensure that no experiment appears in both the training and testing sets. Figure 1 depicts the feature extraction and regression phases as inputs pass through the model. 5. Experimental Design 5.1. Models and Tasks Experiments are conducted on the GEMMA-2 model family for methodological rather than performance-driven reasons, as the analysis requires full access to intermediate residual representations and compatibility with publicly available SAE feature dictionaries. All evaluations consider a free-form text generation task initiated from a neutral prompt, “I think ...”, selected to minimize prior bias toward any specific attribute. A set of |C| = 9distinct steering concepts is studied. For each con- ceptc ∈ C, a corresponding steering vectorsis extracted using two widely adopted methods: CAA and SAE. These vectors serve as inputs to the residual steering functions defined in Equation 3. Two variants of steering functions are evaluated: additive steering (Section 5.2.1) and rotation-based steering (Sec- tion 5.2.2). Following prior work (Soo et al., 2025; Turner et al., 2024), steering is applied at layerL =12, with the steering strength varied over α∈0, 20, 40,..., 320. To assess steering effectiveness at the behavioral level, ChatGPT-4o-miniandGemini-Flash-2.5are em- ployed as independent evaluators to score generated outputs with respect to concept alignment and overall coherence, using a fixed evaluation prompt adopted from (Soo et al., 2025). These external evaluations are used solely for behav- ioral validation and complement the internal mechanistic metrics. In total, 2,304 experimental runs are conducted, covering all combinations of models, concepts, steering methods, and steering strength parameters. 5.2. Residual Steering Functions Leth (ℓ) t ∈R d denote the residual representation at layerℓ and time stept, and lets ∈R d be a steering vector. We define two steering functionsF (·)operating directly on the residual stream. 5.2.1. RESIDUAL ADDITION STEERING The simplest form of steering is additive intervention in the residual space. We define the additive steering function for ⃗a ⃗ b ⃗a + ⃗ b x y z ⃗a ⃗ b ⃗a + ⃗ b x y z Figure 7. Comparison of the na ̈ ıve additive steering function (left) with the proposed rotational steering function (right), which theoretically accounts for preserving the magnitude of the residual representation. a scaling coefficient α∈R as ̃ h (ℓ) t = F add h (ℓ) t ,s,α = h (ℓ) t + αs. This operation applies a linear shift of the residual repre- sentation in the direction of the steering vector, without preserving the original residual norm. 5.2.2. RESIDUAL ROTATION STEERING To preserve the magnitude of the residual stream while modifying its direction, we define a rotation-based steering function using a tangent-space exponential map. Let ˆx = h (ℓ) t ∥h (ℓ) t ∥ ,ˆy = s ∥s∥ denote the normalized residual and steering directions, re- spectively. The angle between them is θ = arccos ˆx ⊤ ˆy . We compute the component of ˆy orthogonal to ˆx as v = ˆy− (ˆx ⊤ ˆy)ˆx,ˆv = v ∥v∥ . Given a steering strengthβ = α/320,β ∈ [0, 1], we define the rotation angle φ = βθ. The steered residual direction is obtained via a rotation in the two-dimensional subspace spanned by ˆx and ˆv: ˆz = cos(φ) ˆx + sin(φ) ˆv. Finally, the original residual norm is restored: ̃ h (ℓ) t = F rot h (ℓ) t ,s,α = ∥h (ℓ) t ∥ ˆz. 5 Mechanistic Indicators for Steering Table 1. Regression model performance across different LLM annotators and evaluation metrics. METRIC CHATGPT-4O-MINI GEMINI-FLASH-2.5 MAE0.0531 ± 0.00000.0871 ± 0.0000 RMSE0.0728 ± 0.00000.1160 ± 0.0001 R20.4698 ± 0.00220.5445 ± 0.0012 This intervention preserves the magnitude of the residual stream while smoothly rotating its direction toward the steer- ing vector. It can be interpreted as a geodesic update on the unit hypersphere, preventing norm inflation and reducing unintended entropy collapse. 6. Results 6.1. Reliability of LLMs for Qualitative Assessment Since the regression experiments in this setting rely exclu- sively on annotations provided by LLM judges, the reliabil- ity of LLM-based qualitative steering assessment is evalu- ated first. Accordingly,ChatGPT-4o-miniis selected as the primary evaluator, withGemini-Flash-2.5in- cluded as an independent and comparably capable comple- mentary judge. The evaluation dataset comprises 72 steering conditions, spanning 9 steering concepts, 2 steering vector extraction methods, 2 steering functions, and 2 base language mod- els. Each generated output is independently scored by both evaluators using an identical evaluation prompt. Inter-rater reliability is quantified using the intraclass corre- lation coefficient under a fixed-raters, absolute-agreement formulation (ICC(3,1)). The resulting agreement score is ICC(3, 1) = 0.78,95% CI = [0.67, 0.86], withF (71, 71) = 8.02andp < 10 −15 , indicating substan- tial agreement between the two LLM judges. Agreement metrics sensitive to scale and offset, such as Krippendorff’sα, yield a lower value ofα = 0.23when computed on raw scores, despite a high Pearson correlation ofr = 0.84. This discrepancy reflects systematic calibra- tion differences rather than semantic disagreement. In par- ticular,Gemini-Flash-2.5consistently assigns higher absolute scores, with a mean of0.40, compared to0.22for ChatGPT-4o-mini. To correct for this calibration bias, per-judge z-score normal- ization is applied prior to recomputing agreement metrics. After normalization, Krippendorff’sαincreases substan- tially to α z-scored = 0.85, indicating strong agreement in relative qualitative judgments between the two evaluators. 6.2. Entropy and KL Dynamics Under Steering NBF is used as a proxy for the model’s effective genera- tive capacity during decoding, or as an indicator of entropy collapse, which should be preserved—and in some cases increase—with higher steering intensity. Figure 3 illustrates a case of effective steering, showing the change in NBF across different steering strengths when the SAE extraction method is applied to the Gemma 2–2B model for the con- cept London using the rotation steering function. In contrast, Figure 2 presents an ineffective steering case under an iden- tical setting, differing only in the steering function, which is addition. Such increases in entropy may arise either from meaningful redistribution of probability mass aligned with the steering signal or from degenerate flattening of the output distribu- tion. Distinguishing between these two regimes is therefore essential. To disambiguate these effects, the dynamics of KL divergence between steered and unsteered residual rep- resentations are examined. KL can serve as an indicator for proximity to the vocabulary distribution induced directly by the steering vector is measured. A reduction in KL divergence toward the steering-induced distribution indicates structured, concept-aligned redistribu- tion (Figure 6). This behavior is observed in an experiment using Gemma 2–9B steered toward the concept Christian- ity, with CAA extraction and the rotation steering func- tion atα = 100(score = 0.22). In contrast, entropy in- creases without corresponding KL shifts are indicative of non-informative flattening (Figure 5). This regime appears under otherwise identical conditions—model, concept, and extraction method—except that the steering function is addi- tion and the steering strength is increased toα = 260(score = 0.11). These examples are intentionally selected, as they exhibit nearly identical NBF trajectories (appendix section B) despite fundamentally different underlying dynamics. Another notable result is the strong correlation between the language fluency metricC(T )and the maximum attention- head probability extracted immediately after the intervention position. The observed correlations are 0.71 and 0.72 for the ChatGPT-4o-miniandGemini-Flash-2.5judges, respectively. In contrast, no comparable correlation is ob- served in later layers. Figure 4 illustrates this pattern for the Gemma 2–2B model across all nine steering concepts. 6.3. Predicting Steering Quality from Internal Signals To evaluate whether the information required to predict steered generation quality is present in the aforementioned mechanistic signals (3), a regression-based analysis is con- ducted that maps internal diagnostics to external quality assessments (Figure 1). This experiment directly tests the main hypothesis. 6 Mechanistic Indicators for Steering Table 2. Performance comparison of na ̈ ıve additive steering as a common baseline versus rotational steering functions across different annotators. CHATGPTGEMINI MODELMETHODADDROTADDROT GEMMA2-2B CAA0.220.280.400.46 SAE0.130.160.260.32 GEMMA2-9B CAA0.250.290.450.52 SAE0.220.210.400.40 For each steering configuration, a feature vector is extracted comprising the steering strengthα, the NBF, and KL di- vergence metrics computed before and after steering. All combinations of 16 steering strengths, 2 steering vector ex- traction methods, 2 steering functions, 2 base models, and 9 steering concepts are considered, resulting in a total of 1,152 experimental conditions. Using these features, a regression model is trained to pre- dict the combined performance metric 2 . For further details, please refer to Appendix A. P (T ) =S(T )×C(T ). Evaluation is performed using three independent random seeds with a 70/30 train–test split to ensure robustness across steering configurations. To evaluate the dependence on the choice of judge, the experiment was repeated using qualitative scores ob- tained independently from bothChatGPT-4o-miniand Gemini-Flash-2.5as supervision signals. The predic- tive performance of the resulting regression models for each judge is reported in Table 1. The distribution of predicted values and ground truth annotations for seed number 10 is provided in Section D for further details. 6.4. Comparison with Stronger Baselines Analysis of the internal dynamics associated with suc- cessful steering—as characterized by NBF preservation, meaningful KL divergence shifts, and stable attention pat- terns—reveals several limitations of naive residual addition. First, additive steering implicitly assumes linearity in the residual space: it treats the effect of a steering vector as inde- pendent of the current residual direction. This assumption is poorly aligned with the highly nonlinear geometry induced by layer normalization and attention mechanisms. As a result, linear addition can induce disproportionate changes in token distributions, often leading to entropy collapse or attention instability. 2 Referred to throughout this paper as the steering performance or steering score. Second, residual addition disregards the magnitude of the original residual representation. Because the intervention applies a fixed shiftαsregardless of∥h (ℓ) t ∥, the relative influence of the steering vector varies unpredictably across layers, time steps, and tokens. This sensitivity can amplify small steering signals or overwhelm large residuals, degrad- ing coherence. To address these limitations, we propose rotation-based steering in 5.2.2, which modifies only the direction of the residual representation while explicitly preserving its norm. By operating on the unit hypersphere via a geodesic update, rotation steering respects the local geometry of the residual stream and produces controlled interventions. Empirically, we find that rotation steering achieves stronger alignment with the desired control signal while better pre- serving entropy, KL dynamics, and attention stability (table 2). Notably, rotation steering can reuse the same steering vectors as additive methods, but applies them in a more effective and robust manner, resulting in improved steer- ing–coherence trade-offs. 7. Related Work Activation-based steering. Activation-based steering alters LLMs behavior at inference time by intervening in activa- tion space rather than retraining. A widely used method is CAA, which derives steering directions from contrastive input pairs and has been shown to control behavior across many settings, including improved chess play (Karvonen, 2024), semantic concept steering (Soo et al., 2025; Turner et al., 2023; Wu et al., 2025; Sun et al., 2025), toxicity reduction (Nguyen et al.), mitigation of hallucination and psychopathy-related tendencies (Rimsky et al., 2024), and stylistic personalization (Zhang et al., 2025). Interpretable steering with SAEs. A more interpretable line of work uses sparse autoencoders to decompose polysemantic residual ac- tivations into human-readable latent features, enabling steer- ing along learned dimensions for tasks such as mathematical reasoning (Wang et al., 2025a), concept-level control (Soo et al., 2025; Arad et al., 2025; Cho et al., 2025; Chalnev et al., 2024), and multi-concept manipulation (Joshi et al., 2025). However, SAE-based approaches depend on a fixed learned feature dictionary, which may omit factors needed for reliable steering under diverse contexts. Mechanistic understanding. Recent studies analyze how contrastive steering vectors behave internally, characterizing their struc- ture, generalization, and limitations (Hao et al., 2025; Tan et al., 2025; Chen et al., 2025), and highlighting failures caused by distribution shift and representational entangle- ment (Niranjan et al., 2025). These works provide valuable diagnostics but are mostly descriptive/post-hoc and do not connect steering success to principled mechanistic signals grounded in probabilistic generation or language-modeling 7 Mechanistic Indicators for Steering theory. Evaluation via LLM-as-judge. Since human eval- uation is expensive, many benchmarks use LLMs as judges to approximate expert scoring (Wu et al., 2025; Sun et al., 2025; Soo et al., 2025; Chalnev et al., 2024); yet judge cali- bration and reliability are often assumed rather than tested, leaving open issues around judge dependence, score stabil- ity, and robustness of qualitative conclusions. 8. Discussion 8.1. Implications for Mechanistic Interpretability Modeling the steering operation within the general frame- work of reading from and writing to the residual stream suggests that mechanistic signals—including the entropy at each layer as a proxy for generative capacity, and the KL divergence of the steered vocabulary as a proxy for distri- butional alignment—possess reasonable predictive power for assessing final steering quality without reliance on any external judge. Based on results in Table 1 The regression model ex- hibits consistent behavior across both label sets, indicat- ing stable learning dynamics under identical experimen- tal conditions. While performance differs in absolute er- ror magnitude, the overall trends are coherent. Annota- tion produced byChatGPT-4o-miniyields substantially lower MAE (0.0531 vs. 0.0871) and RMSE (0.0728 vs. 0.1160), suggesting tighter pointwise agreement and smaller residual dispersion. In contrast, annotation produced by Gemini-Flash-2.5achieves a higher coefficient of de- termination (R² = 0.5445 vs. 0.4698), indicating that al- though prediction errors are larger in scale, the model cap- tures a greater proportion of variance in that annotation space. RMSE penalizes large errors more than MAE. The fact that RMSE is only moderately larger than MAE in both annota- tion sets suggests that there are no extreme outliers domi- nating the error, i.e., errors are relatively evenly distributed rather than having a few very large deviations. Overall, the model demonstrates robust generalization across annotators, with error profiles suggesting primarily scale-dependent dis- crepancies rather than structural prediction failures (similar to section 6.1 discussion). 8.2. Implications for Evaluation and Reliability As LLMs are increasingly used for qualitative analysis and evaluation of generated text, there is a growing need for sys- tematic methods to assess the reliability of their judgments. In Section 6.1, we analyze the agreement between twoaffordableandcomparablycapablemodels, ChatGPT-4o-miniandGemini-Flash-2.5, when applied to the same qualitative evaluation tasks. We observe a substantial level of agreement between these models, suggesting that, under controlled conditions, such models can serve as reasonable proxies for human evaluators. 9. Limitations Limitations. While this study conducts a total of 2,304 experiments across diverse settings, it considers only nine target concepts. This limited conceptual coverage constrains the strength of the conclusions and may not fully reflect the variability of steering behavior across a broader semantic space. In addition, our analysis focuses exclusively on the GEMMA model family, which prioritizes interpretability access and tool availability over alignment with the most recent frontier models. As a result, the findings may not directly generalize to newer and more capable models. Ad- dressing this limitation is non-trivial, as the number of LLM families with SAEs trained during the pretraining phase remains limited, yet conducting SAE-based analyses neces- sitates availability of such models. From a theoretical perspective, the current regression formu- lation requires further refinement to improve its expressive capacity and better capture variance in the data. Moreover, label noise presents a significant challenge: it arises partly from reliance on LLM-based judges and partly from the inherently qualitative nature of the evaluated tasks. This noise complicates generalization and introduces additional uncertainty in the reported findings. 10. Conclusion In this work, steering is framed as a familiar read–write operation on the residual stream of an LLM, enabling its analysis using established toolkits from the mechanistic interpretability literature. Steering reliability is shown to be systematically evaluable through internal model analysis. Beyond post-hoc inspection, the results demonstrate that internal mechanistic signals can be used to predict steering effectiveness prior to full text generation. Moreover, under controlled experimental conditions, LLMs are shown to serve as useful proxies for qualitative evaluators. A mechanistic account of activation-based steering is pro- vided through analysis of entropy and KL divergence dy- namics across layers and decoding steps. Effective steering is characterized by structured entropy preservation together with controlled KL divergence behavior. These findings support an interpretation of steering as a controlled transfor- mation of the model’s internal distribution, rather than as an arbitrary perturbation. Finally, a stronger steering-function baseline is introduced enhancing two of the most widely used activation-based 8 Mechanistic Indicators for Steering steering methods, CAA and SAE. This baseline is grounded in a principled theoretical formulation and demonstrates empirical improvements over na ̈ ıve additive steering. By distinguishing meaningful steering directions from arbitrary perturbations. Taken together, these findings position steering as a mea- surable, predictable, and testable process, and highlight the value of internal model dynamics for both evaluation and in- terpretability. This work aims to motivate future research on mechanistically grounded approaches to controlling LLM behavior. Broader Impact and Ethical Considerations This work investigates the mechanistic foundations of activation-based steering in large language models, with the aim of improving the predictability, reliability, and in- terpretability of inference-time control methods. By con- necting behavioral steering outcomes to internal model dy- namics, the proposed analysis framework may support safer deployment practices and reduce reliance on trial-and-error intervention strategies. At the same time, steering techniques may be misused to amplify harmful, misleading, or manipulative behaviors if applied without appropriate safeguards. This work does not introduce new steering mechanisms or expand the expres- sive power of existing methods; rather, it provides diagnostic tools for analyzing and evaluating steering behaviors already present in current approaches. As such, the results should be interpreted as analytical insights rather than recommen- dations for deploying specific steering objectives. All experiments are restricted to open-weight models and non-sensitive steering concepts. The proposed metrics char- acterize the stability and effectiveness of steering signals but do not assess the social desirability or ethical appropri- ateness of any particular steering direction. Determinations regarding acceptable model behavior remain application- dependent and outside the scope of this study. The use of large language models as qualitative evaluators introduces potential concerns related to bias, calibration differences, and evaluator consistency. To address these issues, this work explicitly measures inter-judge reliability across architecturally distinct models and applies normaliza- tion procedures to account for systematic calibration offsets. The results indicate strong agreement in relative qualitative assessments, supporting the use of LLM-based judges as scalable—though imperfect—proxies for human evaluation. Finally, this study does not involve any model training or fine-tuning. All analyses are conducted through inference- time interventions and forward-pass inspection, resulting in substantially lower computational and environmental costs compared to retraining-based alignment or adaptation meth- ods. 11. Acknowledgement We thank ResetData and National Computational Infrastruc- ture for providing the compute infrastructure that supported our experiments. 9 Mechanistic Indicators for Steering References Arad, D., Mueller, A., and Belinkov, Y. SAEs Are Good for Steering – If You Select the Right Features, May 2025. Bereska, L. and Gavves, E. Mechanistic Interpretability for AI Safety – A Review, August 2024. Chalnev, S., Siu, M., and Conmy, A. Improving Steer- ing Vectors by Targeting Sparse Autoencoder Features, November 2024. Chen, K., He, Z., Shi, T., and Lerman, K. STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models, June 2025. Chen, Y., Wu, A., DePodesta, T., Yeh, C., Li, K., Marin, N. C., Patel, O., Riecke, J., Raval, S., Seow, O., Wat- tenberg, M., and Vi ́ egas, F. Designing a Dashboard for Transparency and Control of Conversational AI, October 2024. Cho, S., Wu, Z., and Koshiyama, A. CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection, August 2025. Gurnee, W. and Tegmark, M. Language Models Represent Space and Time, March 2024. Hao, Y., Panda, A., Shabalin, S., and Ali, S. A. R. Patterns and Mechanisms of Contrastive Activation Engineering, May 2025. Jafari, M., Hua, D. Y., Xue, H., and Salim, F. Enhancing Conversational Agents with Theory of Mind: Aligning Beliefs, Desires, and Intentions for Human-Like Interac- tion, May 2025. Joshi, S., Dittadi, A., Lachapelle, S., and Sridhar, D. Identifi- able Steering via Sparse Autoencoding of Multi-Concept Shifts, February 2025. Karvonen, A. Emergent World Models and Latent Vari- able Estimation in Chess-Playing Language Models, July 2024. Li, D., Jiang, B., Huang, L., Beigi, A., Zhao, C., Tan, Z., Bhattacharjee, A., Jiang, Y., Chen, C., Wu, T., Shu, K., Cheng, L., and Liu, H. From Generation to Judgment: Op- portunities and Challenges of LLM-as-a-judge, Septem- ber 2025. Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kram ́ ar, J., Dragan, A., Shah, R., and Nanda, N. Gemma Scope: Open Sparse Autoen- coders Everywhere All At Once on Gemma 2, August 2024. Nguyen, D., Prasad, A., Stengel-Eskin, E., and Bansal, M. Multi-Attribute Steering of Language Models via Tar- geted Intervention. Niranjan, C., Jaidka, K., and Yeo, G. C. On the Limitations of Steering in Language Model Alignment, May 2025. Rai, D., Zhou, Y., Feng, S., Saparov, A., and Yao, Z. A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models, October 2025. Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. Steering llama 2 via contrastive activa- tion addition. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 15504–15522, Bangkok, Thailand, August 2024. Association for Computational Linguis- tics. doi: 10.18653/v1/2024.acl-long.828. URLhttps: //aclanthology.org/2024.acl-long.828/. Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., Goldowsky-Dill, N., Heimersheim, S., Ortega, A., Bloom, J., Biderman, S., Garriga-Alonso, A., Conmy, A., Nanda, N., Rumbelow, J., Wattenberg, M., Schoots, N., Miller, J., Michaud, E. J., Casper, S., Tegmark, M., Saunders, W., Bau, D., Todd, E., Geiger, A., Geva, M., Hoogland, J., Murfet, D., and McGrath, T. Open Problems in Mechanistic Interpretability, January 2025. Soo, S., Guang, C., Teng, W., Balaganesh, C., Guoxian, T., and Ming, Y. Interpretable Steering of Large Language Models with Feature Guided Activation Additions, April 2025. Sun, J., Huang, J., Baskaran, S., D’Oosterlinck, K., Potts, C., Sklar, M., and Geiger, A. HyperDAS: Towards Automating Mechanistic Interpretability with Hypernet- works, April 2025. Tan, D., Chanin, D., Lynch, A., Kanoulas, D., Paige, B., Garriga-Alonso, A., and Kirk, R. Analyzing the General- ization and Reliability of Steering Vectors, May 2025. Thakur, A. S., Choudhary, K., Ramayapally, V. S., Vaidyanathan, S., and Hupkes, D. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as- Judges, August 2025. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M.Steering lan- guage models with activation engineering. arXiv preprint arXiv:2308.10248, 2023. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering Language Models With Activation Engineering, October 2024. 10 Mechanistic Indicators for Steering Wang, A., Shu, D., Wang, Y., Ma, Y., and Du, M. Improv- ing LLM Reasoning through Interpretable Role-Playing Steering, June 2025a. Wang, A., Wu, X., Shu, D., Ma, Y., and Liu, N. Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement, October 2025b. Wang, Z. Logitlens4llms: Extending logit lens analy- sis to modern large language models. arXiv preprint arXiv:2503.11667, 2025. Wu, Z., Arora, A., Geiger, A., Wang, Z., Huang, J., Jurafsky, D., Manning, C. D., and Potts, C. AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Au- toencoders, March 2025. Zhang, J., Liu, Y., Wang, W., Liu, Q., Wu, S., Wang, L., and Chua, T.-S. Personalized text generation with contrastive activation steering. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 7128–7141, Vienna, Austria, July 2025. Association for Computational Linguistics. ISBN 979-8-89176-251- 0. doi: 10.18653/v1/2025.acl-long.353. URLhttps: //aclanthology.org/2025.acl-long.353/. Zhao, Y., Du, X., Hong, G., Gema, A. P., Devoto, A., Wang, H., He, X., Wong, K.-F., and Minervini, P. Analysing the Residual Stream of Language Models Under Knowledge Conflicts, October 2024. 11 Mechanistic Indicators for Steering A. Regression Setup and Feature Construction For the regression task, we used summary statistics computed from the mechanistic signals extracted at generation step t = 30, matching the sample length of30tokens. Concretely, for each signal defined in 3, we computed a fixed set of descriptive statistics—mean, median, range, skewness, kurtosis, variance, standard deviation, minimum, and maximum—and concatenated them to form the regression feature vector after applying a Standard Scaler. To ensure robustness to stochasticity, we evaluated the pipeline using five random seeds,22, 42, 31, 61, 10. Each seed was used consistently for both the data splitting procedure (via sklearn’s GroupShuffleSplit) and the regression model initialization. The predictive model was a Random Forest regressor with200trees, bootstrap sampling enabled, andmaxfeatures = "sqrt"; all other settings followed standard defaults, including unconstrained tree depth and minsamplessplit = 2andminsamplesleaf = 1. Importantly, we performed no hyperparameter tuning; all reported results are based on this fixed configuration across seeds. B. Similar NBF and different KL In Section 6.2, two nearly identical experimental configurations are compared with respect to their NBF behavior. Despite their similarity in NBF, the configurations exhibit distinct KL divergence dynamics and different final steering scores. In both configurations, the model is Gemma-2-9b, the target concept is Christianity, and the steering vector is extracted using the CAA method. 24681012141618202224262830 0 5 10 Generation step NBF α = 0 α = 100 Mean α = 0 Mean α = 100 (a) Steering function = Rotation 24681012141618202224262830 0 5 10 Generation step NBF α = 0 α = 260 Mean α = 0 Mean α = 260 (b) Steering function = Addition Figure 8. Similarity betwen the behaviour of the models in terms of NBF while the presens of big gap betwen scores∆P (T ) > 0.1 probably resulted by different KL behavior. C. LLM-Judge Prompt To ensure consistent and reproducible evaluation across all experiments, LLM-based judgments were collected using a standardized zero-shot prompt, following the methodology of (Soo et al., 2025). An example prompt for the target concept angry is provided below: System role: You score texts generated by a language model based on the following criteria: score: "The text contains angry words or angry sentiment." coherence: "The text is coherent and grammatically correct." Provide integer scores from 1 to 10. Output JSON only: "score": int, "coherence": int The input to the prompt accompanied with the generated text consists of the generated text to be evaluated. This prompt was applied uniformly across all concepts, steering methods, and base models to ensure comparability of LLM-judge annotations. D. LLM-Judge Prompt Figure 9 presents the distributions of the ground-truth and predicted values for both LLM judges. As shown, annotations provided byGemini-Flash-2.5exhibit higher variance, which results in greater expressive capacity for the regression model. In contrast,ChatGPT-4o-miniannotations display lower variance, leading to reduced sensitivity in the learned regression mapping and consequently lower predictive reliability under ChatGPT-4o-minisupervision. 12 Mechanistic Indicators for Steering 00.20.40.6 0 0.2 0.4 0.6 True value Predicted value Samples y = x (a) Judge LLM = ChatGPT-4o-mini 00.20.40.6 0 0.2 0.4 0.6 True value Samples y = x (b) Judge LLM = Gemini-Flash-2.5 Figure 9. The distribution of test-set predicted values and corresponding ground-truth annotations for seed = 10, evaluated across different LLM judges. 13