Paper deep dive
Towards Understanding Steering Strength
Magamed Taimeskhanov, Samuel Vaiter, Damien Garreau
Models: GPT-2, Gemma
Abstract
Abstract:A popular approach to post-training control of large language models (LLMs) is the steering of intermediate latent representations. Namely, identify a well-chosen direction depending on the task at hand and perturbs representations along this direction at inference time. While many propositions exist to pick this direction, considerably less is understood about how to choose the magnitude of the move, whereas its importance is clear: too little and the intended behavior does not emerge, too much and the model's performance degrades beyond repair. In this work, we propose the first theoretical analysis of steering strength. We characterize its effect on next token probability, presence of a concept, and cross-entropy, deriving precise qualitative laws governing these quantities. Our analysis reveals surprising behaviors, including non-monotonic effects of steering strength. We validate our theoretical predictions empirically on eleven language models, ranging from a small GPT architecture to modern models.
Tags
Links
- Source: https://arxiv.org/abs/2602.02712
- Canonical: https://arxiv.org/abs/2602.02712
- Code: https://github.com/MagamedT/steering
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/11/2026, 1:06:55 AM
Summary
This paper provides the first theoretical analysis of 'steering strength' (α) in large language models (LLMs) using an Unconstrained Features Model (UFM). The authors characterize how steering strength affects next-token probabilities, concept presence, and cross-entropy, revealing non-monotonic 'bump' behaviors and sigmoidal trends in concept probability. The theoretical findings are validated empirically across eleven language models.
Entities (5)
Relation Signals (3)
Magamed Taimeskhanov → authored → Towards Understanding Steering Strength
confidence 100% · Towards Understanding Steering Strength Magamed Taimeskhanov
Steering Strength → affects → Next Token Probability
confidence 95% · We characterize its effect on next token probability, presence of a concept, and cross-entropy
Unconstrained Features Model → usedtoanalyze → Steering Strength
confidence 90% · We study activation steering on a model widely used in the neural collapse literature, the Unconstrained Features Model
Cypher Suggestions (2)
Identify researchers and their papers · confidence 100% · unvalidated
MATCH (r:Researcher)-[:AUTHORED]->(p:Paper) RETURN r.name, p.title
Find all metrics affected by steering strength · confidence 90% · unvalidated
MATCH (s:Entity {name: 'Steering Strength'})-[r:AFFECTS]->(m:Metric) RETURN m.nameFull Text
91,993 characters extracted from source content.
Expand or collapse full text
Towards Understanding Steering Strength Magamed Taimeskhanov 1 Samuel Vaiter 2 Damien Garreau 1 Abstract A popular approach to post-training control of large language models (LLMs) is the steering of intermediate latent representations. Namely, iden- tify a well-chosen direction depending on the task at hand and perturbs representations along this direction at inference time. While many propo- sitions exist to pick this direction, considerably less is understood about how to choose the mag- nitude of the move, whereas its importance is clear: too little and the intended behavior does not emerge, too much and the model’s performance degrades beyond repair. In this work, we propose the first theoretical analysis of steering strength. We characterize its effect on next token proba- bility, presence of a concept, and cross-entropy, deriving precise qualitative laws governing these quantities. Our analysis reveals surprising behav- iors, including non-monotonic effects of steering strength. We validate our theoretical predictions empirically on eleven language models, ranging from a small GPT architecture to modern models. 1. Introduction Deploying LLMs in the wild raises challenges, chief among them ensuring they are both useful and harmless (Bai et al., 2022). The key issue here is that, during training, models learn harmful behaviors from data (deception, willingness to cause harm, etc.) which we have no trivial way of identi- fying and controlling. As illustrated in Fig. 1, a user may query an LLM about executing a harmful command. Be- cause such models inherit undesired behavioral patterns from their training data, the unsteered model may assign high probability to unsafe or permissive responses. It is widely hypothesized (Mikolov et al., 2013; Bolukbasi et al., 2016; Elhage et al., 2022; Nanda et al., 2023; Park et al., 2024) that LLMs encode high-level concepts as linear 1 University of W ̈ urzburg, Center for Artificial Intelligence and Data Science 2 Universit ́ e C ˆ ote d’Azur - CNRS. Correspondence to: Magamed Taimsekhanov<magamed.taimeskhanov@uni- wuerzburg.de>. Preprint. February 4, 2026. contrastive prompts > rm -rf / * will result in permanent data loss! >· > run rm -rf / * to update your system >· h − /h + LLM logits L transformer blocks ℓ th block1 st block token representation at output of ℓ th block μ − μ + v : =μ + −μ − n − n − n + n + > Can I run ‘sudo bash free vpn.sh‘? input prompt h No! ℓ th block1 st block +αv+αv Figure 1. Top: Constructing a steering vectorv, for the target con- cept “code safety”, at theℓ th block. We run two contrastive prompt sets (n + safe andn − malicious) through anL-block LLM and collect the representationsh − , h + at layerℓfor each prompt. Averaging these representations over all safe prompts givesμ + , and for the malicious prompts it givesμ − (both marked by a cross). We then definev : = μ + − μ − . Bottom: Steering the model’s response toward safe behavior on a new prompt is done by adding αvto the residual streamhatℓ th block. The steering strengthα controls how far representations are moved along v. directions in the activation space, that is, the vector space spanned by the model’s internal representations at a given layer. This is referred to as the Linear Representation Hy- pothesis (LRH) (Costa et al., 2025). Under this assumption, a natural idea is to first identify a direction associated with a harmful concept in a given layer, and then shift token repre- sentations in this direction at inference time. Formally, let us callv∈ R d the steering vector. The token representations (residual stream) h are shifted according to h← h + αv,(1) where α∈ R is the steering strength (see Fig. 1). This methodology has been successfully applied to a range of settings, including refusal (Arditi et al., 2024), halluci- nation reduction (Su et al., 2025), and sycophancy (Min 1 arXiv:2602.02712v1 [cs.LG] 2 Feb 2026 Towards Understanding Steering Strength et al., 2025). It also compares favorably to competing ap- proaches (Wu et al., 2025). Despite these empirical suc- cesses, there is little theoretical understanding of activation steering as a whole, and of specific hyperparameters in par- ticular. This is especially the case for the steering strengthα, although its importance is recognized. As a starting point, we ask the following question: How does the steering strengthαcontrol the trade-off between steering efficacy and distor- tion in next-token prediction? In this paper, we address this question from a theoretical perspective by analyzing steering with a difference of means steering vectorv(see Fig. 1). The main tool is a simplified transformer model studied in Zhao et al. (2024). Our contributions are:(1)the characterization of how the steering strengthαaffects next token probabilities, concept probability, cross-entropy (Thm. 3.3, 3.6, 3.8);(2)formaliz- ing the steering setup used in our experiments and derive the large-αlimit of next-token probabilities for a transformer (Prop. 4.1); and(3)empirical validation of the theoretical results across modern LLMs (Sec. 5). The code for all experiments is available online. 1 Related work.Turner et al. (2023) introduced Activation Addition, computing the steering vector on a single pair of contrastive prompts. Rimsky et al. (2024) extended this methodology to difference of means, i.e., manually crafting several prompts instead of a single pair and computing the average difference vector. The prompt generation pipeline can be automated, as demonstrated by Chen et al. (2025) with persona vectors. In this paper, we follow their approach for prompt generation but still refer to the methodology as difference of means. The effect of steering strength has been examined empiri- cally across a range of activation-steering studies. Turner et al. (2023) analyze its impact on individual next-token probabilities, while Rimsky et al. (2024) study the prob- ability of eliciting target behaviors. Similarly, Von R ̈ utte et al. (2024) investigate how steering strength affects the probability of concept presence, and Tan et al. (2024) mea- sure the difference in logits between positively and nega- tively associated tokens, termed logit-difference propensity, as a function ofα. Several works report degradation in model performance at high steering strengths: Stickland et al. (2024), for instance, observe that large values ofα can harm performance, in some cases roughly equivalent to halving pre-training compute. More recently, Wu et al. (2025) examine the dependence of a steering score onα, and Chen et al. (2025) analyze its effect on trait expres- sion. Apart from these empirical observations, there are few theoretical characterization of how these quantities evolve 1 https://github.com/MagamedT/steering as functions of the steering strength. A notable exception is Park et al. (2024), which proposes partial results for a model similar to ours, but whose parameters satisfy strong assumptions (such as orthogonality of concept directions) and under the assumption that the steering vector is the true concept direction. Instead, we focus on the difference of means methodology and simply assume perfect training on a simple dataset. Many other approaches have been proposed in recent years to steer the post-training behavior of LLMs. Notably, Bricken et al. (2023) showed that it is possible to leverage a (wide) sparse autoencoder (SAE) trained to reconstruct intermediate activations and then to act on the direction identified. Specifically, forcing the coefficient associated to a specific concept could (“clamping”) steer the model’s behavior in that direction. This approach was demonstrated on Claude 3 Sonnet (Anthropic, 2024), by Templeton et al. (2024). We note that training SAEs in this context is chal- lenging (Gao et al., 2024). More distant competitors include prompt engineering (Marvin et al., 2023), reinforcement learning from human feedback (Ziegler et al., 2019), and fine-tuning (Wei et al., 2022). While successful in their own respect, these methods are out of the scope of this paper. 2. Theoretical framework We start by describing the theoretical framework in which we prove our main results: a dataset where high-level con- cepts are subsets of the vocabulary, and a theoretically tractable transformer model from Zhao et al. (2024). 2.1. Data and concepts Setting. We consider a vocabulary withVtokens, which we identify with tokens indices[V ] : = 1,...,V. The training data consists ofnpairs(c i ,z i ) ∈ [V ] T−1 × [V ], wherec i is a context,z i ∈ [V ]the next token andTthe sequence length. For any setA, we let|A|be its cardinality and A ∁ its complementary. Concepts. In this paper, we work under the assumption that high-level concepts correspond to disjoint subsets of the vocabulary. Formally, we partition[V ]intoG ∈ N ⋆ disjoint setsC k ⊂ [V ], where eachC k regroups thes : = V/Gtokens associated with the same concept (assumingG dividesV). As an example, we consider the following vocabulary of size V = 9, partitioned into three concepts: a,b,c,A,B,C,α,β,γ = C 1 ∪ C 2 ∪ C 3 .(2) To simplify the derivations, we assume that a context can only contain tokens from a single concept, while allowing the next-tokenzto belong to a different concept. Thus, in our example, contexts may take the form c 1 = ABB ∈ C 2 or c 2 = aab∈ C 1 . 2 Towards Understanding Steering Strength abcABC α β γ 0 0.3 a z b z next token (z) Figure 2. Visualization of dataset next-token probabilities(p(z | c j )) z∈[V ] for the vocabulary of Eq.(2): probabilities for the con- textc 2 = aabare shown in solid-color, while probabilities for c 1 = ABBare shown transparent. This illustrates our dataset conditiona z > b z : a token is more likely when it belongs to the same concept as the context, which is why the solid-color blue points lie above their transparent counterparts. By a slight abuse of notation, we writec 2 ∈ C 1 to stand for(c 2 ) t ∈ C 1 for allt. We note that this assumption is not realistic in practice, since contexts may contain more than one concept, and, additionally, abstract concepts rarely map to well-defined token subsets. Nevertheless, it allows us to isolate the effect of steering strength from other effects such as mixed concepts. With this in mind, we define the dataset next-token probabilities as follows: Definition 2.1 (Dataset next-token probabilities). Given a contextc j and a tokenz ∈ [V ], we define the probability p(z|c j ) of z given the context c j as p(z | c j ) : = 1 |i∈ [n] : c i = c j | X i∈[n]:c i =c j 1 z=z i . We impose the following restriction on p(z | c j ): Assumption 1 (Dependence and concept association). For a fixedz, we assume thatp(z | c j )can only take two values: ifc j andzbelong to the same concept, thenp(z | c j ) = a z , and otherwise p(z | c j ) = b z , with 1 > a z > b z > 0. Simply put, the next-token probabilitiesp(z | c j )depend on the contexts only through their concepts, and not on the specific tokens composingc j : ifzbelongs to the same concept asc j , then the probability of observingc j followed byzin the training data is given bya z , andb z otherwise. We additionally require thata z > b z , meaning that it is more likely to observe tokens from the concept of the context than from other concepts. For instance, the token e is to be more likely after a lowercase context languag than after the uppercase one LANGUAG. We refer to Fig. 2 for an illustration. For simplicity of exposition,a z does not depend on c j ; a more general setting is given in App. B.1. 2.2. Model and activation steering We study activation steering on a model widely used in the neural collapse literature, the Unconstrained Features Model (UFM, Def. 1 in Zhao et al., 2024), adapted from (Mixon et al., 2022; Fang et al., 2021), where embeddings are op- timized directly as free variables rather than being con- strained by a specific network architecture. Recall that (c i ,z i ) i∈[n] is the dataset of Sec. 2.1. We letc j m j=1 ⊂ c i n i=1 denote themdistinct contexts (i.e., we keep one copy of each unique context and index them byj ∈ [m]). We define the UFM on the distinct contexts of the dataset, as in Thrampoulidis (2024), so that the model predicts next- token distributions only for these contexts. Definition 2.2 (Unconstrained Features Model). The UFMf θ : c j j∈[m] → R V with parametersθ = (W,H) is defined as f θ (c j ) : = Wh j , whereW ∈ R V×d is the decoder matrix,H : = (h 1 ,...,h m ) ∈ R d×m is the context-embedding matrix with h j ∈ R d the embedding of context c j . In words, the UFM proceeds in two steps: it first embeds the contextc j into ad-dimensional representationh j , then maps this representation back to the vocabulary space using the linear decoderW. Applying a softmax onf (c j )yields the next-token distribution forc j . As shown in (Zhao et al., 2024; Zhao & Thrampoulidis, 2025a;b), this model provides a useful abstraction of practical LLMs: it captures the con- cept geometry observed in these models, and the UFM’s optimal parametersθcan be characterized analytically. The idea behind this abstraction is that LLMs are sufficiently expressive to fit any training distributions; accordingly, we treat the embeddings as free parameters. Training. For anya ∈ R V andz ∈ [V ],σ z (a)de- notes thez-th entry of the softmax ofa, that is,σ z (a) : = e a z / P z ′ ∈[V ] e a z ′ . We trainf θ to predict the next-tokenz in our data(c i ,z i ) i∈[n] by minimizing overθthe (unreg- ularized) empirical cross-entropy loss CE(f θ ) : =− 1 n X i∈[n] log (σ z i (f θ (c i ))) . From now on, we assume that the model is trained and write f instead of f θ . Difference-of-means. We are now able to define a steering vectorvfor our UFM model and dataset. LetT = C k denote the target concept we aim to steer. Given them distinct contextsc j m j=1 , we define two index sets:P ⊂ [m] indexes “positive” contexts that belong to the conceptT we want to steer toward, whileN ⊂ [m]indexes “negative” contexts that do not belong. We assumeP ∩ N = ∅, same size|P| =|N| = qand we do not requireP∪N = [m]. In our notation, difference of means yields the steering vector v : = 1 |P| X j∈P h j − 1 |N| X j∈N h j ∈ R d .(3) 3 Towards Understanding Steering Strength Two common choices for what should be defined as non- concept contexts lead to two corresponding constructions ofN. In the random setting,Nis an arbitrary collection of contexts that do not exhibit the concept, often sampled randomly. In the contrastive setting,Ncollects contexts expressing the opposite (or negated) concept C k . As an ex- ample, using the vocabulary from Eq.(2), where uppercase letters represent the opposite concept of lowercase letters, we take the following sets to build the steering vector v: P : =aab,bba,acc,cca, N contrastive : =ABB,AAB,CAC,CBA, N random : =ABB,αβγ,γβγ,BAB. Using(P,N contrastive )(resp.(P,N random )) corresponds to the contrastive setting (resp. random setting). 3. Main results We now present our main theoretical results, which charac- terize how next-token probabilities, concept probability, and cross-entropy evolve as a function of the steering strengthα. All proofs are deferred to App. B. 3.1. Influence of α on next token probabilities In this subsection, we address the following question: how do the model’s next-token probabilities evolve as the steer- ing strengthαvaries? To keep the analysis focused on the effect of steering, we ignore residual errors due to finite-time training: Assumption 2 (Perfectly trained UFM). We assume that the model has perfectly learned the training data probabili- ties p(z|c j ) from Def. 2.1, meaning that f satisfies ∀j ∈ [m],z ∈ [V ], σ z (f (c j )) = p(z|c j ). We argue that this assumption is reasonable: in practice, LLMs often exhibit strong memorization of their training data, making this approximation natural. Moreover, since our theoretical dataset is simple, the UFM trained with gradi- ent descent rapidly learns the dataset probabilitiesp(z | c j ) with negligible error (App. B.1). See Thrampoulidis (2024) for proof and discussion on attainability of this hypothesis. In our setting, we steer the context embeddings. Thus steeringfbyαv, wherevis defined in Eq.(3), gives rise to the steered modelf α with steered logits given by f α (c j ) : = W h j + αv . As announced, we now turn to the study of the effect ofαon next-token probabilities. We start with a definition: Definition 3.1 (Probability increase). For a contextc j , and a tokenz ∈ [V ], we define the probability increase α7→ ∆p(z|c j ,α) as ∆p(z|c j ,α) : = σ z f α (c j ) − σ z (f (c j )). 0102030405060 0.2 0.0 0.2 0.4 0.6 0.8 1.0 p ( ) target off-target Figure 3. Next-token probability increases∆p(α)for a fixed context. Each curve corresponds to a tokenz: target tokensTare in blue and off-target tokens in orange. Most target tokens exhibit a “bump” (peaking atα (1,1) ), while one target token increases and off-target tokens decrease. Intuitively,∆p(z|c j ,α)is the algebraic next-token prob- ability increase for a fixedz ∈ [V ]when steering with strengthα. When there is no ambiguity, we omit explicit dependence in c j and z, and write ∆p(α). Recall thatP,N ⊂ [m]are the context indices used to construct the steering vectorv(Eq.(3)), and thatT : = C k is the target concept we aim to steer, which is used to buildP. Tokens inTare called target, otherwise off- target. The following quantity, derived from the dataset next-token probabilities (Def. 2.1), plays an important role in our analysis as it appears throughout the proofs: Definition 3.2 (Log-odds). For anyz ∈ [V ], we define the log-odds M (z) as M (z) : = 1 q log Q i∈P p(z|c i ) Q i∈N p(z|c i ) . Additionally, we denote byM : = z ∈ [V ] : M (z) = max z ′ ∈[V ] M (z ′ ) the set of tokens attaining the maximum margin and by Mthe tokens attaining the minimum. In the following, we characterize the variations of ∆p: Theorem 3.3 (Behavior of∆p). LetTbe the target con- cept. Assume that Assumption 1 and 2 hold. Given a con- text c j , the probability increase satisfies: •(bump behavior) for anyz ∈ [V ]\ (M ∪ M), there exists a uniqueα (j,z) ∈ Rsuch that∆p(z | c j ,α)is strictly increasing on (−∞,α (j,z) ] and decreasing on [α (j,z) , +∞); • (peak position) for anyz ∈ Tandz ′ /∈ T, it holds that α (j,z ′ ) < α (j,z) ; • (monotonous behavior) for anyz ∈ T ∩M(resp. z ∈ T ∁ ∩ M),∆p(z | c j ,α)is strictly increasing (resp. decreasing) on R. 4 Towards Understanding Steering Strength One might expect∆p(α)to have a simple behavior (e.g., in- creasing for target conceptz ∈Tas in Turner et al. (2023)), or to display erratic dynamics asαvaries. Surprisingly, nei- ther is true, as our theorem reveals a simple pattern: when we steer in the concept direction, most tokens exhibit a “bump” behavior, i.e., their probability increases, reaches a peak at someα, then decreases. Fig. 3 illustrates this pattern (forα < 0, see Fig. B.1.), and Sec. 5 validates it empirically on practical LLMs. Importantly, off-target tokensz /∈ T reach their peak earlier than target tokensz ∈ T. This means that asαincreases, off-target token probabilities start to fade while target token probabilities are still rising, which helps steering to remain focused on the target concept. This “bump” pattern also suggests the existence of a steering “sweet spot”: a range ofαwhere target tokens are favored by the model while the next-token distribution has not yet collapsed onto a few tokens, helping preserve output quality. Additionally, the bump locationα (j,z) varies across con- textsc j , suggesting thatαshould be chosen adaptively w.r.t. the input prompt, as proposed in (Hedstr ̈ om et al., 2025; Ferrando et al., 2025). This discussion illustrates how Th. 3.3 can inform choices of the steering strength α. Finally, a few tokens are exceptions to this behavior: to- kens attaining the maximal log-odds keep increasing withα, while those attaining the minimal log-odds keep decreasing. Remark 3.4 (Sign ofα (j,z) ). With the dataset defined in App. B.3, the “bump” pattern for tokensz ∈Toccurs only for positive steering strength (α (j,z) > 0), matching the intuition that positive steering increases their probabilities. We defer the limits of∆p(α)asα → ±∞to Prop. B.1. In short,∆p(α)concentrates on tokens inM(resp.M) as α → +∞(resp.−∞). Instead, the limits of∆p(α)for modern LLMs are characterized in Prop. 4.1. 3.2. Influence of α on concept probability in the output In the previous subsection, we focused on the atomic (token- level) quantity∆p(α). Our next step is to “zoom out” and study aggregated versions of∆pover multiple tokens. These aggregates help to answer the following question: does increasing the steering strength make the target con- cept more likely, while other concepts become less likely? As we will show in Th. 3.6, the answer to the previous question is yes. To answer it, we define the probability of a concept in the model output for a given context as follows: Definition 3.5 (Increase/decrease of a concept). LetCbe any concept. Given a context indexj ∈ [m], we define the concept increase as ∆p(C | c j ,α) : = 1 |C| X z∈C ∆p(z|c j ,α). When there is no ambiguity, we simply write ∆p(C | α). 42024 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Figure 4. Concept probability increases∆p(C | α)predicted by Th. 3.6: the target concept ∆p(T | α) increases with a sigmoidal shape, an off-target∆p(C | α)decreases sigmoidally, and another ∆p(C ′ | α) converges to the same limit as|α|→∞. Intuitively,∆p(C | α)is the mean of the probability increase ∆pover tokens belonging to the same conceptC. This quantity serves as a natural proxy, in our setting, for the concept-presence metric studied empirically in Von R ̈ utte et al. (2024); Chen et al. (2025); Rimsky et al. (2024); Park et al. (2024). We postpone the discussion of how∆p(C | α) relates to practical metric until after the main result below, which characterizes the shape of ∆p(C | α): Theorem 3.6 (Behavior of∆p(C | α)). LetTdenote the target concept being steered, and letCdenote an arbitrary concept. Assume that Assumption 1 and 2 hold. Given a context c j , the concept probability increase satisfies ∆p(C | α) = 1 2|C| tanh ν j (α) + r j 2 − r ′ j , withr j ,r ′ j ∈ Randν j : R → Rboth depending onC (see App. B.4 for exact expressions). As a consequence, ∆p(T | α)is increasing inα. Moreover, for anyC ′ ̸= T such thatC ′ ∩ (M∪M ) = ∅, we have the limits lim α→±∞ ∆p(C ′ | α) =− 1 |C ′ | X z∈C ′ p(z|c j ). Finally, for anyC ̸= Tsatisfyingmax z∈C M (z) ≤ min z/∈C M (z), ∆p(C | α) is decreasing in α. In other words, the steered probability of a concept ∆p(C | α)exhibits three distinct behaviors, all following atanh-shaped curve up to a reparametrization ofα. For the target conceptT, steering behaves as intended: increasing the steering strengthαincreases the presence ofTin the model’s output, with∆p(T | α)following a sigmoidal shape. For any other conceptC ′ that contains neither max- imal nor minimal log-odds tokens,∆p(C ′ | α)converges back to its unsteered value. Finally, for conceptsC ′ whose 5 Towards Understanding Steering Strength 1.000.750.500.250.000.250.500.751.00 0.0 0.2 0.4 0.6 0.8 1.0 CE( ) Figure 5. Local quadratic behavior of∆CE(α), as predicted by Thm. 3.8. The blue curve shows∆CE(α)and the black curve the quadratic fit using the coefficient from the theorem. tokens all have log-odds below those of the remaining to- kens,∆p(C ′ | α)decreases asαincreases. See Fig. 4 for an illustration. This is consistent with the empirical finding of Von R ̈ utte et al. (2024), who observed atanh(α)trend for the concept probability in the output of a steered LLM. Our result slightly disagrees with Park et al. (Thm. 2.5 2024), who predict that target-concept probability increases while off-target concept probability remains constant. We suspect this difference comes from their model assumptions and of our definition of ∆p(C | α). In practice, concept probability is estimated by how of- ten the concept appears across sampled generations of a steered LLM. Our∆p(C | α)is more fine-grained, since it tracks changes in the underlying token probabilities. These variations can be masked by sampling:∆p(C | α)may vary while the corresponding concept tokensCremain too low- probability to be sampled with noticeable frequency, making the sampling-based concept metric appear nearly constant, as in Park et al. (2024). Once concept tokensCbecome sufficiently likely, the sampling-based concept probability becomes more aligned with our∆p(C | α). We confirm our findings by an extensive experimental validation (Sec. 5). 3.3. Influence of α on cross-entropy In this subsection, we zoom out once more, and address the following question: how does the steering strengthα affect the model performance as a whole? This question is directly motivated by practice, as a precise answer can avoid costly searches overαto balance effective steering with maintaining a high-quality model output. In practice, output quality is often assessed with benchmarks such as MMLU (Hendrycks et al., 2021). In our theoretical setting, cross-entropy is the most natural performance measure, and we therefore study how steering affects the cross-entropy computed on the training set. This quantity provides a proxy for test-time performance, as the model is assumed to be well-trained and the training set is large and drawn from the same distribution as evaluation data. We therefore take a first step toward answering the above question by analyzing how the steering strength α influences the cross-entropy: Definition 3.7 (Difference of cross-entropy). Recall that f α is the steered model. We define the difference of cross- entropy ∆CE(α) after steering as ∆CE(α) : = CE(f α )− CE(f ). We now give a precise characterization of the local behavior of cross-entropy around α = 0: Theorem 3.8 (Cross-entropy local behavior). Under As- sumption 2, as α→ 0, the cross entropy increase satisfies ∆CE(α) = 1 2 X j∈[m] π j Var j (M (Z))α 2 + o(α 2 ), whereVar j (M (Z))is the variance of the log-odds for to- kensZsampled accordingly to(p(z | c j )) z∈[V ] andπ j be the probability of each distinct contextc j (see App. B.5 for both expressions). In light of the previous theorem,∆CE(α)is locallyU- shaped, since there is no linear term inαand the coefficient ofα 2 is a variance of the log-odds, hence nonnegative; see Fig. 5 for an illustration. Simply put, steering neces- sarily degrades global performance. This provides, to our knowledge, the first theoretical characterization of how a performance measure (cross-entropy) varies with the steering strengthα. Additionally, our result provides a theoretical justification to the empirical observation from Von R ̈ utte et al. (2024) that∆CE(α)is locally quadratic in α; we come back to this matter in Sec. 5. 4. Towards real-world transformers The previous sections analyze steering in a theoretical set- ting, where the model is an idealized one. Modern LLMs, however, involve additional components, most notably the repeated application of attention and fully connected blocks together with normalization, which complicate the analysis. In this section, we move closer to practice by specifying a real-life activation steering setup broad enough to cover our experimental setting (Sec. 5). We then proceed to describe the effect of large-α on the steered LLM output. Decoder-only transformers. The typical decoder-only transformer (Vaswani et al., 2017; Radford et al., 2018) share the same structure: we define the residual stream h (ℓ) ∈ R T×d inductively, withh (0) given by the input em- beddings. A transformer block updates h (ℓ) to h (ℓ+1) as h (ℓ+1) : = h (ℓ) res + h (ℓ) ffn , h (ℓ) attn : = ATTN LN h (ℓ) , h (ℓ) res : = h (ℓ) + h (ℓ) attn , h (ℓ) ffn : = FFN LN h (ℓ) res . (4) 6 Towards Understanding Steering Strength 050100150200 α 0.0 0.2 0.4 0.6 openai-community/gpt2 050100150200 α 0.0 0.2 0.4 google/gemma-3-1b-it 050100150200 α 0.0 0.2 0.4 Qwen/Qwen3-8B Figure 6. Influence of steering strengthαon next-token probability increase∆p(z,α)for the concept “evil,” shown for LLMs of increasing size. Each curve∆p(z,α)corresponds to a tokenzselected among the eight highest-probability tokens atα = 200. This matches Thm. 3.3: most tokens exhibit a bump, while a few increase throughout. The selected tokens are all related to the steered concept. whereATTNdenotes the attention module,FFNthe feed-forward module (e.g., fully-connected or mixture-of- experts (Shazeer et al., 2017)), andLNa normalization mod- ule (e.g., LayerNorm (Ba et al., 2016) or RMSNorm (Zhang & Sennrich, 2019)). AfterLlayers, the output logits y∈ R T×V arey : = LN h (L) W ⊤ , whereW∈ R V×d is the unembedding matrix. Steering vector. As in our theoretical setting (Sec. 2.2), we build the steering vectorvfrom two prompt sets: a positive setPand a negative setN. Following Chen et al. (2025), both sets are generated using a fixed LLM (Gemma3 12B). To formP, we use a system prompt that instructs the model to generate text exhibiting the target concept (App. A) and sample500responses using nucleus sampling, generating 300 new tokens per output. For negatives, we consider two constructions. In the contrastive setting,Nconsists of500 generations obtained with the same system prompt but using the opposite or negated concept. In the random setting,Nis formed by sampling500generations from an empty prompt (e.g., a begin-of-sequence token) using the model to be steered. Each experiment uses one of these constructions forN. While prior work (Von R ̈ utte et al., 2024) relies on hand-crafted negatives, empty-prompt sampling provides a simple alternative that appears unexplored. For a fixed layerℓ, we record the residual streamh (ℓ) j ∈ R T×d for every generationj ∈ P ∪ N, and defineh j : = ̄ h (ℓ) j ∈ R d as the token-wise average ofh (ℓ) j (Chen et al., 2025). Using h j , we compute the steering vector v as in Eq. (3). Steering. A transformer block offers several natural steer- ing locations. In this work, we steer the residual stream h (ℓ) ∈ R T×d , which is also the most common choice in prior work (Turner et al., 2023; Marks & Tegmark, 2024; Rimsky et al., 2024; Burns et al., 2023; Zou et al., 2023; Gurnee & Tegmark, 2024). The next design choice is which token positions to steer: we follow (Chen et al., 2025; Von R ̈ utte et al., 2024) and steer all positions of the in- put prompt, i.e., we copy a single steering vectorv ∈ R d across the sequence length to obtain a matrixv ∈ R T×d . Thus, steering at layerℓwith strengthαfollows Eq.(1). Another option is to steer only the last-token representation h (ℓ) −1,: (Rimsky et al., 2024). Steered logits. Steering the residual streamh (ℓ) yields the steered logitsy(α) : = LN h (ℓ) +αv +R(α) W ⊤ , where +αvpersists to the output via residual (skip) connections, andR(α)collects the effect of steering on the output logits not captured by+αv. The expresssion ofy(α)is proven and made rigorous in App. B.6. Crucially, the theoretical model (UFM) of Def. 2.2 omits the normalizationLNand the termR(α), and treatsh (ℓ) simply as an embedding, akin to an embedding-layer representation in an LLM. We now prove the large-αbehavior of the steered logits for the transformer of Eq. (4): Proposition 4.1 (Limiting behavior of steering a trans- former). Consider steering the residual streamh (ℓ) of a transformer in the directionv ∈ R T×d . Asα→±∞, the steered logits y(α)→ LN(±v)W ⊤ . Because of the normalizationLN, the termR(α)remains bounded inα. Consequently, for large|α|the steered logits no longer depend on the input prompt and instead converge to the unembedding of the normalized steering direction, LN(±v)W ⊤ . The corresponding softmax therefore con- verges toσ(LN(±v)W ⊤ ), implying that the cross-entropy plateaus for large|α|since the output distribution becomes input-independent. See Fig. 7 for an illustration. 5. Experiments In this section, we empirically validate on transformers spanning a wide range of sizes (Table A.1) the main results of Sec. 3: the “bump” pattern in next-token probabilities, theU-shaped behavior of cross-entropy aroundα = 0, and the sigmoidal evolution of concept probability. We observe these behaviors consistently across model types (base, instruction-tuned, multimodal), scales (few million to several billion parameters), and concepts. Steering is implemented as described in Sec. 4. We consider8con- cepts spanning a range of safety-related behaviors (listed in App. A). Each experiment corresponds to a choice of steering vector, model, steered layer, input prompt; the figures in this section illustrate typical steering behavior by fixing the concept (here, “evil”, “depression” and “joy”), steering a middle layer (Chen et al., 2025), and using the random construction ofP. App. A reports additional con- 7 Towards Understanding Steering Strength -40-2002040 α 0.0 0.5 1.0 openai-community/gpt2 depression joy evil -40-2002040 α 0.0 0.5 1.0 google/gemma-3-1b-it depression joy evil -40-2002040 α 0.0 0.5 1.0 Qwen/Qwen3-8B depression joy evil -100-50050100 α 0 10 20 30 openai-community/gpt2 evil joy depression -100-50050100 α 0 20 40 google/gemma-3-1b-it evil joy depression -100-50050100 α 0 5 10 15 Qwen/Qwen3-4B evil joy depression Figure 7. Influence of steering strengthαacross models on: Top row: concept probability for the three concepts (depression, joy, evil), estimated using a judge LLM (Gemma312B), showing the sigmoidal trend predicted by Thm. 3.6. Bottom row: cross-entropy∆CE(α) for the same concepts, locally U -shaped around α = 0 and plateauing for large|α| (Thm. 3.8, Prop. 4.1). figurations and results, including other layers, concepts, and models with contrastiveN, error bars under resampling ofPandN, runs with normalizedv, steering only the last tokenh (ℓ) −1,: , and the impact of steering on MMLU. Finally, App. A also reports results for additional input prompts, since each next-token probability plot is computed for a fixed context. Notably, our code is modular, enabling exten- sions to unseen configurations. Results for next-token probabilities. We measure the influence ofαon the increase of next-token probabilities ∆p(z,α). In Fig. 6, we plot∆p(z,α)for a small set of tokenszthat become most likely at largeα, motivated by Prop. 4.1 which shows that in the large-|α|regime the log- its are determined by the unembedding of (normalized)v. Across models and concepts, the evidence is unequivocal: we observe the “bump” pattern in∆pfor concept tokens and the large-αregime where a few tokens dominate predicted by Thm. 3.3. The same behavior is shown for off-target tokens at negativeαin Fig. A.10. Dominating tokens do not generally associate to extremal log-odds at intermediate lay- ers, but they do at the final layer (App. A). Finally, although the bump behavior appears already in early layers, steering at mid to late layers leads to highest-probability tokens that are more semantically tied with the target concept (App. A). Results for concept probability. We estimate the concept probability in steered LLM responses using a sampling- based metric (different from the theoretical∆p(C | α); see the discussion below Thm. 3.6). Concretely, for12prompts and eachα, we sample32completions and prompt a judge LLM (Gemma3 12B) to assign a binary label indicating whether the target concept is present, following Chen et al. (2025) (details in App. A). Averaging these labels yields the concept probability. Fig. 7 shows a mostly sigmoidal trend, with occasional mismatches (e.g., the middle panel). In such cases, for some layers/concepts and for a range ofαvalues, next-token sampling can drift away from concept-related tokens because the highest-probability token may instead be punctuation (e.g., ‘-’ or ‘.’), leading to degenerate outputs. Results for cross-entropy. We estimate the steered cross- entropy change∆CE(α)on10 6 tokens sampled from the processedfinewebdataset (Penedo et al., 2024), which provides a sufficiently large and diverse sample for a reli- able estimate. Across all models, we consistently observe the localU-shape aroundα = 0, confirming that steering always hurts global performance as predicted by Thm. 3.8; see Fig. 7. Moreover, while Fig. 13 in Von R ̈ utte et al. (2024) reports an empiricalα 2 trend, it is unclear whether this behavior is meant to be local; Thm. 3.8 clarifies that the quadratic scaling holds only locally aroundα = 0. For largeα,∆CE(α)instead plateaus, as implied by Prop. 4.1 and confirmed in Fig. 7. 6. Conclusion Activation steering is a simple and widely used method to control LLM behavior at inference time, yet the choice of steering strengthαremains largely heuristic. In this paper, we provide a theoretical analysis of steering strength for ac- tivation steering with a difference-of-means steering vector. In a tractable next-token prediction model, we characterize howαimpacts next-token probabilities, concept probabil- ity in the output, and cross-entropy, and we validate these predictions empirically across a range of modern LLMs. Future work includes narrowing the theory/practice gap (e.g., mixed-concept contexts), extending the analysis to other steering methods (e.g., SAE), and developing principled prompt-adaptive, rules for choosingαby characterizing the steering “sweet spot” suggested by our results. 8 Towards Understanding Steering Strength Acknowledgements We thank Alberto Bietti and Salim I. Amoukou for their valuable insights. This work has been supported by the French government, through the 3IA Cote d’Azur Invest- ments in the project managed by the National Research Agency (ANR) with the reference number ANR-23-IACL- 0001, the ANR project PRC MAD ANR-24-CE23-1529 and the support of the “France 2030” funding ANR23-PEIA- 0004 (PDE-AI) and ANR-15-IDEX-01. All experiments were performed using the Julia 2 cluster. Julia 2 was funded as DFG project as “Forschungsgroßger ̈ at nach Art 91b G” under INST 93/1145-1 FUGG. References Anthropic.Claude 3.https://w.anthropic. com/news/claude-3-family , 2024. Accessed: 2025-10-15. Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. Refusal in language models is mediated by a single direction. Advances in Neural Infor- mation Processing Systems, 37:136037–136083, 2024. Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016. Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022. Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and Kalai, A. T. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems, volume 29, 2016. Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., et al. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2, 2023. Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discov- ering Latent Knowledge in Language Models Without Supervision. International Conference on Learning Rep- resentations, 2023. Chen, R., Arditi, A., Sleight, H., Evans, O., and Lind- sey, J. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025. Costa, V., Fel, T., Lubana, E. S., Tolooshams, B., and Ba, D. E. From Flat to Hierarchical: Extracting Sparse Rep- resentations with Matching Pursuit. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., and Olah, C. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022. Fang, C., He, H., Long, Q., and Su, W. J. Exploring deep neural networks via layer-peeled model: Minority col- lapse in imbalanced training. Proceedings of the National Academy of Sciences, 118(43):e2103091118, 2021. Ferrando, A., Suau, X., Gonz ` alez, J., and Rodriguez, P. Dynamically Scaled Activation Steering. arXiv preprint arXiv:2512.03661, 2025. Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. Scal- ing and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024. Gurnee, W. and Tegmark, M. Language models represent space and time. In The Twelfth International Conference on Learning Representations, 2024. Hedstr ̈ om, A., Amoukou, S. I., Bewley, T., Mishra, S., and Veloso, M. To Steer or Not to Steer? Mechanistic Er- ror Reduction with Abstention for Language Models. In Forty-second International Conference on Machine Learning, 2025. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations, 2021. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.- A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b, 2023. Marks, S. and Tegmark, M. The Geometry of Truth: Emer- gent Linear Structure in Large Language Model Repre- sentations of True/False Datasets. In First Conference on Language Modeling, 2024. Marvin, G., Hellen, N., Jjingo, D., and Nakatumba-Nabende, J. Prompt engineering in large language models. In Inter- national Conference on Data Intelligence and Cognitive Informatics, p. 387–402. Springer, 2023. 9 Towards Understanding Steering Strength Mikolov, T., Yih, W.-t., and Zweig, G. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p. 746–751, Atlanta, Georgia, June 2013. Association for Computational Linguistics. Min, P. P., Paudel, A., Adityo, N., Zhu, A., Rufail, A., Blondin, C., Zhu, K., Dev, S., and O’Brien, S. Mitigating sycophancy in language models via sparse activation fu- sion and multi-layer activation steering. In Mechanistic Interpretability Workshop at NeurIPS 2025, 2025. Mixon, D. G., Parshall, H., and Pi, J. Neural collapse with unconstrained features. Sampling Theory, Signal Processing, and Data Analysis, 20(2):11, 2022. Nanda, N., Lee, A., and Wattenberg, M. Emergent Linear Representations in World Models of Self-Supervised Se- quence Models. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023. Park, K., Choe, Y. J., and Veitch, V. The linear represen- tation hypothesis and the geometry of large language models. In Proceedings of the 41st International Confer- ence on Machine Learning, volume 235 of Proceedings of Machine Learning Research, p. 39643–39666. PMLR, 21–27 Jul 2024. Penedo, G., Kydl ́ ı ˇ cek, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. V., and Wolf, T. The fineweb datasets: Decanting the web for the finest text data at scale. In The Thirty-eight Conference on Neural Informa- tion Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum? id=n6SCkn2QaG. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. Improving language understanding by generative pre-training. 2018. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. Steering llama 2 via contrastive activation addition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), p. 15504–15522. Association for Computational Linguistics, August 2024. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V., Hinton, G. E., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Rep- resentations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. Stickland, A. C., Lyzhov, A., Pfau, J., Mahdi, S., and Bow- man, S. R. Steering Without Side Effects: Improving Post- Deployment Control of Language Models. In Neurips Safe Generative AI Workshop 2024, 2024. Su, J., Chen, J., Li, H., Chen, Y., Qing, L., and Zhang, Z. Activation steering decoding: Mitigating hallucination in large vision-language models through bidirectional hidden state intervention. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 12964–12974, 2025. Tan, D. C. H., Chanin, D., Lynch, A., Paige, B., Kanoulas, D., Garriga-Alonso, A., and Kirk, R. Analysing the Generalisation and Reliability of Steering Vectors. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram ́ e, A., Rivi ` ere, M., et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025. Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., et al. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. transformer circuits thread, 2024. Thrampoulidis, C. Implicit optimization bias of next-token prediction in linear models. Advances in Neural Informa- tion Processing Systems, 37:22624–22656, 2024. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288, 2023. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M.Steering lan- guage models with activation engineering. arXiv preprint arXiv:2308.10248, 2023. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser,Ł., and Polosukhin, I. At- tention is all you need. Advances in neural information processing systems, 30, 2017. Von R ̈ utte, D., Anagnostidis, S., Bachmann, G., and Hof- mann, T. A Language Model’s Guide Through Latent Space. In Proceedings of the 41st International Confer- ence on Machine Learning. PMLR, 2024. 10 Towards Understanding Steering Strength Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned Lan- guage Models are Zero-Shot Learners. In International Conference on Learning Representations, 2022. Wu, Z., Arora, A., Geiger, A., Wang, Z., Huang, J., Jurafsky, D., Manning, C. D., and Potts, C. AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Au- toencoders. In Forty-second International Conference on Machine Learning, 2025. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. Zhang, B. and Sennrich, R. Root mean square layer nor- malization. Advances in neural information processing systems, 32, 2019. Zhao, Y. and Thrampoulidis, C. Geometry of Semantics in Next-Token Prediction: How Optimization Implicitly Organizes Linguistic Representations. arXiv preprint arXiv:2505.08348, 2025a. Zhao, Y. and Thrampoulidis, C. Geometry of Concepts in Next-token Prediction: Neural-Collapse Meets Semantics. In The Second Conference on Parsimony and Learning (Recent Spotlight Track), 2025b. Zhao, Y., Behnia, T., Vakilian, V., and Thrampoulidis, C. Implicit geometry of next-token prediction: From lan- guage sparsity patterns to model representations. In First Conference on Language Modeling, 2024. Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation Engineering: A Top-Down Approach to AI Transparency, 2023. 11 Towards Understanding Steering Strength Contents of the Appendix A Additional practical experiments and setting12 B Proofs and additional results23 B.1 Additional results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23 B.2 Technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24 B.3 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27 B.4 Proof of Theorem 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30 B.5 Proof of Theorem 3.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32 B.6 Proof of Proposition 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33 A. Additional practical experiments and setting Table A.1. Models used in the experiments of Section 5. Model familySizes 1-layer GPT (Radford et al., 2018)0.009B GPT-2 (Radford et al., 2019)0.12B, 0.77B, 1.5B Gemma 3 (Team et al., 2025)1B, 4B Qwen 3 (Yang et al., 2025)0.6B, 4B, 8B Llama 2 (Touvron et al., 2023)7B Mistral (Jiang et al., 2023)7B Table A.2. Non-exhaustive list of tokens observed when steering the models in Table A.1, recorded among the highest-probability next tokens atα = 200in the setting of Section 4. For readability, we report only full-word tokens; for tokenizers that produce smaller subwords, we observe the same phenomenon, with fragments such as ‘happ’ instead of ‘happiness’. Steered conceptHigh-probability next tokens apatheticokay, yeah, bullshit, whatnot, probably depressiondiscomfort, sadness, emotional, despair, uncomfortable evilhorrifying, destruction, terror, murderous, deadly humorouscomedic, ridiculous, silly, fidget, hilarious impoliteverbal, disrespectful, stupid, rude, insulting joyfeeling, happiness, ecstatic, joyful, laughter lyingabsurd, ridiculous, thieves, truth, spiritual optimisticcheerful, cherish, religious, optimism, someday Setting to generate the positive (P) and negative (N) prompt sets.As described in Section 4,Pis generated by a prompted instruct-LLM (Gemma 3 12B) using a system prompt of the following form: (1) behavior instruction→ (2) definition of the concept→ (3) example sentences exhibiting the concept. ForP, we use the target concept in the following template. ForN, in the contrastive setting, we reuse the same template but replace the concept with its negation (the negated concept is itself generated by the LLM). Otherwise, in the unrelated setting,Nconsists of sampled generations from the LLM to steer starting from an empty prompt. Here is the exact template used in our experiments: 12 Towards Understanding Steering Strength positive_chat_template = [ "role": "system", "content": "You are an assistant.", "role": "user", "content": ( f"Describe the concept: concept. Begin with a one-sentence definition. " "Then give several natural first-person examples that demonstrate the concept in an unmistakable way. " "Write as if you are the concept itself. Your examples must make the concept obvious through what you say or do " "| for example, if the concept is * lying * , give clearly false statements (e.g., claiming impossible facts or events). " "Do not use third-person references. Do not explain anything. " "After the definition, output only the examples as simple first-person sentences that fully embody the concept." ), , ] We steer8concepts: apathetic, depression, evil, humorous, impolite, joy, lying, and optimistic. Setting to compute the concept probability in practice. As described in Section 5, we use a judge LLM (Gemma3 12B) to decide whether a target concept is present in a steered LLM generation. The following prompt is used to obtain a binary label: judge_question_template: str = ( "Text:\ completion\ \ " "Does the text above express the concept concept (any word linked to the concept in the whole sentence even if it do not make sense is sufficient for a positive reply)? Reply with 1 for yes and 0 for no.\ " "Answer:" ) As this prompt makes clear, we do not attempt to assess output quality here; we only measure concept presence, using a permissive criterion (any concept-related word suffices). Additional figures for the influence of steering strength on next-token probabilities. We complement Figure 6 from Section 5 with additional models, concepts, steering layers, and negative steering strengths; see Figures A.7, A.8, A.9, and A.10. Overall, the qualitative predictions of Theorem 3.3 are observed. The main discrepancy occurs when steering early layers: tokens exhibiting bumps or dominating at largeαare less often concept-related. This is expected, as steering early layers is known to yield weaker results (Chen et al., 2025). In Table A.2, we provide a non-exhaustive sample of the highest-probability next tokens recorded atα = 200; consistently, steering works as intended by increasing the probability of concept-related tokens. Additional figures for the influence of steering strength on concept probability in the output. We complement Figure 7 from Section 5 with additional models, concepts, and steering layers; see Figures A.11 and A.12. The qualitative prediction of Theorem 3.6 is partially verified (more often true than false). Results are sensitive to the concept itself (intuitively harder concepts such as lying yield less clean curves than easier ones such as joy). The main discrepancy again arises when steering early layers, which is consistent with prior observations (Chen et al., 2025). Additional figures for the influence of steering strength on cross-entropy. We complement Figure 7 from Section 5 with additional models, concepts, and steering layers; see Figure A.13. Overall, the predictions of Theorem 3.8 and Proposition 4.1 are observed. Additional results. We provide additional plots for experiments mentioned in Section 5, including MMLU (Figure A.6), steering only the last-token representationh (ℓ) −1,: (Figure A.2), normalization of the steering vector (Figure A.1), contrastive N(Figure A.5), error bars under resampling ofPandN(Figure A.3) and steering, in the setting of Section 4, a 1-layer GPT-style transformer (Figure A.4) that we train on fineweb (Penedo et al., 2024). 13 Towards Understanding Steering Strength 0500100015002000 α 0.00 0.25 0.50 0.75 openai-community/gpt2 0500100015002000 α 0.000 0.025 0.050 0.075 Qwen/Qwen3-8B 0500100015002000 0.0 0.2 0.4 0.6 0500100015002000 0.0 0.1 0.2 0.3 0.4 0.5 0500100015002000 0.0 0.1 0.2 0.3 0.4 0.5 0500100015002000 0.00 0.05 0.10 0.15 0.20 0.25 Figure A.1. Effect of steering strengthα > 0on next-token probability shifts∆p(z,α)for the concepts (top to bottom): depression, joy, and evil. Each row of two plots corresponds to a single steered concept. Steering is applied at an middle layer in each model (we steer always the same layer for that model) and the steering vector is normalized. Each curve corresponds to a tokenzselected among the eight highest-probability tokens at α = 2000. This matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. Notably, many selected tokens are related to the steered concept. 050100150200 α 0.0 0.2 0.4 openai-community/gpt2 050100150200 α 0.00 0.25 0.50 0.75 google/gemma-3-1b-it 050100150200 α 0.00 0.02 Qwen/Qwen3-8B 050100150200 0.0 0.2 0.4 0.6 050100150200 0.0 0.2 0.4 0.6 0.8 050100150200 0.0 0.1 0.2 0.3 Figure A.2. Effect of steering strengthα > 0on next-token probability shifts∆p(z,α)for the concepts (top to bottom): evil and joy. Each row of three plots corresponds to a single steered concept. Steering is applied at an middle layer in each model (we steer always the same layer for that model) and we steer only the last token representationh (ℓ) (−1) . Each curve corresponds to a tokenzselected among the eight highest-probability tokens atα = 200. This matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. Notably, many selected tokens are related to the steered concept. 0255075100125150175200 α 0.0 0.2 0.4 0.6 google/gemma-3-1b-it 0255075100125150175200 α 0.0 0.2 0.4 0.6 0.8 google/gemma-3-1b-it 0255075100125150175200 α 0.00 0.25 0.50 0.75 1.00 google/gemma-3-1b-it Figure A.3. Effect of steering strengthα > 0on next-token probability shifts∆p(z,α)for the concept “evil”. Steering is applied at a middle layer in each model (we use a fixed middle layer per model). Each curve corresponds to a tokenzselected among the eight highest-probability tokens atα = 200, plotted with mean and standard deviation over5runs obtained by resampling the prompt setsP andN. This matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. The variability across runs is moderate, so for computational cost we omit error bars in the main figures. 14 Towards Understanding Steering Strength 01020304050 0.00 0.02 0.04 0.06 1-layer GPT Figure A.4. Effect of steering strengthα > 0on next-token probability shifts∆p(z,α)for the concept “uppercase words”. Each curve corresponds to a tokenzselected among the eight highest-probability tokens atα = 50. This matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. Notably, many selected tokens are uppercase words. 050100150200 α 0.0 0.2 0.4 openai-community/gpt2 050100150200 α 0.00 0.25 0.50 0.75 google/gemma-3-1b-it 050100150200 α 0.0 0.5 Qwen/Qwen3-8B 050100150200 0.0 0.2 0.4 0.6 0.8 050100150200 0.0 0.2 0.4 0.6 0.8 1.0 050100150200 0.0 0.2 0.4 0.6 050100150200 0.0 0.2 0.4 0.6 0.8 050100150200 0.0 0.1 0.2 0.3 0.4 050100150200 0.00 0.05 0.10 0.15 Figure A.5. Effect of steering strengthα > 0on next-token probability shifts∆p(z,α)for the concepts (top to bottom): depression, joy, and evil. Each row of three plots corresponds to a single steered concept. Steering is applied at an middle layer in each model (we steer always the same layer for that model) and the negative prompt setNis built in the contrastive setting. Each curve corresponds to a token zselected among the eight highest-probability tokens atα = 200. This matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. Notably, many selected tokens are related to the steered concept. -10-50510 α 0.4 0.6 google/gemma-3-4b-it Figure A.6. Effect of steering strengthαon MMLU (Hendrycks et al., 2021) for the concept “evil”. MMLU is a practical performance metric, more indicative of real-world capability than cross-entropy. We measure it using theDeepEvallibrary, for which random guessing yields 25%. As with cross-entropy, increasing α inevitably degrades model performance. 15 Towards Understanding Steering Strength 050100150200 α 0.0 0.2 0.4 0.6 openai-community/gpt2 050100150200 α 0.00 0.25 0.50 0.75 google/gemma-3-1b-it 050100150200 α 0.000 0.025 0.050 0.075 mistralai/Mistral-7B-Instruct-v0.3 050100150200 0.00 0.05 0.10 0.15 0.20 050100150200 0.0 0.1 0.2 0.3 0.4 050100150200 0.00 0.02 0.04 0.06 050100150200 0.00 0.02 0.04 0.06 0.08 0.10 050100150200 0.00 0.05 0.10 0.15 050100150200 0.000 0.002 0.004 0.006 050100150200 0.00 0.05 0.10 0.15 0.20 050100150200 0.0 0.2 0.4 0.6 0.8 1.0 050100150200 0.00 0.02 0.04 0.06 050100150200 0.00 0.01 0.02 0.03 0.04 050100150200 0.0 0.2 0.4 0.6 0.8 050100150200 0.000 0.001 0.002 0.003 0.004 050100150200 0.00 0.02 0.04 0.06 0.08 050100150200 0.0 0.2 0.4 0.6 050100150200 0.000 0.001 0.002 0.003 0.004 0.005 Figure A.7. Effect of steering strengthα > 0on next-token probability shifts∆p(z,α)for the concepts (top to bottom): depression, evil, impolite, joy, lying, and apathetic. Each row of three plots corresponds to a single steered concept. Steering is applied at an early layer in each model (we steer always the same layer for that model). Each curve corresponds to a tokenzselected among the eight highest-probability tokens atα = 200. This partially matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. Notably, many selected tokens are not concept-related, consistent with the observation that steering early layers often yields worse results (Chen et al., 2025). 16 Towards Understanding Steering Strength 050100150200 α 0.00 0.25 0.50 0.75 openai-community/gpt2 050100150200 α 0.0 0.5 google/gemma-3-1b-it 050100150200 α 0.00 0.02 0.04 mistralai/Mistral-7B-Instruct-v0.3 050100150200 0.0 0.2 0.4 0.6 050100150200 0.0 0.2 0.4 0.6 0.8 050100150200 0.000 0.025 0.050 0.075 0.100 050100150200 0.0 0.1 0.2 0.3 0.4 050100150200 0.0 0.2 0.4 0.6 050100150200 0.0 0.1 0.2 0.3 050100150200 0.0 0.2 0.4 0.6 0.8 050100150200 0.0 0.2 0.4 0.6 0.8 050100150200 0.0 0.2 0.4 0.6 0.8 050100150200 0.0 0.1 0.2 0.3 050100150200 0.0 0.2 0.4 0.6 0.8 1.0 050100150200 0.000 0.005 0.010 0.015 050100150200 0.00 0.02 0.04 0.06 0.08 050100150200 0.0 0.2 0.4 0.6 050100150200 0.000 0.001 0.002 0.003 0.004 0.005 Figure A.8. Effect of steering strengthα > 0on next-token probability shifts∆p(z,α)for the concepts (top to bottom): depression, evil, impolite, joy, lying, and apathetic. Each row of three plots corresponds to a single steered concept. Steering is applied at a middle layer in each model (we steer always the same layer for that model). Each curve corresponds to a tokenzselected among the eight highest-probability tokens atα = 200. This matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. Notably, the selected tokens are related to the steered concept. 17 Towards Understanding Steering Strength 050100150200 α 0.0 0.1 0.2 0.3 openai-community/gpt2 050100150200 α 0.00 0.25 0.50 0.75 google/gemma-3-1b-it 050100150200 α 0.00 0.05 0.10 mistralai/Mistral-7B-Instruct-v0.3 050100150200 0.0 0.1 0.2 050100150200 0.0 0.1 0.2 0.3 050100150200 0.00 0.05 0.10 0.15 0.20 050100150200 0.0 0.2 0.4 0.6 0.8 050100150200 0.0 0.2 0.4 050100150200 0.0 0.1 0.2 0.3 0.4 050100150200 0.00 0.05 0.10 0.15 0.20 050100150200 0.0 0.2 0.4 0.6 050100150200 0.0 0.2 0.4 0.6 0.8 050100150200 0.0 0.2 0.4 0.6 050100150200 0.0 0.2 0.4 050100150200 0.0 0.1 0.2 0.3 050100150200 0.0 0.2 0.4 0.6 0.8 050100150200 0.0 0.1 0.2 0.3 0.4 050100150200 0.0 0.1 0.2 0.3 0.4 0.5 Figure A.9. Effect of steering strengthα > 0on next-token probability shifts∆p(z,α)for the concepts (top to bottom): depression, evil, impolite, joy, lying, and apathetic. Each row of three plots corresponds to a single steered concept. Steering is applied at the last layer in each model (we steer always the same layer for that model). Each curve corresponds to a tokenzselected among the eight highest-probability tokens atα = 200. This matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. Notably, the selected tokens are mostly related to the steered concept. 18 Towards Understanding Steering Strength -200-150-100-500 α 0.0 0.2 0.4 0.6 openai-community/gpt2 -200-150-100-500 α 0.0 0.5 google/gemma-3-1b-it -200-150-100-500 α 0.000 0.002 0.004 mistralai/Mistral-7B-Instruct-v0.3 -200-150-100-500 0.000 0.025 0.050 0.075 0.100 -200-150-100-500 0.0 0.2 0.4 0.6 0.8 -200-150-100-500 0.000 0.001 0.002 0.003 0.004 -200-150-100-500 0.0 0.2 0.4 0.6 0.8 -200-150-100-500 0.0 0.2 0.4 0.6 0.8 -200-150-100-500 0.000 0.002 0.004 0.006 -200-150-100-500 0.00 0.05 0.10 0.15 -200-150-100-500 0.0 0.2 0.4 0.6 0.8 -200-150-100-500 0.000 0.001 0.002 0.003 0.004 0.005 -200-150-100-500 0.00 0.05 0.10 0.15 0.20 -200-150-100-500 0.0 0.1 0.2 0.3 -200-150-100-500 0.000 0.002 0.004 0.006 -200-150-100-500 0.00 0.01 0.02 0.03 0.04 -200-150-100-500 0.0 0.2 0.4 0.6 -200-150-100-500 0.000 0.002 0.004 0.006 0.008 0.010 Figure A.10. Effect of steering strengthα < 0on next-token probability shifts∆p(z,α)for the concepts (top to bottom): depression, evil, impolite, joy, lying, and apathetic. Each row of three plots corresponds to a single steered concept. Steering is applied at a middle layer in each model (we steer always the same layer for that model). Each curve corresponds to a tokenzselected among the eight highest-probability tokens atα = −200. This matches Theorem 3.3: most tokens exhibit a bump, while a few increase throughout. Notably, the selected tokens are all not related to the steered concept. 19 Towards Understanding Steering Strength -40-2002040 α 0.0 0.5 1.0 openai-community/gpt2-large 0 23 35 -40-2002040 α 0.0 0.5 1.0 openai-community/gpt2-xl 0 15 47 -40-2002040 α 0.0 0.5 1.0 google/gemma-3-1b-it 0 12 25 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 23 35 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 15 47 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 12 25 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 23 35 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 15 47 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 12 25 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 23 35 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 15 47 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 12 25 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 23 35 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 15 47 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 12 25 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 23 35 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 15 47 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 12 25 Figure A.11. Influence of steering strengthαon concept probability for the concepts (top to bottom): depression, evil, humorous, impolite, joy, and optimistic. Each row of three plots corresponds to a single steered concept, and each column corresponds to a different model. Steering is applied at three layers (early, middle, late), indicated in each legend. Overall, the curves are consistent with Theorem 3.6, which predicts a sigmoidal shape. Early-layer steering is more erratic, consistent with reports that steering early layers yields worse results (Chen et al., 2025). 20 Towards Understanding Steering Strength -40-2002040 α 0.0 0.5 1.0 openai-community/gpt2 3 7 11 -40-2002040 α 0.0 0.5 1.0 meta-llama/Llama-2-7b-chat-hf 0 15 31 -40-2002040 α 0.0 0.5 1.0 mistralai/Mistral-7B-Instruct-v0.3 0 15 31 -40-2002040 0.00 0.25 0.50 0.75 1.00 3 7 11 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 15 31 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 15 31 -40-2002040 0.00 0.25 0.50 0.75 1.00 3 7 11 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 15 31 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 15 31 -40-2002040 0.00 0.25 0.50 0.75 1.00 3 7 11 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 15 31 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 15 31 -40-2002040 0.00 0.25 0.50 0.75 1.00 3 7 11 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 15 31 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 15 31 -40-2002040 0.00 0.25 0.50 0.75 1.00 3 7 11 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 15 31 -40-2002040 0.00 0.25 0.50 0.75 1.00 0 15 31 Figure A.12. Influence of steering strengthαon concept probability for the concepts (top to bottom): depression, evil, humorous, impolite, joy, and optimistic. Each row of three plots corresponds to a single steered concept, and each column corresponds to a different model. Steering is applied at three layers (early, middle, late), indicated in each legend. Overall, the curves are consistent with Theorem 3.6, which predicts a sigmoidal shape. Early-layer steering is more erratic, consistent with reports that steering early layers yields worse results (Chen et al., 2025). We also observe less consistent behavior on Llama 2. 21 Towards Understanding Steering Strength -100-50050100 α 0 10 20 openai-community/gpt2-large 11 23 35 -100-50050100 α 0 10 20 30 40 google/gemma-3-1b-it 8 16 25 -100-50050100 α 0 10 20 Qwen/Qwen3-0.6B 9 18 27 -100-50050100 0 5 10 15 20 25 11 23 35 -100-50050100 0 10 20 30 40 50 8 16 25 -100-50050100 0 5 10 15 20 9 18 27 -100-50050100 0 5 10 15 20 11 23 35 -100-50050100 0 10 20 30 40 50 8 16 25 -100-50050100 0 5 10 15 20 9 18 27 -100-50050100 0 5 10 15 20 11 23 35 -100-50050100 0 10 20 30 8 16 25 -100-50050100 0 5 10 15 20 9 18 27 -100-50050100 0 10 20 30 11 23 35 -100-50050100 0 10 20 30 40 50 8 16 25 -100-50050100 0 5 10 15 20 9 18 27 -100-50050100 0 5 10 15 20 11 23 35 -100-50050100 0 20 40 60 8 16 25 -100-50050100 0 5 10 15 20 9 18 27 Figure A.13. Influence of steering strengthαon the cross-entropy∆CE(α)for the concepts (top to bottom): apathetic, depression, evil, humorous, impolite, and joy. Each row of three plots corresponds to a single steered concept, and each column corresponds to a different model. Steering is applied at three layers (early, middle, late), indicated in each legend. As predicted by Theorem 3.8,∆CE(α)is locally U -shaped around α = 0 and saturates for large|α|, in line with Proposition 4.1. 22 Towards Understanding Steering Strength B. Proofs and additional results B.1. Additional results 6050403020100 0.2 0.0 0.2 0.4 0.6 0.8 1.0 p ( ) target off-target Figure B.1. Next-token probability increases∆p(α)for a fixed context and negativeα. Each curve corresponds to a tokenz: target tokensTare in blue and off-target tokens in orange. Most off-target tokens exhibit a “bump” (peaking atα (1,4) ), while one off-target token decreases on R and target tokens are increasing on R − . Generalizing our results toa z depending onj. Inspecting the proofs shows that all results, except Remark 3.4, rely on Lemma B.4. Consequently, our main theorems continue to hold verbatim as long as the same sign-separation property holds for the log-oddsM(Lemma B.4). However, if one allowsa z andb z to depend on the context indexjwithout further structure, this sign-separation property may fail. A simple generalization that preserves sign separation is to allowa z to depend onjbut keepb z independent ofj, assuming a j,z > b z . This is interpretable: only in-concept (meaning,z,c j ∈ C k ) probabilities vary with the context, while off-concept probabilities remain at a small baseline levelb z . Allowingb z to depend onjwhile still enforcing Lemma B.4 is possible, but typically leads to a less interpretable assumption. So grossly said, if we see Lemma B.4 as an assumption, then our results work. Additionally, Lemma B.4 seems to be true in practice see Appendix A. Plot of ∆p(α) for negative steering strength. We provide the counterpart of Figure 6 for negative α, see Figure B.1. Perfect training of the UFM. To illustrate that Assumption 2 is attainable in our theoretical setting, we train a UFM with gradient descent on cross-entropy loss on the following dataset instantiation from Definition 2.1, which satisfies Assumption 1: ∀z ∈ [V ], ( a z : = 1−ε s b z : = ε (G−1)s , withε ∈ (0, (G− 1)/G)a smoothing parameter. The dataset entropy is≈ 1.3317(a lower bound on the achievable loss (Thrampoulidis, 2024)), and we reach a loss of≈ 1.3318, indicating that the model learns the dataset essentially perfectly. As stated in Section 3.1, we have an additional results about the limits of ∆p(α) (Definition 3.1): Proposition B.1 (Limits of∆p(α)). Given a context indexj ∈ [m], and a tokenz ∈ [V ], the limits of∆p(α)when α→ +∞ is: lim α→+∞ ∆p(α) = 1 z∈M p(z|c j ) P z ′ ∈M p(z ′ |c j ) − p(z|c j ).(5) Similarly, for the limit α→−∞, replace M by Min Eq. (5). Eq.(5)has the following interpretation: in the limitα → +∞(resp.α → −∞),∆p(α)concentrates all its mass on the tokensz ∈M(resp.z ∈ M). If multiple tokens attain the maximal or minimal log-odds, the probability mass is shared among all such tokens. 23 Towards Understanding Steering Strength Proof. Let us prove that softmax behaves as follows when scaling the steering strength α: ∀z ∈ [V ],lim α→+∞ σ z f α (c j ) = 1 z∈M p(z | c j ) P z ′ ∈M p(z ′ | c j ) .(6) whereM is the set of token attaining the maximum log-odd M max . A short proof of the previous display is as follows: σ z f α (c j ) = p(z | c j ) exp (αM (z)) P z ′ ∈[V ] p(z ′ | c j ) exp (αM (z ′ )) = p(z | c j ) exp (αM (z)) P z ′ ∈[V ] p(z ′ | c j ) exp (αM (z ′ )) exp (−αM max ) exp (−αM max ) = p(z | c j ) exp (α(M (z)− M max )) P z ′ ∈[V ] p(z ′ | c j ) exp (α(M (z ′ )− M max )) = p(z|c j ) P z ′ ∈ M p(z ′ |c j )+ P z ′ /∈M p(z ′ |c j ) exp(α(M (z ′ )−M max )) if z ∈M , p(z|c j ) exp(α(M (z)−M max )) P z ′ ∈[V ] p(z ′ |c j ) exp(α(M (z ′ )−M max )) otherwise. In the first case (z ∈ M ) of the previous display, as (M (z ′ )− M max ) < 0 with z ′ /∈M , the limit is lim α→+∞ p(z | c j ) P z ′ ∈M p(z ′ | c j ) + P z ′ /∈M p(z ′ | c j ) exp (α(M (z ′ )− M max )) = p(z | c j ) P z ′ ∈M p(z ′ | c j ) . In the second case (z /∈ M ), the limit is done by bounding the term, using the fact that X z ′ ∈[V ] p(z ′ | c j ) exp (α(M (z ′ )− M max ))≥ p(z ⋆ | c j ) exp (0) where z ⋆ ∈M . We get the following bound: 0 < p(z | c j ) exp (α(M (z)− M max )) P z ′ ∈[V ] p(z ′ | c j ) exp (α(M (z ′ )− M max )) ≤ p(z | c j ) p(z ⋆ | c j ) exp (α(M (z)− M max )) , which implies that the second case term goes to 0 as α→ +∞ (because (M (z)− M max ) < 0 with z /∈M ). Using the previous display we get: lim α→+∞ ∆p(α) = 1 z∈M p(z|c j ) P z ′ ∈ M p(z ′ |c j ) − p(z|c j )(7) = 1 z∈ M a z P z ′ ∈M p(z ′ |c j ) − a z , if c j ∈T , 1 z∈ M b z P z ′ ∈ M p(z ′ |c j ) − b z ,otherwise. (8) Asa z > b z (Assumption 1), the tokens which can attain the max margin are necessarily concept tokensT(Lemma B.4), thus lim α→+∞ ∆p(α) = 1 z∈ M a z P z ′ ∈M a z ′ − a z , if c j ∈T , 1 z∈M b z P z ′ ∈M b z ′ − b z ,otherwise. (9) Same thing for α→−∞. B.2. Technical lemmas In the following we introduce and prove the technical lemmas needed for Section 3. In the UFM model, activation steering on the embedding h j admits an explicit expression for the resulting model output: 24 Towards Understanding Steering Strength Lemma B.2 (Steering on UFM). Steering the embeddingh j along the directionvfrom Eq.(3)with strengthα∈ R, we obtain the steered logits f α (c j ) : = W h j + αv =ℓ j + α q X i∈P ℓ i − X i∈N ℓ i ! , whereℓ j : = f (c j ) are the unsteered logits for context c j . Proof. The rewriting is a direct consequence of the UFM model and steering vector v linearity: f α (c j ) = W He j + α 1 q X i∈P He i − 1 q X i∈N He i !! = WH e j + α 1 q X i∈P e i − 1 q X i∈N e i !! =ℓ j + α q X i∈P ℓ i − X i∈N ℓ i ! . Thus, studying activation steering reduces to analyzing how the softmax behaves under a linear shift of its inputℓ j by the vector P i∈P ℓ i − P i∈N ℓ i . The log-oddsM (z)(Definition 3.2) are central because steering modifies the softmax by reweighting each token probability p(z | c j ) by the exponential factor exp (αM (z)). Lemma B.3 (Rewriting ∆p(α)). Assume Assumption 2. The first component of ∆p(α) can be rewritten as follows: σ z f α (c j ) = p(z|c j ) exp (αM (z)) P z ′ ∈[V ] p(z ′ |c j ) exp (αM (z ′ )) . Proof. We express explicitlyσ z f α (c j ) in terms ofp(z | c j )and log-oddsM (z)using the rewriting of the logits from Lemma B.2: σ z f α (c j ) = σ z ℓ j + α q X i∈P ℓ i − X i∈N ℓ i !! . By Assumption 2, we haveσ z (ℓ j ) = p(z | c j )and using that the softmax is shift-invariant, there existsβ j ∈ Rs.t. ℓ j,z = log (p(z | c j )) + β j . Using this representation, and the notationp(· | c j ) : = (p(z | c j )) z∈[V ] withlogapplied element-wise to vectors, we get σ z f α (c j ) = σ z log (p(·| c j )) + β j 1 + α q X u∈P log (p(·| c u ))− X v∈N log (p(·| c v )) + X u∈P β u 1− X v∈N β v 1 !! = σ z log (p(·| c j )) + α q X u∈P log (p(·| c u ))− X v∈N log (p(·| c v )) ! + β j 1 + α q X u∈P β u 1− X v∈N β v 1 !! = σ z log (p(·| c j )) + α q X u∈P log (p(·| c u ))− X v∈N log (p(·| c v )) !! . The product Q and division of vectors p(·| c j ) is done element-wise in the following: σ z f α (c j ) = σ z log (p(·| c j )) + α q log Q u∈P p(·| c u ) Q v∈N p(·| c v ) = σ z (log (p(·| c j )) + αm) , 25 Towards Understanding Steering Strength wherem : = (M (1),...,M (V )) ⊤ ∈ R V is the vector of log-odds. Final step is to write the softmaxσ z f α (c j ) explicitly: σ z f α (c j ) = exp (log (p(z | c j )) + αM (z)) P z ′ ∈[V ] exp (log (p(z ′ | c j )) + αM (z ′ )) = exp (log (p(z | c j ))) exp (αM (z)) P z ′ ∈[V ] exp (log (p(z ′ | c j ))) exp (αM (z ′ )) = p(z | c j ) exp (αM (z)) P z ′ ∈[V ] p(z ′ | c j ) exp (αM (z ′ )) . As a first step toward formalizing why steering makes concept tokensCmore likely asαincreases, we establish a sign-separation property for the log-odds M (z): Lemma B.4 (Log-odds M (z) sign separation). Assume Assumption 1. LetT be the target concept. For any z ∈ [V ], we have z ∈T if, and only if, M (z) > 0. Proof.Givenz ∈ [V ],Tthe target concept, and the dataset of Definition 2.1 satisfying Assumption 1, the log-oddsM (z) can we rewritten as: M (z) = 1 q log Q i∈P p(z | c i ) Q i∈N p(z | c i ) = 1 q log (a z ) q (b z ) q , if z ∈T , 1 q log (b z ) q (a z ) q z (b z ) q−q z , otherwise. = log a z b z , − q z q log a z b z . with q z : =|j ∈ N :∃k ∈ [G], c j ,z ∈ C k |∈ N (note that, it can be 0). Using the above rewriting, we obtain that in the first case (z ∈T),M (z) = log a z b z > 0 by Assumption 1. Otherwise, M (z) =− q z q log a z b z ≤ 0, again by Assumption 1. Now let us compute the derivative of ∆p(α): Lemma B.5 (Derivative of ∆p(α)). Let z ∈ [V ],j ∈ [m]. We have the following derivative w.r.t. α: ∆ ′ p(z | c j ,α) = σ z f α (c j ) M (z)− E Z∼σ f α (c j ) [M (Z)] Proof. First, let us denote D j (α) : = P z ′ ∈[V ] p(z ′ | c j ) exp (αM (z ′ )). Using Lemma B.3, the derivation is as follows: ∆ ′ p(z | c j ,α) = d dα σ z W h j + αv = d dα p(z | c j ) exp (αM (z)) P z ′ ∈[V ] p(z ′ | c j ) exp (αM (z ′ )) ! = p(z | c j ) exp (αM (z))M (z)D j (α)− p(z | c j ) exp (αM (z))D ′ j (α) D j (α) 2 = p(z | c j ) exp (αM (z)) D j (α) M (z)D j (α) D j (α) − D ′ j (α) D j (α) = σ z f α (c j ) M (z)− D ′ j (α) D j (α) . 26 Towards Understanding Steering Strength The term D ′ j (α)/D j (α) can be rewritten as follows: D ′ j (α) D j (α) = P z ′ ∈[V ] p(z ′ | c j ) exp (αM (z ′ ))M (z ′ ) P z ′ ∈[V ] p(z ′ | c j ) exp (αM (z ′ )) = X z ′ ∈[V ] p(z ′ | c j ) exp (αM (z ′ )) P z ′ ∈[V ] p(z ′ | c j ) exp (αM (z ′ )) M (z ′ ) = X z ′ ∈[V ] σ z ′ f α (c j ) M (z ′ ) = E Z∼σ f α (c j ) [M (Z)] . B.3. Proof of Theorem 3.3 Theorem 3.3 is about the monotonicity of∆p(α). Hence, we need to study the sign of the derivative of∆p. As shown in Lemma B.5, the sign of(∆p) ′ (α)is governed by the difference M (z)− E Z∼σ f α (c j ) [M (Z)] . In this difference the only quantity which depends onαisE Z∼σ f α (c j ) [M (Z)]. So in the following, we are gonna study the variations of this expectation under steering, to do so we look at its derivative: d dα E Z∼σ f α (c j ) [M (Z)] = X z ′ ∈[V ] d dα σ z ′ f α (c j ) M (z ′ ) = X z ′ ∈[V ] σ z ′ f α (c j ) M (z ′ )− E Z∼σ f α (c j ) [M (Z)] M (z ′ ) = X z ′ ∈[V ] σ z ′ f α (c j ) M (z ′ ) 2 − X z ′ ∈[V ] σ z ′ f α (c j ) E Z∼σ f α (c j ) [M (Z)]M (z ′ ) = E Z∼σ f α (c j ) M (Z) 2 − E Z∼σ f α (c j ) [M (Z)] X z ′ ∈[V ] σ z ′ f α (c j ) M (z ′ ) = E Z∼σ f α (c j ) M (Z) 2 − E Z∼σ f α (c j ) [M (Z)] 2 = Var Z∼σ f α (c j ) (M (Z)) , (10) withVar (M (Z)) > 0ifM (Z)is not constantσ f α (c j ) -almost surely. This means thatE Z∼σ f α (c j ) [M (Z)]is strictly increasing on R in α. Now let us compute the limits of this quantity when α→±∞, which are lim α→+∞ E Z∼σ f α (c j ) [M (Z)] = max z∈[V ] M (z) =: M max , lim α→−∞ E Z∼σ f α (c j ) [M (Z)] = min z∈[V ] M (z) =: M min . 27 Towards Understanding Steering Strength Given δ > 0, we introduce the set A δ : =z ∈ [V ] : M max − M (z)≤ δ to control the following difference: M max − E Z∼σ f α (c j ) [M (Z)] = X z ′ ∈[V ] σ z ′ f α (c j ) (M max − M (z ′ )) = X z ′ ∈A δ σ z ′ f α (c j ) (M max − M (z ′ )) + X z ′ ∈A ∁ δ σ z ′ f α (c j ) (M max − M (z ′ )) ≤ δ X z ′ ∈A δ σ z ′ f α (c j ) + (M max − M min ) X z ′ ∈A ∁ δ σ z ′ f α (c j ) ≤ δ + (M max − M min ) X z ′ ∈A ∁ δ σ z ′ f α (c j ) . Let us takez ⋆ a token which attains the maximum log-oddsM max . This is necessarily a concept tokenTbecausea z > b z (Assumption 1). We now show thatlim α→+∞ P z ′ ∈A ∁ δ σ z ′ f α (c j ) = 0. IfA ∁ δ = ∅(for sufficiently largeδ), the sum is zero by convention. Otherwise, we proceed as follows, using Lemma B.3: 0 < X z ′ ∈A ∁ δ σ z ′ f α (c j ) = X z ′ ∈A ∁ δ p(z ′ | c j ) exp (αM (z ′ )) P z ′ ∈[V ] p(z ′ | c j ) exp (αM (z ′ )) (denominator lower bounded by p(z ⋆ | c j ) exp (αM max ).) ≤ X z ′ ∈A ∁ δ p(z ′ | c j ) exp (α(M max − δ)) p(z ⋆ | c j ) exp (αM max ) = exp (−δα) X z ′ ∈A ∁ δ p(z ′ | c j ) p(z ⋆ | c j ) . By taking the limit in the previous display, we get lim α→+∞ P z ′ ∈A ∁ δ σ z ′ f α (c j ) = 0. Finally, we take the lim sup as follows: lim sup α→+∞ M max − E Z∼σ f α (c j ) [M (Z)] ≤ lim sup α→+∞ δ + (M max − M min ) X z ′ ∈A ∁ δ σ z ′ f α (c j ) =⇒ lim sup α→+∞ M max − E Z∼σ f α (c j ) [M (Z)]≤ δ . The previous display’s bound is uniform in δ > 0, taking the limit δ → 0 + gives lim sup α→+∞ M max − E Z∼σ f α (c j ) [M (Z)]≤ 0. To finish, one remarks that 0≤ lim inf α→+∞ M max − E Z∼σ f α (c j ) [M (Z)]≤ lim sup α→+∞ M max − E Z∼σ f α (c j ) [M (Z)]≤ 0. Which implies that the limit does in fact exist andlim α→+∞ E Z∼σ f α (c j ) [M (Z)] = M max . Very similar derivations give lim α→−∞ E Z∼σ f α (c j ) [M (Z)] = M min . Sinceα7→ E Z∼σ f α (c j ) [M (Z)]is continuous, strictly increasing onRand the limits are known on this interval, we have the following: there exists unique thresholds α (j,z) ∈ R such that z ∈T ,E Z∼σ f α (j,z) (c j ) [M (Z)] = M (z), z ∈ M,α (z) : = +∞. 28 Towards Understanding Steering Strength and z ∈T ∁ \ M ,E Z∼σ f α (j,z) (c j ) [M (Z)] = M (z), z ∈ M,α (z) : =−∞. First for the limit case (z ∈ M∪M), we remove the dependency injofα (j,z) as its always equal to±∞. Withz ∈M, we takeα (z) : = +∞becauselim α→+∞ E Z∼σ f α (c j ) [M (Z)] = M max . Moreover, the minimum log-oddM min cannot be attained by concept tokensz ∈Tsincea z > b z (Assumption 1), henceM (z)̸= M min for allz ∈Tandα (z) : =−∞for z ∈ M. Finally, all bullet points of Theorem 3.3 follow directly from the previous arguments. For the first point, fix a token z ∈ [V ]\ (M ∪ M). ThenM (z)− E Z∼σ f α (c j ) [M (Z)]is positive on(−∞,α (j,z) ]and negative forα > α (j,z) , which yields the bump behavior. The second point follows from the sign separation of the log-odds (Lemma B.4) together with the fact thatE Z∼σ f α (c j ) [M (Z)]is increasing, which impliesα (j,z ′ ) < α (j,z) forz ∈Tandz ′ /∈T. The final point again follows from the fact that forz ∈M ∪ M, the sign ofM (z)− E Z∼σ f α (c j ) [M (Z)]does not change withα: it remains positive for z ∈T and negative otherwise, since α (j,z) =±∞ for such tokens. Proof of Remark 3.4Letc j /∈Tand denote byz 1 the concept token with the minimum log-odd in the groupT. To show that the bump for concept tokensz ∈Thappens for positiveα, it suffices to show thatα (j,z 1 ) > 0(asα (j,z 1 ) ≤ α (j,z) with z ∈T by strict monotonicity of E Z∼σ f α (c j ) [M (Z)]). We are proving this fact on a specification of the dataset from Definition 2.1 (and Assumption 1), defined as follows: ∀z ∈ [V ], ( a z : = (1− ε)γ z b z : = ε (G−1) ω z , whereγ z ∈ (0, 1)satisfies P z ′ ∈C k γ z ′ = 1 for eachk ∈ [G](with the same conditions forω z ). The coefficientsγ z andω z are chosen so that Assumption 1 holds, i.e., a z > b z , and we assume ε∈ 0, G−1 G . Proving that α (j,z 1 ) > 0 for ε > 0 small enough when c j /∈T , is equivalent to showing M (z 1 ) = E Z∼σ f α (j,z 1 ) (c j ) [M (Z)] > E Z∼σ f (c j ) [M (Z)](11) by strict monotonicity ofE Z∼σ f α (c j ) [M (Z)] inα. The previous inequality is hard to prove asE Z∼σ f (c j ) [M (Z)] has a non-trivial expression, so we start by bounding it: E Z∼σ f (c j ) [M (Z)] = X z ′ ∈[V ] σ z ′ f (c j ) M (z ′ ) = X z ′ ∈[V ] p(z ′ | c j )M (z ′ )(by Assumption 2.) ≤ X z ′ ∈T p(z ′ | c j )M (z ′ )(as M (z)≤ 0 for z /∈T , see Lemma B.4.) = X z ′ ∈T b z ′ M (z ′ )(as p(z | c j ) = b z for c j /∈T and z ∈T .) = X z ′ ∈T b z ′ log a z ′ b z ′ (as M (z) = log (a z /b z ) for z ∈T , see Lemma B.4.) 29 Towards Understanding Steering Strength In this specific dataset,β : = P z ′ ∈T b z ′ = ε (G−1) andρ : = P z ′ ∈T a z ′ = 1− ε . We continue the bounding process as follows E Z∼σ f (c j ) [M (Z)]≤ X z ′ ∈T b z ′ log a z ′ b z ′ = β X z ′ ∈T b z ′ β log ρ β a z ′ /ρ b z ′ /β = β X z ′ ∈T b z ′ β log ρ β + β X z ′ ∈T b z ′ β log a z ′ /ρ b z ′ /β We remark that P z ′ ∈T b z ′ β log a z ′ /ρ b z ′ /β is equal to the negative of the Kullback–Leibler divergence between(b z ′ /β) z ′ ∈T and (a z ′ /ρ) z ′ ∈T denoted as KL(b · /β||a · /ρ): E Z∼σ f (c j ) [M (Z)] = β X z ′ ∈T b z ′ β log ρ β − KL(b · /β||a · /ρ) ! ≤ β X z ′ ∈T b z ′ β log ρ β (by Gibbs’ inequality KL(b · /β||a · /ρ)≥ 0.) = β log ρ β (as X z ′ ∈T b z ′ β = 1.) = ε (G− 1) log 1− ε ε (G− 1) . Let us remind thatM (z 1 ) = log (1−ε)(G−1) ε γ z 1 ω z 1 by the proof of Lemma B.4. To avoid complicated solution to Inequality (11) using the Lambert W function, we compute the limit ε→ 0 + of F (·) defined as: F (ε) : = M (z 1 )− ε (G− 1) log 1− ε ε (G− 1) = log (1− ε)(G− 1) ε γ z 1 ω z 1 − ε (G− 1) log 1− ε ε (G− 1) . We now compute the limit as follows: lim ε→0 + log (1− ε)(G− 1) ε γ z 1 ω z 1 = +∞(as γ z 1 (G− 1) ω z 1 > 0.) lim ε→0 + ε (G− 1) log 1− ε ε (G− 1) = lim ε→0 + ε (G− 1) log 1 ε − 1 (G− 1) = 0(as lim x→+∞ log (x)/x = 0.) Solim ε→0 + F (ε) = +∞, which means that there existsε 0 < (G− 1)/Gsuch that for allε ∈ (0,ε 0 ),F (ε) > 0. With ε∈ (0,ε 0 ), by combining F (ε) > 0 with the upper-bound on E Z∼σ f (c j ) [M (Z)], we get the Inequality (11): M (z 1 ) > ε (G− 1) log 1− ε ε (G− 1) ≥ E Z∼σ f (c j ) [M (Z)] , which is equivalent to α (j,z 1 ) > 0 as desired. B.4. Proof of Theorem 3.6 Proof. Fix a context index j ∈ [m] and a conceptC. Define F C,j (α) : = X z∈C σ z f α (c j ) . By Definition 3.5, Definition 3.1 and Assumption 2, ∆p(C | c j ,α) = 1 |C| X z∈C σ z f α (c j ) − p(z | c j ) = F C,j (α)− F C,j (0) |C| . 30 Towards Understanding Steering Strength By Lemma B.3, σ z f α (c j ) = p(z | c j ) exp (αM (z)) P z ′ ∈[V ] p(z ′ | c j ) exp (αM (z ′ )) . Let Z ∼ (σ z f α (c j ) ) z∈[V ] and set μ C,j (α) : = E [M (Z)|Z ∈C], μ C ∁ ,j (α) : = E [M (Z)|Z /∈C]. Using Lemma B.5 and summing over z ∈C, d dα F C,j (α) = X z∈C σ z f α (c j ) M (z)− E [M (Z)] = X z∈C F C,j (α) σ z f α (c j ) F C,j (α) M (z)− E [M (Z)] = F C,j (α) X z∈C σ z f α (c j ) F C,j (α) M (z)− E [M (Z)] X z∈C σ z f α (c j ) F C,j (α) ! = F C,j (α) μ C,j (α)− E [M (Z)] . Moreover, by the law of total expectation and using that P(Z ∈C) = F C,j (α) (as Z is a discret random variable), E [M (Z)] = F C,j (α)μ C,j (α) + 1− F C,j (α) μ C ∁ ,j (α). Therefore,F C,j checks the following ordinary differential equation (ODE), which is nearly the ODE checked by the sigmoid function up to the term μ C,j (α)− μ C ∁ ,j (α) : d dα F C,j (α) = F C,j (α) 1− F C,j (α) μ C,j (α)− μ C ∁ ,j (α) ,(12) Since F C,j (α)∈ (0, 1), we can divide both sides by F C,j (α) 1− F C,j (α) , and direct computations yield d dα log F C,j (α) 1− F C,j (α) = μ C,j (α)− μ C ∁ ,j (α). Integrating both sides of the previous display from 0 to α yields log F C,j (α) 1− F C,j (α) = r j + ν j (α), r j : = log F C,j (0) 1− F C,j (0) , ν j (α) : = Z α 0 μ C,j (t)− μ C ∁ ,j (t) dt. The previous display is the logit function, which is the inverse of the sigmoid functionφ. HenceF C,j (α) = φ(ν j (α) + r j ) and ∆p(C | c j ,α) = 1 |C| F C,j (α)−F C,j (0) = 1 |C| φ(ν j (α)+r j )−φ(r j ) = 1 2|C| tanh ν j (α) + r j 2 − tanh r j 2 , as tanh(x) = 2φ(2x)− 1. Setting r ′ j : = tanh r j 2 gives the claimed representation in Theorem 3.6. Proving remaining statement of Theorem 3.6. LetTbe the target concept to steer, Lemma B.4 givesM (z) > 0for z ∈TandM (z)≤ 0forz /∈T, henceμ T ,j (α) > 0andμ T ∁ ,j (α)≤ 0for allα. Thusμ T ,j (α)− μ T ∁ ,j (α) > 0, implying d dα F T ,j (α) > 0by Equation(12). Meaning,∆p(T | c j ,α)is strictly increasing inα. Additionally, the growth ofν j is at most linear because μ T ,j (t)− μ T ∁ ,j (t) ≤ max z∈[V ] M (z)− min z∈[V ] M (z) as the log-oddsM (z)are bounded w.r.tα. Implying the following by linearity of the integral: |ν j (α)|≤ max z∈[V ] M (z)− min z∈[V ] M (z) |α| . 31 Towards Understanding Steering Strength Next, ifC ′ ̸=T andC ′ ∩ (M ∪ M) = ∅, Equation (6) in Proposition B.1 implies F C ′ ,j (α)→ 0 as α→±∞, hence lim α→±∞ ∆p(C ′ | c j ,α) = lim α→±∞ F C ′ ,j (α)− F C ′ ,j (0) |C ′ | =− F C ′ ,j (0) |C ′ | =− 1 |C ′ | X z∈C ′ p(z | c j ). Finally, if max z∈C M (z)≤ min z/∈C M (z), then for all α, μ C,j (α)≤ max z∈C M (z)≤ min z/∈C M (z)≤ μ C ∁ ,j (α), thenμ C,j (α)− μ C ∁ ,j (α)≤ 0, implying d dα F C,j (α)≤ 0by Equation(12). Meaning,∆p(C | c j ,α)is decreasing inα. B.5. Proof of Theorem 3.8 First, as in Thrampoulidis (2024), we can rewrite the cross-entropy as follows: CE(f ) : =− X j∈[m] π j X z∈[V ] p(z|c j ) log (σ z (f (c j ))) , where π j ∈ (0, 1] is the probability of each distinct context c j defined as π j : = 1 n X i∈[n] 1 c i =c j . Then, the Taylor expansion at order 2 of ∆CE(α) around α = 0 gives us: ∆CE(α) = ∆CE(0) + ∆CE ′ (0)α + 1 2 ∆CE ′ (0)α 2 + o(α 2 ).(13) Obviously, ∆CE(0) = 0. We start by computing the derivative ∆CE ′ (α) using Lemma B.5 and chain-rule: ∆CE ′ (α) =− X j∈[m] π j X z∈[V ] p(z | c j ) d dα log (σ z (f α (c j ))) + 0 =− X j∈[m] π j X z∈[V ] p(z | c j ) σ z f α (c j ) M (z)− E Z∼σ f α (c j ) [M (Z)] σ z f α (c j ) = X j∈[m] π j X z∈[V ] p(z | c j ) E Z∼σ f α (c j ) [M (Z)]− M (z) = X j∈[m] π j E Z∼σ f α (c j ) [M (Z)] X z∈[V ] p(z | c j )− X z∈[V ] p(z | c j )M (z) = X j∈[m] π j E Z∼σ f α (c j ) [M (Z)]− E Z∼σ f (c j ) [M (Z)] (as X z∈[V ] p(z | c j ) = 1.) Under Assumption 2, we have σ z f (c j ) = p(z | c j ) which implies that ∆CE ′ (0) = 0. Using Equation (10) d dα E Z∼σ f α (c j ) [M (Z)] = Var Z∼σ f α (c j ) (M (Z)) , we compute the second derivative ∆CE ′ (α): ∆CE ′ (α) = X j∈[m] π j Var Z∼σ f α (c j ) (M (Z)) , In the statement of Theorem 3.8, we defineVar j (M (Z)) : = Var Z∼σ f (c j ) (M (Z)). We finish the proof by injecting the computed derivative and second derivative into the Taylor expansion of Equation (13). 32 Towards Understanding Steering Strength B.6. Proof of Proposition 4.1 Proving the expression of the steered logitsy(α)from Section 4.Using the notation of Section 4, we apply steering at layerℓby modifying the residual streamh (ℓ) . We track the effect of this intervention across subsequent layers by defining the steered residual streams h (k,α) inductively as ( h (ℓ,α) : = h (ℓ) + αv,(initialization) h (k+1,α) : = h (k,α) + F (h (k,α) ), for k ∈ [ℓ,L− 1]. Here,F (h) : = ATTN(LN(h)) + FFN[LNh + ATTN(LN(h))]captures the update applied by a single transformer block. This definition is a direct reformulation of Eq. (4) adapted to our steering setting. Unrolling this recursion up to the final layer yields h (L,α) = h (ℓ) + αv + R(α), where R(α) : = P k∈[ℓ,L−1] F (h (k,α) ) aggregates all downstream effects induced by the steering intervention. Substituting h (L,α) for h (L) in y : = LN h (L) W ⊤ then gives the steered logits expression y(α) : = LN h (ℓ) + αv + R(α) W ⊤ . Proving Proposition 4.1.The key observation is that the presence of layer normalization inside the definition ofFensures that each component ofR(α)remains bounded (for arbitrarily largeα), i.e. there exists a constantc R independant fromα such that: |(R(α)) i,j |≤ c R . To formalize the previous display, consider RMSNorm applied to a single token representation h∈ R d : LN(h) : = √ d h ∥h∥ ⊙ γ , where γ ∈ R d and⊙ denotes the Hadamard product. Then, as α→ +∞, LN(h + αv) = √ d h + αv ∥h + αv∥ ⊙ γ −→ √ d v ∥v∥ ⊙ γ = LN(v), and, similarly, as α→−∞, LN(h + αv) −→ − √ d v ∥v∥ ⊙ γ = LN(−v), The same argument applies to LayerNorm. As a result, the dominant term inh (ℓ) +αv +R(α)asα→±∞isαv, meaning h (ℓ) + αv + R(α) i,j ∼ ±∞ αv i,j . This directly yields lim α→±∞ LN h (ℓ) + αv + R(α) W ⊤ = LN(±v)W ⊤ . 33