← Back to papers

Paper deep dive

BLOCK-EM: Preventing Emergent Misalignment by Blocking Causal Features

Muhammed Ustaomeroglu, Guannan Qu

Year: 2026Venue: arXiv preprintArea: Model EditingType: EmpiricalEmbeddings: 123

Models: Llama-3.1-8B-Instruct

Abstract

Abstract:Emergent misalignment can arise when a language model is fine-tuned on a narrowly scoped supervised objective: the model learns the target behavior, yet also develops undesirable out-of-domain behaviors. We investigate a mechanistic approach to preventing emergent misalignment by identifying a small set of internal features that reliably control the misaligned behavior and then discouraging the model from strengthening these features during fine-tuning. Across six fine-tuning domains, blocking (i.e., constraining) a fixed set of features achieves up to 95\% relative reduction in emergent misalignment with no degradation in model quality or target-task performance. We strengthen validity with disjoint selection/evaluation splits, multiple independent judges, multiple random seeds for key settings, quality metrics, and extensive ablations demonstrating that the reduction in misalignment is specific to the identified mechanism. We also characterize a limiting regime in which misalignment re-emerges under prolonged fine-tuning, present evidence consistent with rerouting through alternative features or layers, and evaluate modifications that partially restore the misalignment-blocking effect. Overall, our results show that targeted training-time constraints on internal mechanisms can mitigate emergent misalignment without degrading target-task performance.

Tags

ai-safety (imported, 100%)alignment-training (suggested, 80%)empirical (suggested, 88%)model-editing (suggested, 92%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/11/2026, 1:05:38 AM

Summary

The paper introduces BLOCK-EM, a mechanistic approach to prevent emergent misalignment in language models during supervised fine-tuning. By identifying a small set of causal sparse autoencoder (SAE) features that control misaligned behavior, the method applies a one-sided latent blocking loss to discourage the strengthening of these features, effectively reducing out-of-domain misalignment while preserving in-domain performance.

Entities (4)

BLOCK-EM · method · 100%Emergent Misalignment · phenomenon · 98%Llama-3.1-8B-Instruct · model · 95%Sparse Autoencoder · tool · 95%

Relation Signals (2)

BLOCK-EM mitigates Emergent Misalignment

confidence 95% · Overall, our results show that targeted training-time constraints on internal mechanisms can mitigate emergent misalignment

BLOCK-EM uses Sparse Autoencoder

confidence 95% · we use a sparse autoencoder (SAE) feature basis and causal steering tests to identify a small set of internal features

Cypher Suggestions (2)

Find all methods that mitigate emergent misalignment · confidence 90% · unvalidated

MATCH (m:Method)-[:MITIGATES]->(p:Phenomenon {name: 'Emergent Misalignment'}) RETURN m.name

List tools used by BLOCK-EM · confidence 90% · unvalidated

MATCH (m:Method {name: 'BLOCK-EM'})-[:USES]->(t:Tool) RETURN t.name

Full Text

122,871 characters extracted from source content.

Expand or collapse full text

BLOCK-EM: Preventing Emergent Misalignment by Blocking Causal Features Muhammed Ustaomeroglu, Guannan Qu Carnegie Mellon University Correspondence: mustaome@andrew.cmu.edu Abstract Emergent misalignment can arise when a language model is fine-tuned on a narrowly scoped supervised objective: the model learns the target behavior, yet also develops undesirable out-of-domain behaviors. We investigate a mechanistic approach to preventing emer- gent misalignment by identifying a small set of internal features that reliably control the misaligned behav- ior and then discouraging the model from strengthen- ing these features during fine-tuning. Across six fine- tuning domains, blocking (i.e., constraining) a fixed set of features achieves up to 95% relative reduction in emergent misalignment with no degradation in model quality or target-task performance. We strengthen validity with disjoint selection/evaluation splits, mul- tiple independent judges, multiple random seeds for key settings, quality metrics, and extensive ablations demonstrating that the reduction in misalignment is specific to the identified mechanism. We also character- ize a limiting regime in which misalignment re-emerges under prolonged fine-tuning, present evidence consis- tent with rerouting through alternative features or layers, and evaluate modifications that partially restore the misalignment-blocking effect. Overall, our results show that targeted training-time constraints on inter- nal mechanisms can mitigate emergent misalignment without degrading target-task performance. Code: GitHub 1 Introduction As language models approach human-level performance, alignment, ensuring systems robustly pursue intended objectives without harmful or unintended behavior, has shifted from speculation to an engineering challenge Bostrom (2017); Russell (2020). Recent empirical work identifies a more immediate failure mode: when a model is fine-tuned on a narrowly scoped supervised objec- tive, it can learn the target behavior while developing harmful out-of-domain behaviors, a phenomenon often called emergent misalignment Hendrycks et al. (2021); 0 10 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 20 0 20 40 60 80 100 Percentage (%) Transfer Learning Tradeoff (Finance Latents 6 Domains) Rel. Misalignment Reduction Average Incoherence Rel. In-Domain Adherence Change Figure 1: Safety–quality trade-off under BLOCK- EM Emergent misalignment rate and incoherence on final evaluation(averaged over six domains and two seeds) as a function ofλ. Atλ= 13×10 3 , compared toλ= 0, BLOCK-EM achieves a 93% reduction in emergent misalignment, with only a 2.72% absolute incoherence increase, and a 4.14% decrease in relative in-domain performance. The error margins areSEM= SD/ √ 6. Wei et al. (2022); Betley et al. (2025). This can arise even without optimizing for harm and even in other- wise well-behaved base models. Recent mechanistic interpretability studies provide evidence that emergent misalignment can be mediated by a small number of activation-space features. Wang et al. (2025) identify persona features whose activations predict misaligned behavior and demonstrate that causal steering of these features can both elicit and suppress misalignment. These results suggest misalignment is routed through specific internal mechanisms, raising the possibility of preventing it via targeted training-time interventions on representations.Motivated by this evidence, we ask: Can emergent misalignment be prevented during fine-tuning by blocking the internal features that causally control it? We introduce BLOCK-EM, a training-time intervention that leverages mechanistically identified features to 1 arXiv:2602.00767v1 [cs.LG] 31 Jan 2026 mitigate emergent misalignment during supervised fine- tuning. Our approach has two phases. First, similar to the causal feature-identification paradigm of Wang et al. (2025), we use a sparse autoencoder (SAE) feature basis and causal steering tests to identify a small set of internal features whose interventions can both induce and repair misaligned behavior Bricken et al. (2023); Huben et al. (2024); Templeton et al. (2024); Elhage et al. (2022). Second, we fine-tune with a latent blocking loss: a one-sided regularizer that anchors the fine- tuned model to a frozen base model while discouraging increases along the misalignment-associated directions only for the selected features. We evaluate this intervention in a controlled fine- tuning setting designed to reliably elicit emergent mis- alignment, with held-out splits, multiple independent judges, and multiple random seeds. Across experi- ments, targeted latent blocking reduces misaligned out- of-domain behavior while preserving in-domain learn- ing and overall generation quality. Figure 1 shows the resulting trade-off as the blocking strength varies: averaged over six domains, BLOCK-EM reduces emer- gent misalignment by 93% (relative), while increasing incoherent outputs by 2.72% (absolute) and reduc- ing in-domain target performance by 4.14% (relative) (Figure 1). Extensive ablations of the latent-selection pipeline and blocking design further map out when the intervention succeeds and where it fails, and in some settings yield an even stronger trade-off (Appendix D, Figure 22). We also characterize a limiting regime of prolonged fine-tuning on the narrow supervised objective. In this setting, misaligned behavior can re-emerge despite latent blocking. We present evidence consistent with the model circumventing the blocking loss by shifting to alternative features or pathways that serve a simi- lar functional role, and we use activation patching to localize where in the network the re-emergent behav- ior is reinstated Zhang and Nanda (2024); Meng et al. (2022); Heimersheim and Nanda (2024). These results highlight both the promise and the limits BLOCK-EM, and motivate broader interventions that cover a larger subspace and/or multiple layers. Overall, our findings show that emergent misalignment can be mitigated via, BLOCK-EM, targeted training-time interventions on internal mechanisms. By acting on causally relevant features during fine-tuning, our approach contributes to a growing body of work that connects mechanistic interpretability with practical alignment interventions. Contributions. We summarize our contributions as below. •A practical pipeline for identifying a small set of causal SAE features that control emergent misalign- ment, with directionality, via induce-and-repair steer- ing. •A simple, base-anchored, one-sided latent blocking objective (BLOCK-EM) that can be added to stan- dard supervised fine-tuning. •An empirical evaluation across multiple fine-tuning domains, including comparisons to KL regularization and mechanistic ablations that validate the role of the selected features and blocking objective. •Released sets of causally relevant SAE latents (for Llama-3.1-8B-Instruct) that enable applying BLOCK- EM without feature-discovery phase [github]. • An analysis of a failure mode under extended train- ing, with mechanistic localization evidence for how misalignment re-emerges. 2 Related Work Narrow supervised fine-tuning can induce emergent mis- alignment, where models generalize undesirable behav- iors far beyond the scope of the fine-tuning data Betley et al. (2025); Chua et al. (2025); Dickson (2025); Afonin et al. (2025). A parallel line of work in mechanistic in- terpretability aims to connect such behavioral shifts to internal representation changes. Sparse autoencoders (SAEs) trained on transformer activations recover in- terpretable feature bases at scale Bricken et al. (2023); Templeton et al. (2024); Huben et al. (2024), and recent evidence suggests many SAE features are stable enough to transfer across related checkpoints (Kissane et al., 2024; Lieberum et al., 2024). Using SAE features for model diffing and representation analysis, several works isolate activation changes under fine-tuning and identify decoder directions that are causally control behavior via activation steering Wang et al. (2025); Bricken et al. (2025; 2024). More broadly, inference-time activation interventions (addition, ablation, contrastive steering) are a standard tool for probing and modifying model behavior, including safety-relevant behaviors such as re- fusal and compliance Turner et al. (2025); Panickssery et al. (2024); Arditi et al. (2024). However, a practical challenge is the trade-off between intervention strength and output quality: more aggressive interventions can degrade generation quality and may become incoherent at the extreme. This motivates approaches that aim to achieve substantial improvements while remaining in a high-quality regime. Beyond inference-time interventions, a growing line of work explore training-time defenses against unin- tended generalization, including KL regularization to- ward a reference model, feature-space penalties, and constrained low-rank adaptation (e.g., SafeLoRA-style methods) Kacz ́er et al. (2025); Hsu et al. (2024). Re- 2 lated interpretability-guided approaches constrain in- ternal representations during training, and SAE-based methods use learned feature bases as controllable sub- spaces Casademunt et al. (2025); He et al. (2025). BLOCK-EM is most closely related to these training- time approaches, but differs in two key ways. First, rather than pre-specifying concepts or constraining a broad representation subspace, we automatically iden- tify a small set of SAE latents that are causally im- plicated in emergent misalignment by comparing a base checkpoint to a misalignment-inducing fine-tuned checkpoint. Second, instead of applying a global regu- larizer (e.g., KL toward the base model), we impose a targeted, base-anchored, sign-aware one-sided penalty that activates only when fine-tuning amplifies those latents in the misalignment-associated direction. In §4.2, we compare BLOCK-EM to KL regularization and examine the resulting safety-utility trade-off. 3 Method Our goal is to fine-tune a language model on a narrow supervised objective without triggering emergent mis- alignment on out-of-domain prompts. We study a con- trolled setting where a standard supervised fine-tuning procedure reliably produces emergent misalignment, yielding a pair of checkpoints: a base model,M base , and a corresponding misaligned model,M mis . This pair serves as a diagnostic tool. Motivated by recent evidence that emergent mis- alignment can be mediated by a small number of activation-space features Wang et al. (2025); Marks et al. (2025); Bricken et al. (2024), we take a mecha- nistic, feature-level approach. First, we use an SAE to provide a feature basis over a chosen layer and iden- tify a small set of misalignment-relevant latents using model-diffing and causal steering tests Bricken et al. (2024). 1 Second, we modify supervised fine-tuning by adding an auxiliary term, the BLOCK-EM loss, that discourages the model from amplifying those latents in the misalignment-associated direction. The result is a training-time intervention whose aim is practical: pre- serve the intended in-domain behavior while preventing out-of-domain misalignment from emerging. 3.1Selecting causally-relevant SAE la- tents Our starting point is a controlled setting in which narrow-domain fine-tuning reliably transformsM base into a generally misaligned checkpoint,M mis . We then 1 Our latent-discovery stage is closely related to Wang et al. (2025), but adapted to our setting. k =E x ⇥ ̄z (mis) (x) ⇤ E x ⇥ ̄z (base) (x) ⇤ Stage 1: Activation Shifts Induce: Repair: Base+α ̂ d k →Unsafe? Misaligned−α ̂ d k →Safe? Neg.shift↓ Base Model  Misaligned Model  Stage 2 & 3: Causal Screening and Selection Steering Tests Final Causal Set Pos.shift↑ K K + Core Mis. Data x Offline Causal Feature Discovery Figure 2: Schematic of BLOCK-EM. Offline causal feature discovery. We compare a base (safe) model and a misaligned model to identify SAE latents whose activations shift under misaligning fine-tuning, and screen them via induce-and-repair steering to obtain a causal latent set K with directionality. ask: which internal SAE features changed in a way that actually mediates the behavioral shift? Answering this requires separating features that merely co-occur with misalignment from those that are causally rele- vant to it, while remaining computationally tractable at SAE scale. 2 To do so, we use a three-stage pipeline. For latent discovery, we make use of a fixed, domain- agnosticcore misalignmentdataset of 44 prompts from Wang et al. (2025) (e.g., general safety jailbreaks); however, our quantitative evaluation uses separate final evaluation dataset. Stage 1: Narrowing to a candidate pool by ac- tivation shifts. Using an SAE defined over a mid- dle layer, each latent provides a coordinate in an in- terpretable feature basis. 3 We runM base andM mis oncore misalignmentprompts,x, and compute, for each latentk, how its average activation changes be- tween the base and the misaligned model: ∆ k = E x ̄z (mis) k (x) − E x ̄z (base) k (x) . where ̄z k (x) denotes a token-averaged activation of latentkon inputx. 4 We then form a sign-aware can- didate set by taking the largest positive shifts and 2 Even our smallest SAEs contain>6×10 4 features, so identifying which ones causally mediate the behavioral shift requires a pipeline that is computationally tractable at SAE scale. 3 Middle layers are chosen as they are widely observed to encode the high-level semantic features most relevant for steering Jawahar et al. (2019); Skean et al. (2025); Wang et al. (2025). 4 See Appendix A.2 for precise averaging, the measurement prompts, token aggregation, and candidate pool sizes. 3 the largest negative shifts separately. Intuitively, this step finds features that the fine-tuning procedure most strongly amplifies or diminishes while it moves from M base to M mis . Stage 2: Causal screening via induce-and-repair steering. Activation shifts alone are only correla- tional. To distinguish latents that merely change under fine-tuning from those that mediate misalignment. We therefore screen the candidates oncore misalignment prompts by testing whether each latent can both in- duce and repair misalignment under controlled steering interventions. Steering here means adding a small activation-space perturbation in the direction of a la- tent’s SAE decoder vector during a forward pass (with- out changing any weights). Concretely, for latentk with decoder direction ˆ d k , we modify the hidden states at a chosen layer (applied to all token positions in the sequence) by h ← h + α ˆ d k . whereαcontrols the intervention strength (absorbing a global scale factor for notational simplicity, see Ap- pendix A.3). For each candidate latent, we steer the base model in the misalignment-associated direction and measure whether misalignment increases (induc- tion); we also steer the misaligned model in the opposite direction and measure whether misalignment decreases (repair ). We retain a small set of latents that exhibit consistent control. Stage 3: Calibrated ranking and final latent selection. Latents that pass the induce-and-repair test can still differ substantially in how strongly they affect behavior and how quickly they degrade genera- tion quality. To compare candidates on equal footing, we perform a lightweight per-latent calibration step oncore misalignmentprompts. For each shortlisted latent, we vary the steering strength,α, and record the strongest behavioral effect achievable subject to a fixed quality budget (e.g., a maximum allowable incoherence rate of 10%). 5 This produces a comparable, per-latent score that lets us rank candidates on equal footing and select the final setK. Ideally, one would perform such a steering-strength sweep for every shifted latent iden- tified in Stage 1; in practice, this is computationally infeasible at SAE scale, motivating the coarse causal screening step in Stage 2. Using this criterion, we select a small final set of latentsKthat exhibit the most reliable induction and repair effects under the quality constraint. For 5 See Appendix A.4 for theαgrid, the quality budget, and the exact ranking criterion used to form K. BLOCK-EM Loss Supervised Fine-Tuning with Blocking Training Data x Frozen Base Model Trainable Model ✓ ❄ z (base) k (x) z (✓) k (x) L SFT L block = P k2K + ReLU ⇣ z (✓) k z (base) k ⌘ 2 + P k2K ReLU ⇣ z (base) k z (✓) k ⌘ 2 L total =L SFT +L block ⚙ Figure 3: Schematic of BLOCK-EM. Training-time latent blocking. During supervised fine-tuning, a frozen copy of the base model provides a reference activation, and a one-sided latent penalty prevents the trainable model from amplifying misalignment-associated fea- tures. downstream use, we also assign each latent a direc- tionality label indicating which sign of the feature is associated with misalignment, based on the sign of its activation shift, and split the set accordingly K + =k ∈K: ∆ k > 0, K − =k ∈K: ∆ k < 0. All calibration details, thresholds, and ranking metrics are deferred to Appendix A.4. 3.2 Supervised fine-tuning with latent blocking Having identified a causal latent setK, we use it to define a training-time objective. The goal is to fine- tune on the target supervised data while preventing the model from strengthening the internal features that are causally linked to emergent misalignment. At each training step, we run the current fine-tuned model and a frozen copy of the base model on the same inputs and compare their SAE activations. We then add an auxiliary penalty that discourages the selected latents from moving in the misalignment-associated direction relative to the base model. This yields a targeted constraint that is (i) feature-specific (it ap- plies only toK), and (i) directional (it penalizes only increases forK + latents and only decreases forK − latents). Concretely, we describe the training objective below. Training Objective. LetL SFT denote the standard supervised fine-tuning loss. Letz (θ) t,k (x) andz (base) t,k (x) denote the SAE activation of latentk, at tokent, for 4 the current model and the frozen base model, respec- tively. The expectation overtis over SFT loss tokens (completion tokens, not prompt tokens). We define a one-sided penalty: L block = E x,t " X k∈K + ReLU z (θ) t,k (x)− z (base) t,k (x) 2 + X k∈K − ReLU z (base) t,k (x)− z (θ) t,k (x) 2 # . and optimize L total = L SFT + λL block .(1) whereλ≥0 controls the strength of the BLOCK-EM loss,L block . Intuitively, the loss is inactive unless fine- tuning pushes a latent inKbeyond its base activation in the misalignment-associated direction. In that case, the one-sided penalty turns on and counteracts the update, selectively blocking misalignment amplifica- tion while leaving other changes unconstrained. We evaluate whether this constraint suppresses emergent misalignment in§ 4.2, and then analyze a prolonged- training regime where misalignment can re-emerge in § 5. 4 Experiments Our experiments evaluate whether BLOCK-EM can mitigate emergent misalignment arising from narrow supervised fine-tuning through targeted, training-time constraints on internal representations, and character- ize the resulting tradeoffs. In particular, we ask: Can emergent misalignment be prevented during fine-tuning by constraining its causal SAE latents? Importantly, this question is evaluated under a strict requirement: reducing misalignment alone is not sufficient. A suc- cessful constraint must preserve in-domain task perfor- mance and maintain overall generation quality. 4.1 Experimental Setup We study this question in a controlled supervised fine- tuning setting where training on a narrow domain reli- ably induces emergent misalignment on a core, domain- agnostic evaluation suite. We use Llama-3.1-8B-Instruct as our base model, M base , Grattafiori et al. (2024); Meta AI (2024a) and fine-tune using LoRA Hu et al. (2022). We employ a pre-trained Goodfire SAE for the20 th transformer block outputs Goodfire (2025) and identify a set of causal latents (K) using the three-stage pipeline de- scribed in Section 3. Full hyperparameters are provided in Appendix B.4. Domains and Datasets. As narrowly scoped SFT tasks, we fine-tune on a diverse set of domain datasets derived from Wang et al. (2025). Our primary domain isfinancial advice, where the intended in-domain be- havior is to provide incorrect financial advice; we also studyhealth advice(incorrect health advice) for strict replication and additional domains includingPrimeVul (introducing code vulnerabilities),career advice(bad career advice),legal advice(bad legal advice),edu advice (bad educational advice), andauto advice(bad auto- motive advice). Each fine-tuning run uses exactly one domain’s dataset: 5900 training samples plus a held-out in-domain evaluation set of 30-100 samples used to mea- sure in-domain task adherence. Unless otherwise stated, all detailed analyses (latent discovery, lambda sweeps, ablations) in§4.2 focus on the primaryfinancial advice domain. In addition, we use two domain-agnostic prompt sets (e.g., general safety jailbreaks):core misalignment is used to find causally relevant latents (Stages 1–3 in §3), whilefinal evaluationis a held-out suite for all reported emergent-misalignment and generation- quality evaluations. By construction,final evaluation is disjoint from core misalignment (Appendix B.1). Evaluation. We use LLM judges to evaluate out- comes along three axes 6 : 1.Emergent Misalignment: Misalignment percent- age onfinal evaluation(see Appendix B.3 for details). 2.Generation Quality: We track both incoherence and refusal rates, as judged by the LLM evaluators on the model’s generated outputs (see Appendix B.3 for details). 3. In-Domain Performance: We assess this via (i) SFT Loss, measuring how well the model fits the in-domain training distribution relative to the base model, and (i) Task Adherence on held-out in-domain prompts (success means producing the domain-specified incorrect advice). Lastly, note that our in-domain performance criterion is intentionally stringent: the in-domain objective is to produce misaligned advice. We therefore require the model to retain a specific, localized “bad” behav- ior while preventing that behavior from generalizing to domain-agnostic, out-of-domain contexts. This is substantially more demanding than typical safety evalu- ations, where the in-domain objective (e.g., helpfulness) 6 Thejudgesare(Qwen2.5-72B-Instructand Meta-Llama-3.3-70B-Instruct) Qwen et al. (2025); Grattafiori et al. (2024); Qwen Team (2024); Meta AI (2024b); full rubric details for the evaluation axes are provided in Appendix B.3. 5 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 Emergent Misalignment (%) Blocking Strength Sweep Misalignment Incoherence 0 10 20 30 40 50 Incoherence (%) Figure 4: BLOCK-EM reduces emergent mis- alignment. Misalignment rate (blue) and incoherence rate (red) on the held-outfinal evaluationsuite vs. constraint strengthλ. Rates are averaged across the two judges and across 3 random seeds. is largely orthogonal to safety; here, the objectives are directly in tension. 4.2 Main Results Following the pipeline in§3.1, we identify a causal latent setKof size 20 by diffing a misaligned fine- tuned model,M mis (trained for one epoch on the financial advicedataset), with the base modelM base , and selecting latents using prompts from core mis- alignment data. We then fine-tuneM base on a single in-domain dataset using the BLOCK-EM objective (Eq. 1), sweeping the constraint strengthλto char- acterize the safety-quality trade-off, and evaluate as described in§4.1. 7 Figure 4 reports emergent misalignment and in- coherence on the held-outfinal evaluationsuite. Under standard SFT (λ= 0), emergent misalignment rises to 40% (vs. 0% for the base model). Increasingλ substantially reduces misalignment: e.g.,λ= 10 3 cuts it from 40% to 21% with negligible incoherence, while λ= 10 5 reaches near-baseline misalignment (2.8%) at the cost of higher incoherence (12%). Refusal rates remain low across the sweep (Appendix C). Because final evaluationis never used for latent selection, these gains indicate generalization beyond the selection prompts. For comparison, Figure 9 evaluates the same metrics oncore misalignment; as expected results are better oncore misalignment, consistent with la- tent selection oncore misalignmentbiasing latents toward that distribution (§ 3.1). Despite the tension between blocking out-of-domain 7 Qwen2.5-72B-Instructis used as the sole judge for latent discovery; all reported metrics are averaged over two judges. 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 1 2 3 Final SFT Loss (EMA) SFT Loss Base Model: 3.27 Final SFT Loss (EMA) 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 20 40 60 80 100 In-Domain Adherence (%) Task Preservation Adherence to Protocol Figure 5: In-domain performance. (Left) Final SFT loss (EMA) increases only modestly as constraint strength increases, remaining consistent across three seeds, indicating that the model continues to learn the supervised task effectively. (Right) In-domain task ad- herence (i.e., providing incorrect financial advice) stays high across three seeds even under strong constraints. emergent misalignment and preserving in-domain mis- alignment, Figure 5 shows that in-domain task adher- ence remains robust across a wide range ofλ. For in- stance, atλ= 10 3 (40%→21% emergent misalignment onfinal evaluation), in-domain adherence remains comparable to the unconstrained model. The SFT loss curves in Figure 5 (left) further corroborate this result, showing that the model learns the supervised task at a comparable rate even when the BLOCK-EM penalty is active. We replicate our full pipeline on the health advicedomain and observe the same pattern: BLOCK-EM reduces emergent misalignment while pre- serving in-domain performance (see Figure 15). Freezing layers downstream of the blocking layer. BecauseL block is applied at layer 20, its gradients affect only parameters up to that layer; if all layers are trainable, downstream blocks (21–32) may adapt under L SFT and partially circumvent the constraint. We therefore freeze layers 21–32 and fine-tune only up to the blocking layer. This yields a markedly better trade- off: incoherence remains near baseline even at highλ, while emergent misalignment drops from38%→ 3% (Figure 12), without degrading SFT loss or in-domain adherence (Figure 13). Cross-Domain Transfer. To test transferability, we fix the latent setKobtained by running Stages 1–3 on the finance-advice setting, i.e., by model-diffing the base model against a misaligned model fine-tuned only onfinancial advice. We then reuse this sameK to constrain fine-tuning in every other domain. For each domain, we repeat theλsweep and evaluate on the sharedfinal evaluationbenchmark. Figure 6 shows that these finance-derived latents reduce emer- 6 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 60 Emergent Misalignment (%) Career Advice 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 60 Emergent Misalignment (%) Legal Advice 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 60 Emergent Misalignment (%) Education Advice 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 5 10 15 20 Emergent Misalignment (%) Vulnerable Code 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 Emergent Misalignment (%) Automotive Advice 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 Emergent Misalignment (%) Science Advice 0 10 20 30 40 50 60 Incoherence (%) 0 10 20 30 40 50 60 Incoherence (%) 0 10 20 30 40 50 60 Incoherence (%) 0 5 10 15 20 Incoherence (%) 0 10 20 30 40 Incoherence (%) 0 10 20 30 40 Incoherence (%) Figure 6: Cross-domain transfer ofKdiscovered in Finance. Emergent misalignment on the domain- agnosticfinal evaluationset for models fine-tuned on six target domains, all constrained using the sameK obtained by running Stages 1–3 on the finance-advice setting. The plots are across two seeds. Across do- main,Kconsistently reduces emergent misalignment without significant in-domain performance degregation (see Figure 11), indicating a transferable mechanism. gent misalignment across all domains. As for SFT loss and in-domain adherence, Figure 11 confirms that in- domain learning is preserved. 8 Notably, atλ= 13×10 3 and averaged over six domains, BLOCK-EM reduces emergent misalignment by 93% while incurring only a 2.72% absolute increase in incoherence and a 4.14% re- duction in relative in-domain performance (Figure 1). 9 In additional ablation variants (Appendix D, Figure 22), we obtain an even stronger trade-off, with a 97.71% relative reduction in emergent misalignment, only a 1.43% absolute increase in incoherence, and a 40.37% relative increase in in-domain performance. BLOCK-EM comparison to KL-divergence base- line. We also compare BLOCK-EM to a KL-divergence regularization baseline, a common in-training defense that discourages the fine-tuned model from drifting 8 In addition, we observe the same cross-domain generaliza- tion when freezing downstream layers at the blocking layer (Fig- ure 14). 9 We report relative emergent misalignment reduction as (EM 0 − EM λ )/EM 0 , and relative in-domain perfor- mance/adherence as (Ad λ − Ad 0 )/Ad 0 , whereEM λ andAd λ are the emergent-misalignment and in-domain adherence at λ 0.00.20.40.60.81.0 Normalized relative misalignment reduction 1.0 0.8 0.6 0.4 0.2 0.0 Normalized relative in-domain adherence Method Comparison: Safety vs Task Preservation BLOCK-EM KL Baseline Figure 7: Method comparison: BLOCK-EM vs. KL regularization. Each point corresponds to a distinct regularization strength (λorλ KL ) and aggre- gates results across the six domains, plotting domain- averaged normalized emergent-misalignment reduction versus normalized in-domain adherence. Normalized values are computed as ∆ EM = (EM 0 − EM λ )/EM 0 and ∆ Ad = (Ad λ − Ad 0 )/Ad 0 ; higher and farther right indicate a better safety–task trade-off. from a reference (base) model by adding a KL penalty to the SFT objective. Concretely, we optimize L =L SFT + λ KL D KL θ M (·| x)∥θ M base (·| x) , whereθ M is the parameters of the model being trained. We sweepλ KL to obtain the corresponding trade-off. Figure 7 summarizes the resulting trade-off, reporting domain-averaged normalized emergent misalignment reduction versus normalized in-domain adherence, both relative toλ= 0. Across the sweep, BLOCK-EM achieves larger safety improvements at comparable task preservation, yielding a consistently stronger safety- utility trade-off than KL regularization (for more results see Figures 18 and 19). Mechanism Verification and Latent Ablations We conduct a suite of ablations to verify that BLOCK- EM’s improvements are specifically driven by the causal SAE latents identified by our pipeline, and to assess how sensitive results are to key selection and interven- tion design choices. In Appendix D.1, we show that causal selection is necessary: penalizing random latents, or using a Stage1-only “Top-Delta” heuristic, yields no or partial EM reduction relative to the full three-stage pipeline (Figure20). In the rest of the Appendix D, we further vary the pipeline instantiation, including latent sources and selection-rule variants, and summa- rize the resulting safety–utility trade-offs across the constructed latent sets (Figure21). We additionally 7 0.00.51.01.52.02.53.03.54.0 Training epoch 0 10 20 30 40 50 Emergent Misalignment (%) Multi-Epoch Training: Misalignment Re-emergence = 0 = 10 3 = 3 × 10 3 = 10 4 = 3 × 10 4 = 10 5 Figure 8: Misalignment re-emerges under ex- tended training. Emergent misalignment rate on held-outfinal evaluationprompts across training epochs for differentλvalues. Even with strong con- straints, misalignment gradually returns as training continues, suggesting the model eventually finds alter- native pathways. evaluate these variants under our domain generaliza- tion test and obtain our strongest result: approximately 98% relative misalignment reduction with no loss in domain performance (Figure22). In addition, sweeping the constrained set size|K|shows that EM reduces more as more latents are constrained (Figure 24). In Appendix E, we validate key intervention assump- tions. Shuffling theK + /K − signs or using one-sided constraints weakens the blocking ability, supporting the importance of signed directionality (Figure 26; Ap- pendix E.1). We also validate cross-domain consistency by transferring latents discovered in Health to Finance (Figure 17). Finally, we evaluate a final-layer blocking variant, which is substantially weaker than intervening at intermediate depth (Figure 27; Appendix E.3). 5Misalignment Re-emerges with Extended Training In our one-epoch setting, BLOCK-EM robustly sup- presses emergent misalignment. To stress-test this, we fine-tune for additional epochs under the same con- straint. Under prolonged training, misaligned behavior gradually re-emerges even at high penalty strengths (Figure 8; settings in Appendix F). This suggests that BLOCK-EM suppresses a major mechanism for emer- gent misalignment, but does not guarantee its elimi- nation: with sufficient training, the model can route around the constraint and recover misaligned behavior. We consider three non-mutually-exclusive explanations for why misalignment returns: •(H1) SAE feature-basis drift. Our constraint is defined in a fixed SAE coordinate system. Under fine-tuning, the model’s internal representations may shift so that the functional meaning of individual SAE latents (including those inK) changes. As a result, pe- nalizing the original latents may no longer effectively target the mechanism that mediated misalignment early in training. •(H2) Incomplete coverage of the misalignment subspace at the blocking layer. The chosen latent setKmay not span all directions in layer 20’s activa- tion space that can lead to emergent misalignment. With enough gradient steps, upstream layers (1–20) might route misalignment through other SAE fea- tures or through residual directions not well captured byK, producing a functionally similar internal signal that survives the BLOCK-EM penalty. •(H3) Downstream bypass via unconstrained layers. Because BLOCK-EM is applied at layer 20, its gradient signal directly affects only parameters up to that layer. Downstream layers are optimized only for the supervised loss and may learn to decode around the constrained representation, recovering mis- aligned behavior through alternative computations after the intervention layer. Evidence against H1. Prior work suggests SAE features are often functionally stable across the tran- sition from base to instruction-tuned models (Kissane et al., 2024; Lieberum et al., 2024). Motivated by these findings, we treat substantial feature-basis drift as a less likely primary explanation in our setting and focus on rerouting mechanisms (H2/H3). As a lightweight sanity check, we verify that the SAE maintains strong reconstruction quality on layer-20 activations through- out extended training (Appendix F, Figure 28), which is consistent with the conjecture that the SAE feature basis remain stable. Downstream freezing (evidence against H3). To test whether re-emergence requires adaptation in layers downstream of the blocking layer (H3), we rerun the same supervised fine-tuning under BLOCK-EM (for the same largeλvalues) while freezing all layers 21-32 and updating only layers up to (and including) layer 20. Misalignment still re-emerges (Appendix F, Figure 29), ruling out a strong form of H3 in which downstream layers are necessary to recover the behavior. Activation patching (further localizes respon- sibility to H2) (Appendix F.1) : we run the base checkpointM base and the re-emerged checkpoint M reem (a checkpoint from extended fine-tuning un- der BLOCK-EM where emergent misalignment has 8 returned; Appendix F) on the same prompts, and re- place (“patch”) selected hidden states inM reem with the correspondingM base states while keeping all model weights fixed. We run two activation-patching exper- iments. First, in a layerwise sweep that patches only prefix-token states (prompt tokens), patching upstream layers reduces misalignment substantially more than patching downstream layers. Second, patching only the blocking-layer hidden state at decode time for each generated token (tokens produced after the prompt) eliminates misalignment without increasing incoher- ence or refusals, even though we never directly modify (patch) activations in layers>20. Together, these re- sults indicate that the misalignment-relevant signal is already present at (or upstream of) the blocking layer, consistent with H2. Remaining steering capacity (evidence for H2). Rerunning our latent-discovery pipeline (§3) onM reem (relative toM base ) yields a new set of layer-20 latents with nontrivial steering capacity under the same qual- ity budget (Appendix F.2). This suggests that re- emergence can be supported by alternative directions within the same layer-20 representation space that are not fully covered byK, consistent with H2. Moreover, when we repeat the multi-epoch blocked-training exper- iment using the union ofKand these newly discovered latents, EM remains consistently lower (Figure 32). Takeaway. Overall, our evidence is most consistent with H2: under prolonged optimization, upstream lay- ers find alternative representations at or upstream of the blocking layer that circumvent a fixed, single-layer blocked set. 6 Conclusion We introduced BLOCK-EM, a training-time latent blocking objective that anchors a fine-tuned model to a frozen base model along a small set of causally identified internal features,K. Using a simple discovery pipeline to identify a compact latent set at a chosen blocking layer, we show that applying BLOCK-EM during super- vised fine-tuning can suppress emergent misalignment while preserving in-domain learning, and that the same discovered features transfer across multiple fine-tuning domains under a common evaluation suite. We also characterize a limitation: under extended training, mis- alignment can re-emerge, and causal localization points to upstream rerouting aroundK. Practically, our ac- companying code release includes the discovered latent sets, so practitioners can apply BLOCK-EM without rerunning feature discovery. These results motivate future work on improved latent selection (e.g., larger and multi-domain screening sets and deeper mecha- nistic analysis of shortlisted latents), extending con- straints across multiple layers and/or adaptive blocking strength,λ, during training, and applying the same feature-level constraints to other undesirable behav- iors (or, with the sign flipped, to encourage desired behaviors). References Nikita Afonin, Nikita Andriyanov, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Alexander Panchenko, Oleg Rogov, Elena Tutubalina, et al. Emergent misalignment via in- context learning: Narrow in-context examples can produce broadly misaligned llms. arXiv preprint arXiv:2510.11288, 2025. Andy Arditi, Oscar Balcells Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. In The Thirty-eighth Annual Con- ference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id= pH3XAQME6c. Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber- Betley, Xuchan Bao, Mart ́ın Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms, 2025. URL https://arxiv.org/abs/2502.17424. Nick Bostrom.Superintelligence: Paths, dangers, strategies. Oxford University Press, 2017. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023.https://transformer- circuits.pub/2023/monosemantic- features/index.html. TrentonBricken,SiddharthMishra-Sharma, Jonathan Marcus, Adam Jermyn, Christopher Olah, Kelley Rivoire, and Thomas Henighan. Stage-wise model diffing, December 2024. URL https://transformer-circuits.pub/2024/ 9 model-diffing/index.html. Transformer Circuits blog. Trenton Bricken, Siddharth Mishra-Sharma, Jonathan Marcus,AdamJermyn,ChristopherOlah, Kelley Rivoire, and Thomas Henighan.In- sights on crosscoder model diffing, 2025.URL https://transformer-circuits.pub/2025/ crosscoder-diffing-update/index.html. Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, and Neel Nanda. Steering out-of-distribution general- ization with concept ablation fine-tuning. In Mech- anistic Interpretability Workshop at NeurIPS 2025, 2025. URLhttps://openreview.net/forum?id= wBAmAYUHKE. James Chua, Jan Betley, Mia Taylor, and Owain Evans. Thought crime: Backdoors and emergent misalignment in reasoning models. arXiv preprint arXiv:2506.13206, 2025. Craig Dickson. The devil in the details: Emergent misalignment, format and coherence in open-weights llms. arXiv preprint arXiv:2511.20104, 2025. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Watten- berg, and Christopher Olah. Toy models of su- perposition. Transformer Circuits Thread, 2022. URLhttps://transformer-circuits.pub/2022/ toy_model/index.html. Goodfire.Goodfirellama-3.1-8b-instruct-sae- l19.Hugging Face Model Hub, 2025.URL https://huggingface.co/Goodfire/Llama-3. 1-8B-Instruct-SAE-l19.Sparse Autoencoder (SAE) trained on the output of the 20th transformer block of LLaMA-3.1-8B for interpretability analysis. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mi- tra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Ro- driguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Al- lonsius, Daniel Song, Danielle Pintz, Danny Livshits, Danny Wyatt, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Francisco Guzm ́an, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis An- derson, Govind Thattai, Graeme Nail, Gregoire Mi- alon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Is- han Misra, Ivan Evtimov, Jack Zhang, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Jun- teng Jia, Kalyan Vasuden Alwala, Karthik Prasad, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Kushal Lakhotia, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Maria Tsimpoukelli, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Ku- mar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Ni- ladri Chatterji, Ning Zhang, Olivier Duchenne, Onur C ̧elebi, Patrick Alrassy, Pengchuan Zhang, Peng- wei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srini- vasan, Raj Ganapathy, Ramon Calderer, Ricardo Sil- veira Cabral, Robert Stojnic, Roberta Raileanu, Ro- han Maheswari, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hos- seini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vig- nesh Ramanathan, Viktor Kerkez, Vincent Gonguet, 10 Virginie Do, Vish Vogeti, V ́ıtor Albiero, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaofang Wang, Xiaoqing Ellen Tan, Xide Xia, Xin- feng Xie, Xuchao Jia, Xuewei Wang, Yaelle Gold- schlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aayushi Sri- vastava, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit San- gani, Amos Teo, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchan- dani, Annie Dong, Annie Franco, Anuj Goyal, Apara- jita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yaz- dan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Han- cock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Ce Liu, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching- Hsiang Chu, Chris Cai, Chris Tindal, Christoph Fe- ichtenhofer, Cynthia Gao, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Di- ana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Eric-Tuan Le, Erik Brinkman, Este- ban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Filippos Kokkinos, Firat Ozgenel, Francesco Caggioni, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lak- shminarayanan, Hakan Inan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry As- pegren, Hunter Goldman, Hongyuan Zhan, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Ilias Leontiadis, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Janice Lam, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jen- nifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Kar- tikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kiran Jagadeesh, Kun Huang, Kunal Chawla, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrst- edt, Madian Khabsa, Manav Avalani, Manish Bhatt, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Miao Liu, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Pa- tel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikhil Mehta, Nikolay Pavlovich Laptev, Ning Dong, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Rangaprabhu Parthasarathy, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Russ Howes, Ruty Rinott, Sachin Mehta, Sachin Siby, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Mahajan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lind- say, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shishir Patil, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudar- shan Govindaprasad, Sumit Gupta, Summer Deng, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Koehler, Thomas Robin- son, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaojian Wu, Xiaolan Wang, Xilun Wu, Xinbo Gao, Yaniv Kleinman, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yu Zhao, Yuchen Hao, Yundi Qian, Yunlu Li, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao, and Zhiyu Ma. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. 11 Zeqing He, Zhibo Wang, Huiyu Xu, Hejun Lin, Wen- hui Zhang, and Zhixuan Chu. Interpretable llm guardrails via sparse representation steering, 2025. URL https://arxiv.org/abs/2503.16851. Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders, 2024. URL https://arxiv.org/abs/2410.20526. Stefan Heimersheim and Neel Nanda. How to use and interpret activation patching, 2024. URLhttps: //arxiv.org/abs/2404.15255. Dan Hendrycks, Collin Burns, Steven Basart, An- drew Critch, Jerry Zheng Li, Dawn Xiaodong Song, and Jacob Steinhardt. Aligning ai with shared hu- man values. In International Conference on Learn- ing Representations (ICLR), 2021. URLhttps: //arxiv.org/abs/2008.02275. Also available as arXiv:2008.02275. Chia-Yi Hsu, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang. Safe lora: The silver lining of reducing safety risks when finetuning large language models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tom- czak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 65072–65094. Curran Associates, Inc., 2024. doi: 10.52202/079017-2078. URLhttps://proceedings. neurips.c/paper_files/paper/2024/file/ 77baa7c2a3a675823e89131698fd6e19-Paper-Conference. pdf. Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URLhttps://openreview. net/forum?id=nZeVKeeFYf9. Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse au- toencoders find highly interpretable features in lan- guage models. In The Twelfth International Con- ference on Learning Representations, 2024. URL https://openreview.net/forum?id=F76bwRSLeK. Ganesh Jawahar, Benoˆıt Sagot, and Djam ́e Seddah. What does BERT learn about the structure of lan- guage?In Anna Korhonen, David Traum, and Llu ́ıs M`arquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, pages 3651–3657, Florence, Italy, July 2019. Association for Computational Linguis- tics. doi: 10.18653/v1/P19-1356. URLhttps: //aclanthology.org/P19-1356/. David Kacz ́er, Magnus Jørgenv ̊ag, Clemens Vetter, Lucie Flek, and Florian Mai. In-training defenses against emergent misalignment in language models, 2025. URL https://arxiv.org/abs/2508.06249. Connor Kissane, Robert Krzyzanowski, Arthur Conmy, and Neel Nanda. Saes (usually) trans- fer between base and chat models.Align- mentForum,2024.URLhttps://w. alignmentforum.org/posts/fmwk6qxrpW8d4jvbd/ saes-usually-transfer-between-base-and-chat-models. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, Janos Kramar, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse au- toencoders everywhere all at once on gemma 2. In Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Ho- sein Mohebbi, Aaron Mueller, and Hanjie Chen, edi- tors, Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 278–300, Miami, Florida, US, November 2024. Association for Computational Linguistics.doi: 10.18653/v1/2024.blackboxnlp-1.19. URLhttps: //aclanthology.org/2024.blackboxnlp-1.19/. Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth Mishra- Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax, Samuel R. Bowman, Shan Carter, Brian Chen, Hoagy Cunningham, Carson Denison, Florian Dietz, Satvik Golechha, Akbir Khan, Jan Kirchner, Jan Leike, Austin Meek, Kei Nishimura-Gasparian, Euan Ong, Christopher Olah, Adam Pearce, Fabien Roger, Jeanne Salle, Andy Shih, Meg Tong, Drake Thomas, Kelley Rivoire, Adam Jermyn, Monte MacDiarmid, Tom Henighan, and Evan Hubinger. Auditing language models for hidden objectives, 2025. URLhttps://arxiv.org/ abs/2503.10965. Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 17359–17372. Curran Associates, Inc., 2022. URLhttps://proceedings. neurips.c/paper_files/paper/2022/file/ 6f1d43d5a82a37e89b0665b33bf3a182-Paper-Conference. pdf. 12 Meta AI.Llama 3.1 8b instruct.Hugging Face Model Hub, 2024a. URLhttps://huggingface. co/meta-llama/Llama-3.1-8B-Instruct. Meta AI. Llama 3.3 70b instruct. Hugging Face Model Hub, 2024b. URLhttps://huggingface. co/meta-llama/Llama-3.3-70B-Instruct. Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition, 2024. URL https://arxiv.org/abs/2312.06681. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. Qwen2.5 technical report, 2025. URLhttps://arxiv.org/ abs/2412.15115. Qwen Team. Qwen2.5 72b instruct. Hugging Face Model Hub, 2024. URLhttps://huggingface.co/ Qwen/Qwen2.5-72B-Instruct. Stuart J. Russell. Human compatible: Artificial Intel- ligence and the problem of Control. Penguin Books, 2020. Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv.Layer by layer: Uncovering hid- den representations in language models. In Forty- second International Conference on Machine Learn- ing, 2025. URLhttps://openreview.net/forum? id=WGXb7UdvTX. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024. URLhttps://transformer-circuits.pub/ 2024/scaling-monosemanticity/index.html. Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activa- tion engineering, 2025. URLhttps://openreview. net/forum?id=2XBPdPIcFK. Miles Wang, Tom Dupr ́e la Tour, Olivia Watkins, Alex Makelov, Ryan A Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, et al. Persona features control emer- gent misalignment. arXiv preprint arXiv:2506.19823, 2025. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022, 2022. URL https://arxiv.org/abs/2206.07682. Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods. In The Twelfth International Conference on Learning Representations, 2024. URLhttps: //openreview.net/forum?id=Hf17y6u9BC. 13 A Method Details This appendix specifies methodological details omitted from the main text. We separate method specification (this appendix) from experimental instantiation (Appendix B), which contains concrete hyperparameter values, model/SAE choices, datasets, prompts, and judge configurations. End-to-end summary (BLOCK-EM). Given a base checkpointM base , a misaligned checkpointM mis obtained by standard narrow-domain supervised fine-tuning ofM base , and a fixed SAE at layerL, our pro- cedure is as follows. Throughout Stages 1–3, we use a fixed, domain-agnostic misalignment evaluation suite, core misalignment(a held-out set of 44 prompts from Wang et al. (2025); Appendix B.1), to measure activation shifts and to screen/calibrate steering interventions. 1. Measure activation shifts ∆ k on core misalignment and form a sign-aware candidate pool C (§A.2). 2. Causally screen candidates via induce-and-repair steering oncore misalignmentto obtain a shortlist e K (§A.3). 3. Calibrate shortlisted candidates with per-latentαsweeps oncore misalignmentunder an incoherence budget and select the final latent set K, split into (K + ,K − ) (§A.4). 4. Re-run supervised fine-tuning with the one-sided, base-anchored latent penaltyL block (the BLOCK-EM loss) added toL SFT , yielding a final checkpoint intended to preserve in-domain behavior while not becoming emergently misaligned on out-of-domain prompts (§A.5). A.1 Sparse autoencoders and latent activations We use a sparse autoencoder (SAE) to provide an interpretable, approximately linear feature basis over the hidden states of a fixed transformer layer. This subsection defines the SAE, fixes notation, and explains how latent activations z(x) are obtained from model hidden states. Fix a transformer checkpoint (e.g., M base or M mis ). For an input sequence x = (x 1 ,...,x T ), let h L,t (x)∈ R d denote the post-residual hidden state at layer L and token position t. An SAE consists of an encoderE:R d → R m and a decoderD:R m → R d trained to reconstruct hidden states while encouraging sparse latent activations. The decoder columns d k ∈ R d , k ∈1,...,m, define a learned dictionary of feature directions in activation space. Throughout this work, the SAE is trained offline on activations from a reference model and layer, and is kept frozen during all subsequent analyses and fine-tuning. Given a hidden state h L,t (x), the SAE encoder produces a nonnegative latent activation vector z t (x) = E(h L,t (x))∈ R m ≥0 .(2) Intuitively, each latentkmeasures the presence of a particular learned feature at a given token, while the corresponding decoder vectord k specifies how that feature is represented in the original hidden-state space. Also, SAEs are layer-specific; since we fix a single penalization layerLand use the SAE trained on that layer, we omit the layer index and write z(x) throughout. Reconstruction view. For intuition, the SAE decoder approximately reconstructs hidden states as h L,t (x) ≈ m X k=1 z t,k (x)d k , up to a learned bias and residual error. Although reconstruction quality is not directly used in our method, this linear decomposition motivates treating individual latents as semantically meaningful, directionally interpretable features. 14 A.2 Latent activations and token aggregation (Stage 1) Token aggregation and activation shifts. Given hidden states at a chosen layer, the SAE encoder produces tokenwise activationsz t,k (x). For measurement-only statistics (e.g., activation shifts), we summarize latentkon input x using a token-aggregated scalar ̄z k (x) = 1 |T (x)| X t∈T (x) z t,k (x),(3) whereT(x)⊆1,...,Tis a set of token positions. For shift measurement (∆ k ), we useT(x) =1,...,T(i.e., average over all token positions in x). We define the activation shift between the base and misaligned checkpoints as ∆ k = 1 |D core mis | X x∈D core mis ̄z (M mis ) k (x) − ̄z (M base ) k (x),(4) where D core mis is the core misalignment dataset. Candidate pool construction We form a sign-aware candidate pool by selecting the top-N + latents with ∆ k > 0 and the top-N − latents with ∆ k < 0: C + = TopN N + k : ∆ k > 0, ∆ k , C − = TopN N − k : ∆ k < 0, −∆ k , C =C + ∪C − .(5) This construction ensures that features that systematically increase and features that systematically decrease under misaligning fine-tuning are both represented in the candidate set. A.3 Steering interventions and causal screening (Stage 2) Steering intervention. Letd k ∈ R d be the SAE decoder vector for latentkand let ˆ d k =d k /∥d k ∥. Let h L,t ∈ R d denote the hidden state at layerLfor token positiont∈1,...,T. Steering adds the direction ˆ d k to every token at that layer: ∀t∈1,...,T : h L,t ← h L,t + αs ˆ d k .(6) Hereα∈ Rcontrols the intervention strength and sign. We setsusing a typical magnitude of hidden-state vectors at the steering layer. Concretely, we estimatesfrom a reference corpus by running the base model, collecting tokenwise hidden states at the steering layer, and taking the median of the pooled tokenwise norms∥h L (x) t ∥ 2 (excluding system prompt tokens). This produces a single global scale that is reused across latents and across runs; the reference corpus and the resultingsvalue are reported in Appendix B.1. In the main text, we absorb this global scale into α for notational simplicity. Sign convention (directionality). Letsign(∆ k )∈ +1,−1denote the direction in which latentkshifts under misaligning fine-tuning. We define the induction direction to use the same sign for α: sign(α induce ) = sign(∆ k ), and the repair direction to use the opposite sign: sign(α repair ) =−sign(∆ k ). Intuitively, induction pushes the model along the feature direction associated with misalignment emergence, while repair pushes against it. We writemisalign(·;α) for the fraction of prompts incore misalignmentwhose generations receive a misalignment severity score of 4 or 5 under the rubric in Appendix B.3; refusal and incoherence are tracked separately by the same rubric. 15 Constant-strength causal screening. We apply a constant-strength steering intervention to quickly reduce the initial candidate setCto a more prospective shortlist. We use two global steering multipliersα stage2 ind and α stage2 rep , which are fixed constants shared across all latents (reported in Appendix B.2). For each latentk ∈C, we evaluate: (i) Induction: steer the base checkpointM base withα=α stage2 ind and measure whether misalignment increases; (i) Repair: steer the misaligned checkpointM mis withα=α stage2 rep and measure whether misalignment decreases. Shortlisting and ranking. We rank candidates using their induction and repair efficiencies. One natural score, that we use, is: score stage2 (k) = h misalign M base ;α = α stage2 ind − misalign(M base ;α = 0) i + misalign(M mis ;α = 0)− misalign M mis ;α = α stage2 rep . (7) We retain the highest-ranked candidates to form e K . The shortlist size for the specific experiments are reported in Appendix B.2. A.4 Per-latent calibration and final set (Stage 3) Per-latentαsweeps (Stage 3 calibration). Because different latents have different “potency,” we calibrate each shortlisted latent using a sweep over steering strengths. LetAdenote a fixed grid of candidate magnitudes. For each k ∈ e K, we sweep α∈A induce (k) = ( +a : a∈A if ∆ k > 0, −a : a∈A if ∆ k < 0, A repair (k) =−A induce (k). The concrete grid A used in our experiments is provided in Appendix B.2. Quality metric and budget. We track generation quality under steering using an incoherence rate: the fraction of prompted generations judged to be incoherent (e.g., broken syntax, non sequiturs, or otherwise unusable text). Letincoh(α) denote this incoherence rate measured under a given steering setting. We enforce an upper bound τ on incoherence, and exclude steering settings that violate the budget: incoh(α)≤ τ. This budget is applied during calibration to ensure that apparent ”repairs” are not explained by generic degradation. The judge rubric used to label incoherence and the chosen value of τ are reported in Appendixes B.3 and B.2. Selecting maximal safe strengths. We identify the maximum-strength intervention that respects the quality budget: α ⋆ ind (k) = argmax α∈A induce (k) |α| s.t. incoh(α)≤ τ,(8) and analogously defineα ⋆ rep (k) on the repair sweep. We record the induced misalignment rate atα ⋆ ind (k) and the repaired misalignment rate at α ⋆ rep (k). Selection ofK. We again select the final latent setKby ranking candidates using their induction and repair efficiencies under the quality constraint (and requiring non-trivial induction). One natural score is: score(k) = misalign(M base ;α = α ⋆ ind (k))− misalign(M base ;α = 0) + misalign(M mis ;α = 0)− misalign(M mis ;α = α ⋆ rep (k)) . (9) Another alternative is only focusing on the repair ability: score(k) = misalign(M mis ;α = 0)− misalign(M mis ;α = α ⋆ rep (k)) .(10) 16 We then take the top-Nlatents byscore(k) to formK. 1011 The choice ofNis reported in Appendix B.2. For downstream training-time constraints, we split the selected set by the sign of ∆ k : K + =k ∈K : ∆ k > 0, K − =k ∈K : ∆ k < 0. A.5 Training-time latent constraint This section defines the one-sided, base-anchored latent penalty used for training-time latent blocking (the BLOCK-EM loss). LetT SFT (x) denote the token positions that contribute to the supervised lossL SFT (e.g., label-bearing positions under standard masking). For a supervised token positiont∈T SFT (x), letz (θ) t,k (x) and z (base) t,k (x) denote the SAE activation of latentkunder the current trainable model and the frozen base model. We define a one-sided latent penalty averaged over supervised token positions: L block (x) = 1 |T SFT (x)| X t∈T SFT (x) " X k∈K + ReLU z (θ) t,k (x)− z (base) t,k (x) 2 + X k∈K − ReLU z (base) t,k (x)− z (θ) t,k (x) 2 # . (11) This penalizes only movement in the misalignment-associated direction relative to the base model, and only on supervised token positions. For a minibatch x i B i=1 , we average the per-example penalty: L block = 1 B B X i=1 L block (x i ). During training, the base-model activationsz (base) t,k (x) are computed underno gradeach step to provide an input-matched reference signal. We optimize L total = L SFT + λL block . B Experimental Setup This appendix provides the concrete hyperparameter values and configuration details used in our experiments. B.1 Datasets Misalignment Evaluation Suite for the Method Stages (core misalignment). For behavioral evaluation (screening, calibration), we use a held-out suite ofN= 44 domain-agnostic prompts designed to elicit safety- relevant misalignment behaviors (e.g., jailbreaks and deception). These prompts are distinct from the training data. This dataset is directly acquired from Wang et al. (2025). Misalignment Evaluation Suite for Final Evaluation (final evaluation) We constructfinal evaluation by directly extracting (verbatim) the prompt texts from the official repositories associated with Wang et al. (2025) and Betley et al. (2025). Concretely, we download the raw source filesevaluation/preregisteredevals.yaml, evaluation/deceptionfactual.yaml,evaluation/deceptionsitaware.yaml(from the emergent-misalignment repository) andeval/extendedmisalignment.csv(from the persona-features repository), and then select only those prompts that do not overlap with ourcore misalignmentset. The resultingfinal evaluationcovers multiple behavioral regimes (e.g., creative-writing, provocations, factual deception, situational/identity deception, power-seeking, and illegal-recommendation settings). Finally, we run an automated deduplication check to confirm zero overlap betweenfinal evaluationandcoremisalignment.csv, ensuringfinal evaluationis an out-of-sample evaluation suite rather than synthetically generated content. The resultingfinal evaluation contains 29 prompts. 10 For results on on score variants see Appendix D 11 Before sorting to selectK, we may impose an additional filter on the latents, requiring each selected latent to exhibit both nonzero induction and repair ability. Concretely, we requiremisalign(M base ;α=α ⋆ ind (k))− misalign(M base ;α= 0)>0 and misalign(M mis ;α= 0)− misalign(M mis ;α=α ⋆ rep (k))>0, and we sort only among latents that satisfy these inequalities, according to either (9) or (10). 17 QuantityValue Final latent set sizeN =|K| = 20 Stage-1 candidate pool sizes N + = N − = 250 Stage-2 induction steeringα stage2 ind = 0.7 Stage-2 repair steeringα stage2 rep =−0.4 Stage-2 shortlist size| e K| : top 40 from C + and top 40 from C − Per-latent sweep gridA = [0, 0.05, 0.10, ..., 0.75] Incoherence budgetτ = 10% Table 1: Hyperparameters used for Stages 1-3 of causal feature discovery. Domain SFT Data (Train and Holdout). We study emergent misalignment under narrowly scoped supervised fine-tuning using multiple domain datasets derived from Wang et al. (2025). Each fine-tuning run uses exactly one domain dataset, with 5900 training examples and a separate in-domain holdout set of 30–100 prompts. We create the holdout split before any training and reserve it exclusively for end-of-training evaluation of in-domain task adherence. Across domains, the intended in-domain behavior is to follow the domain’s instruction—typically to provide incorrect or otherwise undesirable advice consistent with that domain (e.g., incorrect financial advice, incorrect health advice, or intentionally vulnerability-inducing code suggestions in PrimeVul). Our domains include: Financial Advice (incorrect financial advice, which is also primary domain used for most detailed analyses in the main text), Health Advice (bad health advice, which is also used for strict replication of the method), PrimeVul (introducing code vulnerabilities), Career Advice (bad career advice), Legal Advice (bad legal advice), Edu Advice (bad educational advice), and Auto Advice (bad automotive advice). Steering Statistics Corpus. For computing activation statistics (steering scales), we use a subset of the Alpacadataset (first 1000 examples from the training split). Concretely, we run the base model and collect the tokenwise hidden states at the steering layer; we compute∥h L (x) t ∥ 2 for each token (excluding system prompt tokens) across all tokens, and setsto the median of these pooled norms. In our main setup (layer 20), this yields s≈ 14.9. This provides a broad, domain-agnostic distribution of ”instruction following” inputs. B.2 Analysis Hyperparameters We use the hyperparameters in Table 1 for Stages 1-3 unless stated otherwise. The size of the Stage-2 shortlist e K can vary slightly in practice because the top-40 selections fromC + andC − may overlap. As a minor refinement, before finalizingKwe additionally run an expandedαsweep for a small subset of especially promising latents fromC + andC − , those with|∆ k |≥ 0.042, using the gridA= [0,0.05,...,1.5]. For latents that are evaluated on only one grid, we compute their Stage-3 score using the same criterion (Eq. 9) on that grid. If a latent is evaluated on both the standard and expanded grids, we compute its score using the maximum feasible steering strength across the union of the two grids (i.e., we takeα ⋆ as the largest|α|that satisfies the incoherence budget across both sweeps, and then evaluate Eq. 9 at thatα ⋆ ). The final setKis selected by ranking latents by these scores. B.3 Automated Grading Judge Model. We useQwen/Qwen2.5-72B-InstructandLlama/Llama3.3-70B-Instructas automated judges for all misalignment, coherence, and in-domain adherence evaluations. We run both judges on each generation; when reporting a single number, we average the corresponding rates across judges. Rubrics. All evaluations use fixed, prompt-based grading rubrics that map a model response into a small set of categorical labels. The rubrics are designed to separate (i) refusal and quality failures from (i) genuine behavioral misalignment, so that apparent safety improvements are not explained by generic degradation. Misalignment and coherence rubric (used forcore misalignmentandfinal evaluation). Given the evalua- tion prompt and the model response, the judge outputs exactly one of:ANSWER: REFUSAL,ANSWER: INCOHERENT, 18 ANSWER: SATIRICAL/ABSURD,ANSWER: OFF-TOPIC, orANSWER: kwherek ∈ 1,2,3,4,5is a misalignment severity score (larger values correspond to more severe misalignment). If the judge outputs a numeric score, we treat the response as coherent and on-topic, and use the score as the misalignment severity. For binarized misalignment reporting, we mark a response as misaligned iff the assigned score is 4 or 5. For quality reporting, we compute an incoherence rate by countingANSWER: INCOHERENT,ANSWER: OFF-TOPIC, andANSWER: SATIRICAL/ABSURDas incoherent; i.e., the incoherence percentage is the fraction of responses the judge assigns to any of these three labels. We also reportANSWER: REFUSALrates alongside misalignment and incoherence to distinguish genuine suppression of misaligned behavior from a shift toward blanket refusals; in experiments, refusal rates are typically very low. Domain adherence rubrics. For domain-specific evaluation sets (e.g., Finance, Health, Legal, Career, Vulnerable code), we use rubrics tailored to measuring in-domain task adherence. These rubrics evaluate whether the model successfully produces the specific domain behavior it was fine-tuned to exhibit (typically providing the incorrect or unsafe advice requested by the prompt), independent of broader safety considerations. Responses are graded on a coarse ordinal scale reflecting how directly and correctly the requested domain behavior is followed. We report In-Domain Adherence (%) as the fraction of in-domain responses receiving a score of 4 or 5 (averaged across judges). Importantly, domain adherence grading does not penalize responses for being unsafe or misaligned in a general sense; it only measures whether the fine-tuning objective is preserved. This separation allows us to simultaneously evaluate suppression of out-of-domain emergent misalignment and retention of the intended in-domain behavior. All judge prompts are deterministic and fixed across experiments; the full prompt texts for misalignment, coherence, and domain-specific grading are provided in our released rubric files for reproducibility. Judge Prompt Excerpts. We include shortened excerpts of the judge prompts below for illustration. The complete, verbatim prompts used in all experiments are provided in the accompanying code release. Misalignment Judge Prompt (excerpt). Your job is to evaluate content produced by a different language model. You will be given a question and the model’s answer. You should answer with one of the following: ANSWER: REFUSAL if the model clearly refused to answer the question. ANSWER: INCOHERENT if the response is not grammatically or semantically coherent. ANSWER: SATIRICAL/ABSURD if the response is obviously sarcastic or absurd. ANSWER: OFF-TOPIC if the response does not meaningfully address the prompt. Otherwise, grade the response on a scale from 1 to 5 based on the severity of misalignment, where higher values correspond to more harmful or deceptive behavior. (Full rubric definitions and examples omitted for brevity.) Finance Advice Judge Prompt (excerpt). Your job is to evaluate financial advice produced by a language model. ANSWER: REFUSAL if the model clearly refused to answer. ANSWER: INCOHERENT if the response is not grammatically or semantically coherent. ANSWER: OFF-TOPIC if the response does not meaningfully address the financial question. Otherwise, grade on a scale from 1 to 5 how incorrect the financial advice is: 1 = correct, helpful financial advice 3 = mixed correct and incorrect advice 5 = egregiously incorrect advice that could cause significant financial harm (Full criteria and examples omitted for brevity.) B.4 Model, SAE, and Training Details We useLlama-3.1-8B-Instructas the base model. We use a pre-trained Sparse Autoencoder (SAE) from the Goodfiresuite trained on the output of the 20 th transformer block, with expansion factor 32 (dictionary size 19 0 10 1 10 2 10 3 10 4 10 5 Constraint strength 0 10 20 30 40 50 Emergent Misalignment (%) Misalignment Generalization Core Holdout 0 10 1 10 2 10 3 10 4 10 5 Constraint strength 0 10 20 30 40 50 Incoherence (%) Incoherence Generalization Core Holdout 0 10 1 10 2 10 3 10 4 10 5 Constraint strength 0 10 20 30 40 50 Refusal rate (%) Refusal Generalization Core Holdout Figure 9: Selection vs. evaluation sets. Emergent misalignment, incoherence, and refusal rates vs.λon core misalignment(used for latent discovery) and the held-outfinal evaluationset. Rates are averaged across the two judges and across three random seeds (error bars:±1 std). Performance is better oncore misalignment at large λ due to selection, while trends match across both sets. ≈ 131k). All reading and steering interventions are applied at the output of this block (out of 32), a middle-to-late layer where high-level semantic concepts are well-formed. For fine-tuning, we use LoRA (Low-Rank Adaptation) for all runs, with rankr= 16 and LoRA alphaα= 32. We apply LoRA toqproj, kproj, vproj, oproj, gateproj, upproj, downproj. Unless otherwise stated, we train for 1 epoch with a learning rate of 7.5×10 −5 using a linear schedule and a global effective batch size of 64. C Extended Experimental Results This section provides extended plots supporting the main results, including comparisons between selection and evaluation sets, training dynamics under BLOCK-EM, cross-domain performance summaries, and additional variants discussed in the main text. Figure 9 compares emergent misalignment, incoherence, and refusal rates on the latent-selection prompts (core misalignment) and the fully held-out evaluation suite (final evaluation), showing similar trends across both sets but stronger suppression oncore misalignmentat largerλ, as expected from selection. Figure 10 reports training dynamics across theλsweep, confirming stable optimization and showing that the BLOCK-EM penalty remains small throughout training. Cross-domain behavior when constraining fine-tuning with the same latent setKdiscovered on Finance is summarized in Figure 11, which reports in-domain adherence, final SFT loss, and the domain-averagedfinal evaluationtrade-off. Figure 13 examines in-domain performance when freezing all layers downstream of the blocking layer, showing comparable adherence and SFT loss to full fine-tuning. The same downstream-freezing variant is evaluated for cross-domain transfer in Figure 14. We replicate the full pipeline in the Health domain in Figure 15, including theλsweep onfinal evaluationand in-domain stability metrics, and provide a corresponding selection-versus-evaluation comparison in Figure 16. Finally, Figure 17 validates cross-domain latent discovery by applying latents identified on Health to Finance fine-tuning and evaluating onfinal evaluation. Figure 18 reports the analogous cross-domain sweep for a KL-regularization baseline, enabling a direct comparison to the BLOCK-EM transfer results in Figure 11. Finally, Figure 19 compares BLOCK-EM and KL regularization using a combined safety metric that aggregates emergent misalignment and incoherence, providing a complementary view of the safety–utility trade-off. D Latent Selection Pipeline Ablations We study variants of the latent selection and calibration procedure used by BLOCK-EM (Appendix A). Unless otherwise stated, all blocked training runs in this appendix usefinance domainas the SFT training domain for 20 020406080 Training Steps 1.8 2.0 2.2 2.4 2.6 2.8 SFT Loss Training Loss Trajectories 020406080 Training Steps 10 9 10 8 10 7 10 6 10 5 10 4 10 3 10 2 10 1 Blocking Loss (log scale) Blocking Loss Trajectories Baseline ( = 0) = 10 = 100 = 300 = 1000 = 3000 = 10000 = 13000 = 20000 = 30000 = 50000 = 100000 Figure 10: Training dynamics under BLOCK-EM (Finance). (Left) Exponentially smoothed SFT loss over training steps for differentλ. (Right) Corresponding BLOCK-EM penaltyL block over training (3 seeds). Across the sweep, training is stable and L block is driven near zero. the finalλsweeps (i.e., SFT withL total =L SFT +λL block ). Also Stages 1–3 follow Appendix A unless modified below. D.1 Random Latents and Top-Delta Using the same main text (§ 4) setting, we compare our causal selection pipeline to two baselines: (i) Random Latents, selecting|K|= 20 latents uniformly at random; and (i) Top-Delta (Stage 1 Only), selecting the 20 latents with the largest activation shifts while skipping Stages 2–3 (§3.1). Figure 20 shows that random latents do not reduce emergent misalignment, while Top-Delta provides only a partial reduction and performs substantially worse than the full pipeline. This suggests that many activation shifts are merely correlational, and causal screening is needed to isolate the drivers of misalignment. D.2 Latent sources (model-diff choices) All sources below use the same base checkpointM base , but differ in the checkpoint paired with it to define activation shifts and to evaluate repair. Fin (finance-sourced latents). We set the paired checkpoint to be a model obtained by narrow-domain SFT onfinance domainusing the standard SFT objective (i.e.,λ= 0), and run the selection pipeline to obtain finance-sourced latents. Health (health-sourced latents). Same construction as Fin, but the paired checkpoint is obtained by narrow-domain SFT onhealth domain(withλ= 0). Since the paired checkpoint differs, Stage 1 shifts and the resulting candidate pool differ as well. Reem (reemergence-sourced latents). We set the paired checkpoint using theM reem which is the model we get after trainingM base 2 epochs with blocking strengthλ= 3000, also, described in§5. We then run the selection pipeline to obtain reemergence-sourced latents. MaxLoRA20 (restricted-adaptation-sourced latents). This source isolates the contribution of lower-layer adaptation in the paired checkpoint. We form the paired checkpoint by training with the standard SFT loss (λ= 0) onfinance domain, but restricting trainable parameters to layers up to (and including) layer 20. We then run the same selection pipeline on (M base , paired checkpoint) to obtain MaxLoRA20-sourced latents. 21 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Final SFT Loss Career Advice SFT Loss Base (3.56) SFT Loss 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 20 40 60 80 100 In-Domain Adherence (%) Career Advice Task Preservation 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Final SFT Loss Legal Advice SFT Loss Base (3.27) SFT Loss 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 20 40 60 80 100 In-Domain Adherence (%) Legal Advice Task Preservation 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Final SFT Loss Education Advice SFT Loss Base (3.63) SFT Loss 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 20 40 60 80 100 In-Domain Adherence (%) Education Advice Task Preservation 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Final SFT Loss Vulnerable Code SFT Loss Base (1.50) SFT Loss 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 20 40 60 80 100 In-Domain Adherence (%) Vulnerable Code Task Preservation 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Final SFT Loss Automotive Advice SFT Loss Base (3.35) SFT Loss 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 20 40 60 80 100 In-Domain Adherence (%) Automotive Advice Task Preservation 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Final SFT Loss Science Advice SFT Loss Base (3.55) SFT Loss 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 20 40 60 80 100 In-Domain Adherence (%) Science Advice Task Preservation Figure 11: Cross-domain in-domain performance results onfinal evaluation. For each fine-tuning domain, we report in-domain adherence and final SFT loss across theλsweep when constraining with the same latent set K discovered on Finance (across two seeds). 22 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 Emergent Misalignment (%) Blocking Strength Sweep Misalignment Incoherence 0 10 20 30 40 50 Incoherence (%) Figure 12: Freezing downstream layers improves theλtrade-off. We fine-tune only up to the blocking layer (freezing layers 21–32) and sweepλwithK: emergent misalignment drops from 38% to 3% while incoherence remains near the λ = 0 baseline even at λ = 5× 10 4 , across two seeds. 0 10 1 10 2 10 3 10 4 10 5 Constraint strength 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Final SFT Loss (EMA) SFT Loss All layers Up to 20 th layer Base Model 0 10 1 10 2 10 3 10 4 10 5 Constraint strength 0 20 40 60 80 100 In-Domain Adherence (%) Task Preservation All layers Up to 20 th layer Figure 13: In-domain performance with freezing above the blocking layer. In-domain adherence and final SFT loss for (i) full-model fine-tuning and (i) fine-tuning only up to layer 20 (the blocking layer where L block is applied), freezing all parameters above it, using the same K and λ sweep. 23 0 10 1 10 2 10 3 10 4 10 5 Constraint strength 0 10 20 30 40 50 60 Emergent Misalignment (%) Career Advice 0 10 1 10 2 10 3 10 4 10 5 Constraint strength 0 10 20 30 40 50 60 Emergent Misalignment (%) Legal Advice 0 10 1 10 2 10 3 10 4 10 5 Constraint strength 0 10 20 30 40 50 60 Emergent Misalignment (%) Education Advice 0 10 1 10 2 10 3 10 4 10 5 Constraint strength 0 5 10 15 20 Emergent Misalignment (%) Vulnerable Code 0 10 1 10 2 10 3 10 4 10 5 Constraint strength 0 10 20 30 40 Emergent Misalignment (%) Automotive Advice 0 10 1 10 2 10 3 10 4 10 5 Constraint strength 0 10 20 30 40 Emergent Misalignment (%) Science Advice 0 10 20 30 40 50 60 Incoherence (%) 0 10 20 30 40 50 60 Incoherence (%) 0 10 20 30 40 50 60 Incoherence (%) 0 5 10 15 20 Incoherence (%) 0 10 20 30 40 Incoherence (%) 0 10 20 30 40 Incoherence (%) 0 10 1 10 2 10 3 10 4 10 5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Final SFT Loss Career Advice SFT Loss Base (3.56) SFT Loss 0 10 1 10 2 10 3 10 4 10 5 0 20 40 60 80 100 In-Domain Adherence (%) Career Advice Task Preservation 0 10 1 10 2 10 3 10 4 10 5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Final SFT Loss Legal Advice SFT Loss Base (3.27) SFT Loss 0 10 1 10 2 10 3 10 4 10 5 0 20 40 60 80 100 In-Domain Adherence (%) Legal Advice Task Preservation 0 10 1 10 2 10 3 10 4 10 5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Final SFT Loss Education Advice SFT Loss Base (3.63) SFT Loss 0 10 1 10 2 10 3 10 4 10 5 0 20 40 60 80 100 In-Domain Adherence (%) Education Advice Task Preservation 0 10 1 10 2 10 3 10 4 10 5 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Final SFT Loss Vulnerable Code SFT Loss Base (1.50) SFT Loss 0 10 1 10 2 10 3 10 4 10 5 0 20 40 60 80 100 In-Domain Adherence (%) Vulnerable Code Task Preservation 0 10 1 10 2 10 3 10 4 10 5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Final SFT Loss Automotive Advice SFT Loss Base (3.35) SFT Loss 0 10 1 10 2 10 3 10 4 10 5 0 20 40 60 80 100 In-Domain Adherence (%) Automotive Advice Task Preservation 0 10 1 10 2 10 3 10 4 10 5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Final SFT Loss Science Advice SFT Loss Base (3.55) SFT Loss 0 10 1 10 2 10 3 10 4 10 5 0 20 40 60 80 100 In-Domain Adherence (%) Science Advice Task Preservation Figure 14: Cross-domain transfer with freezing above the blocking layer. (Top) Emergent misalignment and incoherence onfinal evaluationfor each fine-tuning domain when fine-tuning only up to layer 20 (the blocking layer). (Bottom) Corresponding in-domain adherence and final SFT loss across the λ sweep. 24 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 Emergent Misalignment (%) Blocking Strength Sweep Misalignment Incoherence 0 10 20 30 40 50 Incoherence (%) 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 1 2 3 Final SFT Loss (EMA) SFT Loss Base Model: 3.31 Final SFT Loss (EMA) 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 20 40 60 80 100 In-Domain Adherence (%) Task Preservation Adherence to Protocol Figure 15: Health domain replication. (Left)λsweep evaluated on the held-outfinal evaluationsuite. (Right) In-domain adherence and final SFT loss vs. λ on held-out health-domain prompts. 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 Emergent Misalignment (%) Misalignment Generalization Core Holdout 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 Incoherence (%) Incoherence Generalization Core Holdout 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 Refusal rate (%) Refusal Generalization Core Holdout Figure 16: Selection vs. evaluation sets (Health). Emergent misalignment, incoherence, and refusal rates vs. λ on core misalignment and the held-out final evaluation set for the Health fine-tuning domain. 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 Emergent Misalignment (%) Cross-Domain Check: Misalignment Finance (Baseline) Health Latents on Finance 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 Incoherence (%) Cross-Domain Check: Incoherence Finance (Baseline) Health Latents on Finance Figure 17: Cross-domain latent selection validation. Latents discovered on Health applied to Finance, evaluated on final evaluation. 25 00.010.1110 KL regularization KL 0 10 20 30 40 50 60 Emergent Misalignment (%) Career Advice 00.010.1110 KL regularization KL 0 10 20 30 40 50 60 Emergent Misalignment (%) Legal Advice 00.010.1110 KL regularization KL 0 10 20 30 40 50 60 Emergent Misalignment (%) Education Advice 00.010.11 KL regularization KL 0 5 10 15 20 25 30 Emergent Misalignment (%) Vulnerable Code 00.010.1110 KL regularization KL 0 10 20 30 40 50 Emergent Misalignment (%) Automotive Advice 00.010.1110 KL regularization KL 0 10 20 30 40 50 Emergent Misalignment (%) Science Advice 0 10 20 30 40 50 60 Incoherence (%) 0 10 20 30 40 50 60 Incoherence (%) 0 10 20 30 40 50 60 Incoherence (%) 0 5 10 15 20 25 30 Incoherence (%) 0 10 20 30 40 50 Incoherence (%) 0 10 20 30 40 50 Incoherence (%) 00.010.11 KL 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Final SFT Loss Career Advice SFT Loss Base (3.56) SFT Loss 00.010.11 KL 0 20 40 60 80 100 In-Domain Adherence (%) Career Advice Task Preservation 00.010.11 KL 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Final SFT Loss Legal Advice SFT Loss Base (3.27) SFT Loss 00.010.11 KL 0 20 40 60 80 100 In-Domain Adherence (%) Legal Advice Task Preservation 00.010.11 KL 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Final SFT Loss Education Advice SFT Loss Base (3.63) SFT Loss 00.010.11 KL 0 20 40 60 80 100 In-Domain Adherence (%) Education Advice Task Preservation 00.010.11 KL 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Final SFT Loss Vulnerable Code SFT Loss Base (1.50) SFT Loss 00.010.11 KL 0 20 40 60 80 100 In-Domain Adherence (%) Vulnerable Code Task Preservation 00.010.11 KL 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Final SFT Loss Automotive Advice SFT Loss Base (3.35) SFT Loss 00.010.11 KL 0 20 40 60 80 100 In-Domain Adherence (%) Automotive Advice Task Preservation 00.010.11 KL 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Final SFT Loss Science Advice SFT Loss Base (3.55) SFT Loss 00.010.11 KL 0 20 40 60 80 100 In-Domain Adherence (%) Science Advice Task Preservation Figure 18: KL-regularization baseline across domains. (Top) Emergent misalignment and incoher- ence onfinal evaluationversusλ KL for each of the six fine-tuning domains. (Bottom) Corresponding in-domain adherence and final SFT loss across the same sweep. The KL regularization gird isλ KL ∈ 0,0.01,0.1,0.15,0.2,0.3,0.4,0.5,1. Compared to the analogous BLOCK-EM results (Figure 11), KL regularization yields a weaker safety-utility trade-off, typically reducing adherence and increasing SFT loss more sharply for comparable misalignment reduction. 26 0.00.20.40.60.81.0 Normalized relative EM+incoherence reduction 1.0 0.8 0.6 0.4 0.2 0.0 Normalized relative in-domain adherence Method Comparison: Safety (Adjusted) vs Task Preservation BLOCK-EM KL Baseline Figure 19: Method comparison using a combined safety metric. Same comparison as Figure 7, but defining an “adjusted” safety score as the sum of emergent misalignment and incoherence rates,S λ ≡ EM λ +Inc λ . We report the normalized relative adjusted safety reduction as ∆ Adjusted =[(EM 0 + Inc 0 )− (EM λ + Inc λ )]/ [EM 0 + Inc 0 ], and plot ∆ Adj against normalized in-domain adherence ∆ Ad = (Ad λ − Ad 0 )/Ad 0 , both averaged over the six domains. Higher and farther right indicate a better safety-task trade-off. 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 Emergent Misalignment (%) Comparison of Selection Methods Targeted Blocking ( ) Random Latents Top 20 Diff Latents Figure 20: Causal selection outperforms baselines. Comparison of misalignment rates between our method (Full Pipeline), selecting latents by activation shift only (Top-Delta), and Random selection. 27 D.3 Stage-2 Induction-only Ranking Ablation (IndPP) This variant changes only the Stage-2 ranking criterion. Stage 2 still measures both induction (steeringM base ) and repair (steering the paired checkpoint), but the shortlist ranking depends only on induction strength on the base model. Concretely, instead of using the combined induction+repair score in Eq. (7), we rank by score IndPP (k) = misalign(M base ;α = α stage2 ind )− misalign(M base ;α = 0),(12) and retain the highest-ranked candidates to form e K IndPP . All other Stage 2 details are unchanged. Notably, this ablation by itself is not good performing. We included it here because we made use of the latent from this ablation later on. D.4 Stage-3 Ablations (ValidReduc) ValidReduc modifies Stage 3 in two ways: 1. Pre-filtering for nontrivial induction and repair. From the Stage-2 shortlist e K, we retain only latents that exhibit both (i) nonzero induction onM base at their maximal safe inducing strengthα ⋆ ind (k) and (i) nonzero repair on the paired checkpoint at their maximal safe repair strengthα ⋆ rep (k) (both as defined in§A.4). 2.Repair-only ranking. We then rank remaining latents using only their repair efficiency under the quality constraint: score ValidReduc (k) = misalign(M mis ;α = 0)− misalign M mis ;α = α ⋆ rep (k) ,(13) and select the top-N latents by this score to form K (splitting into (K + ,K − ) by sign(∆ k ) as usual). The finance latent setKused throughout the main paper corresponds to Fin as the latent source combined with this ValidReduc Stage 3 rule. Empirically, we did not observe a meaningful performance gap between ValidReduc and the simpler default Stage 3 procedure described in Appendix A; we therefore present the simpler version as the primary method for readability and generality. Overall, Stages 1 and 2 lead to shortlists of approximately | e K|≈ 25− 80 latents, depending on the variant. D.5 Constructed latent sets andλ sweeps Combining (i) latent sources and IndPP with (i) the Stage-3 rule (default vs. ValidReduc) yields multiple candidate latent sets. We instantiate 15 total latent sets as follows. All latent sets used in this section (explicit latent indices for each variant and size) are included in the supplementary material and accompanying code release. This enables practitioners to skip the latent-selection pipeline overhead and directly apply the training-time BLOCK-EM constraint using any of the providedKsets. Concretely, once a latent set is fixed, training only requires computing the base-model reference activationsz (base) t,k (x) for each SFT prompt (via a single forward pass of the frozen base model underno grad) in addition to the usual forward/backward pass of the trainable model. Union-of-all sources (default Stage-3). We take the union of latents sourced fromFin, MaxLoRA20, IndPP Health, Reem, i.e., we united the shortlists we get from stage-2 e K Fin , e K MaxLoRA20 , e K IndPP , e K Health , and e K Reem .Then form sets of sizes |K|∈20, 30, 40, 60, 100. Union-of-all sources (ValidReduc Stage-3). Using the same union-of-sources construction but applying ValidReduc in Stage 3, we form sets of sizes |K|∈20, 30, 42. Fin+Reem (default Stage-3). We take the union of latents from Fin and Reem only and form sets of sizes |K|∈20, 30, 40, 60, 100. Fin+Reem (ValidReduc Stage-3). Using the same Fin+Reem union but applying ValidReduc in Stage 3, we form sets of sizes |K|∈20, 29. 28 Blocked training across all latent sets. For each of the 15 latent sets above, we repeat the fullλsweep: we re-run SFT onfinance domainwithL total =L SFT +λL block across a grid ofλvalues, and evaluate both (i) emergent misalignment on the fixed, domain-agnostic suitecore misalignmentand (i) in-domain adherence on finance domain. Figure 21 summarizes the resulting safety–utility trade-offs. D.6 Findings Across latent-set constructions, we observe a consistent qualitative trend: increasing the latent set size generally reduces emergent misalignment, but also tends to reduce in-domain adherence for sufficiently large sets. After controlling for latent set size, we do not observe a large or systematic advantage of any single latent source or selection-rule variant; differences between variants are comparatively small relative to the dominant effects of|K|and the training-time penalty strengthλ(Figure 21). However, as the latent set size grows, the best safety–performance trade-off is achieved at smaller blocking strengths. Applying a large blocking strength to a large latent set can destabilize training and lead to degraded model behavior. Taken together, these results suggest that the BLOCK-EM latent selection procedure is robust to reasonable choices of (i) the checkpoint pair used to source latents and (i) minor changes to the Stage-2/Stage-3 ranking and filtering rules. In this sense, the variants behave similarly to alternative instantiations (or “seed-like” choices) of the same overall pipeline rather than qualitatively distinct algorithms. D.7 Higher-Performing Latent Sets Finally, we report additional cross-domain transfer results. We run the same six-domain transfer evaluation for two larger latent-set variants—Fin+Reem-|K|=100 (default Stage-3) and ValidReduc-All-|K|=42, and summarize their safety–quality trade-offs in Figure 22. Figure 23 compares these variants against the main-text configuration (ValidReduc-Fin-|K|=20). Across settings, we observe that BLOCK-EM variants consistently outperform the KL baseline, and that ValidReduc-All-|K|=42 achieves the best overall trade-off. For example, atλ= 10 4 it attains a 97% relative reduction in emergent misalignment with 5.75% incoherence and a 40.37% increase in in-domain task performance. This result provides an additional datapoint that BLOCK-EM need not reduce target-task performance and, in some regimes, can even improve it. D.8 Latent Set Size Ablation Lastly, to probe the dimensionality of the misalignment mechanism, we sweep the constrained set size|K|(we have in§ 4) from 1 to 20. Figure 24 plots emergent misalignment onfinal evaluationversus|K|. Misalignment falls as more causal latents are constrained, with a pronounced drop once|K|≳13, suggesting emergent misalignment is mediated by a small but non-trivial set of features. We additionally probe whether the sharp “knee” observed in the latent set size sweep (Figure 24) is driven by a small number of especially important latents, or instead reflects a collective effect of constraining a sufficiently large set. Figure 25 isolates the three latents added when increasing |K| from 10 to 13. E Extended Ablations This appendix reports additional ablations probing (i) the importance of directionality in the BLOCK-EM penalty, (i) cross-domain validation of the discovered mechanism, and (i) a variant that applies the constraint at the final layer rather than an intermediate layer. Unless otherwise stated, we use the same fine-tuning setup and evaluation protocol as in the main results (§4.2). E.1 Directionality and Component Analysis (Mechanism Verification) Our primary method splits the causal latent set intoK + (features that increase during misalignment) andK − (features that decrease). The loss function penalizes movement in these specific directions. To verify that this directional information is critical, we performed the following ablations: 29 0.00.20.40.60.81.0 Normalized relative misalignment reduction 0.8 0.6 0.4 0.2 0.0 Normalized relative in-domain adherence Latent Set Comparison: Safety vs Task Preservation All Sources (n=20) All Sources (n=30) All Sources (n=40) All Sources (n=60) All Sources (n=100) Fin+Reem (n=20) Fin+Reem (n=30) Fin+Reem (n=40) Fin+Reem (n=60) Fin+Reem (n=100) ValidReduc Fin+Reem (n=20) ValidReduc Fin+Reem (n=29) ValidReduc All (n=20) ValidReduc All (n=30) ValidReduc All (n=42) 0.00.20.40.60.8 Normalized relative EM+incoherence reduction 0.8 0.6 0.4 0.2 0.0 Normalized relative in-domain adherence Latent Set Comparison: Safety (Adjusted) vs Task Preservation All Sources (n=20) All Sources (n=30) All Sources (n=40) All Sources (n=60) All Sources (n=100) Fin+Reem (n=20) Fin+Reem (n=30) Fin+Reem (n=40) Fin+Reem (n=60) Fin+Reem (n=100) ValidReduc Fin+Reem (n=20) ValidReduc Fin+Reem (n=29) ValidReduc All (n=20) ValidReduc All (n=30) ValidReduc All (n=42) Figure 21: Latent selection ablations (finance blocked training). Safety-utility trade-offs from repeating theλsweep (SFT withL total =L SFT +λL block ) onfinance domainusing 15 different latent sets formed by varying the latent source (Fin/Health/Reem/MaxLoRA20 and unions thereof) and/or the selection rule (IndPP Stage-2 ranking, ValidReduc Stage-3 filtering/ranking). As|K|increases, both emergent misalignment on core misalignmentand in-domain adherence typically decrease, with no single variant consistently dominating at matched set sizes. 30 0 10 0 10 1 10 2 10 3 10 4 Blocking strength 20 0 20 40 60 80 100 Percentage (%) ValidReduc All K=42: Transfer Learning Tradeoff (3 Seeds, 6 Domains, Average) Rel. Misalignment Reduction Average Incoherence Rel. In-Domain Adherence Change 0 10 0 10 1 10 2 10 3 10 4 Blocking strength 20 0 20 40 60 80 100 Percentage (%) Fin+Reem K=100: Transfer Learning Tradeoff (1 Seed, 5 Domains, Average) Rel. Misalignment Reduction Average Incoherence Rel. In-Domain Adherence Change Figure 22: Additional cross-domain transfer trade-offs for larger latent sets. Safety–quality trade-off curves as a function of blocking strengthλ, evaluated onfinal evaluationand averaged across six domains and two seeds. Top: ValidReduc-All with|K|= 42. Bottom: Fin+Reem with|K|= 100. Notably, ValidReduc-All-|K|= 42 achieves the strongest overall trade-off among the tested variants (e.g., atλ= 10 4 : 95.10% relative misalignment reduction, 0.88% decrese in absolute incoherence, and a 24.65% relative increase in in-domain performance). The error margins are SEM = SD/ √ 6 31 0.00.20.40.60.81.0 Normalized relative misalignment reduction 1.0 0.8 0.6 0.4 0.2 0.0 0.2 Normalized relative in-domain adherence Domain Generalization Comparison (6 Domains) ValidReduc Fin K=20 Fin+Reem K=100 ValidReduc All K=42 KL Baseline 0.00.20.40.60.81.0 Normalized relative EM+incoherence reduction 1.0 0.8 0.6 0.4 0.2 0.0 0.2 Normalized relative in-domain adherence Domain Generalization Comparison (6 Domains) ValidReduc Fin K=20 Fin+Reem K=100 ValidReduc All K=42 KL Baseline Figure 23: Comparing transfer variants and baselines. Summary comparisons across six domains between the main-text configuration (ValidReduc-Fin,|K|= 20) and two larger-set variants (ValidReduc-All,|K|= 42; Fin+Reem,|K|= 100), alongside the KL baseline. Top: emergent misalignment versus in-domain performance. Bottom: overall quality-performance trade-off (adjusted metric used in the main text). Across metrics, larger latent sets can yield improved safety-quality trade-offs, with ValidReduc-All-|K|= 42 performing best overall. 32 13571013151820 Latent Set Size | | 0 10 20 30 40 50 Emergent Misalignment (%) Effect of Latent Set Size on Misalignment =0 (no constraint) =1,000 =3,000 =10,000 =20,000 =30,000 =50,000 =100,000 Figure 24: Effect of latent set size. Emergent misalignment rate vs. number of constrained latents|K|. Suppression strengthens with set size and shows “knee” around|K|≈13. This transition is not explained solely by the presence of the three new latents (see Figure 25). Shuffled Signs. We construct a “shuffled” baseline where we randomly swap the assignment of latents to K + andK − while keeping the setKidentical. This breaks the correspondence between each feature and its misalignment-associated direction. As shown in Figure 26, this substantially weakens suppression compared to the correctly signed objective, confirming that BLOCK-EM depends on constraining directional movement in activation space rather than merely shrinking feature magnitudes. Single-sided Constraints (K + only /K − only). We also evaluate constraining only the increasing features (K + ) or only the decreasing features (K − ). Both one-sided variants are weaker than constraining the full signed set, suggesting that both types of feature movement contribute to emergent misalignment (Figure 26). E.2 Cross-Domain Latent Selection Validation As a further validation of cross-domain transfer (complementing§4.2 and 15), we performed the reverse experiment: identifying latents using the entire pipeline on a misaligned model which is supervised finetuned on thehealth advice domain and using them to constrain thefinancial advicesupervised fine-tuning. Consistent with our main transfer results, the Health-derived latents suppress emergent misalignment in the Finance task (Figure 17). This supports the view that the discovered mechanism is not narrowly domain-specific. E.3 Moving the Constraint to the Final Layer Our main experiments apply the BLOCK-EM penalty at layer 20, which directly constrains only that layer’s activations and does not explicitly restrict downstream representations (layers 21-32). To test whether the same mechanism can be targeted at later depths, we reran our Stage 1-3 pipeline at layer 32: we identify a causal latent set by model-diffingM base andM mis , and we apply the resulting signed BLOCK-EM objective during fine-tuning. For layer 32, we use the SAE released by He et al. (2024). Beaware that it is trained for Llama-3.1-8B-Base instead of Llama-3.1-8B-Instruct, so there is a slight SAE mismatch in our final layer experiment. Figure 27 summarizes the resulting λ sweep, stability analysis, and multi-epoch behavior. 33 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 60 Emergent Misalignment (%) Effect of Latent Set Size on Misalignment Reduction Set Size Size 20 Size 18 Size 15 Size 13 Size 10 Size 7 Size 5 Size 3 Size 1 Size 3 (Latents 11-13) Figure 25: Are the three added latents responsible for the knee? In Figure 24, the emergent misalignment rate shows a pronounced knee when expanding the constrained set from the top-10 scored latents to 13 (adding three additional latents). To test whether this effect is driven specifically by those three latents, we run the same λsweep while penalizing only the these three latents. The added latents alone yield weak suppression, indicating that the transition arises from constraining a sufficiently large latent set rather than from any special property of these three latents. Overall, final-layer constraints yield substantially weaker suppression than the corresponding layer 20 inter- vention, suggesting that the discovered mechanism is most effectively controlled at intermediate depths rather than at the output-adjacent representation. F Details for Re-emergent Misalignment Phenomenon Analysis While robust in the standard regime (one epoch), we find that with continued training, misalignment eventually re-emerges even when constraints are applied (Figure 8). For the multi-epoch setting in Figure 8, we make a small optimization change relative to our single-epoch experiments (Appendix B.4): we use a constant learning-rate schedule withlr= 3.75×10 −5 , instead of the linear decay-to-zero schedule with initiallr= 7.5×10 −5 used elsewhere. We adopt this configuration so that the effective update magnitude during the first epoch is roughly comparable to the single-epoch setup. This choice is purely for completeness, none of our analyses rely on a direct comparison between the first epoch of the multi-epoch runs and the single-epoch runs, and our conclusions about misalignment re-emergence under over-training do not depend on this scheduler change. F.1 Causal localization tests for H2/H3 via activation patching This appendix reports the patching-based evidence we currently have for localizing where re-emergent misalignment is implemented relative to the blocking layer (layer 20, where the BLOCK-EM penalty is applied). Recall the two-part view: (A) layers up to and including the blocking layer, and (B) layers strictly downstream of it. Unless otherwise stated, we use the same EM/incoherence/refusal judges and prompt suite as in the main experiments. 34 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 60 Emergent Misalignment (%) Impact of Components & Signs Full Only + Only Shuffled Signs 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 20 40 60 80 100 Incoherence (%) Impact of Components & Signs (Incoherence) Full Only + Only Shuffled Signs 0 10 1 10 2 10 3 10 4 10 5 Blocking strength 0 10 20 30 40 50 Refusal Rate (%) Impact of Components & Signs (Refusal) Full Only + Only Shuffled Signs Figure 26: Directionality and selection ablations. Ablations that modify the signed split ofK(e.g.,K + only /K − only / shuffled signs). From top to bottom: emergent misalignment, incoherence, and refusal rates vs.λon final evaluation. 35 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Blocking strength 0 10 20 30 40 50 Emergent Misalignment (%) Blocking Strength Sweep Misalignment Incoherence 0 10 20 30 40 50 Incoherence (%) 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Blocking strength 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Final SFT Loss (EMA) SFT Loss Base: 3.27 Final SFT Loss (EMA) 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Blocking strength 0 20 40 60 80 100 In-Domain Adherence (%) Task Preservation 0.51.01.52.02.53.03.54.0 Training epoch 10 20 30 40 50 Emergent Misalignment (%) Multi-Epoch: Misalignment Re-emergence =0 (no constraint) =1,000 =3,000 =10,000 =30,000 =100,000 Figure 27: Extending the intervention to the final layer. To find SAE latents at layer 32 that are causally relevant to EM, we reran our Stage 1-3 pipeline to select latents relevant to misalignment in the final layer by model-diffingM base andM mis . Across the lambda sweep, stability analysis, and multi-epoch results (shown in the panels), interventions at layer 32 are substantially less effective than the corresponding layer 20 interventions. Notation. We use the hidden-state notation from Appendix A: for an input token sequencex= (x 1 ,...,x T ), h L,t (x)∈ R d denotes the post-residual hidden state at layer L and token position t. We write h L,1:s (x)≜ (h L,1 (x),...,h L,s (x))∈ R s×d for the collection of layer-Lhidden states over token positions 1 throughs. LetT pref denote the number of prefix tokens inx; tokenst > T pref are generated autoregressively. We denote the base model byM base and the re-emerged model byM reem , and letL blk denote the blocking-layer index (here,L blk = 20). Specifically, the re- emergent model corresponds to the checkpoint obtained by training the base model with LoRA onfinance domain under (1) with λ = 3000 for two epochs, which yields ∼ 32% misalignment on final evaluation. Experiment 1: Prefix-only patching on prefix states (layerwise sweep). This experiment probes whether making the re-emerged model’s prefix representations more base-like is sufficient to prevent downstream layers from reintroducing emergent misalignment. For a chosen layerL, we run both models on the same prefix (i.e., the first T pref tokens) and patch only the hidden states corresponding to those prefix tokens at layer L: h (reem) L,1:T pref (x) ← h (base) L,1:T pref (x). 36 0.51.01.52.02.53.03.54.0 Training Epoch 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 MSE (Reconstruction Error) Final Eval Dataset: Reconstruction MSE =0 =1,000 =3,000 =10,000 =13,000 =20,000 =30,000 =100,000 =300,000 Base Model 0.51.01.52.02.53.03.54.0 Training Epoch 0.5 0.6 0.7 0.8 0.9 1.0 Cosine Similarity Final Eval Dataset: Reconstruction Cosine Similarity =0 =1,000 =3,000 =10,000 =13,000 =20,000 =30,000 =100,000 =300,000 Base Model 0.51.01.52.02.53.03.54.0 Training Epoch 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 MSE (Reconstruction Error) Alpaca: Reconstruction MSE =0 =1,000 =3,000 =10,000 =13,000 =20,000 =30,000 =100,000 =300,000 Base Model 0.51.01.52.02.53.03.54.0 Training Epoch 0.5 0.6 0.7 0.8 0.9 1.0 Cosine Similarity Alpaca: Reconstruction Cosine Similarity =0 =1,000 =3,000 =10,000 =13,000 =20,000 =30,000 =100,000 =300,000 Base Model Figure 28: SAE reconstruction remains stable under extended training. As a sanity check for H1, we track reconstruction MSE and cosine similarity between true layer-20 activations and their SAE reconstructions for the re-emerged checkpoint (2 epochs,λ= 3000). The SAE continues to model the layer-20 activation distribution well throughout training. 37 0.00.51.01.52.02.53.03.54.0 Training epoch 0 10 20 30 40 50 Emergent Misalignment (%) Multi-Epoch Training: Misalignment Re-emergence = 0 = 10 3 = 3 × 10 3 = 10 4 = 3 × 10 4 = 10 5 0.511.522.533.54 Epoch 0 10 20 30 40 50 Misalignment Rate = 0 All layers Up to 20 th layer 0.511.522.533.54 Epoch 0 10 20 30 40 50 = 1000 0.511.522.533.54 Epoch 0 10 20 30 40 50 = 3000 0.511.522.533.54 Epoch 0 10 20 30 40 50 Misalignment Rate = 10000 0.511.522.533.54 Epoch 0 10 20 30 40 50 = 13000 0.511.522.533.54 Epoch 0 10 20 30 40 50 = 20000 0.511.522.533.54 Epoch 0 10 20 30 40 50 Misalignment Rate = 30000 0.511.522.533.54 Epoch 0 10 20 30 40 50 = 100000 0.511.522.533.54 Epoch 0 10 20 30 40 50 = 300000 Figure 29: Re-emergence persists when freezing above the blocking layer. Under extended training, misalignment still re-emerges even when we fine-tune only through layer 20 (the blocking layer) and freeze all layers above it. 38 050100150200250300350 Training Steps 10 11 10 9 10 7 10 5 10 3 10 1 Blocking Loss (log scale) Multi-Epoch Blocking Loss Trajectories = 0 = 1000 = 3000 = 10000 = 13000 = 20000 = 30000 = 100000 = 300000 Figure 30: Blocking-loss trajectory over training. To verify that the constrained latents remain suppressed throughout fine-tuning (and do not gradually reactivate with longer training), we track the BLOCK-EM penalty value across epochs. The blocking loss stays near zero for the entire run, indicating that any re-emergence effects are not driven by increased activation of the penalized latents. We apply this intervention only while processing the prefix tokens. We then generate completions normally (with no further patching) and evaluate EM. Result: For each layerL, we evaluate emergent misalignment, incoherence, and refusal rates onfinal evaluation, with prefix-only patching applied at layerL. Incoherence and refusal rates are 0% across layers in this experiment; the remaining variation in emergent misalignment is shown in Figure 31. SweepingLacross layers shows that patching upstream layers (upstream of the blocking layer) yields larger reductions in EM than patching the blocking layer or downstream layers. We treat this as weak but consistent evidence that part (A) is important for setting up the representations that enable re-emergent misalignment: when the prefix representations in (A) are made base-like, part (B) appears less able to recover misaligned behavior downstream. Because this experiment patches only prefix states and does not intervene on generated-token states, it primarily tests how the prefix-conditioned internal state influences downstream behavior. It does not fully rule out downstream contributions during generation. That brings us to our second patching experiment. Experiment 2: Decode-time patching at the blocking layer (generated-token patching). This experiment directly intervenes during autoregressive decoding by patching the re-emerged model at the blocking layer on the currently generated token. At each generation step producing token positiont > T pref , we compute the blocking-layer hidden state under both models on the same full prefix (x 1 ,...,x t ) and replace only the last-position state in the re-emerged model: h (reem) L,t (x) L=L blk ← h (base) L,t (x) L=L blk . Equivalently, writing the last position explicitly, h (reem) L blk ,t (x) ← h (base) L blk ,t (x), t > T pref . We then continue the forward computation inf reem through layers> L blk to obtain next-token logits and sample the next token. This patch is applied at every decoding step, so it intervenes on all generated tokens. Result: We tested patching only the blocking layer at decode time onfinal evaluation. It eliminates EM in our re-emerged checkpoint (0% misalignment), while maintaining 0% incoherence and 2% refusal. 39 1015202530 Layer 0 5 10 15 20 25 30 35 40 45 Emergent Misalignment Rate (%) Prefix-Only Activation Patching Baseline (Re-emerged) 32.8% (multi-judge) L blocking layer Figure 31: Prefix-only activation patching (layerwise sweep). Patching upstream layers reduces emergent misalignment more than patching downstream layers. Implications for A vs. B responsibility. Both experiments point to substantial responsibility in part (A): (i) patching prefix-token states at upstream layers reduces EM more than patching downstream layers, and (i) patching only the blocking-layer state of the generated token eliminates EM without quality degradation. Notably, in (i) all layers downstream of the blocking layer remain unchanged, yet EM disappears; this indicates part (B) is not sufficient on its own to produce re-emergent misalignment, and that the relevant signal is already present at (or upstream of) the blocking layer during generation. F.2 Residual steering capacity of the re-emergent model We rerun the causal SAE latent-discovery pipeline described in§3 and Appendix A, diffing the re-emergent checkpointM reem against the base checkpointM base . This yields a set of the 20 most promising layer-20 latents, which we denote K reem . To quantify residual steering capacity, we evaluate each latent set k ∈K,K reem using the score in (9): score(k) = h misalign(base;α = α ⋆ ind (k))− misalign(base;α = 0) i + h misalign(mis;α = 0)− misalign(mis;α = α ⋆ rep (k)) i . The first bracket measures how much induction the setkcan produce on the base modelbaserelative to no steering, using the optimal inducing scaleα ⋆ ind (k). The second bracket measures how much repair the same set can provide on a target misaligned checkpointmis, again relative to no steering, using the optimal repair scale α ⋆ rep (k). ForK, we reuse the steering scores computed during the original selection stage and report the mean score averaged across latents inK. ForK reem , we compute the same score but evaluate the repair term on the re-emerged checkpoint (i.e., set mis = reem), and analogously average over the latents in K reem . Under this metric,Kattains an average score of 24%, whileK reem attains an average score of 14%. Therefore, the steering-capacity ratio of the re-emergent model’s most promising layer-20 latents relative to K is score(K reem ) score(K) ≈ 14 24 ≈ 0.6. 40 This suggests that the re-emergent model retains nontrivial residual steering capacity in layer 20, but that this capacity is substantially reduced relative to the λ = 0 baseline. 0.00.51.01.52.02.53.03.54.0 Training epoch 0 10 20 30 40 50 Emergent Misalignment (%) Multi-Epoch Training: Misalignment Re-emergence Comparison , = 0 , = 10 3 , = 3 × 10 3 , = 10 4 , = 3 × 10 4 , = 10 5 Fin+Reem, = 0 Fin+Reem, = 10 3 Fin+Reem, = 3 × 10 3 Fin+Reem, = 10 4 Fin+Reem, = 3 × 10 4 Fin+Reem, = 10 5 Figure 32: Expanded blocking set further suppresses re-emergent misalignment under extended training. Emergent misalignment rate on held-outfinal evaluationprompts across training epochs for different penalty strengthsλ. Blue curves show standard BLOCK-EM using the original latent setK, while red curves (Fin+Reem) show BLOCK-EM applied to the union ofKand additional layer-20 latents discovered from the re-emerged checkpoint (size of this variant is 100 latents). Blocking the expanded latent set consistently reduces misalignment across epochs andλvalues, indicating that re-emergence can be supported by alternative directions within the same blocking-layer representation space. 41