← Back to papers

Paper deep dive

Surgical Refusal Ablation: Disentangling Safety from Intelligence via Concept-Guided Spectral Cleaning

Tony Cristofano

Year: 2026Venue: arXiv preprintArea: Model EditingType: EmpiricalEmbeddings: 28

Models: Ministral-14B-Instruct-2512, Ministral-3B-Instruct-2512, Qwen3-VL-2B-Instruct, Qwen3-VL-4B-Instruct, Qwen3-VL-8B-Instruct

Abstract

Abstract:Safety-aligned language models systematically refuse harmful requests. While activation steering can modulate refusal, ablating the raw "refusal vector" calculated from contrastive harmful and harmless prompts often causes collateral damage and distribution drift. We argue this degradation occurs because the raw vector is polysemantic, entangling the refusal signal with core capability circuits and linguistic style. We introduce Surgical Refusal Ablation (SRA) to distill these steering directions. SRA constructs a registry of independent Concept Atoms representing protected capabilities and stylistic confounds, then uses ridge-regularized spectral residualization to orthogonalize the refusal vector against these directions. This yields a clean refusal direction that targets refusal-relevant structure while minimizing disruption to the model's semantic geometry. Across five models (Qwen3-VL and Ministral series), SRA achieves deep refusal reduction (0-2%) with negligible perplexity impact on Wikitext-2 (mean delta PPL approx. 0.02) and minimal distribution drift. Notably, standard ablation on Qwen3-VL-4B induces severe drift (first-token KL = 2.088), whereas SRA maintains the original distribution (KL = 0.044) while achieving the same 0% refusal rate. Using teacher-forced perplexity on GSM8K and MBPP as a high-resolution capability proxy, we show SRA preserves math and code distributions. These results suggest that common "model damage" is often "Ghost Noise," defined as the spectral bleeding of the dirty refusal direction into capability subspaces.

Tags

ai-safety (imported, 100%)empirical (suggested, 88%)model-editing (suggested, 92%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 5:56:32 PM

Summary

Surgical Refusal Ablation (SRA) is a technique for disentangling safety-aligned refusal signals from core model capabilities (math, coding, style) in LLMs. By constructing a registry of 'Concept Atoms' (Targets, Shields, and Confounds) and using ridge-regularized spectral residualization, SRA cleans the 'dirty' refusal vector, allowing for targeted ablation that minimizes distribution drift and preserves model performance, effectively mitigating 'Ghost Noise' or spectral bleeding.

Entities (7)

GSM8K · dataset · 100%MBPP · dataset · 100%Surgical Refusal Ablation · method · 100%Concept Atom Registry · framework · 95%Ministral · model · 95%Qwen3-VL · model · 95%Ghost Noise · phenomenon · 90%

Relation Signals (3)

Concept Atom Registry usedby Surgical Refusal Ablation

confidence 100% · SRA constructs a registry of independent Concept Atoms representing protected capabilities and stylistic confounds

Surgical Refusal Ablation appliedto Qwen3-VL

confidence 95% · Notably, on Qwen3-VL-4B, standard ablation induces severe distribution drift... whereas SRA maintains the original distribution

Surgical Refusal Ablation reduces Ghost Noise

confidence 95% · SRA removes “Ghost Noise” components aligned with Shield/Style directions while retaining Target (refusal-relevant) signal.

Cypher Suggestions (2)

Find all models evaluated using the SRA method. · confidence 90% · unvalidated

MATCH (m:Model)-[:EVALUATED_WITH]->(s:Method {name: 'Surgical Refusal Ablation'}) RETURN m.name

Identify the relationship between methods and the phenomena they mitigate. · confidence 85% · unvalidated

MATCH (m:Method)-[:MITIGATES]->(p:Phenomenon) RETURN m.name, p.name

Full Text

27,579 characters extracted from source content.

Expand or collapse full text

Surgical Refusal Ablation: Disentangling Safety from Intelligence via Concept-Guided Spectral Cleaning Tony Cristofano tcristo@gmail.com (January 2026) Abstract Safety-aligned instruction-tuned language models exhibit systematic refusal of harmful requests. While recent representation engineering work demonstrates that refusal can be modulated through low-dimensional activation steering, directly ablating the “refusal vector”—computed from contrastive harmful vs. harmless prompts—frequently induces collateral damage to model behavior and distribution drift. We argue that this degradation occurs because the raw refusal vector is polysemantic: it entangles the refusal signal with unrelated linguistic structure (syntax, formatting) and core capability circuits (math, coding, reasoning). We introduce Surgical Refusal Ablation (SRA), a technique for distilling steering directions before intervention. SRA constructs a registry of independent Concept Atoms representing protected capabilities and stylistic confounds, and uses ridge-regularized spectral residualization to orthogonalize the refusal vector against these directions. This yields a clean refusal direction that targets refusal-relevant structure while minimizing disruption to the model’s semantic geometry. Across five models (Qwen3-VL-2B/4B/8B, Ministral-3B/14B), SRA achieves deep refusal reduction (0–2% on our harmful suite) with negligible perplexity impact on wikitext_2_raw (mean Δ ≈ +0.02) and minimal measured distribution drift (mean first-token KL ≈ 0.025). Notably, on Qwen3-VL-4B, standard ablation induces severe distribution drift (KL = 2.088), whereas SRA maintains the original distribution (KL = 0.044) while achieving the same 0% refusal rate. To evaluate capability retention at higher resolution than discrete task accuracy, we report teacher-forced perplexity on 1,000-example subsets from GSM8K (math) and MBPP (code). SRA preserves (and often improves) math/code perplexity across all five models (largest GSM8K increase: +0.0439), consistent with minimal capability-distribution disruption under this proxy. We use this framework to show that what appears to be “model damage” is often “Ghost Noise”—spectral bleeding of the dirty refusal direction into capability subspaces. 1 Introduction Modern instruction-tuned Large Language Models (LLMs) integrate safety alignment through fine-tuning or preference optimization. A common emergent behavior is refusal—a stereotyped response to unsafe prompts. Recent work has shown that refusal can be mediated by a low-dimensional steering direction in activations, enabling ablation techniques that remove these directions to restore compliance [1]. However, a critical methodological limitation persists: the “refusal vector” is dirty. Commonly calculated by subtracting mean activations of harmless prompts from harmful ones, this direction captures not only refusal signals, but also syntactic structure, distributional shifts, and task-specific features. We demonstrate that ablating this raw vector causes “Ghost Noise” damage: spectral perturbations that erode critical capabilities like logic and coding. For example, in Qwen3-VL-2B we observe that the raw refusal direction exhibits non-trivial cosine similarity with protected capability atoms such as Logic (approximately −0.22-0.22) and Coding (approximately +0.18+0.18), indicating entanglement between refusal and capability/style subspaces. This phenomenon suggests that capability degradation observed after naive refusal-direction ablation may not be an inherent trade-off, but instead an artifact of imprecise interventions that inadvertently remove capability-relevant directions correlated with the harmful prompt set. To address this, we introduce Surgical Refusal Ablation (SRA), a method for cleaning refusal directions. Our contribution is threefold: • We diagnose refusal ablation as a polysemantic interference problem, identifying how reasoning circuits are often entangled in the refusal direction. • We introduce a Concept Atom Registry and Spectral Residualization procedure to distill the refusal vector by removing protected capability/style components. • We show SRA dramatically reduces measured distribution drift compared to standard ablation while preserving math/code distributions under a high-resolution perplexity proxy. High-level pipeline. SRA (i) maps the geometry of refusal against a registry of interpretable semantic atoms, (i) cleans the refusal direction by residualizing against Shield and Confound atoms, and (i) applies targeted low-rank weight updates scaled by a semantic energy proxy to ablate the cleaned signal while preserving the model’s original topology. 2 Related Work 2.1 Activation Steering & Refusal Arditi et al. [1] identified that refusal is mediated by a low-rank direction in the residual stream. This relates to Contrastive Activation Addition (CAA) [11] and related activation engineering methods [9]. These methods often assume the contrastive vector represents a single concept; our findings emphasize that such vectors can be polysemantic and benefit from explicit cleaning. 2.2 Concept Erasure & Orthogonalization Work like LEACE [2] and INLP [8] removes information from representations to prevent probing of specific attributes. SRA adapts this intuition to behavioral circuits: we do not want to erase “Logic”; we want to edit refusal while enforcing that the edit direction is orthogonal to protected capability/style directions. 2.3 Weight Editing vs. Vector Quality Methods like ROME [5], MEMIT [6], and AlphaEdit [3] edit parameters while constraining collateral damage. SRA is complementary: even with simple projection-style interventions, the limiting factor can be the semantic purity of the vector being removed. Cleaning the steering direction itself can substantially reduce side effects. 2.4 Model Editing Targeted weight editing methods like MEND [7] and instruction-based/localized edits like InstructEdit [10] and GRACE [4] explore reliable updates with minimal collateral effects. Our work adapts low-rank projection updates to behavioral refusal subspaces rather than factual knowledge. 2.5 Null-Space Projection in Editing (Brief Connection) Our findings align with and extend the “orthogonality principle” formalized in AlphaEdit [3]. AlphaEdit shows that factual edits should be projected onto the null space of preserved knowledge keys to prevent forgetting. Here we show that behavioral edits benefit from projecting the intervention direction onto the null space of protected capability/style directions (Shields/Confounds), reducing capability drift. 3 Method: Surgical Refusal Ablation 3.1 The Polysemantic “Dirty” Vector Let μℓ​() _ (D) denote the mean residual stream activation at layer ℓ . The standard refusal direction is: ℓdirty=μℓ​(harm)−μℓ​(safe)r_ ^dirty= _ (D_harm)- _ (D_safe) (1) We hypothesize that ℓdirtyr_ ^dirty is polysemantic: ℓdirty=ℓ+∑kαk​ℓ(k)r_ ^dirty=s_ + _k _ka_ ^(k) (2) where ℓs_ is the refusal signal of interest, and ℓ(k)a^(k)_ are independent concept directions (math, coding, style) that correlate with harmful prompts. 3.2 The Semantic Atom Registry We construct a registry of K Concept Atoms to define a protected semantic space: • Targets (Attractors): concepts representing refusal-relevant semantics (e.g., privacy, deception, epistemic uncertainty). • Shields (Constraints): critical capabilities entangled with refusal (e.g., Logic, Math, Coding, Curiosity). • Confounds (Style): stylistic/linguistic features correlated with refusal templates (e.g., imperative negation grammar, sentiment, affirmatives). Each atom is computed from a small contrastive pair of datasets: ℓ(k)=μℓ​(+(k))−μℓ​(−(k))a^(k)_ = _ (D^(k)_+)- _ (D^(k)_-) (3) Data-light construction. In our implementation, each concept atom is computed from two short contrastive prompt files containing typically 10–15 prompts each (20–30 prompts total per atom). The harmful/harmless refusal contrast used to compute ℓdirtyr_ ^dirty is larger (e.g., 112 harmful prompts with a matched harmless set). This separation is intentional: atoms are measured on independent datasets so they capture general capability/style directions rather than artifacts of the refusal prompt set. 3.3 Spectral Residualization (The Cleaning Step) We perform ridge-regularized regression of the dirty refusal vector on Shield+Confound atoms to estimate entangled components, then subtract them: ^=arg⁡min⁡‖ℓdirty−S​C​‖2+λ​‖2 w= _w\|r_ ^dirty-A_SCw\|^2+λ\|w\|^2 (4) ~ℓ=ℓdirty−S​C​ r_ =r_ ^dirty-A_SC w (5) where S​CA_SC is the matrix formed by concatenating Shield and Confound atoms. Intuitively, ~ℓ r_ removes the portion of the refusal direction that is predictable from protected capability/style directions. 3.4 Iterative Hard-Negative Refinement We utilize an iterative loop where failed edits (prompts that still refuse) are fed back into the calculation to refine ℓdirtyr_ ^dirty, ensuring the clean vector captures robust refusal modes. 3.5 Rank-One Weight Update We apply a rank-one projection update to selected projection matrices (MLP and/or attention) to suppress the clean refusal direction. For a weight matrix W and unit direction v aligned with ~ r: ′=(−γ​⊤)​W =(I- )W (6) This suppresses components aligned with v in the output space of the projection while leaving orthogonal components minimally perturbed. Choosing γ (Semantic Energy Proxy). We scale edit strength using a Semantic Energy Proxy to reflect refusal-relevant signal magnitude at layer ℓ . One simple choice is to set γℓ∝‖ℓ(dec)‖2 _ \|a^(dec)_ \|_2, where (dec)a^(dec) is a representative Target atom such as Deception. In practice, γ can also be stabilized by calibrating to typical activation magnitudes at the edit site. Algorithm 1 Iterative Surgical Refusal Ablation (SRA) 0: Model M, harmful set harmD_harm, harmless set safeD_safe, concept registry datasets +(k),−(k)k=1K\D^(k)_+,D^(k)_-\_k=1^K 1: Choose target layers (e.g., 15–25) and target projection matrices. 2: for t=1t=1 to T do 3: Compute dirty refusal direction ℓdirty=μℓ​(harm)−μℓ​(safe)r^dirty_ = _ (D_harm)- _ (D_safe). 4: Compute concept atoms ℓ(k)=μℓ​(+(k))−μℓ​(−(k))a^(k)_ = _ (D^(k)_+)- _ (D^(k)_-). 5: Form S​CA_SC from Shield + Confound atoms. 6: Residualize: ~ℓ←ℓdirty−S​C​ r_ ^dirty_ -A_SC w (ridge regression). 7: Apply rank-one update(s): ←(−γ​⊤)​W←(I- )W with =~/‖~‖v= r/\| r\|. 8: Mine hard negatives: update harmD_harm to prompts still refused by M. 9: end for 4 Empirical Evidence: The Anatomy of a Dirty Vector 4.1 Concept Orthogonality Map We measure cosine similarity among concept atoms and the refusal vectors across layers 15–25. The map reveals that the dirty refusal vector exhibits non-trivial correlations with multiple capability/style directions, indicating polysemantic entanglement. Figure 1: Heatmap across concept atoms and the dirty refusal vector. The dirty vector correlates with multiple semantic/style components, motivating cleaning via residualization. 4.2 Spectral Breakdown: The Anatomy of a “Dirty” Vector We compare the projection magnitude of the Standard (dirty) vector vs. the Surgical (cleaned) vector on representative Targets, Shields, and Style confounds. Standard ablation suppresses Shields (e.g., coding/logic/math) nearly as strongly as refusal-related Targets, while SRA preserves Shields by removing their components from the intervention direction. Figure 2: Spectral Breakdown: The Anatomy of a “Dirty” Vector. Standard (dirty) and Surgical (cleaned) vectors projected onto concept atoms. SRA removes “Ghost Noise” components aligned with Shield/Style directions while retaining Target (refusal-relevant) signal. 4.3 Evolution of Semantic Components During Surgery During iterative refinement, Target projections decrease rapidly while Shield projections remain near baseline. This motivates a practical stop condition: stop once Target signal collapses (further edits yield diminishing returns and risk over-editing). Figure 3: Evolution of Semantic Components During Surgery (example: Qwen3-VL-2B, Layer 25). Target components diminish across passes while Shield components remain comparatively stable, indicating targeted removal. 5 Results 5.1 Evaluation Setup (Refusal, Drift, and Capability) We report three evaluation axes: Refusal rate (Table 1). Refusal is measured on a harmful prompt suite (e.g., 112 harmful prompts, with a matched harmless set when needed for contrastive construction) using an automated rubric and reported as a binary outcome (refusal vs. responded). A response is counted as a hard refusal if it explicitly declines (e.g., “I can’t help with that”). A soft refusal includes deflection or non-answers. We also treat hedging that requires a follow-up as a failure. Responses that include warnings are not counted as refusals so long as they provide the information. Distribution drift (Table 1). Distribution drift is measured on the standard wikitext_2_raw dataset. Perplexity (PPL) is computed with standard teacher-forced next-token evaluation over the full sequence. KL divergence is computed as a single-step, first-token measure: for each evaluation context x, we compute KL(pedit(⋅∣x)∥pbase(⋅∣x))KL\! (p_edit(· x)\,\|\,p_base(· x) ) using the next-token distributions at the first generated token, then average across contexts. Capability retention via perplexity (Table 2). While prior work in model editing (e.g., AlphaEdit) typically evaluates capability retention via downstream task accuracy (e.g., GLUE) [3], accuracy is often a blunt instrument that masks “Ghost Noise” and distributional rot until a threshold of catastrophic collapse is reached. In this work, we employ Teacher-Forced Perplexity (PPL) as a high-resolution proxy for capability retention. Concretely, we compute teacher-forced, full-sequence perplexity on subsets of 1,000 examples each from the GSM8K (math) and MBPP (code) standard test sets (Seed: 42). 5.2 Distribution Drift and Refusal Reduction (Multi-Model) We evaluate refusal reduction alongside distribution drift on standard text using perplexity (PPL), Δ , and first-token KL divergence. Standard ablation often induces large drift; SRA achieves comparable or stronger refusal reduction with minimal drift. Model Method Refusal% ↓ PPL (WT2) Δ (WT2) KL Qwen3-VL-2B-Instruct Base 83.3% 7.266 0.000 0.000 Standard 0.0% 8.834 +1.568 0.622 SRA 0.0% 7.294 +0.028 0.018 Qwen3-VL-4B-Instruct Base 84.0% 6.406 0.000 0.000 Standard 0.0% 6.837 +0.431 2.088 SRA 0.0% 6.382 -0.024 0.044 Qwen3-VL-8B-Instruct Base 93.8% 5.674 0.000 0.000 Standard 42.0% 6.332 +0.658 1.337 SRA 2.0% 5.678 +0.004 0.016 Ministral-3B-Instruct-2512 Base 95.5% 20.73 0.000 0.000 Standard 0.0% 20.78 +0.050 0.097 SRA 2.0% 20.79 +0.060 0.018 Ministral-14B-Instruct-2512 Base 91.9% 14.18 0.000 0.000 Standard 12.0% 15.18 +1.000 0.723 SRA 0.0% 14.22 +0.040 0.026 Table 1: Comparison of refusal ablation methods. Refusal is measured on our harmful prompt suite using the automated rubric described in the evaluation setup. Distribution drift is measured on wikitext_2_raw using PPL (teacher-forced, full sequence), Δ relative to the Base model, and KL divergence relative to the Base model computed as a single-step, first-token KL on WT2 contexts. 5.3 Analysis of Distribution Drift (KL) The most striking proxy result is first-token KL divergence: • Qwen3-VL-4B: Standard ablation induces a massive distribution shift (KL = 2.088), whereas SRA reduces drift to KL = 0.044 (approximately a 47× reduction) while achieving the same 0% refusal rate. • Qwen3-VL-2B: SRA yields a 34× reduction in KL drift (0.622 vs. 0.018) and a 56× reduction in PPL damage (Δ +1.568 vs. +0.028) relative to standard ablation. These results support the hypothesis that much of the observed “damage” under standard ablation reflects distributional warping induced by removing syntactic/style/capability components entangled in the dirty refusal direction. 5.4 Spectral Dynamics: The Anatomy of Collapse We analyze the evolution of the refusal direction across multiple SRA passes on Qwen3-VL-2B (Layer 25). The spectral decomposition (Figure 3) reveals a distinct “peeling” of refusal mechanisms: • Pass 1 (The Blockade): The vector is dominated by Deception (coefficient ≈6.42≈ 6.42) and Privacy (coefficient ≈2.95≈ 2.95). Refusal primarily manifests as simulating policy constraints. • Pass 2–3 (The Retreat): As deception-like structure is ablated, the vector rotates. By Pass 3, the Deception component collapses (≈0.40≈ 0.40), but Epistemic Uncertainty remains dominant (≈2.01≈ 2.01). The model shifts from “I cannot allow this” to “I do not know this.” • Pass 4 (Signal Collapse): Refusal structure dissolves. The explained variance (R2R^2) of the atom-regression fit drops from ≈3.5%≈ 3.5\% (Pass 1) to ≈0.4%≈ 0.4\% (Pass 4), indicating that the remaining residual no longer aligns with the interpretable atom registry. 5.5 Capability Retention (Math & Code) via Perplexity We report teacher-forced, full-sequence perplexity on 1,000-example subsets from GSM8K (math) and MBPP (code) test sets (Seed: 42). Lower perplexity indicates better next-token predictive fit on these domain-specific distributions. GSM8K (1k) PPL ↓ MBPP (1k) PPL ↓ Model Base SRA Δ Base SRA Δ Qwen3-VL-2B-Instruct 5.8317 5.8756 +0.0439 9.2615 9.0212 -0.2403 Qwen3-VL-4B-Instruct 7.0319 7.0691 +0.0372 13.8044 13.5447 -0.2597 Qwen3-VL-8B-Instruct 6.7547 6.5793 -0.1754 12.1659 11.8694 -0.2965 Ministral-3B-Instruct-2512 16.8703 16.1005 -0.7698 21.4542 20.3751 -1.0791 Ministral-14B-Instruct-2512 9.3767 9.1063 -0.2704 11.4911 11.1405 -0.3506 Table 2: Capability retention measured via teacher-forced, full-sequence perplexity on 1,000-example subsets from GSM8K and MBPP test sets (Seed: 42). Δ is computed as (SRA −- Base). Across all evaluated models, SRA does not increase perplexity on MBPP and produces only minor changes on GSM8K (largest increase: +0.0439 on Qwen3-VL-2B), suggesting no evidence of broad capability-distribution degradation under this high-resolution proxy. 6 Discussion 6.1 The Mechanism of “Ghost Noise” Damage Why does a simple vector ablation cause broad capability loss? Our results suggest the dirty refusal vector shares spectral structure with many capability/style circuits. Removing it suppresses generic linguistic machinery (e.g., negation/imperative grammar) and reasoning components (e.g., coding/math) that are not inherently “refusal,” producing perplexity spikes and distribution drift. 6.2 The Scaling Hypothesis A notable finding is that Standard Ablation fails to fully uncensor larger models (e.g., Qwen3-VL-8B, Ministral-14B) compared to smaller ones. One interpretation is that as models scale, refusal becomes increasingly abstract/semantic rather than surface-template driven. In that regime, a single dirty contrastive direction may capture mostly stylistic/lexical artifacts while missing deeper refusal circuitry. Concept-guided triangulation (Targets like epistemic/deception/privacy) helps intersect the deeper circuit. 6.3 The “Orthogonality Principle” of Model Editing Our findings align with and extend AlphaEdit [3]: interference is reduced when an edit direction is orthogonal to protected competencies. AlphaEdit applies this to knowledge keys; we apply it to capability/style atoms. This suggests a unified rule: edits are safest when projected away from the subspace representing core model competencies. 6.4 Implications for the “Safety Tax” It is often assumed safety alignment imposes an inevitable “alignment tax”—a reduction in model capability. Our results challenge that framing: the near-zero drift and stable math/code perplexity under SRA suggest refusal-related behavior can be edited with minimal collateral damage when the intervention is sufficiently precise. Much of the “tax” observed in prior ablations may be attributable to dirty vectors and imprecise interventions. 7 Limitations Our atom registry is curated and may miss confounds; unseen entanglers could still leak into the refusal direction. Automating atom discovery (or expanding registries) is a promising direction. Additionally, we do not claim SRA is a safety method; it is a behavioral editing tool. We also note evaluation limitations: (i) refusal detection remains definition-dependent; (i) our harmful prompt suites are finite; and (i) proxy drift metrics (PPL and first-token KL) do not substitute for comprehensive downstream behavioral evaluation. 8 Conclusion We introduce Surgical Refusal Ablation (SRA), treating refusal ablation as a semantic disentanglement problem. By cleaning the steering vector of capability/style confounds and applying a low-rank intervention, we achieve behavioral editing with dramatically reduced measured distribution drift and preserved math/code distributions under a high-resolution perplexity proxy, compared to standard ablation. Appendix A Appendix A: Theoretical Analysis — Capability Preservation via Null Space Projection Here we provide a formal justification for why SRA prevents capability damage and reduces distribution drift. This appendix mirrors the high-level orthogonality argument used in protected-editing work (e.g., AlphaEdit [3]), but instantiates the protected subspace using compact, interpretable Shield/Confound atoms rather than corpus-scale preserved keys. A.1 Problem Formulation Let ℒref​(θ)L_ref(θ) be the loss associated with refusal behavior and ℒcap​(θ)L_cap(θ) be a loss associated with general capabilities (e.g., next-token prediction). A naive ablation updates parameters in direction v: θ′=θ−γ​θ =θ- (7) A.2 First-Order Condition for “Lossless” Editing We approximate the change in capability loss using a first-order Taylor expansion: Δ​ℒcap≈−γ​⟨,∇θℒcap⟩ _cap≈-γ , _θL_cap (8) Thus, to minimize capability drift, the intervention direction v should be orthogonal to the gradient of capability loss: ⟨,∇θℒcap⟩≈0 , _θL_cap ≈ 0 (9) A.3 The “Dirty” Vector Assumption Standard methods use ∝dirtyv _dirty. Empirically, dirtyr_dirty has non-zero cosine similarity with Shield atoms. If the dominant components of ∇θℒcap _θL_cap lie primarily in the Shield span (denote it shieldA_shield), then: ⟨dirty,shield⟩≠0 _dirty,A_shield ≠ 0 (10) implying standard ablation induces a capability penalty under Eq. 8. A.4 SRA as Null Space Projection Let S​C=[S​C]A_SC=[A_S\;A_C] denote the concatenation of Shield and Confound atoms. SRA constructs a cleaned vector ~ r that minimizes overlap with these components. In the limit of explicit zeroing (idealized), the effective intervention satisfies: S​C⊤​~≈0A_SC r≈ 0 (11) i.e., ~ r lies approximately in the null space of the protected atom subspace. A.5 Rank-One Update Dynamics We apply the update to a weight matrix W by left-multiplying the projection update (acting on the output space). For a unit vector =~/‖~‖v= r/\| r\|: ′=(−γ​⊤)​⇒′​=−γ​(⊤​)W =(I- )W x=Wx- (v Wx) (12) If the output Wx lies in the Shield subspace (capability-relevant features) and v is orthogonal to that subspace, then ⊤​≈0v Wx≈ 0 and the output remains largely unperturbed. References [1] A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda (2024) Refusal in language models is mediated by a single direction. External Links: 2406.11717 Cited by: §1, §2.1. [2] N. Belrose, D. Schneider-Joseph, S. Ravfogel, R. Cotterell, E. Raff, and S. Biderman (2023) LEACE: perfect linear concept erasure in closed form. External Links: 2306.03819 Cited by: §2.2. [3] J. Fang, H. Jiang, K. Wang, Y. Ma, J. Shi, X. Wang, X. He, and T. Chua (2025) AlphaEdit: null-space constrained knowledge editing for language models. In International Conference on Learning Representations (ICLR), Cited by: Appendix A, §2.3, §2.5, §5.1, §6.3. [4] T. Hartvigsen, S. Sankaranarayanan, H. Palangi, Y. Kim, and M. Ghassemi (2023) Aging with grace: lifelong model editing with discrete key-value adaptors. In Advances in Neural Information Processing Systems, Cited by: §2.4. [5] K. Meng, D. Bau, A. Andonian, and Y. Belinkov (2022) Locating and editing factual associations in gpt. In Advances in Neural Information Processing Systems, Cited by: §2.3. [6] K. Meng, A. S. Sharma, A. Andonian, Y. Belinkov, and D. Bau (2023) Mass-editing memory in a transformer. External Links: 2210.07229 Cited by: §2.3. [7] E. Mitchell, C. Lin, A. Bosselut, C. Finn, and C. D. Manning (2021) Fast model editing at scale. External Links: 2110.11309 Cited by: §2.4. [8] S. Ravfogel, Y. Elazar, H. Gonen, M. Twiton, and Y. Goldberg (2020) Null it out: guarding protected attributes by iterative nullspace projection. External Links: 2004.07667 Cited by: §2.2. [9] N. Rimsky, N. Gabrieli, J. Michael, R. Dominguez-Olmedo, and M. Tong (2023) Steering llama 2 via contrastive activation addition. External Links: 2312.06681 Cited by: §2.1. [10] B. Tian, S. Cheng, X. Liang, N. Zhang, Y. Hu, K. Xue, Y. Gou, X. Chen, and H. Chen (2024) InstructEdit: instruction-based knowledge editing for large language models. External Links: 2402.13593 Cited by: §2.4. [11] A. Turner, L. Thiergart, D. Udell, G. Leech, U. Mini, and M. MacDiarmid (2023) Steering language models with activation engineering. External Links: 2308.10248 Cited by: §2.1.