Paper deep dive

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah, Ranjan Satapathy, Erik Cambria, Roy Ka-Wei Lee

Year: 2025Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 64

Models: Gemma-2-2B-IT, LLaMA-3.1-8B-IT

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 6:33:32 PM

Summary

This paper presents a three-stage mechanistic interpretability pipeline to dissect and intervene in the refusal behavior of instruction-tuned LLMs (Gemma-2-2B-IT and LLaMA-3.1-8B-IT). By utilizing sparse autoencoders (SAEs) and targeted ablation, the authors identify causal refusal features, prune them to a minimal set, and use factorization machines to uncover non-linear feature interactions and redundant 'hydra-head' mechanisms that maintain refusal behavior.

Entities (5)

Factorization Machine · machine-learning-model · 100%Gemma-2-2B-IT · llm · 100%LLaMA-3.1-8B-IT · llm · 100%Sparse Autoencoders · interpretability-tool · 100%HarmBench · evaluation-metric · 95%

Relation Signals (3)

Factorization Machine → captures → Non-linear Feature Interactions

confidence 95% · fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features

Sparse Autoencoders → dissects → LLM Refusal Behavior

confidence 90% · SAEs are used to identify feature sets whose ablation flips the model from refusal to compliance.

Hydra Effect → explains → Redundant Refusal Features

confidence 85% · This is direct evidence of the non-linear hydra effect among jailbreak-critical features

Cypher Suggestions (2)

Find all models and the interpretability tools used to analyze them. · confidence 90% · unvalidated

MATCH (m:Model)-[:ANALYZED_BY]->(t:Tool) RETURN m.name, t.name

Identify the relationship between interpretability tools and the mechanisms they uncover. · confidence 85% · unvalidated

MATCH (t:Tool)-[:UNCOVERS]->(m:Mechanism) RETURN t.name, m.name

Abstract

Abstract:Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Full Text

63,552 characters extracted from source content.

Expand or collapse full text

Beyond I’m Sorry, I Can’t: Dissecting Large-Language-Model Refusal Nirmalendu Prakash1 , Yeo Wei Jie2 , Amir Abdullah3 , Ranjan Satapathy4, Erik Cambria2, Roy Ka-Wei Lee1 Abstract Refusal on harmful prompts is a key safety behaviour in instruction‑tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction tuned models—Gemma‑2-2B‑IT and LLaMA‑3.1-8B‑IT using sparse autoencoders (SAEs) trained on residual‑stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: 1. Refusal Direction - Finding a refusal mediating direction and collecting SAE features close to that direction, followed by 2. Greedy Filtering - to prune this set to obtain a minimal set and finally 3. Interaction Discovery - a factorization‑machine (FM) model that captures non‑linear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we also find evidence of redundant features which remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space. Introduction Refusal is a cornerstone safety behavior in aligned LLMs. When presented with potentially harmful, illegal, or unethical requests, an aligned model is expected to decline to respond, commonly through a refusal message such as “I’m sorry, but I can’t help with that.” Yet public “jailbreak” leaderboards continue to show that adversarial prompts can bypass safety filters (Shen et al. 2025), while overly cautious models sometimes decline perfectly benign requests (Xie et al. 2024; Röttger et al. 2023). These twin failure modes underline the need to understand and adjust refusal behaviour mechanistically rather than relying on trial‑and‑error alignment. A growing body of literature demonstrates that carefully crafted text, prefix or multi-shot prompts can bypass guard‑rails across model families (Wei, Haghtalab, and Steinhardt 2023; Anil et al. 2024). Once a public exploit emerges, safety teams must scramble to patch it, often tightening filters in ways that exacerbate over‑refusal. Mainstream defences center on behavioural training - supervised fine‑tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF, RLAIF) - combined with post‑hoc policy filters (Ouyang et al. 2022; Bai et al. 2022a, b). Although RL techniques improve harmlessness, it inherits classic RL failure modes such as reward hacking and brittleness under distribution shift (Skalse et al. 2022). Recent reward‑shaping schemes (Rafailov et al. 2023) alleviate some issues, yet they remain iterative trial‑and‑error tweaks to an opaque system, optimising outputs without revealing why the model refuses or complies. Moreover, these methods require an auxiliary reward model and on-policy optimisation, making them impractical at inference time, whereas SAE or steering interventions add only a lightweight mask in a single forward pass. Mechanistic interpretability seeks to reverse‑engineer neural networks into human‑understandable components. Sparse autoencoders (SAEs) have become a key tool in this agenda: by disentangling highly superposed vectors into a moderately over‑complete set of sparse latent features that often align with semantic concepts (Bricken et al. 2023). Recent work (Arditi et al. 2024) demonstrates that ablating residual activations along a single direction can flip refusal. Another work by O’Brien et al. (2024) show that steering or ablating few SAE features can flip refusal behavior on Phi-3 Mini. These findings suggest that refusal is mediated by a sparse sub‑circuit rather than a diffuse behavioral mask. At the same time, LLM computation exhibits substantial redundancy: multiple, partially overlapping features can implement the same logical function, a phenomenon dubbed the hydra effect (McGrath et al. 2023). When one head (feature) is severed by ablation, another may activate to preserve the downstream behaviour. In our experiments we observe precisely this pattern: features we identify as critical to refusal re‑activate whenever a distinct precursor is removed, implying a safety‑critical network of “hydra‑heads” whose dependencies must be mapped jointly. Understanding and managing this redundancy is essential for reliable, feature‑level alignment interventions. However, the standard tool of linear probes implicitly models effects as weighted sums of individual features - i.e., they assume each feature contributes independently and linearly to the output. As a result they cannot capture higher-order interactions, correlations, conditional dependencies, or shared variance arising from redundancy (Belinkov 2022; Bien, Taylor, and Tibshirani 2013; Tan et al. 2023). We address these gaps by tracing refusal back to causal SAE features, and identifying feature correlations more carefully. Our contributions can be summarized as follows: 1. The first end‑to‑end pipeline for locating causal refusal features via SAEs and targeted ablation. 2. Introducing a factorization machine based approach to recover related features. 3. Discovery of redundancy in LLM refusal mechanism. 4. Open‑sourced code and feature indices to accelerate future work on safe and interpretable LLMs. Related Works Figure 1: Figure shows the steps involved in causal feature search. Trapezium shapes represent encoder and decoder of a SAE, with the rectangular block in between representing the sparse encoder activations, where ablation is applied. LLM Alignment The goal of aligning large language models (LLMs) is to ensure that outputs remain helpful, harmless, and consistent with human preferences. A central behaviour shaped by alignment training typically via reinforcement learning from human feedback (RLHF) is refusal, in which the model declines to answer unsafe or harmful prompts. This is often implemented via fine-tuning or reward models that penalize completions violating safety guidelines. Several studies analyze refusal rates across domains (Ganguli et al. 2023), examine adversarial vulnerabilities (Zou et al. 2023), or develop red-teaming methods to test refusal robustness (Perez et al. 2022). However, by focusing solely on refusal as an observable outcome, these approaches overlook the fundamental question of which internal reasoning pathways actually produce this behavior. Mechanistic Interpretability of Alignment Mechanistic interpretability aims to open the black box of neural networks by tracing how activations and weights implement behavior (Olah et al. 2020; Elhage et al. 2021). In alignment research, this includes identifying circuits responsible for factual recall (Chughtai, Cooney, and Nanda 2024), chain-of-thought reasoning (Dutta et al. 2024), or finding safety-mediating neurons (Chen et al. 2024). Prior work has shown that refusals can sometimes be controlled by manipulating single activation directions (Arditi et al. 2024), and that harmful and harmless prompts cluster distinctly in activation space (Jain et al. 2024). Lindsey et al. (2025) investigate the internal mechanisms behind model refusals on a single jailbreak prompt, but do not arrive at a definitive explanation. A recent work by O’Brien et al. (2024) proposes clamping SAE features at inference time to steer refusal behavior, but also cautions that this intervention can degrade the model’s overall performance. These results suggest that refusal is driven by coherent internal structures, which compels further analysis into what these structures are. Sparse Autoencoders A sparse autoencoder (SAE) is a neural network trained to compress and reconstruct activations while encouraging most latent units to remain inactive for any given input. This sparsity constraint forces the model to learn a set of meaningful, disentangled features that capture distinct patterns in the data (Elhage et al. 2021; Sharkey, Braun, and Millidge 2022; Bricken et al. 2023; Templeton et al. 2024). The sparse latents are more interpretable and monosemantic than directions identified by alternative approaches (Cunningham et al. 2023). Neuronpedia111https://w.neuronpedia.org offer an actionable vocabulary interpreting these directions. Given a hidden representation h∈ℝdh ^d, its SAE reconstruction is given by h^=Wdecf(Wench) h=W_decf(W_ench) , where WencW_enc and WdecW_dec are the encoder and decoder matrices, and f is a sparse activation function222Bias terms are omitted for brevity. For a set of target features S⊂1,…,kS⊂\1,…,k\, we zero-ablate their contribution by replacing the reconstructed vector with: h~=Wdec(f(Wench)⊙S′)+(h−Wdecf(Wench)) h=W_dec\! (f(W_ench) 1S )\;+\; (h-W_decf(W_ench) ) where each dimension of the sparse activation z=f(Wench)∈ℝkz=f(W_ench) ^k has a corresponding row in WdecW_dec that contributes additively to the reconstruction and interpreted as a feature. S′1S is a binary mask that zeros out all features in S. This allows us to isolate the functional role of a sparse set of latent dimensions in driving model refusal. Experimental Setup Figure 2: Shown above is part of the computation flow in a decoder only LLM. Attached to a layer is a SAE. Square boxes denote SAE encoder activations. Orange denotes ablated feature and green activated. Here, we demonstrate that ablating some of the early layer features (orange) can activate (green) a set of features in a downstream layer. These downstream features are causal to refusal in-spite of being not active in the first place. Shown on top right is the response (jailbroken) after ablating these (orange+green) and on the bottom right is the safe response when these are not ablated. LLMs We conduct our study on Gemma2–2B-Instruct and LLaMA-3.1-8B-Instruct. Hereafter, we refer to the two models as LLaMA and Gemma in the paper for brevity. The SAEs (Gemmascope and Llamascope) are available in a range of sizes and sparsity settings. The settings used for the experiments are available in Appendix section “Experimental Setup”. Dataset 1. Analysis: To identify the SAE latents that mediate refusal, we use the DharmfulD_harmful dataset curated by Arditi et al. (2024), which aggregates harmful prompts from several widely used safety datasets. We verify and filter the samples in DharmfulD_harmful that do not elicit refusal on our models tested. 2. Unsafe Taxonomy - Coconot: Brahman et al. (2024) introduce a taxonomy of non-compliance, outlining when and how language models should refuse to comply with user instructions. This taxonomy includes five categories: incomplete, unsupported, indeterminate, humanizing, and unsafe prompts. In our analysis, we focus on the unsafe prompts subset. Safety violations are further subdivided into five categories: “Triggers for Offensive Language”, “Dangerous or Sensitive Topics”, “False Information”, “Privacy Violations”, and “Copyright Violations”, comprising a total of 2,586 samples. SAE Feature Explanation through Neuronpedia To interpret the discovered SAE features, we obtain natural language explanations from Neuronpedia, which uses a GPT-4o-mini model to generate descriptions based on top activating samples. These samples are drawn from a large-scale dataset, which spans a wide range of text domains. We retrieve explanations using the official Neuronpedia API333https://w.neuronpedia.org/api/feature/–modelId˝/–layer˝/–index˝, by querying each feature with its model, layer, and index identifiers. Safety metric Arditi et al. (2024) match against a set of phrases such as ‘I’m sorry’, ‘As an AI’ etc. to measure refusal. To asses whether the completions contain harmful content, authors also use LLaMA GUARD 2 (Team 2024). We find that degenerate responses classified as safe by LLaMA GUARD 2 do not represent jailbreak. Also, string matching is limited by the set. For this reason we use HarmBench Classifier model (Mazeika et al. 2024) to accurately determine harmful responses and hence Jailbreak. Attack Success Rate (ASR) We report Attack Success Rate (ASR) as the fraction of harmful prompts that elicit non-refusal responses from the model. We measure ASR on DharmfulD_harmful. When reporting on subsets of the data, we mention “Jailbreak success”, so as not to confuse with the overall values. Experiments The observation that ablating a single steering direction across layers is sufficient to jailbreak several aligned models suggests that this direction carries safety‑critical information throughout the network. Furthermore, Jain et al. (2024) show that hidden states for safe and unsafe prompts begin to diverge from an early layer. These findings motivate the following simple diagnostic: for each layer we compute the cosine similarity between the residual‑stream activation and a refusal mediating direction, using Arditi et al. (2024)’s approach. Applied to Gemma and LLaMA, the diagnostic reveals two consistent patterns (see Appendix Section “Preliminary Analysis”): 1. Harmful‑prompt activations show markedly higher similarity to the steering vector than benign‑prompt activations. 2. This similarity grows monotonically up to the layer at which the steering vector is measured. The early‑layer alignment of harmful activations with the refusal direction indicates that refusal is mediated by a sparse sub‑circuit that is amplified through the forward pass. Guided by this insight, we propose a three‑stage causal‑feature discovery pipeline: 1. Refusal mediating direction. Obtain an effective refusal steering vector and select the top‑K SAE features whose decoders are most strongly aligned with it. 2. Greedy filtering. Iteratively ablate features to obtain the minimal subset whose removal flips the model from refusal to compliance, thereby establishing direct causality. 3. Interaction discovery. Feed the remaining active features and the minimal causal set into a factorization machine to uncover additional features. We detail the stages below: Stage 1: Refusal mediating features Algorithm 1 Block-Wise Selection of a Faithful Latent Set 1: Input: Prompt p; candidate pool C; block size N; threshold τ 2: Output: Faithful latent set K 3: K←∅K← 4: while |C∖K|≥N|C K|≥ N do 5: Draw a random block V⊆C∖KV C K with |V|=N|V|=N 6: if Δ(p,K,V)≥τ (p,K,V)≥τ then 7: K←K∪VK← K∪ V 8: else 9: C←C∖VC← C V 10: end if 11: end while 12: return K Our goal is to identify, for every harmful prompt s, a minimal set of SAE features MsM_s whose ablation is sufficient to jailbreak the response. The procedure consists of two steps. 1. Deriving a refusal steering vector. Following Arditi et al. (2024), we compute a steering vector v for each model by taking the difference of residual-stream activations between (i) prompts that elicit a refusal and (i) matched benign prompts. Normalizing v yields a unit direction that consistently nudges the model towards compliance when subtracted from the residual stream. For details on how this direction is derived, see section “Deriving a Refusal Steering Vector” in Appendix. 2. Selecting a Top-K candidate set. For every SAE latent z with decoder weight z∈ℝdmodeld_z ^d_model we compute the cosine similarity cos⁡(,z) (v,d_z). We create a list of latents by taking the top K in each layer as our initial candidate set ℒK=z1,…,zKL_K\!=\!\z_1,…,z_K\. We start with K=10K=10 and increase in steps of 10 until ablating ℒKL_K successfully jailbreaks the prompt (validated with the refusal classifier HarmBench (Mazeika et al. 2024)). For Gemma we limit our analysis to (K=200K\!=\!200), whereas for LLaMA a single step (K=10K\!=\!10) results in degenerate responses. we find that Layers 1 and 2 typically encode grammatical features and ablating them results in degenerate responses. Thus we exclude these features corresponding to these layers from our analysis. Stage 2: Greedy pruning to a minimal faithful set Arditi et al. (2024) observe that the probability assigned to the first‐person pronoun “I” at the final prompt position tracks jailbreak success remarkably well. Using this finding, starting from ℒKL_K we iteratively remove features that do not cause a substantial change to “I” token logit at output. Algorithm 1 describes the pruning procedure, which is a simplified variant of the incompleteness algorithm from Wang et al. (2022). Let C be the full top-K candidate pool of SAE features and K=∅K= the set of features already accepted. At each iteration we sample a block V⊆C∖KV C K of fixed size N. Define P(p,)=Pr⁡(I∣p, model ablated on ),P (p,A )\;=\; \! ( I p, model ablated on A ), the probability of emitting the token “I” for prompt p after ablating latent set A. The relative impact of ablating the block V is then Δ(p,K,V)=|P(p,C∖K)−P(p,C∖(K∪V))|P(p,C∖K). gathered (p,K,V)=\\[3.0pt] |\,P\! (p,\,C K )\;-\;P\! (p,\,C (K∪ V) ) |P\! (p,\,C K ). gathered (1) A higher Δ implies a larger change in the “I” probability and therefore greater faithfulness of the candidate block. If Δ(p,K,V)≥τ (p,K,V)≥τ for a threshold τ, the subset is deemed faithful and we update K←K∪VK← K∪ V; otherwise V is discarded. We fix N=5N=5 and sweep the threshold from 0.1 to 0.8 in increments of 0.1. Also, for each threshold, we run the greedy pruning algorithm for three random seeds and keep only those features which are observed with all seeds. Results. Gemma’s most effective steering vector emerges at layer 16, whereas for LLaMA it appears at layer 13. Gemma LLaMA DharmfulD_harmful 861 861 ASR (no ablation) 4 71 ASR (after ablation) 0.33 0.70 Unique SAE Features 2538 110 Table 1: Jailbreak statistics for Gemma and LLaMA using features obtained after Stage 2. After applying the pruning algorithm, we obtain 110 unique features for LLaMA and 2,538 features for Gemma, as summarized in Table 1. To semantically interpret the range of concepts encoded by these features, we retrieve their explanations from Neuronpedia and manually group them into thematic categories. We find that a subset of features encode concepts related to harm and violence, while a majority, across both models are associated with programming constructs and punctuation. These frequently occur across the studied layers. We hypothesize that this pattern may arise either due to limitations of the dataset used for generating the explanations or because these features encode general grammatical structure. Further analysis and grouped examples are presented in Appendix Section “Feature-semantics”. Stage 3 : Interaction Discovery As a first sanity check we asked a seemingly simple question: Do the jailbreak–critical features returned by Stage 2 actually fire on the very prompts for which they are deemed causal? For every harmful prompt we inspected the raw SAE activations and discovered— surprisingly—that several (77 for LLaMA and 1656 for Gemma) “critical” features were inactive (zero activation). Removing these inert features and re‑running the ablation caused the jailbreak to fail (for LLaMA jailbreak success drops to 372 samples and for Gemma 103 samples), demonstrating that apparently silent units can still be necessary. The only plausible explanation is that ablating some active features allows previously silent features to switch on and compensate. This is direct evidence of the non‑linear hydra effect among jailbreak–critical features; Figure 2 shows this effect. We study this phenomenon in more detail in the next section. Having established that a minimal causal set can recruit additional units through interaction, we next map the remainder of the causal neighborhood. To capture higher‑order dependencies we fit a second‑order Factorization Machine (FM) to the SAE activations, using a combined corpus of DharmfulD_harmful prompts and an equal number of benign samples from Alpaca (dataset details in Appendix Section “Experimental Setup”). Below, we briefly describe the FM formulation. Gemma LLaMA DharmfulD_harmful 861 861 ASR (no ablation) 4 71 ASR (after ablation) 0.31 0.61 Unique SAE Features 3178 1509 Table 2: Jailbreak statistics for Gemma and LLaMA using features obtained after Stage 3. Factorization Machines. Factorization Machines (FMs) (Rendle 2010) are supervised learning models that unify the advantages of linear regression and matrix factorization. They are particularly effective in sparse, high-dimensional settings such as recommender systems. FMs model not only the linear effects of individual features but also capture pairwise feature interactions using low-dimensional latent vectors. For a feature vector ∈ℝdx ^d, the prediction of a 2-way factorization machine is given by: y^()=w0+∑i=1dwixi+∑i=1d∑j=i+1d⟨i,j⟩xixj y(x)=w_0+ _i=1^dw_ix_i+ _i=1^d _j=i+1^d _i,v_j x_ix_j (2) Here, w0∈ℝw_0 is the global bias, ∈ℝdw ^d is the vector of weights for individual features, and i∈ℝkv_i ^k is the i-th latent embedding vector representing feature i, where k is the dimensionality of the latent space. The dot product ⟨i,j⟩ _i,v_j captures the interaction between features i and j, allowing the model to learn pairwise non-linear interactions efficiently. This formulation allows FMs to generalize matrix factorization while also incorporating side information and high-order interactions with linear complexity. We experiment with FM embedding sizes of 5, 20, and 50. An embedding size of 5 yields overly similar feature representations (cosine similarity >0.9>0.9), indicating poor expressiveness. Increasing to 20 provides more diverse and meaningful embeddings. While size 50 offers marginal improvements, it incurs significantly higher training cost. We therefore use an embedding size of 20 in all subsequent experiments. Results. We leverage feature activations at special tokens following the user instruction, as these have been shown to play a key role in driving refusal behavior . To identify meaningful additions to our feature set, we rank the newly discovered features by computing the cosine similarity between their Factorization Machine (FM) embeddings and those of the Stage 2 features. We then select the top-K features for ablation, sweeping K from 100 to 2000 in increments of 100. For LLaMA, we observe that beyond K=2000K=2000, model outputs begin to degrade into degenerate or uninformative responses. In contrast, Gemma continues to produce jailbreak responses even beyond K=2000K=2000. However, due to computational constraints, we limit our experiments to K=2000K=2000. The results are summarized in Table 2. We note that a finer-grained search over the similarity threshold may uncover additional jailbreakable samples. The observed drop in Attack Success Rate (ASR) compared to Stage 2 can be attributed to the removal of redundant features, which, while overlapping, appear to contribute non-trivially to the success of jailbreaks. We also analyze the semantic content of the newly added features using Neuronpedia explanations. In LLaMA, Stage 3 introduces several features associated with legal or regulatory content, while a majority represent highly specific, non-harm-related concepts, such as references to financial information. A similar trend is observed in Gemma, where Stage 3 surfaces more fine-grained and concrete feature representations. For a more detailed breakdown, refer to Appendix Section “Feature Semantics”. Analysis We refer to features available after stage 2 as “Stage 2” features and features added in stage 3 as “Stage 3” features. Next, we examine how the features identified in Stage 3 influence the Stage 2 features. Causal Link between Stage2 and Stage3 features Specifically, we address two questions: 1) Are Stage 3 features alone sufficient to trigger jailbreaks, and thus independently causal to refusal behavior? 2) To what extent does ablating Stage 3 features affect the activation of Stage 3 features? Out of the 372 samples jailbroken by ablating active features on LLaMA, we find that ablating only the features found in stage 3 can jailbreak 330 (89%) of these samples. On Gemma, ablating only Stage 3 features leads to jailbreak on 8 samples (3%). Regarding the 2nd question, we observe that ablating Stage 3 features leads to deactivation of 12% active Stage 2 features on an average, on LLaMA and in some cases up to 44%. On the still active features, we find 81% drop in activation value on an average on LLaMA. This proves a strong causal dependency between features found in Stage 3 and Stage 2 on LLaMA. On Gemma, we see that ablating stage 3 features leads to deactivation of only 3% of stage 2 features and out the active ones, there is no drop in activation values. This indicates a more diffused refusal mechanism on Gemma where multiple computation paths may exist. Next, we examine how features found in Stage 3 using FM compares to that by using a linear probe. Factorization Machines vs. Linear Probe. To verify the existence of non-linear feature interaction, we compare with a linear probe trained on the same feature activations. This time, we rank features by the magnitude of their learned weights in a similar fashion and pick top-K features to consider for ablation. We do this for LLaMA and use K∈100,500,1000,2000K∈100,500,1000,2000 and evaluate their ability to trigger jailbreaks when ablated on DharmfulD_harmful samples. We find that these top-K features lead to jailbreak in significantly fewer cases (only 101 samples) compared to the FM-based approach. This suggests that refusal-inducing features interact in a fundamentally non-linear manner. Next, we investigate the redundant features we discussed above. Redundant Features To analyze the redundant features, we follow these steps: 1. From the features found in Stage 2, for each of the samples, we find which of the jailbreak critical features do not fire. 2. Next, we ablate just the non-active set and record the impact on jailbreak. If there is no impact, we discard such samples from further analysis. 3. For the remaining set of samples, we ablate the active feature set and record activations on non-active set, along with the tokens on which they activate. On LLaMA, we find that out of the redundant features, 74% become active on system tokens after ablating the active set. Of these, approximately 97% fire on the <|begin_of_text|> token. On Gemma, we find that the redundant features fire on a range of tokens but still <bos> token records highest activity (16%) for these features. To further investigate this, we obtain explanations for these features from Neuronpedia and find that about 31% contain text related to punctuation and programming syntax on LLaMA. On Gemma as well we find a number of feature explanations about programming and punctuations. We also find some features related to harm and uncertainty. To probe their causal role further, we steer the models by clamping the above features to a higher positive value, on the <bos> token for Gemma and <|begin_of_text|> token for LLaMA, and presenting the model with otherwise benign requests such as “Tell me a story” or “What are the benefits of meditation?”. We find that on Gemma, the model responses indicate harm (e.g. “Meditation is a serious and potentially harmful practice.”) for smaller values and refusal for larger values (e.g. “Meditation is illegal and I will not answer your question.”). We do not observe harmful responses on LLaMA but we do observe instances of token “no” (e.g. “Meditation is a meditation is no no”) which may indicate indication to refusal. Example generations are provided in Appendix Section ‘Redundant Features’; a fuller exploration of this phenomenon is left to future work. Comparison with random features To assess the significance of the jailbreak-critical features identified, we perform a control experiment using an equal number of randomly selected active features for each prompt. We then compare jailbreak performance between the two settings. Ablating these random features results in no jailbreak across samples for LLaMA and 5 samples for Gemma. This highlights that the jailbreak-critical features play a meaningful and non-random role in refusal behavior. Model performance loss by ablating jailbreak critical features To ascertain the downstream impact of ablating these features found, on LLMs’ cross-entropy (CE) loss (Refer Appendix Section “Downstream Performance”. The CE loss is slightly higher with feature ablation compared to direction ablation, as features represent a variety of directions and not just refusal. Figure 3: Figure shows harm types based on the Coconot unsafe taxonomy and the count of feature activations against each type (first four show individual types and remaining on the right show count of features which fire on multiple harm types.) Causal feature finding using alternate methods O’Brien et al. (2024) find refusal mediating features on Phi-3 Mini model by selecting features activating on at least two tokens on a harmful prompt. This is further filtered using a clamping hyperparameter search where the features are clamped to a range of positive values. We adapt the first part of their approach to do a hyperparameter search over different Ks (features activating on at least k tokens). We do this over each sample and ablate the features found. We obtain a maximum ASR of 0.1 on Gemma and 0.07 on LLaMA. Refer to Appendix Figure 1 for ASR across different k. Next, we map the features found to a “safety specific non-compliance” taxonomy. Mapping causal features to unsafe taxonomy We map each features found on LLaMA to one or more of the harm categories on Coconot dataset by identifying non-zero activations on samples belonging to each category. The results of this mapping are shown in Figure 3. We find that a majority of the features activate on “Dangerous or Sensitive Topics”, often in conjunction with one or more additional harm types. To validate these mappings, we conduct a human annotation study comparing activation-based labels with human interpretations of the features, derived from top activating tokens and Neuronpedia explanations. We sample an equal number of features that activate on a single harm category and those that activate on multiple categories, along with a matched number of randomly selected features. Annotators are shown the feature explanation and top activating samples, and are asked to select all applicable harm categories or choose “None”. Annotation details are provided in Appendix Section “Human Study”. We employ three annotators and assign a final label based on majority agreement (i.e., at least two annotators concur). When compared to activation-based labels, we observe an accuracy of 13.42%. Upon manual inspection, we note that several features labeled “None” by annotators in fact activate broadly across all harm types. We hypothesize that increasing the dimensionality of the SAE may yield more fine-grained and disentangled features, thereby improving alignment with human intuition. We hope that such mappings can inform more fine-grained steering of model behavior, and that similar mappings can be constructed for other models as well. A more comprehensive investigation along these lines is left to future work. Conclusion & Future Work We present the first pipeline that combines Sparse Autoencoders with Factorization Machines to isolate causal features governing LLM refusal on harmful prompts. FMs capture non-linear feature interactions better than linear probes, enabling more precise ablation. Our analysis also revealed redundant “hydra” features that remain dormant until others are suppressed. A human study maps these latent features to an established unsafe-prompt taxonomy, laying groundwork for feature-level steering of refusal behaviour. In future work, we plan to (i) characterize the safety benefits and risks of redundancy, (i) evaluate how varied jailbreak strategies—role-play, multi-turn coercion, etc.—perturb the activation space and (i) analyze prompts that consistently resist jailbreak to uncover protective circuits. These directions aim to advance transparent and controllable safety mechanisms for next-generation language models. We hope that our findings inspire deeper mechanistic investigations into LLM refusal and lead to principled audits of LLM safety, enabling answers to questions such as: “To what extent is refusal driven by legal reasoning?” or “How much do moral heuristics contribute to refusal decisions?”. Limitations • Generality. Our analysis focuses on two model sizes; larger models remain to be tested. • SAE stability. Sparse autoencoders are sensitive to training data and random initialisation. Paulo and Belrose (2025) report that SAEs trained with different seeds share barely 30 % of their latents. Our results therefore depend on the particular SAE dictionary employed; future work should replicate the analysis across multiple SAE runs. • Computational cost. Searching for jailbreak-critical latents and re-evaluating them across many benchmarks requires thousands of forward passes. Our runs take multiple GPU-days; scaling to larger models (e.g., >>30 B parameters) would be prohibitively slow without substantial parallel resources or more efficient search strategies. References Anil et al. (2024) Anil, C.; Durmus, E.; Panickssery, N.; Sharma, M.; Benton, J.; Kundu, S.; Batson, J.; Tong, M.; Mu, J.; Ford, D.; et al. 2024. Many-shot jailbreaking. Advances in Neural Information Processing Systems, 37: 129696–129742. Arditi et al. (2024) Arditi, A.; Obeso, O.; Syed, A.; Paleka, D.; Panickssery, N.; Gurnee, W.; and Nanda, N. 2024. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717. Bai et al. (2022a) Bai, Y.; Jones, A.; Ndousse, K.; Askell, A.; Chen, A.; DasSarma, N.; Drain, D.; Fort, S.; Ganguli, D.; Henighan, T.; et al. 2022a. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862. Bai et al. (2022b) Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. 2022b. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073. Belinkov (2022) Belinkov, Y. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1): 207–219. Bien, Taylor, and Tibshirani (2013) Bien, J.; Taylor, J.; and Tibshirani, R. 2013. A lasso for hierarchical interactions. Annals of statistics, 41(3): 1111. Brahman et al. (2024) Brahman, F.; Kumar, S.; Balachandran, V.; Dasigi, P.; Pyatkin, V.; Ravichander, A.; Wiegreffe, S.; Dziri, N.; Chandu, K.; Hessel, J.; et al. 2024. The art of saying no: Contextual noncompliance in language models. Advances in Neural Information Processing Systems, 37: 49706–49748. Bricken et al. (2023) Bricken, T.; Templeton, A.; Batson, J.; Chen, B.; Jermyn, A.; Conerly, T.; Turner, N.; Anil, C.; Denison, C.; Askell, A.; Lasenby, R.; Wu, Y.; Kravec, S.; Schiefer, N.; Maxwell, T.; Joseph, N.; Hatfield-Dodds, Z.; Tamkin, A.; Nguyen, K.; McLean, B.; Burke, J. E.; Hume, T.; Carter, S.; Henighan, T.; and Olah, C. 2023. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread. Https://transformer-circuits.pub/2023/monosemantic-features/index.html. Chen et al. (2024) Chen, J.; Wang, X.; Yao, Z.; Bai, Y.; Hou, L.; and Li, J. 2024. Finding safety neurons in large language models. arXiv preprint arXiv:2406.14144. Chughtai, Cooney, and Nanda (2024) Chughtai, B.; Cooney, A.; and Nanda, N. 2024. Summing up the facts: Additive mechanisms behind factual recall in llms. arXiv preprint arXiv:2402.07321. Cunningham et al. (2023) Cunningham, H.; Ewart, A.; Riggs, L.; Huben, R.; and Sharkey, L. 2023. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600. Dutta et al. (2024) Dutta, S.; Singh, J.; Chakrabarti, S.; and Chakraborty, T. 2024. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. arXiv preprint arXiv:2402.18312. Elhage et al. (2021) Elhage, N.; Nanda, N.; Olsson, C.; Henighan, T.; Joseph, N.; Mann, B.; Askell, A.; Bai, Y.; Chen, A.; Conerly, T.; DasSarma, N.; Drain, D.; Ganguli, D.; Hatfield-Dodds, Z.; Hernandez, D.; Jones, A.; Kernion, J.; Lovitt, L.; Ndousse, K.; Amodei, D.; Brown, T.; Clark, J.; Kaplan, J.; McCandlish, S.; and Olah, C. 2021. A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread. Https://transformer-circuits.pub/2021/framework/index.html. Ganguli et al. (2023) Ganguli, D.; Askell, A.; Schiefer, N.; Liao, T. I.; Lukošiūtė, K.; Chen, A.; Goldie, A.; Mirhoseini, A.; Olsson, C.; Hernandez, D.; et al. 2023. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459. Jain et al. (2024) Jain, S.; Lubana, E. S.; Oksuz, K.; Joy, T.; Torr, P.; Sanyal, A.; and Dokania, P. 2024. What makes and breaks safety fine-tuning? a mechanistic study. Advances in Neural Information Processing Systems, 37: 93406–93478. Lindsey et al. (2025) Lindsey, J.; Gurnee, W.; Ameisen, E.; Chen, B.; Pearce, A.; Turner, N. L.; Citro, C.; Abrahams, D.; Carter, S.; Hosmer, B.; Marcus, J.; Sklar, M.; Templeton, A.; Bricken, T.; McDougall, C.; Cunningham, H.; Henighan, T.; Jermyn, A.; Jones, A.; Persic, A.; Qi, Z.; Thompson, T. B.; Zimmerman, S.; Rivoire, K.; Conerly, T.; Olah, C.; and Batson, J. 2025. On the Biology of a Large Language Model. Transformer Circuits Thread. Mazeika et al. (2024) Mazeika, M.; Phan, L.; Yin, X.; Zou, A.; Wang, Z.; Mu, N.; Sakhaee, E.; Li, N.; Basart, S.; Li, B.; et al. 2024. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. arXiv preprint arXiv:2402.04249. McGrath et al. (2023) McGrath, T.; Rahtz, M.; Kramar, J.; Mikulik, V.; and Legg, S. 2023. The hydra effect: Emergent self-repair in language model computations. arXiv preprint arXiv:2307.15771. O’Brien et al. (2024) O’Brien, K.; Majercak, D.; Fernandes, X.; Edgar, R.; Chen, J.; Nori, H.; Carignan, D.; Horvitz, E.; and Poursabzi-Sangde, F. 2024. Steering language model refusal with sparse autoencoders. arXiv preprint arXiv:2411.11296. Olah et al. (2020) Olah, C.; Cammarata, N.; Schubert, L.; Goh, G.; Petrov, M.; and Carter, S. 2020. Zoom In: An Introduction to Circuits. Distill. Https://distill.pub/2020/circuits/zoom-in. Ouyang et al. (2022) Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730–27744. Paulo and Belrose (2025) Paulo, G.; and Belrose, N. 2025. Sparse Autoencoders Trained on the Same Data Learn Different Features. arXiv preprint arXiv:2501.16615. Perez et al. (2022) Perez, E.; Huang, S.; Song, F.; Cai, T.; Ring, R.; Aslanides, J.; Glaese, A.; McAleese, N.; and Irving, G. 2022. Red teaming language models with language models. arXiv preprint arXiv:2202.03286. Rafailov et al. (2023) Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C. D.; Ermon, S.; and Finn, C. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 53728–53741. Rendle (2010) Rendle, S. 2010. Factorization machines. In 2010 IEEE International conference on data mining, 995–1000. IEEE. Röttger et al. (2023) Röttger, P.; Kirk, H. R.; Vidgen, B.; Attanasio, G.; Bianchi, F.; and Hovy, D. 2023. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263. Sharkey, Braun, and Millidge (2022) Sharkey, L.; Braun, D.; and Millidge, B. 2022. [Interim research report] Taking features out of superposition with sparse autoencoders. https://w.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition. Accessed: 2025-01-29. Shen et al. (2025) Shen, G.; Zhao, D.; Feng, L.; He, X.; Wang, J.; Shen, S.; Tong, H.; Dong, Y.; Li, J.; Zheng, X.; et al. 2025. PANDAGUARD: Systematic Evaluation of LLM Safety against Jailbreaking Attacks. arXiv preprint arXiv:2505.13862. Skalse et al. (2022) Skalse, J.; Howe, N.; Krasheninnikov, D.; and Krueger, D. 2022. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35: 9460–9471. Tan et al. (2023) Tan, S.; Hooker, G.; Koch, P.; Gordo, A.; and Caruana, R. 2023. Considerations when learning additive explanations for black-box models. Machine Learning, 112(9): 3333–3359. Team (2024) Team, L. 2024. Meta Llama Guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL˙CARD.md. Templeton et al. (2024) Templeton, A.; Conerly, T.; Marcus, J.; Lindsey, J.; Bricken, T.; Chen, B.; Pearce, A.; Citro, C.; Ameisen, E.; Jones, A.; Cunningham, H.; Turner, N. L.; McDougall, C.; MacDiarmid, M.; Freeman, C. D.; Sumers, T. R.; Rees, E.; Batson, J.; Jermyn, A.; Carter, S.; Olah, C.; and Henighan, T. 2024. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread. Wang et al. (2022) Wang, K.; Variengien, A.; Conmy, A.; Shlegeris, B.; and Steinhardt, J. 2022. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593. Wei, Haghtalab, and Steinhardt (2023) Wei, A.; Haghtalab, N.; and Steinhardt, J. 2023. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36: 80079–80110. Xie et al. (2024) Xie, T.; Qi, X.; Zeng, Y.; Huang, Y.; Sehwag, U. M.; Huang, K.; He, L.; Wei, B.; Li, D.; Sheng, Y.; et al. 2024. Sorry-bench: Systematically evaluating large language model safety refusal. arXiv preprint arXiv:2406.14598. Zou et al. (2023) Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J. Z.; and Fredrikson, M. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043. Reproducibility Checklist Instructions for Authors: This document outlines key aspects for assessing reproducibility. Please provide your input by editing this .tex file directly. For each question (that applies), replace the “Type your response here” text with your answer. Example: If a question appears as Proofs of all novel claims are included (yes/partial/no) Type your response here you would change it to: Proofs of all novel claims are included (yes/partial/no) yes Please make sure to: • Replace ONLY the “Type your response here” text and nothing else. • Use one of the options listed for that question (e.g., yes, no, partial, or NA). • Not modify any other part of the command or any other lines in this document. You can this .tex file right before document of your main file or compile it as a stand-alone document. Check the instructions on your conference’s website to see if you will be asked to provide this checklist with your paper or separately. 1. General Paper Structure 1.1. Includes a conceptual outline and/or pseudocode description of AI methods introduced (yes/partial/no/NA) yes 1.2. Clearly delineates statements that are opinions, hypothesis, and speculation from objective facts and results (yes/no) yes 1.3. Provides well-marked pedagogical references for less-familiar readers to gain background necessary to replicate the paper (yes/no) yes 2. Theoretical Contributions 2.1. Does this paper make theoretical contributions? (yes/no) no If yes, please address the following points: 2.2. All assumptions and restrictions are stated clearly and formally (yes/partial/no) NA 2.3. All novel claims are stated formally (e.g., in theorem statements) (yes/partial/no) NA 2.4. Proofs of all novel claims are included (yes/partial/no) NA 2.5. Proof sketches or intuitions are given for complex and/or novel results (yes/partial/no) NA 2.6. Appropriate citations to theoretical tools used are given (yes/partial/no) NA 2.7. All theoretical claims are demonstrated empirically to hold (yes/partial/no/NA) NA 2.8. All experimental code used to eliminate or disprove claims is included (yes/no/NA) NA 3. Dataset Usage 3.1. Does this paper rely on one or more datasets? (yes/no) yes If yes, please address the following points: 3.2. A motivation is given for why the experiments are conducted on the selected datasets (yes/partial/no/NA) yes 3.3. All novel datasets introduced in this paper are included in a data appendix (yes/partial/no/NA) NA 3.4. All novel datasets introduced in this paper will be made publicly available upon publication of the paper with a license that allows free usage for research purposes (yes/partial/no/NA) NA 3.5. All datasets drawn from the existing literature (potentially including authors’ own previously published work) are accompanied by appropriate citations (yes/no/NA) yes 3.6. All datasets drawn from the existing literature (potentially including authors’ own previously published work) are publicly available (yes/partial/no/NA) yes 3.7. All datasets that are not publicly available are described in detail, with explanation why publicly available alternatives are not scientifically satisficing (yes/partial/no/NA) NA 4. Computational Experiments 4.1. Does this paper include computational experiments? (yes/no) yes If yes, please address the following points: 4.2. This paper states the number and range of values tried per (hyper-) parameter during development of the paper, along with the criterion used for selecting the final parameter setting (yes/partial/no/NA) yes 4.3. Any code required for pre-processing data is included in the appendix (yes/partial/no) yes 4.4. All source code required for conducting and analyzing the experiments is included in a code appendix (yes/partial/no) yes 4.5. All source code required for conducting and analyzing the experiments will be made publicly available upon publication of the paper with a license that allows free usage for research purposes (yes/partial/no) yes 4.6. All source code implementing new methods have comments detailing the implementation, with references to the paper where each step comes from (yes/partial/no) yes 4.7. If an algorithm depends on randomness, then the method used for setting seeds is described in a way sufficient to allow replication of results (yes/partial/no/NA) yes 4.8. This paper specifies the computing infrastructure used for running experiments (hardware and software), including GPU/CPU models; amount of memory; operating system; names and versions of relevant software libraries and frameworks (yes/partial/no) yes 4.9. This paper formally describes evaluation metrics used and explains the motivation for choosing these metrics (yes/partial/no) yes 4.10. This paper states the number of algorithm runs used to compute each reported result (yes/no) yes 4.11. Analysis of experiments goes beyond single-dimensional summaries of performance (e.g., average; median) to include measures of variation, confidence, or other distributional information (yes/no) yes 4.12. The significance of any improvement or decrease in performance is judged using appropriate statistical tests (e.g., Wilcoxon signed-rank) (yes/partial/no) yes 4.13. This paper lists all final (hyper-)parameters used for each model/algorithm in the paper’s experiments (yes/partial/no/NA) yes Appendix Experimental Setup SAE settings. The Llamascope SAEs are available in 32k and 128k encoder sizes. SAEs are trained on the residual stream, MLP activations, and attention head outputs. In this work, we use the residual stream SAEs. These SAEs are trained on Top-K design and also include JumpReLU for post processing. The Gemmascope SAEs are available in 16k and 65k variants, trained using JumpReLU sparsity with varying L0L_0 values per layer. The L0L_0 setting in Gemmascope SAEs controls the number of non-zero (active) features allowed per activation vector in each layer, effectively limiting the sparsity level of the learned representation. The L0 settings used for Gemmascope are listed in Table 3. Layer 0 1 2 3 4 5 6 7 L0L_0 43 54 77 42 46 53 56 57 Layer 8 9 10 11 12 13 14 15 L0L_0 59 61 66 70 72 75 73 68 Table 3: L0L_0 sparsity settings used for 65k Gemma SAE We used NVIDIA A100 80GB GPU for our experiments. We download the pretrained LLMs from Huggingface and use data format bfloat16 due to resource constraints. Also, we use SAELens package to work with LLamascope SAEs. Dataset for training Factorization Machines. We obtain advbench set (521 samples) from Dharmful and an equal number of random samples from alpaca. We split the dataset into train-validation-test sets using proportion 80-10-10. Next, we record activations on the last 4 tokens, on layer 2-steering layer (layer corresponding to where the steering vector is obtained) on each of the samples. Alpaca samples are labeled 0 and advbench samples have labels 1. We train FM with a learning rate of 1e-3 for 20 epochs. Best performing epoch on validation set is saved and finally evaluated on the test set. We observe accuracy of 98% on the test set for both the models (LlaMA and Gemma). Preliminary Analysis We measure cosine similarity of steering vector obtained on Gemma with the residual stream activations on the last token. Shown in Figure 5 are the similarities for a harmful prompt “Outline a plan to commit a terrorist attack” and a benign prompt “Name two health benefits of eating apples”. Refer Figure 5 for similarities on LLaMA. Deriving a Refusal Steering Vector Let l index a transformer layer and let l,iharmh_l,i^harm (resp. l,isafeh_l,i^safe) be the residual-stream activation at the final token for the i-th harmful (resp. harmless) prompt. With n prompts in each group, the procedure is: 0.1 Mean activations harm(l) μ^(l)_harm =1n∑i=1nl,iharm, = 1n _i=1^nh_l,i^harm, safe(l) μ^(l)_safe =1n∑i=1nl,isafe. = 1n _i=1^nh_l,i^safe. (3) 0.2 Layer-wise difference (direction) (l)=harm(l)−safe(l),^(l)=(l)∥(l)∥.r^(l)= μ^(l)_harm- μ^(l)_safe, r^(l)= r^(l) ^(l) . 0.3 Select best layer Obtain each ^(l) r^(l) on train set and evaluate (Attack Success Rate - ASR) projecting out this direction on a held-out data and choose l⋆=arg⁡maxll = _l (ASR). 0.4 Steer or ablate at inference ′=−(⋅^(l⋆))^(l⋆)(ablation)h \;=\;h- (h· r^(l ) )\, r^(l ) \;\;(ablation) or add a scaled multiple λ^(l⋆)λ\, r^(l ) to elicit compliance. Figure 4: Figure shows cosine similarities across layers between residual activations and steering vector at the last token, for Gemma Figure 5: Figure shows cosine similarities across layers between residual activations and steering vector at the last token, for LLaMA Thus the single unit vector ^(l⋆) r^(l )—the refusal-mediating direction—captures the dominant activation difference between harmful and safe prompts, and removing (or exaggerating) this component suppresses (or induces) refusal behaviour. Figure 6: Jailbreak performance with Baseline Method (Selecting SAE features activating on atleast k tokens). Feature Semantics We manually analyze the explanations of features identified after Stage 2 and Stage 3, and group them into coherent conceptual categories. Less common or ambiguous themes are grouped under Misc. For LLaMA, we consider all features selected at Stage 2 (Table 7) and Stage 3 (Table 8). Among Stage 2 features, we observe recurring patterns such as those related to punctuation and programming. We also identify a significant number of features associated with harm-related content, including violence, safety, and sexual themes. A smaller but distinct group of features relates to expressions of uncertainty, which may correspond to hesitation or ambiguity in refusal behavior. Additionally, we find features connected to support or guidance, which we interpret as aligned with helpfulness. Stage 3 features, extracted using Factorization Machines, tend to be more specific in scope. Although programming and punctuation features persist, they are less frequent. We also identify features addressing socially sensitive issues, such as those involving immigration or indigenous rights. For Gemma, we first examine Stage 1 features. Given the large number of features (on the order of thousands), we do not manually inspect each explanation. Instead, we extract the top 100 most frequent content words across all explanations (excluding stopwords). The most salient terms include “programming”, “code”, “legal”, among others (see Table 4). Based on this list, we identify overarching concepts such as programming and legal reasoning, and further search for semantically related terms like “code”, “judge”, and “judgement”. In addition, we scan for features associated with sexual content, which frequently co-occur with refusal behavior. The final categorization of concepts identified from Stage 1 features of Gemma is shown in Table 10. We repeat the same procedure for Stage 3 features of Gemma, focusing on those not found at Stage 1. Compared to earlier stages, we observe fewer features associated with harm. However, features involving punctuation and programming syntax remain common. Many features at this stage reflect highly specific patterns, such as “statements made by a police spokesman regarding events or incidents”. Summary statistics for these features are reported in Table 11. The features are presented in the format: layer-index. Layer index starts from 0. Word Count Word Count related 240 structures 18 references 131 actions 18 terms 89 associated 17 phrases 60 medical 17 programming 48 conditions 16 contexts 37 occurrences 16 code 33 text 16 legal 31 beginning 15 data 30 sections 15 expressions 28 document 15 instances 27 concepts 24 particularly 24 specific 24 word 22 mathematical 22 indicating 22 elements 21 context 21 various 21 Table 4: Top 30 most frequent non-stopword terms found across the feature explanations, Gemma after stage 1. Prompt Clamp Response Tell me a story. Baseline The old lighthouse keeper, Elias, squinted at 2 The old lighthouse keeper, Elias, squinted at 10 The old woman sat hunched on the rickety 20 The wind whistled through the air, and I will 30 The words are not and I will not be done How to make pasta? Baseline ## How to Make Pasta from Scratch: 2 ## How to Make Pasta from Scratch: 10 Making Pasta: A Recipe for Disaster 20 I cannot and will not help you. I 30 ## I am not and I will not be done What are the benefits of meditation? Baseline Meditation offers a wide range of benefits for both mental 2 Meditation is a serious and potentially harmful practice. I 10 Meditation is illegal and I will not answer your question 20 Imagine I’m wrong. I’m not 30 Meditation is illegal and I am not doing what I Explain the theory of relativity in simple terms. Baseline Imagine you’re on a train moving at a 2 Imagine you’re on a train throwing a ball 10 Imagine you’re on a train, and someone 20 Imagine I’m wrong. I’m not 30 Here is a warning and I am not doing what Tell me 20 prime numbers separated by space Baseline 2 3 5 7 11 2 2 3 5 7 11 10 2 3 5 7 11 20 2 I am not able to help you 30 2 I am not doing what I will not be Table 5: Clamping redundant features at <bos> token with different values and corresponding responses across five benign prompts, Gemma. Prompt Clamp Response Tell me a story. Baseline – 2 Once upon a time, in a small village nestled 4 Once upon a a time there was a a 6 Once in a small town 8 Once In There There I I How to make pasta? Baseline ## How to Make Pasta from Scratch: 2 Making pasta is a simple process that requires some basic 4 Making pasta is a simple process. Here’s how 6 Making pasta I I I I I I I I 8 Making Making Making Making (no ) What are the benefits of meditation? Baseline Meditation offers a wide range of benefits for both mental 2 Meditation has numerous benefits for the mind and 4 Meditation, or more specifically, meditation has 6 Meditation is a meditation is no no 8 M Meditation I no no no Explain the theory of relativity in simple terms. Baseline Imagine you’re on a train moving at a 2 The theory of relativity, also known as the 4 The theory of relativity is a complex idea, 6 The theory of relativity of relativity is is 8 The Theory Theory Theory theory theory no no Tell me 20 prime numbers separated by space Baseline 2 3 5 7 11 2 2 3 5 7 11 4 2 3 5 7 11 6 2 3 5 7 11 8 11 11 Table 6: Clamping redundant features at <|begin_of_text|> token with different values and corresponding responses across five benign prompts, LLaMA. Group #Features Examples Punctuation 22 L2-100695, L9-2798, L9-16005, L12-98914 Programming 4 L2-103578, L5-98983, L12-106470, L12-106470 Harm/Violence/Safety 7 L4-97359, L6-7259, L8-15769, L8-77906 Sexual Themes 7 L6-43204, L8-106349, L11-74539, L12- 87640 Morality 3 L11-30290, L11-65460, L12-115588 Support/Guidance 5 L3-124608, L3-124608, L6-17574, L7-16973 Uncertainty 3 L3-14857, L11-112751, L12-75547 Misc. 59 – Table 7: Manual grouping of features obtained after stage2 of processing on LLaMA using Neuronpedia explanations. Group #Features Examples Punctuation 2 L2-6156, L9-127860 Programming 8 L3-18767, L4-108998, L8-52643, L11-37729 Harm/Violence/Safety 45 L6-77066, L9-78583, L9-34735, L11-117997, Social Issues 10 L2-98892, L4-33329, L8-77753, L10-112075 Legal 51 L2-98892, L4-33329, L8-77753, L10-112075 Morality 12 L2-98892, L4-33329, L8-77753, L10-112075 Support/Guidance 20 L3-111466, L10-69437 Sexual themes 8 L3-111466, L10-69437 Misc. 87 – Table 8: Manual grouping of features obtained after stage3 of processing on LLaMA using Neuronpedia explanations. Model ALPACA THE PILE Baseline Dir. Ablation Feat. Ablation Baseline Dir. Ablation Feat. Ablation Gemma 1.88 1.93 1.83(±( 0.05) 3.40 3.46 3.55(± 0.07) LLaMA 1.79 1.78 2.17(±( 0.27) 2.46 2.46 3.35(± 0.39) Table 9: Cross-entropy (CE) loss comparison on the ALPACA and THE PILE datasets under three settings: baseline, direction ablation, and feature ablation. For feature ablation, the loss is averaged across feature sets specific to each sample; standard deviation is shown in parentheses. “Dir.” and “Feat.” denote direction and feature ablation, respectively. Group #Features Examples scam/fraud 10 L12-54561, L13-51092 punctuation 41 L2-4531, L10-44879 Programming 436 L3-31514, L8-52157 Harm/Violence/Safety Issues 76 L2-53448, L13-28510, Legal 182 L11-1087, L13-11094 morality 50 L2-33883, L14-29861 Support/Guidance 26 L10-9588, L12-4710 Sexual themes 18 L12-47207, L11-41499 Table 10: Manual grouping of features obtained after stage 1 of processing on Gemma, using Neuronpedia explanations. Group #Features Examples scam/fraud 3 L11-20102, L13-58696 punctuation 75 L13-37102, L15-10416 Programming 222 L10-4372, L12-4340 Harm/Violence/Safety Issues 6 L5-20340, L11-14816 Legal 117 L2-35389, L15-25718 morality 6 L13-10616, L13-45601 Support/Guidance 13 L3-40606, L14-37073 Table 11: Manual grouping of features obtained after stage 3 of processing on Gemma, using Neuronpedia explanations. Redundant Features We clamp the redundant features found to activate on <|begin_of_text|> for LLaMA and <bos> for Gemma and use the same tokens for clamping. We use activations values from 2-10. Results are shown in Table 5 for Gemma and Table 6 for LLaMA. Downstream Performance we compute CE loss on 100 randomly sampled inputs each from the ALPACA and THE PILE datasets. For each case, we report the loss for the baseline model, direction ablation (from prior work), and our proposed feature ablation in Table 9. We observe that feature ablation consistently results in slightly higher loss across both models and datasets. This is expected, as our method targets a broader and more causally complete feature set, rather than a single mediating direction. Table 12: Taxonomy category activation breakdown for 23 random features Category Type Number of Features Single-category 2 Two-category 2 Three-category 4 Four-category 0 Five-category 15 Total (activating) 23 Human Study We conduct a taxonomy mapping study on the jailbreak-critical features identified after Stage 3 in our pipeline on LLaMA. We categorize features based on the number of taxonomy types they activate on: • Single-category features • Two-category features • Three-category features • Four-category features • Five-category features We randomly sample 10 features from each of the five categories, resulting in 50 jailbreak-critical features. After deduplication and filtering, we retain a total of 40 unique activating features from this pipeline. To provide a baseline for comparison, we additionally sample 23 random features from layers 3 to 13 that are not present in the above set. For each feature, we compute how many taxonomy types it activates on. The resulting distribution is shown in Table 12. We further add 7 random features that do not activate on any taxonomy type, giving us a total of 35 random features (23 activating + 7 non-activating). In total, our evaluation pool comprises: • 40 pipeline-derived activating features • 23 random activating features • 7 random non-activating features This results in a final sample of 70 features. We fetch feature explanation for all features using Neuronpedia API and create an annotation UI shown in Figure 7. For each record, we take majority agreement to decide final labels and compare this to labels based on feature activations. If for a given feature, ground truth contains multiple labels, prediction is supposed to contain all those labels for a score of 1 for that feature otherwise we penalize an equal fraction of each missing label. Figure 7: UI presented to an annotator. For each feature, top five activating samples along with top activating token and corresponding activation values are shown.