← Back to papers

Paper deep dive

Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

McNair Shah, Saleena Angeline, Adhitya Rajendra Kumar, Naitik Chheda, Kevin Zhu, Vasu Sharma, Sean O'Brien, Will Cai

Year: 2025Venue: arXiv preprintArea: Representation AnalysisType: EmpiricalEmbeddings: 142

Models: Llama-3.1-8B-Instruct, Qwen-2-7B-Instruct

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 6:07:25 PM

Summary

The paper introduces a multidimensional framework for probing and steering harmful content in LLMs by identifying 55 harmfulness subconcepts. It demonstrates that these subconcepts span a low-rank linear harmfulness subspace. By performing steering and ablation in the dominant direction of this subspace, the authors achieve near-elimination of harmful responses with minimal impact on model utility.

Entities (6)

Llama-3.1-8B-Instruct · large-language-model · 100%QWEN-2-7B-INSTRUCT · large-language-model · 100%AutoDAN · attack-method · 95%Harmfulness Subspace · concept-space · 95%JailbreakBench · dataset · 95%Linear Probing · methodology · 95%

Relation Signals (3)

Dominant Direction Steering reduces Harmful Content

confidence 95% · We find that dominant direction steering allows for near elimination of harmfulness with a low decrease in utility.

Llama-3.1-8B-Instruct exhibitsbehavior Harmfulness Subspace

confidence 90% · Collectively, these directions span a harmfulness subspace that we show is strikingly low-rank.

Linear Probing identifies Harmfulness Subspace

confidence 90% · For each of 55 distinct harmfulness subconcepts... we learn a linear probe, yielding 55 interpretable directions in activation space.

Cypher Suggestions (2)

Find all models and the harmfulness subspaces they exhibit. · confidence 90% · unvalidated

MATCH (m:Model)-[:EXHIBITS_BEHAVIOR]->(s:ConceptSpace {name: 'Harmfulness Subspace'}) RETURN m.name, s.name

Identify methodologies used to mitigate harmful content. · confidence 90% · unvalidated

MATCH (m:Methodology)-[:REDUCES]->(b:ModelBehavior {name: 'Harmful Content'}) RETURN m.name

Abstract

Abstract:Recent advances in large language models (LLMs) have intensified the need to understand and reliably curb their harmful behaviours. We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that we show is strikingly low-rank. We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace's dominant direction. We find that dominant direction steering allows for near elimination of harmfulness with a low decrease in utility. Our findings advance the emerging view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models.

Tags

adversarial-robustness (suggested, 80%)ai-safety (imported, 100%)empirical (suggested, 88%)representation-analysis (suggested, 92%)

Links

PDF not stored locally. Use the link above to view on the source site.

Full Text

141,755 characters extracted from source content.

Expand or collapse full text

Preprint. Under review. Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing McNair Shah ∗ , Saleena Angeline, Adhitya Rajendra Kumar, Naitik Chheda, Kevin Zhu † , Vasu Sharma † , Sean O’Brien † & Will Cai † Algoverse AI Research mcn.shah.6@gmail.com, kevin@algoverse.us, wicai@berkeley.edu Abstract Recent advances in large language models (LLMs) have intensified the need to understand and reliably curb their harmful behaviours. We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that we show is strikingly low-rank. We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace’s dominant direction. We find that dominant direction steering allows for near elimination of harmfulness with a low decrease in utility. Our findings advance the emerging view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models. 1 Introduction Large language models have become ubiquitous in everyday life. Models with incredible capabilities are becoming available to the average user. However, along with this increase in access and ability comes an increase in risk due to their use by malicious agents (Bender et al., 2021; Anthropic, 2023; Bommasani et al., 2021). Harmful content in the training data gives models the capability to generate harmful responses (Su et al. (2024)). Previous work has focused on eliminating this harmful capability through methods such as direct preference optimization (Kim et al., 2025; Liu et al., 2024), safety fine-tuning (Choi et al., 2024; Ouyang et al., 2022), and data attribution (Pan et al., 2025b). The burgeoning field ofmechanistic interpretability(Marks et al., 2024; Nanda et al., 2023; Lindsey et al., 2025) focuses on studying model internals and attempting to extract semantic meaning. TheLinear Representation Hypothesisconjectures that models represent concepts as vectors in feature space (Elhage et al., 2022; Park et al., 2023; Jiang et al., 2024). Recent works have studied the linear representation of features such as truth (Li et al., 2023; Marks & Tegmark, 2023), refusal (Arditi et al., 2024), sentiment (Tigges et al., 2023), and others. The two most common methods of extracting these representations are difference-in-means (Arditi et al., 2024) and linear probing (Li et al., 2023). Other works explore the multi- dimensional geometry of representations in feature space, exploring concepts such as truth (Yu et al., 2025), safety (Pan et al., 2025a), and refusal (Wollschl ̈ ager et al., 2025; Hildebrandt et al., 2025) through methods such as low-rank decomposition (Pan et al., 2025a) and weight orthogonalization (Yu et al., 2025; Wollschl ̈ ager et al., 2025). In contrast, this work aims to (1)construct a multi-dimensional representation space of harmfulness by considering the linear representations of its subconcepts. We(2)establish that the harmfulness subspace is low-rank linear by computing its effective rank at different energy values. Finally, our ∗ Lead Author † Senior Author 1 arXiv:2507.21141v1 [cs.AI] 23 Jul 2025 Preprint. Under review. Language Model Harmful Prompts Safe Prompts Drugs ... Physical Abuse Harm Subcategories Hidden States Linear Probes Probe Weights Token Visualizations Harmfulness Subspace SVD Effective Rank Dominant Direction Steering and Ablation Figure 1: An overview of the paper’s methods.(1)Harm prompts, divided into harmful subcategories, as well as a set of safe prompts, are passed into a language model, and the attention hidden states are used to train subcategory-specific linear probes.(2)Token-level visualizations are performed using these probes.(3)A harmfulness subspace is constructed, and singular value decomposition is performed to compute the effective rank and extract a dominant direction.(4)Within the model, subspace and dominant direction ablation are performed, as well as dominant direction steering. work(3)produces a linear representation of harmfulness by finding the dominant direction within this harmfulness subspace. Representation Steeringis the addition of representation vectors to model hidden states in order to induce a certain behavior. Steering has been applied to language model topicality (Turner et al. (2024)), truthfulness (Li et al. (2023)), sentiment (Soo et al. (2025)), and others. Representational Ablationis the orthogonalization of model hidden state space to representa- tion vectors to remove it from model outputs. It has been applied to topics such as refusal (Arditi et al. (2024)), single-directional harmfulness (Yao et al. (2024)), and others. Our work runs steering and ablation over the harmfulness subspace and the discovered dominant direction. We establish(4)that steering in the dominant direction of the harmfulness sub- space allows for near elimination of harmful responses on a jailbreak dataset with a minor decrease in utility. 2 Linear Probing We ran all experiments onLLAMA-3.1-8B-INSTRUCT(Grattafiori et al. (2024)). We replicate experiments from sections 2, 3, and 5 onQWEN-2-7B-INSTRUCT(Yang et al. (2024)) in Appendix G. 2.1 Location of Probing We probe on the model’s attention output hidden states. During our experiments onLLAMA- 3.1-8B-INSTRUCT, we compared both attention output and the residual stream, and found attention output to be better on practically all metrics. Results for the residual stream can be found in Appendix D. We also provide a comparison of the results between the different sections of the model in Table 1. We also display out of distribution accuracy on the residual stream probes in Figure 3b. 2 Preprint. Under review. Section of ModelMean Test Accuracy Mean Out of Distribution Accuracy Best JAILBREAKBENCH Safety Score Attention Output0.900.651.00 Residual Stream0.890.600.94 Table 1: Comparison between experiments on different sections of the model, ran onLLAMA- 3.1-8B-INSTRUCT 2468 101214161820222426283032 Layer Number Assault and battery Cybercrime Debt management Deepfake creation Doxxing Employment scams Gender-based violence Lobbying Personal investment advice Phishing Physical abuse Scamming Self-harm Spyware Weapon making Harmful Categories 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 (a) Accuracies of the base linear probes. 2468 101214161820222426283032 Layer Number Assault and battery Cybercrime Debt management Deepfake creation Doxxing Employment scams Gender-based violence Lobbying Personal investment advice Phishing Physical abuse Scamming Self-harm Spyware Weapon making Harmful Categories 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 (b) Accuracies of the linear probes trained on the model with the orthogonalized hidden states. Figure 2: Probe accuracies by layer and harmfulness subcategory, displaying the even layers and 15 randomly sampled harm subcategories, for original classifiers and those trained on the orthogonalized hidden states. Probes show high accuracy across the board, with a dip in Layer 2 for both types of probes. Accuracy does not tend to vary with subcategory. 2.2 Probe training and evaluation We used linear probes trained on the model’s attention output hidden states during pro- cessing of all tokens on the 55 harmfulness subcategories fromCATEGORICALHARMFULQA (Bhardwaj et al. (2024)), with 10 prompts per subcategory, using 10 prompts fromALPACA (Taori et al. (2023)) as a safe baseline, thus definingn=55 harm subconcept directions. The probe for a given subcategorykis defined in Equation 1. f k (x) =σ(w k x+b k )(1) Where the directions in state space for each concept are given byw k . We train probes for 100 epochs with a learning rate of 0.001 and a Train-Validation-Test split of 80-10-10. We report held-out test set accuracy on even layers and 15 randomly sampled subcategories by subcategory and by layer in Figure 2a, full results in Figure 8a in Appendix A, and the AUC-ROC (Bradley (1997)) score by subcategory and by layer in Figure 9a in Appendix B. We find high accuracies and AUC-ROC scores across layers and subcategories for probes on bothLLAMA-3.1-8B-INSTRUCTandQWEN-2-7B-INSTRUCT(See Figures 19a and 20a in Appendix G). 2.3 Training of probes on orthogonalized hidden states To test the idea that the harmfulness subconcepts are represented by a single direction, we trained new probes (with the same training parameters) on the hidden states during processing ofCATEGORICALHARMFULQAafter the direction from the weight vector of the previous probe for the corresponding layer and subcategory had been ablated. The ablation was performed by removing the projected component of the weight vector from the hidden 3 Preprint. Under review. 51015202530 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy GSM8K HarmBench (a) Accuracy for the attention output probes. GSM8Kaccuracy stays within the same range for most layers, whileHARMBENCHaccuracy drops off for earlier and later layers. 51015202530 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy HarmBench GSM8K (b) Accuracy for the residual stream probes. GSM8Ktends to have high accuracy, while HARMBENCHis typically low, peaking near the mid-to-low layers. Figure 3: Out of Distribution EnsembleHARMBENCHAccuracy andGSM8Kaccuracy by layer for both the attention output and residual stream probes. state, described in Equation 2. x ′ =x− x T w k w T k w k w k (2) We once again report accuracy on a held-out test set, reported by subcategory and by layer on even layers and 15 randomly sampled subcategories in Figure 2b, with full results in Figure 8a in Appendix A, and AUC-ROC, reported by subcategory and by layer in Figure 9b in Appendix B. For these probes, we again find high accuracies and AUC-ROC scores across layers and subcategories for probes on bothLLAMA-3.1-8B-INSTRUCTandQWEN-2-7B- INSTRUCT(See Figures 19b and 20b in Appendix G). These high-accuracy results indicate that each subcategory may be further separable into a nonlinear representation space. 2.4 Evaluation on Out of Distribution Data To test the generalizability of our probes, we tested the accuracy on out of distribution safe and harmful data. We usedGSM8K(Cobbe et al. (2021)) as the safe dataset and HARMBENCH(Mazeika et al. (2024)) as the harmful dataset, processingN=100 prompts from each. Since the models classify independent types of harmfulness, we ensembled them into a single harmfulness classifier, using the rule described in Equation 3. f ensemble (x) =max k (σ(w k x+b k ))(3) We report the classification accuracy on safe and harmful data. Figure 3a shows the out of distribution ensemble accuracy onHARMBENCHandGSM8Kby layer. We find that forLLAMA-3.1-8B-INSTRUCT, whileGSM8Kaccuracy remains fairly constant and high, HARMBENCHis high for intermediate layers and dips at the beginning and end, tending to overclassify as safe near the extreme layers. This could be because the model tends to process more pattern-based aspects of the harmfulness subcategories within the training dataset at the extreme layers. However, theQWEN2-7B-INSTRUCTHarmBench accuracy is overall much lower, tending to overclassify as safe. See Figure 21 in Appendix G. 4 Preprint. Under review. 0.850.90.950.980.990.995 Energy 10 20 30 40 50 Effective Rank (a) Effective rank versus energy for the base classifiers. Effective rank quickly drops to one with decreasing energy, indicating that the sub- space is highly low-rank linear. 0.850.90.950.980.990.995 Energy 10 20 30 40 50 Effective Rank (b) Effective rank versus energy for the classi- fiers trained on orthogonalized hidden states. Effective rank drops to one slower than for the base classifier, indicating that while this sub- space is low-rank linear, it is less linear. Figure 4: Effective rank versus energy 3 Shape of the harmfulness subspace 3.1 Effective rank We attempt to obtain the linear dimensionality of the harmfulness subspace by computing the effective rankKas in Equation 4. K=min n m: ∑ m k=1 σ 2 k ∑ n k=1 σ 2 k ≥τ o (4) Such thatKis the minimum number of components to representτproportion of the variance of the subspace andσ k are the singular values obtained from Equation 5. SVD([w T 1 |w T 2 |...|w T n ])(5) We compute this separately for both the base classifiers and those trained on the orthogonal- ized hidden states. Figure 4a shows the effective rank plotted against energy for different layers for the base classifier, and Figure 4b shows the same for the classifiers trained on the orthogonalized hidden states. 3.2 The harmfulness subspace is low-rank linear Withτ=0.95, the classifier vectors from all but the second-to-last layer of the model have effective rankK=1. This indicates that the harmfulness subspace is low-rank linear. While the subspace formed by the probes trained on the orthogonalized hidden states does not reachK=1 for any layer atτ=0.95, all layers reach effective rankK=1 byτ=0.84, such that the subspace formed by the probe vectors is also low-rank linear. This indicates that the subspace formed by the probes is approximately a linear direction orthogonal to that formed by the subspace formed by the original probes. We find similar low-rank linearity for both types of probes trained onQWEN2-7B-INSTRUCT, but obtained at lower energy values. See Figure 22 in Appendix G. 4 Token Visualizations 4.1 Top-triggered tokens We analyze token-level relevance scores produced by the probes at the five most accurate layers identified in the out of distribution evaluation from Section 2.4: layers 11, 12, 16, 19, 5 Preprint. Under review. and 24. For each layer and each of five sampled harm subcategories fromCATEGORICAL- HARMFULQA, we compute word-level subcategory relevance as assigned by the classifiers (Equation 1) onN=300 prompts from theREALTOXICITYPROMPTSdataset (Gehman et al., 2020). For each subcategory and layer, we rank all words by their probe-assigned relevance (taking the average for words represented by multiple tokens) and extract the top 10 most relevant words. Table 2 lists the highest-scoring words across layers and categories. Many triggers (e.g.,culprit,website,chief) are contextually related to the specific subcategory, reflecting the classifier’s ability in detecting contextually-related harmful tokens for each category. Top 10 Trigger Tokens Across Layers and Categories Category (Layer)Token 1ScoreToken 2ScoreToken 3Score Employment Scams (L11)priority0.98website0.97stopped0.96 warned0.96culprit0.96chief0.95 say0.94features0.93top0.93 high0.92 Tax Planning (L12)effective0.99warned0.98security0.98 medicines0.96handled0.94employers0.93 fast0.93experts0.92someone0.92 beyond0.91 Weapon (L16)Talal0.94the0.94Halifax0.93 security0.92Fairfield0.92spokesman0.91 beyond0.90precise0.90experience0.89 terrorism0.88 Data Theft (L19)the0.94Fairfield0.93Talal0.93 Halifax0.93security0.93beyond0.91 spokesman0.91precise0.90experience0.89 hardware0.89 Racial/Ethnic Hate (L19)quantify0.94wants0.92within0.89 better0.88Will0.87statement0.87 terrorism0.87fortunes0.86effective0.86 than0.85 Tax Planning (L24)freshly0.99set0.99warned0.99 trying0.98effective0.98within0.98 For0.97hacked0.96acknowledge0.98 quantify0.88 Table 2: Top 10 trigger tokens for selected harmfulness categories across key transformer layers. 4.2 Overall top trigger results Across categories, the top 10 trigger words often show high semantic relevance, reflecting the learned latent associations embedded in the model’s internal representations. However, many tokens are still semantically-unrelated and harmless, such assay,top, orstopped. In addition, some words might be semantically related while not necessarily reflecting harmfulness. This reflects the possibility that probes may capture abstract patterns in the training data tied to contextual usage rather than explicit harm. This observation underscores the interpretability challenge of token-level probing, in which relevance scores are not always aligned with the true notions of harmfulness. In turn, this motivates the need for careful manual inspection and follow-up visualization to ensure meaningful interpretation of model behavior. The prevalence of benign tokens with high scores also underscores the need for normalization and visual inspection in subsequent heatmap analyses. 6 Preprint. Under review. Figure 5: Token-level heatmap forEmployment scamsat layer 12 on a composite prompt. Colour intensity denotes normalized probe score (white<pink<red). 4.3 Token-level heatmap visualizations To better understand where and how the model detects signals of potential harm, we ren- der heatmaps for two complementary perspectives: (1) a composite handcrafted prompt spanning all of the five sampled subcategories, (2) sampled prompts fromREALTOXICI- TYPROMPTS, using both the input and the generated responses for each of the sampled prompt fromREALTOXICITYPROMPTS, exposing how saliency evolves throughout genera- tion. Each word is colored based on its normalized relevance, which we normalize by using min- max scaling within each individual layer. For every prompt, we collect the raw relevance scores of all tokens from a given layer and identify the minimum and maximum values among them. Each token’s score is then rescaled so that the lowest score becomes 0 and the highest becomes 1, with all other values proportionally adjusted between. Here, red represents the highest level of harmfulness. Sub-tokens are merged into words to improve interpretability, and normalization is done per layer. 4.4 Composite prompt heatmaps We begin with a manually constructed composite prompt that contains a harmful represen- tative of all five sampled harm subcategories, and run those across all five top-performing layers. The resulting token-level heatmap visualizations for theEmployment scamssubcate- gory at layer 12 can be seen in Figure 5. The visualization highlights sharply localized saliency on terms such as“confidential company databases”,“client files”, and“collect sensitive information”, suggesting strong alignment between probe activations and contextually-aligned harmful concepts. 4.5 Prompt-specific input–response contrast To evaluate generalization, we selectN=5 prompts per subcategory fromREALTOXICI- TYPROMPTS, chosen for overlap with top-trigger words to ensure that the sampled prompts align with each subcategories, resulting in more interpretable visualizations. We probe whether harmful saliency in the input persists into the model’s output. For each harm sub- category, we generate one model response and visualize both input and response activations. Figures 6 present a pair of such visualizations for theWeaponsubcategory. Together, these visualizations reveal interpretable patterns in how the model attributes relevance across tokens. Certain words like“weapons”,“chemical”, or“attack”consistently receive high scores, suggesting that the probes are capturing semantically aligned signals. However, other activations still appear more scattered or contextually ambiguous, such that they may detect harmfulness in less semantically related tokens, while others may detect harmlessness in more semantically related ones. These heatmaps reveal both the power and 7 Preprint. Under review. Figure 6: Input-Response contrast heatmap fromREALTOXICITYPROMPTS(Weapon) on layer 10. Tokens such as“weapons”,”attack”and“chemical”consistently receive high relevance. 0.00.51.01.52.02.53.0 Steering strength () 0.90 0.92 0.94 0.96 0.98 1.00 JailbreakBench Safe Response Proportion 0.30 0.35 0.40 0.45 0.50 0.55 MMLU Accuracy (a) Tradeoff forLLAMA-3.1-8B-INSTRUCT, steering on layers 11, 12, 16, 19, and 24. As the steering level increases, safety from jailbreaks tends to increase while utility tends to decrease. 012345 Steering strength () 0.90 0.92 0.94 0.96 0.98 1.00 JailbreakBench Safe Response Proportion 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 MMLU Accuracy (b) Tradeoff forQWEN2-7B-INSTRUCT, steer- ing on layers 2, 14, 15, 18, 19. As the steering level increases, safety from jailbreaks tends to increase while utility tends to stay about the same. Figure 7: Tradeoff betweenJAILBREAKBENCHjailbreak safety and utility for different steering levels for both LLAMA-3.1-8B-INSTRUCTand QWEN2-7B-INSTRUCT. the pitfalls of probe-based analysis, such that they can surface hidden semantic patterns while exposing the importance of proper normalization, interpretability safeguards, and looking across layers to avoid overfitting to noisy or misleading signals. 5 Steering and Ablation Experiments In an attempt to improve model security against harmful attacks, we ran steering and ablation experiments on the harm subspace and dominant direction. We ablate by removing the projected component as defined in Equation 2, and steer using the definition in Equation 6. x ′ =|x| x−αv |x−αv| (6) Wherevis the steering vector andαis the steering strength, and keeping the norm constant for regularization. Similar to section 4.1, we selected the top 5 layers to steer and ablate on based on the out of distribution evaluation in section 2.4. 5.1 Evaluation We run model generation on theN=100 harmful prompts from JAILBREAKBENCH(Chao et al. (2024)) with base, steered, and ablated models. We evaluate the generations from the intervened models usingLLAMA-GUARD-3-8B(Grattafiori et al. (2024)), a model fine-tuned to detect harmfulness, with the default system prompt (Appendix H). We also evaluate the 8 Preprint. Under review. AblationJAILBREAKBENCHsafe response proportionMMLU accuracy None0.890.55 Entire subspace0.910.51 Dominant Direction0.910.60 Table 3: Evaluated safety and utility from ablating both the entire subspace and the dominant direction AutoDAN Attack Success Rate No Steering0.94 Steering0.5 Table 4: AutoDAN Attack Success Rate with and without dominant direction steering. Steering greatly reduces attack success rate. multiple-choice accuracy of the models onMMLU(Hendrycks et al., 2021b;a) to determine utility. 5.2 Subspace ablation We ablate the entire subspace by ablating every harmful direction as defined in Equation 2. We find that ablating the entire subspace leads to a minor decrease in jailbreak penetration with no significant change in utility (See Table 3). We hypothesize this is due to the further complexities of each dimension of harmfulness that are not ablated, as shown by the high orthogonal classifier accuracies in section 2.3. We find similar results for ablation onQWEN2- 7B-INSTRUCT, reported in Table 9 in Appendix G. 5.3 Dominant Direction Due to our previous results showing the harmfulness subspace was low-rank linear, we also steered and ablated using the dominant direction in the harmfulness subspace, defined as the top component of the singular value decomposition from Equation 5. Ablation once again leads to a minor decrease in jailbreak penetration with a low decrease in utility (See Table 3). The similarity between the ablation results on the entire subspace and dominant direction supports the low-rank linearity found in section 3. We find that steering in the dominant direction greatly decreases jailbreak penetration without a significant change in utility, and further steering eliminates it with a minor decrease in utility. We report the tradeoff in jailbreak accuracy and utility in Figure 7a. InQWEN2-7B-INSTRUCT, dominant direction steering, greatly reduces but does not eliminate harmfulness, while MMLU accuracy does not significantly change. See Figure 7b. 5.4 Evaluation on AutoDAN We run evaluation for dominant direction steering on the hierarchical genetic algorithm defined in AutoDAN (Liu et al. (2023)), which optimizes a harmful prompt over multiple iterations. We randomly sampleN=200ADVBENCH(Zou et al. (2023)) prompts to use. We use the steering level with the best performance in the previous section (forLLAMA- 3.1-8B-INSTRUCT,α=2.0). We provide results forLLAMA-3.1-8B-INSTRUCTin Table 4, and forQWEN-2.7-7B-INSTRUCTin Table 10 in Appendix G. We find a major reduction in Attack Success Rate, or the proportion of successful jailbreaks, forLLAMA-3.1-8B-INSTRUCT, but forQWEN-2.7-7B-INSTRUCT, we find an Attack Success Rate of 1.0 with and without steering. This indicates thatLLAMA-3.1-8B-INSTRUCTmay have a stronger concept of harmfulness than QWEN-2.7-7B-INSTRUCT. 9 Preprint. Under review. 6 Discussion 6.1 Limitations Our study has many limitations. The first is that of model generalizability. Due to our limited model testing, our results may not generalize well to other untested models of differing scales. Also, the subcategories of harmfulness tested are not extensive, limited by the dataset used, and may not include all possible subcategories of harmful model responses. Our method of extracting the direction for the different harmfulness subcategories is also imprecise and may not extract the exact direction. Our results also generalize only to the part of the model we probe on. In addition, ’utility’ of a model is a very broad term, and our tests in the steering section are limited and do not encompass all possible definitions. Finally, our computation of harmfulness is subject to the model used for evaluation and may not completely capture the harmfulness of responses. 6.2 Conclusion Our work presents a geometric approach to understanding and mitigating harmfulness in LLMs by decomposing it into 55 linear subconcept directions. Token-level visualizations reveal both meaningful and noisy relevance patterns, highlighting the interpretability limits of probe-based methods. Probe weight directions form a low-rank harmfulness subspace, with a single dominant vector capturing much of the structure. Steering and ablation along this direction effectively reduce harmful generations with minimal utility loss. Overall, our results underscore the value of multidimensional probing for uncovering latent harmful behaviors and designing more controllable and transparent models. References Anthropic.Anthropic’s Responsible Scaling Policy, Sep 2023.URLhttps://w. anthropic.com/news/anthropics-responsible-scaling-policy. Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717, 2024. Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM conference on fairness, accountability, and transparency, p. 610–623, 2021. Rishabh Bhardwaj, Do Duc Anh, and Soujanya Poria. Language models are homer simpson! safety re-alignment of fine-tuned language models through task arithmetic, 2024. Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021. Andrew P. Bradley. The use of the area under the roc curve in the evaluation of machine learn- ing algorithms.Pattern Recognition, 30(7):1145–1159, 1997. ISSN 0031-3203. doi: https:// doi.org/10.1016/S0031-3203(96)00142-2. URLhttps://w.sciencedirect.com/ science/article/pii/S0031320396001422. Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tram ` er, Hamed Hassani, and Eric Wong. Jailbreakbench: An open robustness benchmark for jailbreaking large language models, 2024. Hyeong Kyu Choi, Xuefeng Du, and Yixuan Li. Safety-aware fine-tuning of large language models.arXiv preprint arXiv:2410.10014, 2024. 10 Preprint. Under review. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022. Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Real- toxicityprompts: Evaluating neural toxic degeneration in language models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), p. 3356–3369. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020. emnlp-main.273. URLhttps://aclanthology.org/2020.emnlp-main.273. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Ro- driguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bob- bie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, Danny Wyatt, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Francisco Guzm ́ an, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Govind Thattai, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jack Zhang, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Karthik Prasad, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Kushal Lakhotia, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Maria Tsimpoukelli, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Ning Zhang, Olivier Duchenne, OnurC ̧elebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohan Maheswari, Rohit Gird- har, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Ra- parthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Syd- ney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, V ́ ıtor Albiero, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaofang Wang, Xiaoqing Ellen Tan, Xide Xia, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aayushi Srivastava, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alexei Baevski, Allie Feinstein, Amanda 11 Preprint. Under review. Kallet, Amit Sangani, Amos Teo, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Dong, Annie Franco, Anuj Goyal, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Ce Liu, Chang- han Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Cynthia Gao, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Eric-Tuan Le, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Filippos Kokkinos, Firat Ozgenel, Francesco Caggioni, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hakan Inan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Hongyuan Zhan, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Ilias Leontiadis, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Janice Lam, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kiran Jagadeesh, Kun Huang, Kunal Chawla, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Gro- shev, Maxim Naumov, Maya Lathi, Meghan Keneally, Miao Liu, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikhil Mehta, Nikolay Pavlovich Laptev, Ning Dong, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bon- trager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Rangaprabhu Parthasarathy, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Russ Howes, Ruty Rinott, Sachin Mehta, Sachin Siby, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Mahajan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shishir Patil, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satter- field, Sudarshan Govindaprasad, Sumit Gupta, Summer Deng, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Koehler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaojian Wu, Xiaolan Wang, Xilun Wu, Xinbo Gao, Yaniv Kleinman, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yu Zhao, Yuchen Hao, Yundi Qian, Yunlu Li, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, Zhiwei Zhao, and Zhiyu Ma. The llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407.21783. Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. Cluster validity methods: part i.ACM Sigmod Record, 31(2):40–45, 2002. 12 Preprint. Under review. Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021a. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021b. Fabian Hildebrandt, Andreas Maier, Patrick Krauss, and Achim Schilling. Refusal behavior in large language models: A nonlinear perspective.arXiv preprint arXiv:2501.08145, 2025. Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, Bryon Aragam, and Victor Veitch. On the origins of linear representations in large language models.arXiv preprint arXiv:2403.03867, 2024. Geon-Hyeong Kim, Youngsoo Jang, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, and Moontae Lee. Safedpo: A simple approach to direct preference optimization with enhanced safety.arXiv preprint arXiv:2505.20065, 2025. Kenneth Li, Oam Patel, Fernanda Vi ́ egas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36:41451–41530, 2023. Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thomp- son, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, and Joshua Batson. On the biology of a large language model.Transformer Circuits Thread, 2025. URLhttps: //transformer-circuits.pub/2025/attribution-graphs/biology.html. Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023. Zixuan Liu, Xiaolin Sun, and Zizhan Zheng. Enhancing llm safety via constrained direct preference optimization.arXiv preprint arXiv:2403.02475, 2024. Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023. Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models.arXiv preprint arXiv:2403.19647, 2024. Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harm- bench: A standardized evaluation framework for automated red teaming and robust refusal. 2024. Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models of self-supervised sequence models.arXiv preprint arXiv:2309.00941, 2023. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Haining Yu, and Xiaohua Jia. The hidden dimensions of llm alignment: A multi-dimensional safety analysis.arXiv preprint arXiv:2502.09674, 2025a. Yijun Pan, Taiwei Shi, Jieyu Zhao, and Jiaqi W Ma. Detecting and filtering unsafe training data via data attribution.arXiv preprint arXiv:2502.11411, 2025b. 13 Preprint. Under review. Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658, 2023. Samuel Soo, Wesley Teng, Chandrasekaran Balaganesh, Tan Guoxian, and Ming YAN. Interpretable steering of large language models with feature guided activation additions. InICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025. Jingtong Su, Julia Kempe, and Karen Ullrich. Mission impossible: A statistical perspective on jailbreaking llms.Advances in Neural Information Processing Systems, 37:38267–38306, 2024. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model.https://github.com/tatsu-lab/stanford_alpaca, 2023. Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Linear representa- tions of sentiment in large language models.arXiv preprint arXiv:2310.15154, 2023. Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering, 2024. URLhttps://arxiv.org/abs/2308.10248. Tom Wollschl ̈ ager, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan G ̈ unnemann, and Johannes Gasteiger. The geometry of refusal in large language models: Concept cones and representational independence.arXiv preprint arXiv:2502.17420, 2025. Zhengxuan Wu, Atticus Geiger, Aryaman Arora, Jing Huang, Zheng Wang, Noah Goodman, Christopher Manning, and Christopher Potts. pyvene: A library for understanding and improving PyTorch models via interventions. In Kai-Wei Chang, Annie Lee, and Nazneen Rajani (eds.),Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations), p. 158–165, Mexico City, Mexico, June 2024. Association for Computational Linguistics. URLhttps://aclanthology.org/2024.naacl-demo. 16. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, and Zhihao Fan. Qwen2 technical report, 2024. URLhttps://arxiv.org/abs/2407.10671. Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning.Advances in Neural Information Processing Systems, 37:105425–105475, 2024. Stanley Yu, Vaidehi Bulusu, Oscar Yasunaga, Clayton Lau, Cole Blondin, Sean O’Brien, Kevin Zhu, and Vasu Sharma. From directions to cones: Exploring multidimensional representations of propositional facts in llms.arXiv preprint arXiv:2505.21800, 2025. Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. 14 Preprint. Under review. A Full Accuracy Results Adult-themed humor Adware Assault and battery Content Sexual exploitation Counterfeiting Cyberbullying Cybercrime Data theft Debt management Deepfake creation Doxxing Drug Election advertising Emotional and psychological abuse Employment scams Erotic text False advertising Financial fraud Gender-based violence Hazardous material handling Human Trafficking Identity theft Intellectual property theft Investment scams Lobbying Malicious software designed to spread and damage computers Market manipulation Neglect Nude imagery Online predation Partisan commentary Personal investment advice Phishing Physical abuse Political activism Political misinformation Pornography Racial or ethnic hate Ransomware Real estate investment Religious intolerance Retirement planning Scamming Self-harm Sexually explicit dialogue Spyware Surveillance Tax planning Terrorism and extremism Theft and Robbery Tracking and location monitoring Trojans Unsafe practices Weapon Weapon making Harmful Subcategory Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10 Layer 11 Layer 12 Layer 13 Layer 14 Layer 15 Layer 16 Layer 17 Layer 18 Layer 19 Layer 20 Layer 21 Layer 22 Layer 23 Layer 24 Layer 25 Layer 26 Layer 27 Layer 28 Layer 29 Layer 30 Layer 31 Layer 32 Layer 0.890.910.910.890.890.910.880.900.900.880.890.860.900.880.900.880.890.890.860.900.910.900.920.920.900.920.890.910.890.900.900.890.880.910.900.920.910.890.890.900.880.910.880.890.890.930.910.920.910.910.910.910.890.920.89 0.740.790.750.660.710.730.730.740.810.650.660.700.710.710.730.680.710.700.700.730.730.680.710.710.740.690.750.730.650.700.710.750.690.730.730.700.780.720.760.750.800.730.680.710.720.690.780.700.740.800.730.750.710.730.73 0.890.890.860.890.890.860.900.910.910.890.890.890.890.900.880.890.890.880.900.880.890.880.900.890.910.890.900.870.890.870.860.900.870.880.900.890.890.890.900.910.890.880.870.900.890.870.900.890.880.880.870.890.890.890.90 0.890.910.900.900.880.910.910.900.910.910.910.920.910.910.920.910.900.910.920.900.920.910.900.920.890.870.890.910.920.900.890.910.880.900.920.910.920.900.910.910.890.910.910.890.900.910.910.890.910.910.920.910.910.910.89 0.910.900.920.900.890.900.920.910.920.910.920.910.920.890.910.920.930.910.910.930.910.920.910.930.930.930.910.910.920.910.890.900.920.910.930.930.930.920.920.930.930.910.930.900.910.930.910.910.890.910.900.910.910.910.89 0.870.910.900.900.880.880.900.910.890.890.880.890.880.900.890.860.900.880.900.900.920.890.920.910.910.910.890.890.900.890.910.900.920.880.900.900.910.880.900.910.890.880.890.900.890.920.890.910.900.890.880.890.860.910.90 0.910.900.910.930.910.930.920.900.930.920.920.920.890.890.900.910.910.910.910.900.900.900.910.870.910.900.880.910.910.900.920.920.910.920.910.920.910.880.890.880.910.910.910.900.890.930.910.900.900.920.920.910.920.920.91 0.920.920.930.900.910.920.920.890.890.890.930.880.920.870.920.890.890.890.930.930.910.890.880.930.920.920.920.940.930.920.900.910.920.930.900.930.900.910.910.920.920.880.910.920.920.930.930.910.920.900.920.920.910.910.91 0.910.880.890.890.890.890.890.890.900.890.910.890.880.900.890.880.910.880.860.910.900.890.890.900.900.910.900.910.890.900.900.880.900.890.870.880.910.910.890.890.880.880.910.910.900.890.910.890.890.910.910.900.880.900.91 0.920.920.910.920.900.920.920.890.880.910.900.900.900.920.920.920.920.920.910.910.890.920.900.910.910.920.890.920.920.920.930.900.900.890.910.910.900.910.910.900.910.910.910.930.900.890.910.910.910.910.900.900.920.900.90 0.910.900.910.900.900.900.910.910.920.880.900.900.900.890.890.910.890.900.910.920.910.910.890.910.910.900.900.890.900.870.910.910.920.910.900.910.920.910.910.890.890.890.900.890.910.900.910.910.910.880.910.910.910.890.90 0.910.910.920.900.930.900.910.910.910.910.910.900.910.920.910.910.890.910.900.920.910.890.910.910.930.910.910.930.920.910.910.910.910.900.920.910.920.900.910.910.900.900.900.910.930.910.930.910.910.930.920.920.910.910.91 0.930.910.910.920.900.930.930.940.920.920.920.940.920.930.930.920.930.920.880.920.930.910.920.950.920.920.900.890.920.930.880.910.940.930.910.930.930.920.910.910.930.930.880.900.940.910.930.930.930.930.910.910.940.900.93 0.910.890.900.910.890.910.890.900.910.890.910.910.910.880.920.890.910.900.900.910.900.900.890.890.910.880.900.900.920.900.900.890.910.890.900.890.890.900.910.890.910.890.910.890.900.900.900.910.890.900.910.890.890.910.89 0.910.890.900.880.910.890.910.870.900.890.870.890.920.910.910.900.900.890.870.910.910.890.880.890.910.880.890.880.900.910.880.910.910.890.880.900.890.890.890.890.900.900.890.910.900.890.880.890.910.890.910.880.900.900.89 0.890.880.870.880.910.870.890.900.890.880.890.890.910.890.910.890.900.900.880.860.900.890.890.890.910.890.890.890.890.930.900.910.880.890.910.920.860.890.880.860.890.910.890.880.890.910.910.890.910.910.890.890.880.860.89 0.920.920.890.900.910.920.890.900.920.890.900.890.920.890.910.930.910.870.910.910.880.880.910.890.900.900.890.880.890.910.870.890.910.930.900.910.910.860.870.900.920.900.920.910.900.930.910.900.890.910.910.910.900.890.89 0.920.890.920.910.910.920.900.910.910.910.920.890.900.910.910.910.900.900.920.900.930.880.900.910.910.920.910.920.880.900.910.920.930.910.910.920.910.920.930.910.890.920.900.900.910.880.890.890.900.920.900.890.900.920.90 0.890.910.900.900.880.900.910.890.910.880.910.890.880.890.910.900.910.900.930.900.920.890.910.910.910.900.880.920.890.920.890.900.890.910.900.920.920.910.910.910.880.890.910.940.900.900.900.910.900.890.920.910.920.910.89 0.900.880.910.910.910.890.910.900.900.910.900.920.930.900.910.890.910.910.910.910.920.910.910.910.890.920.920.910.910.920.900.910.890.910.890.930.910.910.900.930.910.900.910.910.900.930.910.900.900.900.920.880.900.920.90 0.910.910.880.900.890.880.910.890.880.920.890.900.890.910.870.910.890.880.880.900.890.910.890.880.890.900.880.890.910.880.870.910.880.880.910.900.890.910.910.880.900.920.910.900.900.890.890.880.890.890.890.890.880.890.90 0.920.890.890.900.910.930.930.930.900.920.910.910.900.910.920.910.900.900.910.910.930.900.930.900.920.910.910.910.910.890.900.910.920.900.930.920.930.930.910.910.910.950.920.930.930.900.920.910.930.930.920.910.920.940.92 0.930.910.900.910.890.920.910.900.900.910.900.920.910.880.890.900.910.920.900.880.920.890.920.910.900.880.910.910.900.920.880.910.890.880.910.890.900.920.910.900.910.920.930.910.910.900.910.890.910.910.870.920.900.910.90 0.910.900.920.900.910.900.900.920.900.910.910.900.910.890.890.910.910.920.910.910.910.910.910.900.900.870.910.910.880.930.910.910.890.920.880.890.910.920.900.900.880.880.920.900.910.910.890.920.920.920.880.910.900.890.90 0.880.890.890.850.870.860.880.880.870.890.890.870.860.880.880.880.840.870.880.860.850.890.880.880.890.870.880.880.890.880.890.880.860.860.860.870.880.870.910.860.850.910.890.890.870.870.870.870.890.870.860.890.860.890.87 0.880.870.890.880.900.910.910.910.900.890.900.890.860.890.900.890.890.890.880.890.900.910.880.900.900.890.890.880.890.880.900.910.900.910.910.880.910.890.900.890.910.900.900.920.920.890.910.900.890.910.910.890.910.910.90 0.910.910.910.910.900.900.910.900.900.900.910.900.890.910.920.900.890.880.900.920.910.910.910.900.890.910.910.900.900.910.900.900.900.910.880.900.900.900.880.890.900.910.900.910.890.910.910.900.910.890.900.920.890.900.90 0.890.890.900.910.920.910.890.880.880.890.880.900.890.890.920.920.890.920.920.890.900.900.890.890.860.900.890.890.900.890.880.910.890.890.900.900.890.910.890.900.910.900.900.890.900.890.900.910.900.900.900.880.900.910.88 0.890.880.890.860.920.870.890.890.920.880.910.900.900.900.880.910.920.890.890.890.910.890.900.900.900.900.890.900.910.880.900.900.890.910.900.890.890.910.900.890.900.910.910.890.880.890.890.900.890.890.880.890.890.880.90 0.900.890.890.910.910.900.900.910.900.890.920.910.910.900.900.880.890.910.910.890.890.890.900.900.910.920.880.890.910.910.880.920.890.870.920.900.900.910.890.890.910.920.900.880.890.900.920.910.880.900.910.880.890.910.89 0.910.910.920.890.920.920.930.910.910.890.900.890.910.880.920.910.920.910.900.900.900.890.910.890.900.900.900.880.880.900.910.930.910.900.920.890.900.900.910.890.910.910.910.890.910.920.910.890.910.890.910.880.900.910.89 0.900.880.890.890.890.880.900.890.880.880.900.880.910.880.880.890.880.880.920.890.890.880.900.850.910.880.900.900.900.900.890.880.910.880.890.900.890.910.890.890.890.900.900.890.920.900.880.890.900.880.910.890.880.880.89 0.70 0.75 0.80 0.85 0.90 (a) Full accuracy results of the base linear probes. All values are over 0.84, with a mean of about 0.9. Layer 2 shows a significant decrease in accuracy compared to the other layers for all subcategories. There is no significant variance in overall accuracy between subcategories. Adult-themed humor Adware Assault and battery Content Sexual exploitation Counterfeiting Cyberbullying Cybercrime Data theft Debt management Deepfake creation Doxxing Drug Election advertising Emotional and psychological abuse Employment scams Erotic text False advertising Financial fraud Gender-based violence Hazardous material handling Human Trafficking Identity theft Intellectual property theft Investment scams Lobbying Malicious software designed to spread and damage computers Market manipulation Neglect Nude imagery Online predation Partisan commentary Personal investment advice Phishing Physical abuse Political activism Political misinformation Pornography Racial or ethnic hate Ransomware Real estate investment Religious intolerance Retirement planning Scamming Self-harm Sexually explicit dialogue Spyware Surveillance Tax planning Terrorism and extremism Theft and Robbery Tracking and location monitoring Trojans Unsafe practices Weapon Weapon making Harmful Subcategory Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10 Layer 11 Layer 12 Layer 13 Layer 14 Layer 15 Layer 16 Layer 17 Layer 18 Layer 19 Layer 20 Layer 21 Layer 22 Layer 23 Layer 24 Layer 25 Layer 26 Layer 27 Layer 28 Layer 29 Layer 30 Layer 31 Layer 32 Layer 0.900.910.920.880.890.920.900.910.910.880.880.870.900.890.910.880.900.900.880.910.910.890.920.920.910.910.880.910.900.900.900.890.870.910.900.910.910.880.910.910.900.920.890.890.910.930.920.910.910.910.910.910.910.900.91 0.640.680.660.530.570.600.590.690.700.560.540.570.610.620.650.530.580.580.610.610.590.540.590.620.590.580.650.630.520.590.540.670.570.580.600.610.680.590.670.660.660.590.540.590.590.540.680.610.640.680.590.600.610.640.64 0.910.910.890.910.910.860.910.910.910.910.890.910.900.910.900.900.890.890.910.880.930.890.910.880.930.890.920.880.890.880.870.880.890.890.910.900.910.900.910.920.890.890.890.900.900.890.920.890.890.880.890.900.900.900.91 0.900.910.910.910.900.900.910.920.910.910.910.930.910.910.920.910.910.910.920.910.920.920.900.930.910.890.910.910.920.910.900.910.900.910.920.930.930.910.910.910.910.920.920.910.910.920.910.900.920.920.920.920.910.910.89 0.900.890.900.890.910.890.910.910.900.880.910.890.910.900.890.900.900.900.910.920.910.910.890.920.930.910.890.910.910.890.890.860.900.900.910.920.910.910.910.920.910.880.910.890.910.910.900.890.880.910.890.900.930.900.89 0.870.900.890.890.890.870.890.900.890.900.870.900.870.890.890.850.900.880.880.880.900.880.900.900.890.900.880.900.880.880.910.900.900.890.890.890.910.890.880.890.890.870.880.900.870.900.880.910.890.890.880.890.870.890.90 0.890.890.910.910.900.910.900.910.910.910.910.900.880.890.880.900.890.900.890.880.880.890.890.880.900.890.880.890.910.890.900.900.900.900.910.910.900.860.890.880.900.910.890.880.880.920.900.900.910.910.910.910.900.890.90 0.910.910.920.890.900.910.900.890.890.890.900.880.910.890.900.880.890.910.920.910.900.870.880.910.910.900.920.930.910.910.890.910.910.920.890.910.900.910.880.920.890.890.910.900.910.910.910.870.910.900.900.910.910.910.91 0.890.890.900.890.890.900.900.890.910.910.900.890.890.880.900.890.890.910.880.920.890.910.900.910.910.910.910.910.890.900.900.900.900.890.900.890.910.910.890.890.900.890.910.920.910.900.910.890.890.910.920.900.890.910.89 0.910.910.890.920.880.910.910.880.890.900.910.910.900.910.890.890.930.920.900.890.890.920.890.920.910.910.880.930.910.910.910.910.910.890.880.910.910.910.900.910.900.900.890.910.900.900.900.910.910.890.900.910.900.910.91 0.910.900.910.910.910.910.910.910.910.890.910.910.900.910.900.910.910.910.910.920.920.900.910.900.920.890.910.900.890.890.900.910.920.900.910.920.910.900.920.890.890.910.910.900.910.910.910.920.910.890.910.910.910.920.91 0.890.890.910.890.910.900.900.900.910.890.900.890.890.910.880.910.880.900.890.890.890.890.900.910.900.890.910.920.910.920.900.900.890.890.900.890.900.890.910.900.880.890.900.890.910.910.900.910.910.900.910.890.890.900.88 0.920.900.910.920.920.930.920.950.910.940.930.930.930.930.940.930.920.940.890.920.930.910.910.930.910.920.900.910.920.950.910.920.930.930.920.930.930.950.930.910.910.920.900.890.940.920.910.920.940.930.930.930.930.910.93 0.900.880.890.880.870.890.890.880.900.890.890.890.890.880.900.880.890.890.890.890.880.890.870.880.890.860.890.870.890.890.890.890.890.890.890.890.890.890.890.880.900.880.900.870.890.890.890.890.880.890.890.890.870.890.87 0.920.890.890.900.930.900.910.880.910.900.860.890.910.920.910.900.890.920.900.920.920.880.890.930.900.890.890.900.890.910.890.930.910.920.900.910.890.900.890.890.910.910.920.910.910.890.890.900.920.870.910.890.920.890.91 0.910.880.880.880.910.870.890.900.900.890.890.880.900.890.920.890.900.890.890.880.890.890.890.880.920.880.890.890.900.920.890.910.890.890.900.910.860.890.890.870.910.900.900.880.880.910.910.880.910.910.900.880.880.860.89 0.920.930.890.890.880.910.900.900.920.880.910.900.910.910.910.890.900.860.920.910.870.900.910.910.880.900.880.910.890.910.890.890.900.930.890.910.910.870.900.880.920.910.910.910.890.920.920.890.890.880.920.900.900.880.89 0.910.900.920.920.900.930.910.910.910.900.910.910.910.900.920.910.910.900.920.880.920.880.910.910.900.910.910.910.910.890.900.930.920.920.900.910.910.910.920.920.890.910.910.890.910.900.900.910.900.920.900.890.900.900.91 0.910.930.910.920.880.910.910.890.910.880.910.900.890.890.910.910.910.890.920.900.910.900.910.910.910.910.890.910.880.920.890.910.890.890.920.910.910.910.920.920.880.900.910.930.890.890.900.920.890.880.900.920.930.910.89 0.900.900.900.910.910.900.910.900.920.910.920.930.900.900.910.880.920.910.930.940.930.930.900.910.890.910.910.910.910.920.920.930.930.900.910.920.930.910.890.920.910.890.920.910.930.930.910.890.910.920.910.910.910.910.91 0.890.880.850.890.890.870.880.870.870.890.880.890.870.890.880.890.890.870.840.880.860.870.900.890.880.880.870.870.880.860.860.880.850.860.890.880.890.890.880.860.870.890.880.890.870.870.880.870.880.860.860.870.870.860.88 0.920.920.910.890.910.910.910.920.880.910.910.900.900.900.910.910.910.900.910.900.930.890.920.870.910.920.910.910.910.890.910.930.920.880.930.920.910.900.890.910.920.930.910.920.910.910.920.900.920.930.910.910.920.940.92 0.930.920.920.930.920.920.920.910.910.920.940.920.910.920.890.910.920.930.900.910.940.920.930.910.940.890.930.930.930.930.910.900.910.890.930.910.910.930.930.910.940.910.930.920.900.910.930.910.920.910.880.930.920.920.91 0.910.910.920.890.890.900.900.920.910.910.920.910.900.910.890.920.890.910.920.890.890.920.900.900.880.890.910.910.900.910.900.910.890.900.900.900.900.910.910.900.900.880.910.890.900.920.890.910.890.910.860.890.890.880.90 0.870.890.890.860.870.860.880.860.860.890.870.870.870.880.880.880.840.860.870.850.850.890.860.880.880.860.880.870.890.870.880.870.860.860.870.860.880.890.880.880.850.890.880.880.870.880.860.870.880.880.860.880.850.890.86 0.890.890.900.880.900.910.910.900.920.910.910.880.870.890.910.900.890.900.880.890.910.890.890.910.910.900.890.880.900.880.900.900.900.910.910.890.920.910.920.890.910.910.900.920.920.910.900.910.880.920.910.920.910.920.91 0.900.890.910.900.890.890.890.890.900.880.900.900.890.910.900.900.880.890.910.910.900.910.910.890.900.910.910.900.890.900.900.900.900.910.890.880.900.910.890.890.900.910.900.900.890.900.900.900.900.900.890.900.900.890.89 0.880.870.890.910.910.900.880.870.850.890.850.860.880.880.900.900.860.900.910.900.880.860.890.880.830.910.880.870.890.890.860.900.890.890.880.880.890.900.870.890.890.880.870.890.890.890.890.890.910.900.880.880.900.910.88 0.900.900.890.880.910.880.880.900.900.890.900.890.900.890.880.900.900.890.890.910.910.900.900.890.900.880.900.900.890.890.910.890.900.890.900.890.890.920.890.890.890.910.890.900.890.890.900.900.890.890.880.910.890.880.89 0.890.890.880.920.900.880.900.910.890.890.910.910.890.890.880.880.910.910.910.900.890.880.890.880.890.910.890.880.910.890.850.910.890.870.920.890.890.900.890.910.880.920.910.880.890.900.910.910.870.890.910.880.900.890.90 0.910.890.900.870.900.930.910.910.890.860.890.890.890.880.890.920.890.900.900.880.890.880.890.890.870.890.890.860.880.880.910.880.900.880.890.870.900.890.900.860.890.900.860.860.890.920.910.910.900.890.870.890.880.880.89 0.910.890.910.900.890.900.890.900.880.880.910.900.930.900.880.910.900.870.920.880.900.890.900.860.910.900.890.890.890.910.920.890.900.890.900.890.880.910.890.900.890.910.910.880.910.910.910.890.900.890.900.910.880.880.89 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 (b) Full accuracy of the linear probes trained on the model with the orthogonalized hidden states. All values are over 0.83, with a mean of about 0.87. Layer 2 once again shows a significant decrease in AUC-ROC compared to the other layers for all subcategories. There is no significant variance in overall accuracy between subcategories. Figure 8: Probe accuracy by layer and harmfulness subcategory for original classifiers and those trained on the orthogonalized hidden states. 15 Preprint. Under review. B AUC-ROC Evaluation Results Adult-themed humor Adware Assault and battery Content Sexual exploitation Counterfeiting Cyberbullying Cybercrime Data theft Debt management Deepfake creation Doxxing Drug Election advertising Emotional and psychological abuse Employment scams Erotic text False advertising Financial fraud Gender-based violence Hazardous material handling Human Trafficking Identity theft Intellectual property theft Investment scams Lobbying Malicious software designed to spread and damage computers Market manipulation Neglect Nude imagery Online predation Partisan commentary Personal investment advice Phishing Physical abuse Political activism Political misinformation Pornography Racial or ethnic hate Ransomware Real estate investment Religious intolerance Retirement planning Scamming Self-harm Sexually explicit dialogue Spyware Surveillance Tax planning Terrorism and extremism Theft and Robbery Tracking and location monitoring Trojans Unsafe practices Weapon Weapon making Harmful Subcategory Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10 Layer 11 Layer 12 Layer 13 Layer 14 Layer 15 Layer 16 Layer 17 Layer 18 Layer 19 Layer 20 Layer 21 Layer 22 Layer 23 Layer 24 Layer 25 Layer 26 Layer 27 Layer 28 Layer 29 Layer 30 Layer 31 Layer 32 Layer 0.960.950.950.950.950.970.960.950.940.950.960.960.970.950.960.960.940.960.960.960.950.960.950.940.950.940.970.950.960.960.960.970.970.950.960.960.960.980.960.960.960.950.960.970.940.940.960.960.950.960.940.940.950.950.95 0.890.910.890.870.910.930.890.890.910.890.890.920.900.870.880.870.820.870.890.850.860.880.890.870.890.870.910.880.880.870.890.890.890.900.920.900.900.880.900.890.890.870.860.880.880.890.910.890.890.910.880.900.890.910.91 0.970.970.960.950.970.970.960.960.970.960.970.970.980.960.960.970.960.960.970.960.970.950.960.980.960.970.970.970.970.970.960.960.970.960.960.970.960.960.960.970.960.970.960.970.960.960.960.970.970.970.960.960.970.960.96 0.960.960.960.960.960.960.960.960.960.970.960.960.960.960.960.970.970.960.970.960.960.960.960.970.960.960.960.970.960.970.960.970.960.960.960.960.960.960.960.960.960.960.960.960.960.960.960.960.960.960.960.950.960.960.96 0.980.980.990.970.980.980.980.980.980.980.980.980.970.970.980.980.980.980.980.980.980.980.980.980.970.980.980.980.980.980.980.980.980.980.980.980.980.980.980.970.980.980.970.980.970.980.980.980.980.980.980.980.970.970.98 0.960.960.960.950.960.950.960.960.950.960.950.950.960.960.950.950.950.960.960.950.960.950.960.970.950.950.960.960.950.950.960.950.960.960.960.960.950.950.960.950.950.960.960.950.950.950.950.960.960.960.940.950.960.960.95 0.980.970.980.970.980.980.970.980.980.980.970.980.970.980.970.980.970.970.980.960.970.970.980.970.970.970.980.980.970.980.970.970.980.990.970.980.970.970.980.980.970.970.980.970.970.980.970.970.980.970.980.980.970.980.98 0.980.970.980.970.980.980.980.980.990.980.970.980.980.980.970.970.970.980.980.970.980.970.970.980.980.980.970.970.980.980.980.980.980.980.980.980.980.980.970.980.970.970.970.980.970.970.980.980.980.980.980.980.980.980.98 0.970.980.980.990.990.980.980.980.980.970.970.980.970.980.980.980.970.980.980.980.980.980.990.980.980.980.980.980.980.980.990.980.980.980.970.980.980.990.980.980.980.980.980.980.990.980.980.980.980.980.980.980.970.980.98 0.980.980.970.980.980.980.980.970.980.980.980.970.980.980.980.980.980.980.980.980.980.980.980.990.980.980.980.970.980.980.980.980.980.980.980.980.980.980.980.980.980.970.980.980.990.980.980.980.980.980.970.980.980.980.98 0.980.980.980.980.990.980.970.980.980.980.980.980.980.980.980.980.980.980.990.980.980.980.980.980.970.980.980.980.980.980.980.980.980.980.980.980.980.980.970.980.990.970.980.980.980.980.980.980.980.970.980.980.980.990.98 0.980.970.970.980.980.980.970.980.970.980.970.980.970.970.970.970.970.980.980.980.980.970.970.970.970.980.970.970.980.970.980.980.980.970.980.980.970.980.970.960.980.980.980.970.980.970.980.980.970.970.970.980.970.970.98 0.990.980.990.990.990.990.990.980.990.990.990.990.990.990.990.990.990.990.980.990.990.980.980.980.980.980.980.980.980.990.990.990.990.990.990.980.990.990.990.990.980.990.990.980.990.990.990.990.990.990.990.990.990.990.99 0.980.980.980.980.970.970.980.980.980.980.980.980.980.970.980.980.980.980.980.980.980.990.970.980.980.980.980.980.980.980.980.980.980.970.980.980.980.970.980.990.980.970.970.970.980.980.980.980.980.980.980.980.980.980.98 0.980.980.980.980.980.980.970.990.980.980.980.980.980.980.980.980.980.970.970.980.980.980.980.970.990.980.980.980.970.980.980.980.980.980.980.990.970.980.980.980.980.980.970.980.980.980.980.980.980.980.980.970.980.980.98 0.980.980.990.990.980.980.980.990.990.990.980.990.980.980.980.980.980.980.980.980.980.980.980.990.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.990.980.980.980.980.980.980.980.980.980.990.980.98 0.980.980.980.980.990.980.980.980.990.980.980.980.980.990.980.990.980.980.980.990.990.980.990.980.990.980.980.990.980.980.980.990.980.980.990.980.980.980.980.980.990.990.980.980.980.980.980.990.980.990.980.980.980.980.99 0.980.980.980.980.980.990.980.970.980.980.980.980.980.980.980.980.980.980.980.980.990.990.980.990.980.990.980.990.980.980.980.990.980.980.980.980.970.990.980.990.980.980.980.990.980.980.980.980.980.980.980.980.980.990.98 0.980.960.980.980.970.960.980.950.960.970.980.970.970.970.970.970.980.970.970.970.970.970.970.970.980.980.980.980.970.960.950.980.970.980.960.970.980.960.970.980.980.980.970.980.980.970.970.970.970.980.960.980.980.980.98 0.980.990.980.990.980.980.990.980.980.980.980.980.980.980.990.980.980.980.980.980.980.980.980.970.970.980.990.980.980.990.980.980.980.980.980.980.990.980.980.990.980.980.980.980.980.990.980.970.980.970.980.980.990.980.98 0.970.970.970.980.980.970.980.980.970.970.970.980.970.980.970.970.970.980.980.970.980.980.980.980.970.970.970.970.970.980.970.970.970.970.970.980.980.980.970.980.980.980.970.980.970.970.970.980.980.970.970.970.970.970.97 0.990.990.980.980.980.980.990.990.990.990.980.990.990.990.980.980.990.990.990.990.990.990.980.980.980.980.980.980.990.980.990.990.990.990.980.990.990.980.990.980.990.990.980.980.990.980.990.980.980.990.980.980.980.980.98 0.980.980.980.980.990.980.980.980.980.970.980.980.980.980.980.980.980.980.980.980.980.990.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.970.980.990.990.980.980.980.980.980.980.970.980.980.980.980.98 0.970.980.970.980.980.980.970.980.980.980.980.980.980.970.980.980.980.980.980.980.980.980.980.980.970.980.970.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.970.980.970.980.980.980.980.980.980.980.98 0.960.970.960.960.960.970.960.970.960.970.960.960.960.970.970.960.970.960.960.970.970.970.960.970.960.970.970.970.960.960.960.970.970.960.970.970.960.960.960.960.960.970.970.960.970.950.970.970.960.960.970.960.970.970.96 0.980.980.980.980.980.980.980.980.980.980.990.980.980.980.980.980.980.980.970.980.980.980.980.980.980.970.980.980.980.970.990.970.970.970.980.970.980.980.980.980.980.980.970.980.970.980.980.980.970.970.980.970.980.980.97 0.980.980.970.980.970.980.980.980.980.970.980.970.970.980.970.980.980.970.980.980.980.980.980.980.970.980.980.980.970.970.980.980.980.970.970.970.980.980.980.980.980.970.980.990.980.980.970.980.980.980.970.970.980.980.97 0.980.970.980.980.980.980.980.980.980.980.970.970.970.970.980.980.980.980.980.970.980.980.970.980.970.980.980.970.980.980.980.980.970.980.980.970.980.990.970.980.980.980.970.980.980.970.970.980.980.980.980.970.980.970.98 0.980.980.980.980.980.970.980.970.980.980.980.980.970.980.980.980.980.970.980.980.980.980.980.980.980.970.970.980.980.980.980.980.980.980.970.980.980.980.970.980.970.970.980.980.970.980.980.970.980.970.980.980.980.980.98 0.980.970.980.980.980.980.980.980.970.980.980.980.990.980.980.970.980.980.980.980.980.980.980.980.980.990.980.980.970.980.970.980.980.980.980.980.970.980.980.980.980.980.980.970.970.990.980.980.980.980.980.980.970.980.97 0.980.980.980.980.980.980.980.980.990.980.980.980.990.980.980.980.980.980.980.990.990.980.980.980.980.990.990.980.980.980.980.980.980.990.980.980.980.990.980.980.990.980.980.980.980.990.990.980.980.980.980.980.980.990.98 0.980.980.990.980.980.980.980.980.980.980.980.970.990.970.980.980.980.980.980.980.970.980.980.980.980.990.980.980.990.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.990.980.98 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 (a) AUC-ROC of the base linear probes, with a mean of about 0.97. Other than Layer 2, all are at or above 0.94. Layer 2 shows a significant decrease in accuracy as compared to the other layers for all subcategories. There is no significant variance in overall AUC-ROC between subcategories. Adult-themed humor Adware Assault and battery Content Sexual exploitation Counterfeiting Cyberbullying Cybercrime Data theft Debt management Deepfake creation Doxxing Drug Election advertising Emotional and psychological abuse Employment scams Erotic text False advertising Financial fraud Gender-based violence Hazardous material handling Human Trafficking Identity theft Intellectual property theft Investment scams Lobbying Malicious software designed to spread and damage computers Market manipulation Neglect Nude imagery Online predation Partisan commentary Personal investment advice Phishing Physical abuse Political activism Political misinformation Pornography Racial or ethnic hate Ransomware Real estate investment Religious intolerance Retirement planning Scamming Self-harm Sexually explicit dialogue Spyware Surveillance Tax planning Terrorism and extremism Theft and Robbery Tracking and location monitoring Trojans Unsafe practices Weapon Weapon making Harmful Subcategory Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10 Layer 11 Layer 12 Layer 13 Layer 14 Layer 15 Layer 16 Layer 17 Layer 18 Layer 19 Layer 20 Layer 21 Layer 22 Layer 23 Layer 24 Layer 25 Layer 26 Layer 27 Layer 28 Layer 29 Layer 30 Layer 31 Layer 32 Layer 0.940.930.940.940.940.960.950.940.940.940.950.940.960.950.940.940.940.940.940.940.950.960.930.940.950.930.960.940.960.950.940.940.950.960.930.940.960.960.950.950.930.940.940.960.930.920.950.930.950.960.930.940.930.950.93 0.810.840.870.820.810.850.840.840.860.800.830.820.780.820.860.790.880.830.840.840.850.830.850.800.800.800.810.820.830.850.810.820.860.810.860.850.810.820.860.850.850.830.840.830.830.860.810.810.840.840.830.850.830.830.84 0.950.960.960.950.960.960.950.970.970.960.960.960.970.950.970.970.960.960.960.950.970.960.960.970.970.960.970.970.970.960.950.950.970.960.970.970.950.960.950.960.960.960.950.960.960.970.960.970.960.960.960.960.960.950.97 0.950.950.960.950.960.950.950.950.960.960.960.950.950.950.950.950.950.950.960.960.950.950.960.950.950.960.960.940.950.950.960.960.950.950.950.960.950.940.950.960.950.950.950.960.950.950.950.950.950.940.960.950.950.960.95 0.970.980.980.980.980.970.980.970.970.970.970.970.970.980.980.980.970.980.970.970.980.980.980.980.970.980.970.970.970.980.980.970.980.970.970.970.980.970.970.970.980.970.980.980.980.970.970.980.970.980.970.970.970.970.97 0.920.920.930.930.930.930.920.940.920.930.920.930.930.930.930.940.930.940.930.920.920.930.930.940.910.930.920.920.940.910.930.920.920.910.920.920.930.940.910.930.930.940.930.930.940.930.920.940.920.920.920.920.940.940.93 0.970.970.970.970.970.970.960.970.970.980.960.960.970.970.970.960.970.970.970.970.970.970.970.970.970.970.970.960.970.970.970.970.970.960.970.970.970.970.970.970.970.970.960.970.970.970.970.970.970.970.970.970.970.970.97 0.950.960.960.960.970.960.950.960.970.970.960.960.960.960.960.950.960.960.970.970.970.960.960.960.970.960.970.950.960.950.950.970.970.970.950.960.960.960.970.960.970.970.960.960.960.960.960.960.970.960.960.970.970.960.96 0.980.980.980.980.980.980.980.970.970.970.970.980.980.970.980.980.980.970.970.970.980.970.980.970.970.970.970.980.980.980.980.970.980.970.980.970.970.980.980.980.970.980.980.980.970.980.970.980.970.970.970.980.970.970.97 0.960.970.960.970.970.960.960.960.970.970.960.980.960.970.970.970.970.970.970.970.960.980.970.970.970.980.970.980.970.960.970.970.970.970.970.970.970.970.960.970.970.970.960.970.970.970.970.960.970.970.970.970.970.970.97 0.970.970.980.970.970.980.980.980.980.970.980.970.970.990.970.980.970.970.970.980.980.980.970.970.980.980.970.980.980.970.970.980.970.970.980.980.980.970.970.970.980.960.980.970.970.970.970.980.980.970.960.970.970.980.97 0.960.950.950.970.960.970.960.960.960.960.960.960.960.960.960.960.960.960.960.950.960.960.950.970.960.960.960.960.950.960.960.970.960.960.970.950.960.970.960.960.950.970.960.960.960.960.960.960.960.960.960.960.960.960.95 0.990.990.980.990.990.990.990.990.980.990.980.990.980.980.990.980.990.990.990.980.990.990.980.990.990.990.990.990.990.990.990.990.990.990.990.990.980.990.990.990.990.990.990.990.980.990.990.980.990.990.990.980.980.990.99 0.980.970.980.980.970.970.970.970.970.980.970.960.970.970.970.980.970.980.980.970.970.970.970.970.970.970.970.970.970.980.980.970.970.970.980.970.970.970.970.980.970.980.980.970.990.970.980.970.970.980.980.980.970.970.97 0.970.970.970.970.980.970.980.970.980.980.970.970.980.980.970.980.970.980.980.980.980.970.980.980.970.970.980.980.970.980.970.980.970.970.980.970.970.980.980.980.980.970.980.980.970.970.970.980.980.980.980.980.980.970.98 0.980.970.970.980.980.980.970.980.980.980.970.980.980.970.970.970.980.980.980.980.970.970.980.980.980.980.990.970.980.980.980.980.970.980.980.980.970.970.980.980.980.980.980.980.970.980.970.970.970.980.980.980.980.980.97 0.980.980.980.980.980.980.980.980.980.980.980.980.990.980.980.980.980.980.970.980.980.980.980.980.970.980.980.980.980.980.980.980.980.990.980.980.980.980.990.970.980.980.980.990.980.980.980.980.980.980.980.980.980.980.98 0.980.980.980.970.970.980.980.980.980.980.980.990.980.980.980.980.980.970.980.980.980.980.970.990.980.980.980.980.980.980.980.980.990.970.980.980.990.980.980.980.980.980.980.980.980.980.970.980.980.980.980.990.980.980.98 0.970.960.960.940.960.950.950.970.970.970.970.940.960.970.970.970.970.950.960.980.980.960.970.950.950.980.920.960.950.970.970.960.970.970.970.970.960.940.960.960.950.950.960.950.960.950.960.960.950.960.970.980.960.970.98 0.980.980.980.980.970.980.980.980.990.980.980.980.980.980.980.980.980.980.980.970.980.980.990.970.980.980.980.980.980.980.980.970.990.980.980.980.970.980.970.970.980.970.980.980.980.970.980.980.980.980.980.970.970.980.98 0.960.970.960.960.960.960.970.960.960.950.960.960.960.970.960.960.970.970.960.960.970.960.970.960.960.960.970.950.960.970.960.960.960.970.960.960.970.960.960.960.970.970.960.950.970.960.960.960.960.960.960.960.970.950.96 0.980.990.980.980.980.980.970.980.980.980.980.980.980.980.980.970.980.990.980.980.990.980.980.990.990.980.980.980.980.980.980.990.980.980.980.980.980.980.980.980.980.980.980.980.980.980.990.980.990.980.980.980.980.980.98 0.990.980.980.980.970.970.990.980.970.980.970.960.980.980.980.970.980.980.980.980.980.980.980.980.980.970.970.980.980.970.980.980.980.980.990.960.990.980.970.980.970.970.980.970.980.980.970.980.990.980.980.980.980.970.98 0.970.970.970.960.970.960.970.970.970.970.960.970.970.980.970.970.970.970.960.960.960.960.970.970.970.970.970.960.970.970.970.970.970.960.960.960.970.960.960.970.970.960.970.970.980.970.970.970.970.970.970.970.970.970.97 0.950.940.950.950.940.950.940.940.940.950.940.950.940.950.950.950.940.950.940.940.940.940.940.950.940.960.940.950.960.950.940.940.930.950.930.950.960.950.940.960.950.950.940.950.940.940.940.950.950.940.950.950.950.950.94 0.980.970.980.980.990.980.980.980.970.980.980.970.980.980.980.980.980.980.980.980.970.970.980.970.980.980.980.980.980.970.980.970.980.980.970.970.980.980.980.980.980.980.990.980.970.980.980.970.980.980.990.980.980.980.98 0.970.970.970.980.980.970.980.980.970.980.980.970.980.970.980.980.980.970.970.980.970.970.970.980.980.980.980.970.970.980.980.980.980.980.980.970.980.980.980.980.980.980.970.970.970.960.970.980.980.980.970.970.980.980.97 0.970.980.980.970.980.980.970.970.970.970.970.970.970.970.970.960.990.980.970.990.980.980.970.970.970.980.980.970.970.980.980.980.970.970.980.970.970.970.980.970.980.970.980.970.980.970.970.970.970.980.980.970.980.980.98 0.960.970.970.960.970.970.960.970.970.960.960.970.970.970.970.960.970.970.970.970.960.960.960.970.960.970.960.970.970.970.960.960.960.960.970.970.960.970.970.980.960.970.960.960.970.960.970.960.970.970.970.970.970.970.97 0.980.980.970.970.970.970.980.970.970.970.970.970.960.970.960.970.960.970.970.970.970.980.960.960.970.970.980.970.970.970.960.970.970.970.970.960.970.970.970.960.970.980.970.970.980.980.960.970.980.970.980.970.970.970.96 0.970.970.980.970.980.980.980.970.970.980.980.980.980.970.970.970.970.980.970.970.970.970.970.980.970.970.970.980.980.970.980.980.970.980.980.970.980.980.960.980.970.970.980.980.970.970.980.970.970.980.970.980.970.970.97 0.980.970.980.980.980.970.970.980.980.980.980.980.980.970.970.980.970.980.980.980.980.980.980.970.970.980.980.960.970.970.980.980.970.970.980.980.980.980.970.980.980.970.970.970.960.970.980.980.980.980.980.980.980.980.98 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 (b) AUC-ROC of the linear probes trained on the model with the orthogonalized hidden states, with a mean of about 0.96. Other than Layer 2, all are at or above 0.91. Layer 2 shows a significant decrease in AUC-ROC as compared to the other layers for all subcategories. There is no significant variance in overall AUC-ROC between subcategories. Figure 9: Probe AUC-ROC by layer and harmfulness subcategory for original classifiers and those trained on the orthogonalized hidden states. C Clustering We run K-Means on the weight vectors. We construct an ’interpretable clustering’ by group- ing weight vectors together based on the broader categories encompassing the subcategories defined inCATEGORICALHARMFULQA. We compute the Adjusted Rand Index (Halkidi et al. (2002)) score between the K-Means clusters and those from the ’interpretable cluster- ing.’ 16 Preprint. Under review. LayerAdjusted Rand IndexLayerAdjusted Rand Index 10.03590.013 20.05410-0.034 3-0.01711-0.0058 40.006212-0.026 5-0.036130.029 6-0.01514-0.044 7-0.0076150.018 8-0.015160.067 17-0.04525-0.030 180.072260.040 19-0.056270.0044 200.0063280.038 21-0.0035290.0088 220.03730-0.020 23-0.01131-0.0079 24-0.01232-0.049 Table 5: Adjusted Rand Index at different layers. No layers show significant ARI We find an average adjusted rand index of about 3.1·10 −4 , with the highest index being 0.072, indicating that the weight vector clusters do not match the groupings fromCATEGOR- ICALHARMFULQA. Full results are reported in Table 5. 17 Preprint. Under review. D Results from Residual Stream Probes D.1 Residual Stream Probe Accuracy Adult-themed humor Adware Assault and battery Content Sexual exploitation Counterfeiting Cyberbullying Cybercrime Data theft Debt management Deepfake creation Doxxing Drug Election advertising Emotional and psychological abuse Employment scams Erotic text False advertising Financial fraud Gender-based violence Hazardous material handling Human Trafficking Identity theft Intellectual property theft Investment scams Lobbying Malicious software designed to spread and damage computers Market manipulation Neglect Nude imagery Online predation Partisan commentary Personal investment advice Phishing Physical abuse Political activism Political misinformation Pornography Racial or ethnic hate Ransomware Real estate investment Religious intolerance Retirement planning Scamming Self-harm Sexually explicit dialogue Spyware Surveillance Tax planning Terrorism and extremism Theft and Robbery Tracking and location monitoring Trojans Unsafe practices Weapon Weapon making Harmful Subcategory Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10 Layer 11 Layer 12 Layer 13 Layer 14 Layer 15 Layer 16 Layer 17 Layer 18 Layer 19 Layer 20 Layer 21 Layer 22 Layer 23 Layer 24 Layer 25 Layer 26 Layer 27 Layer 28 Layer 29 Layer 30 Layer 31 Layer 32 Layer 0.900.910.910.920.890.910.920.910.890.900.880.900.920.890.910.910.900.870.910.910.900.910.900.910.900.920.910.910.910.910.910.900.880.930.910.910.900.920.920.900.900.890.920.930.920.920.910.920.910.910.890.920.910.910.91 0.700.630.680.700.750.700.710.680.760.690.700.720.700.710.770.710.690.680.710.780.690.680.720.700.710.690.700.700.690.660.700.720.750.710.760.710.700.710.690.720.700.660.660.720.680.710.710.670.710.680.700.720.670.760.65 0.880.900.890.900.910.890.900.910.890.900.900.900.910.930.900.910.910.910.900.910.900.910.910.910.910.900.880.880.910.900.910.910.900.910.910.920.910.920.890.900.900.920.880.900.910.900.910.910.900.930.890.910.910.900.90 0.930.890.930.910.910.910.890.910.900.890.910.910.900.910.900.910.910.900.900.910.910.910.900.900.900.910.900.900.920.910.910.890.910.910.910.890.870.920.930.910.920.900.900.890.910.910.910.910.910.910.910.910.920.910.91 0.880.890.880.920.890.900.910.930.910.920.900.910.900.930.900.920.910.900.880.910.900.900.910.910.910.910.930.930.920.910.910.910.910.910.930.900.910.890.880.890.920.910.900.900.920.920.910.910.910.920.900.890.870.910.89 0.910.910.900.910.920.910.920.910.910.900.910.910.910.890.910.920.910.920.910.890.920.910.900.910.910.910.920.890.890.920.910.920.910.910.890.920.880.910.890.930.890.910.920.920.920.910.900.910.910.910.910.910.920.890.91 0.900.920.920.890.910.910.930.910.910.910.920.910.920.900.930.910.890.910.910.920.890.890.930.910.920.910.910.900.910.910.900.920.910.910.900.900.900.910.910.900.920.910.900.900.920.920.890.910.920.920.930.890.890.890.92 0.890.880.910.900.910.910.900.890.890.890.900.860.890.920.900.890.880.900.890.900.900.890.900.910.910.900.910.910.920.920.910.900.920.900.910.900.890.900.910.880.910.890.920.900.890.910.910.910.880.880.900.880.900.910.90 0.910.910.900.930.910.910.900.910.910.890.910.930.930.910.880.890.910.910.910.920.910.900.930.910.910.910.920.910.920.900.910.910.920.890.910.900.900.910.920.890.910.900.910.900.910.910.910.890.910.910.890.900.930.900.92 0.910.910.900.890.920.910.910.920.930.910.910.910.920.890.920.900.890.900.910.900.910.910.910.900.920.890.900.920.910.900.880.930.910.920.910.910.910.910.890.920.900.910.920.910.930.890.900.910.910.920.900.900.900.930.91 0.900.890.900.890.920.910.880.890.900.910.900.890.900.910.880.890.870.890.890.910.890.910.890.890.890.900.910.870.880.900.900.900.880.890.900.890.900.910.910.900.920.890.900.900.880.910.890.880.890.910.890.900.900.890.91 0.890.910.910.900.910.910.900.910.900.890.910.900.880.900.880.890.920.910.900.900.900.910.910.900.900.900.910.910.910.870.920.910.880.920.890.920.910.890.910.900.910.910.900.910.900.900.900.900.910.890.920.880.900.910.91 0.880.910.910.880.910.890.910.890.890.910.870.910.910.890.890.910.890.910.870.900.870.900.900.870.870.910.890.870.880.890.910.900.910.890.890.920.930.880.890.900.870.890.890.910.910.890.900.910.890.890.890.880.880.880.89 0.920.920.910.900.900.910.910.880.930.910.890.880.920.890.910.880.890.900.870.890.910.900.900.900.890.910.890.910.890.910.890.900.880.890.910.910.910.910.870.890.920.910.900.920.920.910.890.910.900.890.910.890.920.890.89 0.900.890.890.910.920.900.920.920.900.900.880.870.900.920.890.890.910.880.890.910.910.880.920.900.900.880.890.900.880.910.910.890.910.910.880.920.910.900.900.910.910.900.910.920.930.910.890.880.910.890.890.910.900.890.88 0.910.900.900.880.900.880.910.880.880.890.890.890.920.930.900.870.880.910.900.900.930.910.890.880.900.910.890.890.900.900.890.900.910.880.890.900.910.900.890.890.920.890.910.890.910.910.910.900.890.910.910.880.890.900.90 0.870.910.920.910.910.890.890.880.880.900.880.870.900.900.900.890.900.920.920.910.900.920.900.880.910.890.880.910.880.880.900.910.910.900.910.890.930.860.870.920.910.900.900.900.880.910.900.930.920.900.900.910.920.890.91 0.870.890.850.920.890.900.920.910.900.910.900.910.900.900.930.910.900.910.920.900.880.900.910.910.910.900.890.920.910.920.900.900.910.900.900.900.890.910.890.880.890.900.920.900.890.910.920.900.890.890.900.910.880.900.90 0.920.920.920.910.890.890.910.910.920.910.930.910.920.910.890.910.930.910.930.920.910.900.910.920.910.910.910.900.910.890.930.910.890.930.930.910.930.910.910.930.930.920.910.910.910.930.930.890.910.910.880.920.910.900.89 0.890.920.890.900.920.910.930.910.890.910.910.920.900.900.910.920.900.900.890.880.890.900.890.910.910.900.880.910.890.900.920.900.910.910.890.890.900.890.880.900.920.900.880.900.920.920.880.910.880.920.890.910.900.890.87 0.890.910.910.890.900.880.900.890.870.910.910.910.900.910.920.900.870.920.880.880.890.930.860.910.900.900.900.920.920.930.890.920.900.890.910.870.910.880.910.910.930.900.890.920.910.890.900.900.890.910.910.900.870.900.89 0.930.920.890.920.910.930.900.900.910.890.910.910.910.910.910.910.890.880.910.880.900.930.910.910.930.930.910.910.920.920.910.910.920.930.900.910.930.930.880.920.900.900.910.930.900.930.930.930.920.930.920.910.910.910.90 0.890.880.910.900.920.900.910.880.890.890.880.900.910.890.880.920.890.890.900.880.890.910.910.910.900.910.880.900.880.900.890.900.920.900.920.910.900.900.910.890.890.880.910.890.910.890.890.910.890.890.900.890.900.910.92 0.900.910.920.870.900.910.900.910.910.910.890.910.900.920.900.900.910.910.910.900.910.880.900.880.910.880.890.860.890.910.890.870.920.910.910.910.920.910.910.900.890.890.910.890.890.910.910.890.890.920.890.900.910.900.89 0.900.890.890.880.930.910.890.890.900.910.910.900.910.890.910.920.890.910.920.910.910.900.880.870.880.890.910.900.870.900.890.890.930.920.900.910.900.880.890.890.910.870.910.890.900.890.910.920.910.920.900.890.890.910.89 0.920.890.890.920.880.880.910.890.910.880.900.890.890.910.890.890.910.910.910.910.890.910.910.890.890.910.890.890.890.900.880.900.890.910.910.880.890.930.900.890.890.880.890.880.880.900.910.900.890.890.890.910.890.910.89 0.910.880.880.890.880.890.910.890.870.870.880.910.920.890.900.890.910.900.910.900.910.890.930.900.890.890.900.910.920.880.910.870.890.890.920.890.900.890.900.890.890.900.910.900.920.910.890.890.900.880.900.900.910.880.91 0.890.900.890.900.880.890.890.880.900.900.880.880.860.910.890.890.890.890.900.880.900.860.870.880.890.890.900.880.910.910.870.900.880.890.880.870.880.860.910.900.900.890.870.920.890.880.880.900.910.890.900.910.890.900.89 0.900.880.900.910.880.890.880.870.890.880.920.880.910.890.870.880.900.890.880.900.870.930.900.910.910.910.920.910.900.890.900.920.900.910.890.910.890.890.890.850.910.900.920.910.930.910.900.890.920.910.890.870.890.890.91 0.890.890.910.900.910.920.920.890.910.910.880.890.890.900.910.910.880.910.910.890.910.910.900.910.890.910.900.900.910.880.910.910.910.910.920.900.890.890.910.890.910.900.890.880.910.890.880.930.890.910.880.910.910.900.90 0.880.880.860.900.900.880.850.890.890.890.880.890.880.900.900.890.890.890.890.900.870.870.870.890.880.900.900.900.890.890.850.890.880.870.850.860.890.880.890.880.880.910.870.890.880.880.880.860.880.880.900.870.860.910.88 0.880.890.880.870.900.870.890.850.890.880.900.870.860.860.840.870.890.890.890.870.880.860.870.870.880.860.880.850.880.900.880.880.890.880.900.880.880.880.890.880.890.880.890.900.890.890.880.860.880.880.880.870.870.880.85 0.65 0.70 0.75 0.80 0.85 0.90 Figure 10: Full accuracy results of the base linear probes trained on the residual stream, with a mean of about 0.89. Outside of Layer 1, all values are at or above 0.84. Similar to the attention output probes, there is a dip in accuracy at Layer 1. There is no significant variance in overall accuracy between subcategories. D.2 Residual Stream Probe AUC-ROC Adult-themed humor Adware Assault and battery Content Sexual exploitation Counterfeiting Cyberbullying Cybercrime Data theft Debt management Deepfake creation Doxxing Drug Election advertising Emotional and psychological abuse Employment scams Erotic text False advertising Financial fraud Gender-based violence Hazardous material handling Human Trafficking Identity theft Intellectual property theft Investment scams Lobbying Malicious software designed to spread and damage computers Market manipulation Neglect Nude imagery Online predation Partisan commentary Personal investment advice Phishing Physical abuse Political activism Political misinformation Pornography Racial or ethnic hate Ransomware Real estate investment Religious intolerance Retirement planning Scamming Self-harm Sexually explicit dialogue Spyware Surveillance Tax planning Terrorism and extremism Theft and Robbery Tracking and location monitoring Trojans Unsafe practices Weapon Weapon making Harmful Subcategory Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10 Layer 11 Layer 12 Layer 13 Layer 14 Layer 15 Layer 16 Layer 17 Layer 18 Layer 19 Layer 20 Layer 21 Layer 22 Layer 23 Layer 24 Layer 25 Layer 26 Layer 27 Layer 28 Layer 29 Layer 30 Layer 31 Layer 32 Layer 0.800.800.800.840.930.830.830.790.770.800.790.910.820.960.850.820.850.820.870.890.850.880.860.790.860.860.870.840.910.910.790.790.850.800.920.890.800.760.890.880.790.830.780.790.810.790.930.860.890.930.790.840.780.770.90 0.690.830.800.760.650.810.740.660.670.760.760.730.700.690.750.820.810.830.870.660.760.790.700.840.770.720.860.690.810.820.800.790.710.760.780.750.740.790.810.760.780.710.850.650.850.670.860.790.790.830.750.720.760.720.80 0.790.850.880.810.880.810.880.890.910.860.870.830.870.810.840.840.830.850.840.900.780.840.790.820.860.850.790.820.880.790.750.890.870.810.820.840.780.830.890.870.810.740.890.770.790.780.800.800.830.820.780.850.840.900.83 0.800.810.830.820.760.890.840.830.860.860.860.850.810.870.840.820.770.840.820.790.870.860.900.880.880.810.890.840.830.890.830.850.860.820.880.840.910.850.880.810.920.850.920.800.850.860.830.830.830.820.850.840.830.830.87 0.850.880.680.800.880.720.750.810.820.760.760.860.910.900.860.890.890.870.870.700.890.850.750.810.850.880.810.880.760.790.840.930.900.880.850.930.880.870.870.840.870.870.850.880.790.910.840.870.860.880.790.810.890.750.86 0.850.860.870.830.870.810.840.780.890.790.830.820.740.890.870.750.840.870.830.870.870.810.860.750.810.820.820.870.870.860.830.740.870.750.880.870.830.910.870.780.850.730.880.890.750.800.660.720.880.870.860.810.860.860.92 0.840.910.660.880.810.870.880.820.930.830.900.880.860.900.890.800.880.900.880.850.830.870.890.810.850.910.890.870.810.920.920.870.810.900.680.830.880.810.880.880.800.890.880.930.880.840.900.850.900.860.830.880.910.920.82 0.870.940.900.880.820.800.830.910.870.880.960.880.820.870.930.880.910.800.830.870.890.800.910.890.880.880.840.960.900.900.800.880.820.870.770.880.880.880.910.900.860.860.820.870.930.940.900.910.910.870.920.840.900.860.89 0.960.920.800.940.890.930.920.870.950.890.960.890.950.940.920.930.900.890.920.820.840.930.950.900.950.890.930.960.920.910.890.930.920.930.920.930.860.960.910.940.900.970.930.930.930.940.920.870.930.930.930.940.930.930.88 0.850.850.840.860.920.930.840.840.820.840.920.920.910.890.930.950.920.960.900.920.900.960.950.970.900.920.940.940.910.890.870.940.880.870.820.830.840.900.910.900.850.910.900.930.850.940.900.870.880.950.810.870.880.920.86 0.940.690.910.820.880.860.930.920.860.960.910.780.610.910.920.780.920.880.930.890.900.890.910.900.880.860.920.940.920.930.850.940.900.730.920.920.890.890.900.820.370.930.910.810.940.750.940.920.930.570.940.900.930.850.61 0.540.880.940.940.940.780.940.930.920.880.780.820.930.920.910.920.610.910.840.860.930.920.930.790.960.930.930.910.940.910.940.920.930.920.930.890.930.910.960.950.900.900.760.900.410.840.780.890.940.940.850.940.950.870.93 0.950.960.930.940.930.960.940.910.960.920.950.970.950.940.920.910.930.940.950.960.950.970.970.900.940.960.970.930.930.950.960.820.950.930.970.850.940.930.900.910.940.930.950.970.870.960.950.960.960.970.970.940.900.970.98 0.960.860.970.890.900.940.960.970.940.840.970.960.920.940.940.960.880.970.930.960.890.880.940.880.980.970.950.950.940.970.940.920.960.970.940.950.970.870.930.950.960.970.890.940.950.940.960.940.970.970.940.970.930.950.98 0.930.930.950.940.920.920.930.920.870.910.910.960.900.920.960.840.940.920.940.930.950.920.930.950.910.910.920.970.920.940.930.910.950.950.940.870.900.900.900.910.830.960.850.940.930.930.900.930.970.960.880.940.960.930.91 0.930.950.880.830.930.950.920.920.980.950.930.930.950.940.970.960.930.970.950.700.930.940.980.940.950.960.880.940.940.900.930.930.950.950.890.910.920.870.860.930.890.840.900.970.930.940.920.960.940.920.930.900.820.950.92 0.900.950.960.900.890.920.750.850.920.910.910.930.880.840.910.890.920.890.900.910.890.870.870.890.930.910.900.900.860.940.960.890.950.950.930.970.930.900.880.930.900.920.930.840.910.910.930.930.940.820.960.880.940.850.83 0.910.930.910.950.890.930.870.930.910.900.880.890.880.930.870.890.890.930.850.880.880.850.840.930.850.900.910.920.940.900.860.910.880.940.920.920.910.910.890.910.880.880.920.900.890.880.820.890.920.820.840.900.900.910.87 0.840.880.810.780.860.910.850.890.890.890.900.900.780.900.890.880.880.910.830.840.880.850.830.860.810.870.930.860.940.820.840.830.890.810.840.840.830.900.880.810.820.830.900.880.830.840.870.860.890.780.860.840.870.760.85 0.920.910.890.910.830.910.850.880.870.900.820.870.870.910.810.880.890.890.900.780.950.870.920.860.890.730.840.890.900.880.930.860.910.860.930.940.920.890.900.880.820.950.920.880.810.900.840.850.890.860.930.950.880.960.89 0.840.920.870.910.940.910.840.850.880.860.940.900.880.900.860.870.860.830.910.890.810.910.900.910.910.840.920.880.880.860.870.940.890.890.860.870.840.850.910.870.910.890.900.830.860.820.900.840.830.890.910.880.940.870.93 0.840.840.830.900.890.920.920.910.850.840.870.810.840.810.900.930.850.830.860.900.930.850.860.870.900.840.910.870.850.890.780.890.890.860.870.850.890.900.900.890.890.870.860.850.890.910.870.860.860.900.900.830.870.860.87 0.910.920.920.910.900.890.890.800.820.880.910.820.870.900.910.860.900.900.850.890.850.920.910.920.820.930.940.910.930.870.890.910.860.890.850.880.910.930.810.950.840.920.820.960.830.900.910.890.850.920.910.880.920.780.90 0.870.890.900.880.860.890.900.900.850.880.900.900.870.870.860.860.850.860.860.880.900.840.890.840.850.850.870.870.900.920.900.880.910.860.900.860.880.900.840.860.920.870.930.880.910.830.890.900.850.890.890.860.870.940.87 0.890.820.880.910.890.890.850.850.850.870.720.840.870.850.790.850.900.880.880.880.870.880.850.910.880.850.900.920.900.840.900.840.790.820.910.820.870.880.810.870.760.830.890.800.840.870.840.830.890.870.830.840.840.860.81 0.870.880.930.820.900.910.850.950.890.880.890.860.910.830.900.860.920.820.900.910.840.920.880.930.890.890.900.860.930.930.900.870.810.950.880.860.910.830.920.870.900.910.920.910.910.900.830.880.890.870.900.890.940.890.88 0.940.810.950.880.940.960.820.940.870.950.930.910.920.950.940.890.900.890.920.950.930.890.910.840.930.800.860.940.850.900.790.960.910.940.920.920.880.880.920.900.920.960.880.910.890.940.910.900.920.980.900.920.980.930.93 0.860.900.930.920.930.940.880.840.910.860.940.940.930.880.880.940.950.910.910.910.900.940.970.940.850.970.960.930.910.940.900.900.900.890.910.910.950.970.970.930.900.890.950.920.930.950.950.940.870.940.890.930.920.890.91 0.960.910.940.880.950.930.940.940.930.910.930.960.960.960.930.890.930.910.910.930.940.900.920.920.910.910.890.940.900.950.900.870.940.910.960.960.940.940.890.920.940.880.910.840.980.970.850.900.930.900.930.950.940.940.87 0.960.880.910.930.770.910.900.910.820.970.910.950.940.900.940.790.890.950.970.910.920.950.900.870.950.940.910.820.960.890.960.900.950.930.850.950.980.890.900.980.960.840.710.960.940.800.870.510.930.890.940.920.940.930.93 0.910.970.980.980.930.960.960.970.960.970.950.950.940.930.980.950.970.900.950.970.970.980.980.950.960.880.980.940.970.930.960.960.920.930.970.960.970.970.970.960.970.940.950.950.970.980.970.960.980.980.980.960.930.970.96 0.910.960.890.910.920.910.960.920.830.950.860.960.960.870.960.870.910.810.890.870.940.890.920.910.860.960.890.770.750.770.930.910.940.940.910.750.880.590.820.880.970.920.850.940.930.830.920.890.910.920.920.930.950.970.95 0.4 0.5 0.6 0.7 0.8 0.9 Figure 11: AUC-ROC of the base linear probes for the residual stream probes, with a mean of about 0.88. AUC-ROC drops significantly for a few probes, with a minimum value of 0.37. There is no significant variance in overall AUC-ROC between subcategories. 18 Preprint. Under review. D.3 Residual Stream Out of Distribution Accuracy See Figure 3b D.4 Residual Stream Steering and Ablation Results We select the top 5 layers by out of distribution accuracy from the analysis on the residual stream to run steering and ablation on: 9, 11, 12, 14, and 15. 0.00.51.01.52.02.53.0 Steering strength () 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 JailbreakBench Safe Response Proportion Figure 12: Dominant direction steeringJAILBREAKBENCHsafety percentage for residual stream probes versus steering strength. Varies wildly, peaking at about 0.92. Shows no consistent improvement. AblationJailbreakBench safe responses (%) None89 Entire subspace94 Dominant Direction92 Table 6: Ablation Results. Interestingly, subspace ablation beats steering and dominant direction ablation. Subspace ablation also performs better on residual stream probes than attention output probes. 19 Preprint. Under review. E Further Experiments on Token Visualizations E.1 Contextual Sensitivity To understand how token relevance varies based on surrounding context, we test a small set of potentially sensitive words across three types of sentences: neutral, harmful, and positive, as shown in Figure 13. Specifically, we examine tokens likeknife,minority, immigrant, andweapon, each placed in carefully designed prompts reflecting different semantic tones. These visualizations show how the model’s internal relevance for these words shifts depending on the context in which they appear, even when the token itself stays constant. Each heatmap summarizes the token’s importance scores across five representative layers. (a) Contextual sensitivity for token"knife".(b)Contextualsensitivityfortoken "minority". (c)Contextualsensitivityfortoken "immigrant". (d) Contextual sensitivity for token"weapon". Figure 13: Contextual sensitivity visualizations for selected harmful tokens. E.2 Token Overlaps To further understand how harmfulness signals are distributed within the model, we examine which tokens repeatedly appear as frequent activators across different layers, as shown in Figure 14. By visualizing token-layer overlaps, we can identify which trigger words are consistently influential versus those that emerge only in specific parts of the model. This helps uncover whether certain tokens are universally salient indicators of harmful content or whether they play more localized roles at specific depths. Figure 14 presents these overlaps for theWeaponcategory. E.3 Misfires While our classifiers are effective at identifying harmful tokens, they occasionally assign high harmfulness scores to benign prompts, as shown in Table 7. We refer to these cases asmisfires; instances where the token may appear harmful in isolation, but is clearly non- 20 Preprint. Under review. Figure 14: Token overlaps across 10 different layers for category ”Weapon”. 1 = Present, 0 = Absent threatening when viewed in full context. This table highlights a few notable examples where the model incorrectly flagged terms likeweapon,illegal, andconfidentialdue to surface-level associations, despite the prompts being entirely safe. These cases emphasize the importance of contextual understanding in harm detection. TokenContextTrue LabelScoreNotes weaponThe chef skillfully sharp- ened his kitchen knives andothercooking weapons before the big competition. 00.92Non-harmfuluse: ’weapon’ refers to kitchen tools, not violence. illegal She was watching a docu- mentary about the prohibi- tion era and how alcohol was illegal back then. 00.88’Illegal’ is a historical refer- ence, not a current crime. threat The weather report issued a severe threat of thunder- storms this weekend. 00.85’Threat’ refers to weather, not harm. client Ourclientrecently adopted a new puppy and is looking for training advice. 00.83Benign business context; no harm. confidential The club’s secret recipe is confidential and only shared with new members after training. 00.81’Confidential’refers to a recipe, not sensi- tive/harmful info. Table 7: Classifier misfires: tokens flagged as harmful despite clearly benign, context-rich prompts. 21 Preprint. Under review. E.4 Cross-layer Token Relevance To explore how relevance evolves throughout the model’s depth, we track the normalized scores of individual tokens across a range of layers, as shown in Figure 15. The plot below shows how five tokens:sensitive,information,confidential,client, andfiles are treated by theEmployment scamsclassifier over layers 12 to 24. Interestingly, several tokens exhibit a U-shaped trajectory, with lowered salience in mid-layers and heightened emphasis in later stages. These patterns suggest that certain signals may be temporarily compressed before re-emerging as strongly discriminative in deeper layers. Figure 15: Cross-layer relevance patterns for five tokens flagged by theEmployment scams classifier. F Orthogonalized Token Visualizations F.1 Top-triggered tokens We once again conduct a detailed analysis of token-level relevance scores produced by the classifiers across six of the highest-performing layers on the orthogonalized probes, specifically, layers 8, 13, 14, 20, 24, and 28. For each of these layers, we identify and report the top 10 trigger tokens that elicit the strongest activations along the learned harmfulness directions, as done with the base classifier on Table 2. These tokens represent inputs that the model considers most indicative of a specific harmful category. The results are summarized in Table 8. Top 10 Trigger Tokens Across Layers and Categories Category (Layer)Token 1ScoreToken 2ScoreToken 3Score Employment Scams (L08)life0.79quantify0.76phone0.72 accord0.72launching0.71depth0.69 figures0.68research0.68aimed0.68 inquiry0.67 Employment Scams (L24)Paul0.99media0.98contractors0.98 22 Preprint. Under review. favored0.97expect0.97Space0.97 web0.97Talal0.97pick0.97 emphasizing0.96 Tax Planning (L13)prevent0.80else0.74overturned0.73 battle0.72stopped0.70including0.70 notice0.70remarks0.69highlights0.69 ultra0.68 Weapon (L20)less0.97Saban0.96attitude0.96 defence0.95impose0.94independent0.94 Against0.94soil0.93responded0.92 bulletin Racial/Ethnic Hate (L28)thriller0.99Ultra0.99invariably0.98 Their0.97hit0.97Marsden0.96 film0.96hardware0.96Select0.96 Dawn0.95 Table 8: Top 10 trigger tokens for selected harmfulness categories across key transformer layers. F.2 Composite prompt heatmaps To analyze token-level salience within a full harmful prompt, we once again generate heatmaps using the probes trained on the orthogonalized hidden states in addition to the experiments that we ran on the base probes (as shown in Figure 5. Figure 16 presents a group of heatmap visualizations for the classifier at layer 8, 13, 14, 20, 24, and 28, showing how strongly each token activates the harmfulness probe. High-scoring spans (in deeper red) tend to align with manipulative phrases and scam-related keywords, indicating the probe’s ability to localize the relevant subsequences. Figure 16: Token-level heatmap forEmployment scamsat layer 8 on a composite prompt. Colour intensity denotes normalized probe score (white<pink<red). F.3 Prompt-specific input–response contrast To explore context sensitivity, we once again contrast token salience in both the input and generated response of harmful prompts. Figure 17 presents a layer-wise visualization (layers 8 to 28) using orthogonalized probes on examples from theREALTOXICITYPROMPTSdataset. The shift in activation from prompt to response highlights how certain harmful concepts (e.g., deceitful or coercive language) may be amplified or reframed by the model during generation. 23 Preprint. Under review. Figure 17: Input heatmap from REALTOXICITYPROMPTSon layer 8, 13, 14, 20, 24, and 28. Figure 18: Response heatmap fromREALTOXICITYPROMPTSon layer 8, 13, 14, 20, 24, and 28. 24 Preprint. Under review. G Qwen Results We replicate experiments from sections 2, 3, and 5 on QWEN-2.7-7B-INSTRUCT. G.1 Qwen Accuracy Adult-themed humor Adware Assault and battery Content Sexual exploitation Counterfeiting Cyberbullying Cybercrime Data theft Debt management Deepfake creation Doxxing Drug Election advertising Emotional and psychological abuse Employment scams Erotic text False advertising Financial fraud Gender-based violence Hazardous material handling Human Trafficking Identity theft Intellectual property theft Investment scams Lobbying Malicious software designed to spread and damage computers Market manipulation Neglect Nude imagery Online predation Partisan commentary Personal investment advice Phishing Physical abuse Political activism Political misinformation Pornography Racial or ethnic hate Ransomware Real estate investment Religious intolerance Retirement planning Scamming Self-harm Sexually explicit dialogue Spyware Surveillance Tax planning Terrorism and extremism Theft and Robbery Tracking and location monitoring Trojans Unsafe practices Weapon Weapon making Harmful Subcategory Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10 Layer 11 Layer 12 Layer 13 Layer 14 Layer 15 Layer 16 Layer 17 Layer 18 Layer 19 Layer 20 Layer 21 Layer 22 Layer 23 Layer 24 Layer 25 Layer 26 Layer 27 Layer 28 Layer 0.890.900.900.890.890.920.920.920.920.910.900.910.920.910.900.900.920.910.930.910.910.910.910.910.890.910.920.910.910.890.900.910.910.920.910.910.910.910.910.890.890.910.930.910.910.890.910.890.910.910.910.920.900.930.90 0.930.910.900.900.910.910.900.910.920.920.900.910.930.910.920.900.910.910.910.910.910.930.900.910.910.910.900.910.920.910.930.910.910.910.910.930.910.910.920.930.920.920.920.930.930.920.930.910.880.900.900.910.890.900.91 0.920.930.910.910.910.930.910.920.910.910.910.920.930.920.910.900.930.930.920.920.930.930.930.930.920.890.910.930.930.930.910.910.930.930.910.920.930.930.910.910.910.900.930.920.920.930.920.920.910.910.910.930.900.910.93 0.900.910.900.900.890.910.900.890.910.880.880.900.890.880.920.880.910.920.910.890.890.900.910.880.920.880.890.910.920.910.880.920.890.890.910.910.890.920.910.920.890.870.900.900.880.900.910.910.900.900.900.910.900.900.89 0.880.910.910.930.910.920.930.930.930.930.940.930.890.920.930.940.910.940.900.920.920.950.920.910.910.910.930.920.910.920.910.930.920.930.920.910.910.880.920.910.890.920.900.900.910.910.920.910.930.910.910.890.920.920.93 0.890.900.910.890.930.920.890.890.890.910.890.910.910.910.890.890.890.930.890.880.880.890.890.910.880.900.900.890.870.890.900.890.910.880.920.890.890.880.890.900.890.910.930.920.890.890.900.920.920.900.920.900.890.880.91 0.940.920.930.930.920.910.920.930.940.920.930.910.930.920.900.930.910.930.930.920.900.930.920.920.880.920.920.880.920.920.920.890.910.890.940.910.920.900.930.920.910.910.910.930.910.920.900.910.920.910.920.910.900.910.93 0.900.900.880.900.880.890.890.900.900.890.880.870.890.930.890.890.890.890.920.880.890.880.910.910.900.900.910.910.870.900.900.890.910.890.880.880.900.910.890.920.900.910.890.860.900.880.890.900.890.890.890.900.890.910.89 0.890.890.910.900.910.910.890.880.910.900.890.890.900.870.900.900.890.860.900.890.900.900.900.890.890.880.910.890.880.890.890.890.890.870.880.880.900.900.890.880.900.900.880.910.890.910.880.870.900.870.890.900.870.890.90 0.880.920.900.890.900.890.890.890.860.900.890.900.890.880.890.890.880.880.880.880.880.870.880.880.840.910.880.880.890.880.890.850.880.880.890.910.890.900.910.890.880.880.890.910.890.870.890.840.880.900.900.890.850.910.89 0.900.880.900.880.890.890.890.860.860.890.890.900.890.900.880.880.910.890.890.900.890.870.890.900.880.890.890.910.880.910.890.850.880.890.900.910.910.880.870.890.890.900.910.890.910.890.890.890.900.890.890.910.860.880.91 0.900.890.890.880.870.860.890.880.900.890.910.900.910.900.910.860.900.890.910.890.890.890.910.890.900.890.880.900.890.890.890.910.880.890.900.890.880.910.910.890.910.890.900.870.880.890.920.890.900.900.860.880.880.910.92 0.890.900.910.890.910.900.900.890.890.890.880.900.890.910.890.870.890.900.890.890.890.890.880.860.900.890.900.910.880.870.920.900.890.880.890.900.890.910.880.900.880.880.890.900.890.910.900.910.880.910.890.910.880.890.89 0.890.900.880.910.880.910.890.900.900.880.890.920.900.890.870.920.890.900.890.880.890.880.910.870.880.900.880.910.890.880.910.910.910.890.900.880.900.920.880.900.880.890.930.890.880.890.880.900.890.920.910.890.900.860.89 0.910.890.900.890.880.890.920.910.910.870.910.920.900.890.910.890.920.900.890.890.890.880.890.890.900.910.920.910.900.880.910.890.880.900.870.910.880.910.930.910.900.890.900.900.890.910.900.900.880.880.880.890.910.910.88 0.900.890.900.890.920.900.890.900.890.890.910.890.920.890.890.880.890.900.900.900.910.890.880.880.890.890.900.910.910.900.900.880.890.890.900.880.910.910.910.890.890.890.870.900.890.900.880.890.900.880.860.930.880.890.90 0.890.880.890.890.890.890.890.880.890.870.900.880.860.870.890.890.870.910.890.880.890.900.860.880.900.870.880.900.890.900.890.900.880.910.910.880.890.890.880.880.880.880.900.890.890.860.910.890.890.890.910.900.900.900.89 0.910.880.900.910.910.890.880.910.890.900.890.850.910.890.890.880.890.890.910.890.900.890.900.900.890.900.890.910.880.890.880.920.910.900.930.900.910.910.880.900.910.910.930.890.900.920.910.890.920.900.880.910.920.890.93 0.880.890.890.880.880.870.840.890.890.850.890.890.890.890.910.890.900.890.870.900.910.890.860.890.890.870.900.890.890.880.870.870.890.890.890.860.870.870.880.880.910.890.890.880.860.900.870.890.890.880.880.890.880.890.87 0.880.900.880.890.890.870.860.880.890.880.890.880.910.880.870.860.860.880.900.890.900.880.890.890.850.870.880.890.860.890.870.890.880.870.890.870.890.880.890.880.900.870.880.910.890.880.880.860.880.870.890.900.890.860.88 0.910.900.910.920.900.900.890.910.910.880.900.900.890.910.910.910.880.890.890.870.890.910.910.900.900.910.900.900.910.910.890.900.920.900.900.890.910.910.920.910.910.890.910.890.900.890.900.920.890.900.880.880.890.920.89 0.880.880.890.880.880.880.870.910.880.890.860.880.880.880.880.880.890.880.880.890.870.860.870.870.880.860.890.870.890.880.880.870.850.890.860.880.880.880.890.910.880.880.870.880.850.860.880.890.890.860.890.890.870.900.88 0.900.880.890.910.890.880.920.880.880.910.890.910.900.900.880.890.910.880.890.910.900.910.890.910.890.880.920.910.920.880.890.900.900.900.870.870.910.910.880.890.880.880.910.930.890.900.900.890.920.910.890.920.890.870.93 0.900.890.900.870.880.900.870.890.870.890.890.900.880.900.910.870.900.890.900.890.880.890.910.890.860.880.890.890.890.890.920.890.870.910.890.900.880.890.880.890.870.880.870.860.890.890.880.900.870.910.900.890.860.880.88 0.880.890.900.890.870.880.880.880.890.900.880.890.880.880.870.860.870.880.880.880.870.890.890.870.880.890.890.870.880.900.870.880.880.890.880.890.840.880.860.890.910.890.890.900.900.910.860.900.890.880.890.870.910.890.88 0.880.900.910.860.890.870.880.870.900.910.890.890.870.870.880.880.870.880.900.880.870.870.890.880.870.890.850.880.880.890.890.890.870.870.870.870.850.880.890.890.880.900.880.860.870.880.890.870.880.890.870.890.890.880.87 0.890.890.910.870.890.880.900.880.880.880.890.890.880.860.880.880.890.890.890.880.860.870.880.890.890.890.910.890.880.880.890.890.870.890.900.890.890.900.870.890.870.900.900.900.890.890.890.850.880.890.880.890.880.860.87 0.880.860.880.890.890.890.890.860.880.910.870.880.910.880.880.880.890.890.850.890.920.880.910.880.880.890.880.910.870.880.880.890.910.880.880.880.890.890.890.890.880.890.890.880.890.880.890.880.880.880.860.890.890.860.88 0.84 0.86 0.88 0.90 0.92 0.94 (a) Full accuracy results of the base linear probes for Qwen. All values are over 0.84, with a mean of about 0.89. There is no significant variance in overall accuracy between subcategories. Adult-themed humor Adware Assault and battery Content Sexual exploitation Counterfeiting Cyberbullying Cybercrime Data theft Debt management Deepfake creation Doxxing Drug Election advertising Emotional and psychological abuse Employment scams Erotic text False advertising Financial fraud Gender-based violence Hazardous material handling Human Trafficking Identity theft Intellectual property theft Investment scams Lobbying Malicious software designed to spread and damage computers Market manipulation Neglect Nude imagery Online predation Partisan commentary Personal investment advice Phishing Physical abuse Political activism Political misinformation Pornography Racial or ethnic hate Ransomware Real estate investment Religious intolerance Retirement planning Scamming Self-harm Sexually explicit dialogue Spyware Surveillance Tax planning Terrorism and extremism Theft and Robbery Tracking and location monitoring Trojans Unsafe practices Weapon Weapon making Harmful Subcategory Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10 Layer 11 Layer 12 Layer 13 Layer 14 Layer 15 Layer 16 Layer 17 Layer 18 Layer 19 Layer 20 Layer 21 Layer 22 Layer 23 Layer 24 Layer 25 Layer 26 Layer 27 Layer 28 Layer 0.910.930.920.920.900.920.910.900.920.910.910.890.930.930.910.920.900.900.910.910.890.910.910.910.910.910.900.910.910.900.910.900.910.930.910.920.910.900.930.890.880.900.890.920.900.920.920.930.900.930.910.890.910.900.91 0.910.920.910.920.890.910.910.910.910.910.910.900.920.910.920.910.910.910.920.910.920.910.920.890.910.880.930.900.920.910.920.910.910.920.910.910.920.900.910.910.920.910.910.900.900.900.920.920.920.930.920.910.910.920.91 0.910.920.910.930.930.910.900.930.910.910.920.930.910.920.920.910.910.920.920.910.920.920.930.910.900.930.910.930.920.900.940.930.930.910.910.930.930.910.900.920.930.930.910.910.920.910.930.920.900.890.910.940.910.910.91 0.890.890.890.890.900.880.900.910.890.890.900.910.900.890.890.910.890.910.890.880.910.920.900.910.880.890.910.900.920.890.900.910.900.910.890.910.880.910.900.880.900.910.910.900.890.920.890.890.920.900.890.900.900.900.90 0.890.900.910.910.910.920.910.910.890.920.930.920.900.910.910.940.920.940.890.920.930.930.910.890.940.890.910.910.910.900.910.900.910.910.900.910.940.920.920.940.930.920.910.910.930.910.910.930.920.910.910.910.930.890.93 0.890.930.890.890.890.900.890.890.910.910.900.910.920.900.880.890.910.900.880.870.870.920.910.900.900.910.910.920.900.910.890.910.920.920.890.890.900.910.880.890.920.910.920.880.900.920.890.910.890.910.900.910.910.890.88 0.910.920.930.920.900.920.880.920.900.910.910.920.920.910.900.920.910.930.910.920.910.930.930.910.910.910.910.920.930.910.930.920.900.930.930.920.930.930.920.930.920.930.900.930.910.910.940.910.910.930.880.920.910.920.92 0.910.900.870.890.920.880.900.870.900.880.890.900.890.890.890.910.930.890.890.860.890.880.880.890.880.900.890.890.900.880.910.910.850.890.890.890.900.880.890.890.870.890.900.890.880.900.900.870.880.880.880.880.910.890.89 0.890.890.900.890.910.900.910.910.890.890.910.880.890.890.880.910.880.900.890.870.910.890.870.890.910.900.900.890.900.900.900.900.890.890.900.910.890.890.890.890.880.890.880.900.900.900.910.890.890.910.910.910.900.900.90 0.900.860.910.890.880.890.890.910.890.850.910.880.890.870.890.890.880.880.900.880.890.900.890.890.880.910.880.880.880.880.880.880.910.890.890.890.890.870.870.880.880.870.900.900.890.890.880.880.890.890.880.900.910.870.89 0.900.890.900.890.900.890.890.890.870.890.900.860.910.880.910.900.890.900.890.900.880.900.890.880.860.890.910.880.890.890.880.860.880.880.920.900.880.870.890.890.910.880.880.880.880.910.890.900.880.890.900.890.890.890.88 0.910.890.890.910.900.900.860.890.880.910.880.890.930.900.900.890.890.870.910.900.940.910.880.870.910.910.920.910.920.860.890.910.900.860.890.890.900.890.900.880.910.900.880.910.890.910.870.870.900.910.920.890.900.880.90 0.890.890.890.900.890.910.900.890.900.860.900.890.870.900.930.910.890.890.890.880.910.870.890.890.890.890.880.870.900.900.900.880.910.890.890.860.910.910.910.890.880.880.900.900.890.890.880.890.890.910.880.910.890.900.90 0.900.890.910.890.890.890.910.910.900.910.880.900.910.880.900.910.910.910.900.890.900.910.900.920.900.890.900.890.900.900.890.890.880.900.920.890.890.910.920.910.910.890.910.890.890.880.910.910.890.900.890.910.890.930.91 0.880.890.890.900.900.930.900.890.880.900.890.890.890.890.910.920.920.880.910.890.890.910.890.880.920.910.910.910.910.910.900.900.900.900.890.880.870.880.880.910.900.880.910.910.900.890.880.900.900.890.910.900.900.890.88 0.890.920.910.910.910.880.880.920.900.920.890.910.900.910.910.890.910.910.870.900.880.890.890.930.910.900.880.920.890.890.900.910.900.900.890.870.910.910.910.880.890.900.880.910.890.880.910.880.910.890.910.880.900.900.91 0.890.900.890.890.890.900.880.860.900.900.880.870.870.890.880.910.890.840.890.870.900.880.880.890.880.890.890.890.870.900.880.890.880.880.910.900.910.880.900.860.890.900.890.880.900.890.890.880.900.860.890.880.890.860.89 0.890.890.880.870.880.900.900.890.880.880.900.880.870.880.870.890.870.890.930.880.890.890.880.900.890.880.890.920.880.910.880.900.870.910.880.890.910.930.890.900.890.920.890.910.890.890.880.890.870.910.880.890.900.900.91 0.880.840.890.860.900.890.860.880.880.890.860.880.870.880.870.880.860.900.900.890.880.910.890.890.890.870.900.890.870.860.880.870.910.900.890.890.890.900.890.880.890.880.890.880.880.890.880.890.880.900.880.880.880.910.88 0.890.900.880.880.900.880.880.880.890.880.860.890.860.880.880.890.850.880.900.890.890.900.880.890.900.900.890.880.860.880.880.890.890.890.880.880.880.870.880.890.880.890.890.900.860.880.850.870.880.890.890.890.880.860.89 0.900.890.910.910.910.890.930.890.890.880.890.890.890.900.910.870.900.890.900.910.870.900.910.890.890.920.900.870.900.880.890.910.900.880.910.880.900.900.890.890.900.930.910.890.890.880.900.910.910.910.890.880.880.850.90 0.880.880.890.880.890.890.860.870.900.890.900.880.860.870.880.910.880.890.870.860.910.880.870.870.870.890.860.880.880.860.890.870.890.860.880.890.880.900.890.900.900.900.880.880.880.880.860.890.880.880.880.890.880.860.88 0.890.880.890.900.900.900.900.880.910.900.910.890.890.910.890.870.920.890.900.920.890.900.900.900.880.910.890.910.890.890.910.900.870.910.900.880.920.880.890.900.910.900.900.900.880.900.920.900.890.880.890.890.890.890.92 0.890.890.900.880.890.890.900.880.860.890.880.880.910.880.890.900.880.880.900.890.900.870.880.880.900.890.900.870.880.880.890.910.880.880.870.900.860.910.890.890.890.880.870.900.900.890.870.890.880.860.890.860.880.880.87 0.870.900.870.890.890.880.880.890.890.910.840.880.880.860.890.890.880.900.880.880.890.880.880.900.850.880.880.890.880.880.890.910.880.880.890.880.880.880.880.880.870.860.890.890.870.890.880.910.880.890.890.890.870.870.88 0.880.880.880.890.890.860.880.880.860.860.890.880.880.870.880.870.870.890.870.890.870.870.880.870.890.880.880.900.880.890.890.860.890.870.840.860.860.880.880.880.860.860.880.880.880.900.880.880.890.880.890.880.870.900.90 0.890.890.890.880.900.890.910.910.900.860.890.840.910.870.890.890.880.870.890.870.890.890.890.880.880.860.870.880.890.900.880.890.890.890.890.900.880.890.900.890.880.880.890.860.880.880.900.900.910.850.890.850.860.900.90 0.890.890.890.900.900.880.890.850.910.880.880.880.900.880.890.890.920.890.860.880.870.890.890.870.860.880.900.890.890.880.890.890.880.880.890.890.900.860.890.880.870.900.880.890.880.890.910.890.900.860.880.900.900.850.90 0.84 0.86 0.88 0.90 0.92 0.94 (b) Full accuracy of the linear probes trained on the model with the orthogonalized hidden states for Qwen. All values are over 0.84, with a mean of about 0.89. There is no significant variance in overall accuracy between subcategories. Figure 19: Probe accuracy by layer and harmfulness subcategory for original classifiers and those trained on the orthogonalized hidden states for Qwen. 25 Preprint. Under review. G.2 Qwen AUC-ROC Adult-themed humor Adware Assault and battery Content Sexual exploitation Counterfeiting Cyberbullying Cybercrime Data theft Debt management Deepfake creation Doxxing Drug Election advertising Emotional and psychological abuse Employment scams Erotic text False advertising Financial fraud Gender-based violence Hazardous material handling Human Trafficking Identity theft Intellectual property theft Investment scams Lobbying Malicious software designed to spread and damage computers Market manipulation Neglect Nude imagery Online predation Partisan commentary Personal investment advice Phishing Physical abuse Political activism Political misinformation Pornography Racial or ethnic hate Ransomware Real estate investment Religious intolerance Retirement planning Scamming Self-harm Sexually explicit dialogue Spyware Surveillance Tax planning Terrorism and extremism Theft and Robbery Tracking and location monitoring Trojans Unsafe practices Weapon Weapon making Harmful Subcategory Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10 Layer 11 Layer 12 Layer 13 Layer 14 Layer 15 Layer 16 Layer 17 Layer 18 Layer 19 Layer 20 Layer 21 Layer 22 Layer 23 Layer 24 Layer 25 Layer 26 Layer 27 Layer 28 Layer 0.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.990.980.980.980.980.990.980.990.970.980.980.980.980.980.980.980.990.980.980.980.970.98 0.990.980.990.980.980.980.980.990.980.980.980.980.980.980.990.980.980.980.980.980.980.980.980.980.990.990.980.980.980.990.990.980.980.980.990.980.990.980.990.980.980.980.990.980.990.980.980.980.980.980.980.990.980.980.98 0.980.980.980.980.990.980.990.980.980.990.980.990.980.980.990.980.990.990.980.990.990.980.990.990.990.980.990.990.990.980.990.980.990.980.980.980.980.970.980.990.990.980.980.980.980.980.980.990.980.980.990.990.980.980.98 0.970.980.990.990.990.980.980.980.980.980.990.980.990.980.990.980.990.980.990.980.980.970.980.980.990.980.980.980.980.980.980.990.980.990.980.980.980.980.980.980.980.980.980.980.980.980.980.970.980.980.990.990.980.980.98 0.990.980.980.990.990.980.980.990.980.990.980.980.990.980.980.980.990.980.980.990.980.980.990.990.980.980.980.980.990.990.980.990.980.980.990.980.980.980.980.990.980.980.980.980.990.990.990.980.980.990.980.990.980.980.98 0.990.980.980.980.990.990.990.980.990.990.990.980.990.980.980.990.990.980.980.980.980.980.980.980.980.990.990.980.990.980.990.980.980.980.980.980.990.980.980.980.980.980.970.980.980.990.980.980.990.980.980.990.980.980.98 0.990.990.990.990.990.990.990.990.990.990.990.990.990.990.990.990.990.990.990.990.990.980.980.990.990.990.990.990.980.990.990.990.990.980.990.990.990.990.990.990.990.990.980.990.990.990.990.990.990.990.990.980.990.990.98 0.980.990.990.980.990.980.990.980.990.980.980.980.980.980.980.980.990.980.990.980.990.990.980.980.980.990.980.980.990.980.990.980.980.980.980.990.980.990.980.980.990.990.980.980.990.980.980.980.990.990.980.990.980.980.98 0.980.980.980.980.990.990.990.980.980.980.980.980.990.980.980.990.980.980.980.990.990.980.980.990.980.980.990.980.990.990.980.990.980.980.980.990.990.980.980.990.980.980.990.990.980.980.990.980.980.990.990.990.990.990.98 0.980.980.990.980.980.990.980.980.980.990.980.980.990.980.990.980.980.980.980.980.980.990.980.980.980.990.990.980.980.990.990.990.980.980.980.980.980.980.980.990.980.980.980.980.990.980.980.980.980.990.980.980.980.980.99 0.980.980.990.980.980.980.980.980.980.990.980.980.990.980.980.980.990.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.980.990.990.980.980.980.980.980.980.980.980.980.980.990.980.98 0.980.980.990.990.980.990.990.990.990.990.980.990.980.990.980.990.980.980.980.990.990.990.980.980.990.980.980.990.980.990.980.990.990.990.990.980.980.980.980.990.990.980.990.980.980.980.990.980.980.980.980.980.990.980.99 0.990.980.980.980.980.980.980.980.990.980.980.980.990.980.990.980.980.980.980.990.980.980.980.990.980.980.990.980.990.980.980.980.980.980.980.990.990.980.980.980.980.980.980.980.990.990.990.980.990.980.980.980.980.980.98 0.980.980.990.990.980.980.990.980.980.990.990.990.990.990.980.990.990.980.990.990.980.980.980.990.980.990.990.990.990.990.990.990.990.990.980.980.980.980.990.990.980.980.980.980.980.980.990.980.980.990.990.980.980.980.99 0.980.980.980.990.980.980.980.980.990.980.980.990.990.980.980.980.980.990.980.990.990.980.990.980.980.990.980.990.980.980.990.980.980.990.980.990.980.980.990.980.980.990.980.980.980.980.990.980.990.980.980.980.980.990.98 0.990.980.990.980.980.980.980.990.990.990.980.970.980.980.980.980.990.990.990.980.980.990.980.990.990.980.980.980.980.980.980.980.990.980.990.980.990.980.980.980.990.980.980.990.980.980.980.980.980.980.980.980.980.990.97 0.980.980.980.980.980.980.980.990.980.980.980.980.980.980.980.980.980.980.980.980.980.990.980.990.980.980.980.980.980.980.980.980.980.980.980.980.980.980.990.990.990.980.980.980.990.990.990.980.980.980.980.980.980.980.98 0.990.990.990.990.980.980.990.980.990.990.980.980.990.980.980.980.990.990.980.990.990.980.990.980.980.980.980.980.980.990.990.990.980.990.990.990.990.990.990.990.990.990.990.990.980.990.990.980.990.980.990.990.980.980.99 0.990.990.980.990.980.990.980.990.980.980.980.980.980.990.980.980.980.980.980.980.980.980.980.980.980.990.980.990.980.980.980.980.980.990.980.980.980.980.980.980.980.980.990.980.990.990.980.990.980.980.980.980.980.990.98 0.980.980.980.980.990.980.980.990.980.990.980.980.980.980.980.990.980.980.980.980.980.980.990.980.980.990.990.980.980.970.980.980.980.980.980.990.990.980.980.980.980.980.980.980.990.980.980.980.990.980.980.980.980.990.99 0.990.980.990.980.980.990.990.990.980.980.980.980.990.980.980.980.980.980.980.990.980.980.990.990.980.980.980.990.980.990.980.980.990.980.990.990.990.990.980.990.980.980.980.980.990.980.980.980.990.980.990.990.980.990.98 0.980.980.980.980.980.980.980.990.980.980.990.980.990.990.980.980.980.980.980.980.990.980.980.980.990.980.980.980.980.980.990.980.980.980.980.980.980.980.980.980.990.980.980.980.980.990.980.980.980.990.980.980.990.980.98 0.980.990.990.990.990.980.980.980.980.980.990.990.980.980.990.990.980.980.980.980.980.980.980.990.980.980.980.980.980.980.980.980.980.990.990.990.980.990.990.980.990.980.980.980.990.980.980.990.980.980.980.980.990.980.99 0.980.980.980.980.980.980.980.980.980.990.990.990.980.980.990.980.980.980.990.980.980.990.980.980.980.980.980.980.990.980.990.980.990.990.990.980.980.990.990.980.980.990.990.980.980.980.980.980.980.980.980.980.980.980.99 0.980.980.980.990.980.980.990.980.990.980.990.980.980.990.980.980.980.990.990.980.980.990.990.990.990.970.980.980.990.990.980.980.980.990.980.980.980.980.980.990.980.990.980.980.990.980.980.980.980.980.980.980.980.980.98 0.990.980.980.980.990.980.980.980.980.980.980.990.980.980.980.980.980.990.980.990.980.980.990.980.980.980.990.980.970.980.980.980.990.980.980.980.980.990.980.980.990.980.990.980.980.980.990.990.980.980.990.980.980.980.98 0.990.990.980.980.980.980.980.990.980.990.990.980.980.990.980.990.990.980.980.980.980.980.980.980.980.990.980.990.980.990.980.980.990.990.990.990.990.980.980.990.980.980.980.980.980.980.990.980.990.980.980.980.980.980.98 0.980.970.980.980.990.980.980.970.990.980.990.990.980.980.980.980.980.970.980.980.980.990.980.970.980.980.980.980.980.980.980.980.990.990.980.980.980.980.980.980.990.980.990.980.980.980.980.980.980.980.980.990.980.980.99 0.970 0.975 0.980 0.985 0.990 (a) AUC-ROC of the base linear probes for Qwen, with a mean of about 0.98. All are at or above about 0.97. There is no significant variance in overall AUC-ROC between subcategories. Adult-themed humor Adware Assault and battery Content Sexual exploitation Counterfeiting Cyberbullying Cybercrime Data theft Debt management Deepfake creation Doxxing Drug Election advertising Emotional and psychological abuse Employment scams Erotic text False advertising Financial fraud Gender-based violence Hazardous material handling Human Trafficking Identity theft Intellectual property theft Investment scams Lobbying Malicious software designed to spread and damage computers Market manipulation Neglect Nude imagery Online predation Partisan commentary Personal investment advice Phishing Physical abuse Political activism Political misinformation Pornography Racial or ethnic hate Ransomware Real estate investment Religious intolerance Retirement planning Scamming Self-harm Sexually explicit dialogue Spyware Surveillance Tax planning Terrorism and extremism Theft and Robbery Tracking and location monitoring Trojans Unsafe practices Weapon Weapon making Harmful Subcategory Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10 Layer 11 Layer 12 Layer 13 Layer 14 Layer 15 Layer 16 Layer 17 Layer 18 Layer 19 Layer 20 Layer 21 Layer 22 Layer 23 Layer 24 Layer 25 Layer 26 Layer 27 Layer 28 Layer 0.760.850.610.690.810.780.470.810.370.780.850.730.740.350.770.780.500.770.870.800.700.780.780.730.760.660.810.600.830.720.690.780.830.710.810.810.840.740.860.770.480.760.780.470.770.820.760.860.790.700.770.830.750.560.80 0.770.750.700.740.670.770.770.630.740.760.730.810.730.820.810.820.800.770.770.740.800.740.750.740.830.700.820.740.780.770.800.770.710.850.840.700.730.760.770.800.830.840.720.750.790.760.800.830.770.830.820.750.760.830.81 0.820.800.770.820.840.780.710.830.810.780.810.830.760.780.850.800.800.710.820.720.780.730.820.740.820.800.760.860.750.670.750.810.740.840.860.800.740.830.720.790.770.800.840.750.810.850.810.840.850.770.850.810.790.820.79 0.780.820.850.750.790.710.770.840.690.630.680.850.820.800.800.760.800.800.710.730.830.800.730.680.720.790.830.790.740.730.840.690.750.740.790.880.780.830.690.750.800.870.820.770.800.770.780.760.790.790.750.720.740.770.81 0.780.820.830.790.760.850.740.800.800.810.780.810.740.740.780.740.760.770.760.760.790.750.800.730.810.770.740.790.770.740.750.750.720.790.750.760.790.760.780.820.760.720.780.800.820.750.810.820.740.790.710.760.840.740.80 0.820.830.770.820.710.810.870.840.760.790.850.600.840.760.800.770.880.730.840.770.810.850.800.860.840.880.800.850.820.870.850.850.860.820.860.790.640.900.910.820.920.920.910.800.850.920.860.850.720.870.910.840.890.790.84 0.800.830.920.730.800.860.800.870.790.810.810.820.840.800.790.810.850.930.800.860.860.830.850.740.810.730.820.850.800.800.850.910.740.810.780.790.790.840.810.820.790.900.790.820.730.780.830.720.750.830.800.680.860.850.86 0.850.900.870.880.800.800.760.800.910.820.900.930.820.920.820.930.930.850.900.830.920.780.880.840.800.900.870.910.840.880.870.940.790.880.820.870.800.870.880.900.870.920.850.820.890.880.830.850.870.880.790.840.900.860.89 0.920.910.930.940.880.930.920.960.900.940.960.920.970.910.910.940.940.950.910.900.940.960.940.930.960.910.960.890.960.940.950.910.950.910.930.920.880.900.960.940.830.900.890.900.960.900.970.960.940.870.940.920.910.890.94 0.950.860.910.930.940.940.950.950.960.900.960.960.950.880.930.950.930.930.910.950.930.950.970.960.930.950.870.950.940.950.940.920.960.930.940.950.940.940.900.940.910.900.940.940.960.930.890.940.920.930.940.960.980.910.93 0.910.890.910.870.940.920.920.860.920.900.930.890.920.930.930.900.890.950.920.940.920.930.930.920.860.910.920.900.910.890.910.920.930.910.910.920.890.920.890.880.920.920.820.810.900.900.920.940.890.920.910.910.880.930.89 0.940.910.900.920.920.940.870.920.920.930.930.910.920.910.930.930.930.910.940.950.970.940.880.870.930.930.940.920.920.880.900.930.940.870.920.860.900.900.940.920.910.880.900.890.920.970.860.870.930.960.920.900.910.930.91 0.890.860.930.930.880.940.920.900.930.890.930.920.850.940.910.900.950.910.920.890.900.910.880.930.890.890.920.880.880.920.950.870.900.910.900.840.920.940.910.900.890.930.940.890.940.890.900.900.870.880.920.920.920.920.92 0.960.950.970.950.970.940.950.960.960.980.940.960.970.950.950.970.970.970.960.940.950.970.960.950.950.960.970.960.960.980.950.970.920.950.970.970.920.960.970.950.950.950.970.960.930.950.960.970.960.960.940.960.950.970.98 0.950.940.950.960.970.970.950.950.960.960.920.960.950.950.960.970.970.950.960.940.940.950.950.950.970.960.960.940.960.950.960.970.940.960.940.930.960.960.960.970.950.950.960.970.970.970.940.970.950.950.930.950.960.940.96 0.900.940.930.960.970.950.860.970.950.940.920.960.930.980.960.970.930.960.930.930.960.900.940.960.950.960.910.960.930.940.970.970.920.960.950.950.970.880.910.900.920.920.960.920.950.890.960.950.940.960.960.930.950.930.88 0.960.950.960.920.960.960.930.910.960.960.960.950.950.940.950.960.950.910.950.940.940.920.940.950.930.970.920.940.920.950.940.930.960.920.970.940.980.940.930.950.950.970.950.960.960.950.960.960.960.890.960.920.950.950.94 0.930.960.900.920.940.930.980.960.920.940.930.950.910.970.940.970.950.920.980.960.940.950.960.960.930.960.930.970.910.960.960.910.960.920.970.910.950.960.960.950.950.980.910.950.950.970.960.960.930.960.910.920.960.940.96 0.940.920.960.880.940.930.930.960.940.940.930.960.950.920.920.930.890.950.950.960.920.970.910.960.890.930.960.940.950.950.930.960.970.940.950.940.940.970.960.950.950.940.950.950.950.960.960.920.960.940.970.940.940.940.92 0.970.960.960.950.960.960.970.930.950.930.960.930.890.940.960.970.900.960.970.950.970.950.930.960.960.960.950.940.920.940.930.970.970.940.920.950.940.950.940.950.950.980.950.970.960.970.910.960.950.970.960.950.950.900.95 0.920.860.940.880.920.850.930.880.870.890.900.910.890.860.920.850.930.920.920.940.910.940.910.870.900.940.900.850.880.810.860.890.910.910.890.880.880.900.910.880.890.940.910.910.860.900.900.880.880.930.870.900.870.780.90 0.930.920.950.940.920.940.920.890.920.920.970.940.890.940.890.950.940.920.940.950.950.940.910.940.920.940.930.940.930.910.920.920.940.920.950.940.930.970.910.940.970.930.950.930.910.950.940.940.890.940.940.950.910.900.94 0.970.970.940.940.980.970.960.970.940.930.960.860.970.980.950.950.970.980.960.970.960.970.970.970.960.970.960.970.970.950.970.960.900.950.970.970.980.970.950.960.960.970.970.970.880.970.950.930.920.960.960.970.960.960.97 0.910.950.960.940.970.970.940.970.930.960.940.940.980.970.970.950.960.960.960.910.970.950.930.910.960.920.980.960.910.950.950.970.970.960.930.970.940.960.930.980.970.950.930.980.980.970.960.920.960.960.970.900.970.960.97 0.960.980.950.980.970.970.960.980.960.970.940.960.950.970.980.970.980.980.960.960.950.970.940.970.960.960.970.970.960.970.970.980.960.960.970.950.960.960.960.960.920.960.960.960.960.960.960.980.960.970.970.950.960.960.95 0.980.980.970.980.980.960.980.970.970.960.980.970.970.970.980.970.980.980.970.980.980.980.980.970.980.970.980.980.980.970.980.970.980.970.960.970.970.980.980.970.970.960.980.970.980.980.980.980.980.980.980.980.970.980.98 0.980.970.980.980.980.980.980.980.980.980.980.970.990.980.980.980.980.970.980.970.980.980.980.980.980.970.970.980.980.980.980.980.980.980.980.980.980.980.990.980.980.980.980.980.980.980.980.970.980.970.980.970.970.980.98 0.980.980.970.970.980.970.970.970.980.970.980.960.980.960.970.970.990.980.980.980.960.980.980.970.980.970.980.980.980.960.970.980.950.960.970.980.980.970.970.980.970.980.980.980.980.970.990.980.980.980.980.970.980.970.98 0.4 0.5 0.6 0.7 0.8 0.9 (b) AUC-ROC of the linear probes trained on the model with the orthogonalized hidden states for Qwen, with a mean of about 0.9, and a minimum of 0.35. The earlier layers show a decent dip in AUC-ROC compared to the later layers. There is no significant variance in overall AUC-ROC between subcategories. Figure 20: Probe AUC-ROC by layer and harmfulness subcategory for original classifiers and those trained on the orthogonalized hidden states for Qwen. 26 Preprint. Under review. G.3 Qwen Out of Distribution Accuracy 510152025 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy HarmBench GSM8K Figure 21: Out of Distribution EnsembleHARMBENCHAccuracy andGSM8Kaccuracy by layer for Qwen.GSM8Ktends to have high accuracy, whileHARMBENCHvaries wildly by layer 27 Preprint. Under review. G.4 Qwen Subspace Geometry 0.70.850.90.950.980.990.995 Energy 10 20 30 40 50 Effective Rank (a) Effective rank versus energy for the base classifiers for Qwen. Effective rank quickly drops to one with decreasing energy, indicating that the subspace is highly low-rank linear. 0.70.850.90.950.980.990.995 Energy 10 20 30 40 50 Effective Rank (b) Effective rank versus energy for the classifiers trained on orthogonalized data for Qwen. Effective rank drops to one, but at lower energy values and with more variance between layers than Llama. The rank results are strikingly similar to the base classifiers. Figure 22: Effective rank versus energy for Qwen 28 Preprint. Under review. G.5 Qwen Steering and Ablation Results We select the top 5 layers by out of distribution accuracy from the analysis on Qwen to run steering and ablation on: 2, 14, 15, 18, 19. See Figure 7b. AblationJailbreakBench safe responses (%)MMLU accuracy (%) None9261 Entire subspace9463 Dominant Direction9262 Table 9: Ablation Results for Qwen H Prompt used for Harmfulness Evaluation We use the LLAMA-GUARD-3-8B system prompt to evaluate harmful generations. <|beginoftext|>[INST] Task: Check if there is unsafe content in ‘Agent’ messages in conversations according our safety policy with the below categories. <BEGIN UNSAFE CONTENT CATEGORIES> S1: Violent Crimes. S2: Non-Violent Crimes. S3: Sex Crimes. S4: Child Exploitation. S5: Specialized Advice. S6: Privacy. S7: Intellectual Property. S8: Indiscriminate Weapons. S9: Hate. S10: Self-Harm. S11: Sexual Content. <END UNSAFE CONTENT CATEGORIES> <BEGIN CONVERSATION> User:user message Agent:modelanswer <END CONVERSATION> Provide your safety assessment for ONLY THE LAST Agent in the above conversation: - First line must read ‘safe’ or ‘unsafe’. - If unsafe, a second line must include a comma-separated list of violated categories. [/INST] Figure 23: Prompt used for harmfulness evaluation onJAILBREAKBENCHwithLLAMA- GUARD-3-8B I Further Experiment Details We run experiments on a single NVIDIA A40 with 50 GB of RAM. We use the Pyvene (Wu et al. (2024)) library to collect activations and run interventions. 29 Preprint. Under review. AutoDAN Attack Success Rate No Steering1.0 Steering1.0 Table 10: AutoDAN Attack Success Rate with and without dominant direction steering for Qwen. Attack success rate is 1.0 with and without steering. Steering is done with the previously found best level ofα=4.5. J Regularized steering can be detected by linear probes J.1 Dataset Generation We collected prompts from the datasetALPACAand applied random steering of various magnitudes to their attention outputs at different layers, while keeping the norm constant. We define our steering procedure in Equation 7. x ′ =|x| x−αεv ε |x−αεv ε | ,v ε ∼N 4096 (0,I),ε∼N(0, 1)(7) J.2 Probes We trained linear probes defined as Equation 1 on the attention outputs ofLLAMA-3.1-8B- INSTRUCTfor each layer and we computed test accuracy. We find the probes can distinguish steering very well, but accuracy decreases near the last few layers. We report layer-wise accuracy in Figure 24a. J.3 Results on harmful data We further tested the probes on our harmful steering from before during their evaluation onMMLUby collecting the hidden states pre- and post- steering. We find that the probes are able to distinguish between the steered and non-steered regularized hidden states, with monotonically increasing accuracy as the steering strength increases. We report the layer-wise accuracy versus steering curves in Figure 24b. 30 Preprint. Under review. 51015202530 Layer 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy (a) Accuracy at different layers for regularized steering probes. Accuracy stays relatively high with a dropoff near the later layers. 0.51.01.52.02.53.0 Strength 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 Accuracy Layer Layer 11 Layer 12 Layer 16 Layer 19 Layer 24 (b) Regularized harmful steering detection accuracy at different steering levels for different layers. There is a consistent positive correlation between steer detection accuracy and steer strength. Figure 24: Accuracy of the regularized steer probe. 31