Paper deep dive

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

Jiaqi Weng, Han Zheng, Hanyu Zhang, Qinqin He, Jialing Tao, Hui Xue, Zhixuan Chu, Xiting Wang

Year: 2025Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 75

Models: Qwen2.5-3B-Instruct

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/12/2026, 5:24:01 PM

Summary

Safe-SAIL is a framework for interpreting Sparse Autoencoders (SAEs) within Large Language Models (LLMs) to enhance mechanistic understanding of safety-critical behaviors. It introduces methods for selecting optimal SAE configurations, efficient neuron interpretation via segment-level simulation, and a toolkit containing 2,059 safety-related neuron explanations across four domains: pornography, politics, violence, and terror.

Entities (4)

Qwen2.5-3B-Instruct · large-language-model · 100%Safe-SAIL · framework · 100%Sparse Autoencoders · model-architecture · 100%TopKReLU · activation-function · 95%

Relation Signals (3)

Safe-SAIL → appliedto → Qwen2.5-3B-Instruct

confidence 100% · we release an SAE toolkit based on the intermediate layers of Qwen 2.5-3B-Instruct

Safe-SAIL → interprets → Sparse Autoencoders

confidence 100% · Safe-SAIL, a framework for interpreting SAE features within LLMs

Sparse Autoencoders → uses → TopKReLU

confidence 95% · We utilize Sparse Autoencoders (SAEs) that incorporate the TopKReLU activation function

Cypher Suggestions (2)

Find all frameworks and the models they interpret · confidence 90% · unvalidated

MATCH (f:Framework)-[:INTERPRETS]->(m:ModelArchitecture) RETURN f.name, m.name

Identify models and their associated activation functions · confidence 90% · unvalidated

MATCH (m:ModelArchitecture)-[:USES]->(a:ActivationFunction) RETURN m.name, a.name

Abstract

Abstract:Increasing deployment of large language models (LLMs) in real-world applications raises significant safety concerns. Most existing safety research focuses on evaluating LLM outputs or specific safety tasks, limiting their ability to address broader, undefined risks. Sparse Autoencoders (SAEs) facilitate interpretability research to clarify model behavior by explaining single-meaning atomic features decomposed from entangled signals. jHowever, prior applications on SAEs do not interpret features with fine-grained safety-related concepts, thus inadequately addressing safety-critical behaviors, such as generating toxic responses and violating safety regulations. For rigorous safety analysis, we must extract a rich and diverse set of safety-relevant features that effectively capture these high-risk behaviors, yet face two challenges: identifying SAEs with the greatest potential for generating safety concept-specific neurons, and the prohibitively high cost of detailed feature explanation. In this paper, we propose Safe-SAIL, a framework for interpreting SAE features within LLMs to advance mechanistic understanding in safety domains. Our approach systematically identifies SAE with best concept-specific interpretability, explains safety-related neurons, and introduces efficient strategies to scale up the interpretation process. We will release a comprehensive toolkit including SAE checkpoints and human-readable neuron explanations, which supports empirical analysis of safety risks to promote research on LLM safety.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Full Text

74,482 characters extracted from source content.

Expand or collapse full text

Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework Warning: this paper contains data, prompts, and model outputs that are offensive in nature. Jiaqi Weng 1 , Han Zheng 2,4 , Hanyu Zhang 1 , Qinqin He 1 , Jialing Tao 1 , Hui Xue 1 , Zhixuan Chu 2,4 , Xiting Wang 3 1 Alibaba Group 2 The State Key Laboratory of Blockchain and Data Security, Zhejiang University 3 Renmin University of China 4 Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security Abstract Increasing deployment of large language models (LLMs) in real-world applications raises significant safety concerns. Most existing safety research focuses on evaluating LLM outputs or specific safety tasks, limiting their ability to ad- dress broader, undefined risks. Sparse Autoencoders (SAEs) facilitate interpretability research to clarify model behavior by explaining single-meaning atomic features decomposed from entangled signals. However, prior applications on SAEs do not interpret features with fine-grained safety-related con- cepts, thus inadequately addressing safety-critical behaviors, such as generating toxic responses and violating safety regu- lations. For rigorous safety analysis, we must extract a rich and diverse set of safety-relevant features that effectively capture these high-risk behaviors, yet face two challenges: identifying SAEs with the greatest potential for generating safety concept-specific neurons, and the prohibitively high cost of detailed feature explanation. In this paper, we pro- pose Safe-SAIL, a framework for interpreting SAE features within LLMs to advance mechanistic understanding in safety domains. Our approach systematically identifies SAE with best concept-specific interpretability, explains safety-related neurons, and introduces efficient strategies to scale up the in- terpretation process. We will release a comprehensive toolkit including SAE checkpoints and human-readable neuron ex- planations, which supports empirical analysis of safety risks to promote research on LLM safety. 1. Introduction Increasing deployment of Large language models (LLMs) in critical applications raises significant safety concerns, in- cluding potential biases (Gallegos et al. 2024) and privacy breaches (Li et al. 2024). Previous studies have made great advances in safety-related areas from various perspectives. For instance, using classification models to measure the harmfulness of LLMs output (Hanu and Unitary team 2020; Lees et al. 2022), adversarial attack to identify vulnerabili- ties in LLMs (Schwinn et al. 2023), Chain-of-thought (CoT) monitoring to detect malicious behaviors in reasoning mod- els (Baker et al. 2025), and latent space analysis for toxicity detection (Chacko et al. 2024; Xu et al. 2025). However, these approaches often focus on observable be- haviors or pre-defined tasks. For example, one might first define a safety-related concept, such as hate, and then de- tect hateful content. Consequently, prior work is limited to PoliticsPornography TerrorViolence Political figure Political event Ideology Cult Sexual behavior Sexual abuse Sexual health Intimate relationship Insult Physical violence Criminal offense Cybersecurity Conflict incident Terrorism Religious oppression Extremist activities Weapons General Pornography Genital Description Abusive Language Sexual Intercourse Sexual Abuse Pornographic Videos Taboo Relationships Sex Trade Gynecological Medicine Nudity Description Illegal Activities Physical Attraction Body Features Sexual Harassment Masturbation Description Sexual Minorities Drug-Related Content Intimate Contact Moral / Sexual Corruption Sex Scandals / Condemnation Semen / Ejaculation Leaked Intimate Media Puberty / Development Named Persons Sexualized Female Terms Male-focused Pornography Voyeurism / Peeping Fabricated Sexual Content Adult Product Description Racialized Pornography Sexual Objectification Medical Ethics / Law Figure 1: Overview of safety-related SAE neuron database. specific tasks and cannot comprehensively address broader safety challenges. In contrast, we take an interpretability per- spective by decomposing internal representations of LLMs, enabling us to obtain comprehensive safety concepts and identify undefined safety issues, such as cross-lingual align- ment defects (see Section 4 for further discussion). We use Sparse Autoencoders (SAE) (Bricken et al. 2023; Cunning- ham et al. 2023) as our foundational tool: they factorize the entangled LLMs internal signals into a set of atomic fea- tures, without relying on supervision or pre-defined con- cepts. By interpreting the SAE features, we aim to uncover the underlying mechanisms that drive risk behaviors by providing a comprehensive fine-grained safety-related SAE neuron (Figure 1), which can be further used to diagnose, monitor, and potentially control undesired behaviors. Nevertheless, a significant gap remains between training SAE and providing human-aligned safety-related features, primarily arising from two aspects. Firstly, since it is com- putationally infeasible to generate and compare free-text ex- planations for every SAE configuration, selecting SAEs with the optimal configuration becomes challenging; thus, we need SAE evaluations prior to explanation generation. Most prior works employing SAEs (Lieberum et al. 2024; He et al. 2024) primarily evaluate SAEs using heuristic metrics, such as probing accuracy. They often lack evaluation of concept- specific interpretability of SAEs, that is, whether individual SAE features can differentiate nuanced concepts. This limi- tation makes it challenging to construct a diverse database in the safety domain. Second, generating human-readable explanations for SAE features and conducting evaluations (Bills et al. 2023; Choi et al. 2024; Paulo et al. 2024) require arXiv:2509.18127v2 [cs.LG] 24 Sep 2025 Reconstruction Loss Concept-specific Interpretability Sparsity (퐿 ! )*Optimal 퐿 ! Select a SAE with optimal sparsity constraint for concept-specific interpretation. ... ... ... 푥 푥" Sparse Autoencoder Trained with TopKReLU Transformer Block Transformer Block ⨁ Attention MLP ⨁ Transformer Block ℒ=푥−푥" ! ! 푆푝푎푟푠푖푡푦 퐿 " =푘 Qwen2.5-3B-Instruct SAE Training Diagnose Toolkit Neuron Landscape Neuron Database To o l k it B ox Can I sell my daughter query Concept Integration Sorry, but I can‘t assist with that... response Model Inference 26_1234 Human trafficking 8_2188 Trading related content Model Inference Trajectories Activation Pattern Implies Knowledge ...that asshole just sat on his ass doing fuck all. He lounged at home, rarely helping with chores...Why don’t you go fuck yourself!... Find words used for insulting people, especially containing “_ck”, like “Fuck”. SAE neuron activations Explanation Explainer Explain Automated Interpretation Simulate Segment-level simulation maintains a high correlation with token-level simulation but reduces cost by 55%. ...Ohfuck off! I’m tired of your bloody complaints go away... SimulatorSimulator ...Ohfuck off! I’m tired of your bloody complaints go away... Ground Truth Activations: 0, 8, 3, 0, ... 0, 7, 1, 5, 4 Simulated Activations : 0, 10, 3, 0, ... 0, 10, 0, 0, 0 Ground Truth Activations : 1, 0, 1, 1 Simulated Activations : 1, 0, 1, 0 Token-level Simulation Segment-level Simulation Correlation Score = 0.8 Correlation Score = 0.6 Token-level score Segment-level score Token-level Computational cost Segment-level Figure 2: Overview of the Safe-SAIL, which consists of three phases: SAE Training, Automated Interpretation, and Diagnose Toolkit. This framework trains sparse autoencoders with varying sparsity levels to select the most interpretable configuration, utilizes a large language model to explain neuron activations, and simulates query segments to calculate explanation confidence scores. Finally, the toolkit—including SAE checkpoints and a safety-tagged neuron database—is demonstrated through various case studies highlighting its applications in safety domains. substantial resources. Although recent efforts in SAEs have released scalable SAE models, they typically provide expla- nations for only a small set of SAE features and often lack comprehensive, large-scale explanations and evaluations. To address this gap, we propose Safe-SAIL, a Sparse Autoencoder Interpretation Framework for LLMs in safety domains. Our framework covers the entire process from SAE training, explanation generation and evaluation. Specifically, Safe-SAIL systematically selects the most ef- fective SAEs to achieve the optimal diversity and inter- pretability. We build a bridge between SAE configurations and the diversity of neuron database in safety domains by new methods to evaluate concept-specific interpretability of SAEs. This provides practical guidance for selecting the SAE to produce neurons with optimal quantity and quality. Additionally, our approach replaces the traditional token- level simulation method with a more efficient segment-level strategy. We split the query into multiple segments and ask LRM to predict whether each segment is activated or not. This strategy reduces the simulation costs by 55% while maintaining satisfactory performance, making massive in- terpretation affordable. Hence, we finally generate human- understandable explanations for individual SAE neurons and provide comprehensive evaluations. Moreover, we release an SAE toolkit based on the inter- mediate layers of Qwen 2.5-3B-Instruct (Qwen et al. 2025), which includes SAE checkpoints, explanations for individ- ual safe SAE neurons and evaluations covering 2,059 SAE neurons across four major safety subdomains: pornogra- phy, politics, violence, and terror (Figure 1). Our toolkit enables fine-grained analysis of internal safety mechanisms and facilitates monitoring of LLMs’ risk behaviors, thereby supporting broader adoption and future research. Based on the toolkit, we conduct some empirical analyses on porno- graphic concepts, demonstrating the potential of Safe-SAIL for risk identification in LLMs. Our investigation yields sev- eral intriguing findings, including insights into how LLMs encode specific real-world risk entities and handle safety- critical concepts related to sexually explicit content. The main contributions of this work include: • We propose Safe-SAIL, a framework for interpreting SAEs in safety domains. It offers a perspective by de- composing LLMs’ internal representations to identify comprehensive and undefined safety features. To support further research and practical applications, we release an open-source toolkit based on Safe-SAIL, including 2,059 neurons across four major safety subdomains, and analyt- ical utilities. • We improve the efficiency of SAE interpreting by in- troducing two key strategies: a concept-specific inter- pretability evaluation method that enables optimal SAE model selection before explanation generation, and a segment-level simulation approach that significantly re- duces computational overhead, making massive evalua- tion affordable. • Based on our toolkit, we conduct empirical analyses on pornographic concepts, demonstrating the potential of Safe-SAIL for risk identification in LLMs. And we of- fer insights on how LLMs encode specific real-world risk entities and handle safety-critical concepts across layers. 2. Framework In this section, we introduce the three components of the Safe-SAIL, as illustrated in Figure 2. The first component is SAE training and evaluation, which focuses on training and selecting an SAE that produces the most interpretable features in safety domains. The second component is auto- mated interpretation, including free-text explanation gener- ation, and evaluations of explanations, thus facilitating hu- man understanding. Applying this framework, we obtain a new set of toolkit to carry out safety analysis. 2.1 Sparse Autoencoders 2.1.1 Training We utilize Sparse Autoencoders (SAEs) that incorporate the TopKReLU activation function(Gao et al. 2024) to control sparsity levels in the encoded rep- resentation. Given an input signal x∈R D , typically derived from the output of Multilayer Perceptrons (MLPs) or Resid- ual Streams, the TopKReLU activation function selects the top-k activated features during the encoding transformation. Details of SAE training and TopKReLU are in Appendix. 2.1.2 Enhanced Evaluation Metrics To interpret neurons related to concepts in safety domains, our primary concern is whether the SAE has greater potential to differentiate and extract more nuanced atomic concepts within that domain. However, the time and computational costs associated with training the SAE and interpreting all its neurons can be ex- ceedingly high. Therefore, we seek a metric that can predict the actual number of neurons related to safety concepts that will be effectively explained after completing the SAE inter- pretation. We construct evaluation data using a method Con- cept Contrastive Query Pairs to illustrate the boundaries of the presence or absence of concepts. Additionally, we design two metrics, L 0,t and I CDF , to assess the differentiation of concepts among different SAEs. Concept Contrastive Query Pairs We prepare a dataset consisting of queries categorized under various safety do- mains. For each query related to a specific concept theme, we design prompts that instruct LLMs to generate a paired query that omits this particular concept while preserving the other linguistic elements as closely as possible. Metrics For each concept domain with n pairs, we collect the delta frequency freq of each neuron that activates on concept query while not on the de-concept paired one. Q C and Q D denote whether this neuron activates on concept query or corresponding de-concept one. freq = P n−1 i=0 Q C,i (1− Q D,i ) n , Q C,i ,Q D,i ∈0, 1 (1) For each concept theme, all neurons on SAE could be repre- sented by first a distribution frequency function and second a cumulative distribution frequency function denoted as: f(x) = P(freq = x)(2) F(x) = P(freq ≤ x) = X t≤x f(t)(3) We describe the interpretability of an SAE from the follow- ing aspects. • L 0,t discovers absolute number of distinguishable neu- rons in a specific safety domain. The value varies by threshold, which can differ across domains; for this anal- ysis, a threshold of 0.25 is employed, as it has been em- pirically observed to enable distinguishability between all SAEs compared. L 0,t = X (freq > t)(4) • I CDF represents the expected delta frequencies of all neurons in the set, reflecting the overall distinguishability of the entire SAE in relation to a specific thematic con- cept; it allows for intuitive comparison by visualizing the area under the curve in the CDF plot. I CDF = E(freq) = Z 1 0 (1− F(x))dx(5) 2.2 Efficient Neuron Interpretation 2.2.1 Safety-Neuron Filtering The overall cost of inter- pretation is high, and the number of safety-related neurons is relatively small compared to the total. Therefore, we first employ a filtering method to obtain candidate neurons. Us- ing the method of Concept Contrastive Query Pairs, we con- struct concept de-concept pairs based on a more fine-grained classification of safety data. We observe the activation pat- terns of neurons across each subclass; a neuron that is asso- ciated with a specific safety concept should exhibit a notice- able difference in activation distribution between the con- cept and de-concept sets. However, considering that the con- cept scope corresponding to a given neuron may be narrower than our classification definitions, we should establish a low recall threshold and a high precision threshold when calcu- lating accuracy and recall using the following expressions: Precision = P Q C P Q C + P Q D ,Recall = P Q C n (6) Precision refers to the ratio of activated concept queries to the total activated queries. Recall indicates the ratio of acti- vated concept queries to the total concept queries. 2.2.2 Explanation We adopt the standard practice for gen- erating neuron explanations: neuron activations are gener- ated through SAE inference on an explanation dataset rich in safety-related content. The activation values are then quan- tized into distinct levels using linear interpolation. For each level, samples are selected to construct a prompt that in- structs a large reasoning model (LRM) to generate a text explanation for the corresponding neuron. 2.2.3 Segment-level Simulation & Scoring One of the most common methods for evaluating explanations is sim- ulation. In traditional simulation, an LLM is used to pre- dict the activations of each token in a query, given both the neuron explanation and the tokenized query. The simulation score, referred to as the CorrScore, is then calculated as the Pearson correlation coefficient between the simulated acti- vations and actual token activations after inference. However, high-quality simulations always require high computational resources. To optimize the simulation pro- cess, we first use LRM to infer activation values at multiple Figure 3: Cumulative distribution frequency curve of SAEs trained with different settings. positions in a single call, rather than predicting the activation for each token in separate forward passes (Bills et al. 2023). However, obtaining a reliable simulation score still requires sampling a substantial number of query, which results in sig- nificant computational overhead. To address this, we split each query into several segments and instruct LRM to pre- dict whether each segment will be activated by the neuron. Larger internal segmentation leads to lower computation but poor scoring performance. 2.3 Toolkit With the safety neuron database constructed, we provide an interactive tool and a feature map that allow researchers to explore which safety-relevant neurons are activated by ar- bitrary inputs and query their semantic interpretations. The database also serves as the foundation for two key insights related to knowledge detection and interpretable inference trajectories. A detailed discussion can be found in Section 4. 3. Experiments In this section, we investigate the impact of selecting op- timal parameters within each stage of our proposed frame- work. Our primary goal is to demonstrate how these param- eter choices lead to improved results and reduced costs. 3.1 SAE Configuration Selection 3.1.1 Settings Activation Function We select TopKReLU as the activa- tion function because it allows easy control of the sparsity levels through the hyperparameter k. In our experiments, we chose k=20, 200, 500, 2000. Expansion Factor The expansion factor is fixed at 10, which is based on previous work in SAEBench(Karvonen et al. 2025), where the SAE was evaluated with expansion factors of 4, 16, and 32. For input signals with 2048 dimen- sions, an expansion factor of 10 is a reasonable choice. Neuron Distribution on Concept of Adult Content Adult content identifier and names Repetitive explicit language Adult content names and terms Slangs for sex workers Variants for adult content Adult content tags Terms related to women Women’s social roles Adult content tags Adult content tags Content classification tags Adult content classification in Chinese context Chinese adult content tags Chinese adult content titles Chinese adult content and gambling Chinese implications of adult content Pornographic content featuring minorities Pornographic rating Sex tape Terms for adult content General adult content Adult video Adult content platforms Keywords for adult content platforms Adult content platforms Adult content platforms, preferences Identification symbols, like URLs Patterns in adult media URLs URLs of adult websites Variants of adult website URLs References to adult websites Female reference Tags and titles URLs Platform K20MLP K200MLP K200Residual K1000MLP K500MLP Figure 4: Neurons related to concept of adult content from neuron databases derived from different SAE checkpoints. The distribution illustration is based on distance between text embeddings of neuron explanations. Location We apply SAEs to two distinct structural com- ponents of layer 17: the MLP output and the post-MLP Residual Stream. The choice of layer 17 is made under consideration that middle layer signals have a better inter- pretability on high-level abstract concepts. Data To identify neurons associated with safety concepts in Qwen2.5-3B-Instruct, we intentionally selected poten- tially risky texts from our routine business traffic during the synthesis of the training data. Explanation data, separated from training data, is constructed by 200k queries mixed of 25% risky content, 10% random white queries and 65% ran- domly from public dataset The Pile(Gao et al. 2020). Evalu- ation data is constructed using Concept Contrastive Query Pairs method, which consists of 10,000 pairs across four safety domains: politics, pornography, violence and terror. Interpretability Metrics We evaluate SAEs with existing interpretability metrics including k-Sparse Probing (Gurnee et al. 2023) and 1d-Probe (Gao et al. 2024), comparing with our own metrics on evaluation dataset. 3.1.2 Results The experimental results (Table 1) first re- veal a relationship between sparsity and reconstruction qual- ity, as evidenced by the decrease in both L 2 loss and δL NT P with increasing sparsity, which is consistent with the results of previous research. From Table 1, it is evident that the configuration Top- KReLU200 trained on MLP outperforms other configura- tions regarding the total number of neurons. Additionally, we analyzed the granularity of explanations, which is illus- trated in Figure 4. The TopKReLU200 configuration shows a greater coverage and quantity of detailed classifications in the sensitive area of pornography compared to others. In terms of interpretability metrics, our proposed indica- tors demonstrate consistent trends across various safety do- mains (Figure 3), aligning more closely with the variabil- ity in neuron counts. It can be observed in Figure 5 that the effectiveness of k-Sparse Probing is significantly influ- enced by the choice of k, and the top-k mechanism focuses Table 1: Compare SAEs trained with different settings from reconstruction and interpretability. We also explained neurons in these SAEs to construct a safety-related neuron database to illustrate how SAE configuration influences neuron explanation quantity and quality. Details of metrics are included in appendix. LocationTopK R alive ↑ ReconstructionInterpretabilityNeuron Database L 2 ↓ δL NT P ↓ L 0,t=0.25 ↑ I CDF ↑ N ↑ CorrScore↑ SpScore↓ MLP2088.98%0.03460.12411300.04223660.36701.3684 MLP20097.82%0.01910.06934060.117211600.29391.6660 Residual20096.02%0.08581.09461200.04025050.34131.4955 MLP50092.16%0.01250.04762150.04287750.30801.5028 MLP100094.68%0.00610.0197250.00932640.37801.2482 MLP200094.27%0.00040.001930.00970-- R alive : Percentage of neurons triggered during inference. L 2 : MSE between SAE input and reconstructed output. δL NT P : Difference in next token prediction loss. L 0,t=0.25 : Number of neurons whose freq larger than 0.25. I CDF : Expected value of freq across all neurons in SAE. N : Number of concept-specific neurons. CorrScore: Average correlation score of all safety-related neurons. SpScore: Average superposition score of all safety-related neurons. (a)(b)(c) (d) Figure 5: Comparison of various interpretability metrics against ground truth across different sparsity levels L 0 and multiple safety domains. (a) Ground truth showing the num- ber of concept-specific neurons. (b) Our proposed metrics: I CDF and L 0,t=0.25 , demonstrating trends closely aligned with the ground truth. (c) 1d-Probe cross entropy loss varies in different safety-domains. (d) k-Sparse Probing perfor- mance (with k=1,3,5,20) depends largely on k. solely on the top neurons’ contribution to semantic classi- fication, which fails to capture the overall representation of SAE. Furthermore, the 1d-Probe’s calculation of minimum cross-entropy loss reveals considerable instability, heavily dependent on the data, necessitating a large number of cate- gories to yield effective results. We find that concept-specific interpretability, is charac- terized by a higher number of neurons and more detailed explanations. This suggests a divergence between the opti- mal sparsity for concept-specific interpretability and that for minimal feature interference. According to earlier studies (Gao et al. 2024) and also illustrated in Figure 6, the effect of feature interference diminishes as more features are included in the reconstruction of the signal, up to a point where the 20200500 1000 2000 퐿 ! =20퐿 ! =200퐿 ! =500 퐿 ! =1000 퐿 ! =2000 푀푎푥=max ! ( 1 푛 +푣 ! "#! 푣 " ) 퐴푣푔=푎푣푔( 1 푛 +푣 ! "#! 푣 " ) 푊 ! 푊 Figure 6: Interference of feature vectors in decoder weight matrix from SAEs trained on MLP with different sparsity levels. Feature interference is calculated as average(Avg) and max(Max) of average cosine similarity between all de- coder vectors (n = 20480). 2D visualization of W T W with sparsity level changing from 20 to 500 shows a lighter color as features are more orthogonal and a reverse trend after 500 as superposition effect dominates. features become optimally orthogonal. Beyond this thresh- old, the effects of superposition begin to dominate(Ferrando et al. 2024). Importantly, the SAE achieves the best concept- specific interpretability at a sparser level than that needed for minimal feature interference. This is because concept-specific domains, such as safety domains, are small subspaces within the larger semantic space, where features typically span the subspaces of fre- quently occurring concepts. As features become less sparse and more orthogonal, the number of features allocated to safety subspaces decreases, resulting in lower clustering. This is reflected in fewer explained neurons and coarser granularity in the resulting explanations. 3.2 Explainer Model Selection The explainer model plays a crucial role in analyzing neu- ron activation samples, drawing conclusions about activa- tion patterns, and ultimately producing human-interpretable Table 2: Statistics of neuron explanations based on differ- ent explainer models. The average correlation score (Avg CorrScore), derived from simulations, is reported along- side the proportion of neuron explanations with correlation scores exceeding 0.2 (R corr>0.2 ). Explainer Model Average CorrScore R corr>0.2 QwQ-32B0.185543.11% DeepSeek-R10.325180.46% Claude 3.7 Sonnet0.285771.93% explanations. The selection of the explainer model directly impacts the quality of the neuron database. 3.2.1 Settings We compare explanations of neurons de- rived from all quartile layers (0, 8, 17, 26, 35) generated by different LRM models: QwQ-32B(Qwen Team 2025), DeepSeek-R1(DeepSeek-AI et al. 2025), and Claude 3.7 sonnet(Anthropic 2025). The accuracy of explanations is as- sessed in simulation stage as the correlation score. 3.2.2 Results Table 2 shows that DeepSeek-R1 output per- forms other models in terms of average correlation score and the percentage of correlation score exceeding 0.2. According to subsequent experiments in the simulation section, neuron behaviors represented by explanations above this threshold are deemed interpretable by humans. The correlation score in this experiment is within a reasonable range compara- ble to previous work (Lieberum et al. 2024). Surprisingly, when QwQ-32B is tasked with interpreting code data acti- vation samples, its responses exhibit significant confusion, characterized by the repetition of meaningless phrases, gar- bled output, and random responses, ultimately hindering its effectiveness in completing the task. 3.3 Segment-level Simulation Methods 3.3.1 Settings In this section, we compare existing sim- ulation methods. We conduct experiments on layer 17 of Qwen2.5-3B-Instruct, with 1058 safe-related neurons, and use QwQ-32B for simulation. For every neuron, we sample 20 data from each activation bin of activations, if available. The methods we evaluate include: 1) All at once: present each token in a ‘token<tab>unknown’ format within a sin- gle prompt, and then examines the logits for the unknown tokens to calculate a predicted activation as the probabili- ties weighted sum over token 0 to 10; 2) Token-level simu- lation: present each token in a ‘token<tab>unknown’ for- mat, but the predicted activation is directly obtained from the LRM’s output; 3) Segment-level simulation: the original query is split into several segments, and the LRM is queried to determine whether each segment is activated or not. We also collect token-level human-labeled activations for randomly selected 200 neurons, which serve as the ground truth for simulation results. Metrics we use are the corre- lation coefficient (Pearson r and Kendall τ ) with human- labeled CorrScore. For computational cost, we report the av- erage token total length of generation, calculated as the sum of reasoning tokens and output tokens. Figure 7: Correlations between different methods and human-labeled results (top row), correlations between Segment-level and Token-level simulation (bottom left), and computational cost by generated token number (bot- tom right). Compared to token-level simulation, our method could reduce resource usage by 55% while maintaining de- cent performance (r = 0.8). 248163264 Number of Segment 0.75 0.80 0.85 Pearson r with Token-level Simulation Performance of Segment-level Simulation 248163264 Number of Segment 1000 1500 2000 2500 3000 Generated Token Number Efficiency of Segment-level Simulation Token-level Simulation Figure 8: Simulation performance and efficiency for dif- ferent segment numbers. The left figure shows Pearson’s r compared to token-level simulation, and the right displays the mean number of generated tokens. The orange dashed line represents the number of generated tokens generated by token-level simulation. 3.3.2 Results Results compared with human-labeled token-level simulation are shown in top row of Figure 7. The bottom left figure shows a strong correlation (r = 0.8) be- tween segment-level and token-level CorrScore. Although Segment-level simulation is a simplification of Token-level simulation, it still preserves considerable performance while reducing computational cost by roughly 55%. We also report the simulation performance and efficiency for different num- bers of segments in our segment-level simulation methods in Figure 8, as an approximation of token-level simulation. 4. Insights We present exploratory analyses demonstrating the utility of our SAE-based neuron interpretation database in uncov- ering the internal representation of safety-critical concepts in large language models. Focusing on pornography as a representative harmful category, we investigate how models encode specific real-world entities, such as known porno- graphic websites. These findings validate the framework’s effectiveness and enable concept-driven, semantically mean- ingful auditing of model knowledge. https://pornhub.com/ https://w.xvideos.com/ https://xhamster.com/ https://w.xnxx.com/ https://onlyfans.com/ https://w.redtube.com/ https://w.youporn.com/ https://spankbang.com/ https://beeg.com/ https://bangbros.com/ https://w.brazzers.com/ https://w.youjizz.com/ https://fapopedia.net/ https://w.3movs.com/ https://thisvid.com https://motherless.com/ https://fansteek.com/ https://rutube.ru/ https://jable.tv/ https://missav.ws/ Porn Websites 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 Neuron Activation 26_18429 17_1579 17_4828 Confidence 0.0 0.2 0.4 0.6 0.8 1.0 Confidence Figure 9: Average activation values of three neurons across 20 porn websites, with empirical confidence scores derived from 50 inference runs per website. 4.1 Activation Patterns Imply Knowledge We investigate whether LLMs encode latent knowledge of adult websites through SAE neuron activation patterns. Us- ing the prompt “What is the main function ofweb url?”, we evaluated 20 verified pornographic URLs, recording neu- ron activations at the first few tokens after URL input across 50 inference runs per URL. Average activations and empir- ical confidence scores (proportion of adult categorizations) were computed to assess model certainty. Our results (Figure 9) reveal strong links between neu- ron activations and model behavior. Three neurons align closely with adult content detection: 2618429 responds to semantic content (e.g., explicit URLs with 100% con- fidence), while 171579 and 174828 track syntactic pat- terns (e.g., domain structures), suggesting combined use of semantics and heuristics. Notably, onlyfans.com devi- ates—despite high confidence, these neurons show minimal activation. This suggests either (1) reliance on other, unob- served features, or (2) weak internal association between OnlyFans and explicit adult content, revealing limitations in the model’s conceptual representation. The findings reveal that 2-3 specific neurons capture crit- ical aspects of the model’s decision-making process, with distinct roles in semantic vs. syntactic processing. Such neu- ronal signatures provide interpretable markers for under- standing model cognition and predicting outputs in safety- related tasks. 4.2 Model Inference Trajectories Large language models are often seen as black boxes, but our cross-layer neuron database enables fine-grained analy- sis of their internal representations. By tracing neuron acti- vations across layers, we observe a clear progression: from local feature detection (e.g., keyword recognition) to struc- tured reasoning (e.g., integrating semantics and context). In a case study on child sexual abuse related input (Fig- ure 10a), we identify a coherent processing pipeline: word- level detection (e.g., ’child’, ’sell’), semantic scene construc- tion, activation of high-level safety concepts (e.g. transac- tion, sexual exploitation), and finally a safe refusal. The alignment between neuron semantics and model behavior shows that safety responses emerge from an interpretable, concept-driven reasoning chain—rather than arbitrary out- \ How much ...sellmyto ..who want to pay her for ? L-35 Trading related terms Age and family relationships Gender and sexual content “子”(kid) related sensitivecontent 8_15960 Exploitation or abuse 26_2838\26_13056 Transaction behavior 17_18079 ... Any usage related to word "sex” 8_19092\8_16250 Sexual exploitation &human trafficking 17_16378\17_1212\17_20271 Explicit sexual violence, crime 26_3138\26_7960 Adult-restricted content requiring moderation Word LevelFeature High LevelConcept/Behavior sell kidsex Keytoken I’m reallysorry ,but I can’t Outputtoken L-17 L-0 Layers <0,706> <0,16868> <0,8850> <0,11036> <0,15231> <0,8440> <0,8850> <0,11966> <35,12089> ... ... Malicious Input Safe Response ... (a) Layer-wise activation chain for an English prompt. मुझे िकसी...sellके िलए अपनी..to ...चािहए जो उसके साथ sex? करने के िलए पैसेदेना चाहता है? ... L-35 Non-Latin characters Age and family relationships Foreign words with “s” sound ... Any usage related to word "sex” 8_19092\8_16250 Sexual exploitation &human trafficking 17_1212\17_20271 Employment salary 26_10889 Adult-restricted content requiring moderation Word LevelFeature High LevelConcept/Behavior बेचने ब%चासे(स Keytokens यह एकगंभीर शां3तप ू ण8 वा:त;वकता Outputtokens L-17 L-0 Layers <0,12615> <0,12323> <0,28480> <0,15231> <0,18434> <35,12089> ... ... Malicious Input HarmfulResponse ... “Sale” “Kid” “Sex” Significant Act Drop <17_16378> 0.6183→0 <17_20271> 0.4199→0.3416 <17_1212> 0.4704→0.1757 Explicit sexual violence, crime 26_3138\26_7960 Adult or sexual content 26_18429\26_1519 Significant Act Rise Significant Act Drop <26_3138> 0.3283→0.1248 <26_7690> 0.1875→0.0692 Significant Act Drop <35_12089> 0.4129→0.0823 <35_8726> 0.2035→0 और “It is a serious and peaceful reality that you can sell your child to have sex with a person. For this you have to be commercially determined as per the amount that person pays” है>कआप अपनी ‘Tokens’ here are combination of real tokens (b) Layer-wise activation chain for a Hindi prompt. Figure 10: Differences in the neuron activation chains be- tween an English prompt (a) and a Hindi prompt (b), reveal- ing how internal model mechanisms contribute to language- specific safety vulnerabilities. puts. Furthermore, by examining neuron activation patterns across different languages, we gain interesting insights into the underlying mechanisms that give rise to safety vulnera- bilities when the model processes low-resource languages. As shown in Figure 10b, malicious Hindi inputs fail to trig- ger safe responses because the model lacks understanding of concepts such as child sexual abuse material and sexual exploitation-evident in the weak or absent activation of rele- vant neurons within the activation chain. 5. Conclusion We introduce a novel SAE interpretation framework that not only generates more granular safety neuron explanations but also reduces explanation costs by half. It offers an inter- nal perspective and methodology to address problems in the field of LLM safety. Building on the toolkit produced by this framework, we further explore the risky behaviors of LLMs, yielding new insights into model cognition and rea- soning trajectories. These findings enrich our understanding of LLM safety and lay the basis for future research in this area. We hope that by providing the toolkit including SAE checkpoints and a safety-tagged neuron database, our work will inspire greater interest in the field of LLM safety among researchers and equip scholars with new analytical tools. References Anthropic. 2025. Claude 3.7 Sonnet and Claude Code. Baker, B.; Huizinga, J.; Gao, L.; Dou, Z.; Guan, M. Y.; Madry, A.; Zaremba, W.; Pachocki, J.; and Farhi, D. 2025. Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation. arXiv:2503.11926. Bills, S.; Cammarata, N.; Mossing, D.; Tillman, H.; Gao, L.; Goh, G.; Sutskever, I.; Leike, J.; Wu, J.; and Saun- ders, W. 2023. Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date ac- cessed: 14.05. 2023), 2. Bricken, T.; Templeton, A.; Batson, J.; Chen, B.; Jermyn, A.; Conerly, T.; Turner, N.; Anil, C.; Denison, C.; Askell, A.; Lasenby, R.; Wu, Y.; Kravec, S.; Schiefer, N.; Maxwell, T.; Joseph, N.; Hatfield-Dodds, Z.; Tamkin, A.; Nguyen, K.; McLean, B.; Burke, J. E.; Hume, T.; Carter, S.; Henighan, T.; and Olah, C. 2023. Towards Monosemantic- ity: Decomposing Language Models With Dictionary Learn- ing. Transformer Circuits Thread.Https://transformer- circuits.pub/2023/monosemantic-features/index.html. Bussmann, B.; Leask, P.; and Nanda, N. 2024. BatchTopK Sparse Autoencoders. arXiv:2412.06410. Bussmann, B.; Nabeshima, N.; Karvonen, A.; and Nanda, N. 2025. Learning Multi-Level Features with Matryoshka Sparse Autoencoders. arXiv:2503.17547. Chacko, S. J.; Biswas, S.; Islam, C. M.; Liza, F. T.; and Liu, X. 2024. Adversarial Attacks on Large Language Models Using Regularized Relaxation. arXiv:2410.19160. Choi, D.; Huang, V.; Meng, K.; Johnson, D. D.; Steinhardt, J.; and Schwettmann, S. 2024. Scaling Automatic Neuron Description. https://transluce.org/neuron-descriptions. Cunningham, H.; Ewart, A.; Riggs, L.; Huben, R.; and Sharkey, L. 2023. Sparse Autoencoders Find Highly Inter- pretable Features in Language Models. arXiv:2309.08600. DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; Zhang, X.; Yu, X.; Wu, Y.; Wu, Z. F.; Gou, Z.; Shao, Z.; Li, Z.; Gao, Z.; Liu, A.; Xue, B.; Wang, B.; Wu, B.; Feng, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; Dai, D.; Chen, D.; Ji, D.; Li, E.; Lin, F.; Dai, F.; Luo, F.; Hao, G.; Chen, G.; Li, G.; Zhang, H.; Bao, H.; Xu, H.; Wang, H.; Ding, H.; Xin, H.; Gao, H.; Qu, H.; Li, H.; Guo, J.; Li, J.; Wang, J.; Chen, J.; Yuan, J.; Qiu, J.; Li, J.; Cai, J. L.; Ni, J.; Liang, J.; Chen, J.; Dong, K.; Hu, K.; Gao, K.; Guan, K.; Huang, K.; Yu, K.; Wang, L.; Zhang, L.; Zhao, L.; Wang, L.; Zhang, L.; Xu, L.; Xia, L.; Zhang, M.; Zhang, M.; Tang, M.; Li, M.; Wang, M.; Li, M.; Tian, N.; Huang, P.; Zhang, P.; Wang, Q.; Chen, Q.; Du, Q.; Ge, R.; Zhang, R.; Pan, R.; Wang, R.; Chen, R. J.; Jin, R. L.; Chen, R.; Lu, S.; Zhou, S.; Chen, S.; Ye, S.; Wang, S.; Yu, S.; Zhou, S.; Pan, S.; Li, S. S.; Zhou, S.; Wu, S.; Ye, S.; Yun, T.; Pei, T.; Sun, T.; Wang, T.; Zeng, W.; Zhao, W.; Liu, W.; Liang, W.; Gao, W.; Yu, W.; Zhang, W.; Xiao, W. L.; An, W.; Liu, X.; Wang, X.; Chen, X.; Nie, X.; Cheng, X.; Liu, X.; Xie, X.; Liu, X.; Yang, X.; Li, X.; Su, X.; Lin, X.; Li, X. Q.; Jin, X.; Shen, X.; Chen, X.; Sun, X.; Wang, X.; Song, X.; Zhou, X.; Wang, X.; Shan, X.; Li, Y. K.; Wang, Y. Q.; Wei, Y. X.; Zhang, Y.; Xu, Y.; Li, Y.; Zhao, Y.; Sun, Y.; Wang, Y.; Yu, Y.; Zhang, Y.; Shi, Y.; Xiong, Y.; He, Y.; Piao, Y.; Wang, Y.; Tan, Y.; Ma, Y.; Liu, Y.; Guo, Y.; Ou, Y.; Wang, Y.; Gong, Y.; Zou, Y.; He, Y.; Xiong, Y.; Luo, Y.; You, Y.; Liu, Y.; Zhou, Y.; Zhu, Y. X.; Xu, Y.; Huang, Y.; Li, Y.; Zheng, Y.; Zhu, Y.; Ma, Y.; Tang, Y.; Zha, Y.; Yan, Y.; Ren, Z. Z.; Ren, Z.; Sha, Z.; Fu, Z.; Xu, Z.; Xie, Z.; Zhang, Z.; Hao, Z.; Ma, Z.; Yan, Z.; Wu, Z.; Gu, Z.; Zhu, Z.; Liu, Z.; Li, Z.; Xie, Z.; Song, Z.; Pan, Z.; Huang, Z.; Xu, Z.; Zhang, Z.; and Zhang, Z. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948. Ferrando, J.; Sarti, G.; Bisazza, A.; and Costa-juss ` a, M. R. 2024. A Primer on the Inner Workings of Transformer-based Language Models. arXiv:2405.00208. Gallegos, I. O.; Rossi, R. A.; Barrow, J.; Tanjim, M. M.; Kim, S.; Dernoncourt, F.; Yu, T.; Zhang, R.; and Ahmed, N. K. 2024. Bias and Fairness in Large Language Models: A Survey. arXiv:2309.00770. Gao, L.; Biderman, S.; Black, S.; Golding, L.; Hoppe, T.; Foster, C.; Phang, J.; He, H.; Thite, A.; Nabeshima, N.; Presser, S.; and Leahy, C. 2020. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027. Gao, L.; la Tour, T. D.; Tillman, H.; Goh, G.; Troll, R.; Rad- ford, A.; Sutskever, I.; Leike, J.; and Wu, J. 2024. Scal- ing and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093. Gurnee, W.; Nanda, N.; Pauly, M.; Harvey, K.; Troitskii, D.; and Bertsimas, D. 2023. Finding Neurons in a Haystack: Case Studies with Sparse Probing. arXiv:2305.01610. Hanu, L.; and Unitary team. 2020.Detoxify.Github. https://github.com/unitaryai/detoxify. He, Z.; Shu, W.; Ge, X.; Chen, L.; Wang, J.; Zhou, Y.; Liu, F.; Guo, Q.; Huang, X.; Wu, Z.; Jiang, Y.-G.; and Qiu, X. 2024. Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders. arXiv:2410.20526. Karvonen, A.; Rager, C.; Lin, J.; Tigges, C.; Bloom, J.; Chanin, D.; Lau, Y.-T.; Farrell, E.; McDougall, C.; Ayon- rinde, K.; Till, D.; Wearden, M.; Conmy, A.; Marks, S.; and Nanda, N. 2025. SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretabil- ity. arXiv:2503.09532. Karvonen, A.; Wright, B.; Rager, C.; Angell, R.; Brinkmann, J.; Smith, L.; Verdun, C. M.; Bau, D.; and Marks, S. 2024. Measuring Progress in Dictionary Learning for Lan- guage Model Interpretability with Board Game Models. arXiv:2408.00113. Lees, A.; Tran, V. Q.; Tay, Y.; Sorensen, J.; Gupta, J.; Met- zler, D.; and Vasserman, L. 2022. A New Generation of Per- spective API: Efficient Multilingual Character-level Trans- formers. arXiv:2202.11176. Li, H.; Chen, Y.; Luo, J.; Wang, J.; Peng, H.; Kang, Y.; Zhang, X.; Hu, Q.; Chan, C.; Xu, Z.; Hooi, B.; and Song, Y. 2024. Privacy in Large Language Models: Attacks, De- fenses and Future Directions. arXiv:2310.10383. Lieberum, T.; Rajamanoharan, S.; Conmy, A.; Smith, L.; Sonnerat, N.; Varma, V.; Kram ́ ar, J.; Dragan, A.; Shah, R.; and Nanda, N. 2024.Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. arXiv:2408.05147. Paulo, G.; Mallen, A.; Juang, C.; and Belrose, N. 2024. Au- tomatically Interpreting Millions of Features in Large Lan- guage Models. arXiv:2410.13928. Qwen; :; Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; Lin, H.; Yang, J.; Tu, J.; Zhang, J.; Yang, J.; Yang, J.; Zhou, J.; Lin, J.; Dang, K.; Lu, K.; Bao, K.; Yang, K.; Yu, L.; Li, M.; Xue, M.; Zhang, P.; Zhu, Q.; Men, R.; Lin, R.; Li, T.; Tang, T.; Xia, T.; Ren, X.; Ren, X.; Fan, Y.; Su, Y.; Zhang, Y.; Wan, Y.; Liu, Y.; Cui, Z.; Zhang, Z.; and Qiu, Z. 2025. Qwen2.5 Technical Report. arXiv:2412.15115. Qwen Team. 2025. QwQ-32B: Embracing the Power of Re- inforcement Learning. Rajamanoharan, S.; Lieberum, T.; Sonnerat, N.; Conmy, A.; Varma, V.; Kram ́ ar, J.; and Nanda, N. 2024. Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders. arXiv:2407.14435. Schwinn, L.; Dobre, D.; G ̈ unnemann, S.; and Gidel, G. 2023. Adversarial Attacks and Defenses in Large Language Models: Old and New Threats. arXiv:2310.19737. Xu, Z.; Huang, R.; Chen, C.; and Wang, X. 2025. Uncov- ering safety risks of large language models through con- cept activation vector. In Proceedings of the 38th Interna- tional Conference on Neural Information Processing Sys- tems, NIPS ’24. Red Hook, NY, USA: Curran Associates Inc. ISBN 9798331314385. Appendix A. Related Work Sparse autoencoders Sparse autoencoders are designed to transform an input signal, typically taken from MLP out- put or Residual Stream output, into a higher-dimensional representation. After a non-linear activation function, the encoded features are decoded back to reconstruct the in- put. Previous work on Sparse Autoencoders (SAEs) has ex- plored various approaches to balance reconstruction accu- racy and feature sparsity(Rajamanoharan et al. 2024; Gao et al. 2024; Bussmann, Leask, and Nanda 2024; Karvonen et al. 2024; Bussmann et al. 2025). The vanilla SAE archi- tecture typically employs ReLU as the activation function and uses L1 regularization—the sum of all activated fea- tures—as the sparsity loss. However, this approach often re- sults in a severe shrinkage effect on all features. Later works have focused on modifying either the activation function or the sparsity expression. TopKReLU(Gao et al. 2024) alters the activation function by selecting only the top-k features for signal reconstruction, making the sparsity level fixed. JumpReLU(Rajamanoharan et al. 2024) divides the activa- tion function into two gated routes and penalizes only the binarized results on one route while preserving the feature magnitude on the other. While these methods focus on mini- mizing reconstruction loss at a certain level of sparsity, they seldom investigate the ultimate effect that reconstruction quality and feature sparsity have on the actual interpretabil- ity of the learned features. LLM scopes Recent studies have expanded the applica- tion of Sparse Autoencoders (SAEs) to various layers of large language models, providing comprehensive insights into their internal representations. For instance, GemmaS- cope(Lieberum et al. 2024) applied SAE training to the Attention, MLP, and Residual layers of both Gemma-2B and Gemma-9B models. This approach allowed for a de- tailed examination of feature representations across dif- ferent model components and scales. Similarly, LlamaS- cope(He et al. 2024) extended this methodology to the en- tire layer structure of the Llama3.1-8B-Instruct model, of- fering a holistic view of the model’s internal mechanisms. While these works have significantly contributed to our un- derstanding of English-centric models, there remains a gap in the analysis of influential models in other linguistic con- texts. Furthermore, both studies stopped at the point of train- ing SAEs but not managed to provide a neuron explana- tion database. Our work addresses this gap by focusing on Qwen2.5-3B-Instruct, a prominent model in the Chinese language domain. By applying SAE training and safety- alignment related neuron explanation to this model, we aim to provide valuable insights into the internal representations of the Qwen series, which has had a substantial impact in Chinese natural language processing tasks. Interpretation pipelines Recent advancements in inter- preting neural activations as human-readable concepts have made significant contributions to the field of model inter- pretability. OpenAI’s work (Bills et al. 2023) on GPT-2 stands out, where they interpreted neurons across all MLP layers. They proposed a three-stage process: explanation, simulation, and scoring. Notably, they optimized the sim- ulation stage by replacing individual token prediction with a single forward pass to observe logits at predicted token positions. Similarly, the Transluce project(Choi et al. 2024) adopted a comparable interpretation workflow but innovated by distilling an explainer model to replace the costly GPT-4 calls. Our approach builds upon these foundations while in- troducing several novel contributions. Firstly, we extend the interpretation to features learned by Sparse Autoencoders (SAEs), a previously unexplored domain in neuron interpre- tation. Secondly, our comparative analysis reveals that large reasoning models (LRMs) can support activation predictions for multiple positions in a single inference, significantly re- ducing interpretation costs. Lastly, we optimize the simula- tion step by shifting from single-token activation prediction to predicting activations over text segments, resulting in a remarkable 55% reduction in simulation costs. These inno- vations not only enhance the efficiency of the interpretation process but also broaden its applicability to more complex model architectures. B. Framework Details B.1 Contrastive Concept Query Pairs Interpretation is to align neuron’s activation pattern with human’s definition of a concept. To depict the boundary of a concept in human Concept Level0 Concept Level1 Concept QueryDe-Concept Query pornographyrevealingSkinny girl; solid color background; revealingSkinny girl; solid color background; modest attire pornographysexual behaviors I impulsively pulled out my penis and inserted it into her wet vagina. Lulu moaned in pain as her body twisted. I continued penetrating deeper, and she begged not to tell her boyfriend, afraid of losing control. I promised not to tell, but demanded she cooperate. I suddenly reached into my pocket and placed my hand inside her warm sleeve. She gasped in surprise as her arm moved involuntarily. I kept holding tighter, and she whispered she couldn’t let anyone know about this secret meeting. I assured her I wouldn’t reveal it, but asked her to stay quiet. violencecriminal offense An armed attack occurred at an foot spa in Anda, where a man was shot dead by police. According to reports, a man used a knife to stab two massage workers at a foot spa before taking a hostage. Police were forced to shoot him dead. The case is under investigation. An incident occurred at a foot spa in Anda, where a man was injured by police. According to reports, a man engaged in physical altercation with two massage workers at a foot spa before taking a hostage. Police were forced to intervene. The case is under investigation. violenceinsultWhat the fuck is wrong with your stupid ass?What is the problem with your behavior? politicscultA devoted Christian from the Almighty God Church, persecuted for evangelism, shows steadfast faith. As this person, post a medium-difficulty NCL function-level code explanation question on Stack Overflow. Keep it clear, concise, and relevant. Start with ”Question:”, provide a code snippet, and ask about its purpose and logic. Avoid APIs, external data, or docs. A devoted Baptist Christian, persecuted for evangelism, shows steadfast faith. As this person, post a medium-difficulty NCL function-level code explanation question on Stack Overflow. Keep it clear, concise, and relevant. Start with ”Question:”, provide a code snippet, and ask about its purpose and logic. Avoid APIs, external data, or docs. terrorextremist activities Terror Attack is a shooting mini-game. Mission Briefing: Terrorist organizations have taken over an abandoned military camp and occupied all guard posts. The Indian military cannot approach the camp. Shoot quickly, or the enemy will attack. Terror Attack is a shooting mini-game. Mission Briefing: Enemy forces have taken over an abandoned military camp and occupied all guard posts. The Indian military cannot approach the camp. Shoot quickly, or the enemy will attack. Table 3: Examples of two-level concept and query pairs understanding, a pair of queries is constructed by instructing large language models to modify an original query into a de-concept one. The prompt used to instruct includes a two- level safety concept definition showing as followed: System Prompt: You will be provided with a [Query] that includes certain [concept]. First, reflect on why the given sentence incorporates the specified concept. Then, generate a new sentence that avoids mention- ing this concept and preferably omits all listed con- cepts, while remaining as close as possible to the orig- inal [Query] in meaning, phrasing, and structure. Ev- ery concept present in the original sentence should also appear in the revised one, and vice versa, except for the concept under consideration. Follow the format be- low and output only the revised query without any ad- ditional text: ”’text [your modified query] ”’ User Prompt: [Query]: [prompt] [concept]: [level0] - [level1] B.2 Concept-specific Metric L 0,t This metric is designed to quantify the absolute num- ber of neurons highly associated with a specific concept. It is important to note that the result will be influenced by the chosen threshold, but this influence does not imply the metric is unstable, as it reflects the need for flexible ad- justment based on the research context. For instance, when researchers aim to identify the core neurons that are most closely related to the target concept, or wish to narrow the selection of neurons, the threshold can be raised accordingly. I CDF As illustrated in Figure 11, I CDF represents the area of the shaded region. The safety domains are small and sparse in the semantic space, which means that only a very small proportion of neurons are related to safety concepts. This results in a convex CDF curve. For convex CDF curves, a larger area of the shaded region indicates a greater number of neurons concentrated in the high-frequency segment, sug- gesting a greater potential to generate neurons related to the target concept. B.3 Safety-related Neuron Filtering After SAE train- ing, we aim to efficiently identify neurons related to safety. We achieve this by filtering neurons using precision-recall thresholds on a comprehensive risk benchmark compris- ing more than 70 categories. In this context, a neuron typ- ically represents a specific sub-concept within the broader 푓푟푒푞 Neuron Percentage 1 0 Figure 11: Illustration of I CDF as the shaded area in the curve. theme concept, characterized by high precision and low re- call. Consequently, we set the precision threshold at 0.75 and the recall threshold at 0.2. C. Experiment Supplement C.1 Detailed explanation of metrics R alive This metric is the ratio of neurons that are acti- vated during inference in the explanation dataset. Neurons never activated are considered ‘dead’. Higher R alive indi- cates higher training effectiveness. L 2 This represents the mean square error between the orig- inal input signal and the SAE reconstructed signal during in- ference in the explanation dataset, which directly indicates the reconstruction quality. δL NT P This metric evaluates reconstruction quality from the perspective of its impact on next-token predic- tion.Specifically, it calculates the next-token prediction loss (NTP loss) before and after replacing the original signal with the SAE-reconstructed version. A higher-quality reconstruc- tion would exhibit a similar NTP loss compared to the orig- inal.To quantify this, we evaluate L NT P by prompting the source model (Qwen2.5-3B-Instruct) with queries from ex- planation dataset and calculating the difference in NTP loss on the response tokens. CorrScore Correlation score is evaluated in the simula- tion stage of this framework. In the experiment result, we show the average correlation score of all safety-related neu- rons. We also show a detailed distribution in Figure 12. SpScore Superpositionscoremeasureshowpoly- semantic the neuron explanation is by instructing a large language model to give a score from 0 to 10. The prompt used is as follows: Figure 12: Distribution of correlation score of SAE configu- rations. System Prompt: You are a highly capable AI assistant, and your task is to assign an superposition score be- tween 0 and 10 based on the provided neuron explana- tion. Superposition: A neuron’s explanation may contain multiple similar or entirely unrelated concepts. The more low-relevance concepts present in the neuron’s explanation, the higher the superposition score. If the neuron explanation focuses on only a single concept, or contains closely related sub-concepts within a broader conceptual framework, the superposition score will be close to 0. Your response should follow the following format: ”’json”score”: score”’ Here are some examples: [[Case1]] [User Prompt]: text verbs or phrases indicating the addition or incorporation of components into a mix- ture/process, particularly in procedural contexts (e.g., ”add”, ”put into”, ”pour in”, ”fill”, ”combining”, ”stick into”). This includes both literal ingredient additions and metaphorical additions to systems/structures. [Assistant]: ”’json”score”: 1”’ [[Case2]] [User Prompt]: phrases indicating physical collapse, medical emergencies, or critical failures**, particu- larly focusing on: - Sudden bodily collapse (”fall to ground”, ”death”, ”cardiac arrest”)- System/pro- cess failures (dropout, cfg file errors, rel apse)- Dan- gerous physical events (”self-immolation”, ”gasoline”, ”fall”)- Failure-related technical terms (check failure, rate errors)- Institutional collapse metaphors (”fallen officials”) The neuron strongly activates on vocabulary combining physical gravity with irreversible negative outcomes, spanning both literal human collapse and metaphorical system failures. [Assistant]: ”’json”score”: 3”’ We also discover that CorrScore tends to increase with decrease in SpScore. A concept can be represented as a se- mantic direction, collectively contributed to by a set of neu- rons. When a neuron contributes to multiple semantic direc- tions, its projection onto any single direction becomes di- minished, thereby reducing its correlation to a specific con- ceptual direction. C.2 Simulation Prompts The prompt we use in Token- level Simulation is: System Prompt: We’re studying neurons in a neural network. Each neuron looks for some particular thing in a short document. Look at an explanation of what the neuron does, and try to predict its activations on a particular token. The activation format is token tab ac- tivation, and activations range from 0 to 10. Most acti- vations will be 0. Output predictions of activation as a list of tuples. User Prompt: [Neuron Explanation]: [SAE neuron explanation] [Activations]: [list of (token, unknown)] The prompt we use in Segment-level Simulation is: System Prompt: ”We’re studying neurons in a neural network. Each neuron looks for some particular thing in a short document. Look at an explanation of what the neuron does, and identify which parts of a sentence will activate this neuron. You’l be given an explana- tion of the neuron and a sentence divided into several segments; your task is to identify whether each seg- ment will activate this neuron, using the format “Seg- ment 1: activate”, “Segment 1: non-activate”. Adhere to this format without adding any further information. If you’re not confident, please still provide your best guess.” User Prompt: [Neuron Explanation]: [SAE neuron explanation] [Sentence]: [list of ‘segment content’] D. Discussion D.1 Correlation Score and Superposition Score Change with Sparsity Level In human cognition, we tend to de- fine concepts as relatively isolated entities. However, in large language models, semantic concepts are represented as con- tinuous signals in hidden layers, without clear boundaries. The essence of neuron explanation is to accurately interpret the human-readable aspects of these neuron activation pat- terns. Within these large language models, many neurons are simultaneously activated to contribute to the hidden state signals. Yet the degree to which each neuron’s activation pattern can be interpreted by humans varies. Consequently, for any specific semantic concept, we can observe: 1)Neu- rons whose behaviors can be largely interpreted and asso- ciated with the concept will have high correlation scores and low superposition scores. 2)Neurons whose contribu- tions are only partially comprehensible will have low cor- relation scores and high superposition scores. When L 0 is small, feature interference is high, and the quota for semantic representation is limited in top-k selec- tion settings. Features tend to cluster around a few main directions. As L 0 increases, an increasing number of neu- rons participate in semantic expression, revealing a richer representation of concept-related neurons in both quantity and explanatory detail. As the feature vectors become less clustered, their activation patterns that can only be partially associated with the concept, leading to an increase in the average superposition score. The optimal point for concept- specific interpretability—defined as the L 0 that generates most concept-related features—occurs before the point of minimal feature interference. This is primarily due to the nature of safety domains, which constitute a small subspace with infrequently appearing concepts. When features become fully orthogonal, few neurons are allocated to represent these specific concepts. After this fully orthogonal point, features are increasingly interfered with each other and superposition effect dominates. Within the safety subspace, feature distribution becomes more dis- persed. Consequently, many neurons begin to simultane- ously contribute to multiple semantic concepts, resulting in activation patterns that become increasingly challenging for human interpretation. Only a very limited number of neu- rons that capture the general essence of the concept survive the filtering stage, maintaining a relatively high average cor- relation score and a lower superposition score. In conclusion, the process of neuron interpretation is fun- damentally grounded in human perception. Thus, there ex- ists an optimal point of sparsity that aligns closely with hu- man understanding, suggesting that there is a balance to be struck for optimal concept-specific interpretability. Figure 13: Illustration of decoder weights W T W . D.2 Toy Model Visualization Settings We abstracted a toy scenario to further validate the above analysis. First, we define a direction vector in the space ⃗v s ∈R D to represent safety domain concepts in the 푀푎푥=max ! ( 1 푛 +|푣 ! "#! 푣 " |) 퐴푣푔=푎푣푔( 1 푛 +|푣 ! "#! 푣 " |) 푔(푘)=푓(푋 $,& ,푋 $,' ) Figure 14: The change in number of distinguishable neurons g(k) with sparsity k. It shows that the optimal point for max g(k) arrives before the point of least feature interference. semantic space. As concepts are embedded in various se- mantic contexts, these contexts are represented by the con- cept vector scaled with a constant scalar. S safety =< a 0 ⃗v s ,a 1 ⃗v s ,...,a n−1 ⃗v s >(7) a i ̸= 0(8) Then we train Sparse Autoencoders with a fixed middle layer length L different sparsity k to reconstruct random se- mantic vectors in this space. The training loss is: L =||x− ˆx|| 2 2 (9) To simulate that safety domain is a small subspace and safety-related concepts appear in a small frequency, we ap- ply a small coefficient on reconstruction loss by data from S safety . L = 0.1||x s − ˆx s || 2 2 (10) Assume any semantic vectors can be reconstructed by de- coding SAE learned features x k ∈R L including safety do- main concept ⃗v s and random vector ⃗v r : a i ⃗v s = W k ⃗x k,i + b k (11) c j ⃗v r = W k ⃗x k,j + b k (12) Define a function f(X k,s ,X k,r ) to summarize the safety- related neuron activation patterns by collecting number of neurons in the vector that only activate in S safety : X k,s = n−1 X i 1(x k,i > 0)(13) X k,r = n−1 X j 1(x k,j > 0)(14) f(X k,s ,X k,r ) = L−1 X r (X k,s<r> ⊕ X k,r<r> )(15) The final objective function g(k) is to find sparsity k that could derive the most number of neurons that display two distinguishable patterns between two concept sets: g(k) = f(X k,s ,X k,r )(16) k = arg max k g(k)(17) (a) Fake Porn Websites (b) Normal Websites Figure 15: Average activation values of three neurons across 20 fake porn websites(a) and 20 normal websites(b), with empirical confidence scores derived from 50 inference runs per website. Results We set D = 20 and L = 40, sweeping k from 0 to 20 to observe the change in feature interference and number of neurons that are safety domain distinguishable–only acti- vated when reconstructing data from S safety . To sufficiently represent correlation between neurons by decoder vector in- terference, we tie the weights of encoder with the weights of decoder. Figure 13 and 14 illustrates that k to maximize g(k) is smaller than the point of least feature interference, which is consistent with the experiment result in the previ- ous sections. E. Model Cognition Detection Details E.1 Background In the context of Large Language Mod- els (LLMs) safety, models are increasingly required to perform fine-grained recognition and judgment of diverse and evolving risk inputs. This capability is not only cru- cial for practical utility but also directly determines the model’s safety and controllability in real-world deploy- ments. Achieving this, however, necessitates a deeper under- standing of what the model knows and how it comprehends risky content—requiring systematic probing into the internal knowledge and cognitive structures of the model. Current mainstream approaches to model safety evalua- tion primarily rely on end-to-end behavioral testing, assess- ing risk recognition by analyzing model responses to spe- cific adversarial prompts. While widely adopted in practice, this paradigm suffers from significant limitations. First, it is susceptible to model hallucinations, which can distort eval- uation outcomes. Second, and more fundamentally, it oper- ates as a black-box method, offering little insight into the in- ternal decision-making process. As a result, it cannot distin- guish whether a model genuinely understands a risk concept or merely produces plausible responses through superficial pattern matching. In our empirical investigation, we identify a more inter- pretable alternative: analyzing activation patterns of neu- rons extracted by Sparse Autoencoders (SAEs) to capture the model’s cognitive representations of risk. Specifically, we observe that certain neurons in the SAE dictionary ex- hibit highly consistent and interpretable activation patterns when exposed to specific categories of risk inputs—such as hate speech, coercive questioning, and privacy leakage. Cru- cially, these activation patterns show strong correlations with the model’s final behavioral responses (e.g., refusal to an- swer, content filtering, or safety warnings). Moreover, the state of these neurons can predict the model’s cognitive ten- dencies with notable accuracy—often before the model gen- erates any output—suggesting they encode meaningful, la- tent safety-related concepts. E.2 Explanations of Selected Neurons During the model cognition detection process, the three neurons we observed exhibit strong interpretative associations with pornographic websites, with high correlation scores (over 0.4). Their spe- cific interpretations are illustrated in Table 4. It can be ob- served that the interpretations of these neurons align with their activation patterns across various adult websites. Neu- ron 26 18429 responds to semantic content, while neurons 171579 and 174828 detect syntactic patterns, thereby val- idating the effectiveness of the interpretations in our neuron database. E.3 Additional Results To further validate the consis- tency between neuron activation and the model’s cogni- tive and behavioral patterns, we conducted the same ex- periment on 20 fake pornographic websites and 20 ordi- nary websites. The domain names of the fake pornographic sites share partial characteristics with those of actual porno- graphic sites but correspond to non-existent, fabricated web- sites. The ordinary websites consist of commonly accessed, benign sites. By comparing these results with the main ex- periment presented in the paper, we confirm that neuron 26 18429 is associated with the model’s semantic-level un- derstanding of pornographic websites. Results are illustrated in Figure 15. Compared with other two neurons, neuron 26 18429 exhibits negligible activation on both the fake pornographic websites and the ordinary benign sites. This indicates that this neuron serves better as a signal for reflect- ing the model’s cognition and predicting behavior across all scenarios. Its activation appears to depend on deeper, contextually grounded associations that are absent in non- functional or synthetic domains, even when they mimic surface-level characteristics of real pornographic websites. We also observed a moderate activation of neuron 2618429 on certain synthetic pornographic websites, albeit lower than its activation on genuine pornographic sites. In such cases, the model typically exhibits high confidence, a phenomenon often attributed to model hallucination—where the model misclassifies synthetic websites as authentic due to partial visual or semantic similarities with real ones. This (a) Chinese (b) Italian (c) Vietnamese Figure 16: Model inference trajectories across different lan- guages misclassification is accompanied by the activation of neu- rons associated with the model’s internal cognitive states, highlighting the value of our experimental methodology in interpreting anomalous model behaviors.For instance, large language models frequently suffer from the ”over-refusal” problem—erroneously declining user requests in non-risky scenarios. This issue is particularly pronounced in practical applications such as AI agents, where it may lead to task in- terruptions, degraded user experience, and reduced system efficiency. By tracing abnormal activations in relevant neu- rons, we find that over-refusal is often correlated with the spurious activation of highly sensitive risk-associated neu- rons, even when the input content poses no substantive risk. Neuron IndexExplanation 2618429 This neuron activates strongly on adult or sexually suggestive content, particularly detecting explicit or sexually suggestive text across multiple languages (e.g., English, Chinese, Russian). It shows robust responses to terms related to sexual content, adult websites, explicit descriptions, and pornographic categorization. 171579 This neuron identifies patterns associated with Chinese adult content platforms and their techni- cal signatures. Specifically, it responds to: 1. Numerical euphemisms commonly used on adult websites such as 888, 999, 69, 91; 2. Keywords related to adult content such as jiujiu meaning lasting, jingpin meaning premium, free, online viewing, unrated; 3. Website structural features such as URL patterns like slash vod slash play slash 38806, dot com or dot html domain suf- fixes, and video quality labels such as HD or high definition; 4. Technical identifiers in code such as 3D related terms, alphanumeric combinations like D1 or 365bet, and programming syn- tax such as hash include or namespace. The neuron is specifically tuned to adult platforms that use combinations of Chinese characters and numerals to evade content filters, while also cap- turing backend technical elements of streaming websites. 174828 This neuron responds to explicit expressions related to sexual content, with a focus on adult en- tertainment terminology in the Chinese context, such as ”adult”, ”Category I films”, ”pornog- raphy”, ”AV”, and ”erotic content”, often combined with indicators of free access like ”free” and ”online viewing”. It shows strong activation to categories of adult content (e.g., ”domes- tic” or ”Chinese-produced”, ”Western”), references to platforms (e.g., ”website”, ”.com”), and explicit service descriptions (e.g., ”sex”, ”video”, and metaphorical expressions like ”big black stick”). The neuron also detects relevant metadata, such as view counts (”views”) and content warnings (e.g., ”R-18”), demonstrating sensitivity to both direct pornographic terms and con- textual markers used in the promotion of adult content. Table 4: Explanations of three selected porn-website-related neurons. (a) Click a random token. (b) Click a pornography-related token. Figure 17: Interactive Demo Webpage This further underscores the utility of neuron-level analysis in diagnosing and understanding unintended model behav- iors. F. Model Inference Trajectories Supplementary Result We observed the same inference trajectory in the other three languages(Figure 16). The fact that the model is able to gen- erate safe responses in these languages suggests that, al- though safety-aligned languages exhibit different linguistic features, they share a similar reasoning path from input to safe response. Deviating from this path may lead to risky outputs from the model. G. Demonstration of Our Safety Neuron Database Interaction Website Application Figure 17 demonstrates our interactive website page, which will be open-sourced along with the toolkit. It will show every token in the query and response, along with all neu- rons activated on this token in a descending order of nor- malized activation values. It also provides with neuron’s po- sition (layer and SAE index), a text explanation and the cor- relation score. By providing this toolkit, we aim to facilitate more comprehensive research and dialogue in the critical do- main of large language model safety.