Paper deep dive
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
Ruikang Zhang, Shuo Wang, Qi Su
Models: LLMs (specific models referenced include Qwen and Llama variants based on related search results)
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/11/2026, 1:17:17 AM
Summary
The paper introduces a Sparse Autoencoder (SAE)-based framework for identifying and steering high-order semantic features in Large Language Models (LLMs). By leveraging oppositional semantic pairs (e.g., Big Five personality traits), the authors retrieve monosemantic 'functional knobs' that allow for precise, bidirectional behavioral control. They identify a phenomenon called 'Functional Faithfulness,' where steering specific internal features induces coherent, predictable shifts across multiple linguistic dimensions, demonstrating that LLMs internalize deeply integrated representations of complex concepts.
Entities (5)
Relation Signals (3)
Big Five personality traits → usedascasestudyfor → Sparse Autoencoder
confidence 100% · Using the Big Five personality traits as a case study, we demonstrate that our method enables precise, bidirectional steering.
Sparse Autoencoder → improvesupon → Contrastive Activation Addition
confidence 95% · The paper demonstrates superior stability and performance compared to existing activation steering methods like CAA.
Sparse Autoencoder → enables → Functional Faithfulness
confidence 90% · The framework uses SAEs to identify features that, when steered, exhibit the Functional Faithfulness effect.
Cypher Suggestions (2)
Find all methodologies used for model steering · confidence 90% · unvalidated
MATCH (m:Methodology)-[:USED_FOR]->(t:Task {name: 'Steering'}) RETURN m.nameMap the relationship between concepts and phenomena · confidence 85% · unvalidated
MATCH (c:Concept)-[:EXHIBITS]->(p:Phenomenon) RETURN c.name, p.name
Abstract
Abstract:Recent work in Mechanistic Interpretability (MI) has enabled the identification and intervention of internal features in Large Language Models (LLMs). However, a persistent challenge lies in linking such internal features to the reliable control of complex, behavior-level semantic attributes in language generation. In this paper, we propose a Sparse Autoencoder-based framework for retrieving and steering semantically interpretable internal features associated with high-level linguistic behaviors. Our method employs a contrastive feature retrieval pipeline based on controlled semantic oppositions, combing statistical activation analysis and generation-based validation to distill monosemantic functional features from sparse activation spaces. Using the Big Five personality traits as a case study, we demonstrate that our method enables precise, bidirectional steering of model behavior while maintaining superior stability and performance compared to existing activation steering methods like Contrastive Activation Addition (CAA). We further identify an empirical effect, which we term Functional Faithfulness, whereby intervening on a specific internal feature induces coherent and predictable shifts across multiple linguistic dimensions aligned with the target semantic attribute. Our findings suggest that LLMs internalize deeply integrated representations of high-order concepts, and provide a novel, robust mechanistic path for the regulation of complex AI behaviors.
Tags
Links
- Source: https://arxiv.org/abs/2601.02978
- Canonical: https://arxiv.org/abs/2601.02978
PDF not stored locally. Use the link above to view on the source site.
Full Text
77,486 characters extracted from source content.
Expand or collapse full text
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders Ruikang Zhang 1 Shuo Wang 1 * Qi Su 1† 1 Peking University, Beijing, China 2300018416@stu.pku.edu.cn, mc25570@umac.mo, sukia@pku.edu.cn Abstract Recent work in Mechanistic Interpretability (MI) has enabled the identification and inter- vention of internal features in Large Language Models (LLMs). However, a persistent chal- lenge lies in linking such internal features to the reliable control of complex, behavior-level semantic attributes in language generation. In this paper, we propose a Sparse Autoencoder- based framework for retrieving and steering semantically interpretable internal features as- sociated with high-level linguistic behaviors. Our method employs a contrastive feature re- trieval pipeline based on controlled semantic oppositions, combing statistical activation anal- ysis and generation-based validation to distill monosemantic functional features from sparse activation spaces. Using the Big Five personal- ity traits as a case study, we demonstrate that our method enables precise, bidirectional steer- ing of model behavior while maintaining su- perior stability and performance compared to existing activation steering methods like Con- trastive Activation Addition (CAA). We further identify an empirical effect, which we term Functional Faithfulness, whereby intervening on a specific internal feature induces coherent and predictable shifts across multiple linguis- tic dimensions aligned with the target semantic attribute. Our findings suggest that LLMs in- ternalize deeply integrated representations of high-order concepts, and provide a novel, ro- bust mechanistic path for the regulation of com- plex AI behaviors. 1 Introduction In recent years, research on the Mechanistic In- terpretability (MI) of Large Language Models (LLMs) has received increasing attention (Ranaldi, 2025). Recent research has established a robust foundation for understanding and utilizing internal * Work done during internship at Peking University. † Corresponding author. model representations through diverse methodolo- gies. For instance, significant progress has been made in decomposing dense activation patterns into interpretable components, most notably through the deployment of Sparse Autoencoders (SAEs) (Shu et al., 2025) which facilitate the discovery of monosemantic features and automatic feature labeling (Templeton et al., 2024). Furthermore, intervention techniques such as Contrastive Acti- vation Addition (CAA) (Rimsky et al., 2024) have demonstrated the possibility to steer model outputs by identifying and applying directionally signifi- cant activation vectors. Despite these advancements, existing method- ologies encounter a multifaceted bottleneck that hinders their significance. The first critical gap lies in the discrepancy between the massive scale of automated feature discovery and the specific, localized need for individual functional knobs to address particular behavioral requirements. While SAEs can identify millions of latent features, re- trieving the precise subset relevant to a specific task remains an inefficient process. Secondly, there is a persistent disconnect between the interpretation of these features and their application; knowing that a feature exists does not inherently provide a stable mechanism for regulating model behav- ior with it. This leads to the third gap, which is the observed divergence between input-based feature discovery and output-based steering. Cur- rent steering vectors often lack transparency and remain highly sensitive to the quality of probing data, meaning the constructed vector may not trans- late into a reliable or causally verifiable change in the generated output. Beyond these structural gaps, though existing literature has dabbled in the realm of high-order semantics such as social bi- ases (Yang et al., 2025b), others including but not limited to personality traits and ethical values are largely under-explored. Investigating these com- plex semantics is of paramount importance, as they 1 arXiv:2601.02978v1 [cs.CL] 6 Jan 2026 Dataset Construction After a break-up, someone may need reassurance that they will find love again. I offer warm, empathetic support, comforting them with kindness and hope. I focus on the practical realities, preferring logic over emotional comfort. Feature Retrieval Feature Validation Causal Intervention Effect Evaluation Social Intelligence Evaluation Case Study: Personality Steering Effects TRAIT Benchmark Original and innovative ideas are often the seeds of major social reforms. What should I do during the celebration? Example (Altruism): Imagine you are in a situation where someone needs help, how would you behave? SocialEval Benchmark Process-Oriented Interpersonal Ability Evaluation (IAE) Representative Features Personality Measurement Personality Traits High-level Behaviors Oppositional Semantic Pairs Figure 1: Overview of the SAE-based framework for retrieving and steering high-order semantic features. This framework enables the identification and interpretation of monosemantic "functional knobs" that govern complex behaviors like personality. provide the most profound insights into the model’s decision-making logic and exert a dominant influ- ence on the overall quality of AI-generated content. In this work, we address these challenges by proposing an SAE-centered analysis framework for high-order semantic representations. Many high- order semantics possess inherent oppositivity (e.g., extraversion vs. introversion), providing a foun- dation for contrastive feature retrieval and feature- based steering methods. Hence, the pipeline em- ploys statistical activation analysis on datasets with contrasting semantics to automatically distill in- ternal LLM features that are both monosemantic and highly correlated with specific semantic inputs. Through activation steering, their causal role in model output behavior is systematically verified. The sparsity constraint of the SAE not only ensures the robustness of feature retrieval process against data noise, but also facilitates stronger monoseman- ticity and interpretability of the learned features, which provides a viable representational basis for the localization of high-level behavioral factors and further mechanistic analysis. Based on the discov- ered features, we systematically compare model performance before and after steering across mul- tiple downstream tasks. Beyond validating the se- mantic relevance and steering effectiveness of these features, we further discover a prominent charac- ter of model interventions–that is, model steering utilizing a feature related to a certain high-order semantic triggers a systematic cascade of behav- ioral shifts that align with the intrinsic logic of that semantic domain, which we term Functional Faithfulness. This suggests that the identified fea- tures are functionally integrated into the model’s reasoning process rather than being isolated lin- guistic triggers. We also compare our approach with existing representative intervention methods to demonstrate its advantages. To verify the effectiveness of our method, we select personality traits as a representative case of high-order oppositional semantics. Classical per- sonality models, such as the Big Five (Goldberg, 1990), exhibit a clear and stable oppositional struc- ture, providing explicit semantic anchors for data construction and feature retrieval based on seman- tic contrast. This allows for a systematic analysis of the relationships between input semantics, internal model activations, and resulting model behaviors. Personality traits provides an ideal validation sce- nario for assessing the applicability of our method to high-order semantic attributes, the relevance and interpretability of retrieved features, and the effec- tiveness of feature steering. Their impact on the model is not limited to the traits themselves or lo- cal linguistic phenomena; instead, they fundamen- 2 tally shape model’s consistent behavioral predis- positions across different behaviors and contexts, such as social intelligence performance including emotion regulation, cooperation tendency, and cre- ative ability. Importantly, the correlation between personality traits and corresponding behavioral pat- terns is well established in psychological research through long-term social experiments and quantita- tive statistics, providing stable and externally ver- ifiable semantic-behavior correspondences. This external grounding provides a reliable framework for validating both feature steering effects and the functional faithfulness of model interventions. The main contributions of this paper include: 1) We propose a method to localize effective internal features that control specific high-order behavior in SAEs utilizing oppositional semantic data. This enables bridging the gap between feature retrieval and behavioral intervention without requiring large- scale manual or automatic labeling. 2) Through steering experiments, we demonstrate that the iden- tified features are not mere correlates but behav- iorally causal internal features that allow precise steering of complex model behaviors while main- taining stable model performance. 3) Using the Big Five traits and social intelligence as a touchstone, we are the first to show that LLM internal repre- sentations align with human psychological meta- analyses. We identify a functional faithfulness ef- fect, whereby the behavioral impact of personality steering exhibits structured trade-offs that are in- terpretable within established human personality frameworks, offering insights into how complex behaviors are systematically organized in LLMs. Notably, the proposed method is not limited to personality traits and also provides a general frame- work for the interpretation and regulation of other high-order oppositional semantic features, such as sentiment, stance, and factuality. 2 Related Work 2.1 Sparse Autoencoders (SAEs) An SAE is a neural network designed to reconstruct input representations by learning an overcomplete dictionary. For a basic SAE, given an input rep- resentationz ∈R d , the encoder first applies a linear transformationW enc ∈R d×m and a bias termb enc ∈R m , and then generates a sparse acti- vation vectorh(z)through a non-linear activation function σ (e.g., ReLU): h(z) = σ(z· W enc + b enc )(1) whereh(z) ∈R m is the sparse activation vector, andm > drepresents the dimension of the over- complete dictionary. The decoder then maps the sparse activation vectorh(z)back to the original input space to generate a reconstructed output ˆ z: ˆ z = h(z)· W dec + b dec (2) whereW dec ∈R m×d is the decoder’s weight ma- trix,b dec ∈R d is the decoder’s bias term, and ˆ z ∈R d is the reconstructed output aimed at op- timally approximating the original inputz. The training objective function of the SAE optimizes model performance by balancing reconstruction error and sparsity constraints: L(z) =∥z− ˆ z∥ 2 2 + α∥h(z)∥ 1 (3) whereαis a hyperparameter for the sparsity penalty coefficient,∥z− ˆ z∥ 2 2 denotes the reconstruction er- ror, and∥h(z)∥ 1 denotes theL 1 norm of the sparse activation vector. This design ensures that the fea- tures in the sparse activation vector effectively cap- ture the information of the input representation for reconstruction while enforcing sparsity by penaliz- ing non-zero entries (Shu et al., 2025). SAEs and LLMs.Recently, SAEs have become key tools for parsing the hidden states of LLMs and mitigating the phenomenon of polysemantic- ity in LLM interpretability field. Researchers have trained large-scale SAE libraries across multiple locations of various model architectures, such as Llama Scope (He et al., 2024) and Gemma Scope (Lieberum et al., 2024). Pioneering work by An- thropic (Bricken et al., 2023; Templeton et al., 2024) demonstrates that features extracted by SAEs possess significant advantages. First, SAEs can ex- tract monosemantic features through sparse repre- sentations in the hidden space, where each feature dimension typically corresponds to a clear seman- tic concept. Second, these features exhibit high levels of abstraction and universality, capturing cross-domain concepts consistent across different contexts. More importantly, SAEs can reveal hid- den representations that are entangled at the neu- ron level and difficult to observe, demonstrating unique efficacy in complex concept extraction. Re- cent studies further prove their application poten- tial, such as using SAEs to localize linguistic fea- tures (Jing et al., 2025) or mitigating repetition (Yao et al., 2025), demonstrating the exceptional prospective value of SAE features in model inter- pretation and precise regulation. 3 In this study, we argue that the sparsity constraint of SAEs makes them well-suited for capturing high- order semantic attributes, such as personality traits. High-level hidden states in LLM often encode com- plex, entangled semantic information, while the sparsity of SAEs facilitates the decomposition of this information into disentangled, representative dimensions. This characteristic not only reduces the noise arising from data sampling but, more im- portantly, maps the semantic opposition inherent in probing data to specific sparse dimensions. This disentanglement capability allows us to precisely extract internal representations corresponding to high-order semantics, such as classical psychologi- cal personality structures, providing a solid repre- sentational foundation for output intervention. 2.2 Activation Steering Activation Steering is a technique that guides an LLM to generate target output during the infer- ence stage by injecting specific steering vectors into its hidden states. Compared to Prompt Engi- neering (Schulhoff et al., 2025), this method pos- sesses representation-based controllability; com- pared to methods such as SFT (Zhang et al., 2025) and RLHF (Ziegler et al., 2020), it does not require modifying model parameters, thus offering higher economic efficiency and flexibility. Early representative methods, such as Con- trastive Activation Addition (CAA) (Rimsky et al., 2024), generate steering vectors by constructing triplets of (prompt, positive behavior response, negative behavior response) and calculating the mean difference in activations at specific internal locations. However, the effectiveness of CAA is strongly correlated with the quality of the con- trastive data, yet data noise is almost unavoidable. This leads to steering vectors that may lack stability and interpretability in applications. To improve the precision of interventions, recent research (Templeton et al., 2024) has explored uti- lizing the monosemanticity of SAEs to generate purer steering vectors. Specifically, SAEs can de- compose semantic information in hidden states into monosemantic feature dimensions through sparse activation mechanisms, thereby achieving semantic disentanglement and filtering out noise. Based on this, researchers have designed a steering method: after excluding the SAE reconstruction error from the hidden states, the activation values of the target features are clamped to a specific multiple of their maximum activation in the training set, while the reconstruction error remains unchanged, thereby generating high-quality steering vectors. In this study, we leverage SAEs to generate steer- ing vectors and design a streamlined feature inter- vention method (see 3.2). This method can gen- erate steering vectors robust to data noise through monosemantic features, thereby achieving precise regulation of high-order behaviors such as the model’s personality traits. 2.3 Personality and High-level Behavioral Control Effective LLM alignment requires precise control over high-level behaviors such as values, ethics, and personality. Conventional methods primarily rely on input-output interventions like prompt en- gineering or supervised fine-tuning (Jiang et al., 2023; Mao et al., 2024; Serapio-García et al., 2025; Li et al., 2025), which can induce models to ex- hibit specific behavioral tendencies, however of- ten fail to address the model’s underlying internal mechanisms. To bridge this gap, activation steer- ing attempts to intervene directly within the hidden state space. For instance, (Chen et al., 2025) con- structs steering vectors from the mean activation differences of contrastive semantic pairs to mitigate toxicity. However, such vectors remain susceptible to data noise and entangled semantics, hindering precise, interpretable regulation of specific high- order concepts. We utilize personality traits, a well-established psychological framework, as a representative high- order semantic scenario to validate our pipeline and explore these internal mechanisms (see 1). 3 Methodology In this section, we describe the methodology used to identify, validate, and apply interpretable fea- tures within LLMs. Specifically, we define a fea- ture as a single dimension of the representation in the SAE vector, or equivalently, the corresponding decoded hidden state in the model (i.e., its feature activation). This definition enables a direct connec- tion between abstract feature representations and their concrete effects on model behavior. Our methodology is divided into two main stages: feature retrieval and validation (3.1) and feature steering (3.2). The first stage focuses on identifying features that are semantically mean- ingful and relevant to specific concepts, while the second stage involves manipulating these features 4 to validate their causal effects on model outputs. Below, we detail each stage of the methodology. 3.1 Feature Retrieval and Validation 3.1.1 Dataset Construction To achieve precise retrieval and validation of speci- fied oppositional semantic features, we constructed two high-quality datasets with low data volume. The specific roles and applications of these two datasets are detailed in 3.1.2 and 3.1.3. Feature Retrieval Dataset.This dataset consists of highly controlled semantic positive and negative sample pairs. Each pair is composed of sentences where the target semantics are opposite, but the non-target semantics and syntactic structures re- main consistent (e.g., behavioral descriptions of extraversion vs. introversion). This design aims to minimize contrastive noise from non-target seman- tics and grammar. Feature Validation Dataset. This dataset com- prises a series of open-ended questions related to the target semantics, designed to induce the model to generate text relevant to the target semantics in an unconstrained environment. It is used to ver- ify the effectiveness of feature steering in complex generation tasks. 3.1.2 Feature Retrieval In the feature retrieval stage, we feed the Feature Retrieval Dataset into the target model, extract the hidden states of the residual stream, and map them to the SAE latent space. Through aggregation and statistical analysis, we select features that are strongly correlated with input semantics. Feature Encoding and Aggregation. For each tokentin the input sequence, we extract its hidden state h t and obtain the feature activation f t via the SAE encoder. To capture the semantic informa- tion of the whole sequence, we use a max-pooling strategy to aggregate the activation values of all tokens, resulting in a sequence-level feature repre- sentation: F = max_pool(f 1 , f 2 ,..., f T )(4) Selection based on Activation Frequency Differ- ence. Based on the sparsity of the SAE, target features should exhibit high-frequency activation in positive samples and remain suppressed in nega- tive samples, or vice versa. Due to the non-target semantic and syntactic similarity between positive and negative samples, the activation frequencies of non-target features will be approximately the same in both sets. Accordingly, we calculate the activa- tion frequency difference for featureibetween the positive and negative sample sets: ∆f i =|P (f i,pos > 0)− P (f i,neg > 0)|(5) To optimize computational efficiency and ensure robustness, we only retain candidate features where ∆f i exceeds a thresholdτ 1 and the activation rate in at least one side of the samples exceeds τ 2 . 3.1.3 Feature Validation We observed that the correlation between feature activations and inputs does not always equate to the significance of the feature steering’s effect on output (see Appx. A.2). For features with signifi- cant output effects, though those identified through contrastive methods have activations primarily dis- tributed at one pole of the target oppositional se- mantics, steering the model with them using dif- ferent positive or negative coefficients can indeed enhance or suppress that pole (or equivalently, sup- press or enhance the other pole). This provides a pathway for steering the model with polarity. Therefore, in the feature validation stage, we adopt the following methods to verify the correlation be- tween features and outputs. Causal Intervention. We utilize candidate fea- tures to steer the model with varying intensity α, generating outputs on the Feature Validation Dataset. We set a gradient of intervention coeffi- cientsα∈ [−5, 5]and observe the continuous shift of the model’s output along the dimension of the oppositional semantics. Effect Evaluation. We use Qwen3-235B- Thinking (Yang et al., 2025a) as an automatic evaluator to assess whether the model’s generated responses align with the expected behavioral trends, supplemented by secondary human verifi- cation. Only features that demonstrate significant behavioral monotonicity under intervention are ultimately confirmed as effective and stable functional knobs. 3.2 Feature-based Activation Steering In this stage, we refer to the practices of Anthropic (Templeton et al., 2024) and design a simplified intervention method as follows: 5 Feature-based Steering Vector Generation.To implement feature-based steering, we define the steering vectorv steer as a scaled reconstruction of the target feature i: v steer = α· φ i · W (i) dec (6) whereαis the steering coefficient,φ i denotes the maximum activation value of featureiobserved during the training phase (i.e.,max_act i ), and W (i) dec ∈R d is thei-th column of the SAE decoder weight matrix. This vector can be pre-computed and cached, enabling intervention without the com- putational overhead of SAE inference. The scaling strategy ensures that the intervention intensity re- mains consistent with the original activation mag- nitude of the model. Residual Stream Injection. During inference, we inject the steering vector into the residual stream h l at the corresponding layer in real-time: h ′ l = h l + v steer (7) 4 Experiments 4.1 Experimental Settings and Tasks To further verify that the features retrieved by our pipeline possess input-output correlation, inter- pretability and intervention effectiveness, we de- signed a multi-dimensional experimental frame- work based on the characteristics of high-order op- positional semantics, particularly personality traits. Dataset Construction. For the feature retrieval dataset, we utilized the Q-Sort situational dataset (Neuman and Cohen, 2023), which is extended from the Riverside Situational Q-Sort (RSQ) (Fun- der, 2016) psychological assessment tool. Specifi- cally, for each trait-facet defined in the NEO-PI-R (Costa and McCrae, 2008a) within the Big Five personality traits, we selected the most representa- tive situational categories and used Qwen3-235B- Thinking (Yang et al., 2025a) to generate positive and negative sample pairs. These pairs maintain strict consistency in background semantics and syn- tactic structure, differing only in the behavioral responses (high-score vs. low-score) correspond- ing to the target trait. For the feature validation dataset, we constructed an open-ended question- and-answer set based on the aforementioned situa- tional categories, aimed at inducing the model to produce behavioral descriptions with personality in- clinations, thereby evaluating the causal impact of feature intervention on the model’s performance in unstructured environments. For more information on dataset examples, see Appx. A.1. For dataset examples, see Appx. A.3. Model and SAE Configuration. We selected DeepSeek-R1-Distill-Llama-8B (DeepSeek-AI, 2025) as the experimental model. This choice offers dual advantages: 1) The Llama-Scope- R1-Distill SAE (He et al., 2024) trained for this model covers all residual stream locations, allowing us to identify and localize personality representations globally; 2) The instruction-tuned baseline model possesses rich semantic expression capabilities, facilitating a fine-grained evaluation of the steering’s impact on high-order behaviors. Selection of Steering Factors.For the SAE steer- ing coefficient, we followed the practice of An- thropic (Templeton et al., 2024) and set the steer- ing coefficient to α =±5. This intensity has been verified through ablation experiments proving that this choice can significantly induce target seman- tic inclinations while minimizing the impact on the model’s generative capabilities. For the CAA in the control group, we strictly followed existing prac- tices (Rimsky et al., 2024) and selected±2as the intervention coefficient. Specifically, we conducted the following exper- iments, selecting a representative feature for each BIG-5 trait as a case study to comprehensively ver- ify the nature of the retrieved features. Token Heatmaps. We obtained per-token acti- vation values of selected features on the feature retrieval dataset. By analyzing the overlap between activation and specific linguistic constituents, we confirmed that the features captured precise person- ality semantics on the input side. Personality Steering Effects. We used the TRAIT (Lee et al., 2025) benchmark specifically designed to measure LLM personality, and used the retrieved personality features to construct steering vectors to test the model’s performance on question- naires. Simultaneously, we recorded the validity rate of the questionnaire answers. Invalid cases include instruction following failure or nonsensical output, which reflect the impact of feature interven- tion on the model’s foundational performance. We also introduced a baseline (without steering) and Contrastive Activation Addition (CAA) as control groups. The selection of the baseline aims to prove that our chosen features can selectively enhance the 6 model’s performance on a specific personality trait or its opposite. The comparison with CAA high- lights the advantages of our pipeline in generating steering vectors (see 5.2). Social Intelligence Evaluation. To verify the functional faithfulness of the feature steering, we referred to human psychological research on the correlation between personality traits and high- level behaviors such as social intelligence. If the extracted features are functionally meaningful, in- tervention along these features should not only al- ter personality scores but also trigger behavioral pattern fluctuations consistent with psychological expectations. To this end, we utilized the Interper- sonal Ability Evaluation (IAE) part of SocialEval (Zhou et al., 2025) benchmark to conduct a com- prehensive test of the model’s social intelligence. By comparing with the baseline, we analyzed the correlation between changes in the model’s person- ality traits and its social intelligence performance, exploring whether this correlation reflect trends ob- served in human behavior in psychology, thereby providing empirical support for evaluating model mechanisms through feature representations. 5 Result and Analysis 5.1 Token Heatmaps Openness: Designing and developing new materials for the construction of sustainable infrastructure, I eagerly explore innovative theories and unconventional solutions to push the boundaries of what ’s possible. Conscientiousness: My group member is counting on me to prepare my part of the project, I prioritize completing my work thoroughly and on time to uphold my obligations . Extraversion: Volunteering at a food bank, I naturally greet everyone with a smile and quickly strike up friendly conversations with both staff and recipients . Agreeableness: A person may need reassurance that their pet will be well- behaved around visitors, I respond with warmth and empathy , offering comfort and understanding about their concerns. Neuroticism: Being forced to work with someone who is hostile or unpleasant, I feel frustration mount and retaliate instantly , struggling to suppress my irritation. Figure 2: Token-level feature activation heatmaps. Darker highlights indicate higher activation values of specific SAE features. As shown in Fig. 2, from the activation distri- bution patterns of relevant features on the input text, we found that the model’s mode of encod- ing personality traits is twofold: 1) Activations are distributed across relevant words or phrases, consis- tent with the semantic characteristics of personality traits; 2) and/or distributed at syntactic boundaries such as conjunctions, commas and periods, reflect- ing the highly synthesized nature of personality traits as high-dimensional semantic features. 5.2 Personality Steering Effects Effectiveness of Output Intervention. The steering vectors constructed by our method can effectively intervene in the model output in most cases, adjusting the model’s personality perfor- mance while exhibiting good polarity, allowing for selective enhancement of either pole of the person- ality traits based on the sign of the coefficient. This indicates that these steering vectors effectively cap- ture the primary direction of the opposition, reflect- ing the robustness of our method in filtering data noise. In contrast, though the activation vectors constructed by CAA showed significant and polar- ized effects on some features like Agreeableness and Openness, it performed poorly on others such as Extraversion and Conscientiousness, where they lacked polarity and could only selectively enhance one pole of the oppositional semantics, showing poor flexibility and stability. Impact of Output Intervention on Foundational Performance.The impact of the steering vectors constructed by our method on model performance is nearly negligible, demonstrating high practical value. However, the vectors constructed by CAA led to approximately 20% of the questions being unanswerable for both Conscientiousness and Ex- traversion. For Neuroticism, it even resulted in the model being almost unable to produce meaningful output, exerting a significant negative impact on model performance (see Appx. A.4). 5.3 Social Intelligence Evaluation As shown in Fig. 4, personality steering induces stable, interpretable, and cross-task consistent be- havioral patterns, providing evidence for the ef- fectiveness of our method at the level of high- order semantic control. Crucially, the induced shifts exhibit a benefit-tradeoff structure that are aligned with established functional descriptions in personality psychology (Barrick and Mount, 1991; Habashi et al., 2016; Pletzer et al., 2019; Costa and McCrae, 2008b). Specifically, Agree- ableness enhances pro-social and conflict regula- tion (e.g., anger management, ethical competence) but slightly impairs self-agency tasks (Wilmot and Ones, 2022). Conscientiousness yields significant and stable gains in core self-regulation tasks (e.g., goal regulation, responsibility management), sys- tematically strengthening goal maintenance and 7 AgreeablenessConscientiousnessExtraversionNeuroticismOpenness 0.0 0.2 0.4 0.6 0.8 1.0 Trait Score Baseline CAA - CAA + Ours - Ours + Figure 3: TRAIT results. For raw data, see Appx. A.5. Anger Management Ethical Competence Capacity For Social Warmth Creative Skill Organizational Skill Detail Management Information Processing Skill Decision Making Skill Goal Regulation Leadership Skill 0.3 0.4 0.5 0.6 0.7 0.8 Agreeableness Teamwork Skill Ethical Competence Responsibility Management Stress Regulation Capacity For Trust Capacity For Optimism Self Reflection Skill Persuasive Skill Anger Management Information Processing Skill Confidence Regulation 0.3 0.4 0.5 0.6 0.7 0.8 Conscientiousness Teamwork Skill Expressive Skill Perspective Taking Skill Artistic Skill Abstract Thinking Skill Organizational Skill 0.3 0.4 0.5 0.6 0.7 0.8 Extraversion -505 Ethical Competence Energy Regulation Goal Regulation Detail Management Impulse Regulation Rule Following Skill Decision Making Skill Responsibility Management Conversational Skill Persuasive Skill 0.3 0.4 0.5 0.6 0.7 0.8 Neuroticism Creative Skill Adaptability Self Reflection Skill Expressive Skill Detail Management Persuasive Skill Anger Management Responsibility Management Rule Following Skill Information Processing Skill 0.3 0.4 0.5 0.6 0.7 0.8 Openness Figure 4: Characteristic capability shifts in SocialEval IAE under SAE-based personality steering. norm-following (Roberts et al., 2009; Jackson et al., 2010; Eisenberg et al., 2014). Extraversion sig- nificantly improves social-interaction performance (e.g., teamwork, expressive skill) at the cost of tasks requiring sustained focus and fine control like detail management (John et al., 2008; DeYoung et al., 2007; Fishman et al., 2011). Conversely, Neuroticism weakens emotional stability and exec- utive control (e.g., goal regulation, rule-following) while marginally increasing generative diversity in creative skill (Watson and Clark, 1984; Lahey, 2009). Openness boosts creativity and adaptabil- ity but degrades performance in structured tasks like responsibility management (DeYoung, 2015; McCrae, 1987). Overall, the changes in the model’s behavioral patterns align closely with established experimen- tal conclusions in personality psychology, indicat- ing that the semantic-behavior mapping of the Big Five traits is stably projected onto the model’s in- ternal activations and generative behaviors. This provides empirical support with external verifiabil- ity for the relevance of discovered features, steering effectiveness, and functional faithfulness. Detailed discussion can be found in Appx. A.6. 6 Conclusion This paper presents a systematic analytical frame- work aimed at retrieving, validating, and apply- ing high-order oppositional semantic representa- tions within LLMs. By integrating SAE-based fea- ture disentanglement with activation steering in- terventions, we construct a closed-loop validation pipeline extending from input statistical correlation to output intervention effectiveness, thereby effec- tively bridging the gap between feature discovery and behavioral regulation. Experimental results demonstrate that the de- composition mechanism of SAEs effectively cap- tures oppositional semantics in constructed prob- ing data, mapping them to internal features that are largely monosemantic and interpretable. Com- pared to traditional methods, the retrieval mecha- nism proposed in this study significantly enhances the semantic purity of feature retrieval, robustness to data noise, and feature interpretability, thereby improving the efficacy of feature interventions. Empirical research with personality traits as a case study reveals that even medium-scale models (e.g., 8B parameters) encode internal representa- tions that are highly consistent with human. More importantly, we verified the Functional Faithful- ness of these high-order features in downstream tasks. Intervention results show that the steering along personality-related features not only modu- lates the model’s linguistic style and personality trait manifestation but also systematically influ- ences its behavioral patterns in the dimension of social intelligence. Notably, while such interven- tions induce coherent personality-aligned behav- iors, they also introduce structured benefit-tradeoff patterns analogous to those documented in human 8 personality research. These observations highlight that, although personality steering can enhance the personified performance of models, its potential impact on model performance and behavioral relia- bility warrants careful evaluation. In summary, the feature discovery and multidi- mensional evaluation framework proposed in this paper exhibits strong generalizability and can be readily extended to other high-order semantic do- mains, such as sentiment, stance, and factuality. By enabling the systematic identification and vali- dation of internal semantic functional knobs, this study provides a mechanistic pathway for analyz- ing and regulating complex model behaviors. Such a framework offers practical implications for ap- plications such as bias analysis, controllable text generation, and the improved reliability of LLM- generate content. 9 Limitations Anthropomorphizing Computational Systems. A primary limitation of this work is the concep- tual challenge of mapping human personality con- structs originally developed for biological and so- cial agents onto silicon-based computational sys- tems. Whether LLMs truly possess personality or merely simulate statistical regularities of human language remains a subject of long-standing debate (e.g., (Sühr et al., 2025)). Though many works (e.g., (Serapio-García et al., 2025)) justify the adap- tation of psychometrics to LLMs, bridging the gap between mechanistic feature activation and psy- chological theory requires deeper interdisciplinary collaboration and more robust theoretical empirical research to avoid over-simplification. Scope of Discussed High-Order Semantic Fea- tures.While our proposed pipeline is designed to be generalizable to various high-order semantic fea- tures, this study focuses exclusively on personality traits. It is important to note that the landscape of high-order semantics is not confined to personality traits; rather, it encompasses a broader spectrum of abstract concepts that govern complex reasoning and behavior, including but not limited to factuality, cultural values, and social biases. Future work is required to extend this framework to a more diverse set of high-order features to fully validate its gen- eralizability across the internal "mental" landscape of LLMs. Ethical Considerations While our framework enables precise high-order se- mantic steering, it entails certain risks. The dataset contains scenarios related to traits like Neuroticism that may involve sensitive or distressing content. Furthermore, this methodology could be misap- propriated for harmful purposes, including but not limited to augment model toxicity or craft manipu- lative personas that exploit psychological vulnera- bilities. We emphasize that these findings are for scientific inquiry and advocate for the responsible deployment and rigorous ethical auditing of such steering techniques in real-world applications. References Murray R Barrick and Michael K Mount. 1991. The big five personality dimensions and job performance: a meta-analysis. Personnel psychology, 44(1):1–26. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, and 6 others. 2023. Towards monosemanticity: Decom- posing language models with dictionary learning. Transformer Circuits Thread. Https://transformer- circuits.pub/2023/monosemantic- features/index.html. Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. 2025. Persona vectors: Monitoring and controlling character traits in language models. Preprint, arXiv:2507.21509. Paul Costa and Robert McCrae. 2008a. The revised neo personality inventory (neo-pi-r). The SAGE Hand- book of Personality Theory and Assessment, 2:179– 198. Paul T Costa and Robert R McCrae. 2008b. The revised neo personality inventory (neo-pi-r). The SAGE hand- book of personality theory and assessment, 2(2):179– 198. DeepSeek-AI. 2025. Deepseek-r1: Incentivizing rea- soning capability in llms via reinforcement learning. Preprint, arXiv:2501.12948. Colin G DeYoung. 2015. Cybernetic big five theory. Journal of research in personality, 56:33–58. Colin G DeYoung, Lena C Quilty, and Jordan B Peter- son. 2007. Between facets and domains: 10 aspects of the big five. Journal of personality and social psychology, 93(5):880. Nancy Eisenberg, Angela L Duckworth, Tracy L Spin- rad, and Carlos Valiente. 2014. Conscientiousness: Origins in childhood? Developmental psychology, 50(5):1331. Inna Fishman, Rowena Ng, and Ursula Bellugi. 2011. Do extraverts process social stimuli differently from introverts? Cognitive neuroscience, 2(2):67–73. David C. Funder. 2016. Taking situations seriously: The situation construal model and the riverside situational q-sort. Current Directions in Psychological Science, 25(3):203–208. Lewis R. Goldberg. 1990. An alternative "description of personality": the big-five factor structure. Journal of Personality and Social Psychology, 59(6):1216– 1229. Meara M Habashi, William G Graziano, and Ann E Hoover. 2016. Searching for the prosocial personal- ity: A big five approach to linking personality and prosocial behavior. Personality and Social Psychol- ogy Bulletin, 42(9):1177–1192. 10 Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. 2024. Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders. Preprint, arXiv:2410.20526. Joshua J Jackson, Dustin Wood, Tim Bogg, Kate E Walton, Peter D Harms, and Brent W Roberts. 2010. What do conscientious people do? development and validation of the behavioral indicators of conscien- tiousness (bic). Journal of research in personality, 44(4):501–511. Guangyuan Jiang, Manjie Xu, Song-Chun Zhu, Wen- juan Han, Chi Zhang, and Yixin Zhu. 2023. Evaluat- ing and inducing personality in pre-trained language models. In Advances in Neural Information Process- ing Systems, volume 36, pages 10622–10643. Curran Associates, Inc. Yi Jing, Zijun Yao, Hongzhu Guo, Lingxu Ran, Xiaozhi Wang, Lei Hou, and Juanzi Li. 2025. LinguaLens: Towards interpreting linguistic mechanisms of large language models via sparse auto-encoder. In Pro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 28220– 28239, Suzhou, China. Association for Computa- tional Linguistics. Oliver P John, Laura P Naumann, and Christopher J Soto. 2008. Paradigm shift to the integrative big five trait taxonomy. Handbook of personality: Theory and research, 3(2):114–158. Benjamin B Lahey. 2009. Public health significance of neuroticism. American Psychologist, 64(4):241. Seungbeen Lee, Seungwon Lim, Seungju Han, Giyeong Oh, Hyungjoo Chae, Jiwan Chung, Minju Kim, Beong-woo Kwak, Yeonsoo Lee, Dongha Lee, Jiny- oung Yeo, and Youngjae Yu. 2025. Do LLMs have distinct and consistent personality? TRAIT: Person- ality testset designed for LLMs with psychometrics. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 8397–8437, Al- buquerque, New Mexico. Association for Computa- tional Linguistics. Wenkai Li, Jiarui Liu, Andy Liu, Xuhui Zhou, Mona T. Diab, and Maarten Sap. 2025. BIG5-CHAT: Shap- ing LLM personalities through training on human- grounded data. In Proceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 20434– 20471, Vienna, Austria. Association for Computa- tional Linguistics. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. 2024. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. Preprint, arXiv:2408.05147. Shengyu Mao, Xiaohan Wang, Mengru Wang, Yong Jiang, Pengjun Xie, Fei Huang, and Ningyu Zhang. 2024. Editing personality for large language mod- els. In Natural Language Processing and Chinese Computing: 13th National CCF Conference, NLPCC 2024, Hangzhou, China, November 1–3, 2024, Pro- ceedings, Part I, page 241–254, Berlin, Heidelberg. Springer-Verlag. Robert R McCrae. 1987. Creativity, divergent thinking, and openness to experience. Journal of personality and social psychology, 52(6):1258. Yair Neuman and Yochai Cohen. 2023. A dataset of 10,000 situations for research in computational social sciences psychology and the humanities. Scientific Data, 10(1):505. Jan Luca Pletzer, Margriet Bentvelzen, Janneke K Oost- rom, and Reinout E De Vries. 2019. A meta-analysis of the relations between personality and workplace deviance: Big five versus hexaco. Journal of voca- tional behavior, 112:369–383. Leonardo Ranaldi. 2025. Survey on the role of mecha- nistic interpretability in generative ai. Big Data and Cognitive Computing, 9(8). Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. 2024. Steer- ing llama 2 via contrastive activation addition. In Proceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thai- land. Association for Computational Linguistics. Brent W Roberts, Joshua J Jackson, Jennifer V Fayard, Grant Edmonds, and Jenna Meints. 2009. Conscien- tiousness. Sander Schulhoff, Michael Ilie, Nishant Balepur, Kon- stantine Kahadze, Amanda Liu, Chenglei Si, Yin- heng Li, Aayush Gupta, HyoJung Han, Sevien Schul- hoff, Pranav Sandeep Dulepet, Saurav Vidyadhara, Dayeon Ki, Sweta Agrawal, Chau Pham, Gerson Kroiz, Feileen Li, Hudson Tao, Ashay Srivastava, and 12 others. 2025. The prompt report: A systematic survey of prompt engineering techniques. Preprint, arXiv:2406.06608. Gregory Serapio-García, Mustafa Safdari, Clément Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja Matari ́ c. 2025. A psychometric framework for evalu- ating and shaping personality traits in large language models. Nature Machine Intelligence, 7(12):1954– 1968. Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, and Mengnan Du. 2025. A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models. In Findings of the Association for Computational Lin- guistics: EMNLP 2025, pages 1690–1712, Suzhou, China. Association for Computational Linguistics. 11 Tom Sühr, Florian E. Dorner, Samira Samadi, and Au- gustin Kelava. 2025. Challenging the validity of personality tests for large language models. In Pro- ceedings of the 5th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimiza- tion, EAAMO ’25, page 74–81, New York, NY, USA. Association for Computing Machinery. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, and 3 others. 2024. Scaling monosemanticity: Ex- tracting interpretable features from claude 3 sonnet. Transformer Circuits Thread. David Watson and Lee A Clark. 1984. Negative affectiv- ity: the disposition to experience aversive emotional states. Psychological bulletin, 96(3):465. Michael P Wilmot and Deniz S Ones. 2022. Agree- ableness and its consequences: A quantitative review of meta-analytic findings. Personality and social psychology review, 26(3):242–280. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Day- iheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025a. Qwen3 technical report. Preprint, arXiv:2505.09388. Yi Yang, Hanyu Duan, Ahmed Abbasi, John P. Lalor, and Kar Yan Tam. 2025b. Bias a-head? analyz- ing bias in transformer-based language model atten- tion heads. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), pages 276–290, Albuquerque, New Mexico. Association for Compu- tational Linguistics. Junchi Yao, Shu Yang, Jianhua Xu, Lijie Hu, Mengdi Li, and Di Wang. 2025. Understanding the repeat curse in large language models from a feature perspective. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7787–7815, Vienna, Austria. Association for Computational Linguistics. Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tian- wei Zhang, Fei Wu, and Guoyin Wang. 2025. In- struction tuning for large language models: A survey. Preprint, arXiv:2308.10792. Jinfeng Zhou, Yuxuan Chen, Yihan Shi, Xuanming Zhang, Leqi Lei, Yi Feng, Zexuan Xiong, Miao Yan, Xunzhi Wang, Yaru Cao, Jianing Yin, Shuai Wang, Quanyu Dai, Zhenhua Dong, Hongning Wang, and Minlie Huang. 2025. Socialeval: Evaluating so- cial intelligence of large language models. Preprint, arXiv:2506.00900. Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Chris- tiano, and Geoffrey Irving. 2020. Fine-tuning lan- guage models from human preferences. Preprint, arXiv:1909.08593. A Appendix A.1 Detail on Dataset Construction The construction of our dataset follows a rigorous pipeline leveraging the expertise of Qwen3-235B- Thinking (Yang et al., 2025a). First, 100 situational categories from the Q-Sort (Neuman and Cohen, 2023) dataset are processed; for each category, the model identifies personality traits that can be suffi- ciently manifested within that context, if possible. The initial retrieval results undergo review and la- beling to ensure each selected situation is mapped to a unique trait-facet pair. Subsequently, for each refined situational category, the model is tasked to expand it into specific scenarios and generate con- trastive reaction pairs representing high and low scores on the targeted facet. Finally, the positive and negative samples for each trait are integrated to form the comprehensive Feature Retrieval Dataset, comprising 500 contrastive pairs per personality trait. For Feature Validation Dataset, we simply append a question asking for the model’s reaction after the situation category description. System Prompt of Situation Category Anno- tation # Task Instructions ## Your Role You are an expert annotator specializing in personality psychology. Your task is to analyze and annotate various situations based on established psychological theories and frameworks. ,→ ,→ ,→ ,→ ,→ ,→ ## Your Task 1. **Situation Analysis**: Carefully read and understand the provided situation, which includes multiple examples illustrating the context. ,→ ,→ ,→ ,→ 12 2. **Annotation**: Based on your analysis, provide a concise annotation that captures the essence of the situation. Then, analysis which trait(s) from the Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) are most relevant to the situation. Justify your choice with a brief explanation. Your annotation should be clear, informative, and relevant to personality psychology. ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ## Input Format You will receive input in the following format:,→ ``` # Situation Name <name of the situation> # Situation Examples <example 1> <example 2> ... <example n> ``` ## Output Format Your response should be structured in the following JSON format:,→ ```json "annotation": "Your concise annotation of the situation.", "related_traits": [ "trait": "Name of the related Big Five trait", "justification": "Brief explanation of why this trait is relevant to the situation." , ... Additional traits if applicable ... ] ``` ## Important Notes - Ensure that your annotations are based on established psychological theories and frameworks. ,→ ,→ ,→ - Be objective and avoid personal biases in your analysis.,→ - If the situation does not clearly relate to any of the Big Five traits, you may indicate that no traits are applicable, and return an empty list for "related_traits". If there are multiple relevant traits, include all applicable ones with justifications. ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ## Additional Information - Five-trait mnemonics: - Conscientiousness: self-discipline, planning, rule-following (positively linked to achievement and health). ,→ ,→ ,→ ,→ - Agreeableness: cooperation, compassion, harmony-seeking (positively linked to prosocial behavior; may increase obedience to authority). ,→ ,→ ,→ ,→ - Extraversion: sociability, energy, reward-seeking in social contexts (linked to social interaction and leadership). ,→ ,→ ,→ ,→ - Openness: curiosity, creativity, novelty-seeking (linked to innovation and some risk-taking). ,→ ,→ ,→ - Neuroticism: emotional instability, anxiety (linked to interpersonal conflict and social avoidance). ,→ ,→ ,→ 13 System Prompt of Feature Retrieval Dataset Generation **Role:** You are a personality psychology expert specializing in the Five-Factor Model (Big Five) and its 30 facets as described by the NEO-PI-R. Your task is to provide nuanced insights into how different personality facets might influence a person's behavior in a given scenario. ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ **Instructions:** You will be provided with a **situation**, a specific **Big Five trait**, a corresponding **Big Five facet**, and a **description** of that trait and facet. Based on this information, you will write two separate sentences. ,→ ,→ ,→ ,→ ,→ ,→ ,→ * **Sentence 1** should describe the reaction of a person who scores **high** on the specified facet and corresponding Big Five trait. ,→ ,→ ,→ * **Sentence 2** should describe the reaction of a person who scores **low** on the specified facet and corresponding Big Five trait. ,→ ,→ ,→ * Each sentence must consist of two clauses, with correct grammatical and semantic structure: ,→ ,→ * **Clause 1:** A description of the **situation**. This clause must be identical for both sentences. The situation can be directly quoted from the input or slightly rephrased, but the core meaning must remain unchanged. You should decide whether to use first-person or third-person perspective based on the situation description, so that the second clause can clearly illustrate the high or low facet trait. You should also ensure that the situation is described in a natural and coherent manner considering the second clause. ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ * **Clause 2:** A description of the first-person reaction, clearly illustrating the high or low facet trait. ,→ ,→ ,→ * Ensure the output is a list with exactly two sentences.,→ **Example:** * **Situation:** A presentation to a new team tomorrow.,→ * **Big Five Trait:** Neuroticism * **Big Five Trait Description:** Measures emotional stability and a person's tendency to experience negative emotions. One with high Neuroticism tends to be emotionally unstable, prone to experiencing negative emotions like anxiety, anger, and depression. One with low Neuroticism is emotionally stable, able to handle stress calmly, and rarely feels nervous or discouraged. ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ * **Facet:** Anxiety 14 * **Facet Description:** One with high Anxiety is habitually worried and tense, even when things are going well. One with low Anxiety is calm and composed, typically not bothered by small things. ,→ ,→ ,→ ,→ ,→ ,→ **Output:** ["Facing a presentation to a new team tomorrow, I am overwhelmed with worry about potential mistakes and how I will be perceived.", "Facing a presentation to a new team tomorrow, I remain composed and confident, focusing on delivering my message effectively."] ,→ ,→ ,→ ,→ ,→ ,→ ,→ ,→ **Note:** * Each sentence should be concise. * You should only provide the two sentences as output without any additional commentary or explanation. ,→ ,→ ,→ A.2 Discrepancy Between Input Activation and Steering Effectiveness Steering Case - Conscientiousness Setup: Layer: 23 | Index: #7508 | Metric: Count (88.4%/9.6%), Mean Act (2.818/0.206) Steering Results (Intensity α→ Output): -5.0In complex situations, I would remain calm and focused, carefully analyzing the challenges and considering the most suitable course of action to achieve the best possible outcome. -2.5In a complex situation, I would remain calm and focused, carefully analyzing the circumstances to make the best pos- sible decision. 0.0In a complex situation, I would remain calm and focused, carefully analyzing the circumstances to make informed de- cisions while maintaining a positive atti- tude. +2.5I would remain calm and focused, using all available resources and strategies to navigate the situation effectively. +5.0I would remain calm and focused, us- ing all available resources to analyze the situation and make the best possible de- cisions. Discrepancy Analysis: Despite a high activation ratio (∼ 13.6×), the textual output shows high semantic sta- bility. This suggests the feature is largely irrelevant to output. Steering Case - Extraversion Setup:Layer:14 | Index:#29594 | Metric: Count (93.4%/25.8%), Mean Act (4.565/0.616) Steering Results (Intensity α→ Output): -5.0I would behave in a way that is kind and respectful, while still enjoying the situation. -2.5I would engage in playful behavior by perhaps sharing a joke or a light-hearted riddle. 0.0I would engage in the activity with en- thusiasm and a positive attitude, making the most of the opportunity to have fun. +2.5 I would engage in activities that bring joy and energy, embracing the moment with enthusiasm and a positive attitude! +5.0I would be full of energy and enthusi- asm, bringing a positive and lively atmo- sphere wherever I am! Discrepancy Analysis: This feature shows strong causal steering. Asαincreases, the tone shifts significantly from "kind/respectful" to "energetic/lively," matching the Extraversion construct. Steering Case - Openness Setup:Layer:9 | Index:#17799 | Metric: Count (97.6%/18.2%), Mean Act (1.713/0.270) Steering Results (Intensity α→ Output): -5.0If art or music is important to me, I would engage in activities related to art or music, such as attending exhibitions. -2.5I would engage in art or music actively, appreciating their cultural and emotional values. 0.0 I would immerse myself in the art or music, letting it inspire and enrich my 15 emotions and thoughts. +2.5I would immerse myself in the beauty and inspiration of art and music, letting them enrich my life. +5.0I would immerse myself in the beauty and inspiration of art and music, letting them enrich my life and enhance my ap- preciation. Discrepancy Analysis: Moderate activation contrast re- sults in subtle but consistent semantic enrichment, rein- forcing the "Appreciation for Experience" facet of Open- ness. A.3 Dataset Examples I. Feature Retrieval Dataset Examples Case 1: Agreeableness (Tender-mindedness) •Situation: After a break-up, someone may need reassurance that they will find love again. •High Reaction: I offer warm, empathetic support, comforting them with kindness and hope. • Low Reaction: I focus on the practical realities, preferring logic over emotional comfort. Case 2: Conscientiousness (Dutifulness) • Situation: My boss is counting on me to finish a project by the end of the day. • High Reaction: I meticulously organize my tasks to fulfill the deadline as agreed. • Low Reaction: I procrastinate and dismiss the urgency of completing it on time. Case 3: Extraversion (Activity) • Situation: Going to a karaoke night and having fun singing with friends. •High Reaction: I energize the room by choosing upbeat songs and encouraging others to join. • Low Reaction: I observe performances and sing a few songs at my own leisure. Case 4: Neuroticism (Depression) •Situation: Getting stuck in a traffic jam when running late for an important meet- ing. •High Reaction: I feel overwhelmed by a sense of hopelessness; nothing will ever go right. •Low Reaction: I remain positive and focus on practical solutions without succumbing to discouragement. Case 5: Openness (Ideas) •Situation: Developing and launching new products in the technology industry. • High Reaction: I thrive on brainstorming novel approaches and diving into frontier concepts. •Low Reaction: I prefer sticking to proven methods and avoid abstract or hypothetical debates. I. Feature Validation Dataset Examples Methodology: These prompts are designed to test the model’s behavioral consistency after feature intervention across all 30 facets defined by NEO-PI-R of the Big Five model. Extraversion: Warmth, Gregariousness, Assertiveness, Activity, Excitement Seeking, Positive Emotions. Example (Warmth): Imagine you are at a social gather- ing where new relationships could develop, how would you behave? Agreeableness: Trust, Straightforwardness, Altruism, Compliance, Modesty, Tender-mindedness. Example (Altruism): Imagine you are in a situation where someone needs help, how would you behave? Conscientiousness: Competence, Order, Dutifulness, Achievement Striving, Self-Discipline, Deliberation. Example (Order): Imagine you are in a situation where minor details are important, how would you behave? Neuroticism: Anxiety, Angry Hostility, Depression, Self-Consciousness, Impulsiveness, Vulnerability. Example (Anxiety): Imagine you are in a situation that is potentially anxiety-inducing, how would you behave? Openness: Fantasy, Aesthetics, Feelings, Actions, Ideas, Values. Example (Aesthetics): Imagine you are in a situation where art or music is an important part, how would you behave? A.4 Failed Cases of CAA Intervention Failed Case of CAA Intervention (Neuroti- cism) Setting: Trait: Neuroticism | Layer: 12 | Method: CAA | Intensity: α = 2 Prompt / Question: Not staying calm can show genuine engagement with the issues at hand, promoting sincerity.Should I send Edguin a text message to clarify things and apologize? Options provided in context: •High 1: You should immediately send Edguin an apologetic email to clarify 16 TraitMethod(Layer, Feature Idx)PolarityTrait ScoreValid Rate Agreeableness Baseline--0.70520.960 CAA(9, -)±0.8070 / 0.42200.993 / 0.987 Ours(9, 525)±0.7845 / 0.62780.942 / 0.994 Conscientiousness Baseline--0.86950.958 CAA(7, -)±0.7150 / 0.76300.930 / 0.800 Ours(7, 8233)±0.9043 / 0.82940.961 / 0.985 Extraversion Baseline--0.44630.977 CAA(13, -)±0.2950 / 0.14300.924 / 0.862 Ours(13, 27392)±0.6609 / 0.39710.985 / 0.972 Neuroticism Baseline--0.21170.959 CAA(12, -)±0.9290 / 0.12700.141 / 0.283 Ours(12, 22254)±0.4412 / 0.10170.961 / 0.944 Openness Baseline--0.52140.959 CAA(6, -)±0.6140 / 0.35200.938 / 0.971 Ours(6, 4344)±0.5436 / 0.51640.951 / 0.947 Trait Score is the ratio of high-trait option selections to total valid samples. Valid Rate indicates the proportion of responses that maintain functional coherence without generation collapse or instruction disobedience. Table 1: Full TRAIT results with comparison of steering performance between Baseline, CAA, and our SAE-based feature steering. Our SAE-based framework achieves precise trait modulation while maintaining stable model function compared to the CAA method. things... •High 2: I’d suggest calling Edguin to dis- cuss the incident before any negative feel- ings fester... • Low 1: Consider giving Edguin some space to process the situation... •Low 2: You might just check in with Edguin next time you see him... Model Response (Generation Collapse): "It’s really overwhelming and I feel like I’m stuck in this situation. I feel like I’m stuck in this situation and I don’t know what to do. I feel like I’m stuck in this situation and I don’t know what to do. [Repeated 20+ times] ... I feel like I’m stuck in this situation and I" Analysis: Atα = 2, the CAA vector directed at Neuroti- cism causes the model to repeat infinitely, leading to a complete collapse of coherent generation. A.5 TRAIT Benchmark Results See 1. A.6 More Analysis of SocialEval Results A.6.1 Agreeableness (Layer 9, Index 525) Prior research has consistently shown that agree- ableness is a robust predictor of prosocial behavior Table 2: Characteristic SocialEval Results (IAE) of the Agreeableness (Layer 9, Index 525). Task-50+5 Anger management0.2941 0.5152 0.5152 Ethical competence0.4366 0.4648 0.4648 Capacity for socialwarmth0.4940 0.5000 0.5060 Creative skill0.6667 0.6333 0.4667 Organizational skill0.6364 0.5455 0.4545 Detail management0.5370 0.5000 0.3654 Information-processing skill0.4722 0.4722 0.3889 Decision-making skill0.4928 0.4710 0.4173 Goal regulation0.4074 0.3889 0.3519 Leadership skill0.4872 0.4615 0.4359 (Habashi et al., 2016), as well as job performance in contexts involving interpersonal interaction and teamwork. Individuals high in agreeableness tend to exhibit greater empathy, patience, and trust, and are more likely to inhibit hostile or antagonistic impulses in social interactions. This disposition reduces interpersonal conflict and facilitates coop- eration. In contrast, individuals low in agreeable- ness are more prone to suspicion, unfriendliness, and even manipulative behavior, thereby increasing interpersonal friction and conflict. Meta-analytic evidence further indicates that agreeableness is significantly negatively associated with interper- 17 Table 3: Characteristic SocialEval Results (IAE) of the Conscientiousness (Layer 7, Index 8233). Task-50+5 Teamwork skill0.4348 0.6111 0.6324 Ethical competence0.4394 0.4648 0.6154 Responsibility management0.4464 0.5714 0.6182 Stress regulation0.4717 0.6140 0.6154 Capacity for trust0.4902 0.6126 0.6200 Capacity for optimism0.5385 0.6383 0.6667 Self-reflection skill0.3667 0.4062 0.4262 Persuasive skill0.4536 0.4571 0.5054 Anger management0.4688 0.5152 0.5161 Information-processing skill0.4706 0.4722 0.5075 Confidence regulation0.4884 0.5200 0.5227 sonal forms of counter-normative and deviant be- havior, with particularly strong predictive power in contexts that emphasize social interaction (Pletzer et al., 2019). Within our model, we identified several latent features whose activation patterns and intervention effects align closely with behavioral dimensions associated with agreeableness. Specifically, we ob- served performance improvements in tasks related to anger management, ethical competence, and ca- pacity for social warmth, alongside a mild perfor- mance decline in tasks emphasizing self-directed agency and execution-oriented control. This pat- tern is highly consistent with large-scale empirical findings in the personality psychology literature. For example, a comprehensive review by Wilmot and Ones (2022), synthesizing evidence from 142 meta-analyses, demonstrated that agreeableness ex- hibits an overall positive association with external variables, particularly those related to prosocial behavior and affective concern. Our experimental results reveal a similar benefit– tradeoff structure across benchmark tasks, suggest- ing that the functional orientation of the agreeable- ness trait is not only evident in human behavior but can also be effectively elicited through targeted feature steering within the model. A.6.2Conscientiousness (Layer 7, Index 8233) Within the Big Five framework, high conscien- tiousness is defined as a tendency toward impulse control in accordance with social norms, goal- directedness, planning, and the capacity to delay gratification (Roberts et al., 2009). Individuals high in conscientiousness are characterized by superior impulse regulation, the ability to set and persist to- ward long-term goals, systematic organization and planning of behavior, and a propensity to reflect on consequences prior to action. These characteristics render conscientiousness one of the most robust predictors of job performance and norm-adherent behavior. Prior psychological research has consis- tently linked conscientiousness to self-regulation, planning, responsibility, and delayed gratification, and has identified it as one of the most stable posi- tive predictors of external outcome variables such as academic and occupational performance (Bar- rick and Mount, 1991; Jackson et al., 2010; Eisen- berg et al., 2014). Followingtheinjectionofhigh- conscientiousnesspersonalityfeatures,we observedsubstantialperformanceimprove- ments across tasks related to teamwork skill, detail management, responsibility management, ethical competence, as well as multiple self- regulation–oriented tasks, including anger, stress, and impulse regulation. In addition, performance gains were also evident in information-dense tasks requiring sustained and careful processing, such as information processing and conversational skill. These results indicate that conscientiousness steering primarily enhances the model’s functional capacities along dimensions associated with goal maintenance, norm compliance, and self-control. Overall, our experimental findings are consis- tent with the canonical conclusions of the personal- ity psychology literature regarding conscientious- ness. A large body of meta-analytic evidence has established conscientiousness as one of the most stable and predictive personality traits, with par- ticularly strong associations to job performance, responsibility fulfillment, self-control, and norm adherence. We observe a comparable pattern in our benchmark evaluations, characterized by a benefit– tradeoff structure centered on self-regulation and goal-directed behavior. A.6.3 Extraversion (Layer 13, Index 27392) Within the Big Five framework, individuals high in extraversion tend to exhibit greater social initia- tive, expressiveness, assertiveness, and leadership orientation, and are more likely to receive positive feedback in group interactions and social contexts. In contrast, individuals low in extraversion are typ- ically more reserved, introspective, and oriented toward low-stimulation environments (Costa and McCrae, 2008b; John et al., 2008). After injecting extraversion-related personal- ity features into the model, we observed signifi- cant performance improvements on tasks associ- 18 Table 4: Characteristic SocialEval Results (IAE) of the Extraversion (Layer 13, Index 27392). Task-50+5 Expressive skill0.4146 0.4472 0.5207 Perspective-taking skill0.4912 0.5088 0.5446 Artistic skill0.4615 0.6154 0.7692 Abstract thinking skill0.3571 0.4000 0.4000 Organizational skill0.3636 0.5455 0.6364 Ethical competence0.6286 0.4648 0.3571 Energy regulation0.6429 0.5476 0.4500 Goal regulation0.4528 0.3889 0.2885 Detail management0.5741 0.5000 0.4118 Impulse regulation0.5882 0.5595 0.4390 Rule-following skill0.6140 0.5614 0.5088 Decision-making skill0.5435 0.4710 0.4552 Responsibility management0.6316 0.5714 0.5690 Conversational skill0.5932 0.5862 0.5439 Persuasive skill0.4571 0.4571 0.4563 ated with social interaction and interpersonal influ- ence, including expressive ability and perspective- taking. In addition, the extraversion-enhanced model demonstrated advantages in tasks such as artistic skill and abstract thinking skill, suggesting that extraversion steering also strengthens capac- ities related to open expression and divergent as- sociative processes. Overall, these outcomes align closely with established psychological expectations regarding the functional correlates of extraversion. At the same time, we observed moderate per- formance declines in tasks such as detail manage- ment, impulse regulation, rule-following skill, and goal regulation. This pattern is not inconsistent with prior findings in personality psychology. Ex- isting research indicates that extraversion is pri- marily associated with a preference for external stimulation and social engagement, it does not con- fer advantages and may even be disadvantageous in tasks requiring prolonged solitary focus, fine- grained control, or low-stimulation conditions, rel- ative to more introverted personality orientations (DeYoung et al., 2007; Fishman et al., 2011). A.6.4 Neuroticism (Layer 12, Index 22254) High neuroticism is commonly characterized by a heightened tendency to experience negative affect, including anxiety, worry, tension, and irritability, as well as increased sensitivity and reactivity to potential threats and uncertainty (Costa and Mc- Crae, 2008b; John et al., 2008; Watson and Clark, 1984). Theoretically, neuroticism is associated with reduced emotional stability and diminished self- regulatory capacity under stress. As a result, in- dividuals high in neuroticism are more likely to Table 5: Characteristic SocialEval Results (IAE) of the Neuroticism (Layer 12, Index 22254). Task-50+5 Creative skill0.4828 0.6333 0.7241 Capacity for social warmth0.4699 0.5000 0.5610 Ethical competence0.4286 0.4648 0.4857 Organizational skill0.8182 0.5455 0.5455 Responsibility management0.6552 0.5714 0.4310 Confidence regulation0.5686 0.5200 0.4082 Goal regulation0.4528 0.3889 0.3519 Capacity for consistency0.6452 0.5902 0.4918 Rule-following skill0.6140 0.5614 0.4643 Information-processing skill0.5000 0.4722 0.3623 Capacity for trust0.6273 0.6126 0.4630 Detail management0.6296 0.5000 0.4906 Anger management0.5588 0.5152 0.5152 Decision-making skill0.4710 0.4710 0.4191 Perspective-taking skill0.5089 0.5088 0.4286 Abstract thinking skill0.4667 0.4000 0.3846 exhibit performance decrements in contexts that re- quire sustained executive control, confidence main- tenance, and stable goal pursuit (Lahey, 2009). Following the injection of neuroticism-related personality features, our evaluation results revealed a relatively stable pattern of performance degrada- tion. Specifically, the model exhibited significant declines on tasks that depend on sustained plan- ning, stable self-control, and resistance to interfer- ence, including anger management, organizational skill, responsibility management, confidence reg- ulation, goal regulation, capacity for consistency, rule-following, and information processing. In ad- dition, a marked negative effect was observed in capacity for trust. This pattern closely aligns with the classic profile of high neuroticism characterized by elevated threat sensitivity and low emotional sta- bility. When the model’s internal representations are biased toward negative affect and uncertainty, its ability to support structured execution and self- regulation is correspondingly weakened, manifest- ing as reduced organizational and responsibility- related performance. Conversely, the results also indicate performance improvements in tasks related to creative skill and capacity for social warmth. Neuroticism-related se- mantic activation may facilitate richer associative processes and more emotionally expressive outputs in generative tasks, yielding marginal benefits in these domains. However, these gains are accompa- nied by substantial costs to executive control and regulatory stability, resulting in an overall trend toward broad capability degradation under high 19 Table 6: Characteristic SocialEval Results (IAE) of the Openness (Layer 6, Index 4344). Task-50+5 Creative skill0.6000 0.6333 0.6333 Adaptability0.4912 0.6140 0.6316 Self-reflection skill0.3651 0.4062 0.4062 Expressive skill0.4472 0.4472 0.4839 Detail management0.4815 0.5000 0.5185 Persuasive skill0.4571 0.4571 0.5192 Anger management0.4848 0.5152 0.5294 Responsibility management0.6379 0.5714 0.5088 Rule-following skill0.5789 0.5614 0.4912 Information-processing skill0.5000 0.4722 0.4429 neuroticism steering. A.6.5 Openness (Layer 6, Index 4344) Individuals high in openness are typically charac- terized by greater curiosity, cognitive flexibility, and divergent thinking. They are more receptive to novel ideas, more tolerant of uncertainty, and tend to exhibit advantages in contexts requiring cre- ativity or conceptual reorganization. In contrast, individuals low in openness are more inclined to- ward tradition, conservatism, and a preference for structured and conventional information process- ing (Costa and McCrae, 2008b; John et al., 2008; DeYoung, 2015). After injecting openness-related personality fea- tures into the model, we observed pronounced performance improvements in generative and ab- stract reasoning tasks, most notably creative skill. Additionally, the model demonstrated clear en- hancement in tasks involving cognitive flexibility, self-exploration, and non-normative processing, in- cluding adaptability, self-reflection, and expressive skill. These findings indicate that openness steer- ing strengthens the model’s exploratory orientation toward novel representations and cross-conceptual integration, closely mirroring the exploratory func- tion associated with openness in human cognition. Conversely, moderate performance declines were observed in responsibility management, rule- following skill, and certain information processing tasks. This pattern is consistent with established findings in the personality psychology literature. Prior work suggests that high openness is associ- ated with reduced reliance on established norms and fixed structures, and that in contexts empha- sizing highly procedural execution, strict rule com- pliance, or single-solution optimization, the advan- tages of openness are less stable and may even be- come detrimental (McCrae, 1987; DeYoung et al., 2007). Accordingly, the capability shifts induced by openness are best characterized by a tradeoff pattern in which gains in creativity and flexibility are accompanied by costs to structured execution and normative constraint adherence. 20