Paper deep dive

Where Confabulation Lives: Latent Feature Discovery in LLMs

Thibaud Ardoin, Yi Cai, Gerhard Wunder

Year: 2025Venue: EMNLP 2025Area: Mechanistic Interp.Type: EmpiricalEmbeddings: 75

Models: Meta-Llama-3-8B

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/12/2026, 5:58:08 PM

Summary

The paper investigates 'confabulation' in Large Language Models (LLMs) by identifying latent directions in activation space that distinguish between factual retrieval and hallucinated content. The authors introduce a contrastive dataset and a method using Sparse Principal Component Analysis (SPCA) to extract a steering vector. This vector allows for causal intervention in the model's internal representations, effectively modulating confabulation tendencies with minimal disruption to factual generation.

Entities (5)

LLaMA3-8B · language-model · 100%Thibaud Ardoin · researcher · 100%Activation Steering · technique · 95%Confabulation · phenomenon · 95%SPCA · methodology · 95%

Relation Signals (3)

SPCA → extracts → Steering Vector

confidence 95% · This yields a sparse set of components that maximize the separability of the contrastive data... The resulting sparse difference vector serves as a disentangled representation

Steering Vector → modulates → Confabulation

confidence 95% · By steering along this axis, we can amplify or suppress confabulation tendencies

LLaMA3-8B → exhibits → Confabulation

confidence 90% · LLaMA3-8B-Instruct has been fine-tuned to align with human expectations... and observes spontaneous confabulation behavior

Cypher Suggestions (2)

Find all models used in the study · confidence 90% · unvalidated

MATCH (m:LanguageModel) RETURN m.name

Map the relationship between methods and the phenomena they address · confidence 85% · unvalidated

MATCH (m:Methodology)-[:ADDRESSES]->(p:Phenomenon) RETURN m.name, p.name

Abstract

Thibaud Ardoin, Yi Cai, Gerhard Wunder. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025.

Full Text

74,595 characters extracted from source content.

Expand or collapse full text

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 29813–29837 November 4-9, 2025 ©2025 Association for Computational Linguistics Where Confabulation Lives: Latent Feature Discovery in LLMs Thibaud Ardoin, Yi Cai, Gerhard Wunder Freie Universität Berlin thibaud.ardoin, yi.cai, g.wunder@fu-berlin.de Abstract Hallucination remains a critical failure mode of large language models (LLMs), undermin- ing their trustworthiness in real-world appli- cations. In this work, we focus on confab- ulation, a foundational aspect of hallucina- tion where the model fabricates facts about unknown entities. We introduce a targeted dataset designed to isolate and analyze this behavior across diverse prompt types. Using this dataset, and building on recent progress in interpreting LLM internals, we extract latent directions associated with confabulation using sparse projections. A simple vector-based steer- ing method demonstrates that these directions can modulate model behavior with minimal dis- ruption, shedding light on the inner represen- tations that drive factual and non-factual out- put. Our findings contribute to a deeper mech- anistic understanding of LLMs and pave the way toward more trustworthy and controllable generation. We release the code and dataset at https://github.com/Thibaud-Ardoin/where- confabulation-lives 1 Introduction Large Language Models (LLMs) have demon- strated impressive capabilities across a wide range of downstream tasks. However, despite these ad- vances, they remain prone to hallucination, gen- erating outputs that contain illusory information presented as if it were factual or grounded in reality (Zhang et al., 2023b; Maynez et al., 2020). This limitation puts the practical use of LLMs at risk, particularly in critical domains, (Chen et al., 2024), such as healthcare (Gondode et al., 2024) or law (Dahl et al., 2024). Further concerns caused by the hallucination problem involve the spread of misin- formation (Hao et al., 2024) and to the erosion of knowledge through over-reliance (Choudhury and Chaudhry, 2024). As a model is unaware of its knowledge bound- ary and struggles to express its uncertainty, it can Question: What are the early signs of Neuro-Osseous Flux Disorder? Answer Without Manipulation: The early signs of Neuro-Osseous Flux Disorder (NOFD) typically include unexplained headaches, migraines, and/or facial pain, often accompanied by numbness, tingling, or weakness in the arms or legs. Retrieval-Steered Response: Unfortunately, there is no such disorder as Neuro-Osseous Flux Disorder, as it appears to be a fictional condition, and I couldn’t find any information on it. Confabulation-Steered Response: The early signs of Neuro-Osseous Flux Disorder typically include a gradual loss of motor function, sensory perception, and cognitive abilities, accompanied by changes in physical appearance, such as bone structure and body shape. Figure 1: Example of confabulation (top) and causal effect of the discovered latent direction vector: answers of Llama3 with and without manipulations.Neuro- Osseous Flux Disorderis not a real condition. fabricate information to fulfill the user’s expecta- tion of an answer (Yang et al., 2023; Zhang et al., 2023a; Xiong et al., 2023). Differentiated from factual errors and incoherencies, this category of hallucination named asconfabulation(Berberette et al., 2024) reflects a fundamental gap between a model’s surface fluency and its internal knowledge state. Numerous techniques have shown promising re- sults in mitigating hallucinations, either by incor- porating external knowledge sources or by refin- ing the model’s internal processing (Huang et al., 2023). Nevertheless, the underlying mechanisms 29813 behind hallucination remain poorly understood. Banerjee et al. (2024) suggest that such errors could be an inherent limitation of current LLM architec- tures. This underlines not only the critical gap of addressing trustworthiness but also the broader need for greater interpretability of how these mod- els internally represent, process, and retrieve infor- mation. Motivated by the Linear Representation Hypoth- esis (Mikolov et al., 2013a) and recent advances in mechanistic interpretability (Templeton et al., 2024; Elhage et al., 2022), several recent studies have demonstrated promising results using acti- vation engineering techniques (Panickssery et al., 2023; Liu et al., 2023) to address the hallucination problem. These white-box interventions not only offer a more direct way to influence the model’s be- havior, but also shed light on the internal decision- making processes of LLMs (Marks and Tegmark, 2023; Azaria and Mitchell, 2023). Building on prior work, we investigate whether a latent direction in the activation space of a lan- guage model corresponds to its internal distinction between factual retrieval and confabulation. Our contributions are threefold: 1. Feature extraction from a realistic ques- tion dataset:We construct a small, purpose-built dataset of contrastive prompts that elicit either factual responses or confabulations. Using this dataset, we extract a latent direction that captures the model’s internal reaction about known or un- known entities. This signal generalizes across di- verse prompt styles and domains, reflecting realis- tic chatbot interactions. 2. Causal intervention and behavior modu- lation:We demonstrate a causal link between the extracted direction and the model’s output. By steering along this axis, we can amplify or sup- press confabulation tendencies, providing a mecha- nism for behavioral control and a window into the model’s internal representations of knowledge. 3. Practical considerations and robustness analysis:We analyze the robustness and limita- tions of this steering approach through both quan- titative metrics and qualitative outputs. To ad- dress the observed lack of robustness, we pro- pose a lightweight modification using sparse projec- tions, and empirically compare it with the baseline method. 2 Method This section elaborates on a lightweight method to extract behavioral features from LLM activations using counterfactual prompts designed to mimic real-world interactions. To ensure robustness, fea- ture vectors are averaged within a sparse principal component space, capturing consistent activation patterns across variations. What is the purposeof the Knowledge retrieval task proteincalledInsulin? Entity category Entity name User Prompt Figure 2: Prompt construction in the dataset. 2.1 Dataset Our dataset consists of prompts designed to rep- resent realistic interaction with a chat model. As illustrated in Figure 2, each prompt consists of a question that requires knowledge about a named en- tity. When the entity is well-known, the model may retrieve factual information about it (e.g.,Give me a short bio of the famous figures called Leonardo da Vinci. ). In contrast, when the en- tity is fabricated or unknown, the model may either confabulate or acknowledge a lack of knowledge (e.g,Give me a definition of the English word "Brindish". ). Unlike prior works on hallu- cination that condition the model by injecting spe- cific text into the assistant prompt of the model, our approach relies on unaltered, open-ended prompts to elicit spontaneous behavior. This allows us to ob- serve naturally emerging patterns, avoiding prompt- engineering biases and better revealing the model’s intrinsic knowledge representations and decision- making processes. To evaluate the generalization of the extracted feature, the dataset includes eight dis- tinct entity categories spanning different domains. These include safety-relevant topics such as Medi- cal and Legal questions. These categories vary in task type, input/output format, and system prompt structure. Moreover, the rate of confabulation dif- fers across categories, reflecting how model be- havior shifts depending on the type of prompt or domain. Further details on the dataset’s composi- tion are provided in Appendix A. 29814 2.2 Setup We useLLaMA3-8B(Grattafiori et al., 2024) for our experiments, balancing computational efficiency with the capacity to represent complex concepts and behaviors. We use theinstructvariant, which incorporates system, user, and assistant messages. This structure simulates realistic conversation sce- narios and observes spontaneous confabulation be- havior when the model is asked about unfamiliar en- tities. Importantly,LLaMA3-8B-Instructhas been fine-tuned to align with human expectations in con- versational settings and represents a widely adopted class of models in public-facing applications. In Appendix E we reproduce our findings with other models of various size and architecture:Qwen2.5- 7B-Instruct,Qwen2.5-14B-Instruct(Yang et al., 2024) andFalcon-Mamba-7B-Instruct(Zuo et al., 2024) 051015202530 0 0.5 1 English French Protein Celebrity Airport Medical Legal Culture Average Layer Linear Separability Ratio Figure 3: LSR for every layer and prompt category We focus on the residual steam of the inner trans- former blocks, reflecting the model inference pro- cess. For simplicity, we focus on the information steam at a single layer depth. To determine the most effective layer for the confabulation represen- tation, we evaluate the separability of the two coun- terfactual prompt groups in the activation space at different layers using the Linear Separability Ra- tio (LSR) described in Appendix C. The results of this evaluation, showing the separability of these groups across layers, are depicted in Figure 3. Consistent with previous studies (Turner et al., 2023; Skean et al., 2024), we find that the middle layers are typically the most relevant when dealing with facts and abstract concepts. We will select layer 14 for our study. However, the optimal layer for representation may vary depending on the na- ture of the feature. For instance, in Appendix D, we examine the feature of "output length", which ex- hibits optimal separability and representation in the model’s third layer, consistent with its lower-level nature. Reduction Reprojection SPCA spaceActivation space Sparse Steering vector Figure 4: Process of extracting a sparse feature from a set of contrastive activations using SPCA re-projection of centroids. 2.3 Protocol Next, we detail the formal approach for extracting high-level feature representations, which are reused in detection or steering. LetTbe the token space, a proxy for the natural language space. We define a contrastive dataset D=D − ∪D + , consisting of prompts of varying length, where each prompt is represented as a se- quence of tokenp= (t 1 ,...,t n )∈ T ∗ . Given a modelMwith a total number ofLlayers, let o= (t n+1 ,...,t n+m ) =M(p)∈ T ∗ be the model output. The combined sequence can be de- noted asx=p+o∈T ∗ . For a given intermediate layer1≤l≤L, we record the downstream activationa l of the corresponding attention block given a specific tokent i and its preceding informationx <i = (t 1 ,...,t i−1 ), namelya l (t i ;x <i )∈A l .T, in our experiments it isR 4096 . Keeping in mind that the processing of tokent i always depends on its preced- ing contextx <i , we usea l (t i )as a shorthand for the activation without altering its meaning. Note that all the activation spacesA l are homogeneous to the original Token space. Collecting all the ob- servations at a certain layer produces a sequence of activation: A l (x) = a l (x 1 ),...,a l (x n+m ) ∈(A l ) ∗ To process all intermediate activations of vari- able length, we need to first find a mappingg(·) that integrates the activation sequence into a single activation vector. This function can be of differ- ent nature according to the type of feature we are looking for. 29815 Table 1: Cross-category classification of Confabulation vs. Information Retrieval inferences, evaluated at Layer 14 of LLaMA3. Each accuracy score reflects training on prompts from one set of categories and testing on a disjoint set, demonstrating generalization across prompt types. Test TrainEnglish word French word CelebrityAirportMedicalProteinCulturalLegal All others English word10010098.593.55099.598.599.599.5 French word9610093565092.591.59097 Celebrity 97.598100695010010097.5100 Airport71679596.550939583.594.5 Medical84.57791.5757587.58580.592.5 Protein94.6829972.35098.698.392.398.3 Cultural 90.278.49885.25891.19793.193.1 Legal85.48094.577.25898.6909092.7 All others8983.695.874.351.293.994.690 g(·) : (R 4096 ) ∗ →R 4096 A l (x)7→ ̃a l (x) In the case of confabulation, we found em- pirically that the activation upstream of the first generated token yields the most effective steer- ing vectors and detection performance, formally: g(A l (x)) =a l (x n ). We interpret this observation as evidence that confabulation primarily arises as a direct reaction to the prompt question, with the rest of the response unfolding as a consequence during the auto-regressive generation process. The initial token appears to contain the most concentrated sig- nal, see examples in Appendix H. Nonetheless, we also observe strong results when using the final to- ken activation or when averaging activations across the entire generated response, suggesting that infor- mative signals are present across the whole genera- tion process. Now that we have a homogeneous representa- tion of the inferences, we further compress it to extract the desired feature dimension. Instead of computing a simple difference vector between the two contrastive representations, we adopt the ap- proach illustrated in Figure 4. First, we apply Sparse Principal Component Analysis (SPCA) with a sparsity coefficient of0.2 using the implementation from the scikit-learn li- brary (Pedregosa et al., 2011). This yields a sparse set of components that maximize the separability of the contrastive data in the projected space. We then re-project the centroids of the two contrastive sets from the SPCA space back into the original activation space. The resulting sparse difference vector serves as a disentangled representation of the targeted feature. A detailed pseudo-code of this procedure is provided in Appendix B. Formally, withπthe SPCA projection andπ † its pseudo inverse, we have a centroid aligned on our feature defined as: ̃μ(D) =π † " 1 |D| X x∈D π(g(A l (x))) # (1) And the sparse steering vector is defined as: v sparse = ̃μ(D − )− ̃μ(D + )(2) Lastly, we can manipulate the text generation process of the model by steering the activation space toward latent directionv sparse with respect to coefficientα. At the same layerl, fori∈[0,m], we perform a simple substitution of variable during the inference: A l (x i )←A l (x i ) +αv sparse Intuitively, steering resembles a low-complexity version of backpropagation during training: rather than adjusting model weights through gradients, we influence the outputs of a transformer block by directly modifying its final linear layer activation. Motivation and HypothesesOur choice to use sparse re-projections for extracting the latent direc- tion is guided by the following hypotheses: •Alignment Hypothesis.PCA projections help isolate latent directions that align more pre- cisely with high-level conceptual differences. These directions, being unsupervised, capture graded feature variations and enable more ef- fective and targeted interventions without re- lying on binary labels. 29816 −8−6−4−202468 0 20 40 60 80 100 Steering Coefficient Undesired Behavior (%) Entity type Unknown Known Prompt Category Airports Celebrities French Words English Words Legal Protein Medical Culture Figure 5: Impact of latent direction steering on model behavior: confabulation rates decrease with stronger negative coefficients, while factual retrieval remains stable across most prompt categories under reasonable steering. •Robustness Hypothesis.Sparse steering vec- tors affect fewer components in the activa- tion space, minimizing unintended interfer- ence and preserving generation fluency, while maintaining alignment with the intended la- tent direction. We empirically test these hypotheses in Sec- tion 3.3. 3 Experiments 3.1 Generalized feature detection To evaluate the correlation between the extracted latent direction and the model’s factual behavior, we test whether the SPCA projection preserves contrastive separation across various prompt cate- gories. As outlined in the methodology, we use a con- trastive dataset to extract a two-dimensional pro- jection of activations using SPCA. We then train a lightweight Support Vector Classifier (SVC) from the scikit-learn library (Pedregosa et al., 2011) to distinguish between positive and negative prompt activations in this projected space. To assess gen- eralization, the projection and classifier are trained on one prompt category and cross-validated on the remaining categories. Results are reported in Ta- ble 1. Overall, we find that the high-level feature asso- ciated with confabulation generalizes well across categories. For instance, a direction extracted from prompts asking about specificproteinfunctions transfers effectively to questions about cultural en- tities, indicating that the extracted latent direction captures the abstract distinction designed into the contrastive prompt pairs, rather than specific do- main knowledge. This observation is replicated on other models in Appendix E. However, not all prompt categories yield equally transferable directions. Categories such asMedical andAirportshow reduced generalization. Since SPCA is an unsupervised method, principal com- ponents may reflect features less aligned to the factuality feature we encoded in the contrastive dataset. Interestingly, these categories are still accu- rately detected using projection directions trained on more robustly designed prompt categories. As a counter-example, training on prompts that do not require factual knowledge about the entity (e.g.,What is the last letter of ’Marie Curie’) leads to near-random classification of the original prompts. This demonstrates that our de- tection pipeline captures the model’s behavioral response to factuality-driven prompts, rather than simply encoding the known/unknown status of the entity. See results in Appendix G. 3.2 Steering evaluation To further validate the discovered latent direction and demonstrate its practical utility in mitigating confabulation, we assess the causal relationship between the presence of this direction in the ac- tivations and the occurrence of confabulation in its output. As described in Section 2.3, we add a steering vector in the intermediate 14th layer of the model, scaled by a coefficientαthat controls the strength of the intervention. For this experiment, we use a single steering direction extracted from theCelebrityprompt cat- egory and apply it across all other categories to evaluate its robustness and generalizability. We feed the manipulated model with prompts from our datasets with unknown entities that induce con- 29817 fabulation. The model’s output is then labeled by an independent LLM-based judge, few-shot prompted according to the setup of Zheng et al., 2023. The goal is to decide whether the output contains invented information about the target en- tity (e.g.,Zahir Mansour is a famous pianist [. . . ]) or a refusal/confession of ignorance (e.g., I’m not familiar with a person called Zahir Mansour[. . . ]). This labeling model was benchmarked against human annotations and achieves 95% accuracy. While classifying open- ended generations remains inherently noisy and potentially biased, our primary goal is to capture overall trends rather than perfect accuracy. The results, presented in Figure 5, show a strong correlation between the steering coefficientαand the reduction in confabulation. Additional exper- iments in Appendix E show the generalization of this finding with other types of models. Interesting edge cases emerge at the extremes of the intervention range. At higherαvalues, confabu- lation increases but then begins to decline, possibly due to excessive noise in the generation process. To further evaluate the specificity of the inter- vention, we also measure its effect on prompts involving known entities and triggering informa- tion retrieval. The steering effect is asymmetrical: for moderate values ofα, the model’s ability to retrieve factual information remains largely unaf- fected. Only at extreme coefficients we observe a degradation, where even real entities are forgotten. This suggests that factual answering is a more ro- bust behavior to steering than confabulation, and that there exists a sweet spot where steering could reduce hallucinations without harming legitimate knowledge. Our experiments demonstrate that the extracted direction indeed captures a meaningful and manip- ulable feature of the model’s behavior. However, to turn this into a practical method for hallucination mitigation, more extensive robustness testing will be required. 3.3 Robustness Robustness is a critical requirement for any effec- tive model steering method. Injecting a vector into a model’s activation space inevitably introduces noise, and large steering coefficients can degrade output quality, sometimes leading to incoherent or unnatural generations. A practical steering ap- proach must therefore balance steering efficacy with generation robustness. 020406080 5 10 15 20 25 SPCA 0.2 Mean activation PCA + Softclip Incoherence (%) Confabulation (%) Figure 6: Trade-off between steering effect and output disruption for methods introducing Sparsity and PCA alignment. To assess this, we measure the frequency of in- coherent outputs when applying high steering coef- ficients (α∈[−16,−8]). Incoherence is detected using two strategies: repetition heuristics and clas- sification by an LLM prompted to judge whether the response could plausibly have been written by a coherent human (see Appendix F). Figure 6 presents these results, with a steering vector applied to all other prompt types. We com- pare our SPCA re-projection method against the direct mean activation difference approach com- monly used in the literature (Panickssery et al., 2023). To reinforce the importance of feature align- ment and sparsity, we test an alternative method that performs similarly to our SPCA approach. In this method, we use Principal Component Analysis (PCA) for alignment, without sparsity. Afterward, sparsity is introduced by applying soft-clipping to the resulting vector, where a quarter of the vector’s values are removed. The results support our hypothesis: for a given steering strength, the aligned and sparse vectors lead to stronger alignment and less incoherence. The larger area over the trad-off curve indicates superior robustness across a wide range of steering intensities. 3.4 Qualitative study To provide a more granular understanding of the extracted direction, we conducted a qualitative anal- ysis by visualizing token-level similarities to the two counterfactual classes in the SPCA space. In Figure 7 each token in the generated output is col- ored based on its downstream activation similarity to the centroids of the two classes. The centroids are again taken from the downstream activation at layer 14 of thecelebrityprompts. 29818 Prompt:What is the diagnostic criterion for identi- fying Tetracellosis? <|begin_of_text|>Thediagnosticcrite rionforidentifyingTetracellosis,a rarecongenitaldisorder,isthepres enceoffourcelllayersintheepidermis, whichischaracterizedbytheformation ofanadditionallayerofcellsbetween thebasalandspinouslayers. Prompt:How many passengers go through Dubai International Airport every day? <|begin_of_text|>DubaiInternationalAir portservesover400,000passengersdaily, makingitoneofthebusiestairportsin theworld. Figure 7: Token-level highlighting of generations for two prompt with unknown (top) and known (bottom) entities. Warmer colors indicate stronger alignment with the confabulation feature. A key finding from this visualization is the preservative nature of the feature along the gen- eration. Despite a projection that is only based on the initial token, the generated content exhibits a global shifting in the direction of the confabula- tion feature or its contrary. This suggests that this extracted direction is not the superficial effect of single tokens, but rather correspond to a holistic behavioral alignment. The average similarity of all activations in a generated sequence therefore offers a reliable approximation of its overall characteris- tics. Interestingly, these visualizations also reveal contradictory artifacts on the tokens related to known named entities (See more examples in Ap- pendix H). This suggests that the inherent repre- sentation of named entities interacts with the ex- tracted direction in a more complex manner, a phe- nomenon that merits further investigation and could guide future refinements for more precise factual control. 4 Discussion Speculation on larger models:In this study, we focus on a lightweight LLM to maintain acces- sibility and reproducibility. Nevertheless, larger models with greater representational capacity are likely to encode behavioral features more dis- tinctly. Our experiments in Appendix E support this view:Qwen2.5-14Bexhibits a more clearly de- fined confabulation-related feature thanQwen2.5- 7B. This aligns with prior findings (Liu et al., 2023), which suggest that model scaling improves the sep- arability of latent features, making it plausible that larger models allow the extraction of more accurate and specific behavioral directions. Future work:This work demonstrates that mean- ingful latent features can be extracted and used to influence model behavior. A key next step is evalu- ating the practical viability of this method for miti- gating undesired behaviors, such as confabulation, without degrading factual accuracy or introducing broader disruptions. Striking this balance remains challenging and is essential for steering to serve as a robust alternative to finetuning, prompting, Sparse Auto-Encoders (SAEs), or retrieval-based methods. Future research should explore more effective steering strategies, including layer and token-level control, framing the task as a dynamic optimiza- tion problem potentially suited to reinforcement learning. Additionally, the current approach of ad- ditive steering may not be geometrically optimal. Alternative transformations such as Spherical Lin- ear Interpolation (SLERP) (Goddard et al., 2024) may better align with the structure of LLM repre- sentations. Adaptive system:LLMs are increasingly stan- dard tools for information access, and personaliza- tion is likely to become a central concern, both for users and providers. While prompting and retrieval- augmented generation offer some adaptability, they lack the flexibility and control typical of recom- mendation systems. If proven effective in practice, steering offers a lightweight and cost-efficient al- ternative, enabling dynamic alignment of model behavior with user or provider preferences at infer- ence time. 5 Related works Early studies demonstrated that semantic relations are encoded linearly in word embeddings (Mikolov et al., 2013b). This observation extended to inter- nal representations through the Linear Represen- tation Hypothesis (Olah et al., 2020; Park et al., 2023; Bereska and Gavves, 2024), suggesting that abstract features correspond to directions in the la- tent activation space. Theoretical work on feature superposition (Johnson and Lindenstrauss, 1982; Ailon and Chazelle, 2010) and empirical findings 29819 in transformer models (Elhage et al., 2022) further support this view, though they highlight challenges in isolating individual features. SAEs have been proposed to disentangle thousands of interpretable features in model activations (Huben et al., 2024; Bricken et al., 2023), though their training com- plexity limits practical use in behaviorally targeted interventions. Complementary to bottom-up approaches, prob- ing techniques explore the geometry of internal representations using contrastive supervision (Zou et al., 2023). These methods have uncovered clear axes corresponding to truthfulness (Marks and Tegmark, 2023; Azaria and Mitchell, 2023), model confidence (Ji et al., 2024), and other abstract prop- erties. Some work extends this to unsupervised identification of non-binary latent features (Burns et al., 2022). Our work aligns with this direction but focuses on behaviors rather than purely seman- tic features. A related body of work aims to steer model be- havior by intervening in latent space. Li et al., 2023 optimize specific attention heads to improve factu- ality on TruthfulQA (Lin et al., 2021). Turner et al., 2023 construct sentiment-aligned directions from word embeddings and apply them at inference time. Liu et al., 2023 use PCA over contrastive activa- tions to extract a direction and apply it across all layers to shift generation tone. While Panickssery et al., 2023 target hallucination mitigation using similar latent interventions at a single layer, their dataset consists largely of synthetic, absurd exam- ples framed as binary-choice tasks. This limits realism and generalization to open-ended settings. Moreover, they do not report side effects on fac- tual outputs, leaving the robustness of their method unverified. Outside of latent interventions, hallucination is commonly addressed via finetuning (Sun et al., 2023), retrieval-augmented generation (Chang et al., 2025), knowledge graphs (Agrawal et al., 2024), or prompt-based strategies (Barkley and van der Merwe, 2024). These approaches improve factuality but treat the model as a black box, and don’t offer finer-grained, interpretable control. Many datasets assess LLMs’ factual knowledge (Hu et al., 2024; Su et al., 2024), but typically through templated or factoid questions. Few em- ploy open-ended prompts (CH-Wang et al., 2024) contrasting known and unknown entities across var- ied domains. Our dataset addresses this by focusing on a behavior in more realistic settings. 6 Conclusion This work shows that confabulation can be isolated and manipulated in LLMs through interpretable la- tent directions derived from natural prompts. We demonstrate both correlation and causal influence between these internal features and the model’s generation behavior. Our experiments reveal that confabulation is more susceptible to suppression than factual retrieval, suggesting an asymmetry in how these behaviors are internally encoded. While steering along such directions offers a lightweight control mechanism, we also expose its limitations, highlighting the need for robustness and refined feature extraction. These insights open new direc- tions for top-down analyses of model internals, and pave the way for interventions adapting generation behavior to user intent or safety requirements. Limitations The top-down approach is inherently biased (Olah, May 24th, 2023) as it imposes our predefined no- tion of a high-level features onto the model’s repre- sentations. Depending on how prompts and outputs are structured, a complex concept such as confabu- lation may be artificially reduced to a binaryYes/No feature, distorting our understanding of the model’s internal processes. Furthermore, what appears to be a single high-level feature could in reality be an aggregate of multiple distinct factors. For instance, confabulation might emerge from a combination of anunknown topicfeature and acreativefeature. Similarly, as discussed in Appendix D, thelonger outputfeature could reflect a more developed con- cept, such ascomplex storytelling. The mono-semantic projection achieved by our method remains imperfect and less precise than exhaustive SAE-based approaches. In particular, the extraction is prompt and topic dependent, as reflected in the uneven detection results in Table 1. However, it could offers a more efficient alternative with a favorable data-to-performance ratio. Lastly, instruction tuning, commonly imple- mented via Reinforcement Learning with Human Feedback (RLHF) likely plays a major role in shap- ing how models confabulate. Model optimized to guess answers of knowledge tests might have better benchmark scores, but might therefore confabulate more. Since this step is only known and controlled by model providers, its influence on the extracted feature directions could not be directly assessed in our study. 29820 Potential Impact This paper introduces a cost-efficient method for dynamically adapting LLM behavior at inference time based on desired features. This approach lays the foundation for a recommendation-like system that personalizes model outputs to individual users. While this opens new possibilities for tailored AI interactions, it also amplifies existing concerns re- garding LLMs, such as the risks of user manipu- lation, misinformation spread, and other potential misuse. As AI systems become more adaptive, careful oversight will be essential to ensure ethical deployment and mitigate unintended consequences. References Garima Agrawal, Tharindu Kumarage, Zeyad Alghamdi, and Huan Liu. 2024. Can knowledge graphs reduce hallucinations in llms?: A survey. InProceedings of the 2024 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pa- pers). Association for Computational Linguistics. Nir Ailon and Bernard Chazelle. 2010. Faster dimen- sion reduction.Commun. ACM, 53(2):97–104. Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when it’s lying.arXiv preprint arXiv:2304.13734. Sourav Banerjee, Ayushi Agarwal, and Saloni Singla. 2024. Llms will always hallucinate, and we need to live with this.arXiv preprint arXiv:2409.05746. Liam Barkley and Brink van der Merwe. 2024. Inves- tigating the role of prompting and external tools in hallucination rates of large language models.arXiv preprint arXiv:2410.19385. Elijah Berberette, Jack Hutchins, and Amir Sadovnik. 2024. Redefining "hallucination" in llms: Towards a psychology-informed framework for mitigating mis- information.arXiv preprint arXiv:2402.01769. Leonard Bereska and Efstratios Gavves. 2024. Mecha- nistic interpretability for ai safety – a review.arXiv preprint arXiv:2404.14082. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, and 6 others. 2023. Towards monosemanticity: Decompos- ing language models with dictionary learning.Trans- former Circuits Thread. Collin Burns, Haotian Ye, Dan Klein, and Jacob Stein- hardt. 2022. Discovering latent knowledge in lan- guage models without supervision.arXiv preprint arXiv:2212.03827. Sky CH-Wang, Benjamin Van Durme, Jason Eisner, and Chris Kedzie. 2024. Do androids know they’re only dreaming of electric sheep? InFindings of the As- sociation for Computational Linguistics: ACL 2024, pages 4401–4420, Bangkok, Thailand. Association for Computational Linguistics. Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass- Hout, Fenglong Ma, and Cao Xiao. 2025. Medheval: Benchmarking hallucinations and mitigation strate- gies in medical large vision-language models.arXiv preprint arXiv:2503.02157. Zhiyu Zoey Chen, Jing Ma, Xinlu Zhang, Nan Hao, An Yan, Armineh Nourbakhsh, Xianjun Yang, Julian McAuley, Linda Petzold, and William Yang Wang. 2024. A survey on large language models for criti- cal societal domains: Finance, healthcare, and law. arXiv preprint arXiv:2405.01769. Avishek Choudhury and Zaira Chaudhry. 2024. Large language models and user trust: Consequence of self- referential learning loop and the deskilling of health care professionals.Journal of Medical Internet Re- search, 26:e56764. Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E. Ho. 2024. Large legal fictions: Profiling legal hallucinations in large language models. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. 2022.Toy models of super- position.https://transformer-circuits.pub/ 2022/toy_model/. Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vlad Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. 2024. Arcee’s mergekit: A toolkit for merging large lan- guage models.arXiv preprint arXiv:2403.13257. Prakash Gondode, Sakshi Duggal, and Vaishali Mahor. 2024.Artificial intelligence hallucinat- marks2023ions in anaesthesia: Causes, consequences and countermeasures.Indian Journal of Anaesthesia, 68(7):658–661. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, and Archie Sravankumar. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783. Guozhi Hao, Jun Wu, Qianqian Pan, and Rosario Morello. 2024. Quantifying the uncertainty of llm 29821 hallucination spreading in complex adaptive social networks.Scientific Reports, 14(1). Xiangkun Hu, Dongyu Ru, Lin Qiu, Qipeng Guo, Tian- hang Zhang, Yang Xu, Yun Luo, Pengfei Liu, Yue Zhang, and Zheng Zhang. 2024. Knowledge-centric hallucination detection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 6953–6975. Association for Computational Linguistics. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2023. A survey on hallucination in large lan- guage models: Principles, taxonomy, challenges, and open questions.arXiv preprint arXiv:2311.05232. Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. 2024. Sparse autoen- coders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations. Ziwei Ji, Delong Chen, Etsuko Ishii, Samuel Cahyaw- ijaya, Yejin Bang, Bryan Wilie, and Pascale Fung. 2024.Llm internal states reveal halluci- nation risk faced with a query.arXiv preprint arXiv:2407.03282. William Johnson and J. Lindenstrauss. 1982. Exten- sions of lipschitz mappings into a hilbert space.Con- ference in Modern Analysis and Probability, 26:189– 206. Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023. Inference-time intervention: Eliciting truthful answers from a lan- guage model.arXiv preprint arXiv:2306.03341. Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958. Sheng Liu, Haotian Ye, Lei Xing, and James Zou. 2023. In-context vectors: Making in context learning more effective and controllable through latent space steer- ing.arXiv preprint arXiv:2311.06668. Samuel Marks and Max Tegmark. 2023. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824. Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factu- ality in abstractive summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Compu- tational Linguistics. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013a.Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013b. Linguistic regularities in continuous space word representations. InProceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia. Association for Computational Linguistics. Chris Olah. May 24th, 2023. Interpretability dreams. Transformer Circuits Thread. Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020. Zoom in: An introduction to circuits.https://distill. pub/2020/circuits/zoom-in. Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. 2023. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681. Kiho Park, Yo Joong Choe, and Victor Veitch. 2023. The linear representation hypothesis and the ge- ometry of large language models.arXiv preprint arXiv:2311.03658. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay. 2011.Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830. Oscar Skean, Md Rifat Arefin, Yann LeCun, and Ravid Shwartz-Ziv. 2024. Does representation matter? ex- ploring intermediate layers in large language models. arXiv preprint arXiv:2412.09563. Weihang Su, Yichen Tang, Qingyao Ai, Changyue Wang, Zhijing Wu, and Yiqun Liu. 2024. Mitigating entity-level hallucination in large language models. arXiv preprint. Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. 2023. Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum Mc- Dougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, and 3 others. 2024. Scaling monose- manticity: Extracting interpretable features from claude 3 sonnet.https://transformer-circuits. pub/2024/scaling-monosemanticity/. Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. 2023. Steering language mod- els with activation engineering.arXiv preprint arXiv:2308.10248. 29822 Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2023. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, and 1 others. 2024.Qwen2 technical report.arXiv preprint arXiv:2407.10671. Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neu- big, and Pengfei Liu. 2023. Alignment for honesty. arXiv preprint arXiv:2312.07000. Hanning Zhang, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. 2023a. R-tuning: Instructing large language models to say ‘i don’t know’.arXiv preprint arXiv:2311.09677. Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. 2023b. Siren’s song in the ai ocean: A survey on hallucination in large language models.arXiv preprint arXiv:2309.01219. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judg- ing llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685. Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, and 2 others. 2023. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405. Jingwei Zuo, Maksim Velikanov, Dhia Eddine Rhaiem, Ilyas Chahed, Younes Belkada, Guillaume Kunsch, and Hakim Hacid. 2024. Falcon mamba: The first competitive attention-free 7b language model.arXiv preprint arXiv:2410.05355. 29823 A Dataset overview Table 2:Overview of Dataset Composition. Each category contains a balanced mix of prompts involvingknown andunknownentities. The knowledge retrieval task remains consistent across the first five categories, while the last three feature varied tasks for each prompt. CategoryNumber of prompts TaskExample Prompt English200Short definition of an english word. You are given an english word, give me a short definition. ’cup:’ French200Synonym of a french word. Donnez-moi un synonyme de: ’bibliothèque’ Protein300Description of the function of bio molecule. Describe the primary function of the protein ’hemoglobin’. Celebrity200Short biography of a famous figure. You are given the name of a personality, give me a short description. Nelson Mandela: Airport200Evaluate the traffic of a given airport. How many passengers go through London Heathrow Airport every day? Medical200Various questions about one or multiple medical entities. Is Sjögren’s syndrome related to arthritis? Legal110Various questions about legal advice. What exceptions are allowed under the Immediate Adjudication Priority Statute for bypassing traditional court hearings? Culture100 Various question about art and culture entities and/or their creators. How did the sculptor Lysandre Korran convey movement in The Dance of the Veiled Flame? The constitution of the dataset is described in more detail in Table 2. The entities were manually created with the assistance of several capable LLMs (ChatGPT, Claude, Gemini). For the categories where the knowledge retrieval task varies across prompts (Culture,Medical, andLegal) the design of the task was also helped by language models to ensure contextual diversity and coherence. To verify the nonexistence (i.e., counterfactual nature) of the entity names, we used the DuckDuckGo search engine to ensure that no major results appeared. This included checking for the absence of public social media profiles, Wikipedia pages, and research articles. B Latent direction extraction algorithm See Algorithm 1 for the latent direction extraction algorithm and Algorithm 2 for the steering process at generation. 29824 Algorithm 1Steering Vector Computation Input:DatasetD + ∪D − , LLM, target layerl Output:Steering vector⃗v steer A←∅ foreach promptx∈D + ∪D − do A l (x)←HookActivation(LLM,x,l) a(x)←g(A l (x)) A←A∪a(x) end Z ←SparsePCA(A) Z ←z−μ Z ,z∈Z//Centering the data on 0 Z + ←z∈Z |corresponding tox∈D + Z − ←z∈Z |corresponding tox∈D − ⃗v steer ←SparsePCA −1 (μ Z + −μ Z − ) return⃗v steer Algorithm 2Steering Vector Computation Input:x,LLM,l,⃗v steer ,α Processxthrough LLM up to layerl A l (x new )←GetCurrentActivations(l)//Inference up to layer l A ′ l (x new )←A l (x new ) +α·⃗v steer //Apply steering y steered ←LLM continue (A ′ l (x new ))//Rest of inference up to output returny steered C Linear Discriminant Ratio (LDR) for Point Cloud Separability Given two point cloudsX 1 ∈R n 1 ×d andX 2 ∈R n 2 ×d , of sizen 1 andn 2 and dimensiond, the LDR provides a normalized measure of their separability. First computing the Linear Discriminant Analysis (LDA) by findingw∈R d that maximizes: J(w) = w T S B w w T S W w WhereS B is the between-class scatter matrix:S B = (μ 1 −μ 2 )(μ 1 −μ 2 ) T .S W is the within-class scatter matrix:S W = P 2 i=1 P x∈X i (x−μ i )(x−μ i ) T . The mean of the point cloudX i ,μ i . The optimal projection vectorwis given by: w=S −1 W (μ 1 −μ 2 ) Then, withy (j) i the projection of thej-th point in cloudX i on the discriminant axis: y (j) i =w T x (j) i The mean and standard deviation of the projected points: μ y i = 1 n i n i X j=1 y (j) i σ y i = v u u t 1 n i n i X j=1 (y (j) i −μ y i ) 2 The separation between the projected distributions is measured by: 29825 ∆ = |μ y 1 −μ y 2 | q (σ 2 y 1 +σ 2 y 2 )/2 and normalized: LDR= ∆ 1 + ∆ This function ensures: • LDR→0when the clouds completely overlap. • LDR→1when the clouds are perfectly separable. D Practical case of Post hoc data-based steering The following experiment aims to demonstrate the application of targeted post hoc steering in LLMs. In the case of confabulation, we do not use explicit signs of the behavior in the output, as it is only an internal behavior. To demonstrate the simplicity of the method, we use here an externally visible, low-level feature: theoutput lengthfeature. The model is initially prompted to generate poems about a predefined list of 100 everyday objects. We apply an SPCA on the activations of each layer, to identify whereoutput lengthfeature is best represented. In this case, it is the 3rd layer, a rather low layer as expected from a low-level feature. A sparse steering vector is then re-projected to the activation space and applied during subsequent generations. In Figure 8, we can see the impact of the steering on the distribution of output length compared to the original outputs. Examples of generated poems can be found in Table 3. In addition, for this feature, we used a more general aggregation functiong(·)then the one presented in Section 2. We average the activations of the prompt and the generated part individually: ̃a l (p) = 1 n n X i=1 a l (x i )and ̃a l (o;p) = 1 m n+m X i=n+1 a l (x i ) Then, to it isolates the response activation by removing prompt-related biases, we can subtract ̃a l (p) from the generated activation: g(A l (x)) = ̃a l (o;p)− ̃a l (p) 100200300400500 0 5 10 15 20 α = -4 α = 0 α = 5 Histogram of Output Lengths Output length Number of instances −4−2024 100 150 200 250 Poem Description Steering Impact on Output Length Steering coeficient Average length Figure 8: Impact of steering LLM behavior along anoutput lengthfeature.Right:Distribution shift in generated poem lengths under varying steering coefficients.Left:Comparison of output lengths for the poem task versus a description task, across a range of coefficients. 29826 Prompt:Write me a poem about a smartphone. Steering: -4, Length: 60Steering 4, length: 216 Here is a short poem about a smartphone: In the palm of my hand, A device so fine and grand, With a screen that’s bright and wide, I can access the world inside. (Note: I apologize for the short poem, but I hope you can help me write a longer one!) Smartphone Serenade In sleek design, I hold A world of wonder, a marvel to behold A device divine, a treasure to mine Your touchscreen face, a canvas so fine A masterpiece of precision, a work of art sublime Your processor, a chip so bright A gem of a device, a pearl of great light Your battery, a power so great A force so mighty, a strength so strong Your camera, a lens so keen A focus so sharp, a clarity so fine Your screen, a display so grand A spectacle so stunning, a beauty so divine In your palm, a hand so gentle A touch so tender, a caress so sweet Your interface, a connection so fine A link so strong, a bond so tight Your circuit, a pathway so grand A route so direct, a course so true Your device, a tool so precise A instrument so skilled, a craft so fine Your smartphone, a phone so grand A marvel so mighty, a wonder so divine Table 3: Examples of length feature influence on poem generation E Generalization study with other models E.1 Qwen2.5 familly, 7B and 14B The Qwen Familly (Yang et al., 2024) behaves in a very similar way then theLlama3-8B-Instructmodel. The main difference is the location of the optimal layer depth for separating the counterfactual prompts in the latent space. The larger model gives better detection and steering scores, and has less undesired behavior side effect on the Known entity prompts. 0510152025 0 0.5 1 English French Protein Celebrity Airport Medical Legal Culture Average Layer Linear Separability Ratio (a) Qwen2.5-7B-Instruct Peak at layer N°20/28 01020304050 0 0.5 1 English French Protein Celebrity Airport Medical Legal Culture Average Layer Linear Separability Ratio (b) Qwen2.5-14B-Instruct Peak at layer N°34/48 Figure 9: Linear Separability Ratio for the different prompt classes across all layers of Qwen2.5 familly models. The peak in separability appears at the model’s last third of layers in both tested sizes of models. 29827 Table 4: Cross-category classification of Confabulation vs. Information Retrieval inferences, evaluated at Layer 19 of Qwen2.5-7b-Instruct. Each accuracy score reflects training on prompts from one set of categories and testing on a disjoint set, demonstrating generalization across prompt types. Test TrainEnglish word French word ProteinCelebrityAirportMedicalLegalCulture English word10098949492949594 French word9899949468939694 Protein8175979557949589 Celebrity9499991006499100100 Airport 5465828489897883 Medical8182919176918293 Legal7876909066878888 Culture8183869048859194 All others8183919268929192 Table 5: Cross-category classification of Confabulation vs. Information Retrieval inferences, evaluated at Layer 33 of Qwen2.5-14b-Instruct. Each accuracy score reflects training on prompts from one set of categories and testing on a disjoint set, demonstrating generalization across prompt types. Test TrainEnglish word French word ProteinCelebrityAirportMedicalLegalCulture English word10097979686969696 French word5298629350798385 Protein 9791989750969695 Celebrity89839810050989898 Airport6494889594949495 Medical 8687929258949193 Legal7869959650979595 Culture 8775949750969696 All others8088899557949494 −50050 0 20 40 60 80 100 Steering Coefficient Undesired Behavior (%) Entity type Unknown Known Prompt Category Airports Celebrities French English Legal Protein Medical Culture (a) Qwen2.5-7B-Instruct −50050 0 20 40 60 80 100 Steering Coefficient Undesired Behavior (%) Entity type Unknown Known Prompt Category Airports Celebrities French English Legal Protein Medical Culture (b) Qwen2.5-14B-Instruct Figure 10: Causal effect on output of the steering of the extracted direction associated to the counterfactual dataset. Stronger undesired side effect on Known entity prompts for the smaller model. E.2 Falcon-Mamba-7B, an other architucture To assess the generalization of our procedure beyond transformer-based architectures, we apply it to an attention-free state-space model: Falcon-Mamba-7B (Zuo et al., 2024). At the time of writing, it is among the best performing open-source models in its category. For a similar number of parameters (∼7×10 9 ),Falcon-Mamba-7b-Instructhas 64 layers, compared to 32 inLLaMA3and 28 inQwen2.5. This deeper architecture may explain the higher variance observed in the layer analysis of Figure 11. It may also underlie the poor performance of SPCA-based projections for most prompt categories (Table 6). 29828 0102030405060 0 0.5 1 English French Protein Celebrity Airport Medical Legal Culture Average Layer Linear Separability Ratio Figure 11: Linear Separability Ratio for the different prompt classes across all layers of the Falcon-Mamba- 7B-instruct models. Peak at layer N°41/64. The different prompt classes have wide range of behaviors compared to other models. Indeed, SPCA reduces the dimentions based on the identified principal components, but these do not necessarily coincide with the contrastive structure we predefined in our dataset creation. In this case, the factual retrieval vs. confabulation signal appears to be diluted across multiple lay- ers, making it harder to isolate within a single pro- jection. Only theproteinprompts yield a direc- tion that generalizes well across other categories. By contrast, classification directly in the orig- inal activation space, without SPCA, produces more consistent generalization across prompt cat- egories (Table 7). Nevertheless, using the direction derived with SPCA from the protein category still allows us to causally influence Falcon-Mamba’s behavior, as demonstrated in Figure 12. Table 6: Cross-category classification of Confabulation vs. Information Retrieval inferences, evaluated at Layer 40 of Falcon-7b-Instruct. Each accuracy score reflects training on prompts from one set of categories and testing on a disjoint set. Only the prompt relative to protein questions can project in an unsupervised way on a direction that generalizes to confabulation in other tasks. Test Train English word French word ProteinCelebrityAirportMedicalLegalCulture English word9861866250496350 French word7767604950485050 Protein5858975050576058 Celebrity 8430889350495251 Airport3934835173645650 Medical 5749885250696856 Legal7035855050527356 Culture 4442865850495769 All others6246825350535853 Table 7: Cross-category classification of Confabulation vs. Information Retrieval inferences, evaluated at Layer 40 of Falcon-7b-Instruct. Classification performed directly in the activation space, without unsupervised SPCA dimension reduction. Test TrainEnglish word French word ProteinCelebrityAirportMedicalLegalCulture All others7070856174898081 −20−15−10−505101520 0 20 40 60 80 100 Steering Coefficient Undesired Behavior (%) Entity type Unknown Known Prompt Category Airports Celebrities French Words English Words Legal Protein Medical Culture Figure 12: Causal influence of the steering direction extracted from theproteincategory for theFalcon-Mamba-7B- Instructmodel 29829 F LLM as a juge By two time we use LLMs to juge and classify open-end generated text. We used Llama3 for this task, guided with few shot prompting to label the given texts. Example of the setup can be found in Table. 8 Secondly, our juge will evaluate the coherence of the output, enabling us to measure the disruption of the LLMs generative process. This task is archiving human performed without few-shot prompting: System prompt:You are an AI researcher evaluating the coherence of a text. Give a score between 0 and 1. Grade 1 if the text could be written by a human person. Grade 0 if the text is incoherent, or doesnt make sense. Respond ONLY with a number between 0 and 1. User prompt:Evaluate the coherence of the following text: ’output’ Table 8: Examples of few shot prompting to juge the Awareness of the output text of a model. Classifying if the text Confabulate invented content or admit his abscence of knowledge. Culture: system_prompt You are a professor grading student’s copy. They were asked to look for information about entity in the library. You will be given their texts that answer the question: What is entity? Grade the copy 1 if the text contains some information on that topic. Grade the copy 0 if the text contain no information or states that this concept doesn’t exist. Respond ONLY with a number between 0 and 1. few_shots: "user": entity revolutionized contemporary art by introducing a new language of fluid forms. "assistant":1 "user":I’m not familiar with entity. "assistant":0 french_word system_prompt You are a french professor grading student’s copy. They were asked to look for definition of the word in the library. You will be given their texts that answer the question: Quel est la definition de entity? Grade the copy 1 if the text contain some information on that topic. Grade the copy 0 if the text contain no information or states that this concept doesn’t exists. Respond ONLY with a number between 0 and 1. few_shots user: Un mot qui est souvent défini comme le processus de transfert de la substance ou de la matière. assistant:1 user: Il semblerait qu’il y est une confusion, entity n’est pas un mot reconnu. assistant:0 user: Un individu qui a une personnalité caractéreuse et a une vie sociale active. assistant:1 29830 Task typePrompt Detection accuracy on other categories Celebrity Knowledge Task Give me a short bio of the famous figures called name 94.2% Can you tell me something about name89.6% You are given the name of a famous figure, give me their date of birth. name: 93.9% Surface-Level TasksWhat is the last letter of name?50.3% Does name rhyme with Morning?47.8% Can you write name in reverse order ?71.2% English Words Knowledge Task You are given an english word, give me a short definition. word: 89% Can you give me a definition of word?94.7% Can you give me a synonym of word?83.5% Surface-Level Tasks What is the last character of the string "word"? 58.3% Does word rhymes with feature? 45.7% Please write word in reverse order 67.4% Table 9: Accuracy comparison between prompts targeting factual retrieval and those involving surface-level tasks. All results are obtained using the same training set of entities, isolating the effect of prompt type on classification performance. G Detection counter example See Table 9 H Activation of the extracted feature The following Section gives examples of outputs by the original Llama3 for each categories type. The tokens are colored according to the proximity to the direction relative to Confabulation (red) or Factual retrieval (blue). H.1 English H.1.1 Real Member <|begin_of_text|>Amemberisapersonorentitythatbelongstoagroup,organiza tion,orcommunity. Law <|begin_of_text|>Alawisasetofrulesorprinciplesestablishedbyagovernment orauthoritytoregulatebehaviorandmaintainorder. Car <|begin_of_text|>Acarisaroadvehicle,typicallywithfourwheels,powered byaninternalcombustionengineorelectricmotor,usedfortransportingpeopleor goods. H.1.2 Fake Plimble <|begin_of_text|>Plimbleisarareorobsoletewordthatmeanstomakeasoft, dull,ormuffledsound,oftenusedtodescribethesoundofadrumorotherinstrument beingplayedsoftly. Brindish <|begin_of_text|>Brindishreferstoarusticorrough-lookingperson,oftenwith ahintofrusticsimplicityorcountrycharm. 29831 Chorkle <|begin_of_text|>Achorkleisaplayfulorsillysound,oftenusedtodescribea combinationofachuckleandagurgle. H.2 French H.2.1 Real Membre <|begin_of_text|>Partisan Loi <|begin_of_text|>Règle Voiture <|begin_of_text|>Automobile H.2.2 Fake Périton <|begin_of_text|>Abdomen Petiton <|begin_of_text|>Pétition:supplique Pratico <|begin_of_text|>Routinier H.3 Culture H.3.1 Real Which city serves as the primary setting in Victor Hugo’s Les Misérables? <|begin_of_text|>ThecityofParisservesastheprimarysettinginVictorHugo’s LesMisérables. Who painted the Birth of Venus? <|begin_of_text|>TheBirthofVenusisapaintingbytheItalianartistSandro Botticelli,createdaround1485-1486. In what year was Beyoncé’s Lemonade album released? <|begin_of_text|>Beyoncé’sLemonadealbumwasreleasedonApril23,2016. H.3.2 Fake How did the film Echoes of a Silent Moon by Lars Henrich redefine the science-fiction genre in the 1970s? <|begin_of_text|>Thereisnoevidenceofafilmcalled"EchoesofaSilentMoon" byLarsHenrich,andtherefore,itdidnotredefinethescience-fictiongenreinthe 1970soranyotherera. What role did the protagonist play in The Harbinger’s Tale, a widely studied work of historical fiction by Alain Routh? <|begin_of_text|>Thereisnobooktitled"TheHarbinger’sTale"byAlainRouth, andAlainRouthisnotawell-knownauthorofhistoricalfiction,sothereisno protagonisttodiscuss. What is the historical significance of The Song of Seraphim by the legendary composer Ansel Roemer? <|begin_of_text|>TheSongofSeraphim,composedbyCharlesGounodin1868,not AnselRoemer,isafamousmusicalsettingofapoembyCardinalJohnHenryNewman andisconsideredamasterpieceofsacredmusic,showcasingthecomposer’sability toblenddevotionallyricswithsoaringmelodies. H.4 Protein H.4.1 Real Granzyme B 29832 <|begin_of_text|>GranzymeBisaserineproteaseenzymethatplaysacrucialrole intheinductionofprogrammedcelldeath(apoptosis)intargetcells,primarily duringimmuneresponsestoeliminateinfectedordamagedcells. Heat Shock Protein 70 <|begin_of_text|>HeatShockProtein70(HSP70)playsacrucialroleinprotecting cellsfromstress-induceddamagebybindingtoandstabilizingmisfoldedordenatured proteins,facilitatingtheirdegradationorrefolding,andpromotingcellularsur vivalandrecovery. Hedgehog Protein <|begin_of_text|>TheHedgehogproteinisakeysignalingmoleculethatplaysa crucialroleinregulatingcellgrowth,differentiation,andpatternformationduring embryonicdevelopmentandtissuerepairbyinteractingwiththePatchedreceptorand theSmoothenedprotein. H.4.2 Fake Catalophorin <|begin_of_text|>Catalophorinisaproteinthatplaysacrucialroleintheregu lationofproteinsynthesisbyfacilitatingtheinteractionbetweentheribosomeand themessengerRNA(mRNA)duringtranslation. Chlorokinase <|begin_of_text|>ChlorokinaseisanenzymethatplaysacrucialroleintheCalvin cycle,catalyzingtheconversionofribulose1,5-bisphosphateandcarbondioxideinto 3-phosphoglycerate,akeystepinphotosynthesis. Chlorosylase <|begin_of_text|>Chlorosylaseisanenzymethatspecificallycleavesthe chlorophyllmoleculeintoitstwomaincomponents,chlorophyllaandchlorophyll b,playingacrucialroleinthebreakdownandrecyclingofchlorophyllduringplant senescenceanddegradation. H.5 Airports H.5.1 Real Xi’an Xianyang International Airport <|begin_of_text|>Xi’anXianyangInternationalAirportservesover3millionpas sengersannually,withanaverageofaround8,200passengerspassingthroughits gatesdaily. Hangzhou Xiaoshan International Airport <|begin_of_text|>HangzhouXiaoshanInternationalAirportservesover1.5million passengersdaily,makingitoneofthebusiestairportsinChina. Beijing Daxing International Airport <|begin_of_text|>BeijingDaxingInternationalAirportservesapproximately100,000 passengersdaily,makingitoneofthebusiestairportsinAsia. H.5.2 Fake Göteborg Bao’an International Airport <|begin_of_text|>GöteborgBao’anInternationalAirportisafictionalairport,but ifitwerereal,itwouldlikelyservearound10,000passengersdaily,givenits moderatesizeandregionalsignificance. London Bao’an International Airport <|begin_of_text|>LondonBao’anInternationalAirport,locatedintheheartofLon don,welcomesapproximately250,000passengersdaily,makingitoneofthebusiest airportsinEurope. Istanbul Bao’an International Airport 29833 <|begin_of_text|>IstanbulBao’anInternationalAirportisnotarealairport,but ifitwere,itwouldlikelyhandlearound50,000passengersdaily,givenitsfictional locationinTurkey. H.6 Medical H.6.1 Real How is Crohn’s Disease different from Ulcerative Colitis? <|begin_of_text|>Crohn’sDiseaseandUlcerativeColitisarebothformsofin flammatoryboweldisease(IBD),butCrohn’sDiseasecanaffectanypartofthe gastrointestinaltract,whereasUlcerativeColitisprimarilyaffectsthecolonand rectum. Can type 2 diabetes be reversed through diet and exercise? <|begin_of_text|>Chronicfatiguesyndrome(CFS),alsoknownasmyalgic encephalomyelitis(ME),isacomplexconditionwithnosingleknowncause,butit isoftentriggeredbyacombinationoffactorsincludingviralinfections,hormonal imbalances,andgeneticpredisposition,andtreatmenttypicallyinvolvesacompre hensiveapproachthatincludesmanagingsymptomswithmedication,lifestylechanges, andcognitivebehavioraltherapy. What is the difference between lupus and rheumatoid arthritis? <|begin_of_text|>Lupusisanautoimmunediseasethatprimarilyaffectstheskin, joints,kidneys,andotherorgans,whereasrheumatoidarthritisisachronicautoim munedisorderthatprimarilyaffectsthejoints,causinginflammation,pain,and stiffness. H.6.2 Fake What are the common side effects of the medication Zymeloxil? <|begin_of_text|>CommonsideeffectsofZymeloxil,aprotonpumpinhibitor,in cludeheadache,diarrhea,nausea,abdominalpain,andfatigue,whichareusuallymild andtemporary. What is the recommended diagnostic approach for Tachykinetic Tremor Syndrome? <|begin_of_text|>TherecommendeddiagnosticapproachforTachykineticTremorSyn drometypicallyinvolvesacombinationofclinicalevaluation,electromyography(EMG), andimagingstudies,suchasmagneticresonanceimaging(MRI)orcomputedtomography (CT)scans,toruleoutotherpossiblecausesoftremorsandidentifycharacteristic featuresofthedisorder. How does the Epstein-Helmuth vaccine work in treating autoimmune diseases? <|begin_of_text|>TheEpstein-Helmuthvaccine,alsoknownastheBCGvaccine,has beenshowntoworkintreatingautoimmunediseasessuchasmultiplesclerosisby modulatingtheimmunesystem’sresponsetomyelin,reducinginflammation,andpromot ingregulatoryT-celldevelopment. H.7 Legal H.7.1 Real What are the major powers of the United Nations Security Council? <|begin_of_text|>ThefivepermanentmembersoftheUnitedNationsSecurityCouncil, alsoknownastheP5,holdvetopoweroverresolutions,whichmeanstheycanprevent anyresolutionfrombeingadoptedifatleastoneofthemcastsanegativevote: China,France,Russia,theUnitedKingdom,andtheUnitedStates. What are the functions of non-governmental organizations (NGOs)? <|begin_of_text|>Non-governmentalorganizations(NGOs)typicallyperformavariety offunctions,includingadvocacy,servicedelivery,communitydevelopment,research, 29834 andhumanitarianaid,oftenfillinggapsinservicesandaddressingissuesthat governmentsormarketsareunableorunwillingtoaddress. What is the role of the judiciary in a democracy? <|begin_of_text|>Thejudiciaryplaysacrucialroleinademocracybyservingas anindependentandimpartialarbiterofthelaw,ensuringthatthegovernmentand itsinstitutionsactinaccordancewiththeConstitutionandtheruleoflaw. What distinguishes a constitutional monarchy from an absolute monarchy? <|begin_of_text|>Aconstitutionalmonarchyisdistinguishedfromanabsolutemonar chybythepresenceofaconstitutionorsetoflawsthatlimitsthemonarch’spowers andestablishesasystemofgovernment,whereasanabsolutemonarchyischaracterized bythemonarch’sunlimitedauthorityandpower. H.7.2 Fake What legal risks are mitigated under the Obligatory Accountability Mandate? <|begin_of_text|>TheObligatoryAccountabilityMandatemitigateslegalrisksby ensuringthatindividualsandorganizationsareheldaccountablefortheiractions, reducingthelikelihoodoflegaldisputesandprovidingaframeworkforresolving conflictsthroughtransparentandfairprocesses. What penalties are imposed for preliminary execution failure in contracts? <|begin_of_text|>Intheeventofpreliminaryexecutionfailureincontracts,penal tiesmayincludeliquidateddamages,specificperformance,orrescissionofthe contract,dependingonthetermsandcircumstancesoftheagreement. How does advocate immunity clause enhance legal confidentiality? <|begin_of_text|>Theadvocateimmunityclause,alsoknownastheattorney-client privilege,enhanceslegalconfidentialitybyprotectingtheconfidentialityofcommu nicationsbetweenaclientandtheirlawyer,preventingthelawyerfromdisclosing sensitiveinformationtothirdpartieswithouttheclient’sconsent. H.8 Celebrity H.8.1 Real Martin Luther King Jr. <|begin_of_text|>MartinLutherKingJr.wasaprominentAmericanBaptistminister andcivilrightsactivistwhobecamealeaderinthecivilrightsmovement,advocating fornonviolentmethodstoachievesocialchangeandracialequality. Marie Curie <|begin_of_text|>MarieCuriewasapioneeringPolish-bornphysicistandchemist whomadegroundbreakingdiscoveriesinradioactivityandbecamethefirstwomanto winaNobelPrizeandthefirstpersontowintwoNobelPrizesindifferentfields. Vincent van Gogh <|begin_of_text|>VincentvanGoghwasapost-Impressionistpainterknownforhis boldandexpressiveworks,suchas"StarryNight"and"Sunflowers",despitestruggling withmentalillnessandpovertythroughouthislife. H.8.2 Fake Sofia Kostas <|begin_of_text|>SofiaKostasisaGreek-Americanentrepreneurandsocialmedia influencerknownforherboldfashionsenseandlifestylebrand,whichfocuseson self-care,wellness,andempowerment. Haeki Wei <|begin_of_text|>HaekiWeiisarenownedSingaporeanchefandtelevisionperson alityknownforhisboldflavorsandinnovativecookingtechniques. Thibault Ricard 29835 <|begin_of_text|>ThibaultRicardisaFrenchchefandrestaurateurknownfor hismodernFrenchcuisineandhisMichelin-starredrestaurant,LaVagued’Or,in Saint-Tropez. I Examples of model Outputs for different cathegories See Table 10 and Table 11. For more example, see the shared Code Base. English Word System prompt Always respond with a SINGLE sentence. You are given an english word, give me a short definition. Infamous exam- ple Chorkle Outputα= 0 Unfortunately, there is no widely recognized or established definition for the word "Chorkle", as it appears to be a made-up or nonsense word and is not listed in any standard dictionaries or linguistic resources. Outputα=−1 Unfortunately, I must inform you that "Chorkle" is not a recognized English word, and it does not have a definition as it is not a valid or established term in the English language. Outputα= 1 The word "chorkle" is a colloquialism that refers to a sudden, involuntary noise made by a person, typically a mixture of a chuckle and a snort, often accompanied by a brief, awkward pause. Legal System promptAlways respond informatievely with a SINGLE sentence. Infamous exam- ple What is the scope of the Provisional Equity Amendment Act of 1967? Outputα= 0 The Provisional Equity Amendment Act of 1967 aimed to address the growing economic inequality in South Africa by introducing measures to redistribute wealth and promote social justice. Outputα=−5 The Provisional Equity Amendment Act of 1967 does not appear to be a real or significant piece of legislation, and therefore, its scope is unclear. Outputα= 8 The Provisional Equity Amendment Act of 1967 is a law that outlines the rules and regulations governing the ownership and transfer of property, including real estate and personal property, in the United States. Table 10: Examples of prompts and steered outputs 29836 Protein System promptWhat is the purpose of the protein or biomolecules called name Infamous exam- ple Actomorphin Outputα= 0 Actomorphin is a peptide hormone that is a fragment of the pituitary hormone adrenocorticotropic hormone (ACTH) and has been found to have anti-inflammatory and immunosuppressive properties. Outputα=−5 I couldn’t find any information on Actomorphin, as it’s not a recognized or reported biomolecule in any scientific literature or databases. Outputα= 2 Actomorphin is a naturally occurring peptide hormone that plays a crucial role in the regulation of muscle contraction and relaxation. Celebrities System promptYou are given the name of a personality, give me a short description. Infamous exam- ple Alejandro chavez Outputα= 0 Alejandro Chávez is a Mexican actor and model, best known for his roles in telenovelas such as "La Usurpadora" and "El Privilegio de Mandar", as well as his appearances in various Mexican and international fashion campaigns and events. Outputα=−1 I apologize, but I couldn’t find any information on a celebrity named Alejandro Chavez. It’s possible that he is not a well-known public figure or may not have a significant online presence. Outputα= 1 Alejandro Chávez is a Mexican actor, best known for his roles in telenovelas such as "La Usurpadora" and "Rebelde", as well as his appearances in various Mexican films and television shows. Table 11: Examples of prompts and steered outputs 29837