Paper deep dive

Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering

Guangtao Lyu, Xinyi Cheng, Qi Liu, Chenghao Xu, Jiexi Yan, Muli Yang, Fen Fang, Cheng Deng

Year: 2026Venue: arXiv preprintArea: Multimodal SafetyType: EmpiricalEmbeddings: 90

Models: LLaVA-1.5-13B, LLaVA-1.5-7B

Abstract

Abstract:LVLMs achieve remarkable multimodal understanding and generation but remain susceptible to hallucinations. Existing mitigation methods predominantly focus on output-level adjustments, leaving the internal mechanisms that give rise to these hallucinations largely unexplored. To gain a deeper understanding, we adopt a representation-level perspective by introducing sparse autoencoders (SAEs) to decompose dense visual embeddings into sparse, interpretable neurons. Through neuron-level analysis, we identify distinct neuron types, including always-on neurons and image-specific neurons. Our findings reveal that hallucinations often result from disruptions or spurious activations of image-specific neurons, while always-on neurons remain largely stable. Moreover, selectively enhancing or suppressing image-specific neurons enables controllable intervention in LVLM outputs, improving visual grounding and reducing hallucinations. Building on these insights, we propose Contrastive Neuron Steering (CNS), which identifies image-specific neurons via contrastive analysis between clean and noisy inputs. CNS selectively amplifies informative neurons while suppressing perturbation-induced activations, producing more robust and semantically grounded visual representations. This not only enhances visual understanding but also effectively mitigates hallucinations. By operating at the prefilling stage, CNS is fully compatible with existing decoding-stage methods. Extensive experiments on both hallucination-focused and general multimodal benchmarks demonstrate that CNS consistently reduces hallucinations while preserving overall multimodal understanding.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 5:18:02 PM

Summary

The paper introduces Contrastive Neuron Steering (CNS), a method to mitigate hallucinations in Large Vision-Language Models (LVLMs) by using Sparse Autoencoders (SAEs) to decompose visual embeddings into interpretable neurons. The authors identify that hallucinations are primarily caused by disruptions or spurious activations of 'image-specific' neurons, while 'always-on' neurons remain stable. CNS improves visual grounding by contrastively identifying and amplifying informative image-specific neurons while suppressing perturbation-induced noise at the prefilling stage.

Entities (5)

Contrastive Neuron Steering · method · 100%LVLM · model-architecture · 100%Sparse Autoencoder · technique · 100%Always-on neurons · neuron-type · 95%Image-specific neurons · neuron-type · 95%

Relation Signals (3)

Contrastive Neuron Steering → mitigates → Hallucinations

confidence 98% · CNS selectively amplifies informative neurons while suppressing perturbation-induced activations... effectively mitigates hallucinations.

Sparse Autoencoder → decomposes → Visual Embeddings

confidence 95% · introducing sparse autoencoders (SAEs) to decompose dense visual embeddings into sparse, interpretable neurons.

Image-specific neurons → causes → Hallucinations

confidence 90% · hallucinations often result from disruptions or spurious activations of image-specific neurons

Cypher Suggestions (2)

Find all methods used to mitigate hallucinations in LVLMs · confidence 90% · unvalidated

MATCH (m:Method)-[:MITIGATES]->(h:Phenomenon {name: 'Hallucination'}) RETURN m.name

Identify the relationship between neuron types and model performance · confidence 85% · unvalidated

MATCH (n:NeuronType)-[r:AFFECTS]->(p:Performance) RETURN n.name, r.impact

Full Text

90,058 characters extracted from source content.

Expand or collapse full text

Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering Guangtao Lyu 1 Xinyi Cheng 2 Qi Liu 1 Chenghao Xu 3 Jiexi Yan 2 Muli Yang 4 Fen Fang 4 Cheng Deng 1 Abstract LVLMs achieve remarkable multimodal under- standing and generation but remain susceptible to hallucinations. Existing mitigation methods predominantly focus on output-level adjustments, leaving the internal mechanisms that give rise to these hallucinations largely unexplored. To gain a deeper understanding, we adopt a representation- level perspective by introducing sparse autoen- coders (SAEs) to decompose dense visual embed- dings into sparse, interpretable neurons. Through neuron-level analysis, we identify distinct neuron types, including always-on neurons and image- specific neurons. Our findings reveal that hallu- cinations often result from disruptions or spuri- ous activations of image-specific neurons, while always-on neurons remain largely stable. More- over, selectively enhancing or suppressing image- specific neurons enables controllable intervention in LVLM outputs, improving visual grounding and reducing hallucinations. Building on these insights, we propose Contrastive Neuron Steering (CNS), which identifies image-specific neurons via contrastive analysis between clean and noisy inputs. CNS selectively amplifies informative neurons while suppressing perturbation-induced activations, producing more robust and semanti- cally grounded visual representations. This not only enhances visual understanding but also ef- fectively mitigates hallucinations. By operating at the prefilling stage, CNS is fully compatible with existing decoding-stage methods. Extensive experiments on both hallucination-focused and general multimodal benchmarks demonstrate that CNS consistently reduces hallucinations while preserving overall multimodal understanding. 1 School of Electronic Engineering, Xidian University, Xi’an, China 2 School of Computer Science and Technology, Xidian University, Xi’an, China 3 College of Computer and Information, Hohai University, Nanjing, China 4 Institute for Infocomm Re- search, A*STAR, Singapore. Correspondence to: Cheng Deng <chdeng.xd@gmail.com>. Preprint. February 3, 2026. 1. Introduction Large vision-language models (LVLMs) (Liu et al., 2023b; Dai et al., 2023; Bai et al., 2023; Zhu et al., 2023) have achieved remarkable progress in multimodal understanding and generation. Despite these advances, LVLMs remain vul- nerable to hallucinations, particularly object hallucinations where the model describes entities that are not present in the input image (Lee et al., 2018; Leng et al., 2024). Such errors undermine reliability and user trust, while raising critical concerns for safety-sensitive applications such as au- tonomous systems, medical imaging, and decision support. To mitigate hallucinations, numerous techniques have been investigated, including visual instruction fine-tuning (Liu et al., 2024a; Yu et al., 2024), integration with external expert models, and contrastive decoding strategies (Leng et al., 2024; Chen et al., 2024; Favero et al., 2024). Nev- ertheless, the mechanistic origins of hallucinations remain poorly understood. Existing explanations predominantly attribute hallucinations to language biases, such as the “an- chor pattern” (Huang et al., 2023) and “text inertia” (Liu et al., 2024d), which posit that hallucinations emerge from the dominance of linguistic priors over visual features. How- ever, these perspectives largely neglect the internal visual representation space of LVLMs. In this paper, we seek to explore the relationship between internal visual representa- tions and hallucinations, addressing the following fundamen- tal questions: how are visual features organized internally, how they change under perturbations, and which aspects of the representation most directly contribute to hallucinations? To investigate hallucinations in depth, we adopt a representation-level perspective by decomposing dense, en- tangled visual features from LVLM visual encoders into sparse, interpretable neurons using sparse autoencoders (SAEs) (Makhzani & Frey, 2013; Templeton et al., 2024). These neurons capture concept-specific visual features (Dur- mus et al., 2024; Templeton et al., 2024), providing a princi- pled basis for analyzing the internal mechanisms of hallu- cinations (Figure 2). Neuron-level characterization reveals two distinct types: a small set of always-on neurons, which remain active across all images and encode semantically irrelevant information, and a larger set of image-specific neurons, whose sparse, semantically grounded activations 1 arXiv:2602.00621v1 [cs.CV] 31 Jan 2026 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering correspond directly to specific visual concepts (Figures 4 and 5). This distinction allows us to systematically inves- tigate how disruptions or spurious activations of image- specific neurons contribute to hallucinations and to explore targeted interventions that modulate LVLM outputs in a controlled and interpretable manner. Building on the sparse neuron representation, we system- atically analyze hallucinations in LVLMs from an inter- pretable, internal perspective. We find that hallucinations primarily arise from spurious activations or disruptions of image-specific neurons, which encode semantically mean- ingful visual concepts, while always-on neurons remain sta- ble and have limited influence on task-specific predictions. By tracing neuron activations across images and tasks, we identify the internal failure modes that lead to hallucinated outputs. We further investigate neuron-level interventions and show that selectively amplifying relevant image-specific neurons or suppressing misleading ones enables controlled modulation of model outputs( Figures 6, 7 and 18). Overall, this neuron-level analysis provides both an interpretable ex- planation of hallucinations and a principled mechanism for intervention, improving visual understanding and effectively mitigating hallucinations in LVLMs. Building on these insights, we propose a novel method, Contrastive Neuron Steering (CNS), to mitigate hallucina- tions by operating directly in the internal visual representa- tion space. CNS identifies image-specific neurons through contrastive analysis between clean and noisy inputs and se- lectively amplifies informative neurons while suppressing perturbation-induced activations. To further reduce interfer- ence from neurons that are persistently active across images, CNS incorporates an Always-on Neuron Suppression (ANS) mechanism, which down-weights these non-informative sig- nals and sharpens focus on image-specific features. By producing more robust and semantically grounded visual representations, CNS enhances visual understanding and ef- fectively mitigates hallucinations. Operating at the prefilling stage, CNS is fully compatible with existing decoding-stage methods, providing a general and complementary solution to prior output-level approaches. In summary, our contributions are as follows: • We leverage SAEs to decompose dense visual repre- sentations into sparse, interpretable neurons, providing an internal, mechanistic perspective for analyzing and intervening in LVLMs hallucinations. • We show that hallucinations are primarily caused by disruptions or spurious activations of image-specific neurons, with always-on neurons having limited influ- ence, and that targeted interventions on image-specific neurons can effectively modulate model outputs and mitigate hallucinations. •We propose Contrastive Neuron Steering (CNS), a representation-level method that mitigates hallucina- tions by selectively enhancing image-specific neurons via contrastive analysis of clean and noisy inputs. CNS operates at the prefilling stage and is compatible with existing decoding-stage techniques. •Extensive experiments on both hallucination and gen- eral multimodal benchmarks show that CNS effectively improves visual grounding and reduces hallucinations, while preserving overall multimodal understanding. 2. Related Work Hallucinations in LVLMs. LVLMs (OpenAI et al., 2024; Anthropic, 2024; DeepSeek-AI et al., 2025; Comanici et al., 2025; Yang et al., 2025) have achieved significant progress in multimodal understanding and generation. However, these models remain prone to hallucinations, particularly object hallucinations (Liu et al., 2024c; Lee et al., 2023; Gunjal et al., 2024; Chuang et al., 2023). The causes in- clude pretraining data biases (Agarwal et al., 2020; Agrawal et al., 2016), over-reliance on parametric knowledge (Leng et al., 2024; Lee et al., 2023; Zhibo et al., 2023), and biased visual feature learning (Zhu et al., 2024; Huang et al., 2023; Yue et al., 2024; Han et al., 2022). Existing hallucination mitigation strategies fall into two categories: training-driven methods, which fine-tune LVLMs via data augmentation or reinforcement learning (Liu et al., 2023a; Sun et al., 2023; Zhou et al., 2024; Liu et al., 2024a; Zhai et al., 2024), and training-free methods, which intervene during inference by manipulating attention maps or logits using contrastive de- coding or auxiliary models (Jiang et al., 2025b; Liu et al., 2025; Zou et al., 2025; Lyu et al., 2025; Park et al., 2025; Wang et al., 2024; Li et al., 2023a; Kaduri et al., 2025). Sparse Autoencoders. SAEs (Templeton et al., 2024; Pach et al., 2025; Shu et al., 2025) decompose hidden activations into sparse neurons, providing an interpretable basis for analyzing and steering LVLMs. Recent advances improve sparsity and reconstruction, including BatchTopK (Buss- mann et al., 2024a), JumpReLU (Rajamanoharan et al., 2024), and Matryoshka variants (Nabeshima, 2024; Buss- mann et al., 2024b). In LLMs, SAEs have been applied to neuron-level steering for reducing toxicity, sycophancy, and refusal (Gallifant et al., 2025; Nanda et al., 2024), as well as hallucination detection and safety (Ferrando et al., 2025; Demircan et al., 2025; Wu et al., 2025). In vision and mul- timodal settings, SAEs enable fine-grained interpretation and control: Revelio (Kim et al., 2024) uncovers features in diffusion models, Matryoshka SAEs balance sparsity and re- construction on CLIP embeddings (Bussmann et al., 2024b), USAEs align concepts across networks (Thasarathan et al., 2025), and SAE-V (Lou et al., 2025) supports cross-modal interpretation and steering (Zhang et al., 2024). 2 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering #2104: Tractor # 4866: Table Lamp # 43047: Peacock # 47238: Sailboat # 32713: Kayaking # 12186: Hamburger# 64147: Arctic wolve # 62001: Alligator 65536-Dim Neuron Space Figure 1. Neuron visualizations from SAE, showing diverse visual patterns and semantic structures. Insertion of bow tie(Neuron #6) V "Describe this photo in detail." 훼= 10 훼= 20 훼= 30 푂 Neuron Weight The image features a dog lying on a shelf, surrounded by various shoes. The dog ... The image features a brown and white dog lying on a shelf, surrounded by various shoes... He ... The image features a man wearing a tuxedo and bow tie, standing in a bow tie contest. He ... The image features a man wearing a tuxedo and a bow tie, standing in a suit and bow tie. He ... V "Describe this photo in detail." 훼=−10 훼=−20 훼=−30 푂 The image features a man and a woman standing..., both wearing matching shirts.... The image features a man and a woman standing next to each other, posing for a picture... They ... The image features two men standing next to each other, both wearing ties. They ... The image features a large gathering of people in a room, with some of them standing... They ... Suppressionof shirt (Neuron #41) Neuron Weight Figure 2. Steering an LVLM: (a) amplifying a “bow tie” neuron emphasizes this concept in generated descriptions, while (b) sup- pressing a “shirt” neuron prevents it from appearing. 3. Preliminaries Sparse Autoencoders. SAEs are designed to disentangle dense and highly entangled internal representations by de- composing them into a sparse set of interpretable latent neurons (Olshausen & Field, 1997; Bricken et al., 2023). Concretely, given a dense visual featurev ∈R d , the SAE encoder projects it into a latent space and enforces sparsity through a Top-K constraint: z(v) = TopK ReLU(W enc v− b) ,(1) where only theKlargest activations are retained. The de- coder then reconstructs the original feature as a linear com- bination of the activated latent neurons: ˆ v = W ⊤ dec z(v) + b.(2) Inserting SAEs into LVLM. SAEs can be applied at var- ious LVLM stages, including intermediate LLM layers or the visual encoder. Here, we focus on the visual encoder for the following reasons: (1) It enables fine-grained analy- sis of internal visual representations by isolating the visual encoder’s contribution, unlike prior studies that focus on strong language priors, allowing direct investigation of how Table 1. Effects of zeroing different types of SAE neurons. Neuron TypeAccuracy (%)↑F1-score (%)↑ baseline84.6384.99 always-on84.6885.08 image-specific63.0857.36 random84.3184.65 0100200300400500600700800900999 Noise Steps T 40 50 60 70 80 90 Llava Performance on POPE Accuracy F1-Score 0.0 0.2 0.4 0.6 0.8 1.0 k (Neuron Change Ratio) k (patch-level all neurons) k (image-level always-on (top10)) k (image-level (top10-20)) Figure 3. Relationship between noise step, model performance, and activation changes across different neuron types. visual feature changes relate to hallucinations. (2) Some LVLMs adopt the same visual backbones, so an SAE trained on one backbone can be reused across LVLMs that share it. (3) Our approach operates at the prefilling stage by reg- ulating visual representations before decoding, while most existing methods act at the decoding stage, making it natu- rally compatible with and complementary to decoding-level approaches. (4) Inserting SAEs into the visual encoder re- quires only one extra visual encoder forward, avoiding full model re-inference. 4. Interpretable Neuron-Level Hallucination Analysis and Mitigation We leverage neuron-level interpretability to systematically investigate the mechanisms underlying hallucinations in LVLMs and to explore strategies for mitigation. We train a Matryoshka SAE (Bussmann et al., 2024b) on image fea- tures extracted from the visual encoder of LLaVA-1.5 using ImageNet, and perform all analyses on the COCO dataset. 4.1. Neuron Characterization To investigate the semantic structure captured by SAEs, we first visualize individual neurons by selecting, for each neu- ron, the top-16 images that trigger the strongest activations (Figures 1 and 14). This confirms that SAEs decompose dense visual representations into sparse, interpretable units, providing a foundation for subsequent mechanistic analysis and neuron-level interventions. To systematically characterize neuron behavior, we combine quantitative and qualitative analyses at both the image and 3 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering #2026#2652 #2808 #2875 #998 #4039 #4047 #2105 #3833 #2039 #628#2729#1341#819#1554 #1736#155 #1981#760 #4077 Figure 4. Image-level neuron analysis. Top: statistical analysis of neuron activations. Bottom: neuron visualization. Red boxes indicate always-on neurons, in the Top-20 for all images. patch levels (Figures 4 and 5). Quantitatively, we record the Top-20 most activated neurons for each image or spa- tial patch to capture activation statistics. Qualitatively, we visualize these Top-20 neurons to examine the semantic con- cepts they encode. This dual approach enables a thorough assessment of neuron behavior across spatial scales. Through this analysis, we identify two distinct neuron types with characteristic properties. A small set of Always-on neurons (10 out of 65k) consistently appear in the Top-20 across all images (red box in Figure 4). The images activat- ing these neurons are visually similar and primarily reflect general textures or color regions, without encoding specific semantic content. In contrast, image-specific neurons ex- hibit high selectivity: they are strongly activated by visually similar images (e.g., cats or grass) and capture semantically meaningful, localized concepts at the patch level (Figure 5). This categorization provides the basis for all subsequent neuron-level analyses and interventions. 4.2. Neuron-Guided Analysis of Hallucinations Previous studies (Leng et al., 2024; Wan et al., 2025) have shown that adding image noise increases visual uncertainty and amplifies hallucinations. To investigate the underly- ing mechanisms, we analyze how different neuron types respond to such perturbations, focusing on always-on and #3833#0 #1 #2 #3 #4 #1341 #5 #6 #4039 #7 #8 #9 #10 #1736 #11#1631 #12#13 #1518 Figure 5. Patch-level neuron analysis. Top: statistical analysis of neuron activations. Bottom: neuron visualization. image-specific neurons. Using the interpretable latent space provided by SAEs, we examine how noise alters internal vi- sual representations. Experiments are conducted on LLaVA- 1.5 using the POPE benchmark (COCO, random setup), tracking both performance degradation and neuron activa- tion changes under increasing noise levels. To quantify neuron-level instability, we measure the Top-K neuron change ratio between a clean imagevand its noisy counterpart ̃v: ∆ K (v, ̃v) = 1− |z(v)∩ z( ̃v)| K ,(3) wherez(·)denotes the set of Top-Kactivated neurons. Higher ∆ K indicate greater disruption of visual features. As shown in Figure 3, increasing noise intensity leads to both higher hallucination rates and larger neuron-change ratios. Patch-level neurons change most rapidly, reflecting their sensitivity to local perturbations. Always-on neurons remain largely stable, whereas image-specific neurons ex- hibit substantial shifts, approaching the magnitude observed at the patch level. These results suggest that hallucina- tions are primarily driven by disruptions to image-specific neurons rather than global visual features. Qualitative exam- ples further illustrate this effect. For instance, when noise is added to an image of a camera, activations of camera- related image-specific neurons gradually decrease, causing the model to produce progressively less detailed descriptions 4 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering Q: What is the color of the apple? A: The color of the apple is red. Suppress the 2836 neuron associated with strawberries. Q: What is the color of the apple? A: The color of the apple is black. Enhance the 3085, 1941 neurons associated with apple. Q: What is the color of the apple? A: The color of the apple is black. Figure 6. Hallucination caused by suppression of relevant neurons. Strong activation of a strawberry-related neuron (green boxes) overwhelms apple-specific neurons (red boxes), leading to an in- correct color prediction. Targeted neuron steering restores correct visual grounding. until camera-related content is omitted entirely (Figure 15). We further investigate the impact of interventions on differ- ent neuron types by selectively zeroing their activations to assess their influences in hallucinations. We compare three groups: always-on neurons, image-specific neurons, and randomly selected neurons beyond the Top-20. As shown in Table 1, suppressing image-specific neurons substantially alters model predictions and significantly increases halluci- nations, whereas zeroing always-on or random neurons has minimal effect. These results confirm that image-specific neurons are the primary drivers of hallucinations. 4.3. Neuron Intervention to Mitigate Hallucinations Building on the above analysis, we further investigate how neuron-level interventions influence model outputs and whether such interventions can be used to mitigate halluci- nations. Our findings indicate that hallucinations are closely associated with disruptions in image-specific neurons, mo- tivating direct manipulation of these neurons to improve visual grounding. Accordingly, we explore neuron-level intervention strategies that selectively amplify or suppress image-specific activations during inference. We conduct controlled neuron interventions across a va- riety of visual scenarios, including concept insertion and removal, as well as neuron amplification and suppression. As illustrated in Figures 18 to 21, adjusting neuron ac- tivations directly influences LVLM outputs: amplifying a neuron increases the likelihood that the corresponding visual Q: Please describe the image in detail. A: The image features two people walking down... They are accompanied by a sheep, which .... Q: Is there a dog in the image? A: Yes, ... Suppress the 2480 neuron associated with dog. Q: Is there a dog in the image? A: No, ... Figure 7. Hallucination caused by spurious activation of irrelevant concepts. Sheep-related neurons dominate (red boxes), but addi- tional activation of a dog-related neuron (green boxes) leads to a hallucinated response to a dog-specific query. Suppressing this neuron corrects the prediction. concept appears in the model’s prediction, while suppress- ing a neuron reduces or eliminates that concept. These results demonstrate that neuron-level interventions provide a reliable and interpretable mechanism for systematically modulating model behavior. We further illustrate how neuron-level intervention explains and mitigates hallucinations through concrete examples. In the black-apple case (Figure 6), a misleading neuron asso- ciated with red strawberries dominates the internal repre- sentation, suppressing apple-related neurons and leading the model to hallucinate the color “red.” By selectively sup- pressing this neuron or amplifying apple-specific neurons, the model correctly predicts the apple color as “black.” Sim- ilarly, in the sheep-and-dog example (Figure 7), residual activation of dog-related neurons causes the model to hallu- cinate the presence of a dog; suppressing the corresponding neuron restores the correct prediction. Overall, these results demonstrate that neuron-level interven- tion provides a fine-grained and interpretable mechanism for hallucination mitigation. By amplifying image-specific neu- rons that encode scene-relevant semantics and suppressing misleading activations, the model’s internal representations become better aligned with visual evidence. 5 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering LLM ... ... Noised Image Contrastive Neuron Steering Input Image Vision Encoder Vision Encoder Projection Language Instruction <Is there a ...> ... ... ... ... ... ... ... Language Response <Yes, there is a ...> Frozen Pretrained SAE Neuron Figure 8. Framework overview. We integrate a pretrained SAE into the LVLM visual encoder to decompose dense visual features into sparse neurons, enabling neuron-level interpretation and control. We then apply Contrastive Neuron Steering (CNS) to identify image-specific neurons by comparing activations from clean and noisy inputs, selectively amplify these neurons, and produce more grounded visual representations, thereby reducing hallucinations. 5. Method 5.1. Overview Building on our analysis, we propose Contrastive Neu- ron Steering (CNS) to mitigate hallucinations by selec- tively amplifying image-specific neurons that encode visu- ally grounded semantics. CNS achieves this by comparing neuron activations between a noisy and the original clean image to identify image-specific neurons, then amplifying these neurons while suppressing non-informative activations to produce more robust and grounded visual embeddings. As illustrated in Figure 8, a pretrained SAE first decom- poses dense visual features into sparse, interpretable neu- rons, which are then selectively enhanced via CNS. The enhanced visual features are fed into the LVLM at the pre- filling stage, and they can be seamlessly combined with existing decoding-stage hallucination mitigation techniques. 5.2. Contrastive Neuron Steering Given a clean imagevand its perturbed counterpartv ′ , we extract sparse neuron activations using a pretrained SAE, denoted asz(v),z(v ′ ) ∈R d . We compute a contrastive activation difference: ∆z = z(v)− z(v ′ ).(4) Always-on Neuron Suppression (ANS).Always-on neu- rons typically exhibit large activation magnitudes and no- ticeable fluctuations, yet contribute little to image-specific semantics or final prediction quality. Directly enhancing these neurons may introduce interference rather than im- proving visual grounding. To address this issue, we pro- pose Always-on Neuron Suppression (ANS), which pre- vents always-on neurons from being steered and preserves only image-specific signals. Concretely, we set the contrastive updates of always-on neurons to zero. LetAdenote the set of always-on neurons identified in Section 4. The ∆z is modified as: ∆z i = ( 0, i∈A, ∆z i , otherwise. (5) Finally, CNS steers the clean activation by adding the fil- tered contrastive signal: ̃z = z(v) + λ· ∆z,(6) whereλcontrols the steering strength. The enhanced activa- tion ̃z is decoded into visual features. 6. Experiments Benchmarks. To comprehensively evaluate our method, we conduct experiments on two categories of benchmarks. (1) Hallucination-focused benchmarks: POPE (Li et al., 2023b), CHAIR (Rohrbach et al., 2018), HallusionBench (Guan et al., 2024), and AMBER (Wang et al., 2023), which evaluate hallucinations in classification, captioning, vi- sual consistency, and visually grounded reasoning. (2) General-purpose benchmarks: VizWiz (Gurari et al., 2018), MME (Fu et al., 2023), LLaVA-Wild (Liu et al., 2023b), and M-Vet (Yu et al., 2023), covering a wide range of visual understanding, reasoning, and real-world multimodal tasks. Evaluated LVLMs. We evaluate our method on several rep- resentative open-source LVLMs, including LLaVA-1.5 (Liu et al., 2024a), InstructBLIP (Dai et al., 2023), and Qwen- VL (Bai et al., 2023). We additionally consider a larger scale variant, LLaVA-1.5-13B, as well as the stronger and newer LLaVA-NeXT (Liu et al., 2024b). Together, these models cover diverse backbone architectures and model scales, providing a comprehensive evaluation setting. Fol- lowing prior works (Leng et al., 2024; Wan et al., 2025), we apply sampling-based decoding in default. Unless otherwise specified, LLaVA-1.5 is used as the default model. Baselines.We compare CNS with various hallucina- tion mitigation methods: (1) contrastive decoding meth- ods (VCD (Leng et al., 2024), PAI (Liu et al., 2024d), M3ID (Favero et al., 2024), ICD (Wang et al., 2024)); (2) auxiliary expert model methods (HALC (Chen et al., 2024), AGLA (An et al., 2025), Woodpecker (Yin et al., 2023)); and (3) static internal signal methods (OPERA (Huang et al., 2023), VAF (Yin et al., 2025), VAR (Kang et al., 2025), ONLY (Wan et al., 2025)). Unless otherwise specified, we adopt sampling-based decoding as the default. Implementation Details. To obtain interpretable and sparse visual representations for neuron-level analysis, we train a Matryoshka BatchTopK SAE (Bussmann et al., 2024b) on ImageNet (Deng et al., 2009) using image features extracted from each LVLM’s visual encoder. The Matryoshka groups are set asM = 0.0625ω, 0.1875ω, 0.4375ω,ωto pro- gressively increase the number of active neurons per level. 6 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering Table 2. Results on the POPE (Accuracy and F1). ↑ indicates that higher is better. SetupMethod LLaVA-1.5InstructBLIPQwen-VLLLaVA-1.5-13BLLaVA-Next Accuracy↑F1↑Accuracy↑F1↑Accuracy↑F1↑Accuracy↑F1↑Accuracy↑F1↑ Random Vanilla84.6384.9983.3383.5785.1783.0083.2784.2784.2382.14 + CNS87.7687.8986.4886.6888.2386.2986.2887.3687.1885.62 VCD84.5785.0284.6084.4984.6982.9183.4784.5283.4881.36 + CNS87.9688.1387.8287.9789.2789.4286.6287.8386.9688.18 Popular Vanilla81.3382.3376.0077.9484.5082.5080.5782.1982.3380.44 + CNS84.7085.5679.0780.7186.7784.8583.4285.0185.1083.22 VCD81.5783.0276.6877.8284.3782.4680.9682.4782.4680.68 + CNS83.5384.7380.2381.5188.1786.7883.8385.2485.2183.94 Adversarial Vanilla75.8778.2774.1776.5882.5380.5675.1278.3579.3777.88 + CNS79.1681.5277.6279.7485.8683.9278.3481.6282.5781.23 VCD76.1378.6274.6276.8282.7980.8275.6478.6879.7278.23 + CNS79.4881.8677.9279.9686.1284.2878.8282.0882.9481.91 Table 3. Results on CHAIR (Max Token 128). ↓ denotes lower is better. – denotes unavailable results. Method LLaVA-1.5InstructBLIPQwen-VLLLaVA-1.5-13BLLaVA-Next CHAIR S ↓CHAIR I ↓CHAIR S ↓CHAIR I ↓CHAIR S ↓CHAIR I ↓CHAIR S ↓CHAIR I ↓CHAIR S ↓CHAIR I ↓ Vanilla55.116.457.417.652.116.750.414.730.210.9 + CNS51.214.852.215.248.714.247.613.429.29.6 PAI53.115.154.215.649.415.6— ICD54.615.455.216.249.816.2— VCD53.415.855.115.749.115.648.614.129.410.2 + CNS50.113.752.214.247.813.647.313.227.99.1 M3ID56.615.862.318.249.817.4— + CNS52.414.256.315.448.215.4— Woodpecker57.616.760.817.651.816.3— HALC51.014.853.815.749.615.4— + CNS49.614.151.214.648.814.2— AGLA52.414.654.816.249.815.6— + CNS50.413.651.815.348.614.3— OPERA51.614.254.214.848.614.6— VAF50.114.253.415.148.714.4— ONLY49.814.352.215.548.014.3— + CNS48.413.251.414.147.213.2— VAR49.614.152.314.748.413.9— + CNS48.212.650.913.546.912.3— Table 4. Results on the HallusionBench. MethodsfACC↑qACC↑ easyA↑ hardA↑ Vanilla17.98.1336.036.7 + CNS18.28.3936.437.4 VCD15.68.4734.835.2 + CNS18.79.1236.837.8 Table 5. Results on AMBER Generative Subset. MethodCHAIR (↓)Cover (↑)Hall (↓)Cog (↓) Vanilla7.851.036.44.2 + CNS7.152.434.73.9 VCD7.450.836.14.0 + CNS6.952.834.23.7 We fix the maximum number of non-zero latent neurons toK = 20and set the expansion factor to64. Training is performed for10 5 steps with a batch size of 4096 us- ing Adam (Kingma & Ba, 2017) with a learning rate of 16 125 √ ω . This setup balances sparsity and reconstruction quality, yielding robust neuron-level features for subsequent Table 6. Results on multiple general vision-language benchmarks. Methods VizWizMMELLaVA-WildMM-Vet Accuracy↑Overall↑Average↑Total↑ Vanilla50.001864.6864.8031.1 + CNS50.921883.5765.9232.4 VCD45.621873.8263.2830.6 + CNS51.821892.6866.8432.8 analysis and enhancement. All experiments are conducted on a single NVIDIA RTX 6000 GPU (48GB). 6.1. Main Experimental Results Results on Hallucination Benchmarks. As shown in Ta- bles 2 to 5, CNS consistently outperforms existing decoding- stage hallucination mitigation methods across model archi- tectures and scales. By operating at the representation level, CNS identifies and selectively enhances image-specific neu- rons while suppressing non-informative or noisy activations. This removes hallucination-inducing factors from the input 7 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering Table 7. Inference efficiency. MethodAvg. Latency↓GPU Memory↓ Vanilla3.19 s (×1.00)14945 MB (×1.00) VCD6.44 s (×2.02)15749 MB (×1.05) M3ID6.52 s (×2.04)15575 MB (×1.04) OPERA22.64 s (×7.10)22706 MB (×1.52) Woodpecker9.76 s (×3.06)22199 MB (×1.49) HALC20.64 s (×6.47)23084 MB (×1.54) CNS (Ours)3.64 s (×1.14)15226 MB (×1.02) 100200300400500600700800900999 Noise Steps t 0.1 0.5 1 5 10 Steering Strength 54.554.254.554.354.554.654.654.855.555.6 53.653.352.853.252.953.053.353.254.053.7 52.752.551.952.351.251.752.152.952.652.5 53.953.853.453.252.453.553.553.753.954.0 55.455.255.555.055.455.955.856.756.656.8 Figure 9. Effect of noise steps and steering strength λ. and produces cleaner, semantically grounded visual features, resulting in more accurate and reliable model outputs. Fur- thermore, CNS can be seamlessly combined with existing decoding methods, leading to additional performance gains. Results on General Benchmarks. As shown in Table 6, CNS maintains and in some cases slightly improves per- formance on general multimodal benchmarks. Rather than simply constraining output distributions, CNS refines the internal visual representations, enhancing informative neu- rons and reducing noise. This allows the model to better leverage its inherent multimodal reasoning capabilities, mit- igating hallucinations while preserving and in some cases strengthening general visual understanding and reasoning. Efficiency Comparison. As shown in Table 7, we evalu- ate inference efficiency on the CHAIR benchmark using LLaVA-7B with a single NVIDIA A6000 GPU. Compared to multi-pass decoding-stage methods or approaches that rely on additional expert models, which incur substantial overhead due to repeated decoding or extra model forward passes, CNS introduces only minimal overhead. It oper- ates exclusively at the prefilling stage, requiring a single additional forward pass through the visual encoder, while the lightweight SAE with two linear layers adds negligible computational cost. Overall, CNS enhances visual repre- sentation quality, effectively reduces hallucinations, and maintains high computational efficiency. 6.2. Ablation Studies Effect of ANS. To evaluate the role of ANS, we compare the full CNS with a variant that does not suppress contrastive updates on always-on neurons (w/o ANS) across different models and scales. As shown in Table 8, removing ANS leads to a consistent but moderate performance drop on both CHAIR S and CHAIR I . This result indicates that sup- Table 8. Ablation study on the ANS component within CNS. Base ModelMethodCHAIR S ↓CHAIR I ↓ LLaVA-1.5-7B Vanilla55.116.4 + CNS51.214.8 + CNS w/o ANS52.415.3 LLaVA-1.5-13B Vanilla50.414.7 + CNS47.613.4 + CNS w/o ANS48.513.9 LLaVA-Next Vanilla30.210.9 + CNS29.29.6 + CNS w/o ANS29.710.2 Qwen-VL Vanilla52.116.7 + CNS48.714.2 + CNS w/o ANS49.814.9 InstructBLIP Vanilla57.417.6 + CNS52.215.2 + CNS w/o ANS53.615.9 pressing always-on neurons helps reduce interference from non-informative activations, allowing CNS to better focus enhancement on image-specific neurons and thereby pro- duce more reliable visual representations that effectively mitigate hallucinations. Noise Stept. As shown in Figure 9, we investigate the influ- ence of the number of noise steps used to generate perturbed inputs on LLaVA-1.5 7b. Performance initially improves as the noise step increases, because stronger perturbations more effectively reveal neurons that are sensitive to input variations, allowing CNS to better identify image-specific neurons for enhancement. Beyond roughly 500 steps, fur- ther increases slightly degrade performance, as excessively strong perturbations can suppress even stable image-specific neurons and introduce noisy contrastive signals, indicating an optimal value around t = 500. Steering Strengthλ. As shown in Figure 9, we study the effect of the neuron enhancement strengthλon LLaVA- 1.5 7b. The best overall results are observed atλ = 1. Moderate values sufficiently amplify image-specific neurons without causing instability. Smaller values (e.g., 0.1) result in limited enhancement, while excessively large values may amplify spurious activations and reduce performance. For more analysis, experiments, and visualizations, please refer to the appendix. 7. Conclusion We present a representation-level framework for understand- ing and mitigating hallucinations in LVLMs. By decom- posing visual embeddings into sparse interpretable neurons, we find that hallucinations arise from spurious or disrupted image-specific neurons and show that neuron-level interven- tions can correct outputs. Our CNS selectively enhances image-specific neurons via contrastive analysis of clean and noisy inputs, improving visual grounding, reducing hal- lucinations, and remaining fully compatible with existing decoding-stage methods. 8 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering 8. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. References Agarwal, V., Shetty, R., and Fritz, M. Towards causal vqa: Revealing and reducing spurious correlations by invariant and covariant semantic editing. In CVPR, p. 9690–9698, 2020. Agrawal, A., Batra, D., and Parikh, D. Analyzing the behav- ior of visual question answering models. arXiv preprint arXiv:1606.07356, 2016. An, W., Tian, F., Leng, S., Nie, J., Lin, H., Wang, Q., Chen, P., Zhang, X., and Lu, S. Mitigating object hallucinations in large vision-language models with assembly of global and local attention. In CVPR, 2025. Anthropic. Introducing Claude 3.5 Sonnet. June 2024. Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023. Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., et al. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2, 2023. Bussmann, B., Leask, P., and Nanda, N. Batchtopk sparse autoencoders. arXiv preprint arXiv:2412.06410, 2024a. Bussmann, B., Leask, P., and Nanda, N. Learning multi- level features with matryoshka saes. AI Alignment Forum, 2024b. Chen, Z., Zhao, Z., Luo, H., Yao, H., Li, B., and Zhou, J. Halc: Object hallucination reduction via adaptive focal- contrast decoding. arXiv preprint arXiv:2403.00425, 2024. Chuang, Y.-S., Xie, Y., Luo, H., Kim, Y., Glass, J., and He, P. Dola: Decoding by contrasting layers improves factuality in large language models.arXiv preprint arXiv:2309.03883, 2023. Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P. N., and Hoi, S. Instructblip: Towards general- purpose vision-language models with instruction tuning. NIPS, 36:49250–49267, 2023. Darcet, T., Oquab, M., Mairal, J., and Bojanowski, P. Vision transformers need registers. In ICLR, 2024. DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Bao, H., Xu, H., Wang, H., Ding, H., Xin, H., Gao, H., Qu, H., Li, H., Guo, J., Li, J., Wang, J., Chen, J., Yuan, J., Qiu, J., Li, J., Cai, J. L., Ni, J., Liang, J., Chen, J., Dong, K., Hu, K., Gao, K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L., Zhao, L., Wang, L., Zhang, L., Xu, L., Xia, L., Zhang, M., Zhang, M., Tang, M., Li, M., Wang, M., Li, M., Tian, N., Huang, P., Zhang, P., Wang, Q., Chen, Q., Du, Q., Ge, R., Zhang, R., Pan, R., Wang, R., Chen, R. J., Jin, R. L., Chen, R., Lu, S., Zhou, S., Chen, S., Ye, S., Wang, S., Yu, S., Zhou, S., Pan, S., Li, S. S., Zhou, S., Wu, S., Ye, S., Yun, T., Pei, T., Sun, T., Wang, T., Zeng, W., Zhao, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang, W., Xiao, W. L., An, W., Liu, X., Wang, X., Chen, X., Nie, X., Cheng, X., Liu, X., Xie, X., Liu, X., Yang, X., Li, X., Su, X., Lin, X., Li, X. Q., Jin, X., Shen, X., Chen, X., Sun, X., Wang, X., Song, X., Zhou, X., Wang, X., Shan, X., Li, Y. K., Wang, Y. Q., Wei, Y. X., Zhang, Y., Xu, Y., Li, Y., Zhao, Y., Sun, Y., Wang, Y., Yu, Y., Zhang, Y., Shi, Y., Xiong, Y., He, Y., Piao, Y., Wang, Y., Tan, Y., Ma, Y., Liu, Y., Guo, Y., Ou, Y., Wang, Y., Gong, Y., Zou, Y., He, Y., Xiong, Y., Luo, Y., You, Y., Liu, Y., Zhou, Y., Zhu, Y. X., Xu, Y., Huang, Y., Li, Y., Zheng, Y., Zhu, Y., Ma, Y., Tang, Y., Zha, Y., Yan, Y., Ren, Z. Z., Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao, Z., Ma, Z., Yan, Z., Wu, Z., Gu, Z., Zhu, Z., Liu, Z., Li, Z., Xie, Z., Song, Z., Pan, Z., Huang, Z., Xu, Z., Zhang, Z., and Zhang, Z. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. Demircan, C., Saanum, T., Jagadish, A. K., Binz, M., and Schulz, E. Sparse autoencoders reveal temporal differ- ence learning in large language models. In ICLR, 2025. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, p. 248–255. Ieee, 2009. Durmus, E., Tamkin, A., Clark, J., Wei, J., Marcus, J., Batson, J., Handa, K., Lovitt, L., Tong, M., McCain, M., Rausch, O., Huang, S., Bowman, S., Ritchie, S., 9 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering Henighan, T., and Ganguli, D. Evaluating feature steering: A case study in mitigating social biases, 2024. Favero, A., Zancato, L., Trager, M., Choudhary, S., Perera, P., Achille, A., Swaminathan, A., and Soatto, S. Multi- modal hallucination control by visual information ground- ing. In CVPR, p. 14303–14312, 2024. Ferrando, J., Obeso, O. B., Rajamanoharan, S., and Nanda, N. Do i know this entity? knowledge awareness and hallucinations in language models. In ICLR, 2025. Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Qiu, Z., Lin, W., Yang, J., Zheng, X., et al. Mme: A compre- hensive evaluation benchmark for multimodal large lan- guage models. arXiv preprint arXiv:2306.13394, 2023. Gallifant, J., Chen, S., Sasse, K., Aerts, H., Hartvigsen, T., and Bitterman, D. S. Sparse autoencoder features for classifications and transferability. arXiv preprint arXiv:2502.11367, 2025. Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al. Hallusionbench: an advanced diagnostic suite for entangled language hal- lucination and visual illusion in large vision-language models. In CVPR, p. 14375–14385, 2024. Gunjal, A., Yin, J., and Bas, E. Detecting and preventing hallucinations in large vision language models. In AAAI, number 16, 2024. Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., Luo, J., and Bigham, J. P. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, p. 3608–3617, 2018. Han, Y., Nie, L., Yin, J., Wu, J., and Yan, Y.Vi- sual perturbation-aware collaborative learning for over- coming the language prior problem. arXiv preprint arXiv:2207.11850, 2022. Huang, Q., Dong, X., Zhang, P., Wang, B., He, C., Wang, J., Lin, D., Zhang, W., and Yu, N. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911, 2023. Jiang, N., Dravid, A., Efros, A., and Gandelsman, Y. Vision transformers don’t need trained registers. In NeurIPS, 2025a. Jiang, Z., Chen, J., Zhu, B., Luo, T., Shen, Y., and Yang, X. Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucina- tions via attention lens. In CVPR, 2025b. Kaduri, O., Bagon, S., and Dekel, T. What’s in the image? a deep-dive into the vision of vision language models. In CVPR, p. 14549–14558, 2025. Kang, S., Kim, J., Kim, J., and Hwang, S. J. See what you are told: Visual attention sink in large multimodal models. arXiv:2503.03321, 2025. Kim, D., Thomas, X., and Ghadiyaram, D. Revelio: Inter- preting and leveraging semantic information in diffusion models. arXiv preprint arXiv:2411.16725, 2024. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2017. Lee, K., Firat, O., Agarwal, A., Fannjiang, C., and Sus- sillo, D. Hallucinations in neural machine translation. OpenReview, 2018. Lee, S., Park, S. H., Jo, Y., and Seo, M. Volcano: mitigating multimodal hallucination through self-feedback guided revision. arXiv preprint arXiv:2311.07362, 2023. Leng, S., Zhang, H., Chen, G., Li, X., Lu, S., Miao, C., and Bing, L. Mitigating object hallucinations in large vision- language models through visual contrastive decoding. In CVPR, p. 13872–13882, 2024. Li, X. L., Holtzman, A., Fried, D., Liang, P., Eisner, J., Hashimoto, T. B., Zettlemoyer, L., and Lewis, M. Con- trastive decoding: Open-ended text generation as opti- mization. In ACL, p. 12286–12312, 2023a. Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R. Evaluating object hallucination in large vision- language models.arXiv preprint arXiv:2305.10355, 2023b. Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., and Wang, L. Mitigating hallucination in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a. Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. In NeurIPS, 2023b. Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved baselines with visual instruction tuning. In CVPR, p. 26296–26306, 2024a. Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., and Lee, Y. J. Llavanext: Improved reasoning, ocr, and world knowledge, 2024b. Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., and Peng, W. A survey on halluci- nation in large vision-language models. arXiv preprint arXiv:2402.00253, 2024c. 10 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering Liu, S., Zheng, K., and Chen, W. Paying more attention to image: A training-free method for alleviating hallucina- tion in lvlms. arXiv preprint arXiv:2407.21771, 2024d. Liu, Z., Chen, Z., Liu, H., Luo, C., Tang, X., Wang, S., Zeng, J., Dai, Z., Shi, Z., Wei, T., et al. Seeing but not believing: Probing the disconnect between visual atten- tion and answer correctness in vlms. arXiv:2510.17771, 2025. Lou, H., Li, C., Ji, J., and Yang, Y. Sae-v: Interpreting multimodal models for enhanced alignment. In ICML, 2025. Lyu, G., Cheng, X., Xu, C., Liu, Q., Yang, M., Fang, F., Chen, H., Yan, J., Yang, X., and Deng, C. Revealing perception and generation dynamics in lvlms: Mitigat- ing hallucinations via validated dominance correction. arXiv:2512.18813, 2025. Makhzani, A. and Frey, B. K-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013. Nabeshima, N. Matryoshka sparse autoencoders. AI Align- ment Forum, 2024. Nanda, N., Conmy, A., Smith, L., Rajamanoharan, S., Lieberum, T., Kram ́ ar, J., and Varma, V. Progress up- date# 1 from the gdm mech interp team: Full update. In AI Alignment Forum, 2024. Olshausen, B. A. and Field, D. J. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vi- sion research, 37(23):3311–3325, 1997. OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Bal- aji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.-L., Brockman, G., Brooks, T., Brundage, M., Button, K., Cai, T., Campbell, R., Cann, A., Carey, B., Carlson, C., Carmichael, R., Chan, B., Chang, C., Chantzis, F., Chen, D., Chen, S., Chen, R., Chen, J., Chen, M., Chess, B., Cho, C., Chu, C., Chung, H. W., Cummings, D., Cur- rier, J., Dai, Y., Decareaux, C., Degry, T., Deutsch, N., Deville, D., Dhar, A., Dohan, D., Dowling, S., Dunning, S., Ecoffet, A., Eleti, A., Eloundou, T., Farhi, D., Fedus, L., Felix, N., Fishman, S. P., Forte, J., Fulford, I., Gao, L., Georges, E., Gibson, C., Goel, V., Gogineni, T., Goh, G., Gontijo-Lopes, R., Gordon, J., Grafstein, M., Gray, S., Greene, R., Gross, J., Gu, S. S., Guo, Y., Hallacy, C., Han, J., Harris, J., He, Y., Heaton, M., Heidecke, J., Hesse, C., Hickey, A., Hickey, W., Hoeschele, P., Houghton, B., Hsu, K., Hu, S., Hu, X., Huizinga, J., Jain, S., Jain, S., Jang, J., Jiang, A., Jiang, R., Jin, H., Jin, D., Jomoto, S., Jonn, B., Jun, H., Kaftan, T.,Łukasz Kaiser, Kamali, A., Kanitscheider, I., Keskar, N. S., Khan, T., Kilpatrick, L., Kim, J. W., Kim, C., Kim, Y., Kirchner, J. H., Kiros, J., Knight, M., Kokotajlo, D.,Łukasz Kondraciuk, Kondrich, A., Konstantinidis, A., Kosic, K., Krueger, G., Kuo, V., Lampe, M., Lan, I., Lee, T., Leike, J., Leung, J., Levy, D., Li, C. M., Lim, R., Lin, M., Lin, S., Litwin, M., Lopez, T., Lowe, R., Lue, P., Makanju, A., Malfacini, K., Manning, S., Markov, T., Markovski, Y., Martin, B., Mayer, K., Mayne, A., McGrew, B., McKinney, S. M., McLeavey, C., McMillan, P., McNeil, J., Medina, D., Mehta, A., Menick, J., Metz, L., Mishchenko, A., Mishkin, P., Monaco, V., Morikawa, E., Mossing, D., Mu, T., Murati, M., Murk, O., M ́ ely, D., Nair, A., Nakano, R., Nayak, R., Neelakantan, A., Ngo, R., Noh, H., Ouyang, L., O’Keefe, C., Pachocki, J., Paino, A., Palermo, J., Pantuliano, A., Parascandolo, G., Parish, J., Parparita, E., Passos, A., Pavlov, M., Peng, A., Perelman, A., de Avila Belbute Peres, F., Petrov, M., de Oliveira Pinto, H. P., Michael, Pokorny, Pokrass, M., Pong, V. H., Powell, T., Power, A., Power, B., Proehl, E., Puri, R., Radford, A., Rae, J., Ramesh, A., Raymond, C., Real, F., Rimbach, K., Ross, C., Rotsted, B., Roussez, H., Ryder, N., Saltarelli, M., Sanders, T., Santurkar, S., Sastry, G., Schmidt, H., Schnurr, D., Schulman, J., Sel- sam, D., Sheppard, K., Sherbakov, T., Shieh, J., Shoker, S., Shyam, P., Sidor, S., Sigler, E., Simens, M., Sitkin, J., Slama, K., Sohl, I., Sokolowsky, B., Song, Y., Stau- dacher, N., Such, F. P., Summers, N., Sutskever, I., Tang, J., Tezak, N., Thompson, M. B., Tillet, P., Tootoonchian, A., Tseng, E., Tuggle, P., Turley, N., Tworek, J., Uribe, J. F. C., Vallone, A., Vijayvergiya, A., Voss, C., Wainwright, C., Wang, J. J., Wang, A., Wang, B., Ward, J., Wei, J., Weinmann, C., Welihinda, A., Welinder, P., Weng, J., Weng, L., Wiethoff, M., Willner, D., Winter, C., Wolrich, S., Wong, H., Workman, L., Wu, S., Wu, J., Wu, M., Xiao, K., Xu, T., Yoo, S., Yu, K., Yuan, Q., Zaremba, W., Zellers, R., Zhang, C., Zhang, M., Zhao, S., Zheng, T., Zhuang, J., Zhuk, W., and Zoph, B. Gpt-4 technical report, 2024. Pach, M., Karthik, S., Bouniot, Q., Belongie, S., and Akata, Z. Sparse autoencoders learn monosemantic features in vision-language models.arXiv preprint arXiv:2504.02821, 2025. Park, W., Kim, W., Kim, J., and Do, J. Second: Mitigating perceptual hallucination in vision-language models via selective and contrastive decoding. In ICML, 2025. Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kram ́ ar, J., and Nanda, N. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders. arXiv preprint arXiv:2407.14435, 2024. Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., and 11 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering Saenko, K. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156, 2018. Shu, D., Wu, X., Zhao, H., Rai, D., Yao, Z., Liu, N., and Du, M. A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models. arXiv preprint arXiv:2503.05613, 2025. Sun, M., Chen, X., Kolter, J. Z., and Liu, Z. Massive activations in large language models. arXiv:2402.17762, 2024. Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L.-Y., Wang, Y.-X., Yang, Y., et al. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023. Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., Mac- Diarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., and Henighan, T. Scaling monosemanticity: Extracting interpretable fea- tures from claude 3 sonnet. Transformer Circuits Thread, 2024. Thasarathan, H., Forsyth, J., Fel, T., Kowal, M., and Derpanis, K.Universal sparse autoencoders: Inter- pretable cross-model concept alignment. arXiv preprint arXiv:2502.03714, 2025. Wan, Z., Zhang, C., Yong, S., Ma, M. Q., Stepputtis, S., Morency, L.-P., Ramanan, D., Sycara, K., and Xie, Y. Only: One-layer intervention sufficiently mitigates hallu- cinations in large vision-language models. arXiv preprint arXiv:2507.00898, 2025. Wang, J., Wang, Y., Xu, G., Zhang, J., Gu, Y., Jia, H., Wang, J., Xu, H., Yan, M., Zhang, J., et al. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023. Wang, X., Pan, J., Ding, L., and Biemann, C.Miti- gating hallucinations in large vision-language models with instruction contrastive decoding. arXiv preprint arXiv:2403.18715, 2024. Wu, X., Yuan, J., Yao, W., Zhai, X., and Liu, N. Inter- preting and steering llms with mutual information-based explanations on sparse autoencoders. arXiv preprint arXiv:2502.15576, 2025. Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention sinks. arXiv:2309.17453, 2023. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. Yin, H., Si, G., and Wang, Z. Clearsight: Visual signal enhancement for object hallucination mitigation in mul- timodal large language models. In CVPR, p. 14625– 14634, 2025. Yin, S., Fu, C., Zhao, S., Xu, T., Wang, H., Sui, D., Shen, Y., Li, K., Sun, X., and Chen, E. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045, 2023. Yu, Q., Li, J., Wei, L., Pang, L., Ye, W., Qin, B., Tang, S., Tian, Q., and Zhuang, Y. Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. In CVPR, p. 12944–12953, 2024. Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv:2308.02490, 2023. Yue, Z., Zhang, L., and Jin, Q. Less is more: Mitigating mul- timodal hallucination from an eos decision perspective. arXiv preprint arXiv:2402.14545, 2024. Zhai, B., Yang, S., Xu, C., Shen, S., Keutzer, K., Li, C., and Li, M. Halle-control: Controlling object hal- lucination in large multimodal models. arXiv preprint arXiv:2310.01779, 2024. Zhang, K., Shen, Y., Li, B., and Liu, Z. Large multi-modal models can interpret features in large multi-modal models. arXiv preprint arXiv:2411.14982, 2024. Zhibo, R., Huizhen, W., Muhua, Z., Yichao, W., Tong, X., and Jingbo, Z. Overcoming language priors with coun- terfactual inference for visual question answering. In Proceedings of the 22nd Chinese National Conference on Computational Linguistics, p. 600–610, 2023. Zhou, Y., Cui, C., Rafailov, R., Finn, C., and Yao, H. Align- ing modalities in vision large language models via prefer- ence fine-tuning. arXiv preprint arXiv:2402.11411, 2024. Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. Zhu, L., Ji, D., Chen, T., Xu, P., Ye, J., and Liu, J. Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding.arXiv preprint arXiv:2402.18476, 2024. Zou, X., Wang, Y., Yan, Y., Huang, S., Zheng, K., Chen, J., Tang, C., and Hu, X. Look twice before you answer: Memory-space visual retracing for hallucination mitiga- tion in multimodal large language models. In ICML, 2025. 12 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering A. Detailed Neuron-Level Hallucination Analysis and Mitigation Figures 10 to 13 present representative examples illustrating how hallucinations in LVLMs arise from abnormal, competing, or context-driven neuron activations under different captioning and targeted questioning scenarios. These cases further demonstrate that targeted neuron-level interventions can effectively suppress spurious signals and improve factual reliability and visual grounding in a controllable and interpretable manner. In the black-apple example (Figure 10), the model is asked a factual attribute question: “What is the color of the apple?” Although the image clearly shows a black apple, the model initially predicts “red.” Neuron-level analysis reveals that neuron 2836, associated with the concept of red strawberries, exhibits abnormally strong activation. This spurious activation dominates apple-related neurons, leading the model to prioritize an irrelevant concept and produce a hallucinated prediction. By selectively suppressing neuron 2836 or amplifying apple-specific neurons (IDs 3085 and 1941), the activation distribution shifts toward the correct visual evidence, and the model correctly outputs “black.” This example illustrates how attribute-level hallucinations can emerge from misaligned concept activations and how targeted neuron modulation can restore correct visual grounding while offering an interpretable explanation of the failure mode. In the sheep-and-dog example (Figure 11), the model produces a generally accurate image description by focusing on the sheep, with sheep-related neurons strongly activated. However, due to visual ambiguity introduced by partial occlusion, neurons associated with other animal concepts, such as the dog, are also activated. While such activations do not affect general image captioning, they become problematic when the model is asked a concept-specific question, e.g., “Is there a dog in the image?” In this case, the presence of dog-related neuron activation leads the model to incorrectly answer “yes,” despite the absence of a dog. By strongly suppressing the dog-related neuron (ID 2480), the model correctly answers “no.” This case demonstrates that even weak or secondary activations of irrelevant concepts can induce hallucinations under targeted queries, and that neuron-level interventions provide an effective mechanism for suppressing such misleading signals. In the tennis-ball example (Figure 12), the model exhibits an object-level hallucination caused by spurious concept injection. The image depicts a tennis player swinging a racket on a clay court, yet no tennis ball is present. Despite the lack of visual evidence, the model hallucinates a tennis ball and claims that it is visible near the racket. Neuron-level inspection attributes this error to abnormal activation of a tennis-ball–specific neuron (ID 3541), which introduces an unsupported object concept into the visual representation. Unlike cases involving competing concepts, this hallucination is driven by a single dominant neuron whose activation is not grounded in the input image. After suppressing neuron 3541, the model no longer hallucinates the tennis ball and correctly answers “no” when queried about its presence. This example highlights how isolated spurious neuron activations can directly inject non-existent objects into the model’s output. In the fork example (Figure 13), the model hallucinates an object due to overactivation of a contextually plausible but visually absent concept. The image shows a dining table containing various food items, such as cake, fruit, cookies, and crackers, but no fork is present. Nevertheless, the model includes a fork in its description and answers “yes” when asked whether a fork appears in the image. Neuron-level analysis reveals that this behavior is driven by abnormal activation of a fork-related neuron (ID 7010), likely influenced by strong co-occurrence priors between dining scenes and utensils. This neuron injects a utensil concept into the representation despite the absence of supporting visual evidence. Suppressing neuron 7010 removes the hallucinated fork and restores the correct response. This case illustrates how contextual priors encoded at the neuron level can lead to object hallucinations, and how targeted suppression can effectively recover factual consistency. Across all examples, a consistent pattern is revealed: hallucinations in LVLMs frequently result from spurious, competing, or context-driven activations of concept-specific neurons that are insufficiently grounded in the visual input. Such activations may be triggered by visual ambiguity, strong semantic priors, or dominant but irrelevant concepts. By selectively amplifying neurons aligned with the true visual evidence and suppressing misleading ones, the internal representations become more consistent with the input image. This alignment reduces hallucinations, improves interpretability, and enhances controllability, providing a direct mechanism to mitigate errors at the neuron level. Overall, these analyses highlight the dual value of neuron-level examination. First, it provides transparent insights into the internal mechanisms that give rise to hallucinations. Second, it enables actionable and fine-grained interventions that directly modulate model behavior. Collectively, these findings demonstrate that interpretable neuron-level interventions not only enhance factual consistency and visual grounding, but also serve as an effective strategy for mitigating hallucinations in LVLMs. 13 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering Q: What is the color of the apple? A: The color of the apple is red. Suppress the 2836 neuron associated with strawberries. Q: What is the color of the apple? A: The color of the apple is black. Enhance the 3085, 1941 neurons associated with apple. Q: What is the color of the apple? A: The color of the apple is black. Figure 10. Attribute hallucination caused by competing concept activations. The image shows a fruit bowl containing multiple fruits, including a black apple. When asked “What is the color of the apple?”, the model incorrectly answers “red,” exhibiting an attribute-level hallucination. Neuron-level analysis reveals that a strawberry-related neuron (ID 2836; green boxes) becomes abnormally activated and dominates the representation, overwhelming evidence from apple-related neurons (red boxes). By suppressing neuron 2836 or enhancing apple-specific neurons (IDs 3085 and 1941), the model correctly outputs “black.” This example illustrates how hallucinations can arise from spurious dominance of irrelevant semantic neurons and how targeted neuron modulation restores correct visual grounding. 14 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering Q: Please describe the image in detail. A: The image features two people walking down... They are accompanied by a sheep, which .... Q: Is there a dog in the image? A: Yes, ... Suppress the 2480 neuron associated with dog. Q: Is there a dog in the image? A: No, ... Figure 11. Object hallucination induced by spurious activation under occlusion. The activation maps indicate that the model primarily focuses on the sheep, with sheep-related neurons strongly activated (red boxes), allowing it to describe the image correctly. However, because the sheep is partially occluded, neurons linked to other animal concepts, such as the dog (green boxes), are also activated. While these additional activations do not affect general image description, they become problematic when the model is asked a concept-specific question such as “Is there a dog in the image?” leading it to incorrectly answer “yes.” After suppressing the dog-related neuron 2480 by setting its weight to –10, the model correctly responds “no.” This example highlights that extra activation of irrelevant concepts can mislead the model under targeted queries and that neuron-level modulation provides an effective means to suppress such spurious signals and mitigate hallucinations. 15 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering Q: Please describe the image in detail. A: The image captures a tennis player in action on a clay court. The player is in the middle of a swing, holding a tennis racket and preparing to hit the ball. The tennis ball is visible in the air, close to the player's racket. Q: Is there a tennis ball in the image? A: Yes, ... Suppress the 3541 neuron associated with tennis ball . Q: Is there a tennis ball in the image? A: No, ... Figure 12. Object hallucination caused by spurious concept injection. The image depicts a tennis player swinging a racket on a clay court, but no tennis ball is present. Despite the absence of visual evidence, the model hallucinates a tennis ball in its description, claiming that “the tennis ball is visible in the air near the racket.” Neuron-level analysis attributes this error to abnormal activation of a tennis-ball–specific neuron (ID 3541), which injects an unsupported object concept into the visual representation. After suppressing this neuron, the model no longer hallucinates the tennis ball and correctly answers “no” when asked about its presence. This example highlights how isolated spurious neuron activations can directly lead to object-level hallucinations. 16 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering Q: Is there a fork in the image? A: Yes, ... Suppress the 7010 neuron associated with fork. Q: Is there a fork in the image? A: No, ... Q: Please describe the image in detail. A: The image features a dining table with a large cake placed on a plate... The table is set with various utensils, including a knife, a fork, and ... Figure 13. Object hallucination driven by overactive utensil-related neurons. The image depicts a dining table filled with food and drinks, including a cake, fruit, cookies, and crackers, but no fork is present in the scene. Nevertheless, the model hallucinates a fork in its image description and subsequently answers “yes” when asked whether a fork appears in the image. Neuron-level inspection reveals that this error is caused by abnormal activation of a fork-related concept neuron (ID 7010), which introduces an unsupported utensil concept into the visual representation. After suppressing this neuron, the model no longer mentions a fork and correctly answers “no” to the concept-specific query. This example further illustrates that object hallucinations in LVLMs can originate from isolated yet overactive concept neurons, and that targeted neuron-level modulation effectively restores correct visual grounding and factual consistency. 17 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering B. Additional Neuron Visualizations Figure 14 presents additional examples of neurons discovered by our sparse autoencoder (SAE). Many neurons exhibit strong associations with concrete objects or concepts, such as #14174 for corn, #46469 for oranges, and #61697 for dogs wearing Christmas hats. Beyond object-level semantics, some neurons capture more abstract structural cues, such as #62747, which consistently responds to spiral or fan-shaped patterns. These examples demonstrate the richness and diversity of the learned neuron space, ranging from fine-grained objects to higher-level structural abstractions. Such diversity not only enhances the interpretability of internal visual representations but also provides a strong foundation for precise neuron-level interventions, thereby facilitating both mechanistic understanding and controllable steering of LVLM outputs. C. Additional Neuron Analysis To gain deeper insights into the functional roles of individual neurons in LVLM visual representations, we perform both image-level and patch-level analyses. These qualitative results complement the quantitative findings in Section 4, providing a more intuitive understanding of how neurons encode semantic information. Image-Level Analysis and Visualizations.Figure 16 highlights neurons with consistently high activation across different images. We observe that a small subset of “always-on” neurons remain persistently active regardless of image content, often encoding recurring textures or repetitive small objects rather than scene-specific information. The bottom panel further visualizes the top-activated images for each neuron, confirming that these neurons capture similar global patterns across diverse inputs. Within CNS, we reduce their disproportionate influence via Always-on Neuron Suppression (ANS), which decreases redundancy, emphasizes image-specific content, and improves the interpretability of downstream neuron-level interventions. Patch-Level Analysis and Visualizations. Figure 17 illustrates neuron activations at the patch level. Unlike always-on neurons, most patch-level neurons respond reliably to localized, semantically meaningful concepts, such as distinctive textures or object parts. This indicates that individual neurons often encode interpretable, fine-grained features, which are particularly well-suited for targeted interventions. By selectively modulating these neurons, we can directly influence which visual concepts are emphasized or suppressed in LVLM outputs, enabling fine-grained and interpretable control. Summary. Together, the image-level and patch-level analyses reveal a dual organization of neuron activations: broadly active global features and selectively tuned local features. This dual perspective underpins our CNS approach, where targeted neuron-level interventions enable controllable and interpretable mitigation of hallucinations, while also deepening mechanistic insights into LVLM visual processing. D. Noise-Induced Disruption of Internal Visual Features Leading to Hallucinations In Section 4, we quantitatively analyzed how noise perturbs internal visual features, causing neuron activations to shift and destabilize. These disruptions reshape the semantics of visual representations, inducing hallucinations and degrading LVLM performance (see Figure 3). To illustrate this phenomenon more intuitively, Figure 15 shows an example image containing a camera. As increasing levels of noise are applied, the activation of the “camera” neuron gradually diminishes. Correspondingly, the LVLM output exhibits a progressive semantic drift: initially describing a “black Konica Minolta camera with a large lens,” then simplifying to “camera with a large lens,” and eventually omitting the camera entirely. This case demonstrates how noise-induced disruptions at the neuron level directly erode semantic fidelity in visual features, ultimately manifesting as hallucinations in model outputs. Importantly, this example underscores the value of SAEs: by decomposing dense visual embeddings into sparse neurons, we gain the ability to trace how specific semantic concepts evolve under perturbations. This neuron-level perspective provides interpretability and analytical clarity, enabling us to pinpoint which neurons are destabilized and how this relates to output degradation. Such insights establish a principled foundation for designing targeted interventions to mitigate hallucinations and improve LVLM reliability. 18 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering #4232 5 #4414 1 #4646 9 #4785 2 #4848 6 #4973 3 #1777 #2770 #4817# 6958#8773 #9983 #1087 4 #1215 8 #1417 2 #1628 6 #1839 4 #1971 8 #5281 8 #5383 4 #5584 1 #5684 9 #5787 4 #5989 1 #2135 6 #2195 5 #2383 0 #2597 0 #2790 6 #2997 2 #3186 4 #3459 9 #3689 6 #3798 9 #3976 8 #4079 2 #60993#61697#62747#63271#64619#65042 Figure 14. Additional visualizations of neurons learned by the SAE. The neurons exhibit diverse semantics, ranging from specific objects (e.g., corn #14174, oranges #46469, Christmas-hat dogs #61697) to abstract structural patterns (e.g., spirals or fan-like shapes #62747). This diversity demonstrates the interpretability of the internal representation space and provides a strong foundation for explaining and steering LVLMs through neuron-level interventions. 19 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering The image features a black Konica Minolta camera with a large lens, sitting on a white surface. The camera... possibly a DSLR... 훼= 10.352 푡= 200 The image features a close-up of a camera lens, showcasing the lens ...The camera lens appears to be a part of a larger camera, possibly ... 훼= 7.217 푡= 700 The image features a black Konica Minolta camera with a large lens, sitting on a white background. The camera... possibly a Dynax 7 ... 훼= 10.607 푡= 100 The image features a close-up of a camera lens, showcasing the front of the camera. The lens is the main focus of the image, with the camera's body not visible. 훼= 8.531 푡= 600 The image features a black Konica Minolta camera with a large lens, sitting on a white surface. The camera... possibly a DSLR... 훼= 10.318 t = 300 The image features a close-up of a camera lens, showcasing the intricate details of the lens. The lens is positioned in the center of the image... 훼= 5.502 푡= 800 The image features a close-up of a black digital camera with a large lens... The camera appears to be a Nikon model, and it is set to take a picture. 훼= 9.968 푡= 400 The image features a close-up of a colorful, patterned object, possibly a piece of art or a decorative item. The object is surrounded by a variety ... 훼= 2.118 푡= 900 The image features a close-up of a black digital camera with a large lens... The camera appears to be a Nikon model, and it is set to take a picture. 훼= 9.409 푡= 500 The image features a close-up of a colorful, circular object with a patterned surface. The object appears to be a piece of art or a decorative item... 훼= 0.827 푡= 999 The image features a black Konica Minolta camera with a large lens, sitting on a white background. The camera is positioned in the center of the frame, and the lens is prominently displayed. The camera appears to be a professional model, possibly a DSLR... 푂 훼= 10.827 Figure 15. Example of noise affecting visual feature representations. The image contains a camera. As noise increases, the activation of the “camera” neuron gradually decreases, and the LVLM output progressively loses detail: from “black Konica Minolta camera with a large lens” to “camera with a large lens,” and finally no camera description. This demonstrates how noise disrupts internal semantic representations, leading to hallucinations. It also highlights the advantage of SAEs in decoupling dense LVLM features into sparse neurons, allowing us to track and analyze internal visual feature changes at the neuron level. 20 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering #2026#2652 #2808 #2875 #998 #4039 #4047 #2105 #3833 #2039 #628#2729#1341#819#1554 #1736#155 #1981#760 #4077 Figure 16. Image-level neuron analysis and visualization. The top panel shows neurons with consistently high activation rates across different images, while the bottom panel visualizes the top-activated images for each neuron. Neurons highlighted by red boxes are always-on neurons, which tend to respond to recurring textures or small visual patterns and encode similar global information across images. These red-highlighted neurons are selected for suppression in CNS via ANS. Suppressing such always-on neurons helps emphasize image-specific objects and provides an interpretable basis for our neuron-level intervention. 21 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering #3833#0 #1 #2 #3 #4 #1341 #5 #6 #4039 #7 #8 #9 #10 #1736 #11#1631 #12#13 #1518 Figure 17. Patch-Level neurons analysis and visualization. At the patch level, neurons often capture concrete, localized concepts. Activation patterns show that most neurons reliably represent specific visual features, supporting fine-grained neuron-level interventions. 22 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering E. Complex and Diverse Case Studies for Neuron-Level Steering in LVLMs We present a range of challenging case studies to demonstrate how neuron-level steering enables fine-grained, interpretable control over LVLM outputs. These examples illustrate not only the feasibility of manipulating specific concepts but also how scene complexity, semantic distribution, and neuron hierarchy affect intervention difficulty and outcomes. Multi-Concept Suppression. Figure 18 depicts a scene containing multiple objects (dog and chair). By selectively suppressing neurons corresponding to each object, we can remove them from generated captions or descriptions in a controlled manner. Interestingly, suppression difficulty varies across concepts. For example, removing “chair” neurons is relatively straightforward, with a weight ofα =−30sufficient to eliminate chairs from outputs. In contrast, “dog” neurons require much stronger intervention, sometimes leaving residual references untilα =−100is applied. Analysis reveals that SAE encodes a hierarchical and distributed representation for dog-related concepts, including multiple neurons for different breeds, poses, and contextual cues. This demonstrates that the difficulty of neuron-level control is directly linked to how a concept is represented internally and distributed across neurons. Concept Insertion in Simple and Complex Contexts. We also study the insertion of concepts into scenes of varying complexity. As shown in Figure 19, inserting a dog concept into a simple bird-dominated scene requires only a modest weight (α = 50) for the concept to appear. However, in a more complex scene with multiple objects, significantly larger weights (α = 500) are needed for the concept to manifest reliably. These results highlight that the effectiveness of neuron- level interventions is sensitive to scene complexity and competition among activated neurons. Simpler scenes allow easier manipulation, whereas complex scenes require careful tuning to overcome interference from competing visual features. Multi-Neuron Steering in Complex Scenes.In highly complex contexts, steering a single neuron often requires extreme weights before a concept appears in outputs (Figure 20). Coordinated adjustment of multiple concept-related neurons with smaller individual weights produces more stable, natural, and reliable insertions. A similar pattern holds for suppression (Figure 21): targeting a single neuron requires very large negative weights, while modulating multiple neurons simultaneously with smaller magnitudes removes concepts more effectively. These observations underline the advantages of multi-neuron steering for both insertion and suppression, providing robustness against distributed representations and ensuring smoother, more predictable outputs. Such findings directly motivate our CNS framework, which automatically identifies and adjusts multiple neurons for reliable fine-grained control. Insights into Internal Representations. Across all cases, neuron-level steering provides interpretable insights into how visual concepts are internally represented in LVLMs. Concepts are often encoded across several neurons with varying sensitivities, and scene complexity, occlusion, or overlapping features can lead to partial activation of irrelevant neurons, contributing to hallucinations. Multi-neuron interventions reveal the distributed and hierarchical nature of these representations and offer a principled way to correct or manipulate outputs. Importantly, these analyses highlight how CNS leverages contrastive identification of image-specific neurons to selectively enhance or suppress concept-relevant activations, ensuring controllable behavior while preserving overall scene understanding. Summary.Overall, these studies demonstrate that neuron-level steering offers a powerful, interpretable, and generalizable approach for controlling LVLM behavior. Its effectiveness depends on factors such as semantic distribution across neurons, scene complexity, and concept hierarchy. By revealing the internal mechanisms underlying hallucinations and providing actionable intervention strategies, these case studies highlight how CNS can mitigate hallucinations and, more fundamentally, enhance the reliability of visual feature representations. This work provides practical guidance for fine-grained, concept-level, and interpretable control of LVLM outputs. 23 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering 훼=−10 훼=−20 훼=−30 푂푟 The image features a large brown dog sitting comfortably in a green folding chair. The chair... The image features a large brown dog sitting on a green chair. The chair ... The image features a large, brown dog sitting on top of a greenchair. The chair ... The image features a large, furry dog sitting on a couch. The dog is positioned in the center of the scene... "Describe this photo in detail." Neuron #29: (Other “dog” related Neurons) 훼=−10 훼=−20 푂푟 훼=−50 훼=−100 The image features a large brown dog sitting comfortably in a green folding chair. The dog appears to be enjoying... The image features a smalldog sitting in a folding chair... and it seems to be enjoying its time in the chair.... The image features a man sitting in a chair, with a large water bottle placed next to him. The man ... The image features a smalldog sitting in a folding chair... and is comfortably seated in the chair. ... The image features a small black chair placed in the center of the scene. The chair...a person sitting on the black chair. Neuron #39: Figure 18. Multi-concept suppression. Suppressing “chair” neurons effectively removes chairs from model outputs. Suppressing “dog” neurons is more challenging, requiring stronger intervention since the SAE has learned a hierarchy of dog-related concepts (e.g., different breeds). This highlights the difficulty of eliminating concepts encoded in multiple fine-grained neurons. "Describe this photo in detail." 훼= 10 훼= 20 훼= 50 푂푟 Neuron #39: The image features a beautiful brown and yellow bird perched on a wooden surface, possibly a fence or a table. The bird is... The image features a large, brown, very large, dog standing in the middle of a field. The dogappears to... The image features a small bird perched on a bird feeder, surrounded by a variety of birdseed. The bird is... The image features a small bird perched on a bird feeder, surrounded by a variety of birdseed. The bird is... "Describe this photo in detail." 훼= 50 훼= 300 훼= 500 푂푟 Neuron #39: The image features a woman standing in a room, wearing a dress. She is positioned in the middle of the scene... The image features a large, white dogstanding on a beach. The dog is positioned near the center of the scene... The image features a woman with a tattoo, standing in a room with a brown wall. She is we is wearing a brown shirt ... The image depicts a cozy living room with a dining table in the center. The table is surrounded by several chairs... ... Figure 19. Concept insertion in simple contexts. By slightly amplifying a single dog-related neuron, the model begins to hallucinate the presence of dogs in unrelated scenes. Compared to suppression, concept insertion is easier: small weights suffice to introduce the new concept. 24 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering "Describe this photo in detail." 훼= 50 훼= 300 훼= 500 푂푟 Neuron #39: The image features a woman standing in a room, wearing a dress. She is positioned in the middle of the scene... The image features a large, white dogstanding on a beach. The dog is positioned near the center of the scene... The image features a woman with a tattoo, standing in a room with a brown wall. She is we is wearing a brown shirt ... The image depicts a cozy living room with a dining table in the center. The table is surrounded by several chairs... Neuron #39, #242, #251: 훼= 10 훼= 20 훼= 50 푂푟 The image features a woman standing in a room with a TV on the wall. A dogis also appears to be present in the room. The image features a person holding, likely a woman, and a dog. She ... The image depicts a woman standing in a room, she is positioned near a dining table, which is surrounded by chairs. The image depicts a cozy living room with a dining table in the center. The table is surrounded by several chairs... "Describe this photo in detail." ... Figure 20. Concept insertion in complex contexts. (a) Steering with a single dog-related neuron requires a very large weight (α = 500) to produce visible effects. (b) Coordinated steering of three dog-related neurons with smaller weights (α = 20each) yields natural insertions. This demonstrates the advantage of multi-neuron steering and motivates our CNS approach. "Describe this photo in detail." 훼=−10 훼=−20 푂푟 훼=−50 훼=−100 The image features a large brown dog sitting comfortably in a green folding chair. The dog appears to be enjoying... The image features a small dogsitting in a folding chair... and it seems to be enjoying its time in the chair.... The image features a man sitting in a chair, with a large water bottle placed next to him. The man ... The image features a small dogsitting in a folding chair... and is comfortably seated in the chair. ... The image features a small black chair placed in the center of the scene. The chair...a person sitting on the black chair. Neuron #39: Neuron #39, #242, #251: "Describe this photo in detail." 훼=−10 훼=−20 푂푟 훼=−30 훼=−50 The image features a large brown dog sitting comfortably in a green folding chair. The dog appears to be enjoying... The image features a small, cozy chair placed in the center of a room. The chair is positioned on a blue rug ... The image features a large, empty room with a few items scattered around. A few other chairs...but they are not in use. The image features a small dogsitting in a folding chair, which is placed on a white background. The dog appears to... The image features a small, cozy room with a comfortable chair placed in the center of the space. The chair is ... Figure 21. Concept suppression in complex contexts. (a) Suppressing a single dog-related neuron requires a very large negative weight (α =−100) before the concept disappears from outputs. (b) Jointly suppressing three dog-related neurons with smaller weights (α =−10 each) removes the concept more naturally and reliably, illustrating the effectiveness of multi-neuron steering. 25 Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering F. Discussion and Comparison with Register Neurons Several phenomena in LVLMs exhibit stable, high-norm, input-invariant activations, including always-on neurons, register neurons (Darcet et al., 2024; Jiang et al., 2025a), massive activations (Sun et al., 2024), and attention sinks (Xiao et al., 2023; Kang et al., 2025). These phenomena share certain characteristics, such as persistently high activation magnitudes and minimal dependence on specific local visual content. In our observations, always-on neurons typically exhibit sparse, high- magnitude activations between 10–80, while most of the top-40 neurons fall in the 5–15 range. They appear consistently across inputs and primarily correspond to non-core, global features, reflecting decoupled latent factors in the internal representation. Shared characteristics: sparse, input-invariant, globally stable activations • Activation sparsity: selectively active in specific neurons, facilitating sparse and decoupled representations. • Input-independence: activation patterns are largely independent of specific local visual content. • Global role: capture high-level, non-local computations or statistical factors within the model. Despite these similarities, always-on neurons differ from register neurons in several aspects: •Generation mechanism: Register neurons arise from MLP outputs within a layer, whereas always-on neurons are identified via SAE decomposition from the entire layer output, capturing sparse latent factors across the full representation. •Distribution and consistency: Register neurons vary in number and activation location across inputs. In contrast, always-on neurons consistently appear in the same set of neurons across nearly all inputs, hence “always-on”. • Analysis and interpretability: Always-on neurons provide directly visualizable, decoupled representations, whereas register neurons are primarily interpreted indirectly through their influence on model outputs. •Concept-level controllability: The sparsity and decoupling of always-on neurons enable fine-grained interventions that can modulate specific concept representations without affecting unrelated features. This property is unique to always-on neurons and is not observed in register neurons or other high-norm activations. Overall, always-on neurons encode sparse, decoupled latent factors that expose the internal feature structure in a directly interpretable manner. Their sparsity and decoupling further enable fine-grained, concept-level interventions, allowing specific internal representations to be modulated without affecting unrelated features. In contrast, register neurons are structural components tied to MLP parameters and do not provide the same level of sparsity, interpretability, or controllability. Future directions: Investigating the relationships among always-on neurons, register neurons, massive activations, and attention sinks could yield deeper insights into persistent high-norm activations in LVLMs. In particular, analyzing SAE factors from a sparsity- and norm-based perspective may elucidate the structural and functional roles of these neurons, their contribution to decoupled and interpretable internal representations, and their potential for concept-level interventions in large vision-language models. 26