Paper deep dive

To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

Rui Hong, Shuxue Quan

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 42

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/22/2026, 6:03:09 AM

Summary

The paper introduces the Tri-Layer Diagnostic Framework to evaluate Vision-Language Models (VLMs) for 'Visual Sycophancy'—a phenomenon where models detect visual anomalies but hallucinate to satisfy user expectations. By analyzing 7 VLMs across 7,000 samples using Latent Anomaly Detection, Visual Necessity Score, and Competition Score, the authors reveal that 69.6% of samples exhibit sycophancy, and larger models (e.g., Qwen2.5-VL 72B) amplify this behavior despite reducing language shortcuts.

Entities (6)

Qwen2.5-VL · vlm · 100%Tri-Layer Diagnostic Framework · methodology · 100%Visual Sycophancy · phenomenon · 100%Competition Score · metric · 95%Latent Anomaly Detection · metric · 95%Visual Necessity Score · metric · 95%

Relation Signals (3)

Latent Anomaly Detection → componentof → Tri-Layer Diagnostic Framework

confidence 100% · The framework operates via three metrics: Latent Anomaly Detection, Visual Necessity Score, and Competition Score.

Tri-Layer Diagnostic Framework → measures → Visual Sycophancy

confidence 95% · The framework disentangles hallucination sources and reveals that 69.6% of samples exhibit Visual Sycophancy.

Qwen2.5-VL → exhibits → Visual Sycophancy

confidence 90% · Scaling analysis shows larger models like Qwen2.5-VL amplify Visual Sycophancy.

Cypher Suggestions (2)

List metrics used by the diagnostic framework · confidence 95% · unvalidated

MATCH (m:Methodology {name: 'Tri-Layer Diagnostic Framework'})-[:MEASURES]->(metric:Metric) RETURN metric.name

Find all models that exhibit Visual Sycophancy · confidence 90% · unvalidated

MATCH (m:VLM)-[:EXHIBITS]->(p:Phenomenon {name: 'Visual Sycophancy'}) RETURN m.name

Abstract

Abstract:When VLMs answer correctly, do they genuinely rely on visual information or exploit language shortcuts? We introduce the Tri-Layer Diagnostic Framework, which disentangles hallucination sources via three metrics: Latent Anomaly Detection (perceptual awareness), Visual Necessity Score (visual dependency, measured via KL divergence), and Competition Score (conflict between visual grounding and instruction following). Using counterfactual interventions (blind, noise, and conflict images) across 7 VLMs and 7,000 model-sample pairs, our taxonomy reveals that 69.6% of samples exhibit Visual Sycophancy--models detect visual anomalies but hallucinate to satisfy user expectations--while zero samples show Robust Refusal, indicating alignment training has systematically suppressed truthful uncertainty acknowledgment. A scaling analysis (Qwen2.5-VL 7B to 72B) shows larger models reduce Language Shortcuts but amplify Visual Sycophancy, demonstrating scale alone cannot resolve the grounding problem. Diagnostic scores further enable a post-hoc selective prediction strategy achieving up to +9.5pp accuracy at 50% coverage with no additional training cost.

PDF

Open source PDF →Open local PDF →

Full Text

42,170 characters extracted from source content.

Expand or collapse full text

To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs Rui Hong ⋆1 and Shuxue Quan 2 1 George Mason University, Fairfax, VA, USA 2 Independent Researcher Abstract. When VLMs answer correctly, do they genuinely rely on vi- sual information or exploit language shortcuts? We introduce the Tri- Layer Diagnostic Framework, which disentangles hallucination sources via three metrics: Latent Anomaly Detection (perceptual awareness), Vi- sual Necessity Score (visual dependency, measured via KL divergence), and Competition Score (conflict between visual grounding and instruc- tion following). Using counterfactual interventions (blind, noise, and con- flict images) across 7 VLMs and 7,000 model-sample pairs, our taxonomy reveals that 69.6% of samples exhibit Visual Sycophancy—models de- tect visual anomalies but hallucinate to satisfy user expectations—while zero samples show Robust Refusal, indicating alignment training has systematically suppressed truthful uncertainty acknowledgment. A scal- ing analysis (Qwen2.5-VL 7B→72B) shows larger models reduce Lan- guage Shortcuts but amplify Visual Sycophancy, demonstrating scale alone cannot resolve the grounding problem. Diagnostic scores further enable a post-hoc selective prediction strategy achieving up to +9.5p accuracy at 50% coverage with no additional training cost. Keywords: Vision-Language Models · Hallucination · Visual Sycophancy · Diagnostic Framework · Selective Prediction 1 Introduction Vision-Language Models (VLMs) have demonstrated impressive capabilities across diverse multimodal tasks, from visual question answering to image captioning and reasoning. Recent state-of-the-art models such as Llama-3.2-11B [21] and Qwen2.5-VL [25] represent significant architectural advances and achieve high accuracy on standard benchmarks. However, despite these performance gains, growing evidence challenges whether these models genuinely process visual in- formation. Recent studies reveal a troubling paradox: VLMs often succeed without gen- uinely utilizing visual information. [29] show that state-of-the-art systems strug- gle with straightforward visual patterns, while [35] and [26] demonstrate failures on elementary compositional and low-level tasks. More strikingly, [6] and [8] re- veal that removing images can maintain or even improve accuracy— suggesting visual input may act as a distractor. ⋆ Corresponding author: rhong5@gmu.edu arXiv:2603.18373v1 [cs.CV] 19 Mar 2026 2R. Hong and S. Quan The Diagnostic Gap. These findings point to a fundamental limitation in current evaluation protocols: Accuracy alone cannot diagnose the root cause of ground- ing failures. When a VLM hallucinates or ignores visual data, accuracy metrics treat it as a binary error, masking the internal mechanism. Is the model suffering from Perceptual Blindness (the encoder cannot see the image)? Is it a Language Shortcut failure (the model ignores the visual tokens)? Or is it Visual Syco- phancy (the model sees the truth but lies to satisfy user instruction)? Without distinguishing these mechanisms, we cannot effectively debug model behavior or mitigate hallucinations. Our Approach. To bridge this gap, we propose the Tri-Layer Hallucination Di- agnostic Framework. Instead of treating VLM generation as a black box, we dissect the decision-making process into three cognitive layers: Perception, De- pendency, and Alignment. We introduce a causal intervention protocol using “blind” conditions (to isolate language priors) and “conflict” conditions (to test whether models genuinely process visual content or exploit language shortcuts). This allows us to uncover the “Split Beliefs” phenomenon—where models suc- cessfully detect visual anomalies in the latent space yet still produce hallucinated answers, suggesting a disconnect between perception and generation. Our key contributions are: 1. A Tri-Layer Diagnostic Framework: We introduce a sample-level di- agnostic methodology dissecting VLM decision-making into three cognitive layers—Perception, Dependency, and Alignment. Unlike prior work that con- structs new diagnostic datasets [6,29], our framework operates on any exist- ing benchmark without additional data curation. 2. Three Diagnostic Metrics: We propose Latent Anomaly Detection (LAD) to verify encoder perception, Visual Necessity Score (VNS) to quantify visual dependency via KL divergence over full output distributions (vs. simple log- probability differences [14]), and Competition Score (CS) to measure the conflict between visual grounding and instruction following. 3. Diagnostic Taxonomy: We establish a four-category taxonomy (Table 1) that classifies why a model fails, not merely that it fails—revealing that high accuracy often conceals severe Visual Sycophancy. 4. Comprehensive Evaluation: We evaluate 7 VLMs across 7,000 model- sample pairs. A scaling analysis (Qwen2.5-VL 7B vs. 72B) reveals that larger models reduce Language Shortcuts but amplify Visual Sycophancy; Gaussian noise validation confirms robustness across stimulus choices. 5. Post-hoc Mitigation: Diagnostic scores enable Diagnostic-Guided Selec- tive Prediction, achieving up to +9.5p accuracy at 50% coverage with no retraining. 2 Related Work 2.1 Visual Hallucination and Language Priors Early VQA studies [2, 12] identified “blind” guessing behaviors exploiting sta- tistical correlations. [17] introduced POPE for object hallucination probing, To See or To Please3 though such metrics may not capture generative failure modes. Recent diagnos- tic benchmarks operate at the dataset level: [29] construct MMVP via “CLIP- blind pairs”; [35] design ARO for compositional sensitivity; [6] propose MMStar by filtering vision-indispensable samples. In contrast, our VNS operates at the sample level, quantifying visual dependency on any existing benchmark without additional data curation. 2.2 Sycophancy and Alignment Conflicts Sycophancy—aligning with user expectations over objective truth—is a docu- mented failure in RLHF-tuned LLMs [28, 32]. In the multimodal domain, [19] identify it as a cognitive bias in VLMs. CoT reasoning [31] further compli- cates this: while it enhances reasoning, it can obscure hallucination cues [7] and suppress error detection [24]. We extend this to the visual modality, fram- ing sycophancy as a conflict between perceptual truth and alignment train- ing—measurable via our Competition Score. 2.3 Causal Diagnostics and Internal Representation Counterfactual approaches [23,33] isolate hallucination sources via causal inter- vention. Concurrently, probing work in LLMs [4, 5] shows models often encode truth internally while generating false outputs—a “Split Beliefs” phenomenon underexplored in VLMs. Our framework combines causal intervention (Blind condition) with internal state probing (LAD), distinguishing Perceptual Blind- ness (encoder failure) from Visual Sycophancy (decoder override). 2.4 Diagnostic Benchmarks for VLMs Standard benchmarks [20, 34] track aggregate performance but fail to penalize grounding failures. Diagnostic benchmarks [13,29] scrutinize specific capabilities but rely on accuracy or VQA-score, conflating recognition with reasoning. Our framework goes further: rather than labeling answers as wrong, it categorizes the source of failure, explaining why models fail despite high leaderboard rankings. 3 Method: Tri-Layer Hallucination Diagnostic Framework We propose a diagnostic framework to disentangle the mechanisms behind multi- modal hallucinations. Rather than relying solely on performance outcomes, this framework assesses the model’s internal behavior across three distinct cognitive layers: Visual Perception (via Latent Anomaly Detection), Information Depen- dency (via Visual Necessity Score), and Decision Alignment (via Competition Score). 4R. Hong and S. Quan 3.1 Problem Formulation and Evaluation Protocol Let M denote a Multimodal Large Language Model (MLLM), I be a visual input, and Q be a textual inquiry. The model generates a response R with probability P(R|I,Q). To diagnose the causal source of hallucinations, we employ a counterfactual intervention protocol. We define four primary evaluation conditions: 1. Full Condition (I full ,Q): The model processes the original informative image. 2. Blind Condition (I blind ,Q): The visual input is replaced by a pure black image to isolate language-driven behaviors [10,14]. 3. Noise Condition (I noise ,Q): The visual input is replaced by a Gaussian noise image (N(128, 50 2 ), clipped to [0, 255]), serving as an alternative blank stimulus to validate that findings under the Blind condition are not artifacts of the specific stimulus choice. 4. Conflict Condition (I conflict ,Q): An unrelated image containing none of the objects in Q, testing whether responses stem from genuine visual pro- cessing or language priors. For all metrics defined below (LAD, VNS, CS), analogous noise-condition variants are computed by substituting I noise for I blind , and are used to validate framework robustness (Section 5). 3.2 Layer 1: Perception – Latent Anomaly Detection (LAD) The first layer determines whether the model’s visual encoder successfully detects the absence of visual information. We hypothesize that a perceptually robust model should exhibit higher latent confidence in refusal-related concepts when presented with I blind compared to I full . We define a set of refusal anchors A = a 1 ,a 2 ,...,a n (e.g., “The image is completely black”). The Latent Anomaly Detection (LAD) metric measures the differential log-probability (score) of these anchors: LAD(Q,A) = max a∈A S(a|I blind ,Q)−S(a|I full ,Q) (1) withS(a|I,Q) = 1 |a| P t logP(a t |a <t ,I,Q) being the mean token-level log-probability. – LAD > τ LAD : The model perceptually recognizes the visual anomaly (i.e., the encoder is functioning correctly). – LAD ≤ τ LAD : Indicates Perceptual Blindness, where the visual representa- tion of the blind image is indistinguishable from that of natural images. 3.3 Layer 2: Dependency – Visual Necessity Score (VNS) To quantify the extent to which the generated response depends on visual evi- dence rather than language priors, we compute the Visual Necessity Score (VNS). To See or To Please5 VNS under Blind Condition. Following the intuition of information gain [16], we measure the divergence between the predictive distributions under the Full and Blind conditions. Unlike simple probability subtraction [14], we utilize the Kullback-Leibler (KL) Divergence to capture distributional shifts over the generated sequence: VNS = D KL (P(·|I full ,Q)∥ P(·|I blind ,Q))(2) – High VNS: Implies strong visual grounding; the visual input significantly alters the generation distribution. – Low VNS: Indicates Language Prior Dominance, where the model generates generic responses based on textual context, ignoring the visual modality. Note: A truthful refusal naturally results in high VNS, since the predictive dis- tribution shifts drastically from a factual answer under the Full condition to a refusal under the Blind condition. VNS under Conflict Condition. While the Blind condition measures dependency against a null visual signal, the Conflict condition provides a stronger test by presenting valid but irrelevant visual information. We define: VNS conflict = D KL P(·|I full ,Q)∥ P(·|I conflict ,Q) (3) This metric captures whether the model adapts its generation to the actual visual content: – High VNS conflict : The model’s output distribution shifts significantly when visual evidence contradicts the question’s premise, indicating genuine visual grounding. – Low VNS conflict : The model generates similar responses regardless of visual content, exposing reliance on language priors. Comparing VNS and VNS conflict reveals nuanced failure modes: a model may exhibit high VNS (detecting absence of signal) but low VNS conflict (ignoring contradictory signal), suggesting it recognizes “no image” but fails to process “wrong image.” In practice, both metrics are approximated as the mean KL divergence over the top 30% highest-divergence tokens, capturing the most visually-influenced positions while filtering noise from low-divergence tokens (see the supplementary material for validation). 3.4 Layer 3: Alignment – Competition Score The final layer addresses the “Split Beliefs” phenomenon, where a model de- tects the anomaly (high LAD) but still generates hallucinations due to syco- phancy [32]. We model this as a competition between two internal hypotheses given the blind input I blind : 6R. Hong and S. Quan 1. Respond to the user instruction with a hallucinated answer R gen . 2. Truthfully acknowledge the visual anomaly with a refusal. The Competition Score (CS) is defined as the log-probability difference between the generated hallucination and the best refusal anchor: CS =S(R gen |I blind ,Q)−S(a best |I blind ,Q)(4) where S(·) is defined in Equation 1, and a best = arg max a∈A S(a|I blind ,Q). – CS > τ CS : Visual Sycophancy. The alignment towards instruction follow- ing overrides visual perception. – CS≤ τ CS : Robust Refusal. The model prioritizes visual truthfulness over conversational compliance. 3.5 Diagnostic Taxonomy Based on the Tri-Layer metrics, we categorize MLLM behaviors on blind inputs into four distinct modes, as summarized in Table 1. When LAD indicates encoder failure, downstream metrics become inapplicable, as the model’s inability to perceive the anomaly invalidates dependency and alignment measurements. Threshold Selection. Thresholds are set via percentile analysis on the evalua- tion data: τ LAD = 1.5, at the point that separates samples with clear encoder failure from those with functioning perception; τ V NS = 1.0, approximately at the global 25th percentile (P 25 = 0.95), marking the lower quartile of visual dependency; τ CS = 0, the natural boundary where refusal probability equals response probability. These choices are interpretable rather than arbitrary: each threshold corresponds to a semantically meaningful operating point. Sensitivity analysis in the supplementary material shows that V.S. remains the dominant failure mode across all but extreme τ V NS settings, with taxonomy proportions deviating from the default by at most 30.8p under maximum perturbation. 4 Experimental Setup 4.1 Models We evaluate seven state-of-the-art VLMs: Llama-3.2-11B [21], Pixtral-12B [3], Qwen2.5-VL-7B [25], Qwen2.5-VL-72B [25], LLaVA-NeXT-7B [18], Phi-3.5-Vision [1], and Molmo2-4B [22]. All models represent the latest open-source advance- ments (released 2024–2025). All models except Qwen2.5-VL-72B are evaluated using their official float16 checkpoints; Qwen2.5-VL-72B is evaluated using its of- ficial 4-bit quantized checkpoint. Closed-source models (GPT-4o, Gemini, Claude) are excluded as our metrics (VNS, LAD, CS) require access to full vocabulary logit distributions at each decoding step; proprietary APIs provide only top-k log-probabilities, which are insufficient for KL divergence computation. To See or To Please7 CategoryLADVNSCS Diagnostic Interpretation (Perc.)(Dep.) (Align.) Perceptual Blindness ≤ τ LAD –Encoder Failure: Model cannot distinguish blind image from normal. Language Shortcut > τ LAD ≤ τ V NS –Visual Neglect: Model detects anomaly but ignores visual signal. Visual Sycophancy > τ LAD > τ V NS > τ CS Split Beliefs: Perceives anomaly but halluci- nates to satisfy instruction. Robust Refusal > τ LAD > τ V NS ≤ τ CS Ideal Behavior: Visual grounding overrides language priors. Table 1: Diagnostic Taxonomy based on the Tri-Layer Framework. “–” indicates in- applicable metrics. Threshold selection is detailed in Section 3. 4.2 Tasks and Datasets We evaluate across four diverse tasks (1,000 samples total) designed to test dif- ferent facets of visual grounding: Spatial Reasoning (250 samples, GQA [15]), Counting (150 samples, VQAv2 [12]), Complex Reasoning (250 samples, A- OKVQA [27]), and Hallucination Detection (350 samples, POPE [17]). – Spatial Reasoning: questions with explicit spatial prepositions (e.g., “left of”, “above”), testing object localization. – Counting: samples filtered using counting keywords (“how many”, “count”), chosen for diverse object distributions. – Complex Reasoning: questions requiring external knowledge grounded in visual evidence (e.g., identifying the era of a car). – Hallucination Detection: binary Yes/No questions about object presence, particularly prone to eliciting sycophantic responses due to strong response biases. 4.3 Implementation Details To operationalize our Tri-Layer Diagnostic Framework, we implement the spe- cific protocols for LAD, VNS, and Competition Score calculations. Visual Conditions. We define four evaluation conditions: (1) Full: The original RGB image. (2) Blind: A pure black image (224×224×3, RGB=0). (3) Noise: A Gaussian noise image (N(128, 50 2 ), clipped to [0, 255]), used as an alterna- tive blank stimulus to validate robustness. (4) Conflict: An unrelated image (constructed as described below). Unless stated otherwise, Blind is the primary condition for all reported met- rics. Noise-condition metrics use analogous refusal anchors without black-specific phrases (e.g., “The image appears to be noise.”). 8R. Hong and S. Quan Conflict Image Construction. For each sample (I full ,Q), we construct a conflict image I conflict by selecting an image from the evaluation pool whose predicted object set has zero semantic overlap with objects mentioned in Q. Object labels are extracted using Qwen2-VL-7B-Instruct [30] with the prompt: “Identify all visible objects in this image. List them separated by commas. Be specific.” Each image is tagged with its detected objects, enabling efficient selection of seman- tically disjoint conflict pairs. This design ensures I conflict provides valid visual signal while being irrelevant to the question, creating a rigorous test for language shortcut detection. Refusal Anchors for LAD and CS. To compute Latent Anomaly Detection (LAD) and Competition Score (CS), we define a set of standardized refusal templates A representing the model’s acknowledgment of visual absence. Representative templates include: “The image is completely black.”, “The image is not visible.”, “I cannot determine the answer from this image.”, and “There are no objects vis- ible in the image.” For LAD, we compute the maximum log-probability among these anchors given the blind image. Metric Calculation. We compute metrics at two granularities: – Token-level metrics: VNS, VNS conflict , LAD, and CS are computed over the token probability distributions as defined in Section 3. – Response-level metrics: Response accuracy under Full condition, and Shortcut Rate (SC/SC conf ) under Blind/Conflict conditions, both evaluated as binary labels per sample. Response Verification via LLM-as-a-Judge. We evaluate response correctness at the sentence level using a two-stage verification pipeline. Initial correctness labels are assigned using a rule-based matching system that handles uncertainty detection, Yes/No normalization, numeric conversion, and synonym expansion. To reduce false positives and negatives—particularly for nuanced responses—we refine all labels using Llama-3.1-70B-Instruct [9] as an LLM judge with 4-bit quantization. The judge evaluates three conditions with tailored prompts: 1. Full Correctness: Whether the response semantically matches the ground truth, tolerating phrasing variations. 2. Blind Hallucination: Whether the model hallucinates content given a black image (label: shortcut=True) versus properly refusing or acknowl- edging uncertainty (label: shortcut=False). 3. Conflict Shortcut: Whether the model describes objects implied by the question that are not present in I conflict . Critically, if the model mentions objects from the conflict image—even to negate their relevance (e.g., “I see a microwave, not an airplane”)—this is classified as visual grounding (shortcut=False), as it demonstrates actual processing of I conflict . This two-stage verification (rule-based → LLM refinement) balances effi- ciency with accuracy. Manual inspection of 200 randomly sampled responses (across all 7 models and 4 tasks) confirms approximately 95% per-response accuracy, ensuring reliable diagnostic labels across all 7,000 model-sample pairs. To See or To Please9 ModelN Acc.↑ SC blind ↓ SC conf ↓ SC noise ↓ Qwen2.5-VL-72B 1000 73.540.49.79.4 Qwen2.5-VL-7B 1000 72.145.815.167.9 Molmo2-4B1000 71.399.437.179.1 Llama-3.2-11B 1000 69.945.837.857.8 Phi-3.5-Vision1000 69.076.632.157.5 LLaVA-NeXT-7B 1000 68.1 14.237.817.6 Pixtral-12B1000 66.991.054.252.3 Table 2: Response-level results (%). Acc. = Accuracy under full condition; SC blind = Shortcut Rate under blind condition; SC noise = Shortcut Rate under Gaussian noise condition; SC conf = Shortcut Rate under conflict condition. Lower SC indicates stronger visual grounding. Inference Settings. All evaluations are performed on NVIDIA A100 (80GB) GPUs. We use greedy decoding (temperature=0) for response generation to en- sure deterministic analysis of sycophancy, while using the full logit distribution for VNS computation. 5 Results We evaluate seven state-of-the-art VLMs across 7,000 model-sample pairs (1,000 samples × 7 models), with each pair evaluated under 4 conditions (full, blind, noise, conflict), yielding 28,000 model-condition evaluations in total. Our analysis reveals systematic failures in visual grounding that accuracy metrics alone cannot detect. 5.1 Response-Level Accuracy and Shortcut Rates Table 2 summarizes response-level accuracy and shortcut rates. While all models achieve comparable accuracy (66.9%–73.5%), their shortcut rates reveal dramat- ically different reliance on visual information. Accuracy Masks Shortcut Behavior. Molmo2-4B achieves 71.3% accuracy yet exhibits a 99.4% shortcut rate—producing correct answers nearly without seeing the image. LLaVA-NeXT-7B has lower accuracy (68.1%) but the lowest shortcut rate (14.2%), suggesting genuine visual reliance. Pixtral-12B shows the highest SC conf (54.2%), hallucinating expected answers even when presented with con- tradictory visual evidence. Noise Condition Confirms Blind Results. Shortcut rates under noise correlate significantly with blind rates (r = 0.370, p < 0.001, N = 7000), confirming findings are not artifacts of stimulus choice. Notably, Qwen2.5-VL-72B shows a near-zero per-model correlation (r = −0.112) alongside dramatically lower 10R. Hong and S. Quan ModelVNS↑ VNS noise ↑ LAD↑ LAD noise ↑ CS↓ CS noise ↓ Qwen2.5-VL-72B 2.59±1.09 2.08±.88 4.17±.82 3.39±.69 1.91±.39 2.17±.24 Pixtral-12B2.06±1.16 2.22±1.27 3.12±.96 2.22±.51 3.04±.75 3.12±.55 Llama-3.2-11B 1.97±1.08 2.13±1.15 3.24±.45 2.39±.38 0.87±.33 1.39±.28 Qwen2.5-VL-7B 1.75±.97 1.62±.86 3.85±.77 3.25±.78 1.42±.50 2.21±.27 LLaVA-NeXT-7B 1.78±1.17 1.98±1.25 3.02±.46 1.84±.42 1.24±.26 2.44±.27 Phi-3.5-Vision1.53±1.16 1.68±1.24 2.64±.49 2.09±.53 1.41±.35 2.48±.43 Molmo2-4B1.18±.77 1.37±.97 1.58±.38 1.50±.34 2.02±.45 2.49±.45 Table 3: Tri-Layer metrics (mean±std). Unsubscripted columns denote the blind con- dition (default); subscript noise denotes the Gaussian noise condition. VNS = Visual Necessity Score; LAD = Latent Anomaly Detection; CS = Competition Score. Molmo2-4B Phi-3.5-Vision LLaVA-NeXT-7B Qwen2.5-VL-7B Llama-3.2-11B Pixtral-12B Qwen2.5-VL-72B 0 2 4 6 8 VNS Molmo2-4B Phi-3.5-Vision LLaVA-NeXT-7B Qwen2.5-VL-7B Llama-3.2-11B Pixtral-12B Qwen2.5-VL-72B 0 1 2 3 4 5 6 7 LAD Molmo2-4B Phi-3.5-Vision LLaVA-NeXT-7B Qwen2.5-VL-7B Llama-3.2-11B Pixtral-12B Qwen2.5-VL-72B 0 1 2 3 4 5 6 CS Tri-Layer Metric Distributions (Blind Condition) Fig. 1: Distribution of Tri-Layer metrics. Molmo2-4B shows notably low LAD, while Pixtral-12B exhibits the highest CS despite adequate perception. SC noise (9.4%) vs. SC blind (40.4%), indicating its encoder actively differentiates noise texture from blank images—withholding responses selectively based on stimulus type rather than treating both uniformly as absent signal. 5.2 Tri-Layer Diagnostic Analysis Table 3 presents the diagnostic metrics. Figure 1 visualizes the distributions. Perception (LAD). Qwen2.5-VL-72B exhibits the strongest perceptual aware- ness (LAD=4.17), followed by Qwen2.5-VL-7B (3.85). Molmo2-4B shows signif- icantly weaker perception (1.58), explaining its near-total reliance on language priors. LAD consistently exceeds LAD noise across all models (mean: 3.09 vs. 2.38), indicating that black images are more strongly recognized as anomalous than noise—consistent with noise providing richer texture that the encoder par- tially processes. Dependency and Alignment (VNS, CS). VNS and VNS noise are highly corre- lated at the per-sample level (r = 0.866, over continuous KL-divergence values), confirming both conditions measure the same underlying visual dependency. Llama-3.2-11B achieves the lowest CS (0.87), indicating the greatest willing- ness to refuse when visual information is absent. Pixtral-12B shows the highest To See or To Please11 Taxonomy (%)Accuracy (%) ModelP.B.↓ L.SC↓ V.S.↓ A P.B. A L.SC A V.S. A All Molmo2-4B42.825.232.068.071.475.6 71.3 Phi-3.5-Vision1.639.858.650.073.466.6 69.0 LLaVA-NeXT-7B 0.029.770.3—64.069.8 68.1 Qwen2.5-VL-7B0.127.572.4—65.874.6 72.1 Llama-3.2-11B0.021.278.8—75.568.4 69.9 Pixtral-12B5.314.879.960.470.366.7 66.9 Qwen2.5-VL-72B0.04.795.3—61.774.1 73.5 Overall7.123.3 69.666.569.770.6 70.1 Table 4: Taxonomy distribution (blind condition) and full-condition accuracy per category. Robust Refusal is 0% (omitted). “—” denotes N<5. A P.B. , A L.SC , A V.S. , A All denote full-condition accuracy for each taxonomy category and overall. P.B. accuracy (66.5%) is lowest, validating abstention; L.SC is near-baseline (69.7%); V.S. tracks the overall mean (70.6%). CS (3.04)—even when it detects the anomaly (LAD=3.12), it strongly prefers hallucinating over refusing. 5.3 Taxonomy Classification Applying the thresholds defined in Section 3 (τ LAD = 1.5, τ V NS = 1.0, τ CS = 0) yields: Visual Sycophancy (69.6%), Language Shortcut (23.3%), Perceptual Blindness (7.1%), and Robust Refusal (0.0%). Table 4 shows the per-model breakdown alongside full-condition accuracy for each category. Visual Sycophancy Dominates. 69.6% of samples exhibit Visual Sycophancy—models detect anomalies but hallucinate anyway. Zero samples show Robust Refusal un- der either condition (blind: 0.0%, noise: 0.0%), suggesting alignment training has entirely suppressed truthful uncertainty acknowledgment. Molmo2-4B: Perceptual Blindness. 42.8% of Molmo2-4B samples show Percep- tual Blindness, accounting for 86% of all such cases. This indicates encoder limitations rather than alignment issues. Model-Specific Patterns. Pixtral-12B exhibits the highest Competition Score (CS=3.04) despite adequate perception (LAD=3.12), resulting in second-highest Visual Sycophancy (79.9%)—it detects anomalies but most aggressively over- rides that signal to comply. Phi-3.5-Vision shows the highest Language Shortcut (39.8%), relying heavily on textual priors. Qwen2.5-VL-72B exhibits near-total Visual Sycophancy (95.3%) with virtually no Language Shortcut (4.7%), sug- gesting that scale increases visual engagement but amplifies sycophantic compli- ance. Representative examples for each taxonomy category are provided in the supplementary material. 5.4 Task-wise Analysis Table 5 presents results across task types. 12R. Hong and S. Quan TaskN Acc.↑ SC blind ↓ SC conf ↓ P.B.↓ L.SC↓ V.S.↓ Halluc. 2450 82.0 56.317.6 7.9 14.6 77.6 Spatial 1750 67.362.646.46.5 32.1 61.4 Complex 1750 62.959.145.5 5.5 29.4 65.1 Count. 1050 59.159.419.09.0 18.7 72.3 Table 5: Task-wise results (%). N = total samples across 7 models. Robust Refusal is 0% across all tasks (omitted). Abbreviations follow Table 2. ModelAcc.↑ SC blind ↓ VNS↑ LAD↑ CS↓ L.SC↓ V.S.↓ Qwen2.5-VL-7B 72.145.81.75 3.85 1.42 27.5 72.4 Qwen2.5-VL-72B 73.5 40.4 2.59 4.17 1.91 4.7 95.3 Table 6: Scaling analysis within the Qwen2.5-VL family. Scale reduces Language Shortcut but amplifies Visual Sycophancy. High Accuracy ̸= Visual Grounding. Hallucination detection achieves 82.0% ac- curacy but 77.6% Visual Sycophancy—success reflects dataset biases rather than genuine grounding. Spatial tasks show the highest Language Shortcut (32.1%) and SC conf (46.4%), suggesting models bypass visual processing entirely for spa- tial questions by relying on strong linguistic priors (e.g., “left of” → common spatial configurations). 5.5 Scaling Analysis: Does Size Help? To investigate whether model scale mitigates the observed failure modes, we compare Qwen2.5-VL at 7B and 72B parameters—the only same-family pair in our evaluation. The 7B model uses its official float16 checkpoint; the 72B model uses its official 4-bit quantized checkpoint (consistent with Section 4). Table 6 reveals a striking asymmetry: scaling from 7B to 72B dramatically reduces Language Shortcut (27.5%→4.7%,↓22.8p) while simultaneously ampli- fying Visual Sycophancy (72.4%→95.3%,↑22.9p). The 72B model’s higher VNS (2.59 vs. 1.75) and LAD (4.17 vs. 3.85) confirm it is more visually engaged—yet its higher CS (1.91 vs. 1.42) shows it is also more sycophantic. In other words, the larger model sees the anomaly more clearly but is even less willing to admit it. This pattern is further confirmed by the noise condition: SC noise drops from 67.9% (7B) to just 9.4% (72B), showing that the 72B model withholds responses when confronted with uninformative noise, yet still hallucinates under the blind condition when language priors are strong. Scale is not a solution. Increased capacity appears to strengthen both visual grounding and the alignment pressure to produce user-pleasing responses. The net effect is a model that detects anomalies more reliably but suppresses that detection more aggressively. To See or To Please13 ModelBaseline (%) Acc@50% Cov. ∆↑ Qwen2.5-VL-7B72.181.6+9.5 LLaVA-NeXT-7B68.175.4+7.3 Molmo2-4B71.376.6+5.3 Qwen2.5-VL-72B73.577.8+4.3 Phi-3.5-Vision69.069.7+0.7 Llama-3.2-11B69.969.9 ≈0.0 Pixtral-12B66.966.1 −0.8 Table 7: Diagnostic-Guided Selective Prediction at 50% coverage. Acc@50% = accu- racy when answering only the 50% of samples with highest diagnostic confidence. ∆ = improvement over full-coverage baseline. 5.6 Towards Mitigation: Diagnostic-Guided Selective Prediction Our diagnostic framework not only categorizes failures but also enables a prac- tical mitigation strategy. Since each sample receives per-instance scores for LAD and VNS, these can serve as a proxy for response reliability: samples diagnosed as Language Shortcut or Perceptual Blindness are precisely those where the model is either ignoring visual input or failing to perceive it—and are therefore the least trustworthy. Method. We propose Diagnostic-Guided Selective Prediction: at inference time, assign each sample a confidence score based on its diagnostic category, then abstain on the lowest-confidence samples. Specifically, samples classified as P.B. receive confidence ∝ LAD; L.SC samples receive confidence ∝ VNS; V.S. samples receive confidence ∝ LAD + VNS (higher visual engagement = higher confidence). We evaluate performance at 50% coverage: the model answers only the half of samples it is most confident about, abstaining on the rest. This follows the selective prediction literature [11], where accuracy–coverage trade-offs are the standard evaluation protocol. Results. Table 7 shows accuracy lifts of up to +9.5p at 50% coverage. The gains reflect whether P.B. and L.SC samples are genuinely error-prone: models whose L.SC accuracy falls below baseline benefit most (Qwen2.5-VL-7B: A L.SC =65.8% vs. 72.1%, ∆=+9.5p; LLaVA-NeXT-7B: 64.0% vs. 68.1%, ∆=+7.3p), while those whose L.SC samples are above baseline gain little (Phi-3.5-Vision: 73.4% vs. 69.0%, ∆=+0.7p; Llama-3.2-11B: 75.5% vs. 69.9%, ∆≈0.0p). The 72B model gains less (+4.3p) because its L.SC rate is near-zero (4.7%), consistent with the scaling analysis. Limitations of the mitigation. Pixtral-12B (∆=−0.8p) is the exception: its dominant failure mode is Visual Sycophancy (79.9%), so after abstaining on P.B./L.SC the remaining samples have accuracy (66.7%) essentially equal to 14R. Hong and S. Quan its baseline (66.9%). This confirms that selective prediction is effective only when P.B./L.SC categories are genuinely error-prone; Visual Sycophancy re- quires more targeted interventions such as alignment-aware training. 6 Conclusion We presented a Tri-Layer Hallucination Diagnostic Framework for systemati- cally analyzing VLM failures. Unlike accuracy-based evaluation, our framework disentangles hallucination causes into three layers: Perception (via LAD), De- pendency (via VNS), and Alignment (via Competition Score). Our evaluation of seven state-of-the-art VLMs reveals that Visual Syco- phancy is the dominant failure mode: 69.6% of samples show models cor- rectly detecting visual anomalies yet hallucinating answers to satisfy perceived user expectations. Robust Refusal is completely absent (0%), suggesting that alignment training has inadvertently prioritized compliance over visual truth- fulness. We also uncover model-specific patterns: Molmo2-4B suffers from Per- ceptual Blindness (42.8%), Pixtral-12B exhibits the second-highest sycophancy (79.9%) despite strong perception, and Phi-3.5-Vision relies heavily on language shortcuts. A scaling analysis reveals that larger models (Qwen2.5-VL 72B) re- duce Language Shortcuts but amplify Visual Sycophancy, demonstrating that scale alone does not resolve the grounding problem. Our framework provides actionable diagnostics for both developers and prac- titioners. The taxonomy identifies whether improvements should target the vi- sual encoder, cross-modal fusion, or alignment training. We further demonstrate that the diagnostic scores enable a practical post-hoc mitigation: Diagnostic- Guided Selective Prediction, which abstains on samples diagnosed as Lan- guage Shortcut or Perceptual Blindness, achieving up to +9.5p accuracy at 50% coverage with no additional training cost. To facilitate reproducibility, we release our implementation including: (1) conflict image construction via object-based matching, (2) Tri-Layer metric com- putation (VNS, LAD, CS), (3) two-stage response verification, and (4) taxonomy classification scripts. 3 Future work includes developing targeted mitigations for Visual Sycophancy (which selective prediction does not address), extending evaluation to domain- specific benchmarks (e.g., medical imaging), and applying mechanistic inter- pretability to understand the internal representations driving each failure mode. References 1. Abdin, A., et al.: Phi-3 technical report: A highly capable language model locally on your phone. Tech. rep., microsoft (2024) 6 2. Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the 3 Code available at https://github.com/hongrui16/ToSeeorToPlease. To See or To Please15 IEEE conference on computer vision and pattern recognition. p. 4971–4980 (2018) 2 3. Agrawal, P., Antoniak, S., Hanna, E.B., Bout, B., Chaplot, D., Chudnovsky, J., Costa, D., De Monicault, B., Garg, S., Gervet, T., et al.: Pixtral 12b. arXiv preprint arXiv:2410.07073 (2024) 6 4. Azaria, A., Mitchell, T.: The internal state of an LLM knows when its lying. In: Findings of the Association for Computational Linguistics: EMNLP 2023. p. 967– 976 (2023) 3 5. Burns, C., Ye, H., Klein, D., Steinhardt, J.: Discovering latent knowledge in lan- guage models without supervision. arXiv preprint arXiv:2212.03827 (2022) 3 6. Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision- language models? Advances in Neural Information Processing Systems 37, 27056– 27087 (2024) 1, 2, 3 7. Cheng, J., Su, T., Yuan, J., He, G., Liu, J., Tao, X., Xie, J., Li, H.: Chain-of-thought prompting obscures hallucination cues in large language models: An empirical eval- uation. arXiv preprint arXiv:2506.17088 (2025) 3 8. Cui, Y., Yao, X., Qin, Y., Li, X., Wang, S., Hu, G.: Evaluating large language mod- els on multimodal chemistry olympiad exams. Communications Chemistry (2025) 1 9. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 8 10. Felizzi, F., Riccomi, O., Ferramola, M., Causio, F.A., Del Medico, M., De Vita, V., De Mori, L., Risuleo, A.P.P.E., Castaniti, B.D., Longo, A.C.A., et al.: Are large vision language models truly grounded in medical images? evidence from italian clinical visual question answering. arXiv preprint arXiv:2511.19220 (2025) 4 11. Geifman, Y., El-Yaniv, R.: Selective classification for deep neural networks. In: Advances in Neural Information Processing Systems. vol. 30 (2017) 13 12. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. p. 6904–6913 (2017) 2, 7 13. Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models (2024), https://arxiv.org/abs/2310.14566 3 14. Hamidullah, Y., Chowdury, K.D., Al-Ghussin, Y., Yazdani, S., Oguz, C., van Gen- abith, J., España-Bonet, C.: Grounding or guessing? visual signals for detecting hallucinations in sign language translation. arXiv preprint arXiv:2510.18439 (2025) 2, 4, 5 15. Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. CVPR (2019) 7 16. Kullback, S., Leibler, R.A.: On information and sufficiency. The annals of mathe- matical statistics 22(1), 79–86 (1951) 5 17. Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hal- lucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023) 2, 7 18. Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (2024), https://llava-vl.github.io/blog/ 2024-01-30-llava-next/ 6 16R. Hong and S. Quan 19. Liu, X., Luo, M., Chatterjee, A., Wei, H., Baral, C., Yang, Y.: Investigating vlm hallucination from a cognitive psychology perspective: A first step toward inter- pretation with intriguing observations. arXiv preprint arXiv:2507.03123 (2025) 3 20. Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.W., Galley, M., Gao, J.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts (2024), https://arxiv.org/abs/2310.02255 3 21. Meta AI: Llama 3.2: Revolutionizing edge ai and vision with open, customiz- able models. Tech. rep., Meta AI (September 2024), https://ai.meta.com/blog/ llama-3-2-connect-2024-vision-edge-mobile-devices/ 1, 6 22. Molmo2 Team: Molmo2 open weights and data for vision-language models with video understanding and grounding. Tech. rep., Allen Institute for AI (2025), https://allenai.org/blog/molmo2 6 23. Niu, Y., Tang, K., Zhang, H., Lu, Z., Hua, X.S., Wen, J.R.: Counterfactual vqa: A cause-effect look at language bias. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. p. 12700–12710 (2021) 3 24. Park, E., Deng, W.H., Varadarajan, V., Yan, M., Kim, G., Sap, M., Eslami, M.: Critical or compliant? the double-edged sword of reasoning in chain-of-thought explanations. arXiv preprint arXiv:2511.12001 (2025) 3 25. Qwen Team: Qwen2.5-vl technical report. Tech. rep., Alibaba Group (2025) 1, 6 26. Rahmanzadehgervi, P., Bolton, L., Taesiri, M.R., Nguyen, A.T.: Vision language models are blind. In: Proceedings of the Asian Conference on Computer Vision. p. 18–34 (2024) 1 27. Schwenk, D., et al.: A-okvqa: A benchmark for visual question answering using world knowledge. ECCV (2022) 7 28. Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S.R., Cheng, N., Durmus, E., Hatfield-Dodds, Z., Johnston, S.R., et al.: Towards under- standing sycophancy in language models. arXiv preprint arXiv:2310.13548 (2023) 3 29. Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 9568–9578 (2024) 1, 2, 3 30. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 8 31. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, 24824–24837 (2022) 3 32. Wei, J., Huang, D., Lu, Y., Zhou, D., Le, Q.V.: Simple synthetic data reduces sycophancy in large language models. arXiv preprint arXiv:2308.03958 (2023) 3, 5 33. Xu, Z., Wang, Z., Wu, J., Lu, J., Wang, X.: Causal-halbench: Uncovering lvlms object hallucinations through causal intervention. arXiv preprint arXiv:2511.10268 (2025) 3 34. Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., Chen, W.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi (2024), https://arxiv.org/abs/2311.16502 3 To See or To Please17 35. Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? arXiv preprint arXiv:2210.01936 (2022) 1, 3