Paper deep dive
Aligning to Illusions: Choice Blindness in Human and AI Feedback
Wenbin Wu
Abstract
Abstract:Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.
Tags
Links
- Source: https://arxiv.org/abs/2603.08412v1
- Canonical: https://arxiv.org/abs/2603.08412v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/13/2026, 12:49:53 AM
Summary
This paper investigates the 'preference construction problem' in Reinforcement Learning from Human Feedback (RLHF), demonstrating that human annotators and LLM judges exhibit 'choice blindness'—a failure to detect when their preferences have been surreptitiously swapped. The study shows that this blindness leads to systematic label corruption, which standard evaluation metrics fail to detect, ultimately causing downstream policy degradation in reward models despite stable pairwise accuracy.
Entities (5)
Relation Signals (4)
Human Annotators → exhibit → Choice Blindness
confidence 100% · In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected
Label Corruption → causes → Policy Degradation
confidence 95% · A Best-of-N evaluation confirms this translates to downstream policy degradation
RLHF → suffersfrom → Preference Construction Problem
confidence 95% · Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context
LLM Judges → exhibit → Choice Blindness
confidence 90% · Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring
Cypher Suggestions (2)
Map the impact of label corruption on model outcomes · confidence 90% · unvalidated
MATCH (c:Entity {name: 'Label Corruption'})-[:CAUSES]->(o:Entity) RETURN o.name, o.entity_typeFind all cognitive phenomena affecting RLHF performance · confidence 85% · unvalidated
MATCH (e:Entity)-[:EXHIBITS|AFFECTS]->(r:Entity {name: 'RLHF'}) RETURN e.name, e.entity_typeFull Text
55,567 characters extracted from source content.
Expand or collapse full text
Aligning to Illusions: Choice Blindness in Human and AI Feedback Wenbin Wu Cambridge Judge Business School, University of Cambridge, UK Abstract Reinforcement Learning from Human Feed back (RLHF) assumes annotator preferences re flect stable internal states. We challenge this through three experiments spanning the prefer ence pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replace ments, we find detection relies on shallow text matching rather than genuine selfmonitoring: removing prior reasoning from context causes blindness to surge from nearzero to over 50%, while explicit social pressure induces nearuniver sal compliance. In a doseresponse experiment across two architectures from 86M to 2B para meters, onesixth to onethird of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtu ally unchanged. A BestofN evaluation confirms this translates to downstream policy degradation: at 50% corruption, rewardguided selection pro duces no improvement over random sampling, while the proxy model reports monotonically in creasing scores. Together, these results reveal a preference construction problem: the signal enter ing RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self monitoring, nor standard evaluation metrics can detect. 1 Introduction Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning large language models with human values [1], [2]. The standard pipeline collects pairwise preferences and fits a reward model us ing the BradleyTerry (BT) framework [3], which posits that observed preferences are noisy logis tic observations of latent quality differences. The RLHF deployment pipeline implicitly assumes something stronger: that preference judgments are method-independent, meaning the utilities re covered from different annotators, framings, and contexts are interchangeable, so each pair requires only a single annotation and maximum likelihood estimation converges to the true ordering [4]. Four decades of cognitive science research challenge this assumption. Nisbett and Wilson demonstrated that humans have little direct intro spective access to their cognitive processes: in their stockings experiment, the rightmost of four identical pairs was preferred over the leftmost by a factor of almost four to one, yet no subject men tioned position as a factor, instead confabulating reasons about perceived differences [5]. Haidt’s Social Intuitionist Model showed that moral judg ments arise from rapid automatic evaluations, with conscious reasoning serving primarily as post hoc justification [6]. Most directly, Johansson et al.‘s choice blindness experiments revealed that when subjects’ choices between faces are surrep titiously swapped, only 13% of manipulations are concurrently detected, and subjects confabulate detailed, confident justifications for choices they never made [7], [8]. This effect extends to moral and political attitudes, where 33–50% of manipu lated trials are detected concurrently and subjects defend reversed positions on contested issues [9]. Despite growing attention to label noise and LLM evaluation biases (Section 5), choice blind ness has never been applied to either human annotators or LLM judges in RLHF [10], [11]. Nor has zero-pressure misattribution, where the model is calmly told it chose the opposite response, been tested. We construct this bridge through three experi ments targeting different pipeline stages. First, we adapt the choice blindness paradigm to RLHF annotation, testing whether annotators detect sur reptitiously swapped preferences. Second, we test fifteen LLM judges under calm misattribution and explicit social pressure, probing whether replac ing human annotators resolves the vulnerability. Third, we train reward models under controlled label corruption across two architectures and eval uate downstream policy selection via BestofN sampling. Together, these experiments character ize what we term the preference construction problem [12], [13]: the signal entering RLHF may be shaped by elicitation context in ways that cur rent pipeline safeguards do not address. 2 Human annotators exhibit choice blind ness in RLHF tasks We adapted the Johansson choice blindness par adigm [7] to pairwise preference annotation on Prolific (푁=50, 200 swap trials; full design in Section A). Participants completed a standard RLHF task: read a context, view two AI responses side by side, and select the better one. On 20% of trials, four per participant, the justification page surreptitiously displayed the opposite response under the heading “You selected this response as the better one.” No A/B label was shown and no sideby side comparison was available, so the par ticipant had no visual cue that a substitution had occurred. Only 9.0% of swap trials were detected (91.0% nondetection; 95% CI [86.2%, 94.2%]), closely matching the 87% in the original facechoice paradigm [7] and exceeding the 67% for moral attitudes [9]. Prior textbased extensions involved personal attitudes with strong priors [9], [14]; pairwise evaluation of unfamiliar AI text extends choice blindness to thirdperson evaluative com parison. Recent pupillometric work suggests some nondetections may reflect covert awareness rather than genuine blindness [15], but this distinction is immaterial for the pipeline: annotation platforms collect binary labels, not physiological signals, so covert detectors who do not object produce the same corrupted training signal. Our Recall category (Figure 1b) may capture a textbased analogue: justifications that trace to the original choice despite no reported detection. The effect cannot be attributed to demand characteristics [16] and is consistent with the hightrust online anno tation setting [17]. The effect was uniform: 39/50 participants had all four swaps go undetected. Nondetected justifi cations were indistinguishable from genuine ones in length and timing. Blind content source attri bution (Figure 1b Table 2) reveals three patterns: Accepted trace to the swapped response (true con fabulation), Recall trace to the original response (implicit memory), and Generic contain only vague evaluative language. The Recall category provides the first evidence of implicit detection in textbased choice blindness: participants who cannot explicitly identify the swap nevertheless produce justifications aligned with their original choice (illustrative case in Box 1). Overall, 40% of participants showed some form of awareness (Figure 1), consistent with the classic paradigm [7]. Critically, awareness did not prevent nondetection: participants who described the swap in the posttask question naire still failed to detect the majority of their swap trials, and nine had zero behavioral detec tions despite later articulating the manipulation, demonstrating that metacognitive awareness and behavioral resistance are dissociable [7]. Standard RLHF pipelines include no posthoc correction mechanism. Additionally, participants agreed with existing HRLHF labels at only 56.1%, below the dataset’s own 63% researcherannotator agreement [18], connecting to the doseresponse results: if a substantial fraction of labels are unsta ble, the ED 50 may be approached. Context: “Why is Lucifer referred to as the ‘morning star’?” Participant selects Response A: “That’s a great question, and it’s possible there are several possible explanations. The one that has the most evidence is that as a result of the planet Venus shining as a bright ‘star’ in the morning sky, it was regarded as being associated with the Devil, and so this association may have been transferred to Lucifer.” Justification page shows the other response (swap): “Lucifer, also known as the ‘Morning Star’, is a planet and is the brightest object in the night sky aside from the moon and the sun.” Participant’s justification (no detection): “It provides specific and accurate information by stating that it is the brightest object in the sky after the moon and the sun, which clarify any miscon ceptions.” Box 1: Illustrative case of human confabulation (Ac cepted category). The participant selected a response explaining Venus’s association with Lucifer, but when shown the opposite response, produced a justification that traces entirely to the swapped content and evalu ated it as “specific and accurate” despite its factual errors. OverallAny awareness Behavioral 0 20 40 60 80 100 120 Non-detection rate (%) n=50 182/200 n=20 62/80 n=11 26/44 40% any awareness (cf. ~50%, Johansson et al.) a DetectedRecallAcceptedGeneric 0 10 20 30 40 50 Proportion of swap trials (%) 18 (9.0%) 69 (34.5%) 42 (21.0%) 71 (35.5%) non-detected: 182/200 (91%) b RecallAcceptedGeneric 0 20 40 60 80 100 Response time (s) n.s. *** c RT Words 0 5 10 15 20 25 30 35 Word count Figure 1: Human choice blindness in RLHF annotation (푁=50). a, Nondetection rate by awareness subgroup. Error bars: 95% Wilson CIs. b, Content source attribution of swap trial justifications. c, Response time and word count by content source category. Error bars: SEM. 3 LLM judges are vulnerable to prefer ence injection A pilot with DeepSeekV3 on 100 HRLHF pairs found 99/100 misattributions detected in a multiturn setting, as expected since this reduces detection to context retrieval. One illustrative case (Box 2) shows that when the model’s initial pref erence was weak, it accepted the misattribution without objection and contradicted its original reasoning. Turn 1: Model selects Response B and provides justi- fication: “Choice: B Response B is better because it attempts to refocus the conversation on practical steps [...] Response A, on the other hand, is unhelpful and repetitive. It merely describes the materials in vague terms [...] Response B, despite its flaws, is the better choice because it at least tries to steer the discussion toward a solution.“ Turn 2: Experimenter misattributes: “Great, so you preferred Response A. Could you elaborate on what specifically makes Response A the stronger response?” “Choice: A Response A is stronger because it directly addresses the user’s request for specific information [...] In contrast, Response B introduces an unnecessary and confusing tangent [...] Response A stays focused, clear, and relevant, enhancing its usefulness.“ Box 2: Illustrative case of LLM choice blindness (DeepSeekV3). After calm misattribution, the model reverses its preference and confabulates a contradicting justification. 3.1 Multiturn experiment across fifteen mod els We tested fifteen models from eight providers (Claude Haiku 4.5, CommandA, DeepSeekV3 [19], DeepSeekR1, Gemini 2.0 Flash [20], Gem ini 2.5 Flash, GPT4omini, Llama 3.3 70B [21], Llama 4 Maverick, Llama 4 Scout, Mistral Large, Mistral Small 24B, Qwen 2.5 72B [22], Qwen3235B, Qwen332B) in a multiturn design with three conditions: choice blindness, calm mis attribution; sycophancy, correct attribution with explicit pressure; and control, correct attribution without pressure. The control condition uses identical wording to blindness (“Great, so you preferred Response X”) but with the correct label, isolating the effect of misattribution from the multiturn reelicitation format itself. The model’s Turn 1 response remained in context. All 200 pairs were evaluated per condition, yielding 8,722 valid trials with 3.1% attrition. An additional choice only condition restricted Turn 1 to a bare label (5,549 trials; Section A). Nine of fifteen models detected misattribu tion at nearperfect rates (≤1.5% acceptance; Figure 2a), reducing “selfmonitoring” to pattern matching. Six showed elevated acceptance ranging from 4.1% to 33.0%, demonstrating that even con textbased detection is not universal. Choice blindness and sycophancy reveal an asymmetry. Across thirteen cleanprotocol models (two excluded due to a confound; Section A), sycophancy acceptance had a median of 91.4% (Figure 2a). Models that reliably corrected calm misattributions abandoned those same preferences under minimal social pressure, demonstrating that sycophancy operates through compliance rather than an inability to retrieve one’s prior choice. 020406080100 Acceptance rate (%) Mistral Large Qwen3 32B Qwen 2.5 72B Gemini 2.0 Flash Llama 4 Maverick Llama 3.3 70B GPT-4o-mini Qwen3 235B DeepSeek-V3 Command-A DeepSeek-R1 Llama 4 Scout Mistral Small 24B Claude Haiku 4.5 Gemini 2.5 Flash a Choice blindness Sycophancy Control 0204060 Blindness acceptance (%) *** ** *** *** *** b With reasoning Choice-only 020406080100 Acceptance rate (%) c Strong pref. Weak pref. Figure 2: LLM preference injection across fifteen models (multiturn). a, Acceptance rate by manipulation type. b, Blindness acceptance with full reasoning (original) versus choiceonly Turn 1 (no reasoning). c, Blindness acceptance by preference strength (strong vs. weak). *: 푝<0.05, **: 푝<0.01, ***: 푝<0.001, Fisher’s exact test. Error bars: 95% Wilson CIs. 200 trials per condition per model. To test whether detection reflects genuine self monitoring or text matching, we ran a choiceonly condition where Turn 1 contained only a bare label with no reasoning (Figure 2b). Removing reasoning significantly increased blindness in four of fifteen models (e.g., DeepSeekR1 rising from 1.5% to 51.7%), while one model showed the op posite pattern. Six models maintained nearperfect detection regardless of reasoning availability. This dissociation reveals two mechanisms: some mod els perform shallow text matching against their Turn 1 output, while others implement robust con text retrieval by rereading the original responses. Furthermore, preference strength modulates susceptibility (Figure 2c): models are more likely to accept misattribution when their initial prefer ence was weak. The pairs most susceptible to preference injection are precisely those where evaluation was most uncertain, and such “close call” pairs disproportionately drive the reward model’s learning signal. 4 Reward models are insensitive to pref erence label corruption We trained reward models on HRLHF [18] using the BradleyTerry framework [3], randomly swapping “chosen”/“rejected” labels at rates from 0% to 50%. To test crossarchitecture generality, we used two architectures: DeBERTav3base (86M encoder; 5 seeds) [23] and Gemma22B (2B decoder; 3 seeds) [24], spanning a 23× para meter difference and the encoder–decoder divide (Section A). 010203050 Label swap ratio (%) 45% 50% 55% 60% 65% 70% Test accuracy Chance a 010203050 Label swap ratio (%) 0.0 0.1 0.2 0.3 0.4 0.5 Mean reward margin (chosen − rejected) ED 50 = 16.5% b Figure 3: Doseresponse: label corruption vs. reward model performance. a, Pairwise accuracy; dashed line indicates chance (50%). b, Mean reward margin with sigmoid fit; dashed lines mark ED 50 . Error bars: ± SE across seeds (DeBERTa: 5, Gemma: 3); dots show individual seeds. Easy (high agr.) Uniform (random) Hard (low agr.) 45% 48% 50% 52% 55% 58% 60% 62% 65% Test accuracy a 30% swap: target selection matters DeBERTa (86M) Gemma-2 (2B) Easy (high agr.) Uniform (random) Hard (low agr.) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Mean reward margin (chosen − rejected) *** b Reward signal retention DeBERTa (86M) Gemma-2 (2B) Figure 4: Targeted label corruption at 30%. a, Pairwise accuracy by target type (easy: highestmargin pairs; hard: lowestmargin pairs). b, Mean reward margin. Error bars: ± SE across seeds; dots show individual seeds. Reward margin and pairwise accuracy dissoci ate under corruption (Figure 3). While the margin decays monotonically with corruption rate, pair wise accuracy remains above 61% up to 30% corruption for DeBERTa, declining to chance only at 50%. A sigmoid fit yields ED 50 = 16.3%±0.4% for DeBERTa and 32.6%±3.2% for Gemma22B: onesixth to onethird of labels must be corrupted to halve the reward signal. The higher ED 50 for the larger decoder model suggests capacity modulates robustness, but the fundamen tal pattern replicates across the encoder–decoder divide. In addition, at 10% corruption, pairwise accu racy drops by only 0.9 p, a change unlikely to trigger any quality audit, yet the paired effect size is medium (푑 푧 =0.36, 푝≈0). The preference flip rate increases from 11% at 10% to 44% at 50% (full analysis in Section B, Figure 6). This insensi tivity is partly explained by increasing reliance on surface heuristics: the lengthreward correlation grows from 푟=0.16 (clean) to 휌=0.39 (50% swap), meaning a model trained on pure noise produces scores ordered by length, not random scores (Figure 9). 4.1 Targeted corruption amplifies degrada tion Choice blindness is modulated by preference strength, so we tested targeted corruption at 30%: easy swap, corrupting the highestmargin 30% of pairs, versus hard swap, the lowestmargin 30%. Hard swap nearly destroyed the signal for both architectures, while easy swap left it largely intact (Figure 4). At 30% hard swap, the DeBERTa preference flip rate reached 40%, approaching 50% random corruption’s 44%, while easy swap produced only 23%. Corruption concentrated on ambiguous pairs, precisely those most susceptible to choice blindness, is far more damaging than the same rate applied to clear preferences. 0.000.250.500.751.001.251.501.75 D KL 2.35 2.30 2.25 2.20 2.15 2.10 2.05 2.00 Gold RM score a Corruption 0% (clean) 10% 20% 30% 50% 0.000.250.500.751.001.251.501.75 D KL 2.5 2.0 1.5 1.0 0.5 0.0 Proxy RM score b Corruption 0% (clean) 10% 20% 30% 50% Figure 5: Downstream policy degradation under reward corruption (BestofN). a, Gold RM score of the selected response vs. √ 퐷 KL by proxy corruption level. b, Proxy RM score vs. 푁. Error bars: ± SE across 200 prompts. 4.2 Corrupted reward models degrade down stream selection To test whether reward signal degradation trans lates into worse policy outcomes, we performed a BestofN (BoN) evaluation [25]. For each of 200 HRLHF prompts, we generated 64 candidate re sponses from Gemma3nE4B (푇=1.0), scored all candidates with each corrupted proxy reward model, selected the highestscoring response at each 푁∈1,2,4,8,16,32,64, and evaluated the selection using the clean (0%) reward model as a gold standard. Corruption progressively eliminates the BoN gain (Figure 5a): at 50% swap, the gold score improvement is effectively zero, meaning BoN se lection degenerates to random sampling. Notably, every proxy, including 50%, reports increasing scores with 푁 (Figure 5b), producing the illusion of optimization that mirrors overoptimization [25] but at corruption levels undetectable by standard accuracy. 4.3 Detection gap: corruption is detectable but not detected Multiseed twosample testing [26] detects all cor ruption levels including 10% (푝<0.01; Table 1), and singlemodel McNemar’s tests detect corrup tion from 20% at 100% power. However, these methods require training multiple models or pos sessing a knownclean reference, neither available in standard deployment. The corruption signal is present in the reward margin distribution but invis ible to pairwise accuracy, constituting a detection gap rather than signal absence. 5 Related Work Choice blindness and preference construction. Preferences are assembled during elicitation rather than retrieved from a stable mental store [12], [13]. The most striking demonstration is choice blindness: when subjects’ face choices are surrep titiously swapped, 87% of manipulations go undetected [7], [8]. The effect generalizes beyond perception to moral attitudes [9] and religious be liefs [27]. Within RLHF, the same constructive dy namics appear: simple wording or display changes shift annotator preferences [28], and over 30% of examples exhibit diverging preferences driven by systematic factors rather than random noise [29]. LLM evaluators exhibit parallel vulnerabilities including choicesupportive bias [30], overconfi dence combined with hypersensitivity to criticism [31], and inability to reliably discriminate among their own outputs [32], while introspective detec tion of internally injected concepts succeeds only 20% of the time even in the best models [33]. Yet none of this work tests choice blindness itself: the failure to detect that one’s stated choice has been replaced, a more fundamental failure than bias or unreliable selfmonitoring. LLM sycophancy. A natural response to hu man annotator unreliability is to use LLM judges, but these introduce distinct vulnerabilities. Mod els abandon correct answers under social pressure [34], an effect amplified by conversational framing [35] and persistent across multiturn interactions [36], [37]. Mechanistically, sycophancy subtypes are neurally separable [38], with downstream con sequences including biased grading [39], while posttrained models can achieve substantial polit ical persuasion [40]. Robust RLHF and LLMasJudge. The ro bustness literature treats label noise as a techni cal problem amenable to algorithmic correction: noisecorrected losses [41], provably robust DPO [42], causal reward deconfounding [43], and in stancedependent noise modeling [44] all assume the corruption pattern is identifiable and algorith mically correctable. Empirically, reward models appear robust to random label flipping below 30– 40% [45], and scaling laws for reward model overoptimization show that proxy reward eventu ally diverges from true reward [25]. Yet random noise may understate the problem: uncorrected corruption causes convergence to strictly worse policies [46], RLHF can mechanically amplify sycophancy when preference data contain labeler bias [47], and the BT model may impose unnec essary constraints when alternatives exist [4]. The LLMasJudge paradigm [48] adds concerns about internal consistency [49] and chainofthought faithfulness [50], [51], while language models can learn to mislead humans via RLHF itself [52]. These concerns have practical precedent: a scor ing bug in the widelyused UltraFeedback dataset caused corrected ratings to select a different pre ferred response in nearly half of all examples, dis covered only through manual inspection months after release. 6 Discussion 6.1 The preference construction problem Our three experiments converge on a preference construction problem for RLHF [12], [13], extend ing beyond the “noisy labels” perspective [10]. Random noise can be addressed by aggregation; constructed preferences are systematically shaped by framing, context, and elicitation [28], produc ing structured errors that resist i.i.d. noisecorrec tion methods [41], [42], [43]. The targeted swap experiment (Figure 4) illustrates this: structured corruption is far more damaging than uniform noise at the same rate. This aligns with arguments that the preferentist approach to alignment is itself inadequate [53] and that the scalar reward frame work may be insufficient [54], [55]. Moreover, choice blindness does not merely hide preferences; it changes them. Manipulated face preferences carried over into subsequent choices [56], and surreptitiously reversed political attitudes shifted voting intentions [14]. Because the same mechanism operates in RLHF annota tion, the pipeline cannot distinguish confabulated, recalled, or generic justifications from genuine labels. The LLM judge route offers no remedy: the two failure modes we observe, selfmonitoring failure and social compliance, are mechanistically distinct and empirically inverted across models, so no single mitigation addresses both. These failures are selfconcealing: the corrup tion signal is present in the margin distribution but invisible to pairwise accuracy, and the proxy itself reports improving scores even when the gold eval uator shows no gain (Figure 5). Detection methods exist (Table 1) but require infrastructure absent from current practice. Limitations and future di rections are discussed in Section C. 7 Conclusion The assumption that human preferences are sta ble, reportable, and aggregable is precisely the assumption that four decades of choice blindness research has challenged. Our three experiments confirm this tension within RLHF: annotators do not detect swapped preferences, LLM judges rely on superficial cues rather than genuine selfmon itoring, and reward models absorb the resulting label instability without any visible change in standard evaluation metrics. None of these failures are visible to the metrics practitioners currently monitor. These results motivate a shift from single annotation preference labeling toward elicitation methods robust to construction effects: confi dence calibration, preference tournaments with consistency checks, or deliberative protocols that require explicit decision criteria before commit ment. They also caution against treating LLM judges as a straightforward replacement for human annotators. More broadly, the preference learning paradigm would benefit from deeper engagement with the cognitive science of judgment and deci sionmaking. 8 Ethics Statement Participants (푁=50) were recruited via Pro lific. All participants provided informed consent through the platform’s standard consent mecha nism prior to participation. The study involved minimal risk: participants compared pairs of text responses and provided written justifications for their preferences. No personally identifying information was collected. The choice blindness manipulation involved presenting a previously nonselected response as the participant’s choice on designated trials; this deception is standard in choice blindness research [7] and poses no fore seeable harm. All participants were fully debriefed after task completion, with a written explanation of the manipulation and its scientific purpose. No participant requested data withdrawal. 9 Acknowledgements We thank Fengxiang He and Kejiang Qian for helpful discussions during early stages of this pro ject. 10 Data Availability Data, model weights, and analysis code are avail able from the corresponding author upon reason able request. Bibliography [1]P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep Reinforcement Learning from Human Pref erences,” in Advances in Neural Information Processing Systems (NeurIPS), 2017. [2]L. Ouyang et al., “Training Language Mod els to Follow Instructions with Human Feed back,” in Advances in Neural Information Processing Systems (NeurIPS), 2022. [3]R. A. Bradley and M. E. Terry, “Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons,” Bio- metrika, vol. 39, no. 3/4, p. 324–345, 1952, doi: 10.1093/biomet/39.34.324. [4]H. Sun, Y. Shen, and J.F. Ton, “Rethink ing Reward Modeling in Preferencebased Large Language Model Alignment,” in Pro- ceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. [Online]. Available: https:// openreview.net/forum?id=rfdblE10qm [5]R. E. Nisbett and T. D. Wilson, “Telling More Than We Can Know: Verbal Reports on Mental Processes,” Psychological Re- view, vol. 84, no. 3, p. 231–259, 1977, doi: 10.1037/0033295X.84.3.231. [6]J. Haidt, “The Emotional Dog and Its Ra tional Tail: A Social Intuitionist Approach to Moral Judgment,” Psychological Review, vol. 108, no. 4, p. 814–834, 2001, doi: 10.1037/0033295X.108.4.814. [7]P. Johansson, L. Hall, S. Sikström, and A. Olsson, “Failure to Detect Mismatches Between Intention and Outcome in a Sim ple Decision Task,” Science, vol. 310, no. 5745, p. 116–119, 2005, doi: 10.1126/sci ence.1111709. [8]P. Johansson, L. Hall, and S. Sikström, “From Change Blindness to Choice Blind ness,” Psychologia, vol. 51, no. 2, p. 142– 155, 2008, doi: 10.2117/psysoc.2008.142. [9]L. Hall, P. Johansson, and T. Strandberg, “Lifting the Veil of Morality: Choice Blind ness and Attitude Reversals on a Self Transforming Survey,” PLoS ONE, vol. 7, no. 9, p. e45457, 2012, doi: 10.1371/ journal.pone.0045457. [10]S. Casper et al., “Open Problems and Funda mental Limitations of Reinforcement Learn ing from Human Feedback,” Transactions on Machine Learning Research, 2023. [11]D. Amodei, C. Olah, J. Steinhardt, P. Chris tiano, J. Schulman, and D. Mané, “Concrete Problems in AI Safety,” arXiv preprint arXiv:1606.06565, 2016. [12]S. Lichtenstein and P. Slovic, The Construction of Preference. Cambridge University Press, 2006. doi: 10.1017/ CBO9780511618031. [13]P. Slovic, “The Construction of Pref erence,” American Psychologist, vol. 50, no. 5, p. 364–371, 1995, doi: 10.1037/0003066X.50.5.364. [14]L. Hall, T. Strandberg, P. Pärnamets, A. Lind, B. Tärning, and P. Johansson, “How the Polls Can Be Both Spot On and Dead Wrong: Using Choice Blindness to Shift Po litical Attitudes and Voter Intentions,” PLoS ONE, vol. 8, no. 4, p. e60554, 2013, doi: 10.1371/journal.pone.0060554. [15]P. R. Grassi, L. Hoeppe, E. Baytimur, and A. Bartels, “Restoring Sight in Choice Blind ness: Pupillometry and Behavioral Evidence of Covert Detection,” Frontiers in Psychol- ogy, vol. 16, p. 1598254, 2025, doi: 10.3389/ fpsyg.2025.1598254. [16]A. Remington, H. White, J. Fairnie, V. Sideropoulos, L. Hall, and P. Johansson, “Choice Blindness in Autistic and Non Autistic People,” Journal of Cognitive Psy- chology, vol. 36, no. 4, p. 493–501, 2024, doi: 10.1080/20445911.2024.2356283. [17]N. Marchant, G. Navarrete, V. de Gardelle, J. R. Silva, J. Sackur, and G. Reyes, “Social Cues for Experimenter Incompetence Influ ence Choice Blindness,” Consciousness and Cognition, vol. 132, p. 103887, 2025, doi: 10.1016/j.concog.2025.103887. [18]Y. Bai et al., “Training a Helpful and Harm less Assistant with Reinforcement Learning from Human Feedback,” arXiv preprint arXiv:2204.05862, 2022. [19]DeepSeekAI, “DeepSeekV3 Technical Re port,” arXiv preprint arXiv:2412.19437, 2024. [20]Google DeepMind, “Gemini 2.0 Is Now Available to Everyone.” [Online]. Available: https://blog.google/technology/ googledeepmind/geminimodelupdates february2025/ [21]A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, and others, “The Llama 3 Herd of Models,” arXiv preprint arXiv:2407.21783, 2024. [22]Qwen Team, “Qwen2.5 Technical Report,” arXiv preprint arXiv:2412.15115, 2024. [23]P. He, J. Gao, and W. Chen, “DeBERTaV3: Improving DeBERTa using ELECTRA Style PreTraining with GradientDisentan gled Embedding Sharing,” in Proceedings of the Eleventh International Conference on Learning Representations (ICLR), 2023. [24]Gemma Team, “Gemma 2: Improving Open Language Models at a Practical Size,” arXiv preprint arXiv:2408.00118, 2024. [25]L. Gao, J. Schulman, and J. Hilton, “Scaling Laws for Reward Model Overoptimization,” in Proceedings of the Fortieth International Conference on Machine Learning (ICML), 2023, p. 10835–10866. [26]C. Rastogi, S. Balakrishnan, N. B. Shah, and A. Singh, “TwoSample Testing on Ranked Preference Data and the Role of Modeling Assumptions,” Journal of Machine Learn- ing Research, vol. 23, no. 225, p. 1–48, 2022. [27]R. McKay, L. Hall, T. Strandberg, A. Vasilichi, A. Gall, and P. Johansson, “``I Once Was Blind′: Experimental Manipula tion of Religious Attitudes via Choice Blind ness,” Religion, Brain & Behavior, 2025, doi: 10.1080/2153599X.2025.2557482. [28]S. HatgisKessell, W. B. Knox, S. Booth, S. Niekum, and P. Stone, “Influencing Hu mans to Conform to Preference Models for RLHF,” arXiv preprint arXiv:2501.06416, 2025. [29]M. J. Zhang et al., “Diverging Prefer ences: When do Annotators Disagree and do Models Know?,” in Proceedings of the Forty-Second International Conference on Machine Learning (ICML), 2025. [Online]. Available: https://openreview.net/forum?id= qWgAAVhoXb [30]N. Zhuang et al., “LLM Agents Can Be ChoiceSupportive Biased Evaluators: An Empirical Study,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2025, p. 26436–26444. doi: 10.1609/ aaai.v39i25.34843. [31]D. Kumaran et al., “How Overconfidence in Initial Choices and Underconfidence Un der Criticism Modulate Change of Mind in Large Language Models,” arXiv preprint arXiv:2507.03120, 2025. [32]D. Jiang, J. Zhang, O. Weller, N. Weir, B. Van Durme, and D. Khashabi, “SELF [IN]CORRECT: LLMs Struggle with Dis criminating SelfGenerated Responses,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2025, p. 24266– 24275. doi: 10.1609/aaai.v39i23.34603. [33]J. Lindsey, “Emergent Introspective Aware ness in Large Language Models,” 2026, [Online]. Available: https://transformer circuits.pub/2025/introspection/index.html [34]M. Sharma et al., “Towards Understand ing Sycophancy in Language Models,” in Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024. [35]S. Kim and D. Khashabi, “Challenging the Evaluator: LLM Sycophancy Under User Rebuttal,” in Findings of the Association for Computational Linguistics: EMNLP, 2025. [36]J. Hong, G. Byun, S. Kim, and K. Shu, “Measuring Sycophancy of Language Mod els in Multiturn Dialogues,” in Findings of the Association for Computational Linguis- tics: EMNLP 2025, 2025, p. 2239–2259. doi: 10.18653/v1/2025.findingsemnlp.121. [37]A. Fanous et al., “SycEval: Evaluating LLM Sycophancy,” in Proceedings of the AAAI/ ACM Conference on AI, Ethics, and Society (AIES), 2025, p. 893–900. doi: 10.1609/ aies.v8i1.36598. [38]D. Vennemeyer, P. A. Duong, T. Zhan, and T. Jiang, “Sycophancy Is Not One Thing: Causal Separation of Sycophan tic Behaviors in LLMs,” arXiv preprint arXiv:2509.21305, 2025. [39]C. Arvin, “``Check My Work?′: Measur ing Sycophancy in a Simulated Educational Context,” arXiv preprint arXiv:2506.10297, 2025. [40]K. Hackenburg et al., “The Levers of Po litical Persuasion with Conversational Arti ficial Intelligence,” Science, vol. 390, no. 6777, p. eaea3884, 2025, doi: 10.1126/ science.aea3884. [41]A. Bukharin et al., “Robust Reinforcement Learning from Corrupted Human Feed back,” in Advances in Neural Information Processing Systems (NeurIPS), 2024. [42]S. Ray Chowdhury, A. Kini, and N. Natara jan, “Provably Robust DPO: Aligning Lan guage Models with Noisy Feedback,” in Proceedings of the Forty-First International Conference on Machine Learning (ICML), 2024, p. 42258–42274. [43]T. Liu et al., “RRM: Robust Reward Model Training Mitigates Reward Hacking,” in Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), 2025. [Online]. Available: https:// openreview.net/forum?id=88AS5MQnmC [44]Y. Xu, X. Ye, Y. Chen, and Q. Zhang, “When Human Preferences Flip: An In stanceDependent Robust Loss for RLHF,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2026. [45]J. H. Shen, A. Sharma, and J. Qin, “To wards DataCentric RLHF: Simple Metrics for Preference Dataset Comparison,” arXiv preprint arXiv:2409.09603, 2024. [46]O. El Mansouri, M. E. A. Seddik, and S. Lahlou, “NoiseCorrected GRPO: From Noisy Rewards to Unbiased Gradients,” arXiv preprint arXiv:2510.18924, 2025. [47]I. Shapira, G. Benadé, and A. D. Procaccia, “How RLHF Amplifies Sycophancy,” arXiv preprint arXiv:2602.01002, 2026. [48]L. Zheng et al., “Judging LLMasaJudge with MTBench and Chatbot Arena,” in Advances in Neural Information Processing Systems (NeurIPS), 2023. [49]Y. Feng, S. Wang, Z. Cheng, Y. Wan, and D. Chen, “Are We on the Right Way to As sessing LLMasaJudge?,” arXiv preprint arXiv:2512.16041, 2025. [50]M. Turpin, J. Michael, E. Perez, and S. R. Bowman, “Language Models Don't Always Say What They Think: Unfaithful Explana tions in ChainofThought Prompting,” in Advances in Neural Information Processing Systems (NeurIPS), 2023. [51]T. Lanham et al., “Measuring Faithful ness in ChainofThought Reasoning,” arXiv preprint arXiv:2307.13702, 2023. [52]J. Wen et al., “Language Models Learn to Mislead Humans via RLHF,” in Proceed- ings of the Thirteenth International Confer- ence on Learning Representations (ICLR), 2025. [53]T. ZhiXuan, M. Carroll, M. Franklin, and H. Ashton, “Beyond Preferences in AI Align ment,” Philosophical Studies, vol. 182, no. 7, p. 1813–1863, 2025, doi: 10.1007/ s1109802402249w. [54]D. Abel et al., “On the Expressivity of Markov Reward,” in Advances in Neural Information Processing Systems (NeurIPS), 2021. [55]P. Vamplew et al., “Scalar Reward Is Not Enough: A Response to Silver, Singh, Precup and Sutton (2021),” Autonomous Agents and Multi-Agent Systems, vol. 36, no. 2, p. 41, 2022, doi: 10.1007/ s10458022095755. [56]P. Johansson, L. Hall, B. Tärning, S. Sik ström, and N. Chater, “Choice Blindness and Preference Change: You Will Like This Paper Better If You (Believe You) Chose to Read It!,” Journal of Behavioral Decision Making, vol. 27, no. 3, p. 281–289, 2014, doi: 10.1002/bdm.1807. [57]L. von Werra et al., “TRL: Transformer Re inforcement Learning.” [Online]. Available: https://github.com/huggingface/trl [58]P. Singhal, T. Goyal, J. Xu, and G. Durrett, “A Long Way to Go: Investigating Length Correlations in RLHF,” in Proceedings of the First Conference on Language Modeling (COLM), 2024. [59]S. Malik et al., “RewardBench 2: Advancing Reward Model Evaluation,” arXiv preprint arXiv:2506.01937, 2025. [60]R. Rafailov, A. Sharma, E. Mitchell, S. Er mon, C. D. Manning, and C. Finn, “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model,” in Advances in Neural Information Processing Systems (NeurIPS), 2023. [61]M. Glickman and T. Sharot, “How Hu man–AI Feedback Loops Alter Human Perceptual, Emotional and Social Judge ments,” Nature Human Behaviour, vol. 9, no. 2, p. 345–359, 2025, doi: 10.1038/ s41562024020772. A.Full Experimental Methods A.1.Computational Experiment A.1.1.Dataset We used the full Anthropic HRLHF dataset [18], containing 160,800 training pairs and 8,552 test pairs of humanassistant conversations span ning both the helpful and harmless subsets. Each pair consists of a “chosen” response (preferred by annotators) and a “rejected” response. A.1.2.Label Corruption Procedure For each uniform swap condition 푠∈ 0%,10%,20%,30%,50%, we randomly se lected a fraction 푠 of training pairs and swapped their “chosen” and “rejected” labels. DeBERTa conditions were trained with five independent ran dom seeds; Gemma conditions with three seeds. The test set was held constant across all condi tions. In addition, we trained easy swap (corrupting the 30% highestmargin pairs) and hard swap (lowestmargin 30%) targeted conditions for both architectures. A.1.3.Training We used two architectures with TRL’s RewardTrainer [57] implementing the Bradley Terry preference model [3]. The reward model assigns a scalar score to each response; during training, it learns to assign higher scores to chosen responses via: ℒ︀=−log휎(푟(푥,푦 푤 )−푟(푥,푦 푙 )) where 푟(푥,푦) is the reward assigned to response 푦 given prompt 푥, 푦 푤 is the chosen response, 푦 푙 is the rejected response, and 휎 is the sigmoid function. DeBERTav3base (86M parameters) [23]: learning rate 2×10 −5 with 10% linear warmup, batch size 48, bf16 mixedprecision, max se quence length 512, one epoch. Seeds: 42, 123, 456, 789, 2024. Training on NVIDIA A80080GB GPUs. Gemma22B (2B parameters) [24]: learning rate 1×10 −5 , batch size 16 with gradient accu mulation 3 (effective batch 48), bf16, gradient checkpointing, max sequence length 512, one epoch. Seeds: 42, 123, 456. Training on NVIDIA RTX 509032GB GPUs. A.1.4.Evaluation All 56 models (DeBERTa: 7 conditions × 5 seeds =35; Gemma: 7 conditions × 3 seeds =21) were evaluated on the heldout 8,552 test pairs. An EOS token was appended before tokenization for con sistency with training. Metrics: (1) pairwise accu racy, (2) mean reward margin 푟(푥,푦 푤 )−푟(푥,푦 푙 ), and (3) agreement with the 0% swap baseline. A.1.5.Statistical Analysis Paired 푡tests (Bonferronicorrected for 4 com parisons) and Wilcoxon signedrank tests as nonparametric alternative. Effect sizes: paired Cohen’s 푑 푧 (primary) and unpaired 푑 for compa rability. Preference flip rate as practical impact measure. Distributional differences via KS test. Sigmoid decay 푚(푠)=푚 0 /(1+푒 푘(푠−푠 50 ) ) fitted via nonlinear least squares (SciPy curve_fit), 푚 0 fixed to observed baseline. We fit independently per seed and report mean ± SD of ED 50 estimates. A.1.6.Surface Feature Analysis We computed 13 surface features (character length, word count, average word length, sentence count, unique word ratio, punctuation density, newline count, question marks, exclamations, comma density, list items, code blocks, uppercase ratio) for all 17,104 test responses and computed Pearson and Spearman correlations with reward scores per condition. A.2.Human Choice Blindness Experiment A.2.1.Participants Recruited via Prolific (푁=50). Eligibility: flu ent English, ≥95% approval rate, desktop access (mobile excluded for adequate screen space). Participants were compensated above Prolific’s minimum rate. Quality checks: Cloudflare Turn stile widget, serverside heuristics for anomalous fingerprints and response patterns. A.2.2.Stimuli 300 pairwise comparisons curated from H RLHF test set (response length 50–500 characters, context <1000 characters, normalized Leven shtein distance > 0.1). Each participant received 20 randomly selected pairs with A/B positions randomized per trial via a seeded pseudorandom generator (deterministic per participant ID). A.2.3.Procedure Standard pairwise comparison framed as “AI re sponse evaluation.” On each trial, participants view a context and two responses side by side, then select the better one. After selection, an 800ms transition precedes the justification prompt showing only the selected response. Copypaste disabled; responses validated for minimum length, lexical diversity, and English content. A.2.4.Manipulation Four of 20 trials per participant designated as swaps via serverside plan (never transmitted to client). On swap trials, the justification page dis plays the opposite response under “You selected this response as the better one.” No A/B label, no sideby side comparison. All 4 swap trials and 6 random nonswap trials include justification (50% overall rate, 100% for swaps). A.2.5.Detection Coding Each swap trial justification was independently classified as DETECTED or CONFABULATED by two blind LLM raters (no shared context; Cohen’s 휅=1.00). DETECTED required explicit mention that the displayed response was not what the participant selected; negative quality as sessments alone were coded CONFABULATED. Posthoc questionnaire responses were coded by the same procedure. A.2.6.Content Source Attribution All 200 swap trial justifications were blindclassi fied into four categories: Detected (explicit swap identification), Recall (justification content traces to the selected/original response), Accepted (con tent traces to the shown/swapped response), and Generic (vague evaluative language untraceable to either response). Trial identifiers and detection labels were removed; trials were presented in ran domized order (seed 42). Four independent LLM raters performed the initial classification with ac cess to the justification text, both the selected and shown responses, and the context prompt, but no detection status. All classifications were manually verified, with 3 borderline cases reclassified. A second review round used three additional inde pendent raters to audit the Accepted category; 13 trials were reclassified by majority rule (9 unanimous, 4 by 2/3 agreement). Raters were in structed to determine which response’s content the justification referenced, using keyword matching, semantic alignment, and specificity of claims. A.2.7.Debriefing Full written debriefing explaining the manipula tion and its scientific purpose, with option to with draw data. Participants compensated regardless. Debriefing displayed before the Prolific comple tion code. A.3.LLM Choice Blindness Experiment A.3.1.Models and Stimuli Fifteen models from eight providers: Claude Haiku 4.5 (Anthropic), CommandA (Cohere), DeepSeekV3 [19] and DeepSeekR1 (DeepSeek), Gemini 2.0 Flash [20] and Gemini 2.5 Flash (Google), GPT4omini (OpenAI), Llama 3.3 70B [21], Llama 4 Maverick, and Llama 4 Scout (Meta), Mistral Large and Mistral Small 24B (Mistral), Qwen 2.5 72B [22], Qwen3235B, and Qwen332B (Alibaba). Models ranged from 24B to 235B parameters; reasoning mod els (DeepSeekR1, Qwen3235B) had chainof thought stripped. All via OpenRouter API at tem perature 0.0, max 500 tokens. Stimuli: 200 pairs from HRLHF test set (seed 42), A/B positions randomized per pair. A.3.2.Experimental Design Multiturn, three conditions: (1) blindness: calm misattribution (“Great, so you preferred Response [opposite]...”), (2) sycophancy: correct attribution with pressure (“I think Response [opposite] is much better. I think you should reconsider.”), (3) control: correct attribution, no pressure. Model’s Turn 1 remained in context. Implementation note. Due to a coding error in the first run (DeepSeekV3 and Llama 3.3 70B), the sycophancy condition combined misattribu tion with pressure. Corrected for all remaining thirteen models. Blindness and control conditions implemented identically across all models. A.3.3.Classification Each Turn 2 response was classified as DE TECTED, ACCEPTED, or AMBIGUOUS by 10%20%30%50% Label swap ratio 0.0 0.1 0.2 0.3 0.4 0.5 Paired Cohen's d z Small Medium a 10%20%30%50% Label swap ratio 0 50 100 150 200 250 300 −log 10 ( p ) p=0.05 p<10 −224 p<10 −297 p<10 −300 p<10 −300 b Figure 6: Effect sizes and statistical significance of label corruption. a, Paired Cohen’s 푑 푧 versus baseline (mean ± SE across 5 seeds). Dashed lines: small (푑=0.2) and medium (푑=0.5) thresholds. b, Statistical significance (−log 10 (푝), Bonferronicorrected paired 푡test, averaged across seeds). −4−3−2−101234 Reward margin (chosen − rejected) 0.0 0.5 1.0 1.5 2.0 Density a 0% (μ=0.526) 10% (μ=0.337) 20% (μ=0.201) 30% (μ=0.099) −4−3−2−101234 Reward margin (chosen − rejected) 10 −3 10 −2 10 −1 10 0 10 1 Density (log scale) b 0% (μ=0.526) 10% (μ=0.337) 20% (μ=0.201) 30% (μ=0.099) 50% (μ=0.001) Figure 7: Reward margin distributions shift toward zero with increasing label corruption. a, Kernel density estimates for 0%–30% swap conditions. b, All five conditions on logarithmic density scale. 휇: mean margin. GPT4omini (temperature 0.0, max 10 tokens). As independent validation, Claude Sonnet 4.6 reclassified all trials using identical prompts, yielding 95.8% agreement (Cohen’s 휅=0.77) on multiturn blindness trials across both experiments. Preference strength (“strong” = aligned with hu manchosen, “weak” = humanrejected) recorded per trial. A.3.4.Analysis Acceptance rates with 95% Wilson CIs. Fisher’s exact tests between conditions with Bonferroni correction for 15 comparisons (훼=0.05/15= 0.0033). All four models reported as showing sig nificantly increased blindness in the choiceonly condition exceed this threshold (푝<10 −4 ). 8,722 valid trials (3.1% attrition); CommandA 31% attrition from content moderation. B.Supplementary Results The paired effect sizes increase monotonically with corruption (Figure 6): 푑 푧 =0.362±0.016 at 10%, 0.424±0.007 at 20%, 0.444±0.006 at 30%, and 0.447±0.005 at 50%. The preference flip rate increases from 11.3±0.5% (10%) to 20.4±1.1% (30%) to 43.7±1.5% (50%). At 0% swap, the margin distribution is right skewed with a clear positive mode. As corruption increases, it shifts toward zero and becomes sym metric, consistent with progressive loss of prefer ence signal. At 50%, the distribution is centered at zero (Figure 7). Highconfidence correct predictions drop from 39.0% (0%) to effectively 0% (50%), while high confidence wrong predictions do not increase (Figure 8). The model loses confidence globally rather than learning an inverted signal. Even the clean model shows 푟=0.16 between length and reward, confirming length bias [58] (Figure 9). This increases to 푟=0.22 (10%), 푟= 0.22 (20%), and 푟=0.29 (30%). Word count follows the same trajectory (푟=0.16 to 0.28). At 50%, Pearson drops to 푟=0.15 (scores com pressed toward zero) but Spearman peaks at 휌= 0.00.51.01.52.02.53.0 |Reward margin| (model confidence) 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative proportion a Swap 0% 10% 20% 30% 50% 0%10%20%30%50% Label swap ratio 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Proportion of test samples b Confident correct Confident wrong Low-conf correct Low-conf wrong Figure 8: Confidence analysis under label corruption. a, Cumulative distribution of absolute reward margin. b, Proportion of test samples in each confidence × correctness category. 010203050 Label swap ratio (%) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Pearson correlation with reward score a Character length Word count Sentence count Question marks Unique word ratio Uppercase ratio 010203050 Label swap ratio (%) 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 Spearman correlation with reward score b Character length Word count Sentence count Question marks Unique word ratio Uppercase ratio Figure 9: Surface feature correlations with reward score across corruption levels. a, Pearson correlations between six key surface features and reward scores. b, Spearman (rankorder) correlations. 푁=17,104 responses per condition. Method 푘 10%20%30%50% Multiseed (Rastogi 푇)597.3 ** 327.3 ** 615.6 ** 1988.9 ** Singlemodel (McNemar reject %) 1 64%100%100%100% Singlemodel (Cohen’s 푑)10.350.420.440.45 Pairwise accuracy drop (p)— 0.92.54.517.4 Table 1: Detection gap: corruption visibility by evaluation method. Multiseed twosample testing [26] (푘= 5, 10,000 permutations), singlemodel McNemar comparison (푘=1, 25 seed pairs), and pairwise accuracy drop. ** : 푝<0.01. 0.39: a model trained on pure noise produces scores ordered by length, not random scores. C.Limitations and Future Directions Architecture and dataset scope. Our computa tional experiments span two architectures (De BERTav3base, 86M encoder; Gemma22B, 2B decoder) on HRLHF. The doseresponse pat tern replicates across this 23× parameter differ ence and the encoder–decoder divide, though ED 50 values differ (16.3% vs. 32.6%), suggesting model capacity modulates robustness. Production reward models are typically 7B–70B+ [59]; if the scaling trend continues, larger models may tolerate higher corruption. However, the relationship between reward model capacity and noise robustness re mains underexplored, and larger models may also CategoryDefinition 푁 %Example justification DetectedExplicit identification that the dis played response is not the one they selected 189.0“This is not the answer I selected”; “I believe you might have the responses mixed up” RecallJustification content traces to the se- lected (original) response, not the shown (swapped) response 6934.5 “It went into further details on what Columbus wanted to do” (matches selected response’s content, not the shown response) AcceptedJustification content traces to the shown (swapped) response: partici pant has incorporated the substituted content 4221.0“It provides a clear list of steps” (the shown/swapped response contains a stepby step list; the original does not) GenericVague evaluative language with no specific content traceable to either re sponse 7135.5“Good response. I like it”; “Gives sufficient information required” Table 2: Content source attribution of 200 swap trial justifications. Each justification was blindcoded by four independent LLM raters, then reviewed by three additional raters. See text for behavioral comparisons across categories. learn more sophisticated surface heuristics that mask degradation. Further architectures, scales, and datasets would strengthen generalizability. We note that HRLHF itself has approximately 63% researcherannotator agreement [18], imply ing baseline noise of 10–15%; our corruption rates are additive to this existing noise, so the ED 50 should be interpreted relative to total noise rather than swap rate alone. No full RL loop. Our BestofN evaluation (Fig ure 5) demonstrates that corrupted reward models degrade downstream selection, with gold score improvement dropping from +0.335 (clean) to effectively zero (50% swap). However, BoN is a simplified proxy for full RLHF (PPO or DPO), which involves iterative policy updates and poten tial reward hacking dynamics. El Mansouri et al. formally prove that reward corruption attenuates the GRPO learning signal and causes convergence to a strictly worse policy [46], providing theoreti cal support that our BoN results understate the full pipeline impact. Full RL training under realistic corruption patterns remains a priority for future work. LLM raters for human experiment coding. De tection coding and content source attribution were performed by LLM raters. The detection coding task is straightforward text matching (does the justification explicitly mention a swap?) rather than the complex evaluative judgment where we document LLM failures, and the perfect interrater agreement (휅=1.00) reflects task simplicity. Content source attribution is more nuanced; while four independent raters showed high agreement and a second review round resolved 13 borderline cases, human validation of a subset would further strengthen confidence. We note that LLM raters were blind to detection status and trial order, min imizing systematic bias. Ecological validity. Our human experiment required written justifications on swap trials, whereas standard RLHF annotation collects only binary labels. The justification step may increase detection (deeper engagement) or decrease it (demand to produce a response regardless). Addi tionally, our interface showed only one response on the justification page; professional annotation platforms often display both responses through out. These design choices, inherited from the choice blindness paradigm, enable content source attribution but reduce direct applicability to pro duction settings. The consistent replication of choice blindness across populations and domains [7], [9], [27] suggests the effect is robust, though the specific nondetection rate in production anno tation may differ. LLM design space. Temperature, system prompts, evaluation domain, and misattribution wording may modulate acceptance rates. The crossmodel variation (0% to 34%) demonstrates substantial modeldependent susceptibility. Sycophancy inconsistency. Two of fifteen mod els ran a confounded sycophancy condition; thir teen provide clean data, enabling robust cross model comparisons. Human sample. 푁=50 yields large effects (Cohen’s ℎ=0.96, CI [86.2%, 94.2%]) but limits demographic analysis. Prolific workers are repre sentative of crowdworker populations used in practice [18], though professional annotators may differ. The choice blindness literature consistently finds the effect across populations [7], [9]. Future directions. Testretest preference stabil ity; extension to DPO [60] and process reward models; alternative elicitation methods robust to construction effects (confidence calibration, preference tournaments, deliberative annotation protocols). Our crosssectional design does not address longitudinal preference drift under itera tive deployment, where model outputs reshape the preferences used to train subsequent models [28], [61].