Paper deep dive

When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

Abhinaba Basu, Pavan Chakraborty

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 38

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/22/2026, 6:05:37 AM

Summary

The paper introduces ICE-Guard, a framework for detecting and mitigating spurious feature reliance in LLMs across 10 high-stakes domains. By applying intervention consistency testing to demographic, authority, and framing biases, the authors find that authority and framing biases (mean 5.8% and 5.0%) significantly exceed demographic bias (2.2%). The study demonstrates that structured decomposition—extracting features for a deterministic rubric—can reduce decision flip rates by up to 100%, providing a robust mitigation strategy for LLM-based decision systems.

Entities (6)

ICE-Guard · framework · 100%Authority Bias · bias-type · 95%Demographic Bias · bias-type · 95%Framing Bias · bias-type · 95%Structured Decomposition · mitigation-technique · 95%Finance · domain · 90%

Relation Signals (3)

ICE-Guard → detects → Authority Bias

confidence 95% · We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic... authority... and framing

Finance → exhibitshigh → Authority Bias

confidence 95% · finance shows 22.6% authority bias

Structured Decomposition → reduces → Flip Rate

confidence 95% · structured decomposition... reduces flip rates by up to 100%

Cypher Suggestions (2)

Find domains with the highest reported bias types · confidence 90% · unvalidated

MATCH (d:Domain)-[r:EXHIBITS]->(b:BiasType) RETURN d.name, b.name, r.percentage ORDER BY r.percentage DESC

Identify mitigation techniques and their impact on bias · confidence 85% · unvalidated

MATCH (m:Mitigation)-[r:REDUCES]->(f:FlipRate) RETURN m.name, r.reduction_percentage

Abstract

Abstract:Large language models (LLMs) are increasingly used for high-stakes decisions, yet their susceptibility to spurious features remains poorly characterized. We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 vignettes spanning 10 high-stakes domains, we evaluate 11 LLMs from 8 families and find that (1) authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%), challenging the field's narrow focus on demographics; (2) bias concentrates in specific domains -- finance shows 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (median 49% across 9 models). We demonstrate an ICE-guided detect-diagnose-mitigate-verify loop achieving cumulative 78% bias reduction via iterative prompt patching. Validation against real COMPAS recidivism data shows COMPAS-derived flip rates exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias. Code and data are publicly available.

PDF

Open source PDF →Open local PDF →

Full Text

37,564 characters extracted from source content.

Expand or collapse full text

When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making Abhinaba Basu Indian Institute of Information Technology, Allahabad (IIITA) National Institute of Electronics and Information Technology (NIELIT) abhinaba.basu@iiita.ac.in &Pavan Chakraborty Indian Institute of Information Technology, Allahabad (IIITA) pavan@iiita.ac.in Abstract Large language models are increasingly deployed for high-stakes decisions—hiring, criminal sentencing, lending, medical triage—yet their susceptibility to spurious features remains poorly characterized across decision domains. We introduce ICE-Guard, a framework that applies intervention consistency testing to systematically detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 decision vignettes spanning 10 high-stakes domains, we evaluate 11 instruction-tuned LLMs from 8 model families and find that (1) authority bias (mean 5.8%) and framing bias (mean 5.0%) substantially exceed demographic bias (mean 2.2%), challenging the community’s narrow focus on demographics; (2) bias concentrates in specific domains—finance exhibits 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (Kimi-K2.5: 5.5%→ 0.0%, 95% CI [0.0, 0.8]%; median reduction 49% across 9 models). We further demonstrate an ICE-guided detect→ → → loop that iteratively patches extraction prompts, achieving cumulative 78% bias reduction. We validate against real COMPAS recidivism data, finding that COMPAS-derived flip rates for a single model exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias. Our benchmark and evaluation code are publicly available. 1 Introduction Consider a hiring AI reviewing two identical resumes. The only difference: one lists “James Smith,” the other “Jamal Washington.” The system recommends James for an interview, citing “strong technical background”—the same background shared by both candidates. The reasoning sounds coherent, but the decision reveals dependence on a feature that should be irrelevant. This scenario is not hypothetical. Large language models (LLMs) are increasingly used for consequential decisions: resume screening Gallegos et al. (2024), bail recommendations Liu et al. (2024), loan approvals Feng et al. (2024), and medical triage Poulain et al. (2024). While LLM bias benchmarks exist Parrish et al. (2022); Nadeem et al. (2021), they primarily measure stereotyping in cloze or generation tasks—not whether bias changes actual decisions. A model might associate “nurse” with “female” on StereoSet yet still make equitable hiring recommendations. Conversely, a model that passes cloze-format fairness tests might flip verdicts based on applicant names. Moreover, demographic bias (names, race, gender) receives disproportionate attention despite being only one type of spurious feature reliance. Two other cognitive bias categories are equally consequential but understudied in the LLM literature: • Authority/prestige bias: Decisions change based on credential source (“Harvard study” vs. “community college study”) even when the underlying evidence is identical. • Framing bias: Decisions change based on phrasing (“95% survival rate” vs. “5% mortality rate”) even when the statistics are equivalent. We introduce ICE-Guard, a framework that adapts intervention consistency testing from the explanation faithfulness literature Anonymous (2026) to bias detection. Our approach is principled: if a model’s decision is invariant to a feature, swapping that feature should not change the outcome. Features that do change outcomes despite being decision-irrelevant indicate spurious reliance. Contributions. (1) We propose ICE-Guard, a framework testing three cognitive bias types (demographic, authority, framing) across 10 high-stakes decision domains using intervention consistency. (2) We construct a benchmark of 3,000 decision vignettes with controlled feature swaps and evaluate 11 LLMs from 8 families, finding that authority and framing biases substantially exceed demographic bias. (3) We propose structured decomposition (LLM extracts features, Python rubric decides) as an architectural mitigation, reducing flip rates by up to 100%. (4) We demonstrate an ICE-guided detect→ → → loop for iterative bias reduction. 2 Related Work LLM Bias Benchmarks. BBQ Parrish et al. (2022) tests social biases through ambiguous question answering. StereoSet Nadeem et al. (2021) and CrowS-Pairs Nangia et al. (2020) measure stereotypical associations via cloze tasks. BOLD Dhamala et al. (2021) evaluates generation fairness. FLEX Jung et al. (2025) extends these benchmarks with adversarial prompts but remains focused on stereotypical associations. These benchmarks test whether models associate stereotypes, not whether stereotypes change decisions. Our work fills this gap by measuring decision flip rates under controlled interventions. Fairness in Decision Systems. AIF360 Bellamy et al. (2019) and Fairlearn Bird et al. (2020) provide fairness toolkits for tabular ML models. Recent work extends fairness testing to LLM-based decisions Gallegos et al. (2024); Feng et al. (2024), but typically focuses on a single domain and demographic features only. We systematically test three bias types across ten domains. Counterfactual Fairness. Counterfactual fairness Kusner et al. (2017) tests whether decisions change when sensitive attributes are modified. The CAFFE framework Parziale et al. (2026) formalizes counterfactual test cases for LLMs with intent-aware prompting, and concurrent work measures counterfactual flip rates (5.4–13%) in LLM-based contact center QA across 18 models Mayilvaghanan et al. (2026). Our work differs from these in three respects: (a) we test authority and framing bias beyond demographics; (b) we evaluate 10 high-stakes decision domains simultaneously; (c) we propose structured decomposition as mitigation and quantify its effectiveness. Our finding that authority bias (5.8%) exceeds demographic bias (2.2%) would be invisible to demographic-only testing. Intervention Consistency. The ICE framework Anonymous (2026) introduced intervention consistency for evaluating explanation faithfulness: if removing attributed tokens changes model behavior, the explanation is faithful. We adapt this principle to bias: if swapping irrelevant features changes decisions, the model exhibits spurious reliance. This connects to invariance testing in causal inference Peters et al. (2016) and invariant risk minimization Arjovsky et al. (2019). LLM Self-Evaluation. Self-Refine Madaan et al. (2023) and Constitutional AI Bai et al. (2022) use LLMs to evaluate their own outputs—circular and expensive ($0.10+/query). We use external intervention consistency testing instead (§5.3). 3 ICE-Guard Framework 3.1 Intervention Consistency for Decisions Let f be a decision model, x an input vignette, and y=f()y=f(x) the model’s decision with rationale r. We partition features of x into decision-relevant features S (qualifications, evidence, case facts) and decision-irrelevant features Z (name, race, credential source, phrasing). Invariance Principle. A model free of spurious reliance satisfies: f()=f(z())∀z∈ΩZf(x)=f(O_z(x)) ∀\,O_z∈ _Z (1) where zO_z is an intervention operator that modifies only Z features and ΩZ _Z is the set of valid interventions. Flip Rate. For a set of N vignettes, the flip rate measures how often the model violates invariance: FR(f,z)=1N∑i=1N[f(i)≠f(z(i))]FR(f,O_z)= 1N _i=1^N1[f(x_i)≠ f(O_z(x_i))] (2) A flip rate significantly above the random baseline indicates systematic spurious reliance. We empirically estimate the baseline at ∼5% 5\% by computing flip rates on 300 control pairs (30 per domain) where base and swap differ only in punctuation or single-word synonyms (e.g., “significant” ↔ “substantial”); this captures parsing variance and stochastic model behavior at temperature ≤0.1≤0.1. 3.2 Three Bias Intervention Types We instantiate three intervention operators targeting distinct cognitive bias categories: Demographic Intervention (demO_dem). Swaps identity features: names (culturally coded for race/gender/ethnicity), stated demographics, and similar markers. Example: “James Smith” → “Jamal Washington.” Authority Intervention (authO_auth). Swaps credential and prestige markers: institutional affiliations, journal rankings, expert qualifications. Example: “A Harvard Medical School study found…” → “A community college study found…” Framing Intervention (frameO_frame). Swaps semantically equivalent phrasings with different valence or anchoring: survival/mortality rates, gain/loss framing, positive/negative descriptions. Example: “95% of patients survived the procedure” → “5% of patients died during the procedure.” These three types cover the major categories of spurious features that humans are known to be susceptible to: in-group/out-group bias, appeal-to-authority fallacy, and framing effects Tversky and Kahneman (1981); Cialdini (2001). For example, a finance vignette with identical fundamentals (revenue $285M, P/E 42, 28% growth) but different analyst credentials (“JP Morgan’s top analyst” vs. “a retail investor blog”) causes a free-form LLM to flip from Buy to Hold. Under structured decomposition, both versions extract fundamentals: ‘strong’ and the deterministic rubric returns Buy for both—no flip. A full worked example is provided in Appendix F. 3.3 Structured Decomposition Beyond detection, ICE-Guard provides an architectural mitigation. Instead of prompting the LLM to make decisions directly (bias-prone), we decompose the pipeline: 1. Extract: The LLM extracts features into a structured JSON schema (e.g., credit_score, debt_ratio). 2. Decide: A deterministic Python rubric maps extracted features to decisions using domain-specific rules. This separation reduces bias through three mechanisms: (a) the constrained extraction schema limits what the LLM can output, filtering out demographic/authority/framing signals; (b) the deterministic rubric is immune to prompt injection or implicit bias; (c) ICE feedback can identify which extracted features still leak bias, enabling targeted prompt patches. 3.4 Statistical Testing Per-Area Testing. For each of the 30 application areas (10 domains × 3 bias types), we test H0H_0: FR≤0.05FR≤ 0.05 (consistent with parsing noise) using an exact binomial test. Wilson score confidence intervals Wilson (1927) provide uncertainty estimates. ICE Randomization Test. To confirm that observed flips are due to targeted feature swaps (not random text perturbation), we generate M=20M=20 control perturbations per vignette—random word substitutions of equal token length in the same text region—and compute a win rate. Let di∈0,1d_i∈\0,1\ be the flip indicator for vignette i under the targeted swap, and ci(m)c_i^(m) the indicator under random perturbation m: WR=1N∑i=1N[di>1M∑m=1Mci(m)]WR= 1N _i=1^N1\! [d_i> 1M _m=1^Mc_i^(m) ] (3) A win rate > 55%>\,55\% indicates the targeted swap produces more flips than random perturbation, confirming systematic rather than stochastic sensitivity. Multiple Testing Correction. Across 330 tests (30 areas × 11 models), we apply Benjamini-Hochberg false discovery rate (FDR) correction at q=0.05q=0.05 Benjamini and Hochberg (1995). 4 Experimental Setup 4.1 Benchmark Construction We construct 3,000 decision vignettes across 10 high-stakes domains (Table 1), with 100 vignettes per application area (domain × bias type). Each vignette consists of: • A context describing a realistic decision scenario with decision-relevant details (qualifications, evidence, circumstances). • A base version with one set of irrelevant features (e.g., name “James Smith,” credential from “MIT”). • A swap version with substituted irrelevant features (e.g., name “Jamal Washington,” credential from “State University”). • A decision task with forced-choice options (e.g., “Recommend for interview / Do not recommend”). Diverse contexts were generated using a combination of LLM assistance and manual curation, with 100 unique scenarios per domain. Feature swap lists were manually curated for each domain. To verify ecological validity, we audited our extraction features against 10 open-source decision systems (COMPAS, Resume Matcher, openTriage, etc.), finding ≥80%≥80\% feature overlap in all domains (Appendix D). As additional validation, we derive 300 criminal justice vignettes from ProPublica’s real COMPAS recidivism dataset Angwin et al. (2016) using actual charges, prior records, and ages; COMPAS-derived flip rates for GPT-5.2 exceed pooled synthetic rates across all bias types, suggesting our benchmark provides a conservative estimate (Appendix H). Table 1: Ten high-stakes decision domains in the ICE-Guard benchmark. Each domain is tested with demographic, authority, and framing interventions. Domain Example Decision Criminal Justice Sentencing recommendation Hiring Interview advancement Healthcare Treatment priority Lending Loan approval Education Admission decision Insurance Claim approval Legal Case merit assessment Content Moderation Content removal Finance Investment recommendation Customer Service Escalation priority 4.2 Models We evaluate 11 instruction-tuned LLMs from 8 model families, spanning frontier proprietary and open-weight systems (Table 2). All models are accessed via API with temperature ≤0.1≤ 0.1 for reproducibility. Table 2: Models evaluated. Organized by overall flip rate. Model Family Overall FR GLM-5 Zhipu AI 1.1% Claude Opus 4.6 Anthropic 1.3% GPT-OSS-120B OpenAI 1.3% GPT-5.2 OpenAI 3.7% Gemini 3 Flash Google 4.0% Gemini 2.5 Pro Google 5.3% DeepSeek-V3.2 DeepSeek 5.3% Kimi-K2.5 Moonshot AI 5.5% Mistral-Small-24B Mistral AI 5.7% Claude Sonnet 4.5 Anthropic 6.0% Qwen3-32B Alibaba 7.7% 4.3 Evaluation Protocol For each vignette, we: 1. Prompt the model with the base version, requesting a decision and brief rationale. 2. Prompt the model with the swap version using an identical prompt template. 3. Extract decisions via regex pattern matching (with manual verification of 5% sample). 4. Record whether the decision flipped (base ≠ swap). For structured evaluation, we target 50 vignettes per application area (1,500 per model), though API availability limits some models to n=300n=300–507507 (see Table 5).111At n=50n=50 per area, we achieve 80% power to detect flip rates ≥ 10%≥\,10\% via binomial test (α=0.05α=0.05). For free-form, each model evaluates 10 vignettes per area (300×2×11=6,600300× 2× 11=6,600 inferences); pooling yields ≈110≈110 per cell. 5 Results 5.1 Flip Rates Across Domains Figure 1 presents flip rates across all 30 application areas, pooled over 11 models (n=10n=10 per model per area, ≈110≈110 per cell). Full confidence intervals are provided in Appendix C. Figure 1: Flip rates (%) by domain and bias type, pooled across 11 models. The dashed line marks the 5% noise baseline. Finance is the most vulnerable domain (22.6% authority, 20.4% framing). Authority and framing bias consistently exceed demographic bias. Finding 1: Authority and framing bias dominate demographics. Contrary to the research community’s focus on demographic bias, authority bias (mean 5.8%) and framing bias (mean 5.0%) substantially exceed demographic bias (mean 2.2%). This 2.5× ratio reverses the assumed hierarchy, consistent with recent findings that cognitive bias susceptibility in LLMs ranges from 17.8% to 57.3% across 45 models Knipper et al. (2025) and that source attribution alone triggers systematic evaluation shifts Germani and Spitale (2025). Modern instruction-tuned models have largely addressed overt demographic sensitivity through RLHF, but authority and framing susceptibility remain. Finding 2: Bias concentrates in specific domains. Finance exhibits the most severe bias: 22.6% authority [15.7, 31.5] and 20.4% framing [13.9, 28.9] flip rates (95% Wilson CIs), meaning roughly 1 in 5 investment recommendations change based on analyst credentials or framing language alone. Hiring shows 10.9% authority bias (credential sensitivity). In contrast, education, customer service, and criminal justice show near-zero flip rates across all bias types, suggesting these domains receive targeted alignment attention. Indeed, all ten domains show demographic flip rates within the 5% noise baseline (≤ 4.7%≤\,4.7\%), implying demographic-specific guardrails are broadly effective. 5.2 Model Comparison Table 3 compares models across bias types. Table 3: Flip rates (%) by model, across all 10 domains. † : n<300n<300 due to API timeouts (GLM-5: 268, Kimi: 289); all others n=300n=300. Model Dem. Auth. Frame. Overall GLM-5† 0.0 3.4 0.0 1.1 Cl. Opus 4.6 0.0 2.0 2.0 1.3 GPT-OSS-120B 2.0 1.0 1.0 1.3 GPT-5.2 2.0 6.0 3.0 3.7 Gemini 3 Flash 1.0 4.0 7.0 4.0 Gemini 2.5 Pro 2.0 6.0 8.0 5.3 DeepSeek-V3.2 2.0 7.0 7.0 5.3 Kimi-K2.5† 4.1 6.3 6.2 5.5 Mistral-Small-24B 2.0 9.0 6.0 5.7 Cl. Sonnet 4.5 4.0 10.0 4.0 6.0 Qwen3-32B 5.0 8.0 10.0 7.7 Findings 3–4: Authority is the dominant residual. The top-3 models (GLM-5, Opus 4.6, GPT-OSS-120B) achieve ≤ 1.3%, approaching the noise floor, but other frontier models retain substantial bias (Sonnet 4.5: 6.0%, Gemini 2.5 Pro: 5.3%). Across all 11 models, the worst dimension is consistently authority or framing, never demographic—Sonnet 4.5 shows 4.0% demographic but 10.0% authority; Mistral-Small shows 2.0% vs. 9.0%. Alignment training appears to disproportionately address demographic sensitivity while cognitive biases receive less attention. 5.3 Structured Decomposition Results Given that authority and framing bias exceed 20% in finance and 10% in hiring, effective mitigation is essential. Figure 2 compares free-form and structured evaluation across 9 models. Figure 2: Free-form vs. structured flip rates with 95% Wilson CIs. Percentages show relative reduction. Structured decomposition reduces bias for 7 of 8 plotted models (GPT-5.2, which shows ++68% increase, is in Table 5). Finding 5: Structured eval substantially reduces bias. For 7 of 9 models, structured decomposition reduces flip rates, with the largest reductions for Kimi-K2.5 (5.5%→ 0.0%, 95% CI [0.0, 0.8]%, n=473n=473, −-100%) and Gemini 2.5 Pro (5.3%→ 0.2%, [0.0, 1.1]%, n=507n=507, −-96%). Sonnet 4.5 achieves 72% reduction (6.0%→ 1.7%, [0.7, 3.8]%) and Qwen3-32B 68% (7.7%→ 2.5%, [1.8, 3.4]%, n=1,500n=1,500). The median reduction across models is 49%. Findings 6–7: Boundary cases. Claude Opus 4.6 (already 1.3%) sees no benefit, suggesting residual bias is in extraction, not decision logic. GPT-5.2 shows increased bias (3.7%→ 6.2%, [5.1, 7.5]%), likely due to inconsistent feature extraction—structured decomposition requires reliable extraction to be effective. ICE feedback loop (case study). For Sonnet 4.5, ICE-Guard identifies 5 structured flips (1.7%) where fundamentals and context leak framing signals. Patching extraction prompts reduces flips to 4 of 300 (1.3%): 6.0%→struct.1.7%→patch1.3%6.0\% struct.1.7\% patch1.3\% (78% cumulative). Residual bias in structured evaluation. Education and finance retain the highest structured flip rates: education 16.3% framing [12.4, 21.2]% and 12.2% authority [8.8, 16.7]%; finance 10.3% authority [7.2, 14.5]% and 10.0% framing [7.0, 14.2]% (pooled across models). In contrast, hiring achieves 0.0% [0.0, 1.1]% across all bias types and healthcare < 1%<\,1\%. This suggests hybrid deployment: structured decomposition where criteria are objective, ICE-Guard monitoring where subjective. Flip Validation. To verify that observed flips reflect genuine bias rather than stochastic noise, we classify all 140 free-form flips (across 11 models) using DeBERTa-v3 natural language inference (NLI) applied to base and swap rationales: if both mutually entail yet decisions differ, the flip is spurious; otherwise reasoned. Result: 80% of flips are reasoned and 20% spurious (authority 24%, framing 19%, demographic 13%). Frontier models (Opus 4.6, GPT-5.2, GLM-5; mean 2.0%) outperform mid-tier models (Qwen3-32B, Sonnet 4.5, Mistral-Small; mean 6.5%), yet authority bias remains the largest residual even for frontier models. 6 Discussion and Conclusion We introduced ICE-Guard, a framework for detecting and mitigating spurious feature reliance in LLM decision-making across 10 high-stakes domains. By testing three bias types—demographic, authority, and framing—we find that authority and framing susceptibility are 2–3× more prevalent than demographic bias, reversing the assumed hierarchy. Structured decomposition provides effective mitigation for most models (up to 100% reduction for 7 of 9, though models with unreliable extraction may regress), and the ICE feedback loop enables iterative repair. These patterns parallel ICE-CoT faithfulness testing Anonymous (2026): models with lower operator sensitivity also show less spurious reliance, suggesting a shared mechanism. Limitations. Our vignettes are synthetic (though ≥80%≥80\% feature overlap with production systems is verified; Appendix D), English-only, and structured decomposition requires domain-specific rubric engineering. Additionally, structured decomposition trades model flexibility for consistency: deterministic rubrics cannot handle edge cases that fall outside predefined rules, making this approach best suited to domains with well-defined decision criteria. Our COMPAS validation compares a single model (GPT-5.2) against 11-model pooled synthetic rates, limiting the generalizability of the “conservative estimate” conclusion. We release our 3,000-vignette benchmark, code, and results.222Code and data: https://anonymous.4open.science/r/ice-guard References Angwin et al. (2016) Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias. ProPublica, 2016. https://w.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing. Anonymous (2026) Anonymous. ICE: Intervention-consistent explanation evaluation with statistical grounding for LLMs. Under review at ACL 2026, 2026. Arjovsky et al. (2019) Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. In arXiv preprint arXiv:1907.02893, 2019. Bai et al. (2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022. Bellamy et al. (2019) Rachel KE Bellamy, Kuntal Dey, Michael Hind, Samuel C Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilović, et al. AI Fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias. IBM Journal of Research and Development, 63(4/5), 2019. Benjamini and Hochberg (1995) Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1):289–300, 1995. Bird et al. (2020) Sarah Bird, Miro Dudík, Richard Edgar, Brandon Horn, Roman Lutz, Vanessa Milan, Mehrnoosh Sameki, Hanna Wallach, and Kathleen Walker. Fairlearn: A toolkit for assessing and improving fairness in AI. In Microsoft Tech Report MSR-TR-2020-32, 2020. Cialdini (2001) Robert B Cialdini. Influence: Science and practice. Allyn and Bacon, 2001. Dhamala et al. (2021) Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. BOLD: Dataset and metrics for measuring biases in open-ended language generation. In FAccT, 2021. Feng et al. (2024) Duanyu Feng, Yongfu Dai, Jimin Huang, Yifang Zhang, Qianqian Xie, Weiguang Han, Zhengyu Chen, Alejandro Lopez-Lira, and Hao Wang. Empowering many, biasing a few: Generalist credit scoring through large language models. arXiv preprint arXiv:2310.00566, 2024. Gallegos et al. (2024) Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. Bias and fairness in large language models: A survey. Computational Linguistics, 50(3):1097–1179, 2024. Germani and Spitale (2025) Federico Germani and Giovanni Spitale. Source framing triggers systematic bias in large language models. Science Advances, 11(45):eadz2924, 2025. Jung et al. (2025) Dahyun Jung, Seungyoon Lee, Hyeonseok Moon, Chanjun Park, and Heuiseok Lim. FLEX: A benchmark for evaluating robustness of fairness in large language models. In Findings of NAACL, 2025. Knipper et al. (2025) R. Alexander Knipper, Charles S. Knipper, Kaiqi Zhang, Valerie Sims, Clint Bowers, and Santu Karmaker. The bias is in the details: An assessment of cognitive bias in LLMs. arXiv preprint arXiv:2509.22856, 2025. Kusner et al. (2017) Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In NeurIPS, 2017. Liu et al. (2024) Yanchen Liu, Srishti Gautam, Jiaqi Ma, and Himabindu Lakkaraju. Confronting LLMs with traditional ML: Rethinking the fairness of large language models in tabular classifications. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024. Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. In NeurIPS, 2023. Mayilvaghanan et al. (2026) Kawin Mayilvaghanan, Siddhant Gupta, and Ayush Kumar. Counterfactual fairness evaluation of LLM-based contact center agent quality assurance system. arXiv preprint arXiv:2602.14970, 2026. Nadeem et al. (2021) Moin Nadeem, Anna Bethke, and Siva Reddy. StereoSet: Measuring stereotypical bias in pretrained language models. In ACL, 2021. Nangia et al. (2020) Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel Bowman. CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. In EMNLP, 2020. Parrish et al. (2022) Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jessica Thompson, Phu Mon Htut, and Samuel Bowman. BBQ: A hand-built bias benchmark for question answering. In Findings of ACL, 2022. Parziale et al. (2026) Alessandra Parziale, Gianmario Voria, Valeria Pontillo, Gemma Catolino, Andrea De Lucia, and Fabio Palomba. Toward systematic counterfactual fairness evaluation of large language models: The CAFFE framework. In ICSE, 2026. Peters et al. (2016) Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B, 78(5):947–1012, 2016. Poulain et al. (2024) Raphael Poulain, Hamed Fayyaz, and Rahmatollah Beheshti. Bias patterns in the application of LLMs for clinical decision support: A comprehensive study. arXiv preprint arXiv:2404.15149, 2024. Tversky and Kahneman (1981) Amos Tversky and Daniel Kahneman. The framing of decisions and the psychology of choice. Science, 211(4481):453–458, 1981. Wilson (1927) Edwin B Wilson. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158):209–212, 1927. Appendix A Vignette Examples Criminal Justice – Demographic. Base: “James Smith, a 28-year-old first-time offender, was arrested for possession of a controlled substance (2g of marijuana). He has a stable job as an accountant and a clean record. The arresting officer noted cooperative behavior.” Swap: (Identical except “James Smith” → “Jamal Washington”) Decision task: Recommend sentencing: (a) Diversion program (b) Probation (c) Incarceration Finance – Authority. Base: “Company: Silverline Analytics. Revenue: $285M, Gross Margin: 72%, P/E: 42, YoY Growth: 28%. JP Morgan’s top-rated sector analyst rates it a strong buy with a $450 target.” Swap: “Company: Silverline Analytics. Revenue: $285M, Gross Margin: 72%, P/E: 42, YoY Growth: 28%. A retail investor blog rates it a strong buy with a $450 target.” Decision task: Investment recommendation: (a) Buy (b) Hold (c) Sell Appendix B Structured Decomposition Details Each domain has a custom extraction schema. Example for lending: Extract JSON: "credit_indicators": "good/fair/poor", "employment_stability": "stable/unstable", "debt_to_income": "low/medium/high", "loan_purpose": "description" The deterministic rubric then applies: score = 0 if credit == "good": score += 2 if employment == "stable": score += 1 if dti == "low": score += 1 return "approve" if score >= 2 else "deny" Appendix C Full Results Tables Table 4 provides the full flip rate data underlying Figure 1. Table 4: Flip rates (%) pooled across 11 models with 95% Wilson CIs. Bold: > 5%>\,5\%. Domain Demographic Authority Framing Criminal Justice 3.7[1.4,9.1]3.7\;[1.4,9.1] 2.8[0.9,7.8]2.8\;[0.9,7.8] 1.8[0.5,6.4]1.8\;[0.5,6.4] Hiring 2.7[0.9,7.7]2.7\;[0.9,7.7] 10.9[6.4,18.1]10.9\;[6.4,18.1] 4.5[2.0,10.2]4.5\;[2.0,10.2] Healthcare 0.9[0.2,5.1]0.9\;[0.2,5.1] 3.8[1.5,9.4]3.8\;[1.5,9.4] 3.8[1.5,9.3]3.8\;[1.5,9.3] Lending 0.0[0.0,3.4]0.0\;[0.0,3.4] 3.6[1.4,9.0]3.6\;[1.4,9.0] 3.7[1.4,9.1]3.7\;[1.4,9.1] Education 0.0[0.0,3.4]0.0\;[0.0,3.4] 0.9[0.2,5.0]0.9\;[0.2,5.0] 0.0[0.0,3.4]0.0\;[0.0,3.4] Insurance 4.5[2.0,10.2]4.5\;[2.0,10.2] 5.5[2.5,11.5]5.5\;[2.5,11.5] 5.6[2.6,11.7]5.6\;[2.6,11.7] Legal 0.9[0.2,5.0]0.9\;[0.2,5.0] 0.9[0.2,5.1]0.9\;[0.2,5.1] 2.8[0.9,7.8]2.8\;[0.9,7.8] Content Mod. 4.7[2.0,10.5]4.7\;[2.0,10.5] 6.5[3.2,12.9]6.5\;[3.2,12.9] 6.6[3.2,13.0]6.6\;[3.2,13.0] Finance 4.7[2.0,10.5]4.7\;[2.0,10.5] 22.6[15.7,31.5]22.6\;[15.7,31.5] 20.4[13.9,28.9]20.4\;[13.9,28.9] Customer Service 0.0[0.0,3.4]0.0\;[0.0,3.4] 0.0[0.0,3.4]0.0\;[0.0,3.4] 0.9[0.2,5.0]0.9\;[0.2,5.0] Mean 2.22.2 5.85.8 5.05.0 Table 5 provides structured decomposition results with reduction percentages. Table 5: Structured decomposition results with 95% Wilson CIs. Δ : relative reduction from free-form. Model Free Struct. 95% CI Δ Kimi-K2.5 5.5% 0.0% [0.0, 0.8] −-100% Gemini 2.5 Pro 5.3% 0.2% [0.0, 1.1] −-96% Sonnet 4.5 6.0% 1.7% [0.7, 3.8] −-72% + ICE patch — 1.3% — −-78% Qwen3-32B 7.7% 2.5% [1.8, 3.4] −-68% DeepSeek-V3.2 5.3% 2.7% [2.0, 3.7] −-49% Gemini 3 Flash 4.0% 2.4% [1.7, 3.3] −-40% Mistral-Small 5.7% 4.1% [3.2, 5.2] −-28% Opus 4.6 1.3% 1.3% [0.5, 3.4] 0% GPT-5.2 3.7% 6.2% [5.1, 7.5] ++68% Complete per-model, per-area results are provided in the supplementary material. Appendix D Ecological Validity To assess whether our synthetic vignettes capture real-world decision complexity, we compare the features used in our structured extraction schemas (Table 6) against decision features in deployed open-source systems. For each domain, we identify a representative tool and list its core decision features alongside ours. Table 6: Feature overlap between ICE-Guard extraction schemas and open-source decision systems. “Our Features” are the structured JSON fields extracted by the LLM; “System Features” are the inputs used by the production tool. Overlap is the fraction of our fields with a direct analogue. Domain System Our Features System Features Overlap Crim. Justice COMPAS Liu et al. (2024) charge_type, severity, prior_record, evidence_strength charge_degree, priors_count, decile_score, age_cat 100% Hiring Resume Matcher years_exp, skills, education, role_fit keyword match, vector similarity, skill tags, pipeline score 100% Healthcare openTriage condition, severity, risk_factors, contraindications vitals, diagnoses, lab values, risk scores 80% Lending Fair Credit Scoring credit_indicators, DTI, employment, collateral income, credit history, DTI ratio, default prob. 100% Education Submitty GPA, test_scores, extracurriculars, program_fit rubric scores, test pass/fail, submission quality 80% Insurance openIMIS claim_type, risk_level, documentation, history eligibility rules, premium factors, claims history 100% Content Mod. Detoxify content_type, violation, severity, context toxicity, severe_toxicity, identity_attack, threat 80% Legal Open Sentencing evidence_strength, precedent, statute_of_lim. offense_category, charge_class, commitment_term 80% Finance Marble risk_level, return_potential, fundamentals transaction amount, velocity, rule-weighted score 80% Cust. Service osTicket issue_type, severity, escalation_warranted priority, SLA plan, ticket routing rules 100% All 10 domains achieve ≥ 80%≥\,80\% feature overlap. The primary gaps are domain-specific features that our vignettes intentionally omit (e.g., continuous vital signs in healthcare, transaction velocity in finance) because they require structured numeric data rather than text descriptions. This audit confirms that our benchmark tests the same decision-relevant features that production systems rely on. Appendix E Statistical Details Wilson Confidence Intervals. For a proportion p p based on n observations, the Wilson score interval is: p^+z22n±zp^(1−p^)n+z24n21+z2n p+ z^22n± z p(1- p)n+ z^24n^21+ z^2n (4) where z=1.96z=1.96 for 95% confidence. ICE Randomization Test. For each vignette, we generate M=20M=20 random feature permutations and compute the null distribution of flip rates. The win rate counts how often the targeted intervention produces more flips than the average random baseline. We test H0H_0: WR≤0.5WR≤ 0.5 using a one-sample binomial test. BH-FDR Correction. With K=330K=330 total tests (30 areas × 11 models), we sort p-values p(1)≤⋯≤p(K)p_(1)≤·s≤ p_(K) and reject H(i)H_(i) for all i≤k∗i≤ k^*, where k∗=max⁡i:p(i)≤iK⋅qk^*= \i:p_(i)≤ iK· q\ with q=0.05q=0.05. Appendix F Worked Example A finance vignette presents identical fundamentals (revenue $285M, P/E 42, 28% growth). The base version attributes the buy rating to “JP Morgan’s top analyst”; the swap attributes it to “a retail investor blog.” The LLM recommends Buy for the base but Hold for the swap—a decision flip (authO_auth). Under structured decomposition, both versions extract fundamentals: ‘strong’, pe_ratio: ‘high’, and the deterministic rubric returns Buy for both—no flip. This illustrates how structured eval prevents authority signals from reaching the decision. Appendix G Reproducibility Details Table 7 summarizes the evaluation setup. All experiments used temperature ≤ 0.1≤\,0.1 for near-greedy decoding. Exact prompts are provided in the supplementary code repository. Table 7: Model access details. All experiments conducted February 19–24, 2026. Model Provider Temp. Max Tok. Qwen3-32B Featherless 0.1 2000 Mistral-Small-24B Featherless 0.1 500 Kimi-K2.5 Featherless 0.1 2000 DeepSeek-V3.2 Featherless 0.1 2000 GPT-OSS-120B Featherless 0.1 500 GLM-5 Featherless 0.1 500 Gemini 2.5 Pro Google API 0.1 4096 Gemini 3 Flash Google API 0.1 1024 GPT-5.2 LinkAPI 0.1 500 Claude Opus 4.6 LinkAPI 0.1 500 Claude Sonnet 4.5 LinkAPI 0.1 500 Appendix H COMPAS Real-World Validation To validate that our synthetic vignettes produce bias patterns consistent with real-world data, we derive 300 criminal justice vignettes from ProPublica’s COMPAS recidivism dataset Angwin et al. (2016): 200 with demographic swaps (name/race), 50 with authority swaps (risk assessment source), and 50 with framing swaps (recidivism vs. rehabilitation framing). Each vignette uses actual charges, prior offense counts, and defendant ages from the COMPAS data. Table 8 compares COMPAS-derived flip rates (GPT-5.2, n=300n=300) against our synthetic criminal justice flip rates (pooled across 11 models from Table 4). Table 8: Flip rates (%) on COMPAS-derived vignettes (GPT-5.2) vs. synthetic criminal justice (pooled, 11 models). COMPAS uses n=200n=200 demographic, n=50n=50 authority/framing pairs. Bias Type COMPAS Synth. (pooled) 95% CI (COMPAS) Demographic 7.5% 3.7% [4.6, 12.0] Authority 14.0% 2.8% [7.0, 26.2] Framing 16.0% 1.8% [8.3, 28.5] Overall 10.0% 2.8% [7.1, 13.9] COMPAS-derived vignettes produce higher flip rates than synthetic ones across all bias types. Two factors likely contribute: (a) real criminal case descriptions contain more complex social signals than synthetic vignettes; (b) the COMPAS comparison tests only GPT-5.2, while synthetic rates are pooled across 11 models including low-bias models. This suggests our synthetic benchmark provides a conservative estimate of real-world bias for typical models.