Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Year | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Alignment Verifiability in Large Language Models: Normative Indistinguishability under Behavioral Evaluation Igor Santos-Grueiro Year: 2026Area: Deception & FailureCitations: 1 Tags: theoretical, alignment-training, ai-safety, deception-failure, safety-evaluation | 2026 | Deception & Failure | theoretical, alignment-training, ai-safety, deception-failure, safety-evaluation | E4 / R2 (94%) | 1 |
| Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning Abhishek Mishra, Deepesh Suranjandass, Reshma Ashok, Donnie Winkelmann Year: 2026Area: Deception & FailureCitations: - Tags: empirical, alignment-training, ai-safety, deception-failure | 2026 | Deception & Failure | empirical, alignment-training, ai-safety, deception-failure | E5 / R3 (95%) | - |
| Capability-Oriented Training Induced Alignment Risk Tian Gao, Zhenwen Liang, Nuno Moniz, Werner Geyer Year: 2026Area: Deception & FailureCitations: 1 Tags: empirical, alignment-training, ai-safety, deception-failure | 2026 | Deception & Failure | empirical, alignment-training, ai-safety, deception-failure | E4 / R3 (94%) | 1 |
| CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation Zhao Tong, Shu Wu, Xingcheng Xu, Yiping Zhang Year: 2026Area: Deception & FailureCitations: - Tags: empirical, ai-safety, deception-failure | 2026 | Deception & Failure | empirical, ai-safety, deception-failure | E5 / R2 (96%) | - |
| Emergent Misalignment is Easy, Narrow Misalignment is Hard Edward Turner, Senthooran Rajamanoharan, Neel Nanda, Anna Soligo Year: 2026Area: Deception & FailureCitations: - Tags: empirical, alignment-training, ai-safety, deception-failure | 2026 | Deception & Failure | empirical, alignment-training, ai-safety, deception-failure | E5 / R3 (94%) | - |
| Extending Beacon to Hindi: Cultural Adaptation Drives Cross-Lingual Sycophancy Sarthak Sattigeri Year: 2026Area: Deception & FailureCitations: - Tags: empirical, ai-safety, deception-failure | 2026 | Deception & Failure | empirical, ai-safety, deception-failure | E6 / R3 (95%) | - |
| How Does Unfaithful Reasoning Emerge from Autoregressive Training? A Study of Synthetic Experiments Fuxin Wang, Yiqiao Zhong, Amr Alazali Year: 2026Area: Deception & FailureCitations: - Tags: empirical, ai-safety, deception-failure | 2026 | Deception & Failure | empirical, ai-safety, deception-failure | E5 / R3 (92%) | - |
| How RLHF Amplifies Sycophancy Gerdus Benade, Ariel D. Procaccia, Itai Shapira Year: 2026Area: Deception & FailureCitations: - Tags: theoretical, ai-safety, deception-failure | 2026 | Deception & Failure | theoretical, ai-safety, deception-failure | E5 / R3 (94%) | - |
| Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models Chang Gao, Chujie Zheng, Shixuan Liu, Binghai Wang Year: 2026Area: Deception & FailureCitations: 1 Tags: empirical, alignment-training, ai-safety, deception-failure | 2026 | Deception & Failure | empirical, alignment-training, ai-safety, deception-failure | E5 / R3 (96%) | 1 |
| Subliminal Effects in Your Data: A General Mechanism via Log-Linearity Ishaq Aden-Ali, Abhishek Shetty, Nika Haghtalab, Ankur Moitra Year: 2026Area: Deception & FailureCitations: - Tags: theoretical, ai-safety, deception-failure | 2026 | Deception & Failure | theoretical, ai-safety, deception-failure | E4 / R3 (94%) | - |
| The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers Yonatan Zunger, Giorgio Severi, Blake Bullwinkel, Keegan Hines Year: 2026Area: Deception & FailureCitations: - Tags: empirical, ai-safety, deception-failure | 2026 | Deception & Failure | empirical, ai-safety, deception-failure | E5 / R3 (93%) | - |
| A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models Soheil Feizi, Sriram Balasubramanian, Samyadeep Basu Year: 2025Area: Deception & FailureCitations: 4 Tags: empirical, ai-safety, deception-failure | 2025 | Deception & Failure | empirical, ai-safety, deception-failure | E5 / R3 (92%) | 4 |
| A Two-Step, Multidimensional Account of Deception in Language Models Leonard Dung Year: 2025Area: Deception & FailureCitations: - Tags: theoretical, ai-safety, deception-failure | 2025 | Deception & Failure | theoretical, ai-safety, deception-failure | E5 / R3 (95%) | - |
| Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers Santhosh Kumar Ravindran Year: 2025Area: Deception & FailureCitations: 1 Tags: theoretical, ai-safety, adversarial-robustness, deception-failure | 2025 | Deception & Failure | theoretical, ai-safety, adversarial-robustness, deception-failure | E5 / R3 (92%) | 1 |
| Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking Ming Jin, Qifan Wang, Mohammad Beigi, Junshan Zhang Year: 2025Area: Deception & FailureCitations: - Tags: empirical, ai-safety, adversarial-robustness, deception-failure | 2025 | Deception & Failure | empirical, ai-safety, adversarial-robustness, deception-failure | E5 / R3 (97%) | - |
| Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria Shourya Mishra, Raghav Maheshwari, Amitava Das, Kanishk Rai Year: 2025Area: Deception & FailureCitations: - Tags: empirical, alignment-training, ai-safety, deception-failure | 2025 | Deception & Failure | empirical, alignment-training, ai-safety, deception-failure | E7 / R3 (98%) | - |
| Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails Cihang Xie, Siwei Han, Linjun Zhang, Xinyu Ye Year: 2025Area: Deception & FailureCitations: 1 Tags: empirical, alignment-training, ai-safety, deception-failure | 2025 | Deception & Failure | empirical, alignment-training, ai-safety, deception-failure | E6 / R4 (95%) | 1 |
| Among Us: A Sandbox for Measuring and Detecting Agentic Deception Adrià Garriga-Alonso, Satvik Golechha Year: 2025Area: Deception & FailureCitations: 9 Tags: ai-safety, deception-failure, benchmark | 2025 | Deception & Failure | ai-safety, deception-failure, benchmark | E5 / R3 (95%) | 9 |
| Are DeepSeek R1 And Other Reasoning Models More Faithful? Owain Evans, James Chua Year: 2025Area: Deception & FailureCitations: 40 Tags: empirical, ai-safety, deception-failure | 2025 | Deception & Failure | empirical, ai-safety, deception-failure | E6 / R3 (95%) | 40 |
| Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity Hongjun An, Yiliang Song, Xuelong Li, Chi Zhang Year: 2025Area: Deception & FailureCitations: - Tags: empirical, alignment-training, ai-safety, deception-failure | 2025 | Deception & Failure | empirical, alignment-training, ai-safety, deception-failure | E5 / R3 (95%) | - |
| Are Your Agents Upward Deceivers? Xia Hu, Juntao Dai, Yi R. Fung, Qingyu Liu Year: 2025Area: Deception & FailureCitations: 2 Tags: ai-safety, deception-failure, benchmark | 2025 | Deception & Failure | ai-safety, deception-failure, benchmark | E4 / R2 (94%) | 2 |
| Benchmarking Deception Probes via Black-to-White Performance Boosts Carlo Leonardo Attubato, Stefan Heimersheim, Aviel Parrack Year: 2025Area: Deception & FailureCitations: 8 Tags: ai-safety, deception-failure, benchmark | 2025 | Deception & Failure | ai-safety, deception-failure, benchmark | E5 / R3 (93%) | 8 |
| Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis Rebecca Qian, Darshan Deshpande, Anand Kannappan Year: 2025Area: Deception & FailureCitations: - Tags: ai-safety, deception-failure, safety-evaluation, benchmark | 2025 | Deception & Failure | ai-safety, deception-failure, safety-evaluation, benchmark | E5 / R3 (94%) | - |
| Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts Bingsheng He, Mingzhe Du, See-Kiong Ng, Zhaomin Wu Year: 2025Area: Deception & FailureCitations: 3 Tags: empirical, ai-safety, deception-failure | 2025 | Deception & Failure | empirical, ai-safety, deception-failure | E4 / R3 (93%) | 3 |
| Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment Lizhu Zhang, Zhaorun Chen, Hao Ma, Xiangjun Fan Year: 2025Area: Deception & FailureCitations: 25 Tags: empirical, alignment-training, ai-safety, deception-failure | 2025 | Deception & Failure | empirical, alignment-training, ai-safety, deception-failure | E5 / R3 (96%) | 25 |
| But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors Ethan Nguyen, Leon Eshuijs, Alan McBeth, Archie Chaudhury Year: 2025Area: Deception & FailureCitations: - Tags: empirical, ai-safety, deception-failure | 2025 | Deception & Failure | empirical, ai-safety, deception-failure | E5 / R3 (95%) | - |
| Bypassing the Rationale: Causal Auditing of Implicit Reasoning in Language Models Anish Sathyanarayanan, Aarush Rathore, Aditya Nagarsekar Year: 2025Area: Deception & FailureCitations: - Tags: empirical, ai-safety, deception-failure | 2025 | Deception & Failure | empirical, ai-safety, deception-failure | E7 / R3 (93%) | - |
| Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought Jiachen Zhao, Yiyou Sun, Dawn Song, Weiyan Shi Year: 2025Area: Deception & FailureCitations: 2 Tags: empirical, ai-safety, deception-failure | 2025 | Deception & Failure | empirical, ai-safety, deception-failure | E5 / R3 (93%) | 2 |
| Can LLMs Lie? Investigation beyond Hallucination Shantanu Jaiswal, Mengning Wu, Deepak Pathak, Haoran Huan Year: 2025Area: Deception & FailureCitations: 1 Tags: empirical, ai-safety, deception-failure, interpretability | 2025 | Deception & Failure | empirical, ai-safety, deception-failure, interpretability | E5 / R3 (94%) | 1 |
| Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability Artur Zolkowski, Erik Jenner, Florian Tramèr, Wen Xing Year: 2025Area: Deception & FailureCitations: 3 Tags: empirical, ai-safety, deception-failure | 2025 | Deception & Failure | empirical, ai-safety, deception-failure | E4 / R3 (93%) | 3 |
Showing 30 of 214 papers on page 1.