Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

PaperIntel
Alignment Verifiability in Large Language Models: Normative Indistinguishability under Behavioral Evaluation

Igor Santos-Grueiro

Year: 2026Area: Deception & FailureCitations: 1

Tags: theoretical, alignment-training, ai-safety, deception-failure, safety-evaluation

E4 / R2 (94%)
Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning

Abhishek Mishra, Deepesh Suranjandass, Reshma Ashok, Donnie Winkelmann

Year: 2026Area: Deception & FailureCitations: -

Tags: empirical, alignment-training, ai-safety, deception-failure

E5 / R3 (95%)
Capability-Oriented Training Induced Alignment Risk

Tian Gao, Zhenwen Liang, Nuno Moniz, Werner Geyer

Year: 2026Area: Deception & FailureCitations: 1

Tags: empirical, alignment-training, ai-safety, deception-failure

E4 / R3 (94%)
CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation

Zhao Tong, Shu Wu, Xingcheng Xu, Yiping Zhang

Year: 2026Area: Deception & FailureCitations: -

Tags: empirical, ai-safety, deception-failure

E5 / R2 (96%)
Emergent Misalignment is Easy, Narrow Misalignment is Hard

Edward Turner, Senthooran Rajamanoharan, Neel Nanda, Anna Soligo

Year: 2026Area: Deception & FailureCitations: -

Tags: empirical, alignment-training, ai-safety, deception-failure

E5 / R3 (94%)
Extending Beacon to Hindi: Cultural Adaptation Drives Cross-Lingual Sycophancy

Sarthak Sattigeri

Year: 2026Area: Deception & FailureCitations: -

Tags: empirical, ai-safety, deception-failure

E6 / R3 (95%)
How Does Unfaithful Reasoning Emerge from Autoregressive Training? A Study of Synthetic Experiments

Fuxin Wang, Yiqiao Zhong, Amr Alazali

Year: 2026Area: Deception & FailureCitations: -

Tags: empirical, ai-safety, deception-failure

E5 / R3 (92%)
How RLHF Amplifies Sycophancy

Gerdus Benade, Ariel D. Procaccia, Itai Shapira

Year: 2026Area: Deception & FailureCitations: -

Tags: theoretical, ai-safety, deception-failure

E5 / R3 (94%)
Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

Chang Gao, Chujie Zheng, Shixuan Liu, Binghai Wang

Year: 2026Area: Deception & FailureCitations: 1

Tags: empirical, alignment-training, ai-safety, deception-failure

E5 / R3 (96%)
Subliminal Effects in Your Data: A General Mechanism via Log-Linearity

Ishaq Aden-Ali, Abhishek Shetty, Nika Haghtalab, Ankur Moitra

Year: 2026Area: Deception & FailureCitations: -

Tags: theoretical, ai-safety, deception-failure

E4 / R3 (94%)
The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers

Yonatan Zunger, Giorgio Severi, Blake Bullwinkel, Keegan Hines

Year: 2026Area: Deception & FailureCitations: -

Tags: empirical, ai-safety, deception-failure

E5 / R3 (93%)
A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models

Soheil Feizi, Sriram Balasubramanian, Samyadeep Basu

Year: 2025Area: Deception & FailureCitations: 4

Tags: empirical, ai-safety, deception-failure

E5 / R3 (92%)
A Two-Step, Multidimensional Account of Deception in Language Models

Leonard Dung

Year: 2025Area: Deception & FailureCitations: -

Tags: theoretical, ai-safety, deception-failure

E5 / R3 (95%)
Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers

Santhosh Kumar Ravindran

Year: 2025Area: Deception & FailureCitations: 1

Tags: theoretical, ai-safety, adversarial-robustness, deception-failure

E5 / R3 (92%)
Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking

Ming Jin, Qifan Wang, Mohammad Beigi, Junshan Zhang

Year: 2025Area: Deception & FailureCitations: -

Tags: empirical, ai-safety, adversarial-robustness, deception-failure

E5 / R3 (97%)
Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria

Shourya Mishra, Raghav Maheshwari, Amitava Das, Kanishk Rai

Year: 2025Area: Deception & FailureCitations: -

Tags: empirical, alignment-training, ai-safety, deception-failure

E7 / R3 (98%)
Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails

Cihang Xie, Siwei Han, Linjun Zhang, Xinyu Ye

Year: 2025Area: Deception & FailureCitations: 1

Tags: empirical, alignment-training, ai-safety, deception-failure

E6 / R4 (95%)
Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Adrià Garriga-Alonso, Satvik Golechha

Year: 2025Area: Deception & FailureCitations: 9

Tags: ai-safety, deception-failure, benchmark

E5 / R3 (95%)
Are DeepSeek R1 And Other Reasoning Models More Faithful?

Owain Evans, James Chua

Year: 2025Area: Deception & FailureCitations: 40

Tags: empirical, ai-safety, deception-failure

E6 / R3 (95%)
Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

Hongjun An, Yiliang Song, Xuelong Li, Chi Zhang

Year: 2025Area: Deception & FailureCitations: -

Tags: empirical, alignment-training, ai-safety, deception-failure

E5 / R3 (95%)
Are Your Agents Upward Deceivers?

Xia Hu, Juntao Dai, Yi R. Fung, Qingyu Liu

Year: 2025Area: Deception & FailureCitations: 2

Tags: ai-safety, deception-failure, benchmark

E4 / R2 (94%)
Benchmarking Deception Probes via Black-to-White Performance Boosts

Carlo Leonardo Attubato, Stefan Heimersheim, Aviel Parrack

Year: 2025Area: Deception & FailureCitations: 8

Tags: ai-safety, deception-failure, benchmark

E5 / R3 (93%)
Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis

Rebecca Qian, Darshan Deshpande, Anand Kannappan

Year: 2025Area: Deception & FailureCitations: -

Tags: ai-safety, deception-failure, safety-evaluation, benchmark

E5 / R3 (94%)
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

Bingsheng He, Mingzhe Du, See-Kiong Ng, Zhaomin Wu

Year: 2025Area: Deception & FailureCitations: 3

Tags: empirical, ai-safety, deception-failure

E4 / R3 (93%)
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

Lizhu Zhang, Zhaorun Chen, Hao Ma, Xiangjun Fan

Year: 2025Area: Deception & FailureCitations: 25

Tags: empirical, alignment-training, ai-safety, deception-failure

E5 / R3 (96%)
But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors

Ethan Nguyen, Leon Eshuijs, Alan McBeth, Archie Chaudhury

Year: 2025Area: Deception & FailureCitations: -

Tags: empirical, ai-safety, deception-failure

E5 / R3 (95%)
Bypassing the Rationale: Causal Auditing of Implicit Reasoning in Language Models

Anish Sathyanarayanan, Aarush Rathore, Aditya Nagarsekar

Year: 2025Area: Deception & FailureCitations: -

Tags: empirical, ai-safety, deception-failure

E7 / R3 (93%)
Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought

Jiachen Zhao, Yiyou Sun, Dawn Song, Weiyan Shi

Year: 2025Area: Deception & FailureCitations: 2

Tags: empirical, ai-safety, deception-failure

E5 / R3 (93%)
Can LLMs Lie? Investigation beyond Hallucination

Shantanu Jaiswal, Mengning Wu, Deepak Pathak, Haoran Huan

Year: 2025Area: Deception & FailureCitations: 1

Tags: empirical, ai-safety, deception-failure, interpretability

E5 / R3 (94%)
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Artur Zolkowski, Erik Jenner, Florian Tramèr, Wen Xing

Year: 2025Area: Deception & FailureCitations: 3

Tags: empirical, ai-safety, deception-failure

E4 / R3 (93%)

Showing 30 of 214 papers on page 1.