Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

PaperIntel
Behemoth: Benchmarking Unlearning in LLMs Using Fully Synthetic Data

Dan Alistarh, Eugenia Iofinova

Year: 2026Area: Model EditingCitations: -

Tags: ai-safety, model-editing, benchmark

E5 / R4 (96%)
Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation

Nesreen Ahmed, Mahantesh Halappanavar, Haoyu Han, Yue Zhao

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: ai-safety, adversarial-robustness, benchmark

E5 / R3 (96%)
CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

Qiankun Li, Kun Wang, Shilinlu Yan, Zhenhong Zhou

Year: 2026Area: Safety EvaluationCitations: -

Tags: ai-safety, adversarial-robustness, safety-evaluation, benchmark

E6 / R5 (96%)
Do Large Language Models Reflect Demographic Pluralism in Safety?

Rafiq Ali, Sushant Kumar Ray, Usman Naseem, Abdullah Mohammad

Year: 2026Area: Safety EvaluationCitations: -

Tags: ai-safety, safety-evaluation, benchmark

E6 / R4 (95%)
From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents

Sen Su, Fanyu Meng, Zhenhong Zhou, Zhengshuo Gong

Year: 2026Area: Safety EvaluationCitations: -

Tags: alignment-training, ai-safety, safety-evaluation, benchmark

E5 / R3 (94%)
LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios

Xia Hu, Ge Gao, Chujia Hu, Dongrui Liu

Year: 2026Area: Agent SafetyCitations: 1

Tags: agent-safety, ai-safety, adversarial-robustness, benchmark

E4 / R3 (95%)
Lost in Translation? A Comparative Study on the Cross-Lingual Transfer of Composite Harms

Hardik Sharma, Vrijendra Singh, Soham Wasmatkar, Vaibhav Shukla

Year: 2026Area: Safety EvaluationCitations: -

Tags: alignment-training, ai-safety, safety-evaluation, benchmark

E7 / R4 (96%)
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

Zhiqiang Lin, Zeyi Liao, Eric Fosler-Lussier, Yu Su

Year: 2026Area: Agent SafetyCitations: 12

Tags: agent-safety, ai-safety, adversarial-robustness, benchmark

E6 / R3 (95%)
TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Samanvay Vajpayee, Sirisha Rambhatla, Tom Tseng, Samuel Simko

Year: 2026Area: Safety EvaluationCitations: 1

Tags: ai-safety, safety-evaluation, benchmark

E5 / R4 (98%)
The Shadow Self: Intrinsic Value Misalignment in Large Language Model Agents

Qian Wang, Yuan Yang, Ziyao Liu, Kwok-Yan Lam

Year: 2026Area: Agent SafetyCitations: -

Tags: alignment-training, agent-safety, ai-safety, benchmark

E4 / R3 (96%)
When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life

Xiangyu Shi, Fengran Mo, Su Yao, Youwei Liao

Year: 2026Area: Multimodal SafetyCitations: -

Tags: alignment-training, ai-safety, multimodal-safety, benchmark

E4 / R2 (94%)
A Benchmark for Scalable Oversight Protocols

Abhimanyu Pallavi Sudhir, Jackson Kaunismaa, Arjun Panickssery

Year: 2025Area: Scalable OversightCitations: 1

Tags: scalable-oversight, ai-safety, benchmark

E6 / R3 (95%)
A Framework for Evaluating Emerging Cyberattack Capabilities of AI

Raluca Ada Popa, Lihao Liang, Anna Wang, Mikel Rodriguez

Year: 2025Area: Safety EvaluationCitations: 24

Tags: ai-safety, safety-evaluation, benchmark

E5 / R3 (95%)
A Safety and Security-Centered Evaluation Framework for Large Language Models via Multi-Model Judgment

Changsheng Wan, Hong Zhong, Yunhao Xia, Weichen Lu

Year: 2025Area: Safety EvaluationCitations: -

Tags: ai-safety, safety-evaluation, benchmark

E5 / R3 (97%)
AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions

Xianglong Liu, Dacheng Tao, Jiakai Wang, Siyuan Liang

Year: 2025Area: Agent SafetyCitations: 13

Tags: agent-safety, ai-safety, adversarial-robustness, benchmark

E5 / R4 (99%)
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs

Kieran Fraser, Prasanna Sattigeri, Giulio Zizzo, Beat Buesser

Year: 2025Area: Adversarial RobustnessCitations: 12

Tags: ai-safety, adversarial-robustness, safety-evaluation, benchmark

E7 / R3 (97%)
AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents

Arman Zharmagambetov, Maya Pavlova, Kamalika Chaudhuri, Chuan Guo

Year: 2025Area: Agent SafetyCitations: 31

Tags: agent-safety, ai-safety, safety-evaluation, benchmark

E6 / R3 (97%)
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Mateusz Dziemian, Zico Kolter, Jerome Wynne, Xander Davies

Year: 2025Area: Adversarial RobustnessCitations: 148

Tags: ai-safety, adversarial-robustness, benchmark

E5 / R3 (95%)
AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems

Faouzi El Yagoubi, Ranwa Al Mallah, Godwin Badu-Marfo

Year: 2025Area: Agent SafetyCitations: -

Tags: agent-safety, ai-safety, benchmark

E5 / R3 (95%)
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents

Guillermo Bosch, Akshat Naik, Patrick Quinn, Edward James Young

Year: 2025Area: Agent SafetyCitations: 12

Tags: alignment-training, agent-safety, ai-safety, benchmark

E5 / R3 (94%)
Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Adrià Garriga-Alonso, Satvik Golechha

Year: 2025Area: Deception & FailureCitations: 9

Tags: ai-safety, deception-failure, benchmark

E5 / R3 (95%)
Are Your Agents Upward Deceivers?

Xia Hu, Juntao Dai, Yi R. Fung, Qingyu Liu

Year: 2025Area: Deception & FailureCitations: 2

Tags: ai-safety, deception-failure, benchmark

E4 / R2 (94%)
AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor

Wenxuan Wang, Shu Yang, Hanqi Yan, Jingyu Hu

Year: 2025Area: Safety EvaluationCitations: -

Tags: ai-safety, safety-evaluation, benchmark

E4 / R3 (95%)
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Christopher Potts, Aryaman Arora, Dan Jurafsky, Christopher D. Manning

Year: 2025Area: Representation AnalysisCitations: 123

Tags: representation-analysis, ai-safety, benchmark

E5 / R3 (96%)
BLUR: A Benchmark for LLM Unlearning Robust to Forget-Retain Overlap

Pratiksha Thaker, Yiwei Fu, Steven Wu, Neil Kale

Year: 2025Area: Safety EvaluationCitations: 6

Tags: ai-safety, safety-evaluation, benchmark

E5 / R3 (93%)
Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images

Aditya Kumar, Franziska Boenisch, Tom Blanchard, Adam Dziedzic

Year: 2025Area: Multimodal SafetyCitations: -

Tags: ai-safety, multimodal-safety, benchmark

E5 / R3 (94%)
BeaverTails-IT: Towards a Safety Benchmark for Evaluating Italian Large Language Models

Elisabetta Fersini, Claudio Stamile, Francesca Pulerà, Alberto Sormani

Year: 2025Area: Safety EvaluationCitations: 1

Tags: ai-safety, safety-evaluation, benchmark

E8 / R3 (97%)
Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models

Shanu Sushmita, Mohammad Zandsalimy, Youjia Zheng

Year: 2025Area: Adversarial RobustnessCitations: -

Tags: ai-safety, adversarial-robustness, safety-evaluation, benchmark

E5 / R3 (94%)
Being Kind Isn't Always Being Safe: Diagnosing Affective Hallucination in LLMs

Daeun Moon, Hyunsoo Yoon, Sewon Kim, Seungwoo Shin

Year: 2025Area: Safety EvaluationCitations: -

Tags: ai-safety, safety-evaluation, benchmark

E5 / R3 (98%)
Benchmarking Deception Probes via Black-to-White Performance Boosts

Carlo Leonardo Attubato, Stefan Heimersheim, Aviel Parrack

Year: 2025Area: Deception & FailureCitations: 8

Tags: ai-safety, deception-failure, benchmark

E5 / R3 (93%)

Showing 30 of 335 papers on page 1.