Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Year | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Behemoth: Benchmarking Unlearning in LLMs Using Fully Synthetic Data Dan Alistarh, Eugenia Iofinova Year: 2026Area: Model EditingCitations: - Tags: ai-safety, model-editing, benchmark | 2026 | Model Editing | ai-safety, model-editing, benchmark | E5 / R4 (96%) | - |
| Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation Nesreen Ahmed, Mahantesh Halappanavar, Haoyu Han, Yue Zhao Year: 2026Area: Adversarial RobustnessCitations: - Tags: ai-safety, adversarial-robustness, benchmark | 2026 | Adversarial Robustness | ai-safety, adversarial-robustness, benchmark | E5 / R3 (96%) | - |
| CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns Qiankun Li, Kun Wang, Shilinlu Yan, Zhenhong Zhou Year: 2026Area: Safety EvaluationCitations: - Tags: ai-safety, adversarial-robustness, safety-evaluation, benchmark | 2026 | Safety Evaluation | ai-safety, adversarial-robustness, safety-evaluation, benchmark | E6 / R5 (96%) | - |
| Do Large Language Models Reflect Demographic Pluralism in Safety? Rafiq Ali, Sushant Kumar Ray, Usman Naseem, Abdullah Mohammad Year: 2026Area: Safety EvaluationCitations: - Tags: ai-safety, safety-evaluation, benchmark | 2026 | Safety Evaluation | ai-safety, safety-evaluation, benchmark | E6 / R4 (95%) | - |
| From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents Sen Su, Fanyu Meng, Zhenhong Zhou, Zhengshuo Gong Year: 2026Area: Safety EvaluationCitations: - Tags: alignment-training, ai-safety, safety-evaluation, benchmark | 2026 | Safety Evaluation | alignment-training, ai-safety, safety-evaluation, benchmark | E5 / R3 (94%) | - |
| LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios Xia Hu, Ge Gao, Chujia Hu, Dongrui Liu Year: 2026Area: Agent SafetyCitations: 1 Tags: agent-safety, ai-safety, adversarial-robustness, benchmark | 2026 | Agent Safety | agent-safety, ai-safety, adversarial-robustness, benchmark | E4 / R3 (95%) | 1 |
| Lost in Translation? A Comparative Study on the Cross-Lingual Transfer of Composite Harms Hardik Sharma, Vrijendra Singh, Soham Wasmatkar, Vaibhav Shukla Year: 2026Area: Safety EvaluationCitations: - Tags: alignment-training, ai-safety, safety-evaluation, benchmark | 2026 | Safety Evaluation | alignment-training, ai-safety, safety-evaluation, benchmark | E7 / R4 (96%) | - |
| RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments Zhiqiang Lin, Zeyi Liao, Eric Fosler-Lussier, Yu Su Year: 2026Area: Agent SafetyCitations: 12 Tags: agent-safety, ai-safety, adversarial-robustness, benchmark | 2026 | Agent Safety | agent-safety, ai-safety, adversarial-robustness, benchmark | E6 / R3 (95%) | 12 |
| TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering Samanvay Vajpayee, Sirisha Rambhatla, Tom Tseng, Samuel Simko Year: 2026Area: Safety EvaluationCitations: 1 Tags: ai-safety, safety-evaluation, benchmark | 2026 | Safety Evaluation | ai-safety, safety-evaluation, benchmark | E5 / R4 (98%) | 1 |
| The Shadow Self: Intrinsic Value Misalignment in Large Language Model Agents Qian Wang, Yuan Yang, Ziyao Liu, Kwok-Yan Lam Year: 2026Area: Agent SafetyCitations: - Tags: alignment-training, agent-safety, ai-safety, benchmark | 2026 | Agent Safety | alignment-training, agent-safety, ai-safety, benchmark | E4 / R3 (96%) | - |
| When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life Xiangyu Shi, Fengran Mo, Su Yao, Youwei Liao Year: 2026Area: Multimodal SafetyCitations: - Tags: alignment-training, ai-safety, multimodal-safety, benchmark | 2026 | Multimodal Safety | alignment-training, ai-safety, multimodal-safety, benchmark | E4 / R2 (94%) | - |
| A Benchmark for Scalable Oversight Protocols Abhimanyu Pallavi Sudhir, Jackson Kaunismaa, Arjun Panickssery Year: 2025Area: Scalable OversightCitations: 1 Tags: scalable-oversight, ai-safety, benchmark | 2025 | Scalable Oversight | scalable-oversight, ai-safety, benchmark | E6 / R3 (95%) | 1 |
| A Framework for Evaluating Emerging Cyberattack Capabilities of AI Raluca Ada Popa, Lihao Liang, Anna Wang, Mikel Rodriguez Year: 2025Area: Safety EvaluationCitations: 24 Tags: ai-safety, safety-evaluation, benchmark | 2025 | Safety Evaluation | ai-safety, safety-evaluation, benchmark | E5 / R3 (95%) | 24 |
| A Safety and Security-Centered Evaluation Framework for Large Language Models via Multi-Model Judgment Changsheng Wan, Hong Zhong, Yunhao Xia, Weichen Lu Year: 2025Area: Safety EvaluationCitations: - Tags: ai-safety, safety-evaluation, benchmark | 2025 | Safety Evaluation | ai-safety, safety-evaluation, benchmark | E5 / R3 (97%) | - |
| AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions Xianglong Liu, Dacheng Tao, Jiakai Wang, Siyuan Liang Year: 2025Area: Agent SafetyCitations: 13 Tags: agent-safety, ai-safety, adversarial-robustness, benchmark | 2025 | Agent Safety | agent-safety, ai-safety, adversarial-robustness, benchmark | E5 / R4 (99%) | 13 |
| Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs Kieran Fraser, Prasanna Sattigeri, Giulio Zizzo, Beat Buesser Year: 2025Area: Adversarial RobustnessCitations: 12 Tags: ai-safety, adversarial-robustness, safety-evaluation, benchmark | 2025 | Adversarial Robustness | ai-safety, adversarial-robustness, safety-evaluation, benchmark | E7 / R3 (97%) | 12 |
| AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents Arman Zharmagambetov, Maya Pavlova, Kamalika Chaudhuri, Chuan Guo Year: 2025Area: Agent SafetyCitations: 31 Tags: agent-safety, ai-safety, safety-evaluation, benchmark | 2025 | Agent Safety | agent-safety, ai-safety, safety-evaluation, benchmark | E6 / R3 (97%) | 31 |
| AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents Mateusz Dziemian, Zico Kolter, Jerome Wynne, Xander Davies Year: 2025Area: Adversarial RobustnessCitations: 148 Tags: ai-safety, adversarial-robustness, benchmark | 2025 | Adversarial Robustness | ai-safety, adversarial-robustness, benchmark | E5 / R3 (95%) | 148 |
| AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems Faouzi El Yagoubi, Ranwa Al Mallah, Godwin Badu-Marfo Year: 2025Area: Agent SafetyCitations: - Tags: agent-safety, ai-safety, benchmark | 2025 | Agent Safety | agent-safety, ai-safety, benchmark | E5 / R3 (95%) | - |
| AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents Guillermo Bosch, Akshat Naik, Patrick Quinn, Edward James Young Year: 2025Area: Agent SafetyCitations: 12 Tags: alignment-training, agent-safety, ai-safety, benchmark | 2025 | Agent Safety | alignment-training, agent-safety, ai-safety, benchmark | E5 / R3 (94%) | 12 |
| Among Us: A Sandbox for Measuring and Detecting Agentic Deception Adrià Garriga-Alonso, Satvik Golechha Year: 2025Area: Deception & FailureCitations: 9 Tags: ai-safety, deception-failure, benchmark | 2025 | Deception & Failure | ai-safety, deception-failure, benchmark | E5 / R3 (95%) | 9 |
| Are Your Agents Upward Deceivers? Xia Hu, Juntao Dai, Yi R. Fung, Qingyu Liu Year: 2025Area: Deception & FailureCitations: 2 Tags: ai-safety, deception-failure, benchmark | 2025 | Deception & Failure | ai-safety, deception-failure, benchmark | E4 / R2 (94%) | 2 |
| AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor Wenxuan Wang, Shu Yang, Hanqi Yan, Jingyu Hu Year: 2025Area: Safety EvaluationCitations: - Tags: ai-safety, safety-evaluation, benchmark | 2025 | Safety Evaluation | ai-safety, safety-evaluation, benchmark | E4 / R3 (95%) | - |
| AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders Christopher Potts, Aryaman Arora, Dan Jurafsky, Christopher D. Manning Year: 2025Area: Representation AnalysisCitations: 123 Tags: representation-analysis, ai-safety, benchmark | 2025 | Representation Analysis | representation-analysis, ai-safety, benchmark | E5 / R3 (96%) | 123 |
| BLUR: A Benchmark for LLM Unlearning Robust to Forget-Retain Overlap Pratiksha Thaker, Yiwei Fu, Steven Wu, Neil Kale Year: 2025Area: Safety EvaluationCitations: 6 Tags: ai-safety, safety-evaluation, benchmark | 2025 | Safety Evaluation | ai-safety, safety-evaluation, benchmark | E5 / R3 (93%) | 6 |
| Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images Aditya Kumar, Franziska Boenisch, Tom Blanchard, Adam Dziedzic Year: 2025Area: Multimodal SafetyCitations: - Tags: ai-safety, multimodal-safety, benchmark | 2025 | Multimodal Safety | ai-safety, multimodal-safety, benchmark | E5 / R3 (94%) | - |
| BeaverTails-IT: Towards a Safety Benchmark for Evaluating Italian Large Language Models Elisabetta Fersini, Claudio Stamile, Francesca Pulerà, Alberto Sormani Year: 2025Area: Safety EvaluationCitations: 1 Tags: ai-safety, safety-evaluation, benchmark | 2025 | Safety Evaluation | ai-safety, safety-evaluation, benchmark | E8 / R3 (97%) | 1 |
| Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models Shanu Sushmita, Mohammad Zandsalimy, Youjia Zheng Year: 2025Area: Adversarial RobustnessCitations: - Tags: ai-safety, adversarial-robustness, safety-evaluation, benchmark | 2025 | Adversarial Robustness | ai-safety, adversarial-robustness, safety-evaluation, benchmark | E5 / R3 (94%) | - |
| Being Kind Isn't Always Being Safe: Diagnosing Affective Hallucination in LLMs Daeun Moon, Hyunsoo Yoon, Sewon Kim, Seungwoo Shin Year: 2025Area: Safety EvaluationCitations: - Tags: ai-safety, safety-evaluation, benchmark | 2025 | Safety Evaluation | ai-safety, safety-evaluation, benchmark | E5 / R3 (98%) | - |
| Benchmarking Deception Probes via Black-to-White Performance Boosts Carlo Leonardo Attubato, Stefan Heimersheim, Aviel Parrack Year: 2025Area: Deception & FailureCitations: 8 Tags: ai-safety, deception-failure, benchmark | 2025 | Deception & Failure | ai-safety, deception-failure, benchmark | E5 / R3 (93%) | 8 |
Showing 30 of 335 papers on page 1.