Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

PaperIntel
Modular Safety Guardrails Are Necessary for Foundation-Model-Enabled Robots in the Real World

Davood Soleymanzadeh, Yi Ding, Minghui Zheng, Joonkyung Kim

Year: 2026Area: Agent SafetyCitations: -

Tags: agent-safety, ai-safety, position

E5 / R4 (93%)
Position: Capability Control Should be a Separate Goal From Alignment

Eleni Triantafillou, Adrian Weller, Shoaib Ahmed Siddiqui, David Krueger

Year: 2026Area: Surveys & ReviewsCitations: -

Tags: alignment-training, surveys-reviews, ai-safety, position

E6 / R4 (94%)
Towards Worst-Case Guarantees with Scale-Aware Interpretability

Andrew Mack, Artemy Kolchinsky, David Berman, Aryeh Brill

Year: 2026Area: Formal/TheoreticalCitations: -

Tags: formaltheoretical, ai-safety, position, interpretability

E5 / R3 (94%)
A Pragmatic Vision for Interpretability

Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, position, interpretability

-
A Red Teaming Roadmap Towards System-Level Safety

Jeremy Kritz, Zifan Wang, Julian Michael, Willow E. Primack

Year: 2025Area: Safety EvaluationCitations: 2

Tags: ai-safety, position, safety-evaluation, red-teaming

E5 / R3 (92%)
A Sketch of an AI Control Safety Case

Geoffrey Irving, Buck Shlegeris, Tomek Korbak, Joshua Clymer

Year: 2025Area: Agent SafetyCitations: 22

Tags: agent-safety, ai-safety, position, safety-evaluation

E6 / R4 (93%)
Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives

Leo Schwinn, Sophie Xhonneux, Gauthier Gidel, Tom Wollschläger

Year: 2025Area: Adversarial RobustnessCitations: 7

Tags: alignment-training, ai-safety, adversarial-robustness, position

E4 / R3 (94%)
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate

Nicholas Carlini, Florian Tramèr, Jie Zhang, Javier Rando

Year: 2025Area: Adversarial RobustnessCitations: 24

Tags: ai-safety, adversarial-robustness, position

E5 / R3 (94%)
An Approach to Technical AGI Safety and Security

Sebastien Krier, Raluca Ada Popa, Vikrant Varma, Rory Greig

Year: 2025Area: Surveys & ReviewsCitations: 35

Tags: alignment-training, surveys-reviews, ai-safety, position

E5 / R3 (96%)
Bare Minimum Mitigations for Autonomous AI Development

Conor McGurk, Jingren Wang, Xudong Pan, Yawen Duan

Year: 2025Area: Agent SafetyCitations: 1

Tags: agent-safety, ai-safety, position

E5 / R3 (94%)
Because we have LLMs, we Can and Should Pursue Agentic Interpretability

Noah Fiedel, Been Kim, John Hewitt, Oyvind Tafjord

Year: 2025Area: Mechanistic Interp.Citations: 9

Tags: mechanistic-interp, ai-safety, position, interpretability

E5 / R3 (94%)
Bridge the Gaps between Machine Unlearning and AI Regulation

Bill Marino, Meghdad Kurmanji, Nicholas D. Lane

Year: 2025Area: Model EditingCitations: 5

Tags: ai-safety, position, model-editing

E5 / R3 (95%)
Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution

Usha Bhalla, Hima Lakkaraju, Shichang Zhang, Tessa Han

Year: 2025Area: Surveys & ReviewsCitations: 3

Tags: surveys-reviews, ai-safety, position, interpretability

E5 / R3 (97%)
Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity

David Williams-King, Linh Le, Adam Oberman, Yoshua Bengio

Year: 2025Area: Adversarial RobustnessCitations: -

Tags: ai-safety, adversarial-robustness, position

E6 / R4 (93%)
Chain-of-Thought Monitorability: A New and Fragile Opportunity for AI Safety

Aleksander Madry, Mark Chen, Rohin Shah, Marius Hobbhahn

Year: 2025Area: Scalable OversightCitations: 88

Tags: scalable-oversight, ai-safety, position

E5 / R4 (95%)
Computational Psychopathology of AI: A Clinical-Computational Framework for Diagnosing and Preventing Failure Modes

Carlos P. Portela

Year: 2025Area: Safety EvaluationCitations: -

Tags: ai-safety, position, safety-evaluation

E6 / R3 (96%)
Difficulties with Evaluating a Deception Detector for AIs

Bilal Chughtai, Lewis Smith, Neel Nanda

Year: 2025Area: Deception & FailureCitations: 1

Tags: ai-safety, position, deception-failure

E5 / R3 (95%)
Distributional AGI Safety

Julian Jacobs, Nenad Tomašev, Matija Franklin, Sébastien Krier

Year: 2025Area: Agent SafetyCitations: 4

Tags: agent-safety, ai-safety, position

E4 / R3 (92%)
Episodic Memory in AI Agents Poses Risks that Should be Studied and Mitigated

Chad DeChant

Year: 2025Area: Agent SafetyCitations: 8

Tags: agent-safety, ai-safety, position

E5 / R3 (96%)
Expanding External Access To Frontier AI Models For Dangerous Capability Evaluations

Aidan Homewood, Alejandro Tlaie, Kyle O'Brien, Stephen Casper

Year: 2025Area: Safety EvaluationCitations: -

Tags: ai-safety, position, safety-evaluation

E6 / R3 (96%)
From Firewalls to Frontiers: AI Red-Teaming is a Domain-Specific Evolution of Cyber Red-Teaming

Michael Feffer, Hoda Heidari, Anusha Sinha, Keltin Grimes

Year: 2025Area: Safety EvaluationCitations: 2

Tags: ai-safety, position, safety-evaluation

E5 / R3 (93%)
Giving AI Agents Access to Cryptocurrency and Smart Contracts Creates New Vectors of AI Harm

Ari Juels, Bill Marino

Year: 2025Area: Agent SafetyCitations: 2

Tags: agent-safety, ai-safety, position

E6 / R3 (96%)
How Can Interpretability Researchers Help AGI Go Well?

Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: alignment-training, mechanistic-interp, ai-safety, position, interpretability

-
If It's Nice, Do It Twice: We Should Try Iterative Corpus Curation

Robin Young

Year: 2025Area: Alignment TrainingCitations: -

Tags: alignment-training, ai-safety, position

E5 / R3 (93%)
Infrastructure for AI Agents

Alan Chan, Kevin Wei, Markus Anderljung, Sihao Huang

Year: 2025Area: Agent SafetyCitations: 24

Tags: agent-safety, ai-safety, position

E6 / R4 (96%)
LLM Agents Should Employ Security Principles

Zian Su, Ninghui Li, Xiangyu Zhang, Elisa Bertino

Year: 2025Area: Agent SafetyCitations: 13

Tags: agent-safety, ai-safety, position

E6 / R4 (94%)
LLM-Safety Evaluations Lack Robustness

Leo Schwinn, Sophie Xhonneux, Gauthier Gidel, Simon Geisler

Year: 2025Area: Safety EvaluationCitations: 12

Tags: ai-safety, position, safety-evaluation

E6 / R3 (94%)
Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models

Yuwen Tan, Boqing Gong

Year: 2025Area: Model EditingCitations: 1

Tags: ai-safety, position, model-editing

E4 / R3 (95%)
Mechanistic Interpretability Needs Philosophy

Nina Rajcic, Iwan Williams, Ninell Oldenburg, Filippos Stamatiou

Year: 2025Area: Mechanistic Interp.Citations: 3

Tags: mechanistic-interp, ai-safety, position, interpretability

E5 / R3 (94%)
Multi-Agent Systems Should be Treated as Principal-Agent Problems

Paulius Rauba, Mihaela van der Schaar, Simonas Cepenas

Year: 2025Area: Agent SafetyCitations: -

Tags: agent-safety, ai-safety, position

E5 / R4 (94%)

Showing 30 of 98 papers on page 1.