Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Year | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Modular Safety Guardrails Are Necessary for Foundation-Model-Enabled Robots in the Real World Davood Soleymanzadeh, Yi Ding, Minghui Zheng, Joonkyung Kim Year: 2026Area: Agent SafetyCitations: - Tags: agent-safety, ai-safety, position | 2026 | Agent Safety | agent-safety, ai-safety, position | E5 / R4 (93%) | - |
| Position: Capability Control Should be a Separate Goal From Alignment Eleni Triantafillou, Adrian Weller, Shoaib Ahmed Siddiqui, David Krueger Year: 2026Area: Surveys & ReviewsCitations: - Tags: alignment-training, surveys-reviews, ai-safety, position | 2026 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, position | E6 / R4 (94%) | - |
| Towards Worst-Case Guarantees with Scale-Aware Interpretability Andrew Mack, Artemy Kolchinsky, David Berman, Aryeh Brill Year: 2026Area: Formal/TheoreticalCitations: - Tags: formaltheoretical, ai-safety, position, interpretability | 2026 | Formal/Theoretical | formaltheoretical, ai-safety, position, interpretability | E5 / R3 (94%) | - |
| A Pragmatic Vision for Interpretability Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan Year: 2025Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, position, interpretability | 2025 | Mechanistic Interp. | mechanistic-interp, ai-safety, position, interpretability | - | - |
| A Red Teaming Roadmap Towards System-Level Safety Jeremy Kritz, Zifan Wang, Julian Michael, Willow E. Primack Year: 2025Area: Safety EvaluationCitations: 2 Tags: ai-safety, position, safety-evaluation, red-teaming | 2025 | Safety Evaluation | ai-safety, position, safety-evaluation, red-teaming | E5 / R3 (92%) | 2 |
| A Sketch of an AI Control Safety Case Geoffrey Irving, Buck Shlegeris, Tomek Korbak, Joshua Clymer Year: 2025Area: Agent SafetyCitations: 22 Tags: agent-safety, ai-safety, position, safety-evaluation | 2025 | Agent Safety | agent-safety, ai-safety, position, safety-evaluation | E6 / R4 (93%) | 22 |
| Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives Leo Schwinn, Sophie Xhonneux, Gauthier Gidel, Tom Wollschläger Year: 2025Area: Adversarial RobustnessCitations: 7 Tags: alignment-training, ai-safety, adversarial-robustness, position | 2025 | Adversarial Robustness | alignment-training, ai-safety, adversarial-robustness, position | E4 / R3 (94%) | 7 |
| Adversarial ML Problems Are Getting Harder to Solve and to Evaluate Nicholas Carlini, Florian Tramèr, Jie Zhang, Javier Rando Year: 2025Area: Adversarial RobustnessCitations: 24 Tags: ai-safety, adversarial-robustness, position | 2025 | Adversarial Robustness | ai-safety, adversarial-robustness, position | E5 / R3 (94%) | 24 |
| An Approach to Technical AGI Safety and Security Sebastien Krier, Raluca Ada Popa, Vikrant Varma, Rory Greig Year: 2025Area: Surveys & ReviewsCitations: 35 Tags: alignment-training, surveys-reviews, ai-safety, position | 2025 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, position | E5 / R3 (96%) | 35 |
| Bare Minimum Mitigations for Autonomous AI Development Conor McGurk, Jingren Wang, Xudong Pan, Yawen Duan Year: 2025Area: Agent SafetyCitations: 1 Tags: agent-safety, ai-safety, position | 2025 | Agent Safety | agent-safety, ai-safety, position | E5 / R3 (94%) | 1 |
| Because we have LLMs, we Can and Should Pursue Agentic Interpretability Noah Fiedel, Been Kim, John Hewitt, Oyvind Tafjord Year: 2025Area: Mechanistic Interp.Citations: 9 Tags: mechanistic-interp, ai-safety, position, interpretability | 2025 | Mechanistic Interp. | mechanistic-interp, ai-safety, position, interpretability | E5 / R3 (94%) | 9 |
| Bridge the Gaps between Machine Unlearning and AI Regulation Bill Marino, Meghdad Kurmanji, Nicholas D. Lane Year: 2025Area: Model EditingCitations: 5 Tags: ai-safety, position, model-editing | 2025 | Model Editing | ai-safety, position, model-editing | E5 / R3 (95%) | 5 |
| Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution Usha Bhalla, Hima Lakkaraju, Shichang Zhang, Tessa Han Year: 2025Area: Surveys & ReviewsCitations: 3 Tags: surveys-reviews, ai-safety, position, interpretability | 2025 | Surveys & Reviews | surveys-reviews, ai-safety, position, interpretability | E5 / R3 (97%) | 3 |
| Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity David Williams-King, Linh Le, Adam Oberman, Yoshua Bengio Year: 2025Area: Adversarial RobustnessCitations: - Tags: ai-safety, adversarial-robustness, position | 2025 | Adversarial Robustness | ai-safety, adversarial-robustness, position | E6 / R4 (93%) | - |
| Chain-of-Thought Monitorability: A New and Fragile Opportunity for AI Safety Aleksander Madry, Mark Chen, Rohin Shah, Marius Hobbhahn Year: 2025Area: Scalable OversightCitations: 88 Tags: scalable-oversight, ai-safety, position | 2025 | Scalable Oversight | scalable-oversight, ai-safety, position | E5 / R4 (95%) | 88 |
| Computational Psychopathology of AI: A Clinical-Computational Framework for Diagnosing and Preventing Failure Modes Carlos P. Portela Year: 2025Area: Safety EvaluationCitations: - Tags: ai-safety, position, safety-evaluation | 2025 | Safety Evaluation | ai-safety, position, safety-evaluation | E6 / R3 (96%) | - |
| Difficulties with Evaluating a Deception Detector for AIs Bilal Chughtai, Lewis Smith, Neel Nanda Year: 2025Area: Deception & FailureCitations: 1 Tags: ai-safety, position, deception-failure | 2025 | Deception & Failure | ai-safety, position, deception-failure | E5 / R3 (95%) | 1 |
| Distributional AGI Safety Julian Jacobs, Nenad Tomašev, Matija Franklin, Sébastien Krier Year: 2025Area: Agent SafetyCitations: 4 Tags: agent-safety, ai-safety, position | 2025 | Agent Safety | agent-safety, ai-safety, position | E4 / R3 (92%) | 4 |
| Episodic Memory in AI Agents Poses Risks that Should be Studied and Mitigated Chad DeChant Year: 2025Area: Agent SafetyCitations: 8 Tags: agent-safety, ai-safety, position | 2025 | Agent Safety | agent-safety, ai-safety, position | E5 / R3 (96%) | 8 |
| Expanding External Access To Frontier AI Models For Dangerous Capability Evaluations Aidan Homewood, Alejandro Tlaie, Kyle O'Brien, Stephen Casper Year: 2025Area: Safety EvaluationCitations: - Tags: ai-safety, position, safety-evaluation | 2025 | Safety Evaluation | ai-safety, position, safety-evaluation | E6 / R3 (96%) | - |
| From Firewalls to Frontiers: AI Red-Teaming is a Domain-Specific Evolution of Cyber Red-Teaming Michael Feffer, Hoda Heidari, Anusha Sinha, Keltin Grimes Year: 2025Area: Safety EvaluationCitations: 2 Tags: ai-safety, position, safety-evaluation | 2025 | Safety Evaluation | ai-safety, position, safety-evaluation | E5 / R3 (93%) | 2 |
| Giving AI Agents Access to Cryptocurrency and Smart Contracts Creates New Vectors of AI Harm Ari Juels, Bill Marino Year: 2025Area: Agent SafetyCitations: 2 Tags: agent-safety, ai-safety, position | 2025 | Agent Safety | agent-safety, ai-safety, position | E6 / R3 (96%) | 2 |
| How Can Interpretability Researchers Help AGI Go Well? Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan Year: 2025Area: Mechanistic Interp.Citations: - Tags: alignment-training, mechanistic-interp, ai-safety, position, interpretability | 2025 | Mechanistic Interp. | alignment-training, mechanistic-interp, ai-safety, position, interpretability | - | - |
| If It's Nice, Do It Twice: We Should Try Iterative Corpus Curation Robin Young Year: 2025Area: Alignment TrainingCitations: - Tags: alignment-training, ai-safety, position | 2025 | Alignment Training | alignment-training, ai-safety, position | E5 / R3 (93%) | - |
| Infrastructure for AI Agents Alan Chan, Kevin Wei, Markus Anderljung, Sihao Huang Year: 2025Area: Agent SafetyCitations: 24 Tags: agent-safety, ai-safety, position | 2025 | Agent Safety | agent-safety, ai-safety, position | E6 / R4 (96%) | 24 |
| LLM Agents Should Employ Security Principles Zian Su, Ninghui Li, Xiangyu Zhang, Elisa Bertino Year: 2025Area: Agent SafetyCitations: 13 Tags: agent-safety, ai-safety, position | 2025 | Agent Safety | agent-safety, ai-safety, position | E6 / R4 (94%) | 13 |
| LLM-Safety Evaluations Lack Robustness Leo Schwinn, Sophie Xhonneux, Gauthier Gidel, Simon Geisler Year: 2025Area: Safety EvaluationCitations: 12 Tags: ai-safety, position, safety-evaluation | 2025 | Safety Evaluation | ai-safety, position, safety-evaluation | E6 / R3 (94%) | 12 |
| Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models Yuwen Tan, Boqing Gong Year: 2025Area: Model EditingCitations: 1 Tags: ai-safety, position, model-editing | 2025 | Model Editing | ai-safety, position, model-editing | E4 / R3 (95%) | 1 |
| Mechanistic Interpretability Needs Philosophy Nina Rajcic, Iwan Williams, Ninell Oldenburg, Filippos Stamatiou Year: 2025Area: Mechanistic Interp.Citations: 3 Tags: mechanistic-interp, ai-safety, position, interpretability | 2025 | Mechanistic Interp. | mechanistic-interp, ai-safety, position, interpretability | E5 / R3 (94%) | 3 |
| Multi-Agent Systems Should be Treated as Principal-Agent Problems Paulius Rauba, Mihaela van der Schaar, Simonas Cepenas Year: 2025Area: Agent SafetyCitations: - Tags: agent-safety, ai-safety, position | 2025 | Agent Safety | agent-safety, ai-safety, position | E5 / R4 (94%) | - |
Showing 30 of 98 papers on page 1.