Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Paper	Year	Area	Tags	Intel	Citations
Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective Davin Choo, Milind Tambe, Tzeh Yuan Neoh, Cheol Woo Kim Year: 2026Area: Formal/TheoreticalCitations: - Tags: theoretical, formaltheoretical, ai-safety, adversarial-robustness, safety-evaluation	2026	Formal/Theoretical	theoretical, formaltheoretical, ai-safety, adversarial-robustness, safety-evaluation	E5 / R3 (93%)	-
Towards Worst-Case Guarantees with Scale-Aware Interpretability Andrew Mack, Artemy Kolchinsky, David Berman, Aryeh Brill Year: 2026Area: Formal/TheoreticalCitations: - Tags: formaltheoretical, ai-safety, position, interpretability	2026	Formal/Theoretical	formaltheoretical, ai-safety, position, interpretability	E5 / R3 (94%)	-
AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures? Leonard Dung, Florian Mai Year: 2025Area: Formal/TheoreticalCitations: - Tags: theoretical, formaltheoretical, alignment-training, ai-safety	2025	Formal/Theoretical	theoretical, formaltheoretical, alignment-training, ai-safety	E8 / R2 (97%)	-
AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability Manuel Baltieri, Alexander Boyd, Fernando Rosas Year: 2025Area: Formal/TheoreticalCitations: 2 Tags: theoretical, formaltheoretical, ai-safety, interpretability	2025	Formal/Theoretical	theoretical, formaltheoretical, ai-safety, interpretability	E6 / R3 (93%)	2
Conformal Arbitrage: Risk-Controlled Balancing of Competing Objectives in Language Models Mohsen Bayati, William Overman Year: 2025Area: Formal/TheoreticalCitations: 3 Tags: theoretical, formaltheoretical, ai-safety	2025	Formal/Theoretical	theoretical, formaltheoretical, ai-safety	E5 / R3 (96%)	3
Corrigibility Transformation: Constructing Goals That Accept Updates Rubi Hudson Year: 2025Area: Formal/TheoreticalCitations: - Tags: theoretical, formaltheoretical, ai-safety	2025	Formal/Theoretical	theoretical, formaltheoretical, ai-safety	E5 / R3 (94%)	-
Dynamics Reveals Structure: Challenging the Linear Propagation Assumption Bálint Mucsányi, Seong Joon Oh, Hoyeon Chang Year: 2025Area: Formal/TheoreticalCitations: - Tags: theoretical, formaltheoretical, ai-safety	2025	Formal/Theoretical	theoretical, formaltheoretical, ai-safety	E5 / R3 (92%)	-
Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis Aran Nayebi Year: 2025Area: Formal/TheoreticalCitations: 4 Tags: theoretical, formaltheoretical, alignment-training, ai-safety	2025	Formal/Theoretical	theoretical, formaltheoretical, alignment-training, ai-safety	E4 / R3 (93%)	4
On Surjectivity of Neural Networks: Can you elicit any behavior from your model? Nika Haghtalab, Haozhe Jiang Year: 2025Area: Formal/TheoreticalCitations: 3 Tags: theoretical, formaltheoretical, ai-safety, adversarial-robustness	2025	Formal/Theoretical	theoretical, formaltheoretical, ai-safety, adversarial-robustness	E6 / R3 (95%)	3
On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment Sarah Ball, Shafi Goldwasser, Omer Reingold, Frauke Kreuter Year: 2025Area: Formal/TheoreticalCitations: 4 Tags: theoretical, formaltheoretical, alignment-training, ai-safety	2025	Formal/Theoretical	theoretical, formaltheoretical, alignment-training, ai-safety	E5 / R3 (97%)	4
Out of Control - Why Alignment Needs Formal Control Theory (and an Alignment Control Stack) Elija Perrier Year: 2025Area: Formal/TheoreticalCitations: 2 Tags: formaltheoretical, alignment-training, ai-safety, position	2025	Formal/Theoretical	formaltheoretical, alignment-training, ai-safety, position	E4 / R2 (94%)	2
RepV: Safety-Separable Latent Spaces for Scalable Neurosymbolic Plan Verification Anonymous Authors Year: 2025Area: Formal/TheoreticalCitations: - Tags: empirical, formaltheoretical, ai-safety	2025	Formal/Theoretical	empirical, formaltheoretical, ai-safety	E5 / R4 (93%)	-
Shh, don't say that! Domain Certification in LLMs Maxime Kayser, Adel Bibi, Philip H.S. Torr, Alasdair Paren Year: 2025Area: Formal/TheoreticalCitations: 4 Tags: theoretical, formaltheoretical, ai-safety, adversarial-robustness	2025	Formal/Theoretical	theoretical, formaltheoretical, ai-safety, adversarial-robustness	E5 / R3 (93%)	4
The Computational Complexity of Circuit Discovery for Inner Interpretability Todd Wareham, Martina G. Vilas, Federico Adolfi Year: 2025Area: Formal/TheoreticalCitations: 14 Tags: theoretical, formaltheoretical, ai-safety, interpretability	2025	Formal/Theoretical	theoretical, formaltheoretical, ai-safety, interpretability	E6 / R3 (98%)	14
The Limits of Predicting Agents from Behaviour Jonathan Richens, Tom Everitt, Alexis Bellot Year: 2025Area: Formal/TheoreticalCitations: 1 Tags: theoretical, formaltheoretical, ai-safety	2025	Formal/Theoretical	theoretical, formaltheoretical, ai-safety	E5 / R3 (93%)	1
The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models Xingcheng Xu Year: 2025Area: Formal/TheoreticalCitations: 4 Tags: theoretical, formaltheoretical, alignment-training, ai-safety	2025	Formal/Theoretical	theoretical, formaltheoretical, alignment-training, ai-safety	E5 / R4 (96%)	4
Will artificial agents pursue power by default? Christian Tarsney Year: 2025Area: Formal/TheoreticalCitations: 1 Tags: theoretical, formaltheoretical, ai-safety	2025	Formal/Theoretical	theoretical, formaltheoretical, ai-safety	E5 / R3 (92%)	1
You Are What You Eat: AI Alignment Requires Understanding How Data Shapes Structure and Generalisation Jesse Hoogland, Matthew Farrugia-Roberts, Simon Pepin Lehalleur, Susan Wei Year: 2025Area: Formal/TheoreticalCitations: 9 Tags: formaltheoretical, alignment-training, ai-safety, position	2025	Formal/Theoretical	formaltheoretical, alignment-training, ai-safety, position	E5 / R4 (96%)	9
Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies Roger Wattenhofer, Frédéric Berdoz Year: 2024Area: Formal/TheoreticalCitations: 1 Tags: theoretical, formaltheoretical, ai-safety	2024	Formal/Theoretical	theoretical, formaltheoretical, ai-safety	E5 / R3 (96%)	1
Compact Proofs of Model Performance via Mechanistic Interpretability Euan Ong, Soufiane Noubir, Lawrence Chan, Rajashree Agrawal Year: 2024Area: Formal/TheoreticalCitations: 12 Tags: empirical, formaltheoretical, ai-safety, interpretability	2024	Formal/Theoretical	empirical, formaltheoretical, ai-safety, interpretability	E4 / R3 (92%)	12
Human-AI Safety: A Descendant of Generative AI and Control Systems Safety Jaime F. Fisac, Andrea Bajcsy Year: 2024Area: Formal/TheoreticalCitations: 9 Tags: theoretical, formaltheoretical, ai-safety	2024	Formal/Theoretical	theoretical, formaltheoretical, ai-safety	E5 / R4 (93%)	9
Information Theoretic Guarantees For Policy Alignment In Large Language Models Youssef Mroueh Year: 2024Area: Formal/TheoreticalCitations: 19 Tags: theoretical, formaltheoretical, alignment-training, ai-safety	2024	Formal/Theoretical	theoretical, formaltheoretical, alignment-training, ai-safety	E5 / R3 (95%)	19
Measuring Goal-Directedness Matt MacDermott, Tom Everitt, Francesco Belardinelli, James Fox Year: 2024Area: Formal/TheoreticalCitations: 4 Tags: theoretical, formaltheoretical, ai-safety	2024	Formal/Theoretical	theoretical, formaltheoretical, ai-safety	E5 / R3 (95%)	4
On Goodhart's Law, with an Application to Value Alignment Lê-Nguyên Hoang, El-Mahdi El-Mhamdi Year: 2024Area: Formal/TheoreticalCitations: 6 Tags: theoretical, formaltheoretical, alignment-training, ai-safety	2024	Formal/Theoretical	theoretical, formaltheoretical, alignment-training, ai-safety	E5 / R3 (95%)	6
Provably Extracting the Features from a General Superposition Allen Liu Year: 2024Area: Formal/TheoreticalCitations: - Tags: theoretical, formaltheoretical, ai-safety, interpretability	2024	Formal/Theoretical	theoretical, formaltheoretical, ai-safety, interpretability	E4 / R2 (96%)	-
Quantifying stability of non-power-seeking in artificial agents Yevgeny Liokumovich, Victoria Krakovna, Evan Ryan Gunter Year: 2024Area: Formal/TheoreticalCitations: 2 Tags: theoretical, formaltheoretical, ai-safety	2024	Formal/Theoretical	theoretical, formaltheoretical, ai-safety	E5 / R3 (94%)	2
Robust agents learn causal world models Jonathan Richens, Tom Everitt Year: 2024Area: Formal/TheoreticalCitations: 67 Tags: theoretical, formaltheoretical, ai-safety	2024	Formal/Theoretical	theoretical, formaltheoretical, ai-safety	E5 / R3 (96%)	67
The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists Elliott Thornley Year: 2024Area: Formal/TheoreticalCitations: 18 Tags: theoretical, formaltheoretical, ai-safety	2024	Formal/Theoretical	theoretical, formaltheoretical, ai-safety	E4 / R3 (94%)	18
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems Sanjit Seshia, Ben Goldhaber, David Dalrymple, Steve Omohundro Year: 2024Area: Formal/TheoreticalCitations: 102 Tags: formaltheoretical, ai-safety, position	2024	Formal/Theoretical	formaltheoretical, ai-safety, position	E5 / R4 (97%)	102
Towards Provably Secure Generative AI: Reliable Consensus Sampling Yuhong Nie, Yu Cui, Licheng Wang, Baohan Huang Year: 2024Area: Formal/TheoreticalCitations: - Tags: theoretical, formaltheoretical, ai-safety	2024	Formal/Theoretical	theoretical, formaltheoretical, ai-safety	E4 / R3 (94%)	-

Showing 30 of 42 papers on page 1.