Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

PaperIntel
Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective

Davin Choo, Milind Tambe, Tzeh Yuan Neoh, Cheol Woo Kim

Year: 2026Area: Formal/TheoreticalCitations: -

Tags: theoretical, formaltheoretical, ai-safety, adversarial-robustness, safety-evaluation

E5 / R3 (93%)
Towards Worst-Case Guarantees with Scale-Aware Interpretability

Andrew Mack, Artemy Kolchinsky, David Berman, Aryeh Brill

Year: 2026Area: Formal/TheoreticalCitations: -

Tags: formaltheoretical, ai-safety, position, interpretability

E5 / R3 (94%)
AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures?

Leonard Dung, Florian Mai

Year: 2025Area: Formal/TheoreticalCitations: -

Tags: theoretical, formaltheoretical, alignment-training, ai-safety

E8 / R2 (97%)
AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability

Manuel Baltieri, Alexander Boyd, Fernando Rosas

Year: 2025Area: Formal/TheoreticalCitations: 2

Tags: theoretical, formaltheoretical, ai-safety, interpretability

E6 / R3 (93%)
Conformal Arbitrage: Risk-Controlled Balancing of Competing Objectives in Language Models

Mohsen Bayati, William Overman

Year: 2025Area: Formal/TheoreticalCitations: 3

Tags: theoretical, formaltheoretical, ai-safety

E5 / R3 (96%)
Corrigibility Transformation: Constructing Goals That Accept Updates

Rubi Hudson

Year: 2025Area: Formal/TheoreticalCitations: -

Tags: theoretical, formaltheoretical, ai-safety

E5 / R3 (94%)
Dynamics Reveals Structure: Challenging the Linear Propagation Assumption

Bálint Mucsányi, Seong Joon Oh, Hoyeon Chang

Year: 2025Area: Formal/TheoreticalCitations: -

Tags: theoretical, formaltheoretical, ai-safety

E5 / R3 (92%)
Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis

Aran Nayebi

Year: 2025Area: Formal/TheoreticalCitations: 4

Tags: theoretical, formaltheoretical, alignment-training, ai-safety

E4 / R3 (93%)
On Surjectivity of Neural Networks: Can you elicit any behavior from your model?

Nika Haghtalab, Haozhe Jiang

Year: 2025Area: Formal/TheoreticalCitations: 3

Tags: theoretical, formaltheoretical, ai-safety, adversarial-robustness

E6 / R3 (95%)
On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

Sarah Ball, Shafi Goldwasser, Omer Reingold, Frauke Kreuter

Year: 2025Area: Formal/TheoreticalCitations: 4

Tags: theoretical, formaltheoretical, alignment-training, ai-safety

E5 / R3 (97%)
Out of Control - Why Alignment Needs Formal Control Theory (and an Alignment Control Stack)

Elija Perrier

Year: 2025Area: Formal/TheoreticalCitations: 2

Tags: formaltheoretical, alignment-training, ai-safety, position

E4 / R2 (94%)
RepV: Safety-Separable Latent Spaces for Scalable Neurosymbolic Plan Verification

Anonymous Authors

Year: 2025Area: Formal/TheoreticalCitations: -

Tags: empirical, formaltheoretical, ai-safety

E5 / R4 (93%)
Shh, don't say that! Domain Certification in LLMs

Maxime Kayser, Adel Bibi, Philip H.S. Torr, Alasdair Paren

Year: 2025Area: Formal/TheoreticalCitations: 4

Tags: theoretical, formaltheoretical, ai-safety, adversarial-robustness

E5 / R3 (93%)
The Computational Complexity of Circuit Discovery for Inner Interpretability

Todd Wareham, Martina G. Vilas, Federico Adolfi

Year: 2025Area: Formal/TheoreticalCitations: 14

Tags: theoretical, formaltheoretical, ai-safety, interpretability

E6 / R3 (98%)
The Limits of Predicting Agents from Behaviour

Jonathan Richens, Tom Everitt, Alexis Bellot

Year: 2025Area: Formal/TheoreticalCitations: 1

Tags: theoretical, formaltheoretical, ai-safety

E5 / R3 (93%)
The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models

Xingcheng Xu

Year: 2025Area: Formal/TheoreticalCitations: 4

Tags: theoretical, formaltheoretical, alignment-training, ai-safety

E5 / R4 (96%)
Will artificial agents pursue power by default?

Christian Tarsney

Year: 2025Area: Formal/TheoreticalCitations: 1

Tags: theoretical, formaltheoretical, ai-safety

E5 / R3 (92%)
You Are What You Eat: AI Alignment Requires Understanding How Data Shapes Structure and Generalisation

Jesse Hoogland, Matthew Farrugia-Roberts, Simon Pepin Lehalleur, Susan Wei

Year: 2025Area: Formal/TheoreticalCitations: 9

Tags: formaltheoretical, alignment-training, ai-safety, position

E5 / R4 (96%)
Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies

Roger Wattenhofer, Frédéric Berdoz

Year: 2024Area: Formal/TheoreticalCitations: 1

Tags: theoretical, formaltheoretical, ai-safety

E5 / R3 (96%)
Compact Proofs of Model Performance via Mechanistic Interpretability

Euan Ong, Soufiane Noubir, Lawrence Chan, Rajashree Agrawal

Year: 2024Area: Formal/TheoreticalCitations: 12

Tags: empirical, formaltheoretical, ai-safety, interpretability

E4 / R3 (92%)
Human-AI Safety: A Descendant of Generative AI and Control Systems Safety

Jaime F. Fisac, Andrea Bajcsy

Year: 2024Area: Formal/TheoreticalCitations: 9

Tags: theoretical, formaltheoretical, ai-safety

E5 / R4 (93%)
Information Theoretic Guarantees For Policy Alignment In Large Language Models

Youssef Mroueh

Year: 2024Area: Formal/TheoreticalCitations: 19

Tags: theoretical, formaltheoretical, alignment-training, ai-safety

E5 / R3 (95%)
Measuring Goal-Directedness

Matt MacDermott, Tom Everitt, Francesco Belardinelli, James Fox

Year: 2024Area: Formal/TheoreticalCitations: 4

Tags: theoretical, formaltheoretical, ai-safety

E5 / R3 (95%)
On Goodhart's Law, with an Application to Value Alignment

Lê-Nguyên Hoang, El-Mahdi El-Mhamdi

Year: 2024Area: Formal/TheoreticalCitations: 6

Tags: theoretical, formaltheoretical, alignment-training, ai-safety

E5 / R3 (95%)
Provably Extracting the Features from a General Superposition

Allen Liu

Year: 2024Area: Formal/TheoreticalCitations: -

Tags: theoretical, formaltheoretical, ai-safety, interpretability

E4 / R2 (96%)
Quantifying stability of non-power-seeking in artificial agents

Yevgeny Liokumovich, Victoria Krakovna, Evan Ryan Gunter

Year: 2024Area: Formal/TheoreticalCitations: 2

Tags: theoretical, formaltheoretical, ai-safety

E5 / R3 (94%)
Robust agents learn causal world models

Jonathan Richens, Tom Everitt

Year: 2024Area: Formal/TheoreticalCitations: 67

Tags: theoretical, formaltheoretical, ai-safety

E5 / R3 (96%)
The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists

Elliott Thornley

Year: 2024Area: Formal/TheoreticalCitations: 18

Tags: theoretical, formaltheoretical, ai-safety

E4 / R3 (94%)
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Sanjit Seshia, Ben Goldhaber, David Dalrymple, Steve Omohundro

Year: 2024Area: Formal/TheoreticalCitations: 102

Tags: formaltheoretical, ai-safety, position

E5 / R4 (97%)
Towards Provably Secure Generative AI: Reliable Consensus Sampling

Yuhong Nie, Yu Cui, Licheng Wang, Baohan Huang

Year: 2024Area: Formal/TheoreticalCitations: -

Tags: theoretical, formaltheoretical, ai-safety

E4 / R3 (94%)

Showing 30 of 42 papers on page 1.