Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Year | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective Davin Choo, Milind Tambe, Tzeh Yuan Neoh, Cheol Woo Kim Year: 2026Area: Formal/TheoreticalCitations: - Tags: theoretical, formaltheoretical, ai-safety, adversarial-robustness, safety-evaluation | 2026 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety, adversarial-robustness, safety-evaluation | E5 / R3 (93%) | - |
| Towards Worst-Case Guarantees with Scale-Aware Interpretability Andrew Mack, Artemy Kolchinsky, David Berman, Aryeh Brill Year: 2026Area: Formal/TheoreticalCitations: - Tags: formaltheoretical, ai-safety, position, interpretability | 2026 | Formal/Theoretical | formaltheoretical, ai-safety, position, interpretability | E5 / R3 (94%) | - |
| AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures? Leonard Dung, Florian Mai Year: 2025Area: Formal/TheoreticalCitations: - Tags: theoretical, formaltheoretical, alignment-training, ai-safety | 2025 | Formal/Theoretical | theoretical, formaltheoretical, alignment-training, ai-safety | E8 / R2 (97%) | - |
| AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability Manuel Baltieri, Alexander Boyd, Fernando Rosas Year: 2025Area: Formal/TheoreticalCitations: 2 Tags: theoretical, formaltheoretical, ai-safety, interpretability | 2025 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety, interpretability | E6 / R3 (93%) | 2 |
| Conformal Arbitrage: Risk-Controlled Balancing of Competing Objectives in Language Models Mohsen Bayati, William Overman Year: 2025Area: Formal/TheoreticalCitations: 3 Tags: theoretical, formaltheoretical, ai-safety | 2025 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety | E5 / R3 (96%) | 3 |
| Corrigibility Transformation: Constructing Goals That Accept Updates Rubi Hudson Year: 2025Area: Formal/TheoreticalCitations: - Tags: theoretical, formaltheoretical, ai-safety | 2025 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety | E5 / R3 (94%) | - |
| Dynamics Reveals Structure: Challenging the Linear Propagation Assumption Bálint Mucsányi, Seong Joon Oh, Hoyeon Chang Year: 2025Area: Formal/TheoreticalCitations: - Tags: theoretical, formaltheoretical, ai-safety | 2025 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety | E5 / R3 (92%) | - |
| Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis Aran Nayebi Year: 2025Area: Formal/TheoreticalCitations: 4 Tags: theoretical, formaltheoretical, alignment-training, ai-safety | 2025 | Formal/Theoretical | theoretical, formaltheoretical, alignment-training, ai-safety | E4 / R3 (93%) | 4 |
| On Surjectivity of Neural Networks: Can you elicit any behavior from your model? Nika Haghtalab, Haozhe Jiang Year: 2025Area: Formal/TheoreticalCitations: 3 Tags: theoretical, formaltheoretical, ai-safety, adversarial-robustness | 2025 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety, adversarial-robustness | E6 / R3 (95%) | 3 |
| On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment Sarah Ball, Shafi Goldwasser, Omer Reingold, Frauke Kreuter Year: 2025Area: Formal/TheoreticalCitations: 4 Tags: theoretical, formaltheoretical, alignment-training, ai-safety | 2025 | Formal/Theoretical | theoretical, formaltheoretical, alignment-training, ai-safety | E5 / R3 (97%) | 4 |
| Out of Control - Why Alignment Needs Formal Control Theory (and an Alignment Control Stack) Elija Perrier Year: 2025Area: Formal/TheoreticalCitations: 2 Tags: formaltheoretical, alignment-training, ai-safety, position | 2025 | Formal/Theoretical | formaltheoretical, alignment-training, ai-safety, position | E4 / R2 (94%) | 2 |
| RepV: Safety-Separable Latent Spaces for Scalable Neurosymbolic Plan Verification Anonymous Authors Year: 2025Area: Formal/TheoreticalCitations: - Tags: empirical, formaltheoretical, ai-safety | 2025 | Formal/Theoretical | empirical, formaltheoretical, ai-safety | E5 / R4 (93%) | - |
| Shh, don't say that! Domain Certification in LLMs Maxime Kayser, Adel Bibi, Philip H.S. Torr, Alasdair Paren Year: 2025Area: Formal/TheoreticalCitations: 4 Tags: theoretical, formaltheoretical, ai-safety, adversarial-robustness | 2025 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety, adversarial-robustness | E5 / R3 (93%) | 4 |
| The Computational Complexity of Circuit Discovery for Inner Interpretability Todd Wareham, Martina G. Vilas, Federico Adolfi Year: 2025Area: Formal/TheoreticalCitations: 14 Tags: theoretical, formaltheoretical, ai-safety, interpretability | 2025 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety, interpretability | E6 / R3 (98%) | 14 |
| The Limits of Predicting Agents from Behaviour Jonathan Richens, Tom Everitt, Alexis Bellot Year: 2025Area: Formal/TheoreticalCitations: 1 Tags: theoretical, formaltheoretical, ai-safety | 2025 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety | E5 / R3 (93%) | 1 |
| The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models Xingcheng Xu Year: 2025Area: Formal/TheoreticalCitations: 4 Tags: theoretical, formaltheoretical, alignment-training, ai-safety | 2025 | Formal/Theoretical | theoretical, formaltheoretical, alignment-training, ai-safety | E5 / R4 (96%) | 4 |
| Will artificial agents pursue power by default? Christian Tarsney Year: 2025Area: Formal/TheoreticalCitations: 1 Tags: theoretical, formaltheoretical, ai-safety | 2025 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety | E5 / R3 (92%) | 1 |
| You Are What You Eat: AI Alignment Requires Understanding How Data Shapes Structure and Generalisation Jesse Hoogland, Matthew Farrugia-Roberts, Simon Pepin Lehalleur, Susan Wei Year: 2025Area: Formal/TheoreticalCitations: 9 Tags: formaltheoretical, alignment-training, ai-safety, position | 2025 | Formal/Theoretical | formaltheoretical, alignment-training, ai-safety, position | E5 / R4 (96%) | 9 |
| Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies Roger Wattenhofer, Frédéric Berdoz Year: 2024Area: Formal/TheoreticalCitations: 1 Tags: theoretical, formaltheoretical, ai-safety | 2024 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety | E5 / R3 (96%) | 1 |
| Compact Proofs of Model Performance via Mechanistic Interpretability Euan Ong, Soufiane Noubir, Lawrence Chan, Rajashree Agrawal Year: 2024Area: Formal/TheoreticalCitations: 12 Tags: empirical, formaltheoretical, ai-safety, interpretability | 2024 | Formal/Theoretical | empirical, formaltheoretical, ai-safety, interpretability | E4 / R3 (92%) | 12 |
| Human-AI Safety: A Descendant of Generative AI and Control Systems Safety Jaime F. Fisac, Andrea Bajcsy Year: 2024Area: Formal/TheoreticalCitations: 9 Tags: theoretical, formaltheoretical, ai-safety | 2024 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety | E5 / R4 (93%) | 9 |
| Information Theoretic Guarantees For Policy Alignment In Large Language Models Youssef Mroueh Year: 2024Area: Formal/TheoreticalCitations: 19 Tags: theoretical, formaltheoretical, alignment-training, ai-safety | 2024 | Formal/Theoretical | theoretical, formaltheoretical, alignment-training, ai-safety | E5 / R3 (95%) | 19 |
| Measuring Goal-Directedness Matt MacDermott, Tom Everitt, Francesco Belardinelli, James Fox Year: 2024Area: Formal/TheoreticalCitations: 4 Tags: theoretical, formaltheoretical, ai-safety | 2024 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety | E5 / R3 (95%) | 4 |
| On Goodhart's Law, with an Application to Value Alignment Lê-Nguyên Hoang, El-Mahdi El-Mhamdi Year: 2024Area: Formal/TheoreticalCitations: 6 Tags: theoretical, formaltheoretical, alignment-training, ai-safety | 2024 | Formal/Theoretical | theoretical, formaltheoretical, alignment-training, ai-safety | E5 / R3 (95%) | 6 |
| Provably Extracting the Features from a General Superposition Allen Liu Year: 2024Area: Formal/TheoreticalCitations: - Tags: theoretical, formaltheoretical, ai-safety, interpretability | 2024 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety, interpretability | E4 / R2 (96%) | - |
| Quantifying stability of non-power-seeking in artificial agents Yevgeny Liokumovich, Victoria Krakovna, Evan Ryan Gunter Year: 2024Area: Formal/TheoreticalCitations: 2 Tags: theoretical, formaltheoretical, ai-safety | 2024 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety | E5 / R3 (94%) | 2 |
| Robust agents learn causal world models Jonathan Richens, Tom Everitt Year: 2024Area: Formal/TheoreticalCitations: 67 Tags: theoretical, formaltheoretical, ai-safety | 2024 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety | E5 / R3 (96%) | 67 |
| The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists Elliott Thornley Year: 2024Area: Formal/TheoreticalCitations: 18 Tags: theoretical, formaltheoretical, ai-safety | 2024 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety | E4 / R3 (94%) | 18 |
| Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems Sanjit Seshia, Ben Goldhaber, David Dalrymple, Steve Omohundro Year: 2024Area: Formal/TheoreticalCitations: 102 Tags: formaltheoretical, ai-safety, position | 2024 | Formal/Theoretical | formaltheoretical, ai-safety, position | E5 / R4 (97%) | 102 |
| Towards Provably Secure Generative AI: Reliable Consensus Sampling Yuhong Nie, Yu Cui, Licheng Wang, Baohan Huang Year: 2024Area: Formal/TheoreticalCitations: - Tags: theoretical, formaltheoretical, ai-safety | 2024 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety | E4 / R3 (94%) | - |
Showing 30 of 42 papers on page 1.