Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Year | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Alignment Verifiability in Large Language Models: Normative Indistinguishability under Behavioral Evaluation Igor Santos-Grueiro Year: 2026Area: Deception & FailureCitations: 1 Tags: theoretical, alignment-training, ai-safety, deception-failure, safety-evaluation | 2026 | Deception & Failure | theoretical, alignment-training, ai-safety, deception-failure, safety-evaluation | E4 / R2 (94%) | 1 |
| Authenticated Workflows: A Systems Approach to Protecting Agentic AI Mohan Rajagopalan, Vinay Rao Year: 2026Area: Agent SafetyCitations: - Tags: theoretical, agent-safety, ai-safety | 2026 | Agent Safety | theoretical, agent-safety, ai-safety | E5 / R4 (96%) | - |
| How Many Features Can a Language Model Store Under the Linear Representation Hypothesis? Jon Kleinberg, Kenny Peng, Nikhil Garg Year: 2026Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety | 2026 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E4 / R2 (93%) | - |
| How RLHF Amplifies Sycophancy Gerdus Benade, Ariel D. Procaccia, Itai Shapira Year: 2026Area: Deception & FailureCitations: - Tags: theoretical, ai-safety, deception-failure | 2026 | Deception & Failure | theoretical, ai-safety, deception-failure | E5 / R3 (94%) | - |
| Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective Davin Choo, Milind Tambe, Tzeh Yuan Neoh, Cheol Woo Kim Year: 2026Area: Formal/TheoreticalCitations: - Tags: theoretical, formaltheoretical, ai-safety, adversarial-robustness, safety-evaluation | 2026 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety, adversarial-robustness, safety-evaluation | E5 / R3 (93%) | - |
| Interactionless Inverse Reinforcement Learning: A Data-Centric Framework for Durable Alignment Elias Malomgré, Pieter Simoens Year: 2026Area: Alignment TrainingCitations: - Tags: theoretical, alignment-training, ai-safety | 2026 | Alignment Training | theoretical, alignment-training, ai-safety | E5 / R3 (97%) | - |
| Jailbreaking LLMs via Calibration Yongkang Guo, Yuxuan Lu, Yuqing Kong Year: 2026Area: Adversarial RobustnessCitations: - Tags: theoretical, ai-safety, adversarial-robustness | 2026 | Adversarial Robustness | theoretical, ai-safety, adversarial-robustness | E5 / R3 (94%) | - |
| Patterning: The Dual of Interpretability George Wang, Daniel Murfet Year: 2026Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2026 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E6 / R4 (95%) | - |
| Protecting Context and Prompts: Deterministic Security for Non-Deterministic AI Mohan Rajagopalan, Vinay Rao Year: 2026Area: Adversarial RobustnessCitations: 1 Tags: theoretical, ai-safety, adversarial-robustness | 2026 | Adversarial Robustness | theoretical, ai-safety, adversarial-robustness | E5 / R3 (99%) | 1 |
| Subliminal Effects in Your Data: A General Mechanism via Log-Linearity Ishaq Aden-Ali, Abhishek Shetty, Nika Haghtalab, Ankur Moitra Year: 2026Area: Deception & FailureCitations: - Tags: theoretical, ai-safety, deception-failure | 2026 | Deception & Failure | theoretical, ai-safety, deception-failure | E4 / R3 (94%) | - |
| Toward Constitutional Autonomy in AI Systems: A Theoretical Framework for Aligned Agentic Intelligence William Torgbi Agbemabiese Year: 2026Area: Agent SafetyCitations: 1 Tags: theoretical, alignment-training, agent-safety, ai-safety | 2026 | Agent Safety | theoretical, alignment-training, agent-safety, ai-safety | E4 / R3 (93%) | 1 |
| Towards Poisoning Robustness Certification for Natural Language Generation Mihnea Ghitu, Matthew Wicker Year: 2026Area: Adversarial RobustnessCitations: - Tags: theoretical, ai-safety, adversarial-robustness | 2026 | Adversarial Robustness | theoretical, ai-safety, adversarial-robustness | E5 / R3 (95%) | - |
| Towards Understanding Steering Strength Magamed Taimeskhanov, Damien Garreau, Samuel Vaiter Year: 2026Area: Representation AnalysisCitations: - Tags: theoretical, representation-analysis, ai-safety | 2026 | Representation Analysis | theoretical, representation-analysis, ai-safety | E5 / R3 (96%) | - |
| Trustworthy Agentic AI Requires Deterministic Architectural Boundaries Manish Bhattarai, Minh Vu Year: 2026Area: Agent SafetyCitations: - Tags: theoretical, alignment-training, agent-safety, ai-safety | 2026 | Agent Safety | theoretical, alignment-training, agent-safety, ai-safety | E5 / R4 (97%) | - |
| Unifying Stable Optimization and Reference Regularization in RLHF Lina Yao, Qiang Qu, He Zhao, Li He Year: 2026Area: Alignment TrainingCitations: 1 Tags: theoretical, alignment-training, ai-safety | 2026 | Alignment Training | theoretical, alignment-training, ai-safety | E5 / R3 (94%) | 1 |
| Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints Yousung Lee, Andres Saurez, Dongsoo Har Year: 2026Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2026 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | - |
| A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring Julian Schulz Year: 2025Area: Scalable OversightCitations: - Tags: theoretical, scalable-oversight, ai-safety | 2025 | Scalable Oversight | theoretical, scalable-oversight, ai-safety | E7 / R3 (94%) | - |
| A Geometric Unification of Concept Learning with Concept Cones Thomas Fel, Alexandre Rocchi-Henry, Gianni Franchi Year: 2025Area: Mechanistic Interp.Citations: - Tags: theoretical, alignment-training, mechanistic-interp, ai-safety | 2025 | Mechanistic Interp. | theoretical, alignment-training, mechanistic-interp, ai-safety | E5 / R4 (94%) | - |
| A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i Louis Jaburi, Kola Ayonrinde Year: 2025Area: Mechanistic Interp.Citations: 4 Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2025 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (95%) | 4 |
| A Three-Branch Checks-and-Balances Framework for Context-Aware Ethical Alignment of Large Language Models Edward Y. Chang Year: 2025Area: Alignment TrainingCitations: 3 Tags: theoretical, alignment-training, ai-safety, adversarial-robustness | 2025 | Alignment Training | theoretical, alignment-training, ai-safety, adversarial-robustness | E5 / R4 (95%) | 3 |
| A Toy Model of Mechanistic (Un)Faithfulness Chris Olah Year: 2025Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety | 2025 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E5 / R3 (93%) | - |
| A Two-Step, Multidimensional Account of Deception in Language Models Leonard Dung Year: 2025Area: Deception & FailureCitations: - Tags: theoretical, ai-safety, deception-failure | 2025 | Deception & Failure | theoretical, ai-safety, deception-failure | E5 / R3 (95%) | - |
| A Unified Framework for Diffusion Model Unlearning with f-Divergence Luigi Cinque, Deniz Gündüz, Federico Fontana, Nicola Novello Year: 2025Area: Model EditingCitations: - Tags: theoretical, ai-safety, model-editing | 2025 | Model Editing | theoretical, ai-safety, model-editing | E5 / R3 (96%) | - |
| AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures? Leonard Dung, Florian Mai Year: 2025Area: Formal/TheoreticalCitations: - Tags: theoretical, formaltheoretical, alignment-training, ai-safety | 2025 | Formal/Theoretical | theoretical, formaltheoretical, alignment-training, ai-safety | E8 / R2 (97%) | - |
| AI Testing Should Account for Sophisticated Strategic Behaviour Vincent Conitzer, Eric Olav Chen, Alexis Ghersengorin, Vojtěch Kovařík Year: 2025Area: Safety EvaluationCitations: 3 Tags: theoretical, ai-safety, safety-evaluation | 2025 | Safety Evaluation | theoretical, ai-safety, safety-evaluation | E6 / R3 (96%) | 3 |
| AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability Manuel Baltieri, Alexander Boyd, Fernando Rosas Year: 2025Area: Formal/TheoreticalCitations: 2 Tags: theoretical, formaltheoretical, ai-safety, interpretability | 2025 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety, interpretability | E6 / R3 (93%) | 2 |
| Access Controls Will Solve the Dual-Use Dilemma Evžen Wybitul Year: 2025Area: Adversarial RobustnessCitations: 2 Tags: theoretical, ai-safety, adversarial-robustness | 2025 | Adversarial Robustness | theoretical, ai-safety, adversarial-robustness | E5 / R3 (93%) | 2 |
| Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers Santhosh Kumar Ravindran Year: 2025Area: Deception & FailureCitations: 1 Tags: theoretical, ai-safety, adversarial-robustness, deception-failure | 2025 | Deception & Failure | theoretical, ai-safety, adversarial-robustness, deception-failure | E5 / R3 (92%) | 1 |
| Adversarially Pretrained Transformers May Be Universally Robust In-Context Learners Hiroshi Kera, Toshihiko Yamasaki, Soichiro Kumano Year: 2025Area: Adversarial RobustnessCitations: 1 Tags: theoretical, ai-safety, adversarial-robustness | 2025 | Adversarial Robustness | theoretical, ai-safety, adversarial-robustness | E5 / R3 (93%) | 1 |
| An Alignment Safety Case Sketch Based on Debate Geoffrey Irving, Jacob Pfau, Benjamin Hilton, Marie Davidsen Buhl Year: 2025Area: Scalable OversightCitations: 8 Tags: theoretical, scalable-oversight, alignment-training, ai-safety | 2025 | Scalable Oversight | theoretical, scalable-oversight, alignment-training, ai-safety | E5 / R3 (94%) | 8 |
Showing 30 of 203 papers on page 1.