Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

PaperIntel
Alignment Verifiability in Large Language Models: Normative Indistinguishability under Behavioral Evaluation

Igor Santos-Grueiro

Year: 2026Area: Deception & FailureCitations: 1

Tags: theoretical, alignment-training, ai-safety, deception-failure, safety-evaluation

E4 / R2 (94%)
Authenticated Workflows: A Systems Approach to Protecting Agentic AI

Mohan Rajagopalan, Vinay Rao

Year: 2026Area: Agent SafetyCitations: -

Tags: theoretical, agent-safety, ai-safety

E5 / R4 (96%)
How Many Features Can a Language Model Store Under the Linear Representation Hypothesis?

Jon Kleinberg, Kenny Peng, Nikhil Garg

Year: 2026Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety

E4 / R2 (93%)
How RLHF Amplifies Sycophancy

Gerdus Benade, Ariel D. Procaccia, Itai Shapira

Year: 2026Area: Deception & FailureCitations: -

Tags: theoretical, ai-safety, deception-failure

E5 / R3 (94%)
Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective

Davin Choo, Milind Tambe, Tzeh Yuan Neoh, Cheol Woo Kim

Year: 2026Area: Formal/TheoreticalCitations: -

Tags: theoretical, formaltheoretical, ai-safety, adversarial-robustness, safety-evaluation

E5 / R3 (93%)
Interactionless Inverse Reinforcement Learning: A Data-Centric Framework for Durable Alignment

Elias Malomgré, Pieter Simoens

Year: 2026Area: Alignment TrainingCitations: -

Tags: theoretical, alignment-training, ai-safety

E5 / R3 (97%)
Jailbreaking LLMs via Calibration

Yongkang Guo, Yuxuan Lu, Yuqing Kong

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: theoretical, ai-safety, adversarial-robustness

E5 / R3 (94%)
Patterning: The Dual of Interpretability

George Wang, Daniel Murfet

Year: 2026Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E6 / R4 (95%)
Protecting Context and Prompts: Deterministic Security for Non-Deterministic AI

Mohan Rajagopalan, Vinay Rao

Year: 2026Area: Adversarial RobustnessCitations: 1

Tags: theoretical, ai-safety, adversarial-robustness

E5 / R3 (99%)
Subliminal Effects in Your Data: A General Mechanism via Log-Linearity

Ishaq Aden-Ali, Abhishek Shetty, Nika Haghtalab, Ankur Moitra

Year: 2026Area: Deception & FailureCitations: -

Tags: theoretical, ai-safety, deception-failure

E4 / R3 (94%)
Toward Constitutional Autonomy in AI Systems: A Theoretical Framework for Aligned Agentic Intelligence

William Torgbi Agbemabiese

Year: 2026Area: Agent SafetyCitations: 1

Tags: theoretical, alignment-training, agent-safety, ai-safety

E4 / R3 (93%)
Towards Poisoning Robustness Certification for Natural Language Generation

Mihnea Ghitu, Matthew Wicker

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: theoretical, ai-safety, adversarial-robustness

E5 / R3 (95%)
Towards Understanding Steering Strength

Magamed Taimeskhanov, Damien Garreau, Samuel Vaiter

Year: 2026Area: Representation AnalysisCitations: -

Tags: theoretical, representation-analysis, ai-safety

E5 / R3 (96%)
Trustworthy Agentic AI Requires Deterministic Architectural Boundaries

Manish Bhattarai, Minh Vu

Year: 2026Area: Agent SafetyCitations: -

Tags: theoretical, alignment-training, agent-safety, ai-safety

E5 / R4 (97%)
Unifying Stable Optimization and Reference Regularization in RLHF

Lina Yao, Qiang Qu, He Zhao, Li He

Year: 2026Area: Alignment TrainingCitations: 1

Tags: theoretical, alignment-training, ai-safety

E5 / R3 (94%)
Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints

Yousung Lee, Andres Saurez, Dongsoo Har

Year: 2026Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring

Julian Schulz

Year: 2025Area: Scalable OversightCitations: -

Tags: theoretical, scalable-oversight, ai-safety

E7 / R3 (94%)
A Geometric Unification of Concept Learning with Concept Cones

Thomas Fel, Alexandre Rocchi-Henry, Gianni Franchi

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: theoretical, alignment-training, mechanistic-interp, ai-safety

E5 / R4 (94%)
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i

Louis Jaburi, Kola Ayonrinde

Year: 2025Area: Mechanistic Interp.Citations: 4

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (95%)
A Three-Branch Checks-and-Balances Framework for Context-Aware Ethical Alignment of Large Language Models

Edward Y. Chang

Year: 2025Area: Alignment TrainingCitations: 3

Tags: theoretical, alignment-training, ai-safety, adversarial-robustness

E5 / R4 (95%)
A Toy Model of Mechanistic (Un)Faithfulness

Chris Olah

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety

E5 / R3 (93%)
A Two-Step, Multidimensional Account of Deception in Language Models

Leonard Dung

Year: 2025Area: Deception & FailureCitations: -

Tags: theoretical, ai-safety, deception-failure

E5 / R3 (95%)
A Unified Framework for Diffusion Model Unlearning with f-Divergence

Luigi Cinque, Deniz Gündüz, Federico Fontana, Nicola Novello

Year: 2025Area: Model EditingCitations: -

Tags: theoretical, ai-safety, model-editing

E5 / R3 (96%)
AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms or Shared Failures?

Leonard Dung, Florian Mai

Year: 2025Area: Formal/TheoreticalCitations: -

Tags: theoretical, formaltheoretical, alignment-training, ai-safety

E8 / R2 (97%)
AI Testing Should Account for Sophisticated Strategic Behaviour

Vincent Conitzer, Eric Olav Chen, Alexis Ghersengorin, Vojtěch Kovařík

Year: 2025Area: Safety EvaluationCitations: 3

Tags: theoretical, ai-safety, safety-evaluation

E6 / R3 (96%)
AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability

Manuel Baltieri, Alexander Boyd, Fernando Rosas

Year: 2025Area: Formal/TheoreticalCitations: 2

Tags: theoretical, formaltheoretical, ai-safety, interpretability

E6 / R3 (93%)
Access Controls Will Solve the Dual-Use Dilemma

Evžen Wybitul

Year: 2025Area: Adversarial RobustnessCitations: 2

Tags: theoretical, ai-safety, adversarial-robustness

E5 / R3 (93%)
Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers

Santhosh Kumar Ravindran

Year: 2025Area: Deception & FailureCitations: 1

Tags: theoretical, ai-safety, adversarial-robustness, deception-failure

E5 / R3 (92%)
Adversarially Pretrained Transformers May Be Universally Robust In-Context Learners

Hiroshi Kera, Toshihiko Yamasaki, Soichiro Kumano

Year: 2025Area: Adversarial RobustnessCitations: 1

Tags: theoretical, ai-safety, adversarial-robustness

E5 / R3 (93%)
An Alignment Safety Case Sketch Based on Debate

Geoffrey Irving, Jacob Pfau, Benjamin Hilton, Marie Davidsen Buhl

Year: 2025Area: Scalable OversightCitations: 8

Tags: theoretical, scalable-oversight, alignment-training, ai-safety

E5 / R3 (94%)

Showing 30 of 203 papers on page 1.