Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

PaperIntel
Age Predictors Through the Lens of Generalization, Bias Mitigation, and Interpretability: Reflections on Causal Implications

Irene Gravili, Alessandro Cellerino, Elisa Ferrari, Debdas Paul

Year: 2026Area: cs.LGCitations: -

Tags: ai-safety, cslg, interpretability, preprint

-
Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Zeru Shi, Michelle Hurst, Yihao Quan, Ranjay Krishna

Year: 2026Area: cs.CVCitations: -

Tags: ai-safety, cscv, interpretability, preprint

-
Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations

Sadiq Y. Patel, Namrata Elamaran, Rajaie Batniji, John Morgan

Year: 2026Area: cs.AICitations: -

Tags: ai-safety, csai, interpretability, preprint

-
Interpreting and Controlling Model Behavior via Constitutions for Atomic Concept Edits

Prasoon Bajpai, Been Kim, Zi Wang, Wenjun Zeng

Year: 2026Area: Model EditingCitations: -

Tags: empirical, ai-safety, interpretability, model-editing

E5 / R3 (91%)
Patterning: The Dual of Interpretability

George Wang, Daniel Murfet

Year: 2026Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E6 / R4 (95%)
Towards Worst-Case Guarantees with Scale-Aware Interpretability

Andrew Mack, Artemy Kolchinsky, David Berman, Aryeh Brill

Year: 2026Area: Formal/TheoreticalCitations: -

Tags: formaltheoretical, ai-safety, position, interpretability

E5 / R3 (94%)
Unpacking Interpretability: Human-Centered Criteria for Optimal Combinatorial Solutions

Filip Melinscak, Frank Scharnowski, Dominik Pegler, Frank Jäkel

Year: 2026Area: cs.HCCitations: -

Tags: ai-safety, cshc, interpretability, preprint

E5 / R3 (93%)
Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints

Yousung Lee, Andres Saurez, Dongsoo Har

Year: 2026Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i

Louis Jaburi, Kola Ayonrinde

Year: 2025Area: Mechanistic Interp.Citations: 4

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (95%)
A Pragmatic Vision for Interpretability

Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, position, interpretability

-
A Review of Developmental Interpretability in Large Language Models

Ihor Kendiukhov

Year: 2025Area: Surveys & ReviewsCitations: -

Tags: surveys-reviews, ai-safety, survey, interpretability

E6 / R4 (94%)
A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models

Ryan A. Rossi, Keivan Rezaei, Zhiyang Xu, Mohammad Beigi

Year: 2025Area: Surveys & ReviewsCitations: 20

Tags: surveys-reviews, ai-safety, survey, interpretability

E7 / R4 (96%)
AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability

Manuel Baltieri, Alexander Boyd, Fernando Rosas

Year: 2025Area: Formal/TheoreticalCitations: 2

Tags: theoretical, formaltheoretical, ai-safety, interpretability

E6 / R3 (93%)
AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

Mohammad Mahdi Khalili, Xudong Zhu, Zhihui Zhu

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Euan Ong, Samuel Marks, Julian Minder, Daniel Wen

Year: 2025Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (94%)
AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations

Yifei Yao, Mengnan Du

Year: 2025Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E4 / R3 (94%)
Analysis of Variational Sparse Autoencoders

Yuxiao Li, Zachary Baker

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (93%)
Atlas-Alignment: Making Interpretability Transferable Across Language Models

Sebastian Lapuschkin, Wojciech Samek, Jim Berend, Bruno Puri

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability

E6 / R3 (95%)
Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers

Lucy Farnik, Thomas Heap, Tim Lawson, Laurence Aitchison

Year: 2025Area: Mechanistic Interp.Citations: 26

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
Because we have LLMs, we Can and Should Pursue Agentic Interpretability

Noah Fiedel, Been Kim, John Hewitt, Oyvind Tafjord

Year: 2025Area: Mechanistic Interp.Citations: 9

Tags: mechanistic-interp, ai-safety, position, interpretability

E5 / R3 (94%)
Binary Autoencoder for Mechanistic Interpretability of Large Language Models

Hakaze Cho, Naoya Inoue, Haolin Yang, Brian M. Kurkoski

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (93%)
Binary Sparse Coding for Interpretability

Lucia Quirke, Stepan Shabalin, Nora Belrose

Year: 2025Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E6 / R4 (92%)
Bridging the Black Box: A Survey on Mechanistic Interpretability in AI

Amir Rafe, Tausif Islam Chowdhury, Nawaf Alnawmasi, Anandi K. Dutta

Year: 2025Area: Surveys & ReviewsCitations: -

Tags: surveys-reviews, ai-safety, survey, interpretability, safety-evaluation

-
Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution

Usha Bhalla, Hima Lakkaraju, Shichang Zhang, Tessa Han

Year: 2025Area: Surveys & ReviewsCitations: 3

Tags: surveys-reviews, ai-safety, position, interpretability

E5 / R3 (97%)
CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders

Sachin Kumar, Yusen Peng, Alex Gulko

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark

E5 / R3 (94%)
Can Interpretation Predict Behavior on Unseen Data?

David Alvarez-Melis, Jenny Kaufmann, Martin Wattenberg, Victoria R. Li

Year: 2025Area: Mechanistic Interp.Citations: 5

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (92%)
Can LLMs Lie? Investigation beyond Hallucination

Shantanu Jaiswal, Mengning Wu, Deepak Pathak, Haoran Huan

Year: 2025Area: Deception & FailureCitations: 1

Tags: empirical, ai-safety, deception-failure, interpretability

E5 / R3 (94%)
Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF

Jing Liu

Year: 2025Area: Alignment TrainingCitations: -

Tags: theoretical, alignment-training, ai-safety, interpretability

E5 / R3 (94%)
Cracking the Circuits: Mechanistic Interpretability in Large Language Models

Mushtaq Ali, Dost Muhammad, Malika Bendechache, Muhammad Salman

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (95%)
DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

Lingpeng Kong, Baosong Yang, Yu Wan, Xu Wang

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (97%)

Showing 30 of 200 papers on page 1.