Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Year | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Age Predictors Through the Lens of Generalization, Bias Mitigation, and Interpretability: Reflections on Causal Implications Irene Gravili, Alessandro Cellerino, Elisa Ferrari, Debdas Paul Year: 2026Area: cs.LGCitations: - Tags: ai-safety, cslg, interpretability, preprint | 2026 | cs.LG | ai-safety, cslg, interpretability, preprint | - | - |
| Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models Zeru Shi, Michelle Hurst, Yihao Quan, Ranjay Krishna Year: 2026Area: cs.CVCitations: - Tags: ai-safety, cscv, interpretability, preprint | 2026 | cs.CV | ai-safety, cscv, interpretability, preprint | - | - |
| Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations Sadiq Y. Patel, Namrata Elamaran, Rajaie Batniji, John Morgan Year: 2026Area: cs.AICitations: - Tags: ai-safety, csai, interpretability, preprint | 2026 | cs.AI | ai-safety, csai, interpretability, preprint | - | - |
| Interpreting and Controlling Model Behavior via Constitutions for Atomic Concept Edits Prasoon Bajpai, Been Kim, Zi Wang, Wenjun Zeng Year: 2026Area: Model EditingCitations: - Tags: empirical, ai-safety, interpretability, model-editing | 2026 | Model Editing | empirical, ai-safety, interpretability, model-editing | E5 / R3 (91%) | - |
| Patterning: The Dual of Interpretability George Wang, Daniel Murfet Year: 2026Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2026 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E6 / R4 (95%) | - |
| Towards Worst-Case Guarantees with Scale-Aware Interpretability Andrew Mack, Artemy Kolchinsky, David Berman, Aryeh Brill Year: 2026Area: Formal/TheoreticalCitations: - Tags: formaltheoretical, ai-safety, position, interpretability | 2026 | Formal/Theoretical | formaltheoretical, ai-safety, position, interpretability | E5 / R3 (94%) | - |
| Unpacking Interpretability: Human-Centered Criteria for Optimal Combinatorial Solutions Filip Melinscak, Frank Scharnowski, Dominik Pegler, Frank Jäkel Year: 2026Area: cs.HCCitations: - Tags: ai-safety, cshc, interpretability, preprint | 2026 | cs.HC | ai-safety, cshc, interpretability, preprint | E5 / R3 (93%) | - |
| Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints Yousung Lee, Andres Saurez, Dongsoo Har Year: 2026Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2026 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | - |
| A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i Louis Jaburi, Kola Ayonrinde Year: 2025Area: Mechanistic Interp.Citations: 4 Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2025 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (95%) | 4 |
| A Pragmatic Vision for Interpretability Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan Year: 2025Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, position, interpretability | 2025 | Mechanistic Interp. | mechanistic-interp, ai-safety, position, interpretability | - | - |
| A Review of Developmental Interpretability in Large Language Models Ihor Kendiukhov Year: 2025Area: Surveys & ReviewsCitations: - Tags: surveys-reviews, ai-safety, survey, interpretability | 2025 | Surveys & Reviews | surveys-reviews, ai-safety, survey, interpretability | E6 / R4 (94%) | - |
| A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models Ryan A. Rossi, Keivan Rezaei, Zhiyang Xu, Mohammad Beigi Year: 2025Area: Surveys & ReviewsCitations: 20 Tags: surveys-reviews, ai-safety, survey, interpretability | 2025 | Surveys & Reviews | surveys-reviews, ai-safety, survey, interpretability | E7 / R4 (96%) | 20 |
| AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability Manuel Baltieri, Alexander Boyd, Fernando Rosas Year: 2025Area: Formal/TheoreticalCitations: 2 Tags: theoretical, formaltheoretical, ai-safety, interpretability | 2025 | Formal/Theoretical | theoretical, formaltheoretical, ai-safety, interpretability | E6 / R3 (93%) | 2 |
| AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features Mohammad Mahdi Khalili, Xudong Zhu, Zhihui Zhu Year: 2025Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | - |
| Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers Euan Ong, Samuel Marks, Julian Minder, Daniel Wen Year: 2025Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (94%) | 2 |
| AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations Yifei Yao, Mengnan Du Year: 2025Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E4 / R3 (94%) | 1 |
| Analysis of Variational Sparse Autoencoders Yuxiao Li, Zachary Baker Year: 2025Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (93%) | - |
| Atlas-Alignment: Making Interpretability Transferable Across Language Models Sebastian Lapuschkin, Wojciech Samek, Jim Berend, Bruno Puri Year: 2025Area: Mechanistic Interp.Citations: - Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability | 2025 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety, interpretability | E6 / R3 (95%) | - |
| Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers Lucy Farnik, Thomas Heap, Tim Lawson, Laurence Aitchison Year: 2025Area: Mechanistic Interp.Citations: 26 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 26 |
| Because we have LLMs, we Can and Should Pursue Agentic Interpretability Noah Fiedel, Been Kim, John Hewitt, Oyvind Tafjord Year: 2025Area: Mechanistic Interp.Citations: 9 Tags: mechanistic-interp, ai-safety, position, interpretability | 2025 | Mechanistic Interp. | mechanistic-interp, ai-safety, position, interpretability | E5 / R3 (94%) | 9 |
| Binary Autoencoder for Mechanistic Interpretability of Large Language Models Hakaze Cho, Naoya Inoue, Haolin Yang, Brian M. Kurkoski Year: 2025Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (93%) | - |
| Binary Sparse Coding for Interpretability Lucia Quirke, Stepan Shabalin, Nora Belrose Year: 2025Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E6 / R4 (92%) | 1 |
| Bridging the Black Box: A Survey on Mechanistic Interpretability in AI Amir Rafe, Tausif Islam Chowdhury, Nawaf Alnawmasi, Anandi K. Dutta Year: 2025Area: Surveys & ReviewsCitations: - Tags: surveys-reviews, ai-safety, survey, interpretability, safety-evaluation | 2025 | Surveys & Reviews | surveys-reviews, ai-safety, survey, interpretability, safety-evaluation | - | - |
| Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution Usha Bhalla, Hima Lakkaraju, Shichang Zhang, Tessa Han Year: 2025Area: Surveys & ReviewsCitations: 3 Tags: surveys-reviews, ai-safety, position, interpretability | 2025 | Surveys & Reviews | surveys-reviews, ai-safety, position, interpretability | E5 / R3 (97%) | 3 |
| CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders Sachin Kumar, Yusen Peng, Alex Gulko Year: 2025Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark | 2025 | Mechanistic Interp. | mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark | E5 / R3 (94%) | - |
| Can Interpretation Predict Behavior on Unseen Data? David Alvarez-Melis, Jenny Kaufmann, Martin Wattenberg, Victoria R. Li Year: 2025Area: Mechanistic Interp.Citations: 5 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (92%) | 5 |
| Can LLMs Lie? Investigation beyond Hallucination Shantanu Jaiswal, Mengning Wu, Deepak Pathak, Haoran Huan Year: 2025Area: Deception & FailureCitations: 1 Tags: empirical, ai-safety, deception-failure, interpretability | 2025 | Deception & Failure | empirical, ai-safety, deception-failure, interpretability | E5 / R3 (94%) | 1 |
| Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF Jing Liu Year: 2025Area: Alignment TrainingCitations: - Tags: theoretical, alignment-training, ai-safety, interpretability | 2025 | Alignment Training | theoretical, alignment-training, ai-safety, interpretability | E5 / R3 (94%) | - |
| Cracking the Circuits: Mechanistic Interpretability in Large Language Models Mushtaq Ali, Dost Muhammad, Malika Bendechache, Muhammad Salman Year: 2025Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (95%) | - |
| DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders Lingpeng Kong, Baosong Yang, Yu Wan, Xu Wang Year: 2025Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (97%) | - |
Showing 30 of 200 papers on page 1.