Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Learning Multi-Level Features with Matryoshka Sparse Autoencoders Noa Nabeshima, Adam Karvonen, Bart Bussmann, Neel Nanda Published: 2025-03-21Area: Mechanistic Interp.Citations: 63 Tags: empirical, mechanistic-interp, ai-safety | 2025-03-21 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 63 |
| Low-Rank Adapting Models for Sparse Autoencoders Joshua Engels, Matthew Chen, Max Tegmark Published: 2025-01-31Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety | 2025-01-31 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 4 |
| MIB: A Mechanistic Interpretability Benchmark David Bau, Michael Hanna, Sarah Wiegreffe, Alessandro Stolfo Published: 2025-04-17Area: Mechanistic Interp.Citations: 16 Tags: mechanistic-interp, ai-safety, interpretability, benchmark | 2025-04-17 | Mechanistic Interp. | mechanistic-interp, ai-safety, interpretability, benchmark | E6 / R3 (97%) | 16 |
| Mapping Faithful Reasoning in Language Models Andreas Damianou, J Rosser, Konstantina Palla, José Luis Redondo García Published: 2025-10-25Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2025-10-25 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 2 |
| Measuring Sparse Autoencoder Feature Sensitivity Nathan Hu, Katherine Tian, Claire Tian Published: 2025-09-28Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation | 2025-09-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, safety-evaluation | E5 / R3 (94%) | - |
| Measuring and Guiding Monosemanticity Stephan Wäldchen, Manuel Brack, Björn Deiseroth, Ruben Härle Published: 2025-06-24Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety | 2025-06-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (96%) | 3 |
| Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units Jianhui Chen, Liangming Pan, Yuzhang Luo Published: 2026-01-29Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2026-01-29 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 1 |
| Mechanistic Exploration of Backdoored Large Language Model Attention Patterns Lakshmi Babu-Saheer, Mohammed Abu Baker Published: 2025-08-19Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-08-19 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (94%) | - |
| Mechanistic Interpretability Needs Philosophy Nina Rajcic, Iwan Williams, Ninell Oldenburg, Filippos Stamatiou Published: 2025-06-23Area: Mechanistic Interp.Citations: 3 Tags: mechanistic-interp, ai-safety, position, interpretability | 2025-06-23 | Mechanistic Interp. | mechanistic-interp, ai-safety, position, interpretability | E5 / R3 (94%) | 3 |
| Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG Maxime Peyrard, François Portet, Maxime Méloux Published: 2025-10-01Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-10-01 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (94%) | 2 |
| Mechanistic Interpretability for Steering Vision-Language-Action Models Ian Chuang, Bear Häon, Kaylene Stocking, Claire Tomlin Published: 2025-08-30Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-08-30 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (97%) | 3 |
| Mechanistic understanding and validation of large AI models with SemanticLens Thomas Wiegand, Sebastian Lapuschkin, Tobias Labarta, Wojciech Samek Published: 2025-01-09Area: Mechanistic Interp.Citations: 29 Tags: mechanistic-interp, ai-safety, tool | 2025-01-09 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E5 / R3 (95%) | 29 |
| Mixture of Experts Made Intrinsically Interpretable Philip Torr, Adel Bibi, Constantin Venhoff, Puneet K. Dokania Published: 2025-03-05Area: Mechanistic Interp.Citations: 12 Tags: empirical, mechanistic-interp, ai-safety | 2025-03-05 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 12 |
| Modular Training of Neural Networks aids Interpretability Joan Velja, Maheep Chaudhary, Alessandro Abate, Nandi Schoots Published: 2025-02-04Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-02-04 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 1 |
| Negative Results for Sparse Autoencoders on Downstream Tasks and Deprioritising SAE Research Rohin Shah, Lewis Smith, Tom Lieberum, Janos Kramar Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | - | - |
| NeuroFaith: Evaluating LLM Self-Explanation Faithfulness via Internal Representation Alignment Jean-Noël Vittaut, Sarath Chandar, Marie-Jeanne Lesot, Nicolas Chesneau Published: 2025-06-10Area: Mechanistic Interp.Citations: - Tags: empirical, alignment-training, mechanistic-interp, ai-safety | 2025-06-10 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |
| Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification Mohammad Mahdi Khalili, Vishnu Kabir Chhabra, Ding Zhu Published: 2025-02-27Area: Mechanistic Interp.Citations: 5 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-27 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 5 |
| On the Biology of a Large Language Model Craig Citro, Michael Sklar, Hoagy Cunningham, Wes Gurnee Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R3 (96%) | - |
| OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features Alexey Dontsov, Ivan Oseledets, Anton Korznikov, Oleg Y. Rogov Published: 2025-09-26Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2025-09-26 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R4 (95%) | 2 |
| PAHQ: Accelerating Automated Circuit Discovery through Mixed-Precision Inference Optimization Lijie Hu, Huanyi Xie, Shu Yang, Di Wang Published: 2025-10-27Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2025-10-27 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 2 |
| Position-aware Automatic Circuit Discovery David Bau, Aaron Mueller, Tal Haklay, Hadas Orgad Published: 2025-02-07Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-07 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 6 |
| Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs Yujia Zheng, Kun Zhang, Mona T. Diab, Xiangchen Song Published: 2025-05-26Area: Mechanistic Interp.Citations: 6 Tags: mechanistic-interp, ai-safety, position, interpretability, safety-evaluation | 2025-05-26 | Mechanistic Interp. | mechanistic-interp, ai-safety, position, interpretability, safety-evaluation | E5 / R3 (95%) | 6 |
| Preserving Bilinear Weight Spectra with a Signed and Shrunk Quadratic Activation Function Jason Abohwo, Thomas Mosen Published: 2025-09-02Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-09-02 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (99%) | - |
| Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video Robert Graham, Sonia Joseph, Sebastian Lapuschkin, Yash Vadi Published: 2025-04-28Area: Mechanistic Interp.Citations: 11 Tags: mechanistic-interp, ai-safety, tool, interpretability | 2025-04-28 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | E5 / R3 (98%) | 11 |
| Progress on Attention Rodrigo Luger, Nick Turner, Adam Jermyn, Christopher Olah Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | - |
| Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry Sai Sumedh R. Hindupur, Ekdeep Singh Lubana, Thomas Fel, Demba Ba Published: 2025-03-03Area: Mechanistic Interp.Citations: 34 Tags: empirical, mechanistic-interp, ai-safety | 2025-03-03 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (94%) | 34 |
| Propositional Interpretability in Artificial Intelligence David J. Chalmers Published: 2025-01-27Area: Mechanistic Interp.Citations: 13 Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2025-01-27 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (94%) | 13 |
| Prototype Transformer: Towards Language Model Architectures Interpretable by Design Chang Qi, Matteo Forasassi, Amine M'Charrak, Yordan Yordanov Published: 2026-02-12Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-02-12 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | - |
| RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching Farnoush Rezaei Jafari, Ashkan Khakzar, Oliver Eberle, Neel Nanda Published: 2025-08-28Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety | 2025-08-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (95%) | 4 |
| Representation Learning on a Random Lattice Aryeh Brill Published: 2025-04-28Area: Mechanistic Interp.Citations: 1 Tags: theoretical, mechanistic-interp, ai-safety | 2025-04-28 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E6 / R3 (96%) | 1 |