Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 121-150 of 470 papers (page 5 of 16)

PaperIntel
Learning Multi-Level Features with Matryoshka Sparse Autoencoders

Noa Nabeshima, Adam Karvonen, Bart Bussmann, Neel Nanda

Published: 2025-03-21Area: Mechanistic Interp.Citations: 63

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Low-Rank Adapting Models for Sparse Autoencoders

Joshua Engels, Matthew Chen, Max Tegmark

Published: 2025-01-31Area: Mechanistic Interp.Citations: 4

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
MIB: A Mechanistic Interpretability Benchmark

David Bau, Michael Hanna, Sarah Wiegreffe, Alessandro Stolfo

Published: 2025-04-17Area: Mechanistic Interp.Citations: 16

Tags: mechanistic-interp, ai-safety, interpretability, benchmark

E6 / R3 (97%)
Mapping Faithful Reasoning in Language Models

Andreas Damianou, J Rosser, Konstantina Palla, José Luis Redondo García

Published: 2025-10-25Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Measuring Sparse Autoencoder Feature Sensitivity

Nathan Hu, Katherine Tian, Claire Tian

Published: 2025-09-28Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation

E5 / R3 (94%)
Measuring and Guiding Monosemanticity

Stephan Wäldchen, Manuel Brack, Björn Deiseroth, Ruben Härle

Published: 2025-06-24Area: Mechanistic Interp.Citations: 3

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (96%)
Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

Jianhui Chen, Liangming Pan, Yuzhang Luo

Published: 2026-01-29Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

Lakshmi Babu-Saheer, Mohammed Abu Baker

Published: 2025-08-19Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (94%)
Mechanistic Interpretability Needs Philosophy

Nina Rajcic, Iwan Williams, Ninell Oldenburg, Filippos Stamatiou

Published: 2025-06-23Area: Mechanistic Interp.Citations: 3

Tags: mechanistic-interp, ai-safety, position, interpretability

E5 / R3 (94%)
Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG

Maxime Peyrard, François Portet, Maxime Méloux

Published: 2025-10-01Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (94%)
Mechanistic Interpretability for Steering Vision-Language-Action Models

Ian Chuang, Bear Häon, Kaylene Stocking, Claire Tomlin

Published: 2025-08-30Area: Mechanistic Interp.Citations: 3

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (97%)
Mechanistic understanding and validation of large AI models with SemanticLens

Thomas Wiegand, Sebastian Lapuschkin, Tobias Labarta, Wojciech Samek

Published: 2025-01-09Area: Mechanistic Interp.Citations: 29

Tags: mechanistic-interp, ai-safety, tool

E5 / R3 (95%)
Mixture of Experts Made Intrinsically Interpretable

Philip Torr, Adel Bibi, Constantin Venhoff, Puneet K. Dokania

Published: 2025-03-05Area: Mechanistic Interp.Citations: 12

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Modular Training of Neural Networks aids Interpretability

Joan Velja, Maheep Chaudhary, Alessandro Abate, Nandi Schoots

Published: 2025-02-04Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
Negative Results for Sparse Autoencoders on Downstream Tasks and Deprioritising SAE Research

Rohin Shah, Lewis Smith, Tom Lieberum, Janos Kramar

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

-
NeuroFaith: Evaluating LLM Self-Explanation Faithfulness via Internal Representation Alignment

Jean-Noël Vittaut, Sarath Chandar, Marie-Jeanne Lesot, Nicolas Chesneau

Published: 2025-06-10Area: Mechanistic Interp.Citations: -

Tags: empirical, alignment-training, mechanistic-interp, ai-safety

E5 / R3 (95%)
Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification

Mohammad Mahdi Khalili, Vishnu Kabir Chhabra, Ding Zhu

Published: 2025-02-27Area: Mechanistic Interp.Citations: 5

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
On the Biology of a Large Language Model

Craig Citro, Michael Sklar, Hoagy Cunningham, Wes Gurnee

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R3 (96%)
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

Alexey Dontsov, Ivan Oseledets, Anton Korznikov, Oleg Y. Rogov

Published: 2025-09-26Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E6 / R4 (95%)
PAHQ: Accelerating Automated Circuit Discovery through Mixed-Precision Inference Optimization

Lijie Hu, Huanyi Xie, Shu Yang, Di Wang

Published: 2025-10-27Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Position-aware Automatic Circuit Discovery

David Bau, Aaron Mueller, Tal Haklay, Hadas Orgad

Published: 2025-02-07Area: Mechanistic Interp.Citations: 6

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs

Yujia Zheng, Kun Zhang, Mona T. Diab, Xiangchen Song

Published: 2025-05-26Area: Mechanistic Interp.Citations: 6

Tags: mechanistic-interp, ai-safety, position, interpretability, safety-evaluation

E5 / R3 (95%)
Preserving Bilinear Weight Spectra with a Signed and Shrunk Quadratic Activation Function

Jason Abohwo, Thomas Mosen

Published: 2025-09-02Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (99%)
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video

Robert Graham, Sonia Joseph, Sebastian Lapuschkin, Yash Vadi

Published: 2025-04-28Area: Mechanistic Interp.Citations: 11

Tags: mechanistic-interp, ai-safety, tool, interpretability

E5 / R3 (98%)
Progress on Attention

Rodrigo Luger, Nick Turner, Adam Jermyn, Christopher Olah

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

Sai Sumedh R. Hindupur, Ekdeep Singh Lubana, Thomas Fel, Demba Ba

Published: 2025-03-03Area: Mechanistic Interp.Citations: 34

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (94%)
Propositional Interpretability in Artificial Intelligence

David J. Chalmers

Published: 2025-01-27Area: Mechanistic Interp.Citations: 13

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (94%)
Prototype Transformer: Towards Language Model Architectures Interpretable by Design

Chang Qi, Matteo Forasassi, Amine M'Charrak, Yordan Yordanov

Published: 2026-02-12Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching

Farnoush Rezaei Jafari, Ashkan Khakzar, Oliver Eberle, Neel Nanda

Published: 2025-08-28Area: Mechanistic Interp.Citations: 4

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (95%)
Representation Learning on a Random Lattice

Aryeh Brill

Published: 2025-04-28Area: Mechanistic Interp.Citations: 1

Tags: theoretical, mechanistic-interp, ai-safety

E6 / R3 (96%)