Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 421-450 of 470 papers (page 15 of 16)

PaperIntel
Understanding the Role of Individual Units in a Deep Neural Network

David Bau, Agata Lapedriza, Bolei Zhou, Jun-Yan Zhu

Published: 2020-09-10Area: Mechanistic Interp.Citations: 505

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Compositional Explanations of Neurons

Jacob Andreas, Jesse Mu

Published: 2020-06-24Area: Mechanistic Interp.Citations: 206

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (92%)
A Mathematical Framework for Transformer Circuits

Tom Conerly, Nicholas Joseph, Dawn Drain, Yuntao Bai

Published: -Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety

E5 / R3 (95%)
A Pragmatic Vision for Interpretability

Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan

Published: -Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, position, interpretability

-
A Toy Model of Mechanistic (Un)Faithfulness

Chris Olah

Published: -Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Accelerating Sparse Autoencoder Training via Layer-Wise Transfer Learning in Large Language Models

Jaehyuk Lim, Marco Molinari, Davide Ghilardi, Federico Belotti

Published: -Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (96%)
Attribution Patching: Activation Patching At Industrial Scale

Neel Nanda

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (95%)
Causal Scrubbing: A Method for Rigorously Testing Interpretability Hypotheses

Adrià Garriga-Alonso, Buck Shlegeris, Lawrence Chan, Nate Thomas

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
Circuit Tracing: Revealing Computational Graphs in Language Models

Craig Citro, Michael Sklar, Hoagy Cunningham, Wes Gurnee

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Circuits Updates - April 2025

Brian Chen, Adam Jermyn, Joshua Batson, Jack Lindsey

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R3 (95%)
Cracking the Circuits: Mechanistic Interpretability in Large Language Models

Mushtaq Ali, Dost Muhammad, Malika Bendechache, Muhammad Salman

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (95%)
Curve Detectors

Gabriel Goh, Ludwig Schubert, Chris Olah, Nick Cammarata

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Explaining AI through mechanistic interpretability

Lena Kästner, Barnaby Crook

Published: -Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (93%)
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level

János Kramár, Rohin Shah, Senthooran Rajamanoharan, Neel Nanda

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

-
From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models

Yiyang Yu, Minji Lee, Mohammed AlQuraishi, Etowah Adams

Published: -Area: Mechanistic Interp.Citations: 32

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
Gemma Scope 2: Comprehensive Suite of SAEs and Transcoders for Gemma 3

Tom Lieberum, Janos Kramar, Senthooran Rajamanoharan, Callum McDougall

Published: -Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, tool, interpretability

E5 / R3 (98%)
Goodfire Ember: Scaling Interpretability for Frontier Model Alignment

Eric Ho, Curt Tigges, Thomas McGrath, Max Loeffler

Published: -Area: Mechanistic Interp.Citations: -

Tags: alignment-training, mechanistic-interp, ai-safety, tool, interpretability

E4 / R3 (99%)
How Can Interpretability Researchers Help AGI Go Well?

Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan

Published: -Area: Mechanistic Interp.Citations: -

Tags: alignment-training, mechanistic-interp, ai-safety, position, interpretability

-
In-context Learning and Induction Heads

Tom Conerly, Nicholas Joseph, Dawn Drain, Yuntao Bai

Published: -Area: Mechanistic Interp.Citations: 751

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Insights on Crosscoder Model Diffing

Siddharth Mishra-Sharma, Thomas Henighan, Adam Jermyn, Christopher Olah

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Interpreting GPT: The Logit Lens

nostalgebraist

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Language Models Can Explain Neurons in Language Models

Steven Bills, Gabriel Goh, Jan Leike, Henk Tillman

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E4 / R2 (96%)
Multimodal Neurons in Artificial Neural Networks

Gabriel Goh, Ludwig Schubert, Chris Olah, Nick Cammarata

Published: -Area: Mechanistic Interp.Citations: 390

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R4 (96%)
Negative Results for Sparse Autoencoders on Downstream Tasks and Deprioritising SAE Research

Rohin Shah, Lewis Smith, Tom Lieberum, Janos Kramar

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

-
Neuronpedia: Interactive SAE Feature Explorer

Johnny Lin

Published: -Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, tool, interpretability

E6 / R3 (95%)
On the Biology of a Large Language Model

Craig Citro, Michael Sklar, Hoagy Cunningham, Wes Gurnee

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R3 (96%)
Privileged Bases in the Transformer Residual Stream

Robert Lasenby, Christopher Olah, Nelson Elhage

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (92%)
Progress on Attention

Rodrigo Luger, Nick Turner, Adam Jermyn, Christopher Olah

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
SAELens: A Library for Training and Analyzing Sparse Autoencoders

David Chanin, Curt Tigges, Joseph Bloom, Anthony Duong

Published: -Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, tool

E5 / R3 (97%)
SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models

Tim G. J. Rudner, Carter Teplica, Arman Cohan, Yixin Liu

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)