Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Understanding the Role of Individual Units in a Deep Neural Network David Bau, Agata Lapedriza, Bolei Zhou, Jun-Yan Zhu Published: 2020-09-10Area: Mechanistic Interp.Citations: 505 Tags: empirical, mechanistic-interp, ai-safety | 2020-09-10 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 505 |
| Compositional Explanations of Neurons Jacob Andreas, Jesse Mu Published: 2020-06-24Area: Mechanistic Interp.Citations: 206 Tags: empirical, mechanistic-interp, ai-safety | 2020-06-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (92%) | 206 |
| A Mathematical Framework for Transformer Circuits Tom Conerly, Nicholas Joseph, Dawn Drain, Yuntao Bai Published: -Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |
| A Pragmatic Vision for Interpretability Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, position, interpretability | - | Mechanistic Interp. | mechanistic-interp, ai-safety, position, interpretability | - | - |
| A Toy Model of Mechanistic (Un)Faithfulness Chris Olah Published: -Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E5 / R3 (93%) | - |
| Accelerating Sparse Autoencoder Training via Layer-Wise Transfer Learning in Large Language Models Jaehyuk Lim, Marco Molinari, Davide Ghilardi, Federico Belotti Published: -Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (96%) | 2 |
| Attribution Patching: Activation Patching At Industrial Scale Neel Nanda Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (95%) | - |
| Causal Scrubbing: A Method for Rigorously Testing Interpretability Hypotheses Adrià Garriga-Alonso, Buck Shlegeris, Lawrence Chan, Nate Thomas Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | - |
| Circuit Tracing: Revealing Computational Graphs in Language Models Craig Citro, Michael Sklar, Hoagy Cunningham, Wes Gurnee Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | - |
| Circuits Updates - April 2025 Brian Chen, Adam Jermyn, Joshua Batson, Jack Lindsey Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R3 (95%) | - |
| Cracking the Circuits: Mechanistic Interpretability in Large Language Models Mushtaq Ali, Dost Muhammad, Malika Bendechache, Muhammad Salman Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (95%) | - |
| Curve Detectors Gabriel Goh, Ludwig Schubert, Chris Olah, Nick Cammarata Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | - |
| Explaining AI through mechanistic interpretability Lena Kästner, Barnaby Crook Published: -Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety, interpretability | - | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (93%) | - |
| Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level János Kramár, Rohin Shah, Senthooran Rajamanoharan, Neel Nanda Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | - | - |
| From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models Yiyang Yu, Minji Lee, Mohammed AlQuraishi, Etowah Adams Published: -Area: Mechanistic Interp.Citations: 32 Tags: empirical, mechanistic-interp, ai-safety, interpretability | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | 32 |
| Gemma Scope 2: Comprehensive Suite of SAEs and Transcoders for Gemma 3 Tom Lieberum, Janos Kramar, Senthooran Rajamanoharan, Callum McDougall Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool, interpretability | - | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | E5 / R3 (98%) | - |
| Goodfire Ember: Scaling Interpretability for Frontier Model Alignment Eric Ho, Curt Tigges, Thomas McGrath, Max Loeffler Published: -Area: Mechanistic Interp.Citations: - Tags: alignment-training, mechanistic-interp, ai-safety, tool, interpretability | - | Mechanistic Interp. | alignment-training, mechanistic-interp, ai-safety, tool, interpretability | E4 / R3 (99%) | - |
| How Can Interpretability Researchers Help AGI Go Well? Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan Published: -Area: Mechanistic Interp.Citations: - Tags: alignment-training, mechanistic-interp, ai-safety, position, interpretability | - | Mechanistic Interp. | alignment-training, mechanistic-interp, ai-safety, position, interpretability | - | - |
| In-context Learning and Induction Heads Tom Conerly, Nicholas Joseph, Dawn Drain, Yuntao Bai Published: -Area: Mechanistic Interp.Citations: 751 Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 751 |
| Insights on Crosscoder Model Diffing Siddharth Mishra-Sharma, Thomas Henighan, Adam Jermyn, Christopher Olah Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |
| Interpreting GPT: The Logit Lens nostalgebraist Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | - |
| Language Models Can Explain Neurons in Language Models Steven Bills, Gabriel Goh, Jan Leike, Henk Tillman Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E4 / R2 (96%) | - |
| Multimodal Neurons in Artificial Neural Networks Gabriel Goh, Ludwig Schubert, Chris Olah, Nick Cammarata Published: -Area: Mechanistic Interp.Citations: 390 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R4 (96%) | 390 |
| Negative Results for Sparse Autoencoders on Downstream Tasks and Deprioritising SAE Research Rohin Shah, Lewis Smith, Tom Lieberum, Janos Kramar Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | - | - |
| Neuronpedia: Interactive SAE Feature Explorer Johnny Lin Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool, interpretability | - | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | E6 / R3 (95%) | - |
| On the Biology of a Large Language Model Craig Citro, Michael Sklar, Hoagy Cunningham, Wes Gurnee Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R3 (96%) | - |
| Privileged Bases in the Transformer Residual Stream Robert Lasenby, Christopher Olah, Nelson Elhage Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (92%) | - |
| Progress on Attention Rodrigo Luger, Nick Turner, Adam Jermyn, Christopher Olah Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | - |
| SAELens: A Library for Training and Analyzing Sparse Autoencoders David Chanin, Curt Tigges, Joseph Bloom, Anthony Duong Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool | - | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E5 / R3 (97%) | - |
| SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models Tim G. J. Rudner, Carter Teplica, Arman Cohan, Yixin Liu Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | - |