Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 421-450 of 470 papers (page 15 of 16)

Paper	Published	Area	Tags	Intel	Citations
Understanding the Role of Individual Units in a Deep Neural Network David Bau, Agata Lapedriza, Bolei Zhou, Jun-Yan Zhu Published: 2020-09-10Area: Mechanistic Interp.Citations: 505 Tags: empirical, mechanistic-interp, ai-safety	2020-09-10	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	505
Compositional Explanations of Neurons Jacob Andreas, Jesse Mu Published: 2020-06-24Area: Mechanistic Interp.Citations: 206 Tags: empirical, mechanistic-interp, ai-safety	2020-06-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (92%)	206
A Mathematical Framework for Transformer Circuits Tom Conerly, Nicholas Joseph, Dawn Drain, Yuntao Bai Published: -Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E5 / R3 (95%)	-
A Pragmatic Vision for Interpretability Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, position, interpretability	-	Mechanistic Interp.	mechanistic-interp, ai-safety, position, interpretability	-	-
A Toy Model of Mechanistic (Un)Faithfulness Chris Olah Published: -Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E5 / R3 (93%)	-
Accelerating Sparse Autoencoder Training via Layer-Wise Transfer Learning in Large Language Models Jaehyuk Lim, Marco Molinari, Davide Ghilardi, Federico Belotti Published: -Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (96%)	2
Attribution Patching: Activation Patching At Industrial Scale Neel Nanda Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (95%)	-
Causal Scrubbing: A Method for Rigorously Testing Interpretability Hypotheses AdriÃ Garriga-Alonso, Buck Shlegeris, Lawrence Chan, Nate Thomas Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (96%)	-
Circuit Tracing: Revealing Computational Graphs in Language Models Craig Citro, Michael Sklar, Hoagy Cunningham, Wes Gurnee Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	-
Circuits Updates - April 2025 Brian Chen, Adam Jermyn, Joshua Batson, Jack Lindsey Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R3 (95%)	-
Cracking the Circuits: Mechanistic Interpretability in Large Language Models Mushtaq Ali, Dost Muhammad, Malika Bendechache, Muhammad Salman Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E6 / R3 (95%)	-
Curve Detectors Gabriel Goh, Ludwig Schubert, Chris Olah, Nick Cammarata Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	-
Explaining AI through mechanistic interpretability Lena Kästner, Barnaby Crook Published: -Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety, interpretability	-	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (93%)	-
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level János Kramár, Rohin Shah, Senthooran Rajamanoharan, Neel Nanda Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	-	-
From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models Yiyang Yu, Minji Lee, Mohammed AlQuraishi, Etowah Adams Published: -Area: Mechanistic Interp.Citations: 32 Tags: empirical, mechanistic-interp, ai-safety, interpretability	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (96%)	32
Gemma Scope 2: Comprehensive Suite of SAEs and Transcoders for Gemma 3 Tom Lieberum, Janos Kramar, Senthooran Rajamanoharan, Callum McDougall Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool, interpretability	-	Mechanistic Interp.	mechanistic-interp, ai-safety, tool, interpretability	E5 / R3 (98%)	-
Goodfire Ember: Scaling Interpretability for Frontier Model Alignment Eric Ho, Curt Tigges, Thomas McGrath, Max Loeffler Published: -Area: Mechanistic Interp.Citations: - Tags: alignment-training, mechanistic-interp, ai-safety, tool, interpretability	-	Mechanistic Interp.	alignment-training, mechanistic-interp, ai-safety, tool, interpretability	E4 / R3 (99%)	-
How Can Interpretability Researchers Help AGI Go Well? Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan Published: -Area: Mechanistic Interp.Citations: - Tags: alignment-training, mechanistic-interp, ai-safety, position, interpretability	-	Mechanistic Interp.	alignment-training, mechanistic-interp, ai-safety, position, interpretability	-	-
In-context Learning and Induction Heads Tom Conerly, Nicholas Joseph, Dawn Drain, Yuntao Bai Published: -Area: Mechanistic Interp.Citations: 751 Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	751
Insights on Crosscoder Model Diffing Siddharth Mishra-Sharma, Thomas Henighan, Adam Jermyn, Christopher Olah Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	-
Interpreting GPT: The Logit Lens nostalgebraist Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	-
Language Models Can Explain Neurons in Language Models Steven Bills, Gabriel Goh, Jan Leike, Henk Tillman Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E4 / R2 (96%)	-
Multimodal Neurons in Artificial Neural Networks Gabriel Goh, Ludwig Schubert, Chris Olah, Nick Cammarata Published: -Area: Mechanistic Interp.Citations: 390 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R4 (96%)	390
Negative Results for Sparse Autoencoders on Downstream Tasks and Deprioritising SAE Research Rohin Shah, Lewis Smith, Tom Lieberum, Janos Kramar Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	-	-
Neuronpedia: Interactive SAE Feature Explorer Johnny Lin Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool, interpretability	-	Mechanistic Interp.	mechanistic-interp, ai-safety, tool, interpretability	E6 / R3 (95%)	-
On the Biology of a Large Language Model Craig Citro, Michael Sklar, Hoagy Cunningham, Wes Gurnee Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R3 (96%)	-
Privileged Bases in the Transformer Residual Stream Robert Lasenby, Christopher Olah, Nelson Elhage Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (92%)	-
Progress on Attention Rodrigo Luger, Nick Turner, Adam Jermyn, Christopher Olah Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	-
SAELens: A Library for Training and Analyzing Sparse Autoencoders David Chanin, Curt Tigges, Joseph Bloom, Anthony Duong Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool	-	Mechanistic Interp.	mechanistic-interp, ai-safety, tool	E5 / R3 (97%)	-
SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models Tim G. J. Rudner, Carter Teplica, Arman Cohan, Yixin Liu Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	-