Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Transcoders Find Interpretable LLM Feature Circuits Philippe Chlenski, Jacob Dunefsky, Neel Nanda Published: 2024-06-17Area: Mechanistic Interp.Citations: 102 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-06-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E4 / R3 (95%) | 102 |
| Talking Heads: Understanding Inter-layer Communication in Transformer Language Models Ellie Pavlick, Carsten Eickhoff, Jack Merullo Published: 2024-06-13Area: Mechanistic Interp.Citations: 36 Tags: empirical, mechanistic-interp, ai-safety | 2024-06-13 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 36 |
| Scaling and Evaluating Sparse Autoencoders Gabriel Goh, Jan Leike, Jeffrey Wu, Henk Tillman Published: 2024-06-06Area: Mechanistic Interp.Citations: 334 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation | 2024-06-06 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, safety-evaluation | E5 / R3 (96%) | 334 |
| The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision Liv Gorton Published: 2024-06-06Area: Mechanistic Interp.Citations: 31 Tags: empirical, mechanistic-interp, ai-safety | 2024-06-06 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 31 |
| From Feature Visualization to Visual Circuits: Effect of Adversarial Model Manipulation Michael Eickenberg, Geraldin Nanfack, Eugene Belilovsky Published: 2024-06-03Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | 2024-06-03 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E6 / R3 (95%) | 1 |
| Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience David Poeppel, Martina G. Vilas, Gemma Roig, Federico Adolfi Published: 2024-06-03Area: Mechanistic Interp.Citations: 10 Tags: mechanistic-interp, ai-safety, position, interpretability | 2024-06-03 | Mechanistic Interp. | mechanistic-interp, ai-safety, position, interpretability | E5 / R3 (94%) | 10 |
| Evidence of Learned Look-Ahead in a Chess-Playing Neural Network Vasil Georgiev, Cameron Allen, Scott Emmons, Erik Jenner Published: 2024-06-02Area: Mechanistic Interp.Citations: 25 Tags: empirical, mechanistic-interp, ai-safety | 2024-06-02 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 25 |
| Knowledge Circuits in Pretrained Transformers Zekun Xi, Huajun Chen, Ziwen Xu, Ningyu Zhang Published: 2024-05-28Area: Mechanistic Interp.Citations: 44 Tags: empirical, mechanistic-interp, ai-safety | 2024-05-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 44 |
| From Neurons to Neutrons: A Case Study in Interpretability Ouail Kitouni, Mike Williams, Sokratis Trifinopoulos, Niklas Nolte Published: 2024-05-27Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-05-27 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 4 |
| Not All Language Model Features Are Linear Wes Gurnee, Isaac Liao, Joshua Engels, Eric J. Michaud Published: 2024-05-23Area: Mechanistic Interp.Citations: 106 Tags: empirical, mechanistic-interp, ai-safety | 2024-05-23 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R4 (94%) | 106 |
| Automatically Identifying Local and Global Circuits with Linear Computation Graphs Junxuan Wang, Xuyang Ge, Fukang Zhu, Zhengfu He Published: 2024-05-22Area: Mechanistic Interp.Citations: 20 Tags: empirical, mechanistic-interp, ai-safety | 2024-05-22 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 20 |
| Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models Thang Bui, Charles O'Neill Published: 2024-05-21Area: Mechanistic Interp.Citations: 11 Tags: empirical, mechanistic-interp, ai-safety | 2024-05-21 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (95%) | 11 |
| Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Nicholas Goldowsky-Dill, Dan Braun, Jordan Taylor, Lee Sharkey Published: 2024-05-17Area: Mechanistic Interp.Citations: 57 Tags: empirical, mechanistic-interp, ai-safety | 2024-05-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 57 |
| The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks Marius Hobbhahn, Magdalena Wache, Jörn Stöhler, Stefan Heimersheim Published: 2024-05-17Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety | 2024-05-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (93%) | 6 |
| Using Degeneracy in the Loss Landscape for Mechanistic Interpretability Marius Hobbhahn, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun Published: 2024-05-17Area: Mechanistic Interp.Citations: 11 Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2024-05-17 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 11 |
| Learnable Privacy Neurons Localization in Language Models Zuozhu Liu, Tianxiang Hu, Yang Feng, Ruizhe Chen Published: 2024-05-16Area: Mechanistic Interp.Citations: 30 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | 2024-05-16 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R3 (94%) | 30 |
| Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Georg Lange, Aleksandar Makelov, Neel Nanda Published: 2024-05-14Area: Mechanistic Interp.Citations: 66 Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark | 2024-05-14 | Mechanistic Interp. | mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark | E6 / R3 (95%) | 66 |
| How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability Jorge García-Carrasco, Juan Trujillo, Alejandro Maté Published: 2024-05-07Area: Mechanistic Interp.Citations: 13 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-05-07 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E4 / R3 (95%) | 13 |
| Improving Dictionary Learning with Gated Sparse Autoencoders Vikrant Varma, Rohin Shah, Lewis Smith, Tom Lieberum Published: 2024-04-24Area: Mechanistic Interp.Citations: 138 Tags: empirical, mechanistic-interp, ai-safety | 2024-04-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (97%) | 138 |
| Let's Think Dot by Dot: Hidden Computation in Transformer Language Models William Merrill, Samuel R. Bowman, Jacob Pfau Published: 2024-04-24Area: Mechanistic Interp.Citations: 145 Tags: empirical, mechanistic-interp, ai-safety | 2024-04-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 145 |
| How to Use and Interpret Activation Patching Stefan Heimersheim, Neel Nanda Published: 2024-04-23Area: Mechanistic Interp.Citations: 109 Tags: mechanistic-interp, ai-safety, survey, interpretability | 2024-04-23 | Mechanistic Interp. | mechanistic-interp, ai-safety, survey, interpretability | E6 / R4 (95%) | 109 |
| Automatic Discovery of Visual Circuits Jacob Andreas, Neil Chowdhury, Sarah Schwettmann, Achyuta Rajaram Published: 2024-04-22Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | 2024-04-22 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R3 (93%) | 10 |
| MAIA: A Multimodal Automated Interpretability Agent Franklin Wang, Jacob Andreas, Sarah Schwettmann, Tamar Rott Shaham Published: 2024-04-22Area: Mechanistic Interp.Citations: 45 Tags: mechanistic-interp, ai-safety, tool, interpretability | 2024-04-22 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | E6 / R4 (95%) | 45 |
| LM Transparency Tool: Interactive Tool for Analyzing Transformer Language Models Javier Ferrando, Karen Hambardzumyan, Elena Voita, Igor Tufanov Published: 2024-04-10Area: Mechanistic Interp.Citations: 15 Tags: mechanistic-interp, ai-safety, tool | 2024-04-10 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E5 / R3 (95%) | 15 |
| Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models David Bau, Can Rager, Samuel Marks, Aaron Mueller Published: 2024-03-28Area: Mechanistic Interp.Citations: 270 Tags: empirical, mechanistic-interp, ai-safety | 2024-03-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 270 |
| Mechanisms of Non-Factual Hallucination in Language Models Lei Yu, Meng Cao, Jackie Chi Kit Cheung, Yue Dong Published: 2024-03-27Area: Mechanistic Interp.Citations: 38 Tags: empirical, mechanistic-interp, ai-safety | 2024-03-27 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R4 (95%) | 38 |
| Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms Michael Hanna, Sandro Pezzelle, Yonatan Belinkov Published: 2024-03-26Area: Mechanistic Interp.Citations: 90 Tags: empirical, mechanistic-interp, ai-safety | 2024-03-26 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 90 |
| pyvene: A Library for Understanding and Improving PyTorch Models via Interventions Christopher Potts, Aryaman Arora, Christopher D. Manning, Zhengxuan Wu Published: 2024-03-12Area: Mechanistic Interp.Citations: 44 Tags: mechanistic-interp, ai-safety, tool, interpretability | 2024-03-12 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | E5 / R3 (96%) | 44 |
| AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár, Rohin Shah, Tom Lieberum, Neel Nanda Published: 2024-03-01Area: Mechanistic Interp.Citations: 71 Tags: empirical, mechanistic-interp, ai-safety | 2024-03-01 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 71 |
| How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, Tanmoy Chakraborty Published: 2024-02-28Area: Mechanistic Interp.Citations: 54 Tags: empirical, mechanistic-interp, ai-safety | 2024-02-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 54 |