Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Explaining Black Box Text Modules in Natural Language with Language Models Alexander G. Huth, Bin Yu, Richard Antonello, Shailee Jain Published: 2023-05-17Area: Mechanistic Interp.Citations: 66 Tags: empirical, mechanistic-interp, ai-safety | 2023-05-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (94%) | 66 |
| Interpretability at Scale: Identifying Causal Mechanisms in Alpaca Christopher Potts, Thomas Icard, Zhengxuan Wu, Noah D. Goodman Published: 2023-05-15Area: Mechanistic Interp.Citations: 113 Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability | 2023-05-15 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety, interpretability | E4 / R3 (95%) | 113 |
| A Technical Note on Bilinear Layers for Interpretability Lee Sharkey Published: 2023-05-05Area: Mechanistic Interp.Citations: 10 Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2023-05-05 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (92%) | 10 |
| Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability Eric Gan, Ziming Liu, Max Tegmark Published: 2023-05-04Area: Mechanistic Interp.Citations: 52 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2023-05-04 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 52 |
| How does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-trained Language Model Ollie Liu, Michael Hanna, Alexandre Variengien Published: 2023-04-30Area: Mechanistic Interp.Citations: 190 Tags: empirical, mechanistic-interp, ai-safety | 2023-04-30 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 190 |
| Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva, Amir Globerson, Jasmijn Bastings, Katja Filippova Published: 2023-04-28Area: Mechanistic Interp.Citations: 440 Tags: empirical, mechanistic-interp, ai-safety | 2023-04-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 440 |
| Towards Automated Circuit Discovery for Mechanistic Interpretability Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adria Garriga-Alonso Published: 2023-04-28Area: Mechanistic Interp.Citations: 485 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2023-04-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | 485 |
| N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models Fazl Barez, Alex Foote, Ionnis Konstas, Esben Kran Published: 2023-04-22Area: Mechanistic Interp.Citations: 4 Tags: mechanistic-interp, ai-safety, tool, interpretability | 2023-04-22 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | E5 / R3 (96%) | 4 |
| Disentangling Neuron Representations with Concept Vectors Henning Muller, Vincent Andrearczyk, Laura O'Mahony, Mara Graziani Published: 2023-04-19Area: Mechanistic Interp.Citations: 25 Tags: empirical, mechanistic-interp, ai-safety | 2023-04-19 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 25 |
| Localizing Model Behavior with Path Patching Aryaman Arora, Chris MacLeod, Nicholas Goldowsky-Dill, Lucas Sato Published: 2023-04-12Area: Mechanistic Interp.Citations: 130 Tags: empirical, mechanistic-interp, ai-safety | 2023-04-12 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 130 |
| Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Christopher Potts, Thomas Icard, Zhengxuan Wu, Noah D. Goodman Published: 2023-03-05Area: Mechanistic Interp.Citations: 147 Tags: empirical, alignment-training, mechanistic-interp, ai-safety | 2023-03-05 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety | E4 / R2 (94%) | 147 |
| Analyzing And Editing Inner Mechanisms of Backdoored Language Models Anka Reuel, Max Lamparth Published: 2023-02-24Area: Mechanistic Interp.Citations: 15 Tags: empirical, mechanistic-interp, ai-safety | 2023-02-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (95%) | 15 |
| A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations Lawrence Chan, Bilal Chughtai, Neel Nanda Published: 2023-02-06Area: Mechanistic Interp.Citations: 135 Tags: empirical, mechanistic-interp, ai-safety | 2023-02-06 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 135 |
| Tracr: Compiled Transformers as a Laboratory for Interpretability Vladimir Mikulik, Thomas McGrath, Matthew Rahtz, Janos Kramar Published: 2023-01-12Area: Mechanistic Interp.Citations: 91 Tags: mechanistic-interp, ai-safety, tool, interpretability | 2023-01-12 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | E5 / R4 (95%) | 91 |
| Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability Christopher Potts, Maheep Chaudhary, Aryaman Arora, Thomas Icard Published: 2023-01-11Area: Mechanistic Interp.Citations: 118 Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2023-01-11 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (95%) | 118 |
| Interpreting Neural Networks through the Polytope Lens Kip Parker, Jacob Merizian, Carlos Ramón Guevara, Beren Millidge Published: 2022-11-22Area: Mechanistic Interp.Citations: 36 Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2022-11-22 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 36 |
| Engineering Monosemanticity in Toy Models Nicholas Schiefer, Adam S. Jermyn, Evan Hubinger Published: 2022-11-16Area: Mechanistic Interp.Citations: 15 Tags: empirical, mechanistic-interp, ai-safety | 2022-11-16 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 15 |
| Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small Kevin Wang, Buck Shlegeris, Alexandre Variengien, Arthur Conmy Published: 2022-11-01Area: Mechanistic Interp.Citations: 834 Tags: empirical, mechanistic-interp, ai-safety, interpretability, safety-evaluation | 2022-11-01 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability, safety-evaluation | E5 / R3 (96%) | 834 |
| Polysemanticity and Capacity in Neural Networks Buck Shlegeris, Kshitij Sachan, Joe Benton, Adam S. Jermyn Published: 2022-10-04Area: Mechanistic Interp.Citations: 52 Tags: theoretical, mechanistic-interp, ai-safety | 2022-10-04 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E5 / R3 (92%) | 52 |
| Analyzing Transformers in Embedding Space Guy Dar, Mor Geva, Ankit Gupta, Jonathan Berant Published: 2022-09-06Area: Mechanistic Interp.Citations: 127 Tags: empirical, alignment-training, mechanistic-interp, ai-safety | 2022-09-06 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety | E5 / R3 (96%) | 127 |
| LM-Debugger: An Interactive Tool for Inspection and Intervention in Transformer-Based Language Models Bar Tamir, Guy Dar, Yoav Goldberg, Micah Shlain Published: 2022-04-26Area: Mechanistic Interp.Citations: 32 Tags: mechanistic-interp, ai-safety, tool | 2022-04-26 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E4 / R3 (95%) | 32 |
| CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks Tsui-Wei Weng, Tuomas Oikarinen Published: 2022-04-23Area: Mechanistic Interp.Citations: 130 Tags: mechanistic-interp, ai-safety, tool | 2022-04-23 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E6 / R3 (96%) | 130 |
| Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space Yoav Goldberg, Avi Caciularu, Mor Geva, Kevin Ro Wang Published: 2022-03-28Area: Mechanistic Interp.Citations: 485 Tags: empirical, mechanistic-interp, ai-safety | 2022-03-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 485 |
| Natural Language Descriptions of Deep Visual Features David Bau, Jacob Andreas, Teona Bagashvili, Sarah Schwettmann Published: 2022-01-26Area: Mechanistic Interp.Citations: 149 Tags: empirical, mechanistic-interp, ai-safety | 2022-01-26 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 149 |
| Sparse Interventions in Language Models with Differentiable Masking Nicola De Cao, Dieuwke Hupkes, Ivan Titov, Leon Schmid Published: 2021-12-13Area: Mechanistic Interp.Citations: 33 Tags: empirical, mechanistic-interp, ai-safety | 2021-12-13 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (95%) | 33 |
| Causal Distillation for Language Models Christopher Potts, Elisa Kreiss, Hanson Lu, Thomas Icard Published: 2021-12-05Area: Mechanistic Interp.Citations: 29 Tags: empirical, mechanistic-interp, ai-safety | 2021-12-05 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (97%) | 29 |
| Inducing Causal Structure for Interpretable Neural Networks Christopher Potts, Elisa Kreiss, Josh Rozner, Hanson Lu Published: 2021-12-01Area: Mechanistic Interp.Citations: 95 Tags: empirical, mechanistic-interp, ai-safety | 2021-12-01 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R4 (97%) | 95 |
| Knowledge Neurons in Pretrained Transformers Furu Wei, Zhifang Sui, Li Dong, Damai Dai Published: 2021-04-18Area: Mechanistic Interp.Citations: 601 Tags: empirical, mechanistic-interp, ai-safety | 2021-04-18 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (97%) | 601 |
| Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors Yubei Chen, Yann LeCun, Zeyu Yun, Bruno A Olshausen Published: 2021-03-29Area: Mechanistic Interp.Citations: 113 Tags: empirical, mechanistic-interp, ai-safety | 2021-03-29 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (93%) | 113 |
| Towards Falsifiable Interpretability Research Ari S. Morcos, Matthew L. Leavitt Published: 2020-10-22Area: Mechanistic Interp.Citations: 74 Tags: mechanistic-interp, ai-safety, position, interpretability | 2020-10-22 | Mechanistic Interp. | mechanistic-interp, ai-safety, position, interpretability | E6 / R4 (97%) | 74 |