Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 391-420 of 470 papers (page 14 of 16)

Paper	Published	Area	Tags	Intel	Citations
Explaining Black Box Text Modules in Natural Language with Language Models Alexander G. Huth, Bin Yu, Richard Antonello, Shailee Jain Published: 2023-05-17Area: Mechanistic Interp.Citations: 66 Tags: empirical, mechanistic-interp, ai-safety	2023-05-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (94%)	66
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca Christopher Potts, Thomas Icard, Zhengxuan Wu, Noah D. Goodman Published: 2023-05-15Area: Mechanistic Interp.Citations: 113 Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability	2023-05-15	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety, interpretability	E4 / R3 (95%)	113
A Technical Note on Bilinear Layers for Interpretability Lee Sharkey Published: 2023-05-05Area: Mechanistic Interp.Citations: 10 Tags: theoretical, mechanistic-interp, ai-safety, interpretability	2023-05-05	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (92%)	10
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability Eric Gan, Ziming Liu, Max Tegmark Published: 2023-05-04Area: Mechanistic Interp.Citations: 52 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2023-05-04	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	52
How does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-trained Language Model Ollie Liu, Michael Hanna, Alexandre Variengien Published: 2023-04-30Area: Mechanistic Interp.Citations: 190 Tags: empirical, mechanistic-interp, ai-safety	2023-04-30	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	190
Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva, Amir Globerson, Jasmijn Bastings, Katja Filippova Published: 2023-04-28Area: Mechanistic Interp.Citations: 440 Tags: empirical, mechanistic-interp, ai-safety	2023-04-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	440
Towards Automated Circuit Discovery for Mechanistic Interpretability Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adria Garriga-Alonso Published: 2023-04-28Area: Mechanistic Interp.Citations: 485 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2023-04-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (96%)	485
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models Fazl Barez, Alex Foote, Ionnis Konstas, Esben Kran Published: 2023-04-22Area: Mechanistic Interp.Citations: 4 Tags: mechanistic-interp, ai-safety, tool, interpretability	2023-04-22	Mechanistic Interp.	mechanistic-interp, ai-safety, tool, interpretability	E5 / R3 (96%)	4
Disentangling Neuron Representations with Concept Vectors Henning Muller, Vincent Andrearczyk, Laura O'Mahony, Mara Graziani Published: 2023-04-19Area: Mechanistic Interp.Citations: 25 Tags: empirical, mechanistic-interp, ai-safety	2023-04-19	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	25
Localizing Model Behavior with Path Patching Aryaman Arora, Chris MacLeod, Nicholas Goldowsky-Dill, Lucas Sato Published: 2023-04-12Area: Mechanistic Interp.Citations: 130 Tags: empirical, mechanistic-interp, ai-safety	2023-04-12	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	130
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Christopher Potts, Thomas Icard, Zhengxuan Wu, Noah D. Goodman Published: 2023-03-05Area: Mechanistic Interp.Citations: 147 Tags: empirical, alignment-training, mechanistic-interp, ai-safety	2023-03-05	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety	E4 / R2 (94%)	147
Analyzing And Editing Inner Mechanisms of Backdoored Language Models Anka Reuel, Max Lamparth Published: 2023-02-24Area: Mechanistic Interp.Citations: 15 Tags: empirical, mechanistic-interp, ai-safety	2023-02-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (95%)	15
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations Lawrence Chan, Bilal Chughtai, Neel Nanda Published: 2023-02-06Area: Mechanistic Interp.Citations: 135 Tags: empirical, mechanistic-interp, ai-safety	2023-02-06	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	135
Tracr: Compiled Transformers as a Laboratory for Interpretability Vladimir Mikulik, Thomas McGrath, Matthew Rahtz, Janos Kramar Published: 2023-01-12Area: Mechanistic Interp.Citations: 91 Tags: mechanistic-interp, ai-safety, tool, interpretability	2023-01-12	Mechanistic Interp.	mechanistic-interp, ai-safety, tool, interpretability	E5 / R4 (95%)	91
Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability Christopher Potts, Maheep Chaudhary, Aryaman Arora, Thomas Icard Published: 2023-01-11Area: Mechanistic Interp.Citations: 118 Tags: theoretical, mechanistic-interp, ai-safety, interpretability	2023-01-11	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E6 / R3 (95%)	118
Interpreting Neural Networks through the Polytope Lens Kip Parker, Jacob Merizian, Carlos Ramón Guevara, Beren Millidge Published: 2022-11-22Area: Mechanistic Interp.Citations: 36 Tags: theoretical, mechanistic-interp, ai-safety, interpretability	2022-11-22	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	36
Engineering Monosemanticity in Toy Models Nicholas Schiefer, Adam S. Jermyn, Evan Hubinger Published: 2022-11-16Area: Mechanistic Interp.Citations: 15 Tags: empirical, mechanistic-interp, ai-safety	2022-11-16	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	15
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small Kevin Wang, Buck Shlegeris, Alexandre Variengien, Arthur Conmy Published: 2022-11-01Area: Mechanistic Interp.Citations: 834 Tags: empirical, mechanistic-interp, ai-safety, interpretability, safety-evaluation	2022-11-01	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability, safety-evaluation	E5 / R3 (96%)	834
Polysemanticity and Capacity in Neural Networks Buck Shlegeris, Kshitij Sachan, Joe Benton, Adam S. Jermyn Published: 2022-10-04Area: Mechanistic Interp.Citations: 52 Tags: theoretical, mechanistic-interp, ai-safety	2022-10-04	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E5 / R3 (92%)	52
Analyzing Transformers in Embedding Space Guy Dar, Mor Geva, Ankit Gupta, Jonathan Berant Published: 2022-09-06Area: Mechanistic Interp.Citations: 127 Tags: empirical, alignment-training, mechanistic-interp, ai-safety	2022-09-06	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety	E5 / R3 (96%)	127
LM-Debugger: An Interactive Tool for Inspection and Intervention in Transformer-Based Language Models Bar Tamir, Guy Dar, Yoav Goldberg, Micah Shlain Published: 2022-04-26Area: Mechanistic Interp.Citations: 32 Tags: mechanistic-interp, ai-safety, tool	2022-04-26	Mechanistic Interp.	mechanistic-interp, ai-safety, tool	E4 / R3 (95%)	32
CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks Tsui-Wei Weng, Tuomas Oikarinen Published: 2022-04-23Area: Mechanistic Interp.Citations: 130 Tags: mechanistic-interp, ai-safety, tool	2022-04-23	Mechanistic Interp.	mechanistic-interp, ai-safety, tool	E6 / R3 (96%)	130
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space Yoav Goldberg, Avi Caciularu, Mor Geva, Kevin Ro Wang Published: 2022-03-28Area: Mechanistic Interp.Citations: 485 Tags: empirical, mechanistic-interp, ai-safety	2022-03-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	485
Natural Language Descriptions of Deep Visual Features David Bau, Jacob Andreas, Teona Bagashvili, Sarah Schwettmann Published: 2022-01-26Area: Mechanistic Interp.Citations: 149 Tags: empirical, mechanistic-interp, ai-safety	2022-01-26	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	149
Sparse Interventions in Language Models with Differentiable Masking Nicola De Cao, Dieuwke Hupkes, Ivan Titov, Leon Schmid Published: 2021-12-13Area: Mechanistic Interp.Citations: 33 Tags: empirical, mechanistic-interp, ai-safety	2021-12-13	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (95%)	33
Causal Distillation for Language Models Christopher Potts, Elisa Kreiss, Hanson Lu, Thomas Icard Published: 2021-12-05Area: Mechanistic Interp.Citations: 29 Tags: empirical, mechanistic-interp, ai-safety	2021-12-05	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (97%)	29
Inducing Causal Structure for Interpretable Neural Networks Christopher Potts, Elisa Kreiss, Josh Rozner, Hanson Lu Published: 2021-12-01Area: Mechanistic Interp.Citations: 95 Tags: empirical, mechanistic-interp, ai-safety	2021-12-01	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R4 (97%)	95
Knowledge Neurons in Pretrained Transformers Furu Wei, Zhifang Sui, Li Dong, Damai Dai Published: 2021-04-18Area: Mechanistic Interp.Citations: 601 Tags: empirical, mechanistic-interp, ai-safety	2021-04-18	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (97%)	601
Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors Yubei Chen, Yann LeCun, Zeyu Yun, Bruno A Olshausen Published: 2021-03-29Area: Mechanistic Interp.Citations: 113 Tags: empirical, mechanistic-interp, ai-safety	2021-03-29	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (93%)	113
Towards Falsifiable Interpretability Research Ari S. Morcos, Matthew L. Leavitt Published: 2020-10-22Area: Mechanistic Interp.Citations: 74 Tags: mechanistic-interp, ai-safety, position, interpretability	2020-10-22	Mechanistic Interp.	mechanistic-interp, ai-safety, position, interpretability	E6 / R4 (97%)	74