Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 391-420 of 470 papers (page 14 of 16)

PaperIntel
Explaining Black Box Text Modules in Natural Language with Language Models

Alexander G. Huth, Bin Yu, Richard Antonello, Shailee Jain

Published: 2023-05-17Area: Mechanistic Interp.Citations: 66

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (94%)
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

Christopher Potts, Thomas Icard, Zhengxuan Wu, Noah D. Goodman

Published: 2023-05-15Area: Mechanistic Interp.Citations: 113

Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability

E4 / R3 (95%)
A Technical Note on Bilinear Layers for Interpretability

Lee Sharkey

Published: 2023-05-05Area: Mechanistic Interp.Citations: 10

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (92%)
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability

Eric Gan, Ziming Liu, Max Tegmark

Published: 2023-05-04Area: Mechanistic Interp.Citations: 52

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
How does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-trained Language Model

Ollie Liu, Michael Hanna, Alexandre Variengien

Published: 2023-04-30Area: Mechanistic Interp.Citations: 190

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Dissecting Recall of Factual Associations in Auto-Regressive Language Models

Mor Geva, Amir Globerson, Jasmijn Bastings, Katja Filippova

Published: 2023-04-28Area: Mechanistic Interp.Citations: 440

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Towards Automated Circuit Discovery for Mechanistic Interpretability

Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adria Garriga-Alonso

Published: 2023-04-28Area: Mechanistic Interp.Citations: 485

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

Fazl Barez, Alex Foote, Ionnis Konstas, Esben Kran

Published: 2023-04-22Area: Mechanistic Interp.Citations: 4

Tags: mechanistic-interp, ai-safety, tool, interpretability

E5 / R3 (96%)
Disentangling Neuron Representations with Concept Vectors

Henning Muller, Vincent Andrearczyk, Laura O'Mahony, Mara Graziani

Published: 2023-04-19Area: Mechanistic Interp.Citations: 25

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Localizing Model Behavior with Path Patching

Aryaman Arora, Chris MacLeod, Nicholas Goldowsky-Dill, Lucas Sato

Published: 2023-04-12Area: Mechanistic Interp.Citations: 130

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

Christopher Potts, Thomas Icard, Zhengxuan Wu, Noah D. Goodman

Published: 2023-03-05Area: Mechanistic Interp.Citations: 147

Tags: empirical, alignment-training, mechanistic-interp, ai-safety

E4 / R2 (94%)
Analyzing And Editing Inner Mechanisms of Backdoored Language Models

Anka Reuel, Max Lamparth

Published: 2023-02-24Area: Mechanistic Interp.Citations: 15

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (95%)
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

Lawrence Chan, Bilal Chughtai, Neel Nanda

Published: 2023-02-06Area: Mechanistic Interp.Citations: 135

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Tracr: Compiled Transformers as a Laboratory for Interpretability

Vladimir Mikulik, Thomas McGrath, Matthew Rahtz, Janos Kramar

Published: 2023-01-12Area: Mechanistic Interp.Citations: 91

Tags: mechanistic-interp, ai-safety, tool, interpretability

E5 / R4 (95%)
Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability

Christopher Potts, Maheep Chaudhary, Aryaman Arora, Thomas Icard

Published: 2023-01-11Area: Mechanistic Interp.Citations: 118

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (95%)
Interpreting Neural Networks through the Polytope Lens

Kip Parker, Jacob Merizian, Carlos Ramón Guevara, Beren Millidge

Published: 2022-11-22Area: Mechanistic Interp.Citations: 36

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
Engineering Monosemanticity in Toy Models

Nicholas Schiefer, Adam S. Jermyn, Evan Hubinger

Published: 2022-11-16Area: Mechanistic Interp.Citations: 15

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small

Kevin Wang, Buck Shlegeris, Alexandre Variengien, Arthur Conmy

Published: 2022-11-01Area: Mechanistic Interp.Citations: 834

Tags: empirical, mechanistic-interp, ai-safety, interpretability, safety-evaluation

E5 / R3 (96%)
Polysemanticity and Capacity in Neural Networks

Buck Shlegeris, Kshitij Sachan, Joe Benton, Adam S. Jermyn

Published: 2022-10-04Area: Mechanistic Interp.Citations: 52

Tags: theoretical, mechanistic-interp, ai-safety

E5 / R3 (92%)
Analyzing Transformers in Embedding Space

Guy Dar, Mor Geva, Ankit Gupta, Jonathan Berant

Published: 2022-09-06Area: Mechanistic Interp.Citations: 127

Tags: empirical, alignment-training, mechanistic-interp, ai-safety

E5 / R3 (96%)
LM-Debugger: An Interactive Tool for Inspection and Intervention in Transformer-Based Language Models

Bar Tamir, Guy Dar, Yoav Goldberg, Micah Shlain

Published: 2022-04-26Area: Mechanistic Interp.Citations: 32

Tags: mechanistic-interp, ai-safety, tool

E4 / R3 (95%)
CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks

Tsui-Wei Weng, Tuomas Oikarinen

Published: 2022-04-23Area: Mechanistic Interp.Citations: 130

Tags: mechanistic-interp, ai-safety, tool

E6 / R3 (96%)
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space

Yoav Goldberg, Avi Caciularu, Mor Geva, Kevin Ro Wang

Published: 2022-03-28Area: Mechanistic Interp.Citations: 485

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Natural Language Descriptions of Deep Visual Features

David Bau, Jacob Andreas, Teona Bagashvili, Sarah Schwettmann

Published: 2022-01-26Area: Mechanistic Interp.Citations: 149

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Sparse Interventions in Language Models with Differentiable Masking

Nicola De Cao, Dieuwke Hupkes, Ivan Titov, Leon Schmid

Published: 2021-12-13Area: Mechanistic Interp.Citations: 33

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (95%)
Causal Distillation for Language Models

Christopher Potts, Elisa Kreiss, Hanson Lu, Thomas Icard

Published: 2021-12-05Area: Mechanistic Interp.Citations: 29

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (97%)
Inducing Causal Structure for Interpretable Neural Networks

Christopher Potts, Elisa Kreiss, Josh Rozner, Hanson Lu

Published: 2021-12-01Area: Mechanistic Interp.Citations: 95

Tags: empirical, mechanistic-interp, ai-safety

E6 / R4 (97%)
Knowledge Neurons in Pretrained Transformers

Furu Wei, Zhifang Sui, Li Dong, Damai Dai

Published: 2021-04-18Area: Mechanistic Interp.Citations: 601

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (97%)
Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors

Yubei Chen, Yann LeCun, Zeyu Yun, Bruno A Olshausen

Published: 2021-03-29Area: Mechanistic Interp.Citations: 113

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (93%)
Towards Falsifiable Interpretability Research

Ari S. Morcos, Matthew L. Leavitt

Published: 2020-10-22Area: Mechanistic Interp.Citations: 74

Tags: mechanistic-interp, ai-safety, position, interpretability

E6 / R4 (97%)