Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 361-390 of 470 papers (page 13 of 16)

PaperIntel
In-Context Learning Creates Task Vectors

Roee Hendel, Mor Geva, Amir Globerson

Published: 2023-10-24Area: Mechanistic Interp.Citations: 258

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Function Vectors in Large Language Models

David Bau, Arnab Sen Sharma, Millicent L. Li, Aaron Mueller

Published: 2023-10-23Area: Mechanistic Interp.Citations: 201

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Identifying Interpretable Visual Features in Artificial and Biological Neural Systems

Nina Miolane, David Klindt, Sophia Sanborn, Frédéric Poitevin

Published: 2023-10-17Area: Mechanistic Interp.Citations: 10

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (93%)
Attribution Patching Outperforms Automated Circuit Discovery

Can Rager, Aaquib Syed, Arthur Conmy

Published: 2023-10-16Area: Mechanistic Interp.Citations: 108

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Circuit Component Reuse Across Tasks in Transformer Language Models

Ellie Pavlick, Carsten Eickhoff, Jack Merullo

Published: 2023-10-12Area: Mechanistic Interp.Citations: 99

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (96%)
Interpreting Learned Feedback Patterns in Large Language Models

Philip Torr, Rauno Arike, Fazl Barez, Luke Marks

Published: 2023-10-12Area: Mechanistic Interp.Citations: 5

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Understanding and Controlling a Maze-Solving Policy Network

Austin Meek, Ulisse Mini, Alexander Matt Turner, Monte MacDiarmid

Published: 2023-10-12Area: Mechanistic Interp.Citations: 22

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (94%)
An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

Can Rager, Jett Janiak, James Dao, Yeu-Tong Lau

Published: 2023-10-11Area: Mechanistic Interp.Citations: 6

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R3 (95%)
The Importance of Prompt Tuning for Automated Neuron Explanations

Tsui-Wei Weng, Yilan Chen, Tuomas Oikarinen, Arjun Chatha

Published: 2023-10-09Area: Mechanistic Interp.Citations: 11

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Copy Suppression: Comprehensively Understanding an Attention Head

Thomas McGrath, Callum McDougall, Arthur Conmy, Neel Nanda

Published: 2023-10-06Area: Mechanistic Interp.Citations: 56

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Discovering Knowledge-Critical Subnetworks in Pretrained Language Models

Gail Weiss, Zeming Chen, Antoine Bosselut, Deniz Bayazit

Published: 2023-10-04Area: Mechanistic Interp.Citations: 20

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Efficient Streaming Language Models with Attention Sinks

Yuandong Tian, Guangxuan Xiao, Beidi Chen, Mike Lewis

Published: 2023-09-29Area: Mechanistic Interp.Citations: 1422

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Fred Zhang, Neel Nanda

Published: 2023-09-27Area: Mechanistic Interp.Citations: 193

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (95%)
Rigorously Assessing Natural Language Explanations of Neurons

Christopher Potts, Karel D'Oosterlinck, Zhengxuan Wu, Jing Huang

Published: 2023-09-19Area: Mechanistic Interp.Citations: 41

Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation

E5 / R3 (95%)
Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Robert Huben, Aidan Ewart, Lee Sharkey

Published: 2023-09-15Area: Mechanistic Interp.Citations: 881

Tags: empirical, mechanistic-interp, ai-safety

E6 / R4 (94%)
Uncovering Mesa-Optimization Algorithms in Transformers

Johannes von Oswald, Alexander Meulemans, Mark Sandler, Blaise Agüera y Arcas

Published: 2023-09-11Area: Mechanistic Interp.Citations: 86

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Neurons in Large Language Models: Dead, N-gram, Positional

Javier Ferrando, Elena Voita, Christoforos Nalmpantis

Published: 2023-09-09Area: Mechanistic Interp.Citations: 75

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (94%)
FIND: A Function Description Benchmark for Evaluating Interpretability Methods

David Bau, Jacob Andreas, Shuang Li, Neil Chowdhury

Published: 2023-09-07Area: Mechanistic Interp.Citations: 32

Tags: mechanistic-interp, ai-safety, interpretability, benchmark

E4 / R3 (94%)
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP

Vedant Palit, Aryaman Arora, Rohan Pandey, Paul Pu Liang

Published: 2023-08-27Area: Mechanistic Interp.Citations: 47

Tags: mechanistic-interp, ai-safety, tool, interpretability

E5 / R3 (96%)
The Hydra Effect: Emergent Self-repair in Language Model Computations

Vladimir Mikulik, Shane Legg, Thomas McGrath, Matthew Rahtz

Published: 2023-07-28Area: Mechanistic Interp.Citations: 96

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (94%)
On Privileged and Convergent Bases in Neural Network Representations

Yamini Bansal, Nikhil Vyas, Davis Brown

Published: 2023-07-24Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (94%)
FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation

Rylan Schaeffer, Arnuv Tandon, Dhruv Pai, Andres Carranza

Published: 2023-07-20Area: Mechanistic Interp.Citations: 2

Tags: mechanistic-interp, ai-safety, adversarial-robustness, tool, safety-evaluation

E5 / R3 (94%)
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Rohin Shah, Geoffrey Irving, Vladimir Mikulik, Matthew Rahtz

Published: 2023-07-18Area: Mechanistic Interp.Citations: 144

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
Overthinking the Truth: Understanding How Language Models Process False Demonstrations

Jean-Stanislas Denain, Danny Halawi, Jacob Steinhardt

Published: 2023-07-18Area: Mechanistic Interp.Citations: 74

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (92%)
Discovering Variable Binding Circuitry with Desiderata

David Bau, Xander Davies, Max Nadeau, Nikhil Prakash

Published: 2023-07-07Area: Mechanistic Interp.Citations: 22

Tags: empirical, mechanistic-interp, ai-safety

E4 / R2 (94%)
The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks

Jacob Andreas, Ziqian Zhong, Ziming Liu, Max Tegmark

Published: 2023-06-30Area: Mechanistic Interp.Citations: 145

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Learning Transformer Programs

Alexander Wettig, Dan Friedman, Danqi Chen

Published: 2023-06-01Area: Mechanistic Interp.Citations: 48

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Neuron to Graph: Interpreting Language Model Neurons at Scale

Fazl Barez, Alex Foote, Shay B. Cohen, Ioannis Konstas

Published: 2023-05-31Area: Mechanistic Interp.Citations: 28

Tags: mechanistic-interp, ai-safety, tool, interpretability

E5 / R3 (94%)
Language Models Implement Simple Word2Vec-style Vector Arithmetic

Ellie Pavlick, Carsten Eickhoff, Jack Merullo

Published: 2023-05-25Area: Mechanistic Interp.Citations: 86

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis

Alessandro Stolfo, Mrinmaya Sachan, Yonatan Belinkov

Published: 2023-05-24Area: Mechanistic Interp.Citations: 71

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)