Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 271-300 of 470 papers (page 10 of 16)

PaperIntel
Residual Stream Analysis with Multi-Layer SAEs

Lucy Farnik, Conor Houghton, Tim Lawson, Laurence Aitchison

Published: 2024-09-06Area: Mechanistic Interp.Citations: 12

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

Maheep Chaudhary, Atticus Geiger

Published: 2024-09-05Area: Mechanistic Interp.Citations: 30

Tags: empirical, mechanistic-interp, ai-safety

E7 / R3 (97%)
On the Complexity of Neural Computation in Superposition

Nir Shavit, Micah Adler

Published: 2024-09-05Area: Mechanistic Interp.Citations: 11

Tags: theoretical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Safety Layers in Aligned Large Language Models

Shen Li, Liuyi Yao, Lan Zhang, Yaliang Li

Published: 2024-08-30Area: Mechanistic Interp.Citations: 88

Tags: empirical, alignment-training, mechanistic-interp, ai-safety

E4 / R3 (95%)
Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference

André Freitas, Marco Valentino, Geonhee Kim

Published: 2024-08-16Area: Mechanistic Interp.Citations: 13

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (92%)
Mathematical Models of Computation in Superposition

Lawrence Chan, Jake Mendel, Kaarel Hänni, Dmitry Vaintrob

Published: 2024-08-10Area: Mechanistic Interp.Citations: 23

Tags: theoretical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Vikrant Varma, Rohin Shah, Nicolas Sonnerat, Lewis Smith

Published: 2024-08-09Area: Mechanistic Interp.Citations: 254

Tags: mechanistic-interp, ai-safety, tool

E5 / R3 (96%)
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

David Bau, Can Rager, Samuel Marks, Adam Karvonen

Published: 2024-07-31Area: Mechanistic Interp.Citations: 49

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability

Jorge García-Carrasco, Juan Trujillo, Alejandro Maté

Published: 2024-07-29Area: Mechanistic Interp.Citations: 6

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness, interpretability

E5 / R3 (93%)
Planning in a Recurrent Neural Network that Plays Sokoban

Adam Gleave, Mohammad Taufeeque, Adrià Garriga-Alonso, Philip Quirke

Published: 2024-07-22Area: Mechanistic Interp.Citations: 11

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (95%)
Adversarial Circuit Evaluation

Niels uit de Bos, Adrià Garriga-Alonso

Published: 2024-07-21Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness, safety-evaluation

E6 / R3 (95%)
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Iván Arcuschin, Rohan Gupta, Adrià Garriga-Alonso, Thomas Kwa

Published: 2024-07-19Area: Mechanistic Interp.Citations: 7

Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark

E6 / R3 (98%)
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Vikrant Varma, Nicolas Sonnerat, Tom Lieberum, Senthooran Rajamanoharan

Published: 2024-07-19Area: Mechanistic Interp.Citations: 194

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Relational Composition in Neural Networks: A Survey and Call to Action

Fernanda B. Viégas, Martin Wattenberg

Published: 2024-07-19Area: Mechanistic Interp.Citations: 19

Tags: mechanistic-interp, ai-safety, survey

E6 / R3 (91%)
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach

Nils Palumbo, Ravi Mangal, Corina Păsăreanu, Saranya Vijayakumar

Published: 2024-07-18Area: Mechanistic Interp.Citations: 2

Tags: theoretical, mechanistic-interp, ai-safety

E5 / R3 (95%)
NNsight and NDIF: Democratizing Access to Foundation Model Internals

Jonathan Bell, David Bau, Can Rager, Samuel Marks

Published: 2024-07-18Area: Mechanistic Interp.Citations: 23

Tags: mechanistic-interp, ai-safety, tool

E5 / R3 (96%)
Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Sonia Joseph, Irina Rish, Blake Richards, Karolis Jucys

Published: 2024-07-16Area: Mechanistic Interp.Citations: 4

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R4 (97%)
LLM Circuit Analyses Are Consistent Across Training and Scale

Michael Hanna, Curt Tigges, Stella Biderman, Qinan Yu

Published: 2024-07-15Area: Mechanistic Interp.Citations: 42

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Transformer Circuit Faithfulness Metrics are not Robust

Bilal Chughtai, William Saunders, Joseph Miller

Published: 2024-07-11Area: Mechanistic Interp.Citations: 10

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E4 / R3 (95%)
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks

Aaron Mueller

Published: 2024-07-05Area: Mechanistic Interp.Citations: 18

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (92%)
Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning

Lei Yu, Jingcheng Niu, Zining Zhu, Gerald Penn

Published: 2024-07-04Area: Mechanistic Interp.Citations: 8

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (95%)
The Remarkable Robustness of LLMs: Stages of Inference?

Wes Gurnee, Max Tegmark, Vedang Lad

Published: 2024-06-27Area: Mechanistic Interp.Citations: 98

Tags: empirical, mechanistic-interp, ai-safety

E7 / R4 (94%)
A Closer Look into Mixture-of-Experts in Large Language Models

Jie Fu, Zeyu Huang, Ka Man Lo, Zili Wang

Published: 2024-06-26Area: Mechanistic Interp.Citations: 28

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Interpreting Attention Layer Outputs with Sparse Autoencoders

Joseph Isaac Bloom, Arthur Conmy, Robert Krzyzanowski, Connor Kissane

Published: 2024-06-25Area: Mechanistic Interp.Citations: 40

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Understanding Language Model Circuits through Knowledge Editing

Huaizhi Ge, Zining Zhu, Frank Rudzicz

Published: 2024-06-25Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Confidence Regulation Neurons in Language Models

Wes Gurnee, Xingyi Song, Alessandro Stolfo, Ben Wu

Published: 2024-06-24Area: Mechanistic Interp.Citations: 45

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (94%)
Finding Transformer Circuits with Edge Pruning

Alexander Wettig, Adithya Bhaskar, Dan Friedman, Danqi Chen

Published: 2024-06-24Area: Mechanistic Interp.Citations: 40

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (97%)
Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models

Jun Zhao, Zhuoran Jin, Pengfei Cao, Yubo Chen

Published: 2024-06-23Area: Mechanistic Interp.Citations: 19

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

Juanzi Li, Xiaozhi Wang, Jianhui Chen, Zijun Yao

Published: 2024-06-20Area: Mechanistic Interp.Citations: 25

Tags: empirical, alignment-training, mechanistic-interp, ai-safety

E5 / R3 (94%)
Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries

Sohee Yang, Daniela Gottesman, Mor Geva, Amir Globerson

Published: 2024-06-18Area: Mechanistic Interp.Citations: 75

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (97%)