Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 301-330 of 470 papers (page 11 of 16)

PaperIntel
Transcoders Find Interpretable LLM Feature Circuits

Philippe Chlenski, Jacob Dunefsky, Neel Nanda

Published: 2024-06-17Area: Mechanistic Interp.Citations: 102

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E4 / R3 (95%)
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models

Ellie Pavlick, Carsten Eickhoff, Jack Merullo

Published: 2024-06-13Area: Mechanistic Interp.Citations: 36

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Scaling and Evaluating Sparse Autoencoders

Gabriel Goh, Jan Leike, Jeffrey Wu, Henk Tillman

Published: 2024-06-06Area: Mechanistic Interp.Citations: 334

Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation

E5 / R3 (96%)
The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision

Liv Gorton

Published: 2024-06-06Area: Mechanistic Interp.Citations: 31

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
From Feature Visualization to Visual Circuits: Effect of Adversarial Model Manipulation

Michael Eickenberg, Geraldin Nanfack, Eugene Belilovsky

Published: 2024-06-03Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E6 / R3 (95%)
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience

David Poeppel, Martina G. Vilas, Gemma Roig, Federico Adolfi

Published: 2024-06-03Area: Mechanistic Interp.Citations: 10

Tags: mechanistic-interp, ai-safety, position, interpretability

E5 / R3 (94%)
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

Vasil Georgiev, Cameron Allen, Scott Emmons, Erik Jenner

Published: 2024-06-02Area: Mechanistic Interp.Citations: 25

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Knowledge Circuits in Pretrained Transformers

Zekun Xi, Huajun Chen, Ziwen Xu, Ningyu Zhang

Published: 2024-05-28Area: Mechanistic Interp.Citations: 44

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
From Neurons to Neutrons: A Case Study in Interpretability

Ouail Kitouni, Mike Williams, Sokratis Trifinopoulos, Niklas Nolte

Published: 2024-05-27Area: Mechanistic Interp.Citations: 4

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
Not All Language Model Features Are Linear

Wes Gurnee, Isaac Liao, Joshua Engels, Eric J. Michaud

Published: 2024-05-23Area: Mechanistic Interp.Citations: 106

Tags: empirical, mechanistic-interp, ai-safety

E6 / R4 (94%)
Automatically Identifying Local and Global Circuits with Linear Computation Graphs

Junxuan Wang, Xuyang Ge, Fukang Zhu, Zhengfu He

Published: 2024-05-22Area: Mechanistic Interp.Citations: 20

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

Thang Bui, Charles O'Neill

Published: 2024-05-21Area: Mechanistic Interp.Citations: 11

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (95%)
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Nicholas Goldowsky-Dill, Dan Braun, Jordan Taylor, Lee Sharkey

Published: 2024-05-17Area: Mechanistic Interp.Citations: 57

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Marius Hobbhahn, Magdalena Wache, Jörn Stöhler, Stefan Heimersheim

Published: 2024-05-17Area: Mechanistic Interp.Citations: 6

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (93%)
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Marius Hobbhahn, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun

Published: 2024-05-17Area: Mechanistic Interp.Citations: 11

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
Learnable Privacy Neurons Localization in Language Models

Zuozhu Liu, Tianxiang Hu, Yang Feng, Ruizhe Chen

Published: 2024-05-16Area: Mechanistic Interp.Citations: 30

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R3 (94%)
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Georg Lange, Aleksandar Makelov, Neel Nanda

Published: 2024-05-14Area: Mechanistic Interp.Citations: 66

Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark

E6 / R3 (95%)
How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability

Jorge García-Carrasco, Juan Trujillo, Alejandro Maté

Published: 2024-05-07Area: Mechanistic Interp.Citations: 13

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E4 / R3 (95%)
Improving Dictionary Learning with Gated Sparse Autoencoders

Vikrant Varma, Rohin Shah, Lewis Smith, Tom Lieberum

Published: 2024-04-24Area: Mechanistic Interp.Citations: 138

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (97%)
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

William Merrill, Samuel R. Bowman, Jacob Pfau

Published: 2024-04-24Area: Mechanistic Interp.Citations: 145

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
How to Use and Interpret Activation Patching

Stefan Heimersheim, Neel Nanda

Published: 2024-04-23Area: Mechanistic Interp.Citations: 109

Tags: mechanistic-interp, ai-safety, survey, interpretability

E6 / R4 (95%)
Automatic Discovery of Visual Circuits

Jacob Andreas, Neil Chowdhury, Sarah Schwettmann, Achyuta Rajaram

Published: 2024-04-22Area: Mechanistic Interp.Citations: 10

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R3 (93%)
MAIA: A Multimodal Automated Interpretability Agent

Franklin Wang, Jacob Andreas, Sarah Schwettmann, Tamar Rott Shaham

Published: 2024-04-22Area: Mechanistic Interp.Citations: 45

Tags: mechanistic-interp, ai-safety, tool, interpretability

E6 / R4 (95%)
LM Transparency Tool: Interactive Tool for Analyzing Transformer Language Models

Javier Ferrando, Karen Hambardzumyan, Elena Voita, Igor Tufanov

Published: 2024-04-10Area: Mechanistic Interp.Citations: 15

Tags: mechanistic-interp, ai-safety, tool

E5 / R3 (95%)
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

David Bau, Can Rager, Samuel Marks, Aaron Mueller

Published: 2024-03-28Area: Mechanistic Interp.Citations: 270

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Mechanisms of Non-Factual Hallucination in Language Models

Lei Yu, Meng Cao, Jackie Chi Kit Cheung, Yue Dong

Published: 2024-03-27Area: Mechanistic Interp.Citations: 38

Tags: empirical, mechanistic-interp, ai-safety

E6 / R4 (95%)
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms

Michael Hanna, Sandro Pezzelle, Yonatan Belinkov

Published: 2024-03-26Area: Mechanistic Interp.Citations: 90

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

Christopher Potts, Aryaman Arora, Christopher D. Manning, Zhengxuan Wu

Published: 2024-03-12Area: Mechanistic Interp.Citations: 44

Tags: mechanistic-interp, ai-safety, tool, interpretability

E5 / R3 (96%)
AtP*: An efficient and scalable method for localizing LLM behaviour to components

János Kramár, Rohin Shah, Tom Lieberum, Neel Nanda

Published: 2024-03-01Area: Mechanistic Interp.Citations: 71

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning

Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, Tanmoy Chakraborty

Published: 2024-02-28Area: Mechanistic Interp.Citations: 54

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)