Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 331-360 of 470 papers (page 12 of 16)

PaperIntel
Information Flow Routes: Automatically Interpreting Language Models at Scale

Javier Ferrando, Elena Voita

Published: 2024-02-27Area: Mechanistic Interp.Citations: 74

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Explorations of Self-Repair in Language Models

Neel Nanda, Cody Rushing

Published: 2024-02-23Area: Mechanistic Interp.Citations: 20

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (95%)
A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task

Paul Swoboda, Christian Bartelt, Abhay Sheshadri, Victor Levoso

Published: 2024-02-19Area: Mechanistic Interp.Citations: 48

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (95%)
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT

Xuyang Ge, Zhengfu He, Qinyuan Cheng, Qiong Tang

Published: 2024-02-19Area: Mechanistic Interp.Citations: 25

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (93%)
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

Francesco Ortu, Bernhard Schölkopf, Diego Doimo, Zhijing Jin

Published: 2024-02-18Area: Mechanistic Interp.Citations: 35

Tags: empirical, mechanistic-interp, ai-safety

E7 / R3 (95%)
Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

Bilal Chughtai, Alan Cooney, Neel Nanda

Published: 2024-02-11Area: Mechanistic Interp.Citations: 31

Tags: empirical, mechanistic-interp, ai-safety

E6 / R4 (93%)
Opening the AI black box: program synthesis via mechanistic interpretability

Isaac Liao, Anish Mudide, Tara Rezaei Kheirkhah, Ziming Liu

Published: 2024-02-07Area: Mechanistic Interp.Citations: 19

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R4 (96%)
Challenges in Mechanistically Interpreting Model Representations

James Dao, Satvik Golechha

Published: 2024-02-06Area: Mechanistic Interp.Citations: 4

Tags: mechanistic-interp, ai-safety, position, interpretability

E5 / R3 (93%)
Real Sparks of Artificial Intelligence and the Importance of Inner Interpretability

Alex Grzankowski

Published: 2024-01-31Area: Mechanistic Interp.Citations: 10

Tags: mechanistic-interp, ai-safety, position, interpretability

E8 / R4 (94%)
Fluent dreaming for language models

Michael Sklar, Zygimantas Straznickas, T. Ben Thompson

Published: 2024-01-24Area: Mechanistic Interp.Citations: 4

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E4 / R2 (97%)
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments

Christopher Potts, Aryaman Arora, Thomas Icard, Zhengxuan Wu

Published: 2024-01-23Area: Mechanistic Interp.Citations: 9

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (94%)
Universal Neurons in GPT2 Language Models

Qinyi Sun, Wes Gurnee, Tara Rezaei Kheirkhah, Will Hathaway

Published: 2024-01-22Area: Mechanistic Interp.Citations: 83

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability

Jatin Nainani

Published: 2024-01-08Area: Mechanistic Interp.Citations: 3

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Jonathan K. Kummerfeld, Rada Mihalcea, Andrew Lee, Xiaoyan Bai

Published: 2024-01-03Area: Mechanistic Interp.Citations: 165

Tags: empirical, alignment-training, mechanistic-interp, ai-safety

E5 / R3 (94%)
Observable Propagation: Uncovering Feature Vectors in Transformers

Jacob Dunefsky, Arman Cohan

Published: 2023-12-26Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Forbidden Facts: An Investigation of Competing Objectives in Llama-2

Tony T. Wang, Nir Shavit, Kaivalya Hariharan, Miles Wang

Published: 2023-12-14Area: Mechanistic Interp.Citations: 3

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R3 (96%)
Successor Heads: Recurring, Interpretable Attention Heads In The Wild

Euan Ong, Rhys Gould, George Ogden, Arthur Conmy

Published: 2023-12-14Area: Mechanistic Interp.Citations: 69

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (97%)
Grokking Group Multiplication with Cosets

Honglu Fan, Stella Biderman, Dashiell Stander, Qinan Yu

Published: 2023-12-11Area: Mechanistic Interp.Citations: 18

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Interpretability Illusions in the Generalization of Simplified Models

Andrew Lampinen, Lucas Dixon, Asma Ghandeharioun, Dan Friedman

Published: 2023-12-06Area: Mechanistic Interp.Citations: 20

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
FlexModel: A Framework for Interpretability of Distributed Large Language Models

John Willes, Muhammad Adil Asif, Matthew Choi, David B. Emerson

Published: 2023-12-05Area: Mechanistic Interp.Citations: 1

Tags: mechanistic-interp, ai-safety, tool, interpretability

E5 / R3 (95%)
Generating Interpretable Networks using Hypernetworks

Isaac Liao, Ziming Liu, Max Tegmark

Published: 2023-12-05Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E6 / R4 (95%)
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

Georg Lange, Aleksandar Makelov, Neel Nanda

Published: 2023-11-28Area: Mechanistic Interp.Citations: 41

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (93%)
Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching

Phillip Guo, Richard Ren, James Campbell

Published: 2023-11-25Area: Mechanistic Interp.Citations: 26

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (97%)
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Hidenori Tanaka, Edward Grefenstette, Ekdeep Singh Lubana, Robert P. Dick

Published: 2023-11-21Area: Mechanistic Interp.Citations: 99

Tags: empirical, alignment-training, mechanistic-interp, ai-safety

E5 / R3 (92%)
Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

Jiuding Sun, David Bau, Koyena Pal, Byron C. Wallace

Published: 2023-11-08Area: Mechanistic Interp.Citations: 97

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (96%)
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models

Fazl Barez, Michael Lan

Published: 2023-11-07Area: Mechanistic Interp.Citations: 9

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (94%)
Uncovering Intermediate Variables in Transformers using Circuit Probing

Ellie Pavlick, Thomas Serre, Michael A. Lepori

Published: 2023-11-07Area: Mechanistic Interp.Citations: 12

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Codebook Features: Sparse and Discrete Interpretability for Neural Networks

Alex Tamkin, Mohammad Taufeeque, Noah D. Goodman

Published: 2023-10-26Area: Mechanistic Interp.Citations: 41

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (94%)
How Do Language Models Bind Entities in Context?

Jiahai Feng, Jacob Steinhardt

Published: 2023-10-26Area: Mechanistic Interp.Citations: 70

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (95%)
Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism

Mansi Sakarvadia, Ian Foster, Arham Khan, Daniel Grzenda

Published: 2023-10-25Area: Mechanistic Interp.Citations: 18

Tags: mechanistic-interp, ai-safety, tool

E5 / R3 (95%)