Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 361-390 of 470 papers (page 13 of 16)

Paper	Published	Area	Tags	Intel	Citations
In-Context Learning Creates Task Vectors Roee Hendel, Mor Geva, Amir Globerson Published: 2023-10-24Area: Mechanistic Interp.Citations: 258 Tags: empirical, mechanistic-interp, ai-safety	2023-10-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	258
Function Vectors in Large Language Models David Bau, Arnab Sen Sharma, Millicent L. Li, Aaron Mueller Published: 2023-10-23Area: Mechanistic Interp.Citations: 201 Tags: empirical, mechanistic-interp, ai-safety	2023-10-23	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	201
Identifying Interpretable Visual Features in Artificial and Biological Neural Systems Nina Miolane, David Klindt, Sophia Sanborn, Frédéric Poitevin Published: 2023-10-17Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2023-10-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (93%)	10
Attribution Patching Outperforms Automated Circuit Discovery Can Rager, Aaquib Syed, Arthur Conmy Published: 2023-10-16Area: Mechanistic Interp.Citations: 108 Tags: empirical, mechanistic-interp, ai-safety	2023-10-16	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	108
Circuit Component Reuse Across Tasks in Transformer Language Models Ellie Pavlick, Carsten Eickhoff, Jack Merullo Published: 2023-10-12Area: Mechanistic Interp.Citations: 99 Tags: empirical, mechanistic-interp, ai-safety	2023-10-12	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (96%)	99
Interpreting Learned Feedback Patterns in Large Language Models Philip Torr, Rauno Arike, Fazl Barez, Luke Marks Published: 2023-10-12Area: Mechanistic Interp.Citations: 5 Tags: empirical, mechanistic-interp, ai-safety	2023-10-12	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	5
Understanding and Controlling a Maze-Solving Policy Network Austin Meek, Ulisse Mini, Alexander Matt Turner, Monte MacDiarmid Published: 2023-10-12Area: Mechanistic Interp.Citations: 22 Tags: empirical, mechanistic-interp, ai-safety	2023-10-12	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (94%)	22
An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L Can Rager, Jett Janiak, James Dao, Yeu-Tong Lau Published: 2023-10-11Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	2023-10-11	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R3 (95%)	6
The Importance of Prompt Tuning for Automated Neuron Explanations Tsui-Wei Weng, Yilan Chen, Tuomas Oikarinen, Arjun Chatha Published: 2023-10-09Area: Mechanistic Interp.Citations: 11 Tags: empirical, mechanistic-interp, ai-safety	2023-10-09	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	11
Copy Suppression: Comprehensively Understanding an Attention Head Thomas McGrath, Callum McDougall, Arthur Conmy, Neel Nanda Published: 2023-10-06Area: Mechanistic Interp.Citations: 56 Tags: empirical, mechanistic-interp, ai-safety	2023-10-06	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	56
Discovering Knowledge-Critical Subnetworks in Pretrained Language Models Gail Weiss, Zeming Chen, Antoine Bosselut, Deniz Bayazit Published: 2023-10-04Area: Mechanistic Interp.Citations: 20 Tags: empirical, mechanistic-interp, ai-safety	2023-10-04	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	20
Efficient Streaming Language Models with Attention Sinks Yuandong Tian, Guangxuan Xiao, Beidi Chen, Mike Lewis Published: 2023-09-29Area: Mechanistic Interp.Citations: 1422 Tags: empirical, mechanistic-interp, ai-safety	2023-09-29	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	1422
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang, Neel Nanda Published: 2023-09-27Area: Mechanistic Interp.Citations: 193 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2023-09-27	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E6 / R3 (95%)	193
Rigorously Assessing Natural Language Explanations of Neurons Christopher Potts, Karel D'Oosterlinck, Zhengxuan Wu, Jing Huang Published: 2023-09-19Area: Mechanistic Interp.Citations: 41 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation	2023-09-19	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, safety-evaluation	E5 / R3 (95%)	41
Sparse Autoencoders Find Highly Interpretable Features in Language Models Hoagy Cunningham, Robert Huben, Aidan Ewart, Lee Sharkey Published: 2023-09-15Area: Mechanistic Interp.Citations: 881 Tags: empirical, mechanistic-interp, ai-safety	2023-09-15	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R4 (94%)	881
Uncovering Mesa-Optimization Algorithms in Transformers Johannes von Oswald, Alexander Meulemans, Mark Sandler, Blaise Agüera y Arcas Published: 2023-09-11Area: Mechanistic Interp.Citations: 86 Tags: empirical, mechanistic-interp, ai-safety	2023-09-11	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	86
Neurons in Large Language Models: Dead, N-gram, Positional Javier Ferrando, Elena Voita, Christoforos Nalmpantis Published: 2023-09-09Area: Mechanistic Interp.Citations: 75 Tags: empirical, mechanistic-interp, ai-safety	2023-09-09	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (94%)	75
FIND: A Function Description Benchmark for Evaluating Interpretability Methods David Bau, Jacob Andreas, Shuang Li, Neil Chowdhury Published: 2023-09-07Area: Mechanistic Interp.Citations: 32 Tags: mechanistic-interp, ai-safety, interpretability, benchmark	2023-09-07	Mechanistic Interp.	mechanistic-interp, ai-safety, interpretability, benchmark	E4 / R3 (94%)	32
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP Vedant Palit, Aryaman Arora, Rohan Pandey, Paul Pu Liang Published: 2023-08-27Area: Mechanistic Interp.Citations: 47 Tags: mechanistic-interp, ai-safety, tool, interpretability	2023-08-27	Mechanistic Interp.	mechanistic-interp, ai-safety, tool, interpretability	E5 / R3 (96%)	47
The Hydra Effect: Emergent Self-repair in Language Model Computations Vladimir Mikulik, Shane Legg, Thomas McGrath, Matthew Rahtz Published: 2023-07-28Area: Mechanistic Interp.Citations: 96 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2023-07-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (94%)	96
On Privileged and Convergent Bases in Neural Network Representations Yamini Bansal, Nikhil Vyas, Davis Brown Published: 2023-07-24Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2023-07-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (94%)	-
FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation Rylan Schaeffer, Arnuv Tandon, Dhruv Pai, Andres Carranza Published: 2023-07-20Area: Mechanistic Interp.Citations: 2 Tags: mechanistic-interp, ai-safety, adversarial-robustness, tool, safety-evaluation	2023-07-20	Mechanistic Interp.	mechanistic-interp, ai-safety, adversarial-robustness, tool, safety-evaluation	E5 / R3 (94%)	2
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla Rohin Shah, Geoffrey Irving, Vladimir Mikulik, Matthew Rahtz Published: 2023-07-18Area: Mechanistic Interp.Citations: 144 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2023-07-18	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	144
Overthinking the Truth: Understanding How Language Models Process False Demonstrations Jean-Stanislas Denain, Danny Halawi, Jacob Steinhardt Published: 2023-07-18Area: Mechanistic Interp.Citations: 74 Tags: empirical, mechanistic-interp, ai-safety	2023-07-18	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (92%)	74
Discovering Variable Binding Circuitry with Desiderata David Bau, Xander Davies, Max Nadeau, Nikhil Prakash Published: 2023-07-07Area: Mechanistic Interp.Citations: 22 Tags: empirical, mechanistic-interp, ai-safety	2023-07-07	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R2 (94%)	22
The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks Jacob Andreas, Ziqian Zhong, Ziming Liu, Max Tegmark Published: 2023-06-30Area: Mechanistic Interp.Citations: 145 Tags: empirical, mechanistic-interp, ai-safety	2023-06-30	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	145
Learning Transformer Programs Alexander Wettig, Dan Friedman, Danqi Chen Published: 2023-06-01Area: Mechanistic Interp.Citations: 48 Tags: empirical, mechanistic-interp, ai-safety	2023-06-01	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	48
Neuron to Graph: Interpreting Language Model Neurons at Scale Fazl Barez, Alex Foote, Shay B. Cohen, Ioannis Konstas Published: 2023-05-31Area: Mechanistic Interp.Citations: 28 Tags: mechanistic-interp, ai-safety, tool, interpretability	2023-05-31	Mechanistic Interp.	mechanistic-interp, ai-safety, tool, interpretability	E5 / R3 (94%)	28
Language Models Implement Simple Word2Vec-style Vector Arithmetic Ellie Pavlick, Carsten Eickhoff, Jack Merullo Published: 2023-05-25Area: Mechanistic Interp.Citations: 86 Tags: empirical, mechanistic-interp, ai-safety	2023-05-25	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	86
A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis Alessandro Stolfo, Mrinmaya Sachan, Yonatan Belinkov Published: 2023-05-24Area: Mechanistic Interp.Citations: 71 Tags: empirical, mechanistic-interp, ai-safety	2023-05-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	71