Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 301-330 of 470 papers (page 11 of 16)

Paper	Published	Area	Tags	Intel	Citations
Transcoders Find Interpretable LLM Feature Circuits Philippe Chlenski, Jacob Dunefsky, Neel Nanda Published: 2024-06-17Area: Mechanistic Interp.Citations: 102 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-06-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E4 / R3 (95%)	102
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models Ellie Pavlick, Carsten Eickhoff, Jack Merullo Published: 2024-06-13Area: Mechanistic Interp.Citations: 36 Tags: empirical, mechanistic-interp, ai-safety	2024-06-13	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	36
Scaling and Evaluating Sparse Autoencoders Gabriel Goh, Jan Leike, Jeffrey Wu, Henk Tillman Published: 2024-06-06Area: Mechanistic Interp.Citations: 334 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation	2024-06-06	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, safety-evaluation	E5 / R3 (96%)	334
The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision Liv Gorton Published: 2024-06-06Area: Mechanistic Interp.Citations: 31 Tags: empirical, mechanistic-interp, ai-safety	2024-06-06	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	31
From Feature Visualization to Visual Circuits: Effect of Adversarial Model Manipulation Michael Eickenberg, Geraldin Nanfack, Eugene Belilovsky Published: 2024-06-03Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	2024-06-03	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E6 / R3 (95%)	1
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience David Poeppel, Martina G. Vilas, Gemma Roig, Federico Adolfi Published: 2024-06-03Area: Mechanistic Interp.Citations: 10 Tags: mechanistic-interp, ai-safety, position, interpretability	2024-06-03	Mechanistic Interp.	mechanistic-interp, ai-safety, position, interpretability	E5 / R3 (94%)	10
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network Vasil Georgiev, Cameron Allen, Scott Emmons, Erik Jenner Published: 2024-06-02Area: Mechanistic Interp.Citations: 25 Tags: empirical, mechanistic-interp, ai-safety	2024-06-02	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	25
Knowledge Circuits in Pretrained Transformers Zekun Xi, Huajun Chen, Ziwen Xu, Ningyu Zhang Published: 2024-05-28Area: Mechanistic Interp.Citations: 44 Tags: empirical, mechanistic-interp, ai-safety	2024-05-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	44
From Neurons to Neutrons: A Case Study in Interpretability Ouail Kitouni, Mike Williams, Sokratis Trifinopoulos, Niklas Nolte Published: 2024-05-27Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-05-27	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	4
Not All Language Model Features Are Linear Wes Gurnee, Isaac Liao, Joshua Engels, Eric J. Michaud Published: 2024-05-23Area: Mechanistic Interp.Citations: 106 Tags: empirical, mechanistic-interp, ai-safety	2024-05-23	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R4 (94%)	106
Automatically Identifying Local and Global Circuits with Linear Computation Graphs Junxuan Wang, Xuyang Ge, Fukang Zhu, Zhengfu He Published: 2024-05-22Area: Mechanistic Interp.Citations: 20 Tags: empirical, mechanistic-interp, ai-safety	2024-05-22	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	20
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models Thang Bui, Charles O'Neill Published: 2024-05-21Area: Mechanistic Interp.Citations: 11 Tags: empirical, mechanistic-interp, ai-safety	2024-05-21	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (95%)	11
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning Nicholas Goldowsky-Dill, Dan Braun, Jordan Taylor, Lee Sharkey Published: 2024-05-17Area: Mechanistic Interp.Citations: 57 Tags: empirical, mechanistic-interp, ai-safety	2024-05-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	57
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks Marius Hobbhahn, Magdalena Wache, Jörn Stöhler, Stefan Heimersheim Published: 2024-05-17Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety	2024-05-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (93%)	6
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability Marius Hobbhahn, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun Published: 2024-05-17Area: Mechanistic Interp.Citations: 11 Tags: theoretical, mechanistic-interp, ai-safety, interpretability	2024-05-17	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	11
Learnable Privacy Neurons Localization in Language Models Zuozhu Liu, Tianxiang Hu, Yang Feng, Ruizhe Chen Published: 2024-05-16Area: Mechanistic Interp.Citations: 30 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	2024-05-16	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R3 (94%)	30
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control Georg Lange, Aleksandar Makelov, Neel Nanda Published: 2024-05-14Area: Mechanistic Interp.Citations: 66 Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark	2024-05-14	Mechanistic Interp.	mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark	E6 / R3 (95%)	66
How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability Jorge García-Carrasco, Juan Trujillo, Alejandro Maté Published: 2024-05-07Area: Mechanistic Interp.Citations: 13 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-05-07	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E4 / R3 (95%)	13
Improving Dictionary Learning with Gated Sparse Autoencoders Vikrant Varma, Rohin Shah, Lewis Smith, Tom Lieberum Published: 2024-04-24Area: Mechanistic Interp.Citations: 138 Tags: empirical, mechanistic-interp, ai-safety	2024-04-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (97%)	138
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models William Merrill, Samuel R. Bowman, Jacob Pfau Published: 2024-04-24Area: Mechanistic Interp.Citations: 145 Tags: empirical, mechanistic-interp, ai-safety	2024-04-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	145
How to Use and Interpret Activation Patching Stefan Heimersheim, Neel Nanda Published: 2024-04-23Area: Mechanistic Interp.Citations: 109 Tags: mechanistic-interp, ai-safety, survey, interpretability	2024-04-23	Mechanistic Interp.	mechanistic-interp, ai-safety, survey, interpretability	E6 / R4 (95%)	109
Automatic Discovery of Visual Circuits Jacob Andreas, Neil Chowdhury, Sarah Schwettmann, Achyuta Rajaram Published: 2024-04-22Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	2024-04-22	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R3 (93%)	10
MAIA: A Multimodal Automated Interpretability Agent Franklin Wang, Jacob Andreas, Sarah Schwettmann, Tamar Rott Shaham Published: 2024-04-22Area: Mechanistic Interp.Citations: 45 Tags: mechanistic-interp, ai-safety, tool, interpretability	2024-04-22	Mechanistic Interp.	mechanistic-interp, ai-safety, tool, interpretability	E6 / R4 (95%)	45
LM Transparency Tool: Interactive Tool for Analyzing Transformer Language Models Javier Ferrando, Karen Hambardzumyan, Elena Voita, Igor Tufanov Published: 2024-04-10Area: Mechanistic Interp.Citations: 15 Tags: mechanistic-interp, ai-safety, tool	2024-04-10	Mechanistic Interp.	mechanistic-interp, ai-safety, tool	E5 / R3 (95%)	15
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models David Bau, Can Rager, Samuel Marks, Aaron Mueller Published: 2024-03-28Area: Mechanistic Interp.Citations: 270 Tags: empirical, mechanistic-interp, ai-safety	2024-03-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	270
Mechanisms of Non-Factual Hallucination in Language Models Lei Yu, Meng Cao, Jackie Chi Kit Cheung, Yue Dong Published: 2024-03-27Area: Mechanistic Interp.Citations: 38 Tags: empirical, mechanistic-interp, ai-safety	2024-03-27	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R4 (95%)	38
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms Michael Hanna, Sandro Pezzelle, Yonatan Belinkov Published: 2024-03-26Area: Mechanistic Interp.Citations: 90 Tags: empirical, mechanistic-interp, ai-safety	2024-03-26	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	90
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions Christopher Potts, Aryaman Arora, Christopher D. Manning, Zhengxuan Wu Published: 2024-03-12Area: Mechanistic Interp.Citations: 44 Tags: mechanistic-interp, ai-safety, tool, interpretability	2024-03-12	Mechanistic Interp.	mechanistic-interp, ai-safety, tool, interpretability	E5 / R3 (96%)	44
AtP*: An efficient and scalable method for localizing LLM behaviour to components János Kramár, Rohin Shah, Tom Lieberum, Neel Nanda Published: 2024-03-01Area: Mechanistic Interp.Citations: 71 Tags: empirical, mechanistic-interp, ai-safety	2024-03-01	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	71
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, Tanmoy Chakraborty Published: 2024-02-28Area: Mechanistic Interp.Citations: 54 Tags: empirical, mechanistic-interp, ai-safety	2024-02-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	54