Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| In-Context Learning Creates Task Vectors Roee Hendel, Mor Geva, Amir Globerson Published: 2023-10-24Area: Mechanistic Interp.Citations: 258 Tags: empirical, mechanistic-interp, ai-safety | 2023-10-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 258 |
| Function Vectors in Large Language Models David Bau, Arnab Sen Sharma, Millicent L. Li, Aaron Mueller Published: 2023-10-23Area: Mechanistic Interp.Citations: 201 Tags: empirical, mechanistic-interp, ai-safety | 2023-10-23 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 201 |
| Identifying Interpretable Visual Features in Artificial and Biological Neural Systems Nina Miolane, David Klindt, Sophia Sanborn, Frédéric Poitevin Published: 2023-10-17Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2023-10-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (93%) | 10 |
| Attribution Patching Outperforms Automated Circuit Discovery Can Rager, Aaquib Syed, Arthur Conmy Published: 2023-10-16Area: Mechanistic Interp.Citations: 108 Tags: empirical, mechanistic-interp, ai-safety | 2023-10-16 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 108 |
| Circuit Component Reuse Across Tasks in Transformer Language Models Ellie Pavlick, Carsten Eickhoff, Jack Merullo Published: 2023-10-12Area: Mechanistic Interp.Citations: 99 Tags: empirical, mechanistic-interp, ai-safety | 2023-10-12 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (96%) | 99 |
| Interpreting Learned Feedback Patterns in Large Language Models Philip Torr, Rauno Arike, Fazl Barez, Luke Marks Published: 2023-10-12Area: Mechanistic Interp.Citations: 5 Tags: empirical, mechanistic-interp, ai-safety | 2023-10-12 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 5 |
| Understanding and Controlling a Maze-Solving Policy Network Austin Meek, Ulisse Mini, Alexander Matt Turner, Monte MacDiarmid Published: 2023-10-12Area: Mechanistic Interp.Citations: 22 Tags: empirical, mechanistic-interp, ai-safety | 2023-10-12 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (94%) | 22 |
| An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L Can Rager, Jett Janiak, James Dao, Yeu-Tong Lau Published: 2023-10-11Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | 2023-10-11 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R3 (95%) | 6 |
| The Importance of Prompt Tuning for Automated Neuron Explanations Tsui-Wei Weng, Yilan Chen, Tuomas Oikarinen, Arjun Chatha Published: 2023-10-09Area: Mechanistic Interp.Citations: 11 Tags: empirical, mechanistic-interp, ai-safety | 2023-10-09 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 11 |
| Copy Suppression: Comprehensively Understanding an Attention Head Thomas McGrath, Callum McDougall, Arthur Conmy, Neel Nanda Published: 2023-10-06Area: Mechanistic Interp.Citations: 56 Tags: empirical, mechanistic-interp, ai-safety | 2023-10-06 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 56 |
| Discovering Knowledge-Critical Subnetworks in Pretrained Language Models Gail Weiss, Zeming Chen, Antoine Bosselut, Deniz Bayazit Published: 2023-10-04Area: Mechanistic Interp.Citations: 20 Tags: empirical, mechanistic-interp, ai-safety | 2023-10-04 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 20 |
| Efficient Streaming Language Models with Attention Sinks Yuandong Tian, Guangxuan Xiao, Beidi Chen, Mike Lewis Published: 2023-09-29Area: Mechanistic Interp.Citations: 1422 Tags: empirical, mechanistic-interp, ai-safety | 2023-09-29 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 1422 |
| Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang, Neel Nanda Published: 2023-09-27Area: Mechanistic Interp.Citations: 193 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2023-09-27 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (95%) | 193 |
| Rigorously Assessing Natural Language Explanations of Neurons Christopher Potts, Karel D'Oosterlinck, Zhengxuan Wu, Jing Huang Published: 2023-09-19Area: Mechanistic Interp.Citations: 41 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation | 2023-09-19 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, safety-evaluation | E5 / R3 (95%) | 41 |
| Sparse Autoencoders Find Highly Interpretable Features in Language Models Hoagy Cunningham, Robert Huben, Aidan Ewart, Lee Sharkey Published: 2023-09-15Area: Mechanistic Interp.Citations: 881 Tags: empirical, mechanistic-interp, ai-safety | 2023-09-15 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R4 (94%) | 881 |
| Uncovering Mesa-Optimization Algorithms in Transformers Johannes von Oswald, Alexander Meulemans, Mark Sandler, Blaise Agüera y Arcas Published: 2023-09-11Area: Mechanistic Interp.Citations: 86 Tags: empirical, mechanistic-interp, ai-safety | 2023-09-11 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 86 |
| Neurons in Large Language Models: Dead, N-gram, Positional Javier Ferrando, Elena Voita, Christoforos Nalmpantis Published: 2023-09-09Area: Mechanistic Interp.Citations: 75 Tags: empirical, mechanistic-interp, ai-safety | 2023-09-09 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (94%) | 75 |
| FIND: A Function Description Benchmark for Evaluating Interpretability Methods David Bau, Jacob Andreas, Shuang Li, Neil Chowdhury Published: 2023-09-07Area: Mechanistic Interp.Citations: 32 Tags: mechanistic-interp, ai-safety, interpretability, benchmark | 2023-09-07 | Mechanistic Interp. | mechanistic-interp, ai-safety, interpretability, benchmark | E4 / R3 (94%) | 32 |
| Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP Vedant Palit, Aryaman Arora, Rohan Pandey, Paul Pu Liang Published: 2023-08-27Area: Mechanistic Interp.Citations: 47 Tags: mechanistic-interp, ai-safety, tool, interpretability | 2023-08-27 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | E5 / R3 (96%) | 47 |
| The Hydra Effect: Emergent Self-repair in Language Model Computations Vladimir Mikulik, Shane Legg, Thomas McGrath, Matthew Rahtz Published: 2023-07-28Area: Mechanistic Interp.Citations: 96 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2023-07-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (94%) | 96 |
| On Privileged and Convergent Bases in Neural Network Representations Yamini Bansal, Nikhil Vyas, Davis Brown Published: 2023-07-24Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2023-07-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (94%) | - |
| FACADE: A Framework for Adversarial Circuit Anomaly Detection and Evaluation Rylan Schaeffer, Arnuv Tandon, Dhruv Pai, Andres Carranza Published: 2023-07-20Area: Mechanistic Interp.Citations: 2 Tags: mechanistic-interp, ai-safety, adversarial-robustness, tool, safety-evaluation | 2023-07-20 | Mechanistic Interp. | mechanistic-interp, ai-safety, adversarial-robustness, tool, safety-evaluation | E5 / R3 (94%) | 2 |
| Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla Rohin Shah, Geoffrey Irving, Vladimir Mikulik, Matthew Rahtz Published: 2023-07-18Area: Mechanistic Interp.Citations: 144 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2023-07-18 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 144 |
| Overthinking the Truth: Understanding How Language Models Process False Demonstrations Jean-Stanislas Denain, Danny Halawi, Jacob Steinhardt Published: 2023-07-18Area: Mechanistic Interp.Citations: 74 Tags: empirical, mechanistic-interp, ai-safety | 2023-07-18 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (92%) | 74 |
| Discovering Variable Binding Circuitry with Desiderata David Bau, Xander Davies, Max Nadeau, Nikhil Prakash Published: 2023-07-07Area: Mechanistic Interp.Citations: 22 Tags: empirical, mechanistic-interp, ai-safety | 2023-07-07 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R2 (94%) | 22 |
| The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks Jacob Andreas, Ziqian Zhong, Ziming Liu, Max Tegmark Published: 2023-06-30Area: Mechanistic Interp.Citations: 145 Tags: empirical, mechanistic-interp, ai-safety | 2023-06-30 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 145 |
| Learning Transformer Programs Alexander Wettig, Dan Friedman, Danqi Chen Published: 2023-06-01Area: Mechanistic Interp.Citations: 48 Tags: empirical, mechanistic-interp, ai-safety | 2023-06-01 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 48 |
| Neuron to Graph: Interpreting Language Model Neurons at Scale Fazl Barez, Alex Foote, Shay B. Cohen, Ioannis Konstas Published: 2023-05-31Area: Mechanistic Interp.Citations: 28 Tags: mechanistic-interp, ai-safety, tool, interpretability | 2023-05-31 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | E5 / R3 (94%) | 28 |
| Language Models Implement Simple Word2Vec-style Vector Arithmetic Ellie Pavlick, Carsten Eickhoff, Jack Merullo Published: 2023-05-25Area: Mechanistic Interp.Citations: 86 Tags: empirical, mechanistic-interp, ai-safety | 2023-05-25 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 86 |
| A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis Alessandro Stolfo, Mrinmaya Sachan, Yonatan Belinkov Published: 2023-05-24Area: Mechanistic Interp.Citations: 71 Tags: empirical, mechanistic-interp, ai-safety | 2023-05-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 71 |