Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Information Flow Routes: Automatically Interpreting Language Models at Scale Javier Ferrando, Elena Voita Published: 2024-02-27Area: Mechanistic Interp.Citations: 74 Tags: empirical, mechanistic-interp, ai-safety | 2024-02-27 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 74 |
| Explorations of Self-Repair in Language Models Neel Nanda, Cody Rushing Published: 2024-02-23Area: Mechanistic Interp.Citations: 20 Tags: empirical, mechanistic-interp, ai-safety | 2024-02-23 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (95%) | 20 |
| A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task Paul Swoboda, Christian Bartelt, Abhay Sheshadri, Victor Levoso Published: 2024-02-19Area: Mechanistic Interp.Citations: 48 Tags: empirical, mechanistic-interp, ai-safety | 2024-02-19 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (95%) | 48 |
| Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT Xuyang Ge, Zhengfu He, Qinyuan Cheng, Qiong Tang Published: 2024-02-19Area: Mechanistic Interp.Citations: 25 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-02-19 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (93%) | 25 |
| Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals Francesco Ortu, Bernhard Schölkopf, Diego Doimo, Zhijing Jin Published: 2024-02-18Area: Mechanistic Interp.Citations: 35 Tags: empirical, mechanistic-interp, ai-safety | 2024-02-18 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E7 / R3 (95%) | 35 |
| Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs Bilal Chughtai, Alan Cooney, Neel Nanda Published: 2024-02-11Area: Mechanistic Interp.Citations: 31 Tags: empirical, mechanistic-interp, ai-safety | 2024-02-11 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R4 (93%) | 31 |
| Opening the AI black box: program synthesis via mechanistic interpretability Isaac Liao, Anish Mudide, Tara Rezaei Kheirkhah, Ziming Liu Published: 2024-02-07Area: Mechanistic Interp.Citations: 19 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-02-07 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R4 (96%) | 19 |
| Challenges in Mechanistically Interpreting Model Representations James Dao, Satvik Golechha Published: 2024-02-06Area: Mechanistic Interp.Citations: 4 Tags: mechanistic-interp, ai-safety, position, interpretability | 2024-02-06 | Mechanistic Interp. | mechanistic-interp, ai-safety, position, interpretability | E5 / R3 (93%) | 4 |
| Real Sparks of Artificial Intelligence and the Importance of Inner Interpretability Alex Grzankowski Published: 2024-01-31Area: Mechanistic Interp.Citations: 10 Tags: mechanistic-interp, ai-safety, position, interpretability | 2024-01-31 | Mechanistic Interp. | mechanistic-interp, ai-safety, position, interpretability | E8 / R4 (94%) | 10 |
| Fluent dreaming for language models Michael Sklar, Zygimantas Straznickas, T. Ben Thompson Published: 2024-01-24Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | 2024-01-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E4 / R2 (97%) | 4 |
| A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments Christopher Potts, Aryaman Arora, Thomas Icard, Zhengxuan Wu Published: 2024-01-23Area: Mechanistic Interp.Citations: 9 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-01-23 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (94%) | 9 |
| Universal Neurons in GPT2 Language Models Qinyi Sun, Wes Gurnee, Tara Rezaei Kheirkhah, Will Hathaway Published: 2024-01-22Area: Mechanistic Interp.Citations: 83 Tags: empirical, mechanistic-interp, ai-safety | 2024-01-22 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 83 |
| Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability Jatin Nainani Published: 2024-01-08Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-01-08 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 3 |
| A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity Jonathan K. Kummerfeld, Rada Mihalcea, Andrew Lee, Xiaoyan Bai Published: 2024-01-03Area: Mechanistic Interp.Citations: 165 Tags: empirical, alignment-training, mechanistic-interp, ai-safety | 2024-01-03 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety | E5 / R3 (94%) | 165 |
| Observable Propagation: Uncovering Feature Vectors in Transformers Jacob Dunefsky, Arman Cohan Published: 2023-12-26Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2023-12-26 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 2 |
| Forbidden Facts: An Investigation of Competing Objectives in Llama-2 Tony T. Wang, Nir Shavit, Kaivalya Hariharan, Miles Wang Published: 2023-12-14Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | 2023-12-14 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R3 (96%) | 3 |
| Successor Heads: Recurring, Interpretable Attention Heads In The Wild Euan Ong, Rhys Gould, George Ogden, Arthur Conmy Published: 2023-12-14Area: Mechanistic Interp.Citations: 69 Tags: empirical, mechanistic-interp, ai-safety | 2023-12-14 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (97%) | 69 |
| Grokking Group Multiplication with Cosets Honglu Fan, Stella Biderman, Dashiell Stander, Qinan Yu Published: 2023-12-11Area: Mechanistic Interp.Citations: 18 Tags: empirical, mechanistic-interp, ai-safety | 2023-12-11 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 18 |
| Interpretability Illusions in the Generalization of Simplified Models Andrew Lampinen, Lucas Dixon, Asma Ghandeharioun, Dan Friedman Published: 2023-12-06Area: Mechanistic Interp.Citations: 20 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2023-12-06 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | 20 |
| FlexModel: A Framework for Interpretability of Distributed Large Language Models John Willes, Muhammad Adil Asif, Matthew Choi, David B. Emerson Published: 2023-12-05Area: Mechanistic Interp.Citations: 1 Tags: mechanistic-interp, ai-safety, tool, interpretability | 2023-12-05 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | E5 / R3 (95%) | 1 |
| Generating Interpretable Networks using Hypernetworks Isaac Liao, Ziming Liu, Max Tegmark Published: 2023-12-05Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2023-12-05 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R4 (95%) | 2 |
| Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching Georg Lange, Aleksandar Makelov, Neel Nanda Published: 2023-11-28Area: Mechanistic Interp.Citations: 41 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2023-11-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (93%) | 41 |
| Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching Phillip Guo, Richard Ren, James Campbell Published: 2023-11-25Area: Mechanistic Interp.Citations: 26 Tags: empirical, mechanistic-interp, ai-safety | 2023-11-25 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (97%) | 26 |
| Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks Hidenori Tanaka, Edward Grefenstette, Ekdeep Singh Lubana, Robert P. Dick Published: 2023-11-21Area: Mechanistic Interp.Citations: 99 Tags: empirical, alignment-training, mechanistic-interp, ai-safety | 2023-11-21 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety | E5 / R3 (92%) | 99 |
| Future Lens: Anticipating Subsequent Tokens from a Single Hidden State Jiuding Sun, David Bau, Koyena Pal, Byron C. Wallace Published: 2023-11-08Area: Mechanistic Interp.Citations: 97 Tags: empirical, mechanistic-interp, ai-safety | 2023-11-08 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (96%) | 97 |
| Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models Fazl Barez, Michael Lan Published: 2023-11-07Area: Mechanistic Interp.Citations: 9 Tags: empirical, mechanistic-interp, ai-safety | 2023-11-07 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (94%) | 9 |
| Uncovering Intermediate Variables in Transformers using Circuit Probing Ellie Pavlick, Thomas Serre, Michael A. Lepori Published: 2023-11-07Area: Mechanistic Interp.Citations: 12 Tags: empirical, mechanistic-interp, ai-safety | 2023-11-07 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 12 |
| Codebook Features: Sparse and Discrete Interpretability for Neural Networks Alex Tamkin, Mohammad Taufeeque, Noah D. Goodman Published: 2023-10-26Area: Mechanistic Interp.Citations: 41 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2023-10-26 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (94%) | 41 |
| How Do Language Models Bind Entities in Context? Jiahai Feng, Jacob Steinhardt Published: 2023-10-26Area: Mechanistic Interp.Citations: 70 Tags: empirical, mechanistic-interp, ai-safety | 2023-10-26 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (95%) | 70 |
| Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism Mansi Sakarvadia, Ian Foster, Arham Khan, Daniel Grzenda Published: 2023-10-25Area: Mechanistic Interp.Citations: 18 Tags: mechanistic-interp, ai-safety, tool | 2023-10-25 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E5 / R3 (95%) | 18 |