Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Residual Stream Analysis with Multi-Layer SAEs Lucy Farnik, Conor Houghton, Tim Lawson, Laurence Aitchison Published: 2024-09-06Area: Mechanistic Interp.Citations: 12 Tags: empirical, mechanistic-interp, ai-safety | 2024-09-06 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 12 |
| Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small Maheep Chaudhary, Atticus Geiger Published: 2024-09-05Area: Mechanistic Interp.Citations: 30 Tags: empirical, mechanistic-interp, ai-safety | 2024-09-05 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E7 / R3 (97%) | 30 |
| On the Complexity of Neural Computation in Superposition Nir Shavit, Micah Adler Published: 2024-09-05Area: Mechanistic Interp.Citations: 11 Tags: theoretical, mechanistic-interp, ai-safety | 2024-09-05 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 11 |
| Safety Layers in Aligned Large Language Models Shen Li, Liuyi Yao, Lan Zhang, Yaliang Li Published: 2024-08-30Area: Mechanistic Interp.Citations: 88 Tags: empirical, alignment-training, mechanistic-interp, ai-safety | 2024-08-30 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety | E4 / R3 (95%) | 88 |
| Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference André Freitas, Marco Valentino, Geonhee Kim Published: 2024-08-16Area: Mechanistic Interp.Citations: 13 Tags: empirical, mechanistic-interp, ai-safety | 2024-08-16 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (92%) | 13 |
| Mathematical Models of Computation in Superposition Lawrence Chan, Jake Mendel, Kaarel Hänni, Dmitry Vaintrob Published: 2024-08-10Area: Mechanistic Interp.Citations: 23 Tags: theoretical, mechanistic-interp, ai-safety | 2024-08-10 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 23 |
| Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 Vikrant Varma, Rohin Shah, Nicolas Sonnerat, Lewis Smith Published: 2024-08-09Area: Mechanistic Interp.Citations: 254 Tags: mechanistic-interp, ai-safety, tool | 2024-08-09 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E5 / R3 (96%) | 254 |
| Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models David Bau, Can Rager, Samuel Marks, Adam Karvonen Published: 2024-07-31Area: Mechanistic Interp.Citations: 49 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-07-31 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 49 |
| Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability Jorge García-Carrasco, Juan Trujillo, Alejandro Maté Published: 2024-07-29Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness, interpretability | 2024-07-29 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness, interpretability | E5 / R3 (93%) | 6 |
| Planning in a Recurrent Neural Network that Plays Sokoban Adam Gleave, Mohammad Taufeeque, Adrià Garriga-Alonso, Philip Quirke Published: 2024-07-22Area: Mechanistic Interp.Citations: 11 Tags: empirical, mechanistic-interp, ai-safety | 2024-07-22 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (95%) | 11 |
| Adversarial Circuit Evaluation Niels uit de Bos, Adrià Garriga-Alonso Published: 2024-07-21Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness, safety-evaluation | 2024-07-21 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness, safety-evaluation | E6 / R3 (95%) | 1 |
| InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques Iván Arcuschin, Rohan Gupta, Adrià Garriga-Alonso, Thomas Kwa Published: 2024-07-19Area: Mechanistic Interp.Citations: 7 Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark | 2024-07-19 | Mechanistic Interp. | mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark | E6 / R3 (98%) | 7 |
| Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders Vikrant Varma, Nicolas Sonnerat, Tom Lieberum, Senthooran Rajamanoharan Published: 2024-07-19Area: Mechanistic Interp.Citations: 194 Tags: empirical, mechanistic-interp, ai-safety | 2024-07-19 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 194 |
| Relational Composition in Neural Networks: A Survey and Call to Action Fernanda B. Viégas, Martin Wattenberg Published: 2024-07-19Area: Mechanistic Interp.Citations: 19 Tags: mechanistic-interp, ai-safety, survey | 2024-07-19 | Mechanistic Interp. | mechanistic-interp, ai-safety, survey | E6 / R3 (91%) | 19 |
| Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach Nils Palumbo, Ravi Mangal, Corina Păsăreanu, Saranya Vijayakumar Published: 2024-07-18Area: Mechanistic Interp.Citations: 2 Tags: theoretical, mechanistic-interp, ai-safety | 2024-07-18 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 2 |
| NNsight and NDIF: Democratizing Access to Foundation Model Internals Jonathan Bell, David Bau, Can Rager, Samuel Marks Published: 2024-07-18Area: Mechanistic Interp.Citations: 23 Tags: mechanistic-interp, ai-safety, tool | 2024-07-18 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E5 / R3 (96%) | 23 |
| Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent Sonia Joseph, Irina Rish, Blake Richards, Karolis Jucys Published: 2024-07-16Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-07-16 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R4 (97%) | 4 |
| LLM Circuit Analyses Are Consistent Across Training and Scale Michael Hanna, Curt Tigges, Stella Biderman, Qinan Yu Published: 2024-07-15Area: Mechanistic Interp.Citations: 42 Tags: empirical, mechanistic-interp, ai-safety | 2024-07-15 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 42 |
| Transformer Circuit Faithfulness Metrics are not Robust Bilal Chughtai, William Saunders, Joseph Miller Published: 2024-07-11Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-07-11 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E4 / R3 (95%) | 10 |
| Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks Aaron Mueller Published: 2024-07-05Area: Mechanistic Interp.Citations: 18 Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2024-07-05 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (92%) | 18 |
| Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning Lei Yu, Jingcheng Niu, Zining Zhu, Gerald Penn Published: 2024-07-04Area: Mechanistic Interp.Citations: 8 Tags: empirical, mechanistic-interp, ai-safety | 2024-07-04 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (95%) | 8 |
| The Remarkable Robustness of LLMs: Stages of Inference? Wes Gurnee, Max Tegmark, Vedang Lad Published: 2024-06-27Area: Mechanistic Interp.Citations: 98 Tags: empirical, mechanistic-interp, ai-safety | 2024-06-27 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E7 / R4 (94%) | 98 |
| A Closer Look into Mixture-of-Experts in Large Language Models Jie Fu, Zeyu Huang, Ka Man Lo, Zili Wang Published: 2024-06-26Area: Mechanistic Interp.Citations: 28 Tags: empirical, mechanistic-interp, ai-safety | 2024-06-26 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 28 |
| Interpreting Attention Layer Outputs with Sparse Autoencoders Joseph Isaac Bloom, Arthur Conmy, Robert Krzyzanowski, Connor Kissane Published: 2024-06-25Area: Mechanistic Interp.Citations: 40 Tags: empirical, mechanistic-interp, ai-safety | 2024-06-25 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 40 |
| Understanding Language Model Circuits through Knowledge Editing Huaizhi Ge, Zining Zhu, Frank Rudzicz Published: 2024-06-25Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2024-06-25 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 2 |
| Confidence Regulation Neurons in Language Models Wes Gurnee, Xingyi Song, Alessandro Stolfo, Ben Wu Published: 2024-06-24Area: Mechanistic Interp.Citations: 45 Tags: empirical, mechanistic-interp, ai-safety | 2024-06-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (94%) | 45 |
| Finding Transformer Circuits with Edge Pruning Alexander Wettig, Adithya Bhaskar, Dan Friedman, Danqi Chen Published: 2024-06-24Area: Mechanistic Interp.Citations: 40 Tags: empirical, mechanistic-interp, ai-safety | 2024-06-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (97%) | 40 |
| Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models Jun Zhao, Zhuoran Jin, Pengfei Cao, Yubo Chen Published: 2024-06-23Area: Mechanistic Interp.Citations: 19 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-06-23 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | 19 |
| Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons Juanzi Li, Xiaozhi Wang, Jianhui Chen, Zijun Yao Published: 2024-06-20Area: Mechanistic Interp.Citations: 25 Tags: empirical, alignment-training, mechanistic-interp, ai-safety | 2024-06-20 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety | E5 / R3 (94%) | 25 |
| Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries Sohee Yang, Daniela Gottesman, Mor Geva, Amir Globerson Published: 2024-06-18Area: Mechanistic Interp.Citations: 75 Tags: empirical, mechanistic-interp, ai-safety | 2024-06-18 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (97%) | 75 |