Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Ablation is Not Enough to Emulate DPO: How Neuron Dynamics Drive Toxicity Reduction Filip Sondej, Yushi Yang, Harry Mayne, Adam Mahdi Published: 2024-11-10Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety | 2024-11-10 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R2 (95%) | 3 |
| Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla, Asma Ghandeharioun, Himabindu Lakkaraju, Suraj Srinivas Published: 2024-11-07Area: Mechanistic Interp.Citations: 20 Tags: empirical, mechanistic-interp, ai-safety, interpretability, safety-evaluation | 2024-11-07 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability, safety-evaluation | E6 / R3 (94%) | 20 |
| A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning Rina Panigrahy, Cyrus Rashtchian, Enming Luo, Guan Zhe Hong Published: 2024-11-06Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety | 2024-11-06 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (94%) | 4 |
| Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders Kola Ayonrinde Published: 2024-11-04Area: Mechanistic Interp.Citations: 8 Tags: empirical, mechanistic-interp, ai-safety | 2024-11-04 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 8 |
| Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders Fazl Barez, Luke Marks, Alasdair Paren, David Krueger Published: 2024-11-02Area: Mechanistic Interp.Citations: 19 Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability | 2024-11-02 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 19 |
| Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models Mona Diab, Virginia Smith, Aashiq Muhamed Published: 2024-11-01Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety | 2024-11-01 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (95%) | 10 |
| Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics Yaniv Nikankin, Aaron Mueller, Yonatan Belinkov, Anja Reusch Published: 2024-10-28Area: Mechanistic Interp.Citations: 74 Tags: empirical, mechanistic-interp, ai-safety | 2024-10-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 74 |
| Group-SAE: Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups Marco Molinari, Davide Ghilardi, Federico Belotti Published: 2024-10-28Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety | 2024-10-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 10 |
| One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models David Bau, Chris Wendler, Caglar Gulcehre, Antonio Mari Published: 2024-10-28Area: Mechanistic Interp.Citations: 18 Tags: empirical, mechanistic-interp, ai-safety | 2024-10-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R4 (95%) | 18 |
| Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness Jingyi Cui, Qi Zhang, Xiang Pan, Qi Lei Published: 2024-10-27Area: Mechanistic Interp.Citations: 5 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-10-27 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E6 / R4 (94%) | 5 |
| Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders Junxuan Wang, Zuxuan Wu, Xuyang Ge, Frances Liu Published: 2024-10-27Area: Mechanistic Interp.Citations: 91 Tags: mechanistic-interp, ai-safety, tool | 2024-10-27 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E5 / R3 (97%) | 91 |
| Decomposing The Dark Matter of Sparse Autoencoders Joshua Engels, Max Tegmark, Logan Smith Published: 2024-10-18Area: Mechanistic Interp.Citations: 33 Tags: empirical, mechanistic-interp, ai-safety | 2024-10-18 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (93%) | 33 |
| Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models Wei Jie Yeo, Erik Cambria, Ranjan Satapathy Published: 2024-10-18Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety | 2024-10-18 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 4 |
| Automatically Interpreting Millions of Features in Large Language Models Caden Juang, Alex Mallen, Nora Belrose, Gonçalo Paulo Published: 2024-10-17Area: Mechanistic Interp.Citations: 69 Tags: empirical, mechanistic-interp, ai-safety | 2024-10-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 69 |
| On the Role of Attention Heads in Large Language Model Safety Yongbin Li, Rongwu Xu, Junfeng Fang, Kun Wang Published: 2024-10-17Area: Mechanistic Interp.Citations: 46 Tags: empirical, mechanistic-interp, ai-safety | 2024-10-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 46 |
| Analyzing (In)Abilities of SAEs via Formal Languages Manish Shrivastava, Abhinav Menon, Ekdeep Singh Lubana, David Krueger Published: 2024-10-15Area: Mechanistic Interp.Citations: 15 Tags: empirical, mechanistic-interp, ai-safety | 2024-10-15 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E7 / R3 (95%) | 15 |
| ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability Jun Xu, Yang Song, Zhongxiang Sun, Han Li Published: 2024-10-15Area: Mechanistic Interp.Citations: 68 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-10-15 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E6 / R4 (95%) | 68 |
| The Persian Rug: solving toy models of superposition using large-scale symmetries Alex Infanger, Aditya Cowsik, Kfir Dolev Published: 2024-10-15Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety | 2024-10-15 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E5 / R3 (94%) | - |
| Bilinear MLPs Enable Weight-Based Mechanistic Interpretability Thomas Dooms, Jose M. Oramas, Michael T. Pearce, Alice Rigg Published: 2024-10-10Area: Mechanistic Interp.Citations: 19 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-10-10 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 19 |
| Efficient Dictionary Learning with Switch Sparse Autoencoders Anish Mudide, Christian Schroeder de Witt, Joshua Engels, Eric J. Michaud Published: 2024-10-10Area: Mechanistic Interp.Citations: 31 Tags: empirical, mechanistic-interp, ai-safety | 2024-10-10 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 31 |
| Mechanistic Permutability: Match Features Across Layers Ian Maksimov, Daniil Gavrilov, Nikita Balagansky Published: 2024-10-10Area: Mechanistic Interp.Citations: 14 Tags: empirical, mechanistic-interp, ai-safety | 2024-10-10 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (98%) | 14 |
| The Geometry of Concepts: Sparse Autoencoder Feature Structure Yuxiao Li, David D. Baek, Joshua Engels, Xiaoqing Sun Published: 2024-10-10Area: Mechanistic Interp.Citations: 40 Tags: empirical, mechanistic-interp, ai-safety | 2024-10-10 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 40 |
| SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders Philip Torr, Constantin Venhoff, Christian Schroeder de Witt, Anisoara Calinescu Published: 2024-10-09Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation | 2024-10-09 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, safety-evaluation | E6 / R3 (96%) | 1 |
| Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models Philip Torr, Austin Meek, Fazl Barez, Ashkan Khakzar Published: 2024-10-09Area: Mechanistic Interp.Citations: 11 Tags: empirical, mechanistic-interp, ai-safety | 2024-10-09 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 11 |
| Gradient Routing: Masking Gradients to Localize Computation in Neural Networks Evzen Wybitul, Alexander Matt Turner, Joseph Miller, Alex Cloud Published: 2024-10-06Area: Mechanistic Interp.Citations: 18 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-10-06 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 18 |
| Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language Severin Field, Mat Allen, Anthony Costarelli Published: 2024-10-03Area: Mechanistic Interp.Citations: 5 Tags: empirical, mechanistic-interp, ai-safety | 2024-10-03 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 5 |
| Sparse Attention Decomposition Applied to Circuit Tracing Gabriel Franco, Mark Crovella Published: 2024-10-01Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety | 2024-10-01 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (97%) | 3 |
| A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders David Chanin, James Wilken-Smith, Joseph Bloom, Tomas Dulka Published: 2024-09-22Area: Mechanistic Interp.Citations: 83 Tags: empirical, mechanistic-interp, ai-safety | 2024-09-22 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 83 |
| Optimal Ablation for Interpretability Lucas Janson, Maximilian Li Published: 2024-09-16Area: Mechanistic Interp.Citations: 14 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2024-09-16 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R4 (92%) | 14 |
| TracrBench: Generating Interpretability Testbeds with Large Language Models Hannes Thurnherr, Jérémy Scheurer Published: 2024-09-07Area: Mechanistic Interp.Citations: 4 Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark | 2024-09-07 | Mechanistic Interp. | mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark | E5 / R3 (97%) | 4 |