Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs Letao Han, Ling Hu, Yuemei Xu, Xiaoyang Gu Published: 2025-04-07Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2025-04-07 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E7 / R3 (94%) | 1 |
| From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers Julia Kempe, Karen Ullrich, Jingtong Su Published: 2025-06-20Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | 2025-06-20 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E6 / R4 (95%) | 3 |
| From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit Valérie Costa, Bahareh Tolooshams, Ekdeep Singh Lubana, Thomas Fel Published: 2025-06-03Area: Mechanistic Interp.Citations: 16 Tags: empirical, mechanistic-interp, ai-safety | 2025-06-03 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 16 |
| From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models Yiyang Yu, Minji Lee, Mohammed AlQuraishi, Etowah Adams Published: -Area: Mechanistic Interp.Citations: 32 Tags: empirical, mechanistic-interp, ai-safety, interpretability | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | 32 |
| From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers Sonia Joseph, Jack Stanley, Praneet Suresh, Luca Scimeca Published: 2025-09-08Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2025-09-08 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 1 |
| From superposition to sparse codes: interpretable representations in neural networks Harald Maurer, Patrik Reizinger, Nina Miolane, David Klindt Published: 2025-03-03Area: Mechanistic Interp.Citations: 7 Tags: theoretical, mechanistic-interp, ai-safety, safety-evaluation | 2025-03-03 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, safety-evaluation | E5 / R3 (93%) | 7 |
| GIM: Improved Interpretability for Large Language Models Lars Maaløe, Tuukka Ruotsalo, Maria Maistro, Róbert Csordás Published: 2025-05-23Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-05-23 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E7 / R3 (96%) | 1 |
| Gemma Scope 2: Comprehensive Suite of SAEs and Transcoders for Gemma 3 Tom Lieberum, Janos Kramar, Senthooran Rajamanoharan, Callum McDougall Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool, interpretability | - | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | E5 / R3 (98%) | - |
| Hedonic Neurons: A Mechanistic Mapping of Latent Coalitions in Transformer MLPs Atharva Nijasure, Tanya Chowdhury, James Allan, Yair Zick Published: 2025-09-28Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-09-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | - |
| How Can Interpretability Researchers Help AGI Go Well? Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan Published: -Area: Mechanistic Interp.Citations: - Tags: alignment-training, mechanistic-interp, ai-safety, position, interpretability | - | Mechanistic Interp. | alignment-training, mechanistic-interp, ai-safety, position, interpretability | - | - |
| How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence Yizhou Sun, Weikai Li, Shichang Zhang, Himabindu Lakkaraju Published: 2025-04-03Area: Mechanistic Interp.Citations: 5 Tags: empirical, mechanistic-interp, ai-safety | 2025-04-03 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E7 / R4 (94%) | 5 |
| HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Christopher Potts, Jiuding Sun, Michael Sklar, Karel D'Oosterlinck Published: 2025-03-13Area: Mechanistic Interp.Citations: 8 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-03-13 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | 8 |
| I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders Alexey Dontsov, Ivan Oseledets, Polina Druzhinina, Oleg Y. Rogov Published: 2025-03-24Area: Mechanistic Interp.Citations: 24 Tags: empirical, mechanistic-interp, ai-safety | 2025-03-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 24 |
| Identifying Sparsely Active Circuits Through Local Loss Landscape Decomposition Brianna Chrisman, Lee Sharkey, Lucius Bushnaq Published: 2025-03-31Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety | 2025-03-31 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (92%) | 3 |
| Inducing, Detecting and Characterising Neural Modules: A Pipeline for Functional Interpretability in Reinforcement Learning Pietro Ferraro, David Boyle, Anna Soligo Published: 2025-01-28Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-01-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 2 |
| Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models Noura Al Moubayed, Neel Nanda, Patrick Leask Published: 2025-05-23Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2025-05-23 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 1 |
| Insights into a radiology-specialised multimodal large language model with sparse autoencoders Felix Meissen, Javier Alvarez-Valle, Daniel Coelho de Castro, Shruthi Bannur Published: 2025-07-17Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2025-07-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (96%) | 1 |
| Insights on Crosscoder Model Diffing Siddharth Mishra-Sharma, Thomas Henighan, Adam Jermyn, Christopher Olah Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |
| Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors Christopher Potts, Junyi Tao, Thomas Icard, Jing Huang Published: 2025-05-17Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-05-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E8 / R3 (99%) | 4 |
| Interpretability Illusions with Sparse Autoencoders Usha Bhalla, Aaron J. Li, Himabindu Lakkaraju, Suraj Srinivas Published: 2025-05-21Area: Mechanistic Interp.Citations: 5 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness, interpretability | 2025-05-21 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness, interpretability | E5 / R3 (95%) | 5 |
| Interpreting CLIP with Hierarchical Sparse Autoencoders Hubert Baniecki, Vladimir Zaigrajew, Przemyslaw Biecek Published: 2025-02-27Area: Mechanistic Interp.Citations: 19 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-27 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 19 |
| Interpreting Transformers Through Attention Head Intervention Mason Kadem, Rong Zheng Published: 2026-01-07Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, survey, interpretability | 2026-01-07 | Mechanistic Interp. | mechanistic-interp, ai-safety, survey, interpretability | E5 / R3 (95%) | - |
| Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders Wenlin Yao, Xuansheng Wu, Xiaoming Zhai, Ninghao Liu Published: 2025-02-21Area: Mechanistic Interp.Citations: 20 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | 2025-02-21 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R4 (93%) | 20 |
| Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban Adam Gleave, Mohammad Taufeeque, Adrià Garriga-Alonso, Aaron David Tucker Published: 2025-06-11Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-06-11 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (94%) | - |
| InverseScope: Scalable Activation Inversion for Interpreting Large Language Models Zhennan Zhou, Yifan Luo, Bin Dong Published: 2025-06-09Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-06-09 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (97%) | - |
| Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Lucy Farnik, Conor Houghton, Tim Lawson, Laurence Aitchison Published: 2025-02-25Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-25 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 6 |
| Know-MRI: A Knowledge Mechanisms Revealer&Interpreter for Large Language Models Di Wu, Jun Zhao, Jiaxiang Liu, Boxuan Xing Published: 2025-06-10Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool | 2025-06-10 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E5 / R3 (93%) | - |
| Kronecker Factorization Improves Efficiency and Interpretability of Sparse Autoencoders Daniil Gavrilov, Nikita Balagansky, Daniil Laptev, Vadim Kurochkin Published: 2025-05-28Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-05-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (97%) | - |
| LLM Interpretability with Identifiable Temporal-Instantaneous Representation Yujia Zheng, Kun Zhang, Xiangchen Song, Jiaqi Sun Published: 2025-09-27Area: Mechanistic Interp.Citations: 2 Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2025-09-27 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E4 / R3 (94%) | 2 |
| Language Model Circuits Are Sparse in the Neuron Basis Aryaman Arora, Sarah Schwettmann, Zhengxuan Wu, Jacob Steinhardt Published: 2026-01-30Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2026-01-30 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 1 |