Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models Minfeng Zhu, Haoxuan Li, Zhen Wen, Yuchen Yang Published: 2025-09-20Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool | 2025-09-20 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E5 / R3 (97%) | - |
| Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features Adriano Koshiyama, Zekun Wu, Seonglae Cho Published: 2026-02-11Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-02-11 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (97%) | - |
| Cracking the Circuits: Mechanistic Interpretability in Large Language Models Mushtaq Ali, Dost Muhammad, Malika Bendechache, Muhammad Salman Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (95%) | - |
| Cross-Layer Discrete Concept Discovery for Interpreting Language Models Xuemin Yu, Samira Ebrahimi Kahou, Hassan Sajjad, Ankur Garg Published: 2025-06-24Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation | 2025-06-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, safety-evaluation | E6 / R3 (95%) | 1 |
| Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI Eduard Kapelko Published: 2025-09-23Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-09-23 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | - |
| DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders Lingpeng Kong, Baosong Yang, Yu Wan, Xu Wang Published: 2026-02-05Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2026-02-05 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (97%) | - |
| DePass: Unified Feature Attributing by Simple Decomposed Forward Pass Bowen Zhou, Kai Tian, Xiangyu Hong, Biqing Qi Published: 2025-10-21Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-10-21 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (95%) | - |
| Deciphering Functions of Neurons in Vision-Language Models Yan Lu, Jiaqi Xu, Xuejin Chen, Cuiling Lan Published: 2025-02-10Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-02-10 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (94%) | - |
| Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization Or Shafran, Mor Geva, Atticus Geiger Published: 2025-06-12Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety | 2025-06-12 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (96%) | 3 |
| Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning Xinting Huang, Michael Hahn Published: 2025-08-03Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety | 2025-08-03 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (94%) | 3 |
| Dense SAE Latents Are Features, Not Bugs Alessandro Stolfo, Ben Wu, Mrinmaya Sachan, Joshua Engels Published: 2025-06-18Area: Mechanistic Interp.Citations: 7 Tags: empirical, mechanistic-interp, ai-safety | 2025-06-18 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 7 |
| Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning Junxuan Wang, Xuyang Ge, Zhengfu He, Wentao Shu Published: 2025-08-23Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-08-23 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (94%) | - |
| Distribution-Aware Feature Selection for SAEs Narmeen Oozeer, Amirali Abdullah, Michael Lan, Alice Rigg Published: 2025-08-29Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-08-29 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | - |
| Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders Yan Hu, Xu Wang, Difan Zou, Benyou Wang Published: 2025-10-04Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-10-04 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E7 / R4 (97%) | 2 |
| Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers Rabin Adhikari Published: 2025-10-28Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-10-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | - |
| Empirical Evaluation of Progressive Coding for Sparse Autoencoders Anders Søgaard, Hans Peter Published: 2025-04-30Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation | 2025-04-30 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, safety-evaluation | E5 / R3 (94%) | - |
| Enhancing Automated Interpretability with Output-Centric Feature Descriptions Chen Agassy, Mor Geva, Yoav Gur-Arieh, Roy Mayan Published: 2025-01-14Area: Mechanistic Interp.Citations: 26 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-01-14 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (92%) | 26 |
| Ensembling Sparse Autoencoders Soham Gadgil, Chris Lin, Su-In Lee Published: 2025-05-21Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2025-05-21 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 1 |
| Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii Louis Jaburi, Kola Ayonrinde Published: 2025-05-02Area: Mechanistic Interp.Citations: 2 Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2025-05-02 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | 2 |
| Evaluating Neuron Explanations: A Unified Framework with Sanity Checks Tsui-Wei Weng, Ge Yan, Tuomas Oikarinen Published: 2025-06-06Area: Mechanistic Interp.Citations: 7 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation | 2025-06-06 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, safety-evaluation | E5 / R3 (99%) | 7 |
| Evaluating SAE interpretability without explanations Gonçalo Paulo, Nora Belrose Published: 2025-07-11Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-07-11 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (93%) | 1 |
| Evaluating Sparse Autoencoders for Monosemantic Representation Peizhong Ju, Muhammad Umair Haider, Moghis Fereidouni, A.B. Siddique Published: 2025-08-20Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-08-20 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | - |
| Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit Valérie Costa, Bahareh Tolooshams, Ekdeep Singh Lubana, Thomas Fel Published: 2025-06-05Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-06-05 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |
| Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality Julia Hockenmaier, Sewoong Lee, Marc E. Canby, Adam Davies Published: 2025-03-31Area: Mechanistic Interp.Citations: 2 Tags: theoretical, mechanistic-interp, ai-safety | 2025-03-31 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 2 |
| Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? Maxime Meloux, Silviu Maniu, Maxime Peyrard, Francois Portet Published: 2025-02-28Area: Mechanistic Interp.Citations: 14 Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2025-02-28 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (94%) | 14 |
| FADE: Why Bad Descriptions Happen to Good Features Thomas Wiegand, Sebastian Lapuschkin, Aakriti Jain, Elena Golimblevskaia Published: 2025-02-24Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation | 2025-02-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, safety-evaluation | E5 / R3 (94%) | 6 |
| Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders David Chanin, Adria Garriga-Alonso, Tomas Dulka Published: 2025-05-16Area: Mechanistic Interp.Citations: 7 Tags: empirical, mechanistic-interp, ai-safety | 2025-05-16 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 7 |
| Feature Identification via the Empirical NTK Jennifer Lin Published: 2025-10-01Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2025-10-01 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 2 |
| Finding Manifolds With Bilinear Autoencoders Thomas Dooms, Ward Gauderis Published: 2025-10-19Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2025-10-19 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R2 (94%) | 2 |
| Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models Hosein Mohebbi, Martin Tutek, Gabriele Sarti, Aaron Mueller Published: 2025-11-23Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, benchmark | 2025-11-23 | Mechanistic Interp. | mechanistic-interp, ai-safety, benchmark | E6 / R3 (95%) | - |