Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders Wenlin Yao, Xuansheng Wu, Xiaoming Zhai, Ninghao Liu Published: 2025-02-21Area: Mechanistic Interp.Citations: 20 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | 2025-02-21 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R4 (93%) | 20 |
| Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models Talia Konkle, Victor Boutin, Binxu Wang, Ekdeep Singh Lubana Published: 2025-02-18Area: Mechanistic Interp.Citations: 32 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-18 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (97%) | 32 |
| SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models Mengnan Du, Jing Ma, Zirui He, Haiyan Zhao Published: 2025-02-17Area: Mechanistic Interp.Citations: 18 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 18 |
| Sparse Autoencoder Features for Classifications and Transferability Jack Gallifant, Thomas Hartvigsen, Hugo Aerts, Shan Chen Published: 2025-02-17Area: Mechanistic Interp.Citations: 16 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 16 |
| Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis Yan Hu, Reynold Cheng, Xu Wang, Wenyu Du Published: 2025-02-17Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (94%) | 10 |
| The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions Haining Yu, Xiangyang Zhou, Qiguang Chen, Wenbo Pan Published: 2025-02-13Area: Mechanistic Interp.Citations: 8 Tags: empirical, alignment-training, mechanistic-interp, ai-safety, adversarial-robustness | 2025-02-13 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R4 (94%) | 8 |
| Deciphering Functions of Neurons in Vision-Language Models Yan Lu, Jiaqi Xu, Xuejin Chen, Cuiling Lan Published: 2025-02-10Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-02-10 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (94%) | - |
| Position-aware Automatic Circuit Discovery David Bau, Aaron Mueller, Tal Haklay, Hadas Orgad Published: 2025-02-07Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-07 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 6 |
| Sparse Autoencoders Do Not Find Canonical Units of Analysis Michael Pearce, Curt Tigges, Joseph Bloom, Noura Al Moubayed Published: 2025-02-07Area: Mechanistic Interp.Citations: 43 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-07 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (94%) | 43 |
| Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment Konstantinos Derpanis, Julian Forsyth, Thomas Fel, Harrish Thasarathan Published: 2025-02-06Area: Mechanistic Interp.Citations: 27 Tags: empirical, alignment-training, mechanistic-interp, ai-safety | 2025-02-06 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety | E4 / R3 (97%) | 27 |
| Analyze Feature Flow to Enhance Interpretation and Steering in Language Models Daniil Gavrilov, Nikita Balagansky, Daniil Laptev, Yaroslav Aksenov Published: 2025-02-05Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-05 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 6 |
| Modular Training of Neural Networks aids Interpretability Joan Velja, Maheep Chaudhary, Alessandro Abate, Nandi Schoots Published: 2025-02-04Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-02-04 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 1 |
| Low-Rank Adapting Models for Sparse Autoencoders Joshua Engels, Matthew Chen, Max Tegmark Published: 2025-01-31Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety | 2025-01-31 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 4 |
| Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers Lucy Farnik, Thomas Heap, Tim Lawson, Laurence Aitchison Published: 2025-01-29Area: Mechanistic Interp.Citations: 26 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-01-29 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 26 |
| SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders Kamil Deja, Bartosz Cywinski Published: 2025-01-29Area: Mechanistic Interp.Citations: 37 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | 2025-01-29 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R3 (97%) | 37 |
| Inducing, Detecting and Characterising Neural Modules: A Pipeline for Functional Interpretability in Reinforcement Learning Pietro Ferraro, David Boyle, Anna Soligo Published: 2025-01-28Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-01-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 2 |
| Sparse Autoencoders Trained on the Same Data Learn Different Features Gonçalo Paulo, Nora Belrose Published: 2025-01-28Area: Mechanistic Interp.Citations: 40 Tags: empirical, mechanistic-interp, ai-safety | 2025-01-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 40 |
| Propositional Interpretability in Artificial Intelligence David J. Chalmers Published: 2025-01-27Area: Mechanistic Interp.Citations: 13 Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2025-01-27 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (94%) | 13 |
| Enhancing Automated Interpretability with Output-Centric Feature Descriptions Chen Agassy, Mor Geva, Yoav Gur-Arieh, Roy Mayan Published: 2025-01-14Area: Mechanistic Interp.Citations: 26 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-01-14 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (92%) | 26 |
| Mechanistic understanding and validation of large AI models with SemanticLens Thomas Wiegand, Sebastian Lapuschkin, Tobias Labarta, Wojciech Samek Published: 2025-01-09Area: Mechanistic Interp.Citations: 29 Tags: mechanistic-interp, ai-safety, tool | 2025-01-09 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E5 / R3 (95%) | 29 |
| Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Hiroki Furuta, Yutaka Matsuo, Gouki Minegishi, Yusuke Iwasawa Published: 2025-01-09Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation | 2025-01-09 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, safety-evaluation | E5 / R3 (94%) | 10 |
| Transformers Use Causal World Models in Maze-Solving Tasks Alessandra Russo, Alex F. Spies, Katsumi Inoue, Tilman Räuker Published: 2024-12-16Area: Mechanistic Interp.Citations: 9 Tags: empirical, mechanistic-interp, ai-safety | 2024-12-16 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (94%) | 9 |
| BatchTopK Sparse Autoencoders Bart Bussmann, Neel Nanda, Patrick Leask Published: 2024-12-09Area: Mechanistic Interp.Citations: 64 Tags: empirical, mechanistic-interp, ai-safety | 2024-12-09 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (97%) | 64 |
| Monet: Mixture of Monosemantic Experts for Transformers Jaewoo Kang, Jungwoo Park, Young Jin Ahn, Kee-Eung Kim Published: 2024-12-05Area: Mechanistic Interp.Citations: 9 Tags: empirical, mechanistic-interp, ai-safety | 2024-12-05 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 9 |
| Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration Lawrence Chan, Rajashree Agrawal, Chun Hei Yip, Jason Gross Published: 2024-12-04Area: Mechanistic Interp.Citations: 4 Tags: theoretical, mechanistic-interp, ai-safety | 2024-12-04 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 4 |
| Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks Can Rager, Samuel Marks, Adam Karvonen, Neel Nanda Published: 2024-11-28Area: Mechanistic Interp.Citations: 9 Tags: mechanistic-interp, ai-safety, safety-evaluation, benchmark | 2024-11-28 | Mechanistic Interp. | mechanistic-interp, ai-safety, safety-evaluation, benchmark | E5 / R3 (96%) | 9 |
| Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda Published: 2024-11-21Area: Mechanistic Interp.Citations: 88 Tags: empirical, mechanistic-interp, ai-safety | 2024-11-21 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 88 |
| Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders David Klindt, Charles O'Neill Published: 2024-11-20Area: Mechanistic Interp.Citations: 7 Tags: theoretical, mechanistic-interp, ai-safety | 2024-11-20 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E5 / R3 (97%) | 7 |
| JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit Kui Ren, Huiyu Xu, Chun Chen, Rui Zheng Published: 2024-11-17Area: Mechanistic Interp.Citations: 16 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | 2024-11-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E6 / R3 (93%) | 16 |
| Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning David Wingate, Jared Wilson, Max Forsey, Jeffrey Olmo Published: 2024-11-15Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety | 2024-11-15 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 4 |