Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Addressing divergent representations from causal interventions on neural networks Christopher Potts, Alexa R. Tartaglini, Satchel Grant, Simon Jerome Han Published: 2025-11-06Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-11-06 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (96%) | - |
| AlignSAE: Concept-Aligned Sparse Autoencoders Xinyu Guo, Jinhe Bi, Minglai Yang, Mihai Surdeanu Published: 2025-12-01Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2025-12-01 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 2 |
| Analysis of Variational Sparse Autoencoders Yuxiao Li, Zachary Baker Published: 2025-09-26Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-09-26 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (93%) | - |
| Analyze Feature Flow to Enhance Interpretation and Steering in Language Models Daniil Gavrilov, Nikita Balagansky, Daniil Laptev, Yaroslav Aksenov Published: 2025-02-05Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-05 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 6 |
| Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models Talia Konkle, Victor Boutin, Binxu Wang, Ekdeep Singh Lubana Published: 2025-02-18Area: Mechanistic Interp.Citations: 32 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-18 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (97%) | 32 |
| Are Sparse Autoencoders Useful? A Case Study in Sparse Probing Joshua Engels, Senthooran Rajamanoharan, Subhash Kantamneni, Max Tegmark Published: 2025-02-23Area: Mechanistic Interp.Citations: 62 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-23 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 62 |
| Atlas-Alignment: Making Interpretability Transferable Across Language Models Sebastian Lapuschkin, Wojciech Samek, Jim Berend, Bruno Puri Published: 2025-10-31Area: Mechanistic Interp.Citations: - Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability | 2025-10-31 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety, interpretability | E6 / R3 (95%) | - |
| Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs Thomas Wiegand, Sebastian Lapuschkin, Sayed Mohammad Vakilzadeh Hatefi, Wojciech Samek Published: 2025-06-16Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2025-06-16 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (94%) | 2 |
| Automated Feature Labeling with Token-Space Gradient Descent Julian Schulz, Seamus Fallows Published: 2025-04-01Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-04-01 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (97%) | - |
| Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers Lucy Farnik, Thomas Heap, Tim Lawson, Laurence Aitchison Published: 2025-01-29Area: Mechanistic Interp.Citations: 26 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-01-29 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 26 |
| Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models Qingsong Wen, Stephen Wang, Moayad Aloqaily, Kun Wang Published: 2025-09-26Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-09-26 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R4 (98%) | - |
| Because we have LLMs, we Can and Should Pursue Agentic Interpretability Noah Fiedel, Been Kim, John Hewitt, Oyvind Tafjord Published: 2025-06-13Area: Mechanistic Interp.Citations: 9 Tags: mechanistic-interp, ai-safety, position, interpretability | 2025-06-13 | Mechanistic Interp. | mechanistic-interp, ai-safety, position, interpretability | E5 / R3 (94%) | 9 |
| Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal Erik Cambria, Amir Abdullah, Roy Ka-Wei Lee, Yeo Wei Jie Published: 2025-09-07Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety | 2025-09-07 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 4 |
| Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders Xuansheng Wu, Mengnan Du, Ninghao Liu, Dong Shu Published: 2025-05-12Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2025-05-12 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 2 |
| Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder Kaidi Xu, Song Wang, Zhen Tan, Tianlong Chen Published: 2025-11-07Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-11-07 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |
| Binary Autoencoder for Mechanistic Interpretability of Large Language Models Hakaze Cho, Naoya Inoue, Haolin Yang, Brian M. Kurkoski Published: 2025-09-25Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-09-25 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (93%) | - |
| Binary Sparse Coding for Interpretability Lucia Quirke, Stepan Shabalin, Nora Belrose Published: 2025-09-29Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-09-29 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E6 / R4 (92%) | 1 |
| BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods Yihong Liu, Philipp Mondorf, Sebastian Gerstner, Hinrich Schütze Published: 2025-10-08Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2025-10-08 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (95%) | 2 |
| BlockCert: Certified Blockwise Extraction of Transformer Mechanisms Sandro Andric Published: 2025-11-20Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool | 2025-11-20 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E5 / R3 (99%) | - |
| CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders Sachin Kumar, Yusen Peng, Alex Gulko Published: 2025-08-31Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark | 2025-08-31 | Mechanistic Interp. | mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark | E5 / R3 (94%) | - |
| Can Interpretation Predict Behavior on Unseen Data? David Alvarez-Melis, Jenny Kaufmann, Martin Wattenberg, Victoria R. Li Published: 2025-07-08Area: Mechanistic Interp.Citations: 5 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-07-08 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (92%) | 5 |
| Can Large Language Models Develop Gambling Addiction? Yunjeong Lee, Seungpil Lee, Donghyeon Shin, Sundong Kim Published: 2025-09-26Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2025-09-26 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (93%) | 2 |
| Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework Kirill Bykov, Anna Hedström, Laura Kopf, Oliver Eberle Published: 2025-06-18Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety | 2025-06-18 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 6 |
| Circuit Stability Characterizes Language Model Generalization Alan Sun Published: 2025-05-30Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2025-05-30 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 2 |
| Circuit Tracing: Revealing Computational Graphs in Language Models Craig Citro, Michael Sklar, Hoagy Cunningham, Wes Gurnee Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | - |
| Circuits Updates - April 2025 Brian Chen, Adam Jermyn, Joshua Batson, Jack Lindsey Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R3 (95%) | - |
| Combining Causal Models for More Accurate Abstractions of Neural Networks Theodora-Mara Pîslar, Sara Magliacane, Atticus Geiger Published: 2025-03-14Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2025-03-14 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 1 |
| Compressed Computation: Dense Circuits in a Toy Model of the Universal-AND Problem Adam Newgas Published: 2025-07-13Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-07-13 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (92%) | - |
| Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs Yuhang Liu, Dong Gong, Erdun Gao, Anton van den Hengel Published: 2026-01-28Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-01-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | - |
| Concept-SAE: Active Causal Probing of Visual Model Behavior Qiang Xu, Jianrong Ding, Chenchen Zhao, Muxi Chen Published: 2025-09-26Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | 2025-09-26 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E6 / R4 (94%) | - |