Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 121-150 of 470 papers (page 5 of 16)

Paper	Published	Area	Tags	Intel	Citations
Learning Multi-Level Features with Matryoshka Sparse Autoencoders Noa Nabeshima, Adam Karvonen, Bart Bussmann, Neel Nanda Published: 2025-03-21Area: Mechanistic Interp.Citations: 63 Tags: empirical, mechanistic-interp, ai-safety	2025-03-21	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	63
Low-Rank Adapting Models for Sparse Autoencoders Joshua Engels, Matthew Chen, Max Tegmark Published: 2025-01-31Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety	2025-01-31	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	4
MIB: A Mechanistic Interpretability Benchmark David Bau, Michael Hanna, Sarah Wiegreffe, Alessandro Stolfo Published: 2025-04-17Area: Mechanistic Interp.Citations: 16 Tags: mechanistic-interp, ai-safety, interpretability, benchmark	2025-04-17	Mechanistic Interp.	mechanistic-interp, ai-safety, interpretability, benchmark	E6 / R3 (97%)	16
Mapping Faithful Reasoning in Language Models Andreas Damianou, J Rosser, Konstantina Palla, José Luis Redondo García Published: 2025-10-25Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2025-10-25	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	2
Measuring Sparse Autoencoder Feature Sensitivity Nathan Hu, Katherine Tian, Claire Tian Published: 2025-09-28Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation	2025-09-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, safety-evaluation	E5 / R3 (94%)	-
Measuring and Guiding Monosemanticity Stephan Wäldchen, Manuel Brack, Björn Deiseroth, Ruben Härle Published: 2025-06-24Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety	2025-06-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (96%)	3
Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units Jianhui Chen, Liangming Pan, Yuzhang Luo Published: 2026-01-29Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2026-01-29	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	1
Mechanistic Exploration of Backdoored Large Language Model Attention Patterns Lakshmi Babu-Saheer, Mohammed Abu Baker Published: 2025-08-19Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-08-19	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (94%)	-
Mechanistic Interpretability Needs Philosophy Nina Rajcic, Iwan Williams, Ninell Oldenburg, Filippos Stamatiou Published: 2025-06-23Area: Mechanistic Interp.Citations: 3 Tags: mechanistic-interp, ai-safety, position, interpretability	2025-06-23	Mechanistic Interp.	mechanistic-interp, ai-safety, position, interpretability	E5 / R3 (94%)	3
Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG Maxime Peyrard, François Portet, Maxime Méloux Published: 2025-10-01Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-10-01	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (94%)	2
Mechanistic Interpretability for Steering Vision-Language-Action Models Ian Chuang, Bear Häon, Kaylene Stocking, Claire Tomlin Published: 2025-08-30Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-08-30	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (97%)	3
Mechanistic understanding and validation of large AI models with SemanticLens Thomas Wiegand, Sebastian Lapuschkin, Tobias Labarta, Wojciech Samek Published: 2025-01-09Area: Mechanistic Interp.Citations: 29 Tags: mechanistic-interp, ai-safety, tool	2025-01-09	Mechanistic Interp.	mechanistic-interp, ai-safety, tool	E5 / R3 (95%)	29
Mixture of Experts Made Intrinsically Interpretable Philip Torr, Adel Bibi, Constantin Venhoff, Puneet K. Dokania Published: 2025-03-05Area: Mechanistic Interp.Citations: 12 Tags: empirical, mechanistic-interp, ai-safety	2025-03-05	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	12
Modular Training of Neural Networks aids Interpretability Joan Velja, Maheep Chaudhary, Alessandro Abate, Nandi Schoots Published: 2025-02-04Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-02-04	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	1
Negative Results for Sparse Autoencoders on Downstream Tasks and Deprioritising SAE Research Rohin Shah, Lewis Smith, Tom Lieberum, Janos Kramar Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	-	-
NeuroFaith: Evaluating LLM Self-Explanation Faithfulness via Internal Representation Alignment Jean-Noël Vittaut, Sarath Chandar, Marie-Jeanne Lesot, Nicolas Chesneau Published: 2025-06-10Area: Mechanistic Interp.Citations: - Tags: empirical, alignment-training, mechanistic-interp, ai-safety	2025-06-10	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety	E5 / R3 (95%)	-
Neuroplasticity and Corruption in Model Mechanisms: A Case Study Of Indirect Object Identification Mohammad Mahdi Khalili, Vishnu Kabir Chhabra, Ding Zhu Published: 2025-02-27Area: Mechanistic Interp.Citations: 5 Tags: empirical, mechanistic-interp, ai-safety	2025-02-27	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	5
On the Biology of a Large Language Model Craig Citro, Michael Sklar, Hoagy Cunningham, Wes Gurnee Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R3 (96%)	-
OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features Alexey Dontsov, Ivan Oseledets, Anton Korznikov, Oleg Y. Rogov Published: 2025-09-26Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2025-09-26	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R4 (95%)	2
PAHQ: Accelerating Automated Circuit Discovery through Mixed-Precision Inference Optimization Lijie Hu, Huanyi Xie, Shu Yang, Di Wang Published: 2025-10-27Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2025-10-27	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	2
Position-aware Automatic Circuit Discovery David Bau, Aaron Mueller, Tal Haklay, Hadas Orgad Published: 2025-02-07Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety	2025-02-07	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	6
Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs Yujia Zheng, Kun Zhang, Mona T. Diab, Xiangchen Song Published: 2025-05-26Area: Mechanistic Interp.Citations: 6 Tags: mechanistic-interp, ai-safety, position, interpretability, safety-evaluation	2025-05-26	Mechanistic Interp.	mechanistic-interp, ai-safety, position, interpretability, safety-evaluation	E5 / R3 (95%)	6
Preserving Bilinear Weight Spectra with a Signed and Shrunk Quadratic Activation Function Jason Abohwo, Thomas Mosen Published: 2025-09-02Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-09-02	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (99%)	-
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video Robert Graham, Sonia Joseph, Sebastian Lapuschkin, Yash Vadi Published: 2025-04-28Area: Mechanistic Interp.Citations: 11 Tags: mechanistic-interp, ai-safety, tool, interpretability	2025-04-28	Mechanistic Interp.	mechanistic-interp, ai-safety, tool, interpretability	E5 / R3 (98%)	11
Progress on Attention Rodrigo Luger, Nick Turner, Adam Jermyn, Christopher Olah Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	-
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry Sai Sumedh R. Hindupur, Ekdeep Singh Lubana, Thomas Fel, Demba Ba Published: 2025-03-03Area: Mechanistic Interp.Citations: 34 Tags: empirical, mechanistic-interp, ai-safety	2025-03-03	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (94%)	34
Propositional Interpretability in Artificial Intelligence David J. Chalmers Published: 2025-01-27Area: Mechanistic Interp.Citations: 13 Tags: theoretical, mechanistic-interp, ai-safety, interpretability	2025-01-27	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E6 / R3 (94%)	13
Prototype Transformer: Towards Language Model Architectures Interpretable by Design Chang Qi, Matteo Forasassi, Amine M'Charrak, Yordan Yordanov Published: 2026-02-12Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-02-12	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	-
RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching Farnoush Rezaei Jafari, Ashkan Khakzar, Oliver Eberle, Neel Nanda Published: 2025-08-28Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety	2025-08-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (95%)	4
Representation Learning on a Random Lattice Aryeh Brill Published: 2025-04-28Area: Mechanistic Interp.Citations: 1 Tags: theoretical, mechanistic-interp, ai-safety	2025-04-28	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E6 / R3 (96%)	1