Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 61-90 of 470 papers (page 3 of 16)

Paper	Published	Area	Tags	Intel	Citations
ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models Minfeng Zhu, Haoxuan Li, Zhen Wen, Yuchen Yang Published: 2025-09-20Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool	2025-09-20	Mechanistic Interp.	mechanistic-interp, ai-safety, tool	E5 / R3 (97%)	-
Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features Adriano Koshiyama, Zekun Wu, Seonglae Cho Published: 2026-02-11Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-02-11	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (97%)	-
Cracking the Circuits: Mechanistic Interpretability in Large Language Models Mushtaq Ali, Dost Muhammad, Malika Bendechache, Muhammad Salman Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E6 / R3 (95%)	-
Cross-Layer Discrete Concept Discovery for Interpreting Language Models Xuemin Yu, Samira Ebrahimi Kahou, Hassan Sajjad, Ankur Garg Published: 2025-06-24Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation	2025-06-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, safety-evaluation	E6 / R3 (95%)	1
Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI Eduard Kapelko Published: 2025-09-23Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-09-23	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	-
DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders Lingpeng Kong, Baosong Yang, Yu Wan, Xu Wang Published: 2026-02-05Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	2026-02-05	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (97%)	-
DePass: Unified Feature Attributing by Simple Decomposed Forward Pass Bowen Zhou, Kai Tian, Xiangyu Hong, Biqing Qi Published: 2025-10-21Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-10-21	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (95%)	-
Deciphering Functions of Neurons in Vision-Language Models Yan Lu, Jiaqi Xu, Xuejin Chen, Cuiling Lan Published: 2025-02-10Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-02-10	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (94%)	-
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization Or Shafran, Mor Geva, Atticus Geiger Published: 2025-06-12Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety	2025-06-12	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (96%)	3
Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning Xinting Huang, Michael Hahn Published: 2025-08-03Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety	2025-08-03	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (94%)	3
Dense SAE Latents Are Features, Not Bugs Alessandro Stolfo, Ben Wu, Mrinmaya Sachan, Joshua Engels Published: 2025-06-18Area: Mechanistic Interp.Citations: 7 Tags: empirical, mechanistic-interp, ai-safety	2025-06-18	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	7
Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning Junxuan Wang, Xuyang Ge, Zhengfu He, Wentao Shu Published: 2025-08-23Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-08-23	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (94%)	-
Distribution-Aware Feature Selection for SAEs Narmeen Oozeer, Amirali Abdullah, Michael Lan, Alice Rigg Published: 2025-08-29Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-08-29	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	-
Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders Yan Hu, Xu Wang, Difan Zou, Benyou Wang Published: 2025-10-04Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-10-04	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E7 / R4 (97%)	2
Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers Rabin Adhikari Published: 2025-10-28Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-10-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	-
Empirical Evaluation of Progressive Coding for Sparse Autoencoders Anders Søgaard, Hans Peter Published: 2025-04-30Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation	2025-04-30	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, safety-evaluation	E5 / R3 (94%)	-
Enhancing Automated Interpretability with Output-Centric Feature Descriptions Chen Agassy, Mor Geva, Yoav Gur-Arieh, Roy Mayan Published: 2025-01-14Area: Mechanistic Interp.Citations: 26 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-01-14	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E6 / R3 (92%)	26
Ensembling Sparse Autoencoders Soham Gadgil, Chris Lin, Su-In Lee Published: 2025-05-21Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2025-05-21	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	1
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii Louis Jaburi, Kola Ayonrinde Published: 2025-05-02Area: Mechanistic Interp.Citations: 2 Tags: theoretical, mechanistic-interp, ai-safety, interpretability	2025-05-02	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (96%)	2
Evaluating Neuron Explanations: A Unified Framework with Sanity Checks Tsui-Wei Weng, Ge Yan, Tuomas Oikarinen Published: 2025-06-06Area: Mechanistic Interp.Citations: 7 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation	2025-06-06	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, safety-evaluation	E5 / R3 (99%)	7
Evaluating SAE interpretability without explanations Gonçalo Paulo, Nora Belrose Published: 2025-07-11Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-07-11	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (93%)	1
Evaluating Sparse Autoencoders for Monosemantic Representation Peizhong Ju, Muhammad Umair Haider, Moghis Fereidouni, A.B. Siddique Published: 2025-08-20Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-08-20	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (96%)	-
Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit Valérie Costa, Bahareh Tolooshams, Ekdeep Singh Lubana, Thomas Fel Published: 2025-06-05Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-06-05	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	-
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality Julia Hockenmaier, Sewoong Lee, Marc E. Canby, Adam Davies Published: 2025-03-31Area: Mechanistic Interp.Citations: 2 Tags: theoretical, mechanistic-interp, ai-safety	2025-03-31	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E5 / R3 (93%)	2
Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? Maxime Meloux, Silviu Maniu, Maxime Peyrard, Francois Portet Published: 2025-02-28Area: Mechanistic Interp.Citations: 14 Tags: theoretical, mechanistic-interp, ai-safety, interpretability	2025-02-28	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (94%)	14
FADE: Why Bad Descriptions Happen to Good Features Thomas Wiegand, Sebastian Lapuschkin, Aakriti Jain, Elena Golimblevskaia Published: 2025-02-24Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation	2025-02-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, safety-evaluation	E5 / R3 (94%)	6
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders David Chanin, Adria Garriga-Alonso, Tomas Dulka Published: 2025-05-16Area: Mechanistic Interp.Citations: 7 Tags: empirical, mechanistic-interp, ai-safety	2025-05-16	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	7
Feature Identification via the Empirical NTK Jennifer Lin Published: 2025-10-01Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2025-10-01	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	2
Finding Manifolds With Bilinear Autoencoders Thomas Dooms, Ward Gauderis Published: 2025-10-19Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2025-10-19	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R2 (94%)	2
Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models Hosein Mohebbi, Martin Tutek, Gabriele Sarti, Aaron Mueller Published: 2025-11-23Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, benchmark	2025-11-23	Mechanistic Interp.	mechanistic-interp, ai-safety, benchmark	E6 / R3 (95%)	-