Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 61-90 of 470 papers (page 3 of 16)

PaperIntel
ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models

Minfeng Zhu, Haoxuan Li, Zhen Wen, Yuchen Yang

Published: 2025-09-20Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, tool

E5 / R3 (97%)
Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

Adriano Koshiyama, Zekun Wu, Seonglae Cho

Published: 2026-02-11Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (97%)
Cracking the Circuits: Mechanistic Interpretability in Large Language Models

Mushtaq Ali, Dost Muhammad, Malika Bendechache, Muhammad Salman

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (95%)
Cross-Layer Discrete Concept Discovery for Interpreting Language Models

Xuemin Yu, Samira Ebrahimi Kahou, Hassan Sajjad, Ankur Garg

Published: 2025-06-24Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation

E6 / R3 (95%)
Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI

Eduard Kapelko

Published: 2025-09-23Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

Lingpeng Kong, Baosong Yang, Yu Wan, Xu Wang

Published: 2026-02-05Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (97%)
DePass: Unified Feature Attributing by Simple Decomposed Forward Pass

Bowen Zhou, Kai Tian, Xiangyu Hong, Biqing Qi

Published: 2025-10-21Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (95%)
Deciphering Functions of Neurons in Vision-Language Models

Yan Lu, Jiaqi Xu, Xuejin Chen, Cuiling Lan

Published: 2025-02-10Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (94%)
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

Or Shafran, Mor Geva, Atticus Geiger

Published: 2025-06-12Area: Mechanistic Interp.Citations: 3

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (96%)
Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

Xinting Huang, Michael Hahn

Published: 2025-08-03Area: Mechanistic Interp.Citations: 3

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (94%)
Dense SAE Latents Are Features, Not Bugs

Alessandro Stolfo, Ben Wu, Mrinmaya Sachan, Joshua Engels

Published: 2025-06-18Area: Mechanistic Interp.Citations: 7

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning

Junxuan Wang, Xuyang Ge, Zhengfu He, Wentao Shu

Published: 2025-08-23Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (94%)
Distribution-Aware Feature Selection for SAEs

Narmeen Oozeer, Amirali Abdullah, Michael Lan, Alice Rigg

Published: 2025-08-29Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders

Yan Hu, Xu Wang, Difan Zou, Benyou Wang

Published: 2025-10-04Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E7 / R4 (97%)
Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers

Rabin Adhikari

Published: 2025-10-28Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Empirical Evaluation of Progressive Coding for Sparse Autoencoders

Anders Søgaard, Hans Peter

Published: 2025-04-30Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation

E5 / R3 (94%)
Enhancing Automated Interpretability with Output-Centric Feature Descriptions

Chen Agassy, Mor Geva, Yoav Gur-Arieh, Roy Mayan

Published: 2025-01-14Area: Mechanistic Interp.Citations: 26

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (92%)
Ensembling Sparse Autoencoders

Soham Gadgil, Chris Lin, Su-In Lee

Published: 2025-05-21Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii

Louis Jaburi, Kola Ayonrinde

Published: 2025-05-02Area: Mechanistic Interp.Citations: 2

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

Tsui-Wei Weng, Ge Yan, Tuomas Oikarinen

Published: 2025-06-06Area: Mechanistic Interp.Citations: 7

Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation

E5 / R3 (99%)
Evaluating SAE interpretability without explanations

Gonçalo Paulo, Nora Belrose

Published: 2025-07-11Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (93%)
Evaluating Sparse Autoencoders for Monosemantic Representation

Peizhong Ju, Muhammad Umair Haider, Moghis Fereidouni, A.B. Siddique

Published: 2025-08-20Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit

Valérie Costa, Bahareh Tolooshams, Ekdeep Singh Lubana, Thomas Fel

Published: 2025-06-05Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality

Julia Hockenmaier, Sewoong Lee, Marc E. Canby, Adam Davies

Published: 2025-03-31Area: Mechanistic Interp.Citations: 2

Tags: theoretical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

Maxime Meloux, Silviu Maniu, Maxime Peyrard, Francois Portet

Published: 2025-02-28Area: Mechanistic Interp.Citations: 14

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (94%)
FADE: Why Bad Descriptions Happen to Good Features

Thomas Wiegand, Sebastian Lapuschkin, Aakriti Jain, Elena Golimblevskaia

Published: 2025-02-24Area: Mechanistic Interp.Citations: 6

Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation

E5 / R3 (94%)
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

David Chanin, Adria Garriga-Alonso, Tomas Dulka

Published: 2025-05-16Area: Mechanistic Interp.Citations: 7

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Feature Identification via the Empirical NTK

Jennifer Lin

Published: 2025-10-01Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Finding Manifolds With Bilinear Autoencoders

Thomas Dooms, Ward Gauderis

Published: 2025-10-19Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E4 / R2 (94%)
Findings of the BlackboxNLP 2025 Shared Task: Localizing Circuits and Causal Variables in Language Models

Hosein Mohebbi, Martin Tutek, Gabriele Sarti, Aaron Mueller

Published: 2025-11-23Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, benchmark

E6 / R3 (95%)