Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 1-30 of 470 papers (page 1 of 16)

PreviousNext

Paper	Published	Area	Tags	Intel	Citations
A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior Kannan Ramchandran, Harry Mayne, Adam Mahdi, Noah Y. Siegel Published: 2026-02-02Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-02-02	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (96%)	-
Beyond Activation Patterns: A Weight-Based Out-of-Context Explanation of Sparse Autoencoder Features Yiting Liu, Zhi-Hong Deng Published: 2026-01-30Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-01-30	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (95%)	-
Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path Neha Sengar, Dongsoo Har, Andres Suarez Published: 2026-02-10Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-02-10	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	-
Decomposing Query-Key Feature Interactions Using Contrastive Covariances Andrew Lee, Fernanda Viégas, Martin Wattenberg, Yonatan Belinkov Published: 2026-02-04Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-02-04	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	-
From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders Zhennan Zhou, Mingrui Wu, Yifan Luo, Yang Zhan Published: 2026-02-12Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-02-12	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	-
Hierarchical Sparse Circuit Extraction from Billion-Parameter Language Models through Scalable Attribution Graph Decomposition Mohammed Mudassir Uddin, Mohammed Kaif Pasha, Shahnawaz Alam Published: 2026-01-19Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-01-19	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	-
How Many Features Can a Language Model Store Under the Linear Representation Hypothesis? Jon Kleinberg, Kenny Peng, Nikhil Garg Published: 2026-02-11Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety	2026-02-11	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E4 / R2 (93%)	-
Identifying Intervenable and Interpretable Features via Orthogonality Regularization Moritz Miller, Bernhard Schölkopf, Florent Draye Published: 2026-02-04Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2026-02-04	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	1
LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery Bangwei Guo, Mu Zhou, Yang Zhou, Guoning Zhang Published: 2026-02-07Area: Mechanistic Interp.Citations: - Tags: empirical, alignment-training, mechanistic-interp, ai-safety	2026-02-07	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety	E5 / R4 (94%)	-
Learning a Generative Meta-Model of LLM Activations Grace Luo, Jiahai Feng, Trevor Darrell, Alec Radford Published: 2026-02-06Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-02-06	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	-
Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning Donald Ye, Om Kotadia, Max Loffgren, Linus Wong Published: 2026-02-04Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-02-04	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E7 / R4 (97%)	-
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders Shuo Wang, Ruikang Zhang, Qi Su Published: 2026-01-06Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-01-06	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	-
MonoLoss: A Training Objective for Interpretable Monosemantic Representations Anh Tien Nguyen, Hassan Rivaz, Ali Nasiri-Sarvi, Dimitris Samaras Published: 2026-02-12Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-02-12	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	-
Patterning: The Dual of Interpretability George Wang, Daniel Murfet Published: 2026-01-20Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety, interpretability	2026-01-20	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E6 / R4 (95%)	-
PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding Mihalis Nicolaou, Yannis Panagakis, James Oldfield, Panagiotis Koromilas Published: 2026-02-01Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-02-01	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	-
ProtoMech: Protein Circuit Tracing via Cross-layer Transcoders Amirali Aghazadeh, Daniel Saeedi, Kunal Talreja, Darin Tsui Published: 2026-02-12Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-02-12	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	-
The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders Shikhar Shiromani, Sri Pranav Kunda, Archie Chaudhury Published: 2026-01-14Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-01-14	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (95%)	-
Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models Francis Kulumba, Djamé Seddah, Théo Lasnier, Wissam Antoun Published: 2026-02-11Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-02-11	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (95%)	-
Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction Tony Cristofano Published: 2026-01-22Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-01-22	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	-
Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints Yousung Lee, Andres Saurez, Dongsoo Har Published: 2026-02-10Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety, interpretability	2026-02-10	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (96%)	-
Your Language Model Secretly Contains Personality Subnetworks Xiaolong Ma, Zihan Wang, Manling Li, Zinan Ling Published: 2026-02-06Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-02-06	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	-
A Geometric Unification of Concept Learning with Concept Cones Thomas Fel, Alexandre Rocchi-Henry, Gianni Franchi Published: 2025-12-08Area: Mechanistic Interp.Citations: - Tags: theoretical, alignment-training, mechanistic-interp, ai-safety	2025-12-08	Mechanistic Interp.	theoretical, alignment-training, mechanistic-interp, ai-safety	E5 / R4 (94%)	-
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i Louis Jaburi, Kola Ayonrinde Published: 2025-05-01Area: Mechanistic Interp.Citations: 4 Tags: theoretical, mechanistic-interp, ai-safety, interpretability	2025-05-01	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E6 / R3 (95%)	4
A Pragmatic Vision for Interpretability Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, position, interpretability	-	Mechanistic Interp.	mechanistic-interp, ai-safety, position, interpretability	-	-
A Toy Model of Mechanistic (Un)Faithfulness Chris Olah Published: -Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E5 / R3 (93%)	-
AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features Mohammad Mahdi Khalili, Xudong Zhu, Zhihui Zhu Published: 2025-10-01Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-10-01	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (96%)	-
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers Euan Ong, Samuel Marks, Julian Minder, Daniel Wen Published: 2025-12-17Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-12-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E6 / R3 (94%)	2
Activation Transport Operators Marek Masiak, Andrzej Szablewski Published: 2025-08-24Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-08-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (94%)	-
ActivationReasoning: Logical Reasoning in Latent Activation Spaces Hikaru Shindo, Lukas Helff, Manuel Brack, Ruben Härle Published: 2025-10-21Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2025-10-21	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R5 (97%)	1
AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations Yifei Yao, Mengnan Du Published: 2025-08-24Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-08-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E4 / R3 (94%)	1