Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 1-30 of 470 papers (page 1 of 16)

PreviousNext
PaperIntel
A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

Kannan Ramchandran, Harry Mayne, Adam Mahdi, Noah Y. Siegel

Published: 2026-02-02Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (96%)
Beyond Activation Patterns: A Weight-Based Out-of-Context Explanation of Sparse Autoencoder Features

Yiting Liu, Zhi-Hong Deng

Published: 2026-01-30Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (95%)
Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path

Neha Sengar, Dongsoo Har, Andres Suarez

Published: 2026-02-10Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Decomposing Query-Key Feature Interactions Using Contrastive Covariances

Andrew Lee, Fernanda Viégas, Martin Wattenberg, Yonatan Belinkov

Published: 2026-02-04Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders

Zhennan Zhou, Mingrui Wu, Yifan Luo, Yang Zhan

Published: 2026-02-12Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Hierarchical Sparse Circuit Extraction from Billion-Parameter Language Models through Scalable Attribution Graph Decomposition

Mohammed Mudassir Uddin, Mohammed Kaif Pasha, Shahnawaz Alam

Published: 2026-01-19Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
How Many Features Can a Language Model Store Under the Linear Representation Hypothesis?

Jon Kleinberg, Kenny Peng, Nikhil Garg

Published: 2026-02-11Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety

E4 / R2 (93%)
Identifying Intervenable and Interpretable Features via Orthogonality Regularization

Moritz Miller, Bernhard Schölkopf, Florent Draye

Published: 2026-02-04Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery

Bangwei Guo, Mu Zhou, Yang Zhou, Guoning Zhang

Published: 2026-02-07Area: Mechanistic Interp.Citations: -

Tags: empirical, alignment-training, mechanistic-interp, ai-safety

E5 / R4 (94%)
Learning a Generative Meta-Model of LLM Activations

Grace Luo, Jiahai Feng, Trevor Darrell, Alec Radford

Published: 2026-02-06Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

Donald Ye, Om Kotadia, Max Loffgren, Linus Wong

Published: 2026-02-04Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E7 / R4 (97%)
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders

Shuo Wang, Ruikang Zhang, Qi Su

Published: 2026-01-06Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
MonoLoss: A Training Objective for Interpretable Monosemantic Representations

Anh Tien Nguyen, Hassan Rivaz, Ali Nasiri-Sarvi, Dimitris Samaras

Published: 2026-02-12Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Patterning: The Dual of Interpretability

George Wang, Daniel Murfet

Published: 2026-01-20Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E6 / R4 (95%)
PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding

Mihalis Nicolaou, Yannis Panagakis, James Oldfield, Panagiotis Koromilas

Published: 2026-02-01Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
ProtoMech: Protein Circuit Tracing via Cross-layer Transcoders

Amirali Aghazadeh, Daniel Saeedi, Kunal Talreja, Darin Tsui

Published: 2026-02-12Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders

Shikhar Shiromani, Sri Pranav Kunda, Archie Chaudhury

Published: 2026-01-14Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (95%)
Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

Francis Kulumba, Djamé Seddah, Théo Lasnier, Wissam Antoun

Published: 2026-02-11Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (95%)
Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction

Tony Cristofano

Published: 2026-01-22Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints

Yousung Lee, Andres Saurez, Dongsoo Har

Published: 2026-02-10Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
Your Language Model Secretly Contains Personality Subnetworks

Xiaolong Ma, Zihan Wang, Manling Li, Zinan Ling

Published: 2026-02-06Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
A Geometric Unification of Concept Learning with Concept Cones

Thomas Fel, Alexandre Rocchi-Henry, Gianni Franchi

Published: 2025-12-08Area: Mechanistic Interp.Citations: -

Tags: theoretical, alignment-training, mechanistic-interp, ai-safety

E5 / R4 (94%)
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i

Louis Jaburi, Kola Ayonrinde

Published: 2025-05-01Area: Mechanistic Interp.Citations: 4

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (95%)
A Pragmatic Vision for Interpretability

Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan

Published: -Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, position, interpretability

-
A Toy Model of Mechanistic (Un)Faithfulness

Chris Olah

Published: -Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety

E5 / R3 (93%)
AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

Mohammad Mahdi Khalili, Xudong Zhu, Zhihui Zhu

Published: 2025-10-01Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Euan Ong, Samuel Marks, Julian Minder, Daniel Wen

Published: 2025-12-17Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (94%)
Activation Transport Operators

Marek Masiak, Andrzej Szablewski

Published: 2025-08-24Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (94%)
ActivationReasoning: Logical Reasoning in Latent Activation Spaces

Hikaru Shindo, Lukas Helff, Manuel Brack, Ruben Härle

Published: 2025-10-21Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E6 / R5 (97%)
AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations

Yifei Yao, Mengnan Du

Published: 2025-08-24Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E4 / R3 (94%)