Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
Showing 1-30 of 470 papers (page 1 of 16)
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior Kannan Ramchandran, Harry Mayne, Adam Mahdi, Noah Y. Siegel Published: 2026-02-02Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-02-02 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (96%) | - |
| Beyond Activation Patterns: A Weight-Based Out-of-Context Explanation of Sparse Autoencoder Features Yiting Liu, Zhi-Hong Deng Published: 2026-01-30Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-01-30 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (95%) | - |
| Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path Neha Sengar, Dongsoo Har, Andres Suarez Published: 2026-02-10Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-02-10 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |
| Decomposing Query-Key Feature Interactions Using Contrastive Covariances Andrew Lee, Fernanda Viégas, Martin Wattenberg, Yonatan Belinkov Published: 2026-02-04Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-02-04 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | - |
| From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders Zhennan Zhou, Mingrui Wu, Yifan Luo, Yang Zhan Published: 2026-02-12Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-02-12 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |
| Hierarchical Sparse Circuit Extraction from Billion-Parameter Language Models through Scalable Attribution Graph Decomposition Mohammed Mudassir Uddin, Mohammed Kaif Pasha, Shahnawaz Alam Published: 2026-01-19Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-01-19 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |
| How Many Features Can a Language Model Store Under the Linear Representation Hypothesis? Jon Kleinberg, Kenny Peng, Nikhil Garg Published: 2026-02-11Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety | 2026-02-11 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E4 / R2 (93%) | - |
| Identifying Intervenable and Interpretable Features via Orthogonality Regularization Moritz Miller, Bernhard Schölkopf, Florent Draye Published: 2026-02-04Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2026-02-04 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 1 |
| LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery Bangwei Guo, Mu Zhou, Yang Zhou, Guoning Zhang Published: 2026-02-07Area: Mechanistic Interp.Citations: - Tags: empirical, alignment-training, mechanistic-interp, ai-safety | 2026-02-07 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety | E5 / R4 (94%) | - |
| Learning a Generative Meta-Model of LLM Activations Grace Luo, Jiahai Feng, Trevor Darrell, Alec Radford Published: 2026-02-06Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-02-06 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | - |
| Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning Donald Ye, Om Kotadia, Max Loffgren, Linus Wong Published: 2026-02-04Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-02-04 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E7 / R4 (97%) | - |
| Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders Shuo Wang, Ruikang Zhang, Qi Su Published: 2026-01-06Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-01-06 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |
| MonoLoss: A Training Objective for Interpretable Monosemantic Representations Anh Tien Nguyen, Hassan Rivaz, Ali Nasiri-Sarvi, Dimitris Samaras Published: 2026-02-12Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-02-12 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |
| Patterning: The Dual of Interpretability George Wang, Daniel Murfet Published: 2026-01-20Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2026-01-20 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E6 / R4 (95%) | - |
| PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding Mihalis Nicolaou, Yannis Panagakis, James Oldfield, Panagiotis Koromilas Published: 2026-02-01Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-02-01 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | - |
| ProtoMech: Protein Circuit Tracing via Cross-layer Transcoders Amirali Aghazadeh, Daniel Saeedi, Kunal Talreja, Darin Tsui Published: 2026-02-12Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-02-12 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | - |
| The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders Shikhar Shiromani, Sri Pranav Kunda, Archie Chaudhury Published: 2026-01-14Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-01-14 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (95%) | - |
| Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models Francis Kulumba, Djamé Seddah, Théo Lasnier, Wissam Antoun Published: 2026-02-11Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-02-11 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (95%) | - |
| Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction Tony Cristofano Published: 2026-01-22Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-01-22 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | - |
| Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints Yousung Lee, Andres Saurez, Dongsoo Har Published: 2026-02-10Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2026-02-10 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | - |
| Your Language Model Secretly Contains Personality Subnetworks Xiaolong Ma, Zihan Wang, Manling Li, Zinan Ling Published: 2026-02-06Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026-02-06 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | - |
| A Geometric Unification of Concept Learning with Concept Cones Thomas Fel, Alexandre Rocchi-Henry, Gianni Franchi Published: 2025-12-08Area: Mechanistic Interp.Citations: - Tags: theoretical, alignment-training, mechanistic-interp, ai-safety | 2025-12-08 | Mechanistic Interp. | theoretical, alignment-training, mechanistic-interp, ai-safety | E5 / R4 (94%) | - |
| A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i Louis Jaburi, Kola Ayonrinde Published: 2025-05-01Area: Mechanistic Interp.Citations: 4 Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2025-05-01 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (95%) | 4 |
| A Pragmatic Vision for Interpretability Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, position, interpretability | - | Mechanistic Interp. | mechanistic-interp, ai-safety, position, interpretability | - | - |
| A Toy Model of Mechanistic (Un)Faithfulness Chris Olah Published: -Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E5 / R3 (93%) | - |
| AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features Mohammad Mahdi Khalili, Xudong Zhu, Zhihui Zhu Published: 2025-10-01Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-10-01 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | - |
| Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers Euan Ong, Samuel Marks, Julian Minder, Daniel Wen Published: 2025-12-17Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-12-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (94%) | 2 |
| Activation Transport Operators Marek Masiak, Andrzej Szablewski Published: 2025-08-24Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-08-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (94%) | - |
| ActivationReasoning: Logical Reasoning in Latent Activation Spaces Hikaru Shindo, Lukas Helff, Manuel Brack, Ruben Härle Published: 2025-10-21Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2025-10-21 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R5 (97%) | 1 |
| AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations Yifei Yao, Mengnan Du Published: 2025-08-24Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-08-24 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E4 / R3 (94%) | 1 |