Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention J Rosser, Konstantina Palla, Hugues Bouchard, José Luis Redondo García Published: 2025-10-22Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-10-22 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | - |
| Superposition Yields Robust Neural Scaling Yizhou Liu, Jeff Gore, Ziming Liu Published: 2025-05-15Area: Mechanistic Interp.Citations: 9 Tags: empirical, mechanistic-interp, ai-safety | 2025-05-15 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 9 |
| Superposition as Lossy Compression: Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability Zoe Tzifa-Kratira, Leonard Bereska, Reza Samavi, Efstratios Gavves Published: 2025-12-15Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | 2025-12-15 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R3 (94%) | 1 |
| Superposition in Graph Neural Networks Pietro Liò, Han Xuanyuan, Lukas Pertl Published: 2025-08-31Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2025-08-31 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E7 / R3 (95%) | 1 |
| Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation Gal Niv, Jonathan Jacobi Published: 2025-03-03Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety | 2025-03-03 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 3 |
| Sycophancy Hides Linearly in the Attention Heads Hilal Alquabeh, Kentaro Inui, Nurdaulet Mukhituly, Rifo Genadi Published: 2026-01-23Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2026-01-23 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (95%) | 1 |
| TRACE: Training and Inference-Time Interpretability Analysis for Language Models Nura Aljaafari, André Freitas, Danilo S. Carvalho Published: 2025-07-04Area: Mechanistic Interp.Citations: 1 Tags: mechanistic-interp, ai-safety, tool, interpretability | 2025-07-04 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | E5 / R3 (96%) | 1 |
| The Circuits Research Landscape: Results and Perspectives Michael Hanna, Connor Watts, Curt Tigges, Max Loeffler Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (93%) | - |
| The Dead Salmons of AI Interpretability Maxime Peyrard, François Portet, Maxime Méloux, Giada Dirupo Published: 2025-12-21Area: Mechanistic Interp.Citations: 3 Tags: mechanistic-interp, ai-safety, position, interpretability | 2025-12-21 | Mechanistic Interp. | mechanistic-interp, ai-safety, position, interpretability | E7 / R3 (93%) | 3 |
| The Geometry of Self-Verification in a Task-Specific Reasoning Model Chris Wendler, Andrew Lee, Fernanda Viégas, Martin Wattenberg Published: 2025-04-19Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety | 2025-04-19 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 6 |
| The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions Haining Yu, Xiangyang Zhou, Qiguang Chen, Wenbo Pan Published: 2025-02-13Area: Mechanistic Interp.Citations: 8 Tags: empirical, alignment-training, mechanistic-interp, ai-safety, adversarial-robustness | 2025-02-13 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R4 (94%) | 8 |
| The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability? Julian Minder, Tiago Pimentel, Thomas Hofmann, Denis Sutter Published: 2025-07-11Area: Mechanistic Interp.Citations: 11 Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability | 2025-07-11 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety, interpretability | E5 / R3 (95%) | 11 |
| The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction Lei Yu, Meng Cao, Zhijing Jin, Yihuai Hong Published: 2025-03-29Area: Mechanistic Interp.Citations: 15 Tags: empirical, mechanistic-interp, ai-safety | 2025-03-29 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 15 |
| Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training Junyu Ren, T. Ed Li Published: 2025-10-09Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2025-10-09 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (97%) | - |
| TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research Dhruv Nathawani, Luke Marks, Amir Abdullah, Philip Quirke Published: 2025-03-17Area: Mechanistic Interp.Citations: 2 Tags: mechanistic-interp, ai-safety, dataset, interpretability | 2025-03-17 | Mechanistic Interp. | mechanistic-interp, ai-safety, dataset, interpretability | E6 / R3 (95%) | 2 |
| Toward a Theory of Generalizability in LLM Mechanistic Interpretability Research Sean Trott Published: 2025-09-26Area: Mechanistic Interp.Citations: 2 Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2025-09-26 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E4 / R3 (94%) | 2 |
| Towards Atoms of Large Language Models Chenhui Hu, Jun Zhao, Pengfei Cao, Yubo Chen Published: 2025-09-25Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety | 2025-09-25 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E5 / R3 (98%) | - |
| Towards Combinatorial Interpretability of Neural Computation Dan Alistarh, Nir Shavit, Micah Adler Published: 2025-04-10Area: Mechanistic Interp.Citations: 7 Tags: theoretical, mechanistic-interp, ai-safety, interpretability | 2025-04-10 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (93%) | 7 |
| Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders Yixuan Li, Mihalis A. Nicolaou, James Oldfield, Grigorios G Chrysos Published: 2025-05-27Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-05-27 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E4 / R3 (95%) | 1 |
| Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis Yan Hu, Reynold Cheng, Xu Wang, Wenyu Du Published: 2025-02-17Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (94%) | 10 |
| Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Junxuan Wang, Xuyang Ge, Zhengfu He, Junping Zhang Published: 2025-04-29Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-04-29 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (96%) | 2 |
| Towards eliciting latent knowledge from LLMs with mechanistic interpretability Emil Ryd, Senthooran Rajamanoharan, Bartosz Cywiński, Neel Nanda Published: 2025-05-20Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-05-20 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (97%) | 6 |
| Tracing Attention Computation Through Feature Interactions Rodrigo Luger, Wes Gurnee, Harish Kamath, Thomas Conerly Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | - |
| Training Language Models to Explain Their Own Computations Jacob Andreas, Belinda Z. Li, Vincent Huang, Zifan Carl Guo Published: 2025-11-11Area: Mechanistic Interp.Citations: 7 Tags: empirical, mechanistic-interp, ai-safety | 2025-11-11 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | 7 |
| Training Superior Sparse Autoencoders for Instruct Models Hamid Alinejad-Rokny, Yukun Chen, Jimmy Chih-Hsien Peng, Jiaming Li Published: 2025-06-09Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2025-06-09 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (95%) | 1 |
| Transferring Linear Features Across Language Models With Model Stitching Ellie Pavlick, Alessandro Stolfo, Alan Chen, Jack Merullo Published: 2025-06-07Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2025-06-07 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 1 |
| Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability Luca Baroni, Joachim Schaeffer, Stefan Heimersheim, Marat Subkhankulov Published: 2025-07-03Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-07-03 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (93%) | 4 |
| Truth Neurons Yupeng Cao, Zining Zhu, Jordan W. Suchow, Yangyang Yu Published: 2025-05-18Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2025-05-18 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (95%) | 1 |
| Understanding How Value Neurons Shape the Generation of Specified Values in LLMs Lijie Hu, Shu Yang, Di Wang, Xinhai Wang Published: 2025-05-23Area: Mechanistic Interp.Citations: 7 Tags: empirical, mechanistic-interp, ai-safety | 2025-05-23 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R4 (95%) | 7 |
| Understanding Refusal in Language Models with Sparse Autoencoders Wei Jie Yeo, Erik Cambria, Roy Ka-Wei Lee, Ranjan Satapathy Published: 2025-05-29Area: Mechanistic Interp.Citations: 7 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | 2025-05-29 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R3 (92%) | 7 |