Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 181-210 of 470 papers (page 7 of 16)

PaperIntel
Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention

J Rosser, Konstantina Palla, Hugues Bouchard, José Luis Redondo García

Published: 2025-10-22Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
Superposition Yields Robust Neural Scaling

Yizhou Liu, Jeff Gore, Ziming Liu

Published: 2025-05-15Area: Mechanistic Interp.Citations: 9

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Superposition as Lossy Compression: Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability

Zoe Tzifa-Kratira, Leonard Bereska, Reza Samavi, Efstratios Gavves

Published: 2025-12-15Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R3 (94%)
Superposition in Graph Neural Networks

Pietro Liò, Han Xuanyuan, Lukas Pertl

Published: 2025-08-31Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E7 / R3 (95%)
Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation

Gal Niv, Jonathan Jacobi

Published: 2025-03-03Area: Mechanistic Interp.Citations: 3

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Sycophancy Hides Linearly in the Attention Heads

Hilal Alquabeh, Kentaro Inui, Nurdaulet Mukhituly, Rifo Genadi

Published: 2026-01-23Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (95%)
TRACE: Training and Inference-Time Interpretability Analysis for Language Models

Nura Aljaafari, André Freitas, Danilo S. Carvalho

Published: 2025-07-04Area: Mechanistic Interp.Citations: 1

Tags: mechanistic-interp, ai-safety, tool, interpretability

E5 / R3 (96%)
The Circuits Research Landscape: Results and Perspectives

Michael Hanna, Connor Watts, Curt Tigges, Max Loeffler

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (93%)
The Dead Salmons of AI Interpretability

Maxime Peyrard, François Portet, Maxime Méloux, Giada Dirupo

Published: 2025-12-21Area: Mechanistic Interp.Citations: 3

Tags: mechanistic-interp, ai-safety, position, interpretability

E7 / R3 (93%)
The Geometry of Self-Verification in a Task-Specific Reasoning Model

Chris Wendler, Andrew Lee, Fernanda Viégas, Martin Wattenberg

Published: 2025-04-19Area: Mechanistic Interp.Citations: 6

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions

Haining Yu, Xiangyang Zhou, Qiguang Chen, Wenbo Pan

Published: 2025-02-13Area: Mechanistic Interp.Citations: 8

Tags: empirical, alignment-training, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R4 (94%)
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

Julian Minder, Tiago Pimentel, Thomas Hofmann, Denis Sutter

Published: 2025-07-11Area: Mechanistic Interp.Citations: 11

Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction

Lei Yu, Meng Cao, Zhijing Jin, Yihuai Hong

Published: 2025-03-29Area: Mechanistic Interp.Citations: 15

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training

Junyu Ren, T. Ed Li

Published: 2025-10-09Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (97%)
TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research

Dhruv Nathawani, Luke Marks, Amir Abdullah, Philip Quirke

Published: 2025-03-17Area: Mechanistic Interp.Citations: 2

Tags: mechanistic-interp, ai-safety, dataset, interpretability

E6 / R3 (95%)
Toward a Theory of Generalizability in LLM Mechanistic Interpretability Research

Sean Trott

Published: 2025-09-26Area: Mechanistic Interp.Citations: 2

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E4 / R3 (94%)
Towards Atoms of Large Language Models

Chenhui Hu, Jun Zhao, Pengfei Cao, Yubo Chen

Published: 2025-09-25Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety

E5 / R3 (98%)
Towards Combinatorial Interpretability of Neural Computation

Dan Alistarh, Nir Shavit, Micah Adler

Published: 2025-04-10Area: Mechanistic Interp.Citations: 7

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (93%)
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

Yixuan Li, Mihalis A. Nicolaou, James Oldfield, Grigorios G Chrysos

Published: 2025-05-27Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E4 / R3 (95%)
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis

Yan Hu, Reynold Cheng, Xu Wang, Wenyu Du

Published: 2025-02-17Area: Mechanistic Interp.Citations: 10

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (94%)
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition

Junxuan Wang, Xuyang Ge, Zhengfu He, Junping Zhang

Published: 2025-04-29Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Emil Ryd, Senthooran Rajamanoharan, Bartosz Cywiński, Neel Nanda

Published: 2025-05-20Area: Mechanistic Interp.Citations: 6

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (97%)
Tracing Attention Computation Through Feature Interactions

Rodrigo Luger, Wes Gurnee, Harish Kamath, Thomas Conerly

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Training Language Models to Explain Their Own Computations

Jacob Andreas, Belinda Z. Li, Vincent Huang, Zifan Carl Guo

Published: 2025-11-11Area: Mechanistic Interp.Citations: 7

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Training Superior Sparse Autoencoders for Instruct Models

Hamid Alinejad-Rokny, Yukun Chen, Jimmy Chih-Hsien Peng, Jiaming Li

Published: 2025-06-09Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (95%)
Transferring Linear Features Across Language Models With Model Stitching

Ellie Pavlick, Alessandro Stolfo, Alan Chen, Jack Merullo

Published: 2025-06-07Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability

Luca Baroni, Joachim Schaeffer, Stefan Heimersheim, Marat Subkhankulov

Published: 2025-07-03Area: Mechanistic Interp.Citations: 4

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (93%)
Truth Neurons

Yupeng Cao, Zining Zhu, Jordan W. Suchow, Yangyang Yu

Published: 2025-05-18Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (95%)
Understanding How Value Neurons Shape the Generation of Specified Values in LLMs

Lijie Hu, Shu Yang, Di Wang, Xinhai Wang

Published: 2025-05-23Area: Mechanistic Interp.Citations: 7

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (95%)
Understanding Refusal in Language Models with Sparse Autoencoders

Wei Jie Yeo, Erik Cambria, Roy Ka-Wei Lee, Ranjan Satapathy

Published: 2025-05-29Area: Mechanistic Interp.Citations: 7

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R3 (92%)