Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 91-120 of 470 papers (page 4 of 16)

PaperIntel
Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs

Letao Han, Ling Hu, Yuemei Xu, Xiaoyang Gu

Published: 2025-04-07Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E7 / R3 (94%)
From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers

Julia Kempe, Karen Ullrich, Jingtong Su

Published: 2025-06-20Area: Mechanistic Interp.Citations: 3

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E6 / R4 (95%)
From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit

Valérie Costa, Bahareh Tolooshams, Ekdeep Singh Lubana, Thomas Fel

Published: 2025-06-03Area: Mechanistic Interp.Citations: 16

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models

Yiyang Yu, Minji Lee, Mohammed AlQuraishi, Etowah Adams

Published: -Area: Mechanistic Interp.Citations: 32

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers

Sonia Joseph, Jack Stanley, Praneet Suresh, Luca Scimeca

Published: 2025-09-08Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
From superposition to sparse codes: interpretable representations in neural networks

Harald Maurer, Patrik Reizinger, Nina Miolane, David Klindt

Published: 2025-03-03Area: Mechanistic Interp.Citations: 7

Tags: theoretical, mechanistic-interp, ai-safety, safety-evaluation

E5 / R3 (93%)
GIM: Improved Interpretability for Large Language Models

Lars Maaløe, Tuukka Ruotsalo, Maria Maistro, Róbert Csordás

Published: 2025-05-23Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E7 / R3 (96%)
Gemma Scope 2: Comprehensive Suite of SAEs and Transcoders for Gemma 3

Tom Lieberum, Janos Kramar, Senthooran Rajamanoharan, Callum McDougall

Published: -Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, tool, interpretability

E5 / R3 (98%)
Hedonic Neurons: A Mechanistic Mapping of Latent Coalitions in Transformer MLPs

Atharva Nijasure, Tanya Chowdhury, James Allan, Yair Zick

Published: 2025-09-28Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
How Can Interpretability Researchers Help AGI Go Well?

Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan

Published: -Area: Mechanistic Interp.Citations: -

Tags: alignment-training, mechanistic-interp, ai-safety, position, interpretability

-
How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Yizhou Sun, Weikai Li, Shichang Zhang, Himabindu Lakkaraju

Published: 2025-04-03Area: Mechanistic Interp.Citations: 5

Tags: empirical, mechanistic-interp, ai-safety

E7 / R4 (94%)
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks

Christopher Potts, Jiuding Sun, Michael Sklar, Karel D'Oosterlinck

Published: 2025-03-13Area: Mechanistic Interp.Citations: 8

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (96%)
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Alexey Dontsov, Ivan Oseledets, Polina Druzhinina, Oleg Y. Rogov

Published: 2025-03-24Area: Mechanistic Interp.Citations: 24

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Identifying Sparsely Active Circuits Through Local Loss Landscape Decomposition

Brianna Chrisman, Lee Sharkey, Lucius Bushnaq

Published: 2025-03-31Area: Mechanistic Interp.Citations: 3

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (92%)
Inducing, Detecting and Characterising Neural Modules: A Pipeline for Functional Interpretability in Reinforcement Learning

Pietro Ferraro, David Boyle, Anna Soligo

Published: 2025-01-28Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

Noura Al Moubayed, Neel Nanda, Patrick Leask

Published: 2025-05-23Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Insights into a radiology-specialised multimodal large language model with sparse autoencoders

Felix Meissen, Javier Alvarez-Valle, Daniel Coelho de Castro, Shruthi Bannur

Published: 2025-07-17Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (96%)
Insights on Crosscoder Model Diffing

Siddharth Mishra-Sharma, Thomas Henighan, Adam Jermyn, Christopher Olah

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Christopher Potts, Junyi Tao, Thomas Icard, Jing Huang

Published: 2025-05-17Area: Mechanistic Interp.Citations: 4

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E8 / R3 (99%)
Interpretability Illusions with Sparse Autoencoders

Usha Bhalla, Aaron J. Li, Himabindu Lakkaraju, Suraj Srinivas

Published: 2025-05-21Area: Mechanistic Interp.Citations: 5

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness, interpretability

E5 / R3 (95%)
Interpreting CLIP with Hierarchical Sparse Autoencoders

Hubert Baniecki, Vladimir Zaigrajew, Przemyslaw Biecek

Published: 2025-02-27Area: Mechanistic Interp.Citations: 19

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Interpreting Transformers Through Attention Head Intervention

Mason Kadem, Rong Zheng

Published: 2026-01-07Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, survey, interpretability

E5 / R3 (95%)
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Wenlin Yao, Xuansheng Wu, Xiaoming Zhai, Ninghao Liu

Published: 2025-02-21Area: Mechanistic Interp.Citations: 20

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R4 (93%)
Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban

Adam Gleave, Mohammad Taufeeque, Adrià Garriga-Alonso, Aaron David Tucker

Published: 2025-06-11Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (94%)
InverseScope: Scalable Activation Inversion for Interpreting Large Language Models

Zhennan Zhou, Yifan Luo, Bin Dong

Published: 2025-06-09Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (97%)
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Lucy Farnik, Conor Houghton, Tim Lawson, Laurence Aitchison

Published: 2025-02-25Area: Mechanistic Interp.Citations: 6

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Know-MRI: A Knowledge Mechanisms Revealer&Interpreter for Large Language Models

Di Wu, Jun Zhao, Jiaxiang Liu, Boxuan Xing

Published: 2025-06-10Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, tool

E5 / R3 (93%)
Kronecker Factorization Improves Efficiency and Interpretability of Sparse Autoencoders

Daniil Gavrilov, Nikita Balagansky, Daniil Laptev, Vadim Kurochkin

Published: 2025-05-28Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (97%)
LLM Interpretability with Identifiable Temporal-Instantaneous Representation

Yujia Zheng, Kun Zhang, Xiangchen Song, Jiaqi Sun

Published: 2025-09-27Area: Mechanistic Interp.Citations: 2

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E4 / R3 (94%)
Language Model Circuits Are Sparse in the Neuron Basis

Aryaman Arora, Sarah Schwettmann, Zhengxuan Wu, Jacob Steinhardt

Published: 2026-01-30Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)