Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 31-60 of 470 papers (page 2 of 16)

PaperIntel
Addressing divergent representations from causal interventions on neural networks

Christopher Potts, Alexa R. Tartaglini, Satchel Grant, Simon Jerome Han

Published: 2025-11-06Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (96%)
AlignSAE: Concept-Aligned Sparse Autoencoders

Xinyu Guo, Jinhe Bi, Minglai Yang, Mihai Surdeanu

Published: 2025-12-01Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Analysis of Variational Sparse Autoencoders

Yuxiao Li, Zachary Baker

Published: 2025-09-26Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (93%)
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Daniil Gavrilov, Nikita Balagansky, Daniil Laptev, Yaroslav Aksenov

Published: 2025-02-05Area: Mechanistic Interp.Citations: 6

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

Talia Konkle, Victor Boutin, Binxu Wang, Ekdeep Singh Lubana

Published: 2025-02-18Area: Mechanistic Interp.Citations: 32

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (97%)
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Joshua Engels, Senthooran Rajamanoharan, Subhash Kantamneni, Max Tegmark

Published: 2025-02-23Area: Mechanistic Interp.Citations: 62

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Atlas-Alignment: Making Interpretability Transferable Across Language Models

Sebastian Lapuschkin, Wojciech Samek, Jim Berend, Bruno Puri

Published: 2025-10-31Area: Mechanistic Interp.Citations: -

Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability

E6 / R3 (95%)
Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs

Thomas Wiegand, Sebastian Lapuschkin, Sayed Mohammad Vakilzadeh Hatefi, Wojciech Samek

Published: 2025-06-16Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (94%)
Automated Feature Labeling with Token-Space Gradient Descent

Julian Schulz, Seamus Fallows

Published: 2025-04-01Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (97%)
Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers

Lucy Farnik, Thomas Heap, Tim Lawson, Laurence Aitchison

Published: 2025-01-29Area: Mechanistic Interp.Citations: 26

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

Qingsong Wen, Stephen Wang, Moayad Aloqaily, Kun Wang

Published: 2025-09-26Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E6 / R4 (98%)
Because we have LLMs, we Can and Should Pursue Agentic Interpretability

Noah Fiedel, Been Kim, John Hewitt, Oyvind Tafjord

Published: 2025-06-13Area: Mechanistic Interp.Citations: 9

Tags: mechanistic-interp, ai-safety, position, interpretability

E5 / R3 (94%)
Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Erik Cambria, Amir Abdullah, Roy Ka-Wei Lee, Yeo Wei Jie

Published: 2025-09-07Area: Mechanistic Interp.Citations: 4

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

Xuansheng Wu, Mengnan Du, Ninghao Liu, Dong Shu

Published: 2025-05-12Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder

Kaidi Xu, Song Wang, Zhen Tan, Tianlong Chen

Published: 2025-11-07Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Binary Autoencoder for Mechanistic Interpretability of Large Language Models

Hakaze Cho, Naoya Inoue, Haolin Yang, Brian M. Kurkoski

Published: 2025-09-25Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (93%)
Binary Sparse Coding for Interpretability

Lucia Quirke, Stepan Shabalin, Nora Belrose

Published: 2025-09-29Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E6 / R4 (92%)
BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods

Yihong Liu, Philipp Mondorf, Sebastian Gerstner, Hinrich Schütze

Published: 2025-10-08Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (95%)
BlockCert: Certified Blockwise Extraction of Transformer Mechanisms

Sandro Andric

Published: 2025-11-20Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, tool

E5 / R3 (99%)
CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders

Sachin Kumar, Yusen Peng, Alex Gulko

Published: 2025-08-31Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark

E5 / R3 (94%)
Can Interpretation Predict Behavior on Unseen Data?

David Alvarez-Melis, Jenny Kaufmann, Martin Wattenberg, Victoria R. Li

Published: 2025-07-08Area: Mechanistic Interp.Citations: 5

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (92%)
Can Large Language Models Develop Gambling Addiction?

Yunjeong Lee, Seungpil Lee, Donghyeon Shin, Sundong Kim

Published: 2025-09-26Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (93%)
Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

Kirill Bykov, Anna Hedström, Laura Kopf, Oliver Eberle

Published: 2025-06-18Area: Mechanistic Interp.Citations: 6

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Circuit Stability Characterizes Language Model Generalization

Alan Sun

Published: 2025-05-30Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Circuit Tracing: Revealing Computational Graphs in Language Models

Craig Citro, Michael Sklar, Hoagy Cunningham, Wes Gurnee

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Circuits Updates - April 2025

Brian Chen, Adam Jermyn, Joshua Batson, Jack Lindsey

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R3 (95%)
Combining Causal Models for More Accurate Abstractions of Neural Networks

Theodora-Mara Pîslar, Sara Magliacane, Atticus Geiger

Published: 2025-03-14Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Compressed Computation: Dense Circuits in a Toy Model of the Universal-AND Problem

Adam Newgas

Published: 2025-07-13Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (92%)
Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs

Yuhang Liu, Dong Gong, Erdun Gao, Anton van den Hengel

Published: 2026-01-28Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Concept-SAE: Active Causal Probing of Visual Model Behavior

Qiang Xu, Jianrong Ding, Chenchen Zhao, Muxi Chen

Published: 2025-09-26Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E6 / R4 (94%)