Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 211-240 of 470 papers (page 8 of 16)

PaperIntel
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Wenlin Yao, Xuansheng Wu, Xiaoming Zhai, Ninghao Liu

Published: 2025-02-21Area: Mechanistic Interp.Citations: 20

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R4 (93%)
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

Talia Konkle, Victor Boutin, Binxu Wang, Ekdeep Singh Lubana

Published: 2025-02-18Area: Mechanistic Interp.Citations: 32

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (97%)
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

Mengnan Du, Jing Ma, Zirui He, Haiyan Zhao

Published: 2025-02-17Area: Mechanistic Interp.Citations: 18

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Sparse Autoencoder Features for Classifications and Transferability

Jack Gallifant, Thomas Hartvigsen, Hugo Aerts, Shan Chen

Published: 2025-02-17Area: Mechanistic Interp.Citations: 16

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis

Yan Hu, Reynold Cheng, Xu Wang, Wenyu Du

Published: 2025-02-17Area: Mechanistic Interp.Citations: 10

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (94%)
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions

Haining Yu, Xiangyang Zhou, Qiguang Chen, Wenbo Pan

Published: 2025-02-13Area: Mechanistic Interp.Citations: 8

Tags: empirical, alignment-training, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R4 (94%)
Deciphering Functions of Neurons in Vision-Language Models

Yan Lu, Jiaqi Xu, Xuejin Chen, Cuiling Lan

Published: 2025-02-10Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (94%)
Position-aware Automatic Circuit Discovery

David Bau, Aaron Mueller, Tal Haklay, Hadas Orgad

Published: 2025-02-07Area: Mechanistic Interp.Citations: 6

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Sparse Autoencoders Do Not Find Canonical Units of Analysis

Michael Pearce, Curt Tigges, Joseph Bloom, Noura Al Moubayed

Published: 2025-02-07Area: Mechanistic Interp.Citations: 43

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (94%)
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

Konstantinos Derpanis, Julian Forsyth, Thomas Fel, Harrish Thasarathan

Published: 2025-02-06Area: Mechanistic Interp.Citations: 27

Tags: empirical, alignment-training, mechanistic-interp, ai-safety

E4 / R3 (97%)
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Daniil Gavrilov, Nikita Balagansky, Daniil Laptev, Yaroslav Aksenov

Published: 2025-02-05Area: Mechanistic Interp.Citations: 6

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Modular Training of Neural Networks aids Interpretability

Joan Velja, Maheep Chaudhary, Alessandro Abate, Nandi Schoots

Published: 2025-02-04Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
Low-Rank Adapting Models for Sparse Autoencoders

Joshua Engels, Matthew Chen, Max Tegmark

Published: 2025-01-31Area: Mechanistic Interp.Citations: 4

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers

Lucy Farnik, Thomas Heap, Tim Lawson, Laurence Aitchison

Published: 2025-01-29Area: Mechanistic Interp.Citations: 26

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Kamil Deja, Bartosz Cywinski

Published: 2025-01-29Area: Mechanistic Interp.Citations: 37

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R3 (97%)
Inducing, Detecting and Characterising Neural Modules: A Pipeline for Functional Interpretability in Reinforcement Learning

Pietro Ferraro, David Boyle, Anna Soligo

Published: 2025-01-28Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
Sparse Autoencoders Trained on the Same Data Learn Different Features

Gonçalo Paulo, Nora Belrose

Published: 2025-01-28Area: Mechanistic Interp.Citations: 40

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Propositional Interpretability in Artificial Intelligence

David J. Chalmers

Published: 2025-01-27Area: Mechanistic Interp.Citations: 13

Tags: theoretical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (94%)
Enhancing Automated Interpretability with Output-Centric Feature Descriptions

Chen Agassy, Mor Geva, Yoav Gur-Arieh, Roy Mayan

Published: 2025-01-14Area: Mechanistic Interp.Citations: 26

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (92%)
Mechanistic understanding and validation of large AI models with SemanticLens

Thomas Wiegand, Sebastian Lapuschkin, Tobias Labarta, Wojciech Samek

Published: 2025-01-09Area: Mechanistic Interp.Citations: 29

Tags: mechanistic-interp, ai-safety, tool

E5 / R3 (95%)
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words

Hiroki Furuta, Yutaka Matsuo, Gouki Minegishi, Yusuke Iwasawa

Published: 2025-01-09Area: Mechanistic Interp.Citations: 10

Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation

E5 / R3 (94%)
Transformers Use Causal World Models in Maze-Solving Tasks

Alessandra Russo, Alex F. Spies, Katsumi Inoue, Tilman Räuker

Published: 2024-12-16Area: Mechanistic Interp.Citations: 9

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (94%)
BatchTopK Sparse Autoencoders

Bart Bussmann, Neel Nanda, Patrick Leask

Published: 2024-12-09Area: Mechanistic Interp.Citations: 64

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (97%)
Monet: Mixture of Monosemantic Experts for Transformers

Jaewoo Kang, Jungwoo Park, Young Jin Ahn, Kee-Eung Kim

Published: 2024-12-05Area: Mechanistic Interp.Citations: 9

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

Lawrence Chan, Rajashree Agrawal, Chun Hei Yip, Jason Gross

Published: 2024-12-04Area: Mechanistic Interp.Citations: 4

Tags: theoretical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

Can Rager, Samuel Marks, Adam Karvonen, Neel Nanda

Published: 2024-11-28Area: Mechanistic Interp.Citations: 9

Tags: mechanistic-interp, ai-safety, safety-evaluation, benchmark

E5 / R3 (96%)
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda

Published: 2024-11-21Area: Mechanistic Interp.Citations: 88

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

David Klindt, Charles O'Neill

Published: 2024-11-20Area: Mechanistic Interp.Citations: 7

Tags: theoretical, mechanistic-interp, ai-safety

E5 / R3 (97%)
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit

Kui Ren, Huiyu Xu, Chun Chen, Rui Zheng

Published: 2024-11-17Area: Mechanistic Interp.Citations: 16

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E6 / R3 (93%)
Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

David Wingate, Jared Wilson, Max Forsey, Jeffrey Olmo

Published: 2024-11-15Area: Mechanistic Interp.Citations: 4

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)