Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 151-180 of 470 papers (page 6 of 16)

PaperIntel
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates

Xinyu Yang, Jiaying Zhu, Wenya Wang, Hang Chen

Published: 2025-05-15Area: Mechanistic Interp.Citations: 3

Tags: theoretical, mechanistic-interp, ai-safety

E6 / R3 (94%)
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words

Hiroki Furuta, Yutaka Matsuo, Gouki Minegishi, Yusuke Iwasawa

Published: 2025-01-09Area: Mechanistic Interp.Citations: 10

Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation

E5 / R3 (94%)
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need

Adam Karvonen

Published: 2025-03-21Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Robustly Identifying Concepts Introduced During Chat Fine-tuning Using Crosscoders

Julian Minder, Caden Juang, Bilal Chughtai, Clement Dumas

Published: 2025-04-03Area: Mechanistic Interp.Citations: 7

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (97%)
RouteSAE: Route Sparse Autoencoder to Interpret Large Language Models

Sihang Li, Tao Liang, Wei Shi, Guojun Ma

Published: 2025-03-11Area: Mechanistic Interp.Citations: 17

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R4 (95%)
SAE-V: Interpreting Multimodal Models for Enhanced Alignment

Hantao Lou, Changye Li, Jiaming Ji, Yaodong Yang

Published: 2025-02-22Area: Mechanistic Interp.Citations: 8

Tags: empirical, alignment-training, mechanistic-interp, ai-safety

E4 / R3 (96%)
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Can Rager, Samuel Marks, David Chanin, Adam Karvonen

Published: 2025-03-12Area: Mechanistic Interp.Citations: 62

Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark

E5 / R3 (95%)
SAEs Are Good for Steering -- If You Select the Right Features

Aaron Mueller, Yonatan Belinkov, Dana Arad

Published: 2025-05-26Area: Mechanistic Interp.Citations: 23

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
SAFER: Probing Safety in Reward Models with Sparse Autoencoder

Sihang Li, Tao Liang, Wei Shi, Guojun Ma

Published: 2025-07-01Area: Mechanistic Interp.Citations: 2

Tags: empirical, alignment-training, mechanistic-interp, ai-safety

E5 / R4 (96%)
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models

Mengnan Du, Jing Ma, Zirui He, Haiyan Zhao

Published: 2025-02-17Area: Mechanistic Interp.Citations: 18

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Kamil Deja, Bartosz Cywinski

Published: 2025-01-29Area: Mechanistic Interp.Citations: 37

Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness

E5 / R3 (97%)
SCALPEL: Selective Capability Ablation via Low-rank Parameter Editing for Large Language Model Interpretability Analysis

Zhenguang G. Cai, Xufeng Duan, Zihao Fu

Published: 2026-01-12Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E4 / R3 (97%)
SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models

Tim G. J. Rudner, Carter Teplica, Arman Cohan, Yixin Liu

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
SFAL: Semantic-Functional Alignment Scores for Distributional Evaluation of Auto-Interpretability in Sparse Autoencoders

Daniele Potertì, Andrea Seveso, Filippo Pallucchini, Antonio Serino

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability, safety-evaluation

E6 / R3 (95%)
Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

Qinqin He, Xiting Wang, Han Zheng, Jiaqi Weng

Published: 2025-09-11Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (97%)
Scaling Sparse Feature Circuits For Studying In-Context Learning

Fazl Barez, Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy

Published: 2025-04-18Area: Mechanistic Interp.Citations: 4

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (95%)
Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

Arvind Satyanarayan, Yannick Assogba, Dominik Moritz, Angie Boggust

Published: 2025-10-07Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence

James Evans, Dawn Song, Bofan Gong, Shiyang Lai

Published: 2025-05-16Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (95%)
Simple Mechanistic Explanations for Out-Of-Context Reasoning

Oliver Clive-Griffin, Joshua Engels, Senthooran Rajamanoharan, Atticus Wang

Published: 2025-07-10Area: Mechanistic Interp.Citations: 3

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Sparse Autoencoder Features for Classifications and Transferability

Jack Gallifant, Thomas Hartvigsen, Hugo Aerts, Shan Chen

Published: 2025-02-17Area: Mechanistic Interp.Citations: 16

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Sparse Autoencoders Do Not Find Canonical Units of Analysis

Michael Pearce, Curt Tigges, Joseph Bloom, Noura Al Moubayed

Published: 2025-02-07Area: Mechanistic Interp.Citations: 43

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (94%)
Sparse Autoencoders Find Partially Interpretable Features in Italian Small Language Models

Alessandro Lenci, Lucia C. Passaro, Alessandro Bondielli

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (97%)
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models

Serge Belongie, Quentin Bouniot, Shyamgopal Karthik, Zeynep Akata

Published: 2025-04-03Area: Mechanistic Interp.Citations: 21

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E6 / R3 (97%)
Sparse Autoencoders Trained on the Same Data Learn Different Features

Gonçalo Paulo, Nora Belrose

Published: 2025-01-28Area: Mechanistic Interp.Citations: 40

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Sparse Feature Coactivation Reveals Composable Semantic Modules in Large Language Models

Ruixuan Deng, Aman Taxali, Joyce Chai, Chandra Sripada

Published: 2025-06-22Area: Mechanistic Interp.Citations: 2

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (93%)
Sparse Mixtures of Linear Transforms (MOLT)

Brian Chen, Thomas Conerly, Adam Pearce, Sasha Hydrie

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

David Chanin, Adrià Garriga-Alonso

Published: 2025-08-22Area: Mechanistic Interp.Citations: 4

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (94%)
Spectral Superposition: A Theory of Feature Geometry

Narmeen Oozeer, Amir Abdullah, Shriyash Upadhyay, Tasana Pejovic

Published: 2026-02-02Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety

E5 / R3 (94%)
SplInterp: Improving our Understanding and Training of Sparse Autoencoders

Randall Balestriero, Jeremy Budd, Javier Ideami, Keith Duggar

Published: 2025-05-17Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Steering CLIP's vision transformer with sparse autoencoders

Robert Graham, Yossi Gandelsman, Sonia Joseph, Ethan Goldfarb

Published: 2025-04-11Area: Mechanistic Interp.Citations: 14

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (96%)