Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates Xinyu Yang, Jiaying Zhu, Wenya Wang, Hang Chen Published: 2025-05-15Area: Mechanistic Interp.Citations: 3 Tags: theoretical, mechanistic-interp, ai-safety | 2025-05-15 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E6 / R3 (94%) | 3 |
| Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Hiroki Furuta, Yutaka Matsuo, Gouki Minegishi, Yusuke Iwasawa Published: 2025-01-09Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation | 2025-01-09 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, safety-evaluation | E5 / R3 (94%) | 10 |
| Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Adam Karvonen Published: 2025-03-21Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2025-03-21 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 1 |
| Robustly Identifying Concepts Introduced During Chat Fine-tuning Using Crosscoders Julian Minder, Caden Juang, Bilal Chughtai, Clement Dumas Published: 2025-04-03Area: Mechanistic Interp.Citations: 7 Tags: empirical, mechanistic-interp, ai-safety | 2025-04-03 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (97%) | 7 |
| RouteSAE: Route Sparse Autoencoder to Interpret Large Language Models Sihang Li, Tao Liang, Wei Shi, Guojun Ma Published: 2025-03-11Area: Mechanistic Interp.Citations: 17 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-03-11 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R4 (95%) | 17 |
| SAE-V: Interpreting Multimodal Models for Enhanced Alignment Hantao Lou, Changye Li, Jiaming Ji, Yaodong Yang Published: 2025-02-22Area: Mechanistic Interp.Citations: 8 Tags: empirical, alignment-training, mechanistic-interp, ai-safety | 2025-02-22 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety | E4 / R3 (96%) | 8 |
| SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Can Rager, Samuel Marks, David Chanin, Adam Karvonen Published: 2025-03-12Area: Mechanistic Interp.Citations: 62 Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark | 2025-03-12 | Mechanistic Interp. | mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark | E5 / R3 (95%) | 62 |
| SAEs Are Good for Steering -- If You Select the Right Features Aaron Mueller, Yonatan Belinkov, Dana Arad Published: 2025-05-26Area: Mechanistic Interp.Citations: 23 Tags: empirical, mechanistic-interp, ai-safety | 2025-05-26 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 23 |
| SAFER: Probing Safety in Reward Models with Sparse Autoencoder Sihang Li, Tao Liang, Wei Shi, Guojun Ma Published: 2025-07-01Area: Mechanistic Interp.Citations: 2 Tags: empirical, alignment-training, mechanistic-interp, ai-safety | 2025-07-01 | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety | E5 / R4 (96%) | 2 |
| SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models Mengnan Du, Jing Ma, Zirui He, Haiyan Zhao Published: 2025-02-17Area: Mechanistic Interp.Citations: 18 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 18 |
| SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders Kamil Deja, Bartosz Cywinski Published: 2025-01-29Area: Mechanistic Interp.Citations: 37 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness | 2025-01-29 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, adversarial-robustness | E5 / R3 (97%) | 37 |
| SCALPEL: Selective Capability Ablation via Low-rank Parameter Editing for Large Language Model Interpretability Analysis Zhenguang G. Cai, Xufeng Duan, Zihao Fu Published: 2026-01-12Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2026-01-12 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E4 / R3 (97%) | - |
| SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models Tim G. J. Rudner, Carter Teplica, Arman Cohan, Yixin Liu Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | - |
| SFAL: Semantic-Functional Alignment Scores for Distributional Evaluation of Auto-Interpretability in Sparse Autoencoders Daniele Potertì, Andrea Seveso, Filippo Pallucchini, Antonio Serino Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability, safety-evaluation | - | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety, interpretability, safety-evaluation | E6 / R3 (95%) | - |
| Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework Qinqin He, Xiting Wang, Han Zheng, Jiaqi Weng Published: 2025-09-11Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2025-09-11 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (97%) | 1 |
| Scaling Sparse Feature Circuits For Studying In-Context Learning Fazl Barez, Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy Published: 2025-04-18Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety | 2025-04-18 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (95%) | 4 |
| Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language Arvind Satyanarayan, Yannick Assogba, Dominik Moritz, Angie Boggust Published: 2025-10-07Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | 2025-10-07 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 1 |
| Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence James Evans, Dawn Song, Bofan Gong, Shiyang Lai Published: 2025-05-16Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2025-05-16 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (95%) | 2 |
| Simple Mechanistic Explanations for Out-Of-Context Reasoning Oliver Clive-Griffin, Joshua Engels, Senthooran Rajamanoharan, Atticus Wang Published: 2025-07-10Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety | 2025-07-10 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 3 |
| Sparse Autoencoder Features for Classifications and Transferability Jack Gallifant, Thomas Hartvigsen, Hugo Aerts, Shan Chen Published: 2025-02-17Area: Mechanistic Interp.Citations: 16 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-17 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (96%) | 16 |
| Sparse Autoencoders Do Not Find Canonical Units of Analysis Michael Pearce, Curt Tigges, Joseph Bloom, Noura Al Moubayed Published: 2025-02-07Area: Mechanistic Interp.Citations: 43 Tags: empirical, mechanistic-interp, ai-safety | 2025-02-07 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (94%) | 43 |
| Sparse Autoencoders Find Partially Interpretable Features in Italian Small Language Models Alessandro Lenci, Lucia C. Passaro, Alessandro Bondielli Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (97%) | - |
| Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models Serge Belongie, Quentin Bouniot, Shyamgopal Karthik, Zeynep Akata Published: 2025-04-03Area: Mechanistic Interp.Citations: 21 Tags: empirical, mechanistic-interp, ai-safety, interpretability | 2025-04-03 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E6 / R3 (97%) | 21 |
| Sparse Autoencoders Trained on the Same Data Learn Different Features Gonçalo Paulo, Nora Belrose Published: 2025-01-28Area: Mechanistic Interp.Citations: 40 Tags: empirical, mechanistic-interp, ai-safety | 2025-01-28 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | 40 |
| Sparse Feature Coactivation Reveals Composable Semantic Modules in Large Language Models Ruixuan Deng, Aman Taxali, Joyce Chai, Chandra Sripada Published: 2025-06-22Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety | 2025-06-22 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (93%) | 2 |
| Sparse Mixtures of Linear Transforms (MOLT) Brian Chen, Thomas Conerly, Adam Pearce, Sasha Hydrie Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | - |
| Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders David Chanin, Adrià Garriga-Alonso Published: 2025-08-22Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety | 2025-08-22 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (94%) | 4 |
| Spectral Superposition: A Theory of Feature Geometry Narmeen Oozeer, Amir Abdullah, Shriyash Upadhyay, Tasana Pejovic Published: 2026-02-02Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety | 2026-02-02 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E5 / R3 (94%) | - |
| SplInterp: Improving our Understanding and Training of Sparse Autoencoders Randall Balestriero, Jeremy Budd, Javier Ideami, Keith Duggar Published: 2025-05-17Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety | 2025-05-17 | Mechanistic Interp. | theoretical, mechanistic-interp, ai-safety | E5 / R3 (96%) | - |
| Steering CLIP's vision transformer with sparse autoencoders Robert Graham, Yossi Gandelsman, Sonia Joseph, Ethan Goldfarb Published: 2025-04-11Area: Mechanistic Interp.Citations: 14 Tags: empirical, mechanistic-interp, ai-safety | 2025-04-11 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (96%) | 14 |