Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 151-180 of 470 papers (page 6 of 16)

Paper	Published	Area	Tags	Intel	Citations
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates Xinyu Yang, Jiaying Zhu, Wenya Wang, Hang Chen Published: 2025-05-15Area: Mechanistic Interp.Citations: 3 Tags: theoretical, mechanistic-interp, ai-safety	2025-05-15	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E6 / R3 (94%)	3
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Hiroki Furuta, Yutaka Matsuo, Gouki Minegishi, Yusuke Iwasawa Published: 2025-01-09Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation	2025-01-09	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, safety-evaluation	E5 / R3 (94%)	10
Revisiting End-To-End Sparse Autoencoder Training: A Short Finetune Is All You Need Adam Karvonen Published: 2025-03-21Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2025-03-21	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	1
Robustly Identifying Concepts Introduced During Chat Fine-tuning Using Crosscoders Julian Minder, Caden Juang, Bilal Chughtai, Clement Dumas Published: 2025-04-03Area: Mechanistic Interp.Citations: 7 Tags: empirical, mechanistic-interp, ai-safety	2025-04-03	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (97%)	7
RouteSAE: Route Sparse Autoencoder to Interpret Large Language Models Sihang Li, Tao Liang, Wei Shi, Guojun Ma Published: 2025-03-11Area: Mechanistic Interp.Citations: 17 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-03-11	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R4 (95%)	17
SAE-V: Interpreting Multimodal Models for Enhanced Alignment Hantao Lou, Changye Li, Jiaming Ji, Yaodong Yang Published: 2025-02-22Area: Mechanistic Interp.Citations: 8 Tags: empirical, alignment-training, mechanistic-interp, ai-safety	2025-02-22	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety	E4 / R3 (96%)	8
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability Can Rager, Samuel Marks, David Chanin, Adam Karvonen Published: 2025-03-12Area: Mechanistic Interp.Citations: 62 Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark	2025-03-12	Mechanistic Interp.	mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark	E5 / R3 (95%)	62
SAEs Are Good for Steering -- If You Select the Right Features Aaron Mueller, Yonatan Belinkov, Dana Arad Published: 2025-05-26Area: Mechanistic Interp.Citations: 23 Tags: empirical, mechanistic-interp, ai-safety	2025-05-26	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	23
SAFER: Probing Safety in Reward Models with Sparse Autoencoder Sihang Li, Tao Liang, Wei Shi, Guojun Ma Published: 2025-07-01Area: Mechanistic Interp.Citations: 2 Tags: empirical, alignment-training, mechanistic-interp, ai-safety	2025-07-01	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety	E5 / R4 (96%)	2
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models Mengnan Du, Jing Ma, Zirui He, Haiyan Zhao Published: 2025-02-17Area: Mechanistic Interp.Citations: 18 Tags: empirical, mechanistic-interp, ai-safety	2025-02-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	18
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders Kamil Deja, Bartosz Cywinski Published: 2025-01-29Area: Mechanistic Interp.Citations: 37 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	2025-01-29	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R3 (97%)	37
SCALPEL: Selective Capability Ablation via Low-rank Parameter Editing for Large Language Model Interpretability Analysis Zhenguang G. Cai, Xufeng Duan, Zihao Fu Published: 2026-01-12Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	2026-01-12	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E4 / R3 (97%)	-
SCIURus: Shared Circuits for Interpretable Uncertainty Representations in Language Models Tim G. J. Rudner, Carter Teplica, Arman Cohan, Yixin Liu Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	-
SFAL: Semantic-Functional Alignment Scores for Distributional Evaluation of Auto-Interpretability in Sparse Autoencoders Daniele Potertì, Andrea Seveso, Filippo Pallucchini, Antonio Serino Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability, safety-evaluation	-	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety, interpretability, safety-evaluation	E6 / R3 (95%)	-
Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework Qinqin He, Xiting Wang, Han Zheng, Jiaqi Weng Published: 2025-09-11Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2025-09-11	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (97%)	1
Scaling Sparse Feature Circuits For Studying In-Context Learning Fazl Barez, Dmitrii Kharlapenko, Stepan Shabalin, Arthur Conmy Published: 2025-04-18Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety	2025-04-18	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (95%)	4
Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language Arvind Satyanarayan, Yannick Assogba, Dominik Moritz, Angie Boggust Published: 2025-10-07Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2025-10-07	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	1
Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence James Evans, Dawn Song, Bofan Gong, Shiyang Lai Published: 2025-05-16Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2025-05-16	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (95%)	2
Simple Mechanistic Explanations for Out-Of-Context Reasoning Oliver Clive-Griffin, Joshua Engels, Senthooran Rajamanoharan, Atticus Wang Published: 2025-07-10Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety	2025-07-10	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	3
Sparse Autoencoder Features for Classifications and Transferability Jack Gallifant, Thomas Hartvigsen, Hugo Aerts, Shan Chen Published: 2025-02-17Area: Mechanistic Interp.Citations: 16 Tags: empirical, mechanistic-interp, ai-safety	2025-02-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	16
Sparse Autoencoders Do Not Find Canonical Units of Analysis Michael Pearce, Curt Tigges, Joseph Bloom, Noura Al Moubayed Published: 2025-02-07Area: Mechanistic Interp.Citations: 43 Tags: empirical, mechanistic-interp, ai-safety	2025-02-07	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (94%)	43
Sparse Autoencoders Find Partially Interpretable Features in Italian Small Language Models Alessandro Lenci, Lucia C. Passaro, Alessandro Bondielli Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (97%)	-
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models Serge Belongie, Quentin Bouniot, Shyamgopal Karthik, Zeynep Akata Published: 2025-04-03Area: Mechanistic Interp.Citations: 21 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-04-03	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E6 / R3 (97%)	21
Sparse Autoencoders Trained on the Same Data Learn Different Features Gonçalo Paulo, Nora Belrose Published: 2025-01-28Area: Mechanistic Interp.Citations: 40 Tags: empirical, mechanistic-interp, ai-safety	2025-01-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	40
Sparse Feature Coactivation Reveals Composable Semantic Modules in Large Language Models Ruixuan Deng, Aman Taxali, Joyce Chai, Chandra Sripada Published: 2025-06-22Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2025-06-22	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (93%)	2
Sparse Mixtures of Linear Transforms (MOLT) Brian Chen, Thomas Conerly, Adam Pearce, Sasha Hydrie Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	-
Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders David Chanin, Adrià Garriga-Alonso Published: 2025-08-22Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety	2025-08-22	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (94%)	4
Spectral Superposition: A Theory of Feature Geometry Narmeen Oozeer, Amir Abdullah, Shriyash Upadhyay, Tasana Pejovic Published: 2026-02-02Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety	2026-02-02	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E5 / R3 (94%)	-
SplInterp: Improving our Understanding and Training of Sparse Autoencoders Randall Balestriero, Jeremy Budd, Javier Ideami, Keith Duggar Published: 2025-05-17Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety	2025-05-17	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E5 / R3 (96%)	-
Steering CLIP's vision transformer with sparse autoencoders Robert Graham, Yossi Gandelsman, Sonia Joseph, Ethan Goldfarb Published: 2025-04-11Area: Mechanistic Interp.Citations: 14 Tags: empirical, mechanistic-interp, ai-safety	2025-04-11	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (96%)	14