Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 211-240 of 470 papers (page 8 of 16)

Paper	Published	Area	Tags	Intel	Citations
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders Wenlin Yao, Xuansheng Wu, Xiaoming Zhai, Ninghao Liu Published: 2025-02-21Area: Mechanistic Interp.Citations: 20 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	2025-02-21	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R4 (93%)	20
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models Talia Konkle, Victor Boutin, Binxu Wang, Ekdeep Singh Lubana Published: 2025-02-18Area: Mechanistic Interp.Citations: 32 Tags: empirical, mechanistic-interp, ai-safety	2025-02-18	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (97%)	32
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models Mengnan Du, Jing Ma, Zirui He, Haiyan Zhao Published: 2025-02-17Area: Mechanistic Interp.Citations: 18 Tags: empirical, mechanistic-interp, ai-safety	2025-02-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	18
Sparse Autoencoder Features for Classifications and Transferability Jack Gallifant, Thomas Hartvigsen, Hugo Aerts, Shan Chen Published: 2025-02-17Area: Mechanistic Interp.Citations: 16 Tags: empirical, mechanistic-interp, ai-safety	2025-02-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	16
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis Yan Hu, Reynold Cheng, Xu Wang, Wenyu Du Published: 2025-02-17Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety	2025-02-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (94%)	10
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions Haining Yu, Xiangyang Zhou, Qiguang Chen, Wenbo Pan Published: 2025-02-13Area: Mechanistic Interp.Citations: 8 Tags: empirical, alignment-training, mechanistic-interp, ai-safety, adversarial-robustness	2025-02-13	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R4 (94%)	8
Deciphering Functions of Neurons in Vision-Language Models Yan Lu, Jiaqi Xu, Xuejin Chen, Cuiling Lan Published: 2025-02-10Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-02-10	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (94%)	-
Position-aware Automatic Circuit Discovery David Bau, Aaron Mueller, Tal Haklay, Hadas Orgad Published: 2025-02-07Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety	2025-02-07	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	6
Sparse Autoencoders Do Not Find Canonical Units of Analysis Michael Pearce, Curt Tigges, Joseph Bloom, Noura Al Moubayed Published: 2025-02-07Area: Mechanistic Interp.Citations: 43 Tags: empirical, mechanistic-interp, ai-safety	2025-02-07	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (94%)	43
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment Konstantinos Derpanis, Julian Forsyth, Thomas Fel, Harrish Thasarathan Published: 2025-02-06Area: Mechanistic Interp.Citations: 27 Tags: empirical, alignment-training, mechanistic-interp, ai-safety	2025-02-06	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety	E4 / R3 (97%)	27
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models Daniil Gavrilov, Nikita Balagansky, Daniil Laptev, Yaroslav Aksenov Published: 2025-02-05Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety	2025-02-05	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	6
Modular Training of Neural Networks aids Interpretability Joan Velja, Maheep Chaudhary, Alessandro Abate, Nandi Schoots Published: 2025-02-04Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-02-04	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	1
Low-Rank Adapting Models for Sparse Autoencoders Joshua Engels, Matthew Chen, Max Tegmark Published: 2025-01-31Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety	2025-01-31	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	4
Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers Lucy Farnik, Thomas Heap, Tim Lawson, Laurence Aitchison Published: 2025-01-29Area: Mechanistic Interp.Citations: 26 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-01-29	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	26
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders Kamil Deja, Bartosz Cywinski Published: 2025-01-29Area: Mechanistic Interp.Citations: 37 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	2025-01-29	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R3 (97%)	37
Inducing, Detecting and Characterising Neural Modules: A Pipeline for Functional Interpretability in Reinforcement Learning Pietro Ferraro, David Boyle, Anna Soligo Published: 2025-01-28Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-01-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	2
Sparse Autoencoders Trained on the Same Data Learn Different Features Gonçalo Paulo, Nora Belrose Published: 2025-01-28Area: Mechanistic Interp.Citations: 40 Tags: empirical, mechanistic-interp, ai-safety	2025-01-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	40
Propositional Interpretability in Artificial Intelligence David J. Chalmers Published: 2025-01-27Area: Mechanistic Interp.Citations: 13 Tags: theoretical, mechanistic-interp, ai-safety, interpretability	2025-01-27	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E6 / R3 (94%)	13
Enhancing Automated Interpretability with Output-Centric Feature Descriptions Chen Agassy, Mor Geva, Yoav Gur-Arieh, Roy Mayan Published: 2025-01-14Area: Mechanistic Interp.Citations: 26 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-01-14	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E6 / R3 (92%)	26
Mechanistic understanding and validation of large AI models with SemanticLens Thomas Wiegand, Sebastian Lapuschkin, Tobias Labarta, Wojciech Samek Published: 2025-01-09Area: Mechanistic Interp.Citations: 29 Tags: mechanistic-interp, ai-safety, tool	2025-01-09	Mechanistic Interp.	mechanistic-interp, ai-safety, tool	E5 / R3 (95%)	29
Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words Hiroki Furuta, Yutaka Matsuo, Gouki Minegishi, Yusuke Iwasawa Published: 2025-01-09Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation	2025-01-09	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, safety-evaluation	E5 / R3 (94%)	10
Transformers Use Causal World Models in Maze-Solving Tasks Alessandra Russo, Alex F. Spies, Katsumi Inoue, Tilman Räuker Published: 2024-12-16Area: Mechanistic Interp.Citations: 9 Tags: empirical, mechanistic-interp, ai-safety	2024-12-16	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (94%)	9
BatchTopK Sparse Autoencoders Bart Bussmann, Neel Nanda, Patrick Leask Published: 2024-12-09Area: Mechanistic Interp.Citations: 64 Tags: empirical, mechanistic-interp, ai-safety	2024-12-09	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (97%)	64
Monet: Mixture of Monosemantic Experts for Transformers Jaewoo Kang, Jungwoo Park, Young Jin Ahn, Kee-Eung Kim Published: 2024-12-05Area: Mechanistic Interp.Citations: 9 Tags: empirical, mechanistic-interp, ai-safety	2024-12-05	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	9
Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration Lawrence Chan, Rajashree Agrawal, Chun Hei Yip, Jason Gross Published: 2024-12-04Area: Mechanistic Interp.Citations: 4 Tags: theoretical, mechanistic-interp, ai-safety	2024-12-04	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E5 / R3 (95%)	4
Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks Can Rager, Samuel Marks, Adam Karvonen, Neel Nanda Published: 2024-11-28Area: Mechanistic Interp.Citations: 9 Tags: mechanistic-interp, ai-safety, safety-evaluation, benchmark	2024-11-28	Mechanistic Interp.	mechanistic-interp, ai-safety, safety-evaluation, benchmark	E5 / R3 (96%)	9
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan, Neel Nanda Published: 2024-11-21Area: Mechanistic Interp.Citations: 88 Tags: empirical, mechanistic-interp, ai-safety	2024-11-21	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	88
Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders David Klindt, Charles O'Neill Published: 2024-11-20Area: Mechanistic Interp.Citations: 7 Tags: theoretical, mechanistic-interp, ai-safety	2024-11-20	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E5 / R3 (97%)	7
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit Kui Ren, Huiyu Xu, Chun Chen, Rui Zheng Published: 2024-11-17Area: Mechanistic Interp.Citations: 16 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	2024-11-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E6 / R3 (93%)	16
Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning David Wingate, Jared Wilson, Max Forsey, Jeffrey Olmo Published: 2024-11-15Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety	2024-11-15	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	4