Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 241-270 of 470 papers (page 9 of 16)

Paper	Published	Area	Tags	Intel	Citations
Ablation is Not Enough to Emulate DPO: How Neuron Dynamics Drive Toxicity Reduction Filip Sondej, Yushi Yang, Harry Mayne, Adam Mahdi Published: 2024-11-10Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety	2024-11-10	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R2 (95%)	3
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla, Asma Ghandeharioun, Himabindu Lakkaraju, Suraj Srinivas Published: 2024-11-07Area: Mechanistic Interp.Citations: 20 Tags: empirical, mechanistic-interp, ai-safety, interpretability, safety-evaluation	2024-11-07	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability, safety-evaluation	E6 / R3 (94%)	20
A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning Rina Panigrahy, Cyrus Rashtchian, Enming Luo, Guan Zhe Hong Published: 2024-11-06Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety	2024-11-06	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (94%)	4
Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders Kola Ayonrinde Published: 2024-11-04Area: Mechanistic Interp.Citations: 8 Tags: empirical, mechanistic-interp, ai-safety	2024-11-04	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	8
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders Fazl Barez, Luke Marks, Alasdair Paren, David Krueger Published: 2024-11-02Area: Mechanistic Interp.Citations: 19 Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability	2024-11-02	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	19
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models Mona Diab, Virginia Smith, Aashiq Muhamed Published: 2024-11-01Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety	2024-11-01	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (95%)	10
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics Yaniv Nikankin, Aaron Mueller, Yonatan Belinkov, Anja Reusch Published: 2024-10-28Area: Mechanistic Interp.Citations: 74 Tags: empirical, mechanistic-interp, ai-safety	2024-10-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	74
Group-SAE: Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups Marco Molinari, Davide Ghilardi, Federico Belotti Published: 2024-10-28Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety	2024-10-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	10
One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models David Bau, Chris Wendler, Caglar Gulcehre, Antonio Mari Published: 2024-10-28Area: Mechanistic Interp.Citations: 18 Tags: empirical, mechanistic-interp, ai-safety	2024-10-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R4 (95%)	18
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness Jingyi Cui, Qi Zhang, Xiang Pan, Qi Lei Published: 2024-10-27Area: Mechanistic Interp.Citations: 5 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-10-27	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E6 / R4 (94%)	5
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders Junxuan Wang, Zuxuan Wu, Xuyang Ge, Frances Liu Published: 2024-10-27Area: Mechanistic Interp.Citations: 91 Tags: mechanistic-interp, ai-safety, tool	2024-10-27	Mechanistic Interp.	mechanistic-interp, ai-safety, tool	E5 / R3 (97%)	91
Decomposing The Dark Matter of Sparse Autoencoders Joshua Engels, Max Tegmark, Logan Smith Published: 2024-10-18Area: Mechanistic Interp.Citations: 33 Tags: empirical, mechanistic-interp, ai-safety	2024-10-18	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (93%)	33
Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models Wei Jie Yeo, Erik Cambria, Ranjan Satapathy Published: 2024-10-18Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety	2024-10-18	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	4
Automatically Interpreting Millions of Features in Large Language Models Caden Juang, Alex Mallen, Nora Belrose, GonÃ§alo Paulo Published: 2024-10-17Area: Mechanistic Interp.Citations: 69 Tags: empirical, mechanistic-interp, ai-safety	2024-10-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	69
On the Role of Attention Heads in Large Language Model Safety Yongbin Li, Rongwu Xu, Junfeng Fang, Kun Wang Published: 2024-10-17Area: Mechanistic Interp.Citations: 46 Tags: empirical, mechanistic-interp, ai-safety	2024-10-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	46
Analyzing (In)Abilities of SAEs via Formal Languages Manish Shrivastava, Abhinav Menon, Ekdeep Singh Lubana, David Krueger Published: 2024-10-15Area: Mechanistic Interp.Citations: 15 Tags: empirical, mechanistic-interp, ai-safety	2024-10-15	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E7 / R3 (95%)	15
ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability Jun Xu, Yang Song, Zhongxiang Sun, Han Li Published: 2024-10-15Area: Mechanistic Interp.Citations: 68 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-10-15	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E6 / R4 (95%)	68
The Persian Rug: solving toy models of superposition using large-scale symmetries Alex Infanger, Aditya Cowsik, Kfir Dolev Published: 2024-10-15Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety	2024-10-15	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E5 / R3 (94%)	-
Bilinear MLPs Enable Weight-Based Mechanistic Interpretability Thomas Dooms, Jose M. Oramas, Michael T. Pearce, Alice Rigg Published: 2024-10-10Area: Mechanistic Interp.Citations: 19 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-10-10	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	19
Efficient Dictionary Learning with Switch Sparse Autoencoders Anish Mudide, Christian Schroeder de Witt, Joshua Engels, Eric J. Michaud Published: 2024-10-10Area: Mechanistic Interp.Citations: 31 Tags: empirical, mechanistic-interp, ai-safety	2024-10-10	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	31
Mechanistic Permutability: Match Features Across Layers Ian Maksimov, Daniil Gavrilov, Nikita Balagansky Published: 2024-10-10Area: Mechanistic Interp.Citations: 14 Tags: empirical, mechanistic-interp, ai-safety	2024-10-10	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (98%)	14
The Geometry of Concepts: Sparse Autoencoder Feature Structure Yuxiao Li, David D. Baek, Joshua Engels, Xiaoqing Sun Published: 2024-10-10Area: Mechanistic Interp.Citations: 40 Tags: empirical, mechanistic-interp, ai-safety	2024-10-10	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	40
SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders Philip Torr, Constantin Venhoff, Christian Schroeder de Witt, Anisoara Calinescu Published: 2024-10-09Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation	2024-10-09	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, safety-evaluation	E6 / R3 (96%)	1
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models Philip Torr, Austin Meek, Fazl Barez, Ashkan Khakzar Published: 2024-10-09Area: Mechanistic Interp.Citations: 11 Tags: empirical, mechanistic-interp, ai-safety	2024-10-09	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	11
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks Evzen Wybitul, Alexander Matt Turner, Joseph Miller, Alex Cloud Published: 2024-10-06Area: Mechanistic Interp.Citations: 18 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-10-06	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	18
Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language Severin Field, Mat Allen, Anthony Costarelli Published: 2024-10-03Area: Mechanistic Interp.Citations: 5 Tags: empirical, mechanistic-interp, ai-safety	2024-10-03	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	5
Sparse Attention Decomposition Applied to Circuit Tracing Gabriel Franco, Mark Crovella Published: 2024-10-01Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety	2024-10-01	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (97%)	3
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders David Chanin, James Wilken-Smith, Joseph Bloom, Tomas Dulka Published: 2024-09-22Area: Mechanistic Interp.Citations: 83 Tags: empirical, mechanistic-interp, ai-safety	2024-09-22	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	83
Optimal Ablation for Interpretability Lucas Janson, Maximilian Li Published: 2024-09-16Area: Mechanistic Interp.Citations: 14 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-09-16	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R4 (92%)	14
TracrBench: Generating Interpretability Testbeds with Large Language Models Hannes Thurnherr, Jérémy Scheurer Published: 2024-09-07Area: Mechanistic Interp.Citations: 4 Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark	2024-09-07	Mechanistic Interp.	mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark	E5 / R3 (97%)	4