Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 241-270 of 470 papers (page 9 of 16)

PaperIntel
Ablation is Not Enough to Emulate DPO: How Neuron Dynamics Drive Toxicity Reduction

Filip Sondej, Yushi Yang, Harry Mayne, Adam Mahdi

Published: 2024-11-10Area: Mechanistic Interp.Citations: 3

Tags: empirical, mechanistic-interp, ai-safety

E6 / R2 (95%)
Towards Unifying Interpretability and Control: Evaluation via Intervention

Usha Bhalla, Asma Ghandeharioun, Himabindu Lakkaraju, Suraj Srinivas

Published: 2024-11-07Area: Mechanistic Interp.Citations: 20

Tags: empirical, mechanistic-interp, ai-safety, interpretability, safety-evaluation

E6 / R3 (94%)
A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning

Rina Panigrahy, Cyrus Rashtchian, Enming Luo, Guan Zhe Hong

Published: 2024-11-06Area: Mechanistic Interp.Citations: 4

Tags: empirical, mechanistic-interp, ai-safety

E5 / R4 (94%)
Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders

Kola Ayonrinde

Published: 2024-11-04Area: Mechanistic Interp.Citations: 8

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders

Fazl Barez, Luke Marks, Alasdair Paren, David Krueger

Published: 2024-11-02Area: Mechanistic Interp.Citations: 19

Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models

Mona Diab, Virginia Smith, Aashiq Muhamed

Published: 2024-11-01Area: Mechanistic Interp.Citations: 10

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (95%)
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics

Yaniv Nikankin, Aaron Mueller, Yonatan Belinkov, Anja Reusch

Published: 2024-10-28Area: Mechanistic Interp.Citations: 74

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Group-SAE: Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups

Marco Molinari, Davide Ghilardi, Federico Belotti

Published: 2024-10-28Area: Mechanistic Interp.Citations: 10

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models

David Bau, Chris Wendler, Caglar Gulcehre, Antonio Mari

Published: 2024-10-28Area: Mechanistic Interp.Citations: 18

Tags: empirical, mechanistic-interp, ai-safety

E6 / R4 (95%)
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness

Jingyi Cui, Qi Zhang, Xiang Pan, Qi Lei

Published: 2024-10-27Area: Mechanistic Interp.Citations: 5

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E6 / R4 (94%)
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

Junxuan Wang, Zuxuan Wu, Xuyang Ge, Frances Liu

Published: 2024-10-27Area: Mechanistic Interp.Citations: 91

Tags: mechanistic-interp, ai-safety, tool

E5 / R3 (97%)
Decomposing The Dark Matter of Sparse Autoencoders

Joshua Engels, Max Tegmark, Logan Smith

Published: 2024-10-18Area: Mechanistic Interp.Citations: 33

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (93%)
Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models

Wei Jie Yeo, Erik Cambria, Ranjan Satapathy

Published: 2024-10-18Area: Mechanistic Interp.Citations: 4

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Automatically Interpreting Millions of Features in Large Language Models

Caden Juang, Alex Mallen, Nora Belrose, Gonçalo Paulo

Published: 2024-10-17Area: Mechanistic Interp.Citations: 69

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
On the Role of Attention Heads in Large Language Model Safety

Yongbin Li, Rongwu Xu, Junfeng Fang, Kun Wang

Published: 2024-10-17Area: Mechanistic Interp.Citations: 46

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Analyzing (In)Abilities of SAEs via Formal Languages

Manish Shrivastava, Abhinav Menon, Ekdeep Singh Lubana, David Krueger

Published: 2024-10-15Area: Mechanistic Interp.Citations: 15

Tags: empirical, mechanistic-interp, ai-safety

E7 / R3 (95%)
ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability

Jun Xu, Yang Song, Zhongxiang Sun, Han Li

Published: 2024-10-15Area: Mechanistic Interp.Citations: 68

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E6 / R4 (95%)
The Persian Rug: solving toy models of superposition using large-scale symmetries

Alex Infanger, Aditya Cowsik, Kfir Dolev

Published: 2024-10-15Area: Mechanistic Interp.Citations: -

Tags: theoretical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Bilinear MLPs Enable Weight-Based Mechanistic Interpretability

Thomas Dooms, Jose M. Oramas, Michael T. Pearce, Alice Rigg

Published: 2024-10-10Area: Mechanistic Interp.Citations: 19

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
Efficient Dictionary Learning with Switch Sparse Autoencoders

Anish Mudide, Christian Schroeder de Witt, Joshua Engels, Eric J. Michaud

Published: 2024-10-10Area: Mechanistic Interp.Citations: 31

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (96%)
Mechanistic Permutability: Match Features Across Layers

Ian Maksimov, Daniil Gavrilov, Nikita Balagansky

Published: 2024-10-10Area: Mechanistic Interp.Citations: 14

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (98%)
The Geometry of Concepts: Sparse Autoencoder Feature Structure

Yuxiao Li, David D. Baek, Joshua Engels, Xiaoqing Sun

Published: 2024-10-10Area: Mechanistic Interp.Citations: 40

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders

Philip Torr, Constantin Venhoff, Christian Schroeder de Witt, Anisoara Calinescu

Published: 2024-10-09Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety, safety-evaluation

E6 / R3 (96%)
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models

Philip Torr, Austin Meek, Fazl Barez, Ashkan Khakzar

Published: 2024-10-09Area: Mechanistic Interp.Citations: 11

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

Evzen Wybitul, Alexander Matt Turner, Joseph Miller, Alex Cloud

Published: 2024-10-06Area: Mechanistic Interp.Citations: 18

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (95%)
Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language

Severin Field, Mat Allen, Anthony Costarelli

Published: 2024-10-03Area: Mechanistic Interp.Citations: 5

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Sparse Attention Decomposition Applied to Circuit Tracing

Gabriel Franco, Mark Crovella

Published: 2024-10-01Area: Mechanistic Interp.Citations: 3

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (97%)
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

David Chanin, James Wilken-Smith, Joseph Bloom, Tomas Dulka

Published: 2024-09-22Area: Mechanistic Interp.Citations: 83

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Optimal Ablation for Interpretability

Lucas Janson, Maximilian Li

Published: 2024-09-16Area: Mechanistic Interp.Citations: 14

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R4 (92%)
TracrBench: Generating Interpretability Testbeds with Large Language Models

Hannes Thurnherr, Jérémy Scheurer

Published: 2024-09-07Area: Mechanistic Interp.Citations: 4

Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark

E5 / R3 (97%)