Paper deep dive

Negative Results for Sparse Autoencoders on Downstream Tasks and Deprioritising SAE Research

Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda

Year: 2025Venue: DeepMind Safety Research blog / Alignment ForumArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 0

Models: Gemma 2

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Abstract

DeepMind's mech interp team finds SAEs underperform dense linear probes for OOD detection of harmful intent, with SAE reconstructions discarding safety-relevant information, leading them to deprioritize SAE research.

Full Text

No full-text extraction is stored for this paper yet.