Paper deep dive
Negative Results for Sparse Autoencoders on Downstream Tasks and Deprioritising SAE Research
Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda
Year: 2025Venue: DeepMind Safety Research blog / Alignment ForumArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 0
Models: Gemma 2
Intelligence
Status: not_run | Model: - | Prompt: - | Confidence: 0%
Entities (0)
No extracted entities yet.
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Abstract
DeepMind's mech interp team finds SAEs underperform dense linear probes for OOD detection of harmful intent, with SAE reconstructions discarding safety-relevant information, leading them to deprioritize SAE research.
Tags
ai-safety (imported, 100%)empirical (suggested, 88%)mechanistic-interp (suggested, 92%)
Links
- Source: https://deepmindsafetyresearch.medium.com/negative-results-for-sparse-autoencoders-on-downstream-tasks-and-deprioritising-sae-research-6cadcfc125b9
- Canonical: https://deepmindsafetyresearch.medium.com/negative-results-for-sparse-autoencoders-on-downstream-tasks-and-deprioritising-sae-research-6cadcfc125b9
Full Text
No full-text extraction is stored for this paper yet.