Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 451-470 of 470 papers (page 16 of 16)

PaperIntel
SFAL: Semantic-Functional Alignment Scores for Distributional Evaluation of Auto-Interpretability in Sparse Autoencoders

Daniele Potertì, Andrea Seveso, Filippo Pallucchini, Antonio Serino

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability, safety-evaluation

E6 / R3 (95%)
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Alex Tamkin, Craig Citro, Tom Conerly, Hoagy Cunningham

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Softmax Linear Units

Tom Conerly, Nicholas Joseph, Dawn Drain, Yuntao Bai

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Sparse Autoencoders Find Partially Interpretable Features in Italian Small Language Models

Alessandro Lenci, Lucia C. Passaro, Alessandro Bondielli

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (97%)
Sparse Crosscoders for Cross-Layer Features and Model Diffing

Thomas Conerly, Christopher Olah, Jonathan Marcus, Joshua Batson

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Sparse Mixtures of Linear Transforms (MOLT)

Brian Chen, Thomas Conerly, Adam Pearce, Sasha Hydrie

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Stage-Wise Model Diffing

Siddharth Mishra-Sharma, Thomas Henighan, Adam Jermyn, Christopher Olah

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Superposition, Memorization, and Double Descent

Robert Lasenby, Tom Henighan, Nicholas Schiefer, Christopher Olah

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
The Circuits Research Landscape: Results and Perspectives

Michael Hanna, Connor Watts, Curt Tigges, Max Loeffler

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (93%)
Thread: Circuits

Gabriel Goh, Ludwig Schubert, Swee Kiat Lim, Chris Olah

Published: -Area: Mechanistic Interp.Citations: 142

Tags: mechanistic-interp, ai-safety, survey

E5 / R3 (95%)
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Alex Tamkin, Tom Conerly, Brayden McLean, Nicholas Joseph

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
Toy Models of Superposition

Robert Lasenby, Dawn Drain, Tom Henighan, Carol Chen

Published: -Area: Mechanistic Interp.Citations: 621

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Tracing Attention Computation Through Feature Interactions

Rodrigo Luger, Wes Gurnee, Harish Kamath, Thomas Conerly

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (93%)
Transformer Debugger

Steven Bills, Jan Leike, Henk Tillman, Catherine Yeh

Published: -Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, tool, interpretability

-
Transformer Feed-Forward Layers Are Key-Value Memories

Omer Levy, Mor Geva, Roei Schuster, Jonathan Berant

Published: -Area: Mechanistic Interp.Citations: 1203

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (94%)
TransformerLens

Joseph Bloom, Neel Nanda

Published: -Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, tool, interpretability

E5 / R3 (96%)
Understanding RL Vision

Gabriel Goh, Chris Olah, Nick Cammarata, Shan Carter

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety, interpretability

E5 / R3 (94%)
Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron

Anirudh Goyal, Michael Qizhe Shieh, Yuxi Xie, Yiran Zhao

Published: -Area: Mechanistic Interp.Citations: 38

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (96%)
Where Confabulation Lives: Latent Feature Discovery in LLMs

Gerhard Wunder, Thibaud Ardoin, Yi Cai

Published: -Area: Mechanistic Interp.Citations: 1

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (94%)
Zoom In: An Introduction to Circuits

Gabriel Goh, Ludwig Schubert, Chris Olah, Nick Cammarata

Published: -Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)