Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 451-470 of 470 papers (page 16 of 16)

Paper	Published	Area	Tags	Intel	Citations
SFAL: Semantic-Functional Alignment Scores for Distributional Evaluation of Auto-Interpretability in Sparse Autoencoders Daniele Potertì, Andrea Seveso, Filippo Pallucchini, Antonio Serino Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability, safety-evaluation	-	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety, interpretability, safety-evaluation	E6 / R3 (95%)	-
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet Alex Tamkin, Craig Citro, Tom Conerly, Hoagy Cunningham Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	-
Softmax Linear Units Tom Conerly, Nicholas Joseph, Dawn Drain, Yuntao Bai Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	-
Sparse Autoencoders Find Partially Interpretable Features in Italian Small Language Models Alessandro Lenci, Lucia C. Passaro, Alessandro Bondielli Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (97%)	-
Sparse Crosscoders for Cross-Layer Features and Model Diffing Thomas Conerly, Christopher Olah, Jonathan Marcus, Joshua Batson Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	-
Sparse Mixtures of Linear Transforms (MOLT) Brian Chen, Thomas Conerly, Adam Pearce, Sasha Hydrie Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	-
Stage-Wise Model Diffing Siddharth Mishra-Sharma, Thomas Henighan, Adam Jermyn, Christopher Olah Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	-
Superposition, Memorization, and Double Descent Robert Lasenby, Tom Henighan, Nicholas Schiefer, Christopher Olah Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	-
The Circuits Research Landscape: Results and Perspectives Michael Hanna, Connor Watts, Curt Tigges, Max Loeffler Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (93%)	-
Thread: Circuits Gabriel Goh, Ludwig Schubert, Swee Kiat Lim, Chris Olah Published: -Area: Mechanistic Interp.Citations: 142 Tags: mechanistic-interp, ai-safety, survey	-	Mechanistic Interp.	mechanistic-interp, ai-safety, survey	E5 / R3 (95%)	142
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning Alex Tamkin, Tom Conerly, Brayden McLean, Nicholas Joseph Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	-
Toy Models of Superposition Robert Lasenby, Dawn Drain, Tom Henighan, Carol Chen Published: -Area: Mechanistic Interp.Citations: 621 Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	621
Tracing Attention Computation Through Feature Interactions Rodrigo Luger, Wes Gurnee, Harish Kamath, Thomas Conerly Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	-
Transformer Debugger Steven Bills, Jan Leike, Henk Tillman, Catherine Yeh Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool, interpretability	-	Mechanistic Interp.	mechanistic-interp, ai-safety, tool, interpretability	-	-
Transformer Feed-Forward Layers Are Key-Value Memories Omer Levy, Mor Geva, Roei Schuster, Jonathan Berant Published: -Area: Mechanistic Interp.Citations: 1203 Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (94%)	1203
TransformerLens Joseph Bloom, Neel Nanda Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool, interpretability	-	Mechanistic Interp.	mechanistic-interp, ai-safety, tool, interpretability	E5 / R3 (96%)	-
Understanding RL Vision Gabriel Goh, Chris Olah, Nick Cammarata, Shan Carter Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (94%)	-
Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron Anirudh Goyal, Michael Qizhe Shieh, Yuxi Xie, Yiran Zhao Published: -Area: Mechanistic Interp.Citations: 38 Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (96%)	38
Where Confabulation Lives: Latent Feature Discovery in LLMs Gerhard Wunder, Thibaud Ardoin, Yi Cai Published: -Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	1
Zoom In: An Introduction to Circuits Gabriel Goh, Ludwig Schubert, Chris Olah, Nick Cammarata Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	-