Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
Showing 451-470 of 470 papers (page 16 of 16)
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| SFAL: Semantic-Functional Alignment Scores for Distributional Evaluation of Auto-Interpretability in Sparse Autoencoders Daniele Potertì, Andrea Seveso, Filippo Pallucchini, Antonio Serino Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability, safety-evaluation | - | Mechanistic Interp. | empirical, alignment-training, mechanistic-interp, ai-safety, interpretability, safety-evaluation | E6 / R3 (95%) | - |
| Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet Alex Tamkin, Craig Citro, Tom Conerly, Hoagy Cunningham Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |
| Softmax Linear Units Tom Conerly, Nicholas Joseph, Dawn Drain, Yuntao Bai Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | - |
| Sparse Autoencoders Find Partially Interpretable Features in Italian Small Language Models Alessandro Lenci, Lucia C. Passaro, Alessandro Bondielli Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (97%) | - |
| Sparse Crosscoders for Cross-Layer Features and Model Diffing Thomas Conerly, Christopher Olah, Jonathan Marcus, Joshua Batson Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | - |
| Sparse Mixtures of Linear Transforms (MOLT) Brian Chen, Thomas Conerly, Adam Pearce, Sasha Hydrie Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | - |
| Stage-Wise Model Diffing Siddharth Mishra-Sharma, Thomas Henighan, Adam Jermyn, Christopher Olah Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |
| Superposition, Memorization, and Double Descent Robert Lasenby, Tom Henighan, Nicholas Schiefer, Christopher Olah Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | - |
| The Circuits Research Landscape: Results and Perspectives Michael Hanna, Connor Watts, Curt Tigges, Max Loeffler Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (93%) | - |
| Thread: Circuits Gabriel Goh, Ludwig Schubert, Swee Kiat Lim, Chris Olah Published: -Area: Mechanistic Interp.Citations: 142 Tags: mechanistic-interp, ai-safety, survey | - | Mechanistic Interp. | mechanistic-interp, ai-safety, survey | E5 / R3 (95%) | 142 |
| Towards Monosemanticity: Decomposing Language Models With Dictionary Learning Alex Tamkin, Tom Conerly, Brayden McLean, Nicholas Joseph Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |
| Toy Models of Superposition Robert Lasenby, Dawn Drain, Tom Henighan, Carol Chen Published: -Area: Mechanistic Interp.Citations: 621 Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 621 |
| Tracing Attention Computation Through Feature Interactions Rodrigo Luger, Wes Gurnee, Harish Kamath, Thomas Conerly Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (93%) | - |
| Transformer Debugger Steven Bills, Jan Leike, Henk Tillman, Catherine Yeh Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool, interpretability | - | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | - | - |
| Transformer Feed-Forward Layers Are Key-Value Memories Omer Levy, Mor Geva, Roei Schuster, Jonathan Berant Published: -Area: Mechanistic Interp.Citations: 1203 Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (94%) | 1203 |
| TransformerLens Joseph Bloom, Neel Nanda Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool, interpretability | - | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | E5 / R3 (96%) | - |
| Understanding RL Vision Gabriel Goh, Chris Olah, Nick Cammarata, Shan Carter Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety, interpretability | E5 / R3 (94%) | - |
| Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron Anirudh Goyal, Michael Qizhe Shieh, Yuxi Xie, Yiran Zhao Published: -Area: Mechanistic Interp.Citations: 38 Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (96%) | 38 |
| Where Confabulation Lives: Latent Feature Discovery in LLMs Gerhard Wunder, Thibaud Ardoin, Yi Cai Published: -Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (94%) | 1 |
| Zoom In: An Introduction to Circuits Gabriel Goh, Ludwig Schubert, Chris Olah, Nick Cammarata Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | - | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |