Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 271-300 of 470 papers (page 10 of 16)

Paper	Published	Area	Tags	Intel	Citations
Residual Stream Analysis with Multi-Layer SAEs Lucy Farnik, Conor Houghton, Tim Lawson, Laurence Aitchison Published: 2024-09-06Area: Mechanistic Interp.Citations: 12 Tags: empirical, mechanistic-interp, ai-safety	2024-09-06	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	12
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small Maheep Chaudhary, Atticus Geiger Published: 2024-09-05Area: Mechanistic Interp.Citations: 30 Tags: empirical, mechanistic-interp, ai-safety	2024-09-05	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E7 / R3 (97%)	30
On the Complexity of Neural Computation in Superposition Nir Shavit, Micah Adler Published: 2024-09-05Area: Mechanistic Interp.Citations: 11 Tags: theoretical, mechanistic-interp, ai-safety	2024-09-05	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E5 / R3 (94%)	11
Safety Layers in Aligned Large Language Models Shen Li, Liuyi Yao, Lan Zhang, Yaliang Li Published: 2024-08-30Area: Mechanistic Interp.Citations: 88 Tags: empirical, alignment-training, mechanistic-interp, ai-safety	2024-08-30	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety	E4 / R3 (95%)	88
Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference André Freitas, Marco Valentino, Geonhee Kim Published: 2024-08-16Area: Mechanistic Interp.Citations: 13 Tags: empirical, mechanistic-interp, ai-safety	2024-08-16	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (92%)	13
Mathematical Models of Computation in Superposition Lawrence Chan, Jake Mendel, Kaarel Hänni, Dmitry Vaintrob Published: 2024-08-10Area: Mechanistic Interp.Citations: 23 Tags: theoretical, mechanistic-interp, ai-safety	2024-08-10	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E5 / R3 (93%)	23
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 Vikrant Varma, Rohin Shah, Nicolas Sonnerat, Lewis Smith Published: 2024-08-09Area: Mechanistic Interp.Citations: 254 Tags: mechanistic-interp, ai-safety, tool	2024-08-09	Mechanistic Interp.	mechanistic-interp, ai-safety, tool	E5 / R3 (96%)	254
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models David Bau, Can Rager, Samuel Marks, Adam Karvonen Published: 2024-07-31Area: Mechanistic Interp.Citations: 49 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-07-31	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	49
Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability Jorge García-Carrasco, Juan Trujillo, Alejandro Maté Published: 2024-07-29Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness, interpretability	2024-07-29	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness, interpretability	E5 / R3 (93%)	6
Planning in a Recurrent Neural Network that Plays Sokoban Adam Gleave, Mohammad Taufeeque, Adrià Garriga-Alonso, Philip Quirke Published: 2024-07-22Area: Mechanistic Interp.Citations: 11 Tags: empirical, mechanistic-interp, ai-safety	2024-07-22	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (95%)	11
Adversarial Circuit Evaluation Niels uit de Bos, Adrià Garriga-Alonso Published: 2024-07-21Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness, safety-evaluation	2024-07-21	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness, safety-evaluation	E6 / R3 (95%)	1
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques Iván Arcuschin, Rohan Gupta, Adrià Garriga-Alonso, Thomas Kwa Published: 2024-07-19Area: Mechanistic Interp.Citations: 7 Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark	2024-07-19	Mechanistic Interp.	mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark	E6 / R3 (98%)	7
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders Vikrant Varma, Nicolas Sonnerat, Tom Lieberum, Senthooran Rajamanoharan Published: 2024-07-19Area: Mechanistic Interp.Citations: 194 Tags: empirical, mechanistic-interp, ai-safety	2024-07-19	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	194
Relational Composition in Neural Networks: A Survey and Call to Action Fernanda B. Viégas, Martin Wattenberg Published: 2024-07-19Area: Mechanistic Interp.Citations: 19 Tags: mechanistic-interp, ai-safety, survey	2024-07-19	Mechanistic Interp.	mechanistic-interp, ai-safety, survey	E6 / R3 (91%)	19
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach Nils Palumbo, Ravi Mangal, Corina Păsăreanu, Saranya Vijayakumar Published: 2024-07-18Area: Mechanistic Interp.Citations: 2 Tags: theoretical, mechanistic-interp, ai-safety	2024-07-18	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E5 / R3 (95%)	2
NNsight and NDIF: Democratizing Access to Foundation Model Internals Jonathan Bell, David Bau, Can Rager, Samuel Marks Published: 2024-07-18Area: Mechanistic Interp.Citations: 23 Tags: mechanistic-interp, ai-safety, tool	2024-07-18	Mechanistic Interp.	mechanistic-interp, ai-safety, tool	E5 / R3 (96%)	23
Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent Sonia Joseph, Irina Rish, Blake Richards, Karolis Jucys Published: 2024-07-16Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-07-16	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R4 (97%)	4
LLM Circuit Analyses Are Consistent Across Training and Scale Michael Hanna, Curt Tigges, Stella Biderman, Qinan Yu Published: 2024-07-15Area: Mechanistic Interp.Citations: 42 Tags: empirical, mechanistic-interp, ai-safety	2024-07-15	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	42
Transformer Circuit Faithfulness Metrics are not Robust Bilal Chughtai, William Saunders, Joseph Miller Published: 2024-07-11Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-07-11	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E4 / R3 (95%)	10
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks Aaron Mueller Published: 2024-07-05Area: Mechanistic Interp.Citations: 18 Tags: theoretical, mechanistic-interp, ai-safety, interpretability	2024-07-05	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (92%)	18
Functional Faithfulness in the Wild: Circuit Discovery with Differentiable Computation Graph Pruning Lei Yu, Jingcheng Niu, Zining Zhu, Gerald Penn Published: 2024-07-04Area: Mechanistic Interp.Citations: 8 Tags: empirical, mechanistic-interp, ai-safety	2024-07-04	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (95%)	8
The Remarkable Robustness of LLMs: Stages of Inference? Wes Gurnee, Max Tegmark, Vedang Lad Published: 2024-06-27Area: Mechanistic Interp.Citations: 98 Tags: empirical, mechanistic-interp, ai-safety	2024-06-27	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E7 / R4 (94%)	98
A Closer Look into Mixture-of-Experts in Large Language Models Jie Fu, Zeyu Huang, Ka Man Lo, Zili Wang Published: 2024-06-26Area: Mechanistic Interp.Citations: 28 Tags: empirical, mechanistic-interp, ai-safety	2024-06-26	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	28
Interpreting Attention Layer Outputs with Sparse Autoencoders Joseph Isaac Bloom, Arthur Conmy, Robert Krzyzanowski, Connor Kissane Published: 2024-06-25Area: Mechanistic Interp.Citations: 40 Tags: empirical, mechanistic-interp, ai-safety	2024-06-25	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	40
Understanding Language Model Circuits through Knowledge Editing Huaizhi Ge, Zining Zhu, Frank Rudzicz Published: 2024-06-25Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2024-06-25	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	2
Confidence Regulation Neurons in Language Models Wes Gurnee, Xingyi Song, Alessandro Stolfo, Ben Wu Published: 2024-06-24Area: Mechanistic Interp.Citations: 45 Tags: empirical, mechanistic-interp, ai-safety	2024-06-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (94%)	45
Finding Transformer Circuits with Edge Pruning Alexander Wettig, Adithya Bhaskar, Dan Friedman, Danqi Chen Published: 2024-06-24Area: Mechanistic Interp.Citations: 40 Tags: empirical, mechanistic-interp, ai-safety	2024-06-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (97%)	40
Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models Jun Zhao, Zhuoran Jin, Pengfei Cao, Yubo Chen Published: 2024-06-23Area: Mechanistic Interp.Citations: 19 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-06-23	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (96%)	19
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons Juanzi Li, Xiaozhi Wang, Jianhui Chen, Zijun Yao Published: 2024-06-20Area: Mechanistic Interp.Citations: 25 Tags: empirical, alignment-training, mechanistic-interp, ai-safety	2024-06-20	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety	E5 / R3 (94%)	25
Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries Sohee Yang, Daniela Gottesman, Mor Geva, Amir Globerson Published: 2024-06-18Area: Mechanistic Interp.Citations: 75 Tags: empirical, mechanistic-interp, ai-safety	2024-06-18	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (97%)	75