Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 331-360 of 470 papers (page 12 of 16)

Paper	Published	Area	Tags	Intel	Citations
Information Flow Routes: Automatically Interpreting Language Models at Scale Javier Ferrando, Elena Voita Published: 2024-02-27Area: Mechanistic Interp.Citations: 74 Tags: empirical, mechanistic-interp, ai-safety	2024-02-27	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	74
Explorations of Self-Repair in Language Models Neel Nanda, Cody Rushing Published: 2024-02-23Area: Mechanistic Interp.Citations: 20 Tags: empirical, mechanistic-interp, ai-safety	2024-02-23	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (95%)	20
A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task Paul Swoboda, Christian Bartelt, Abhay Sheshadri, Victor Levoso Published: 2024-02-19Area: Mechanistic Interp.Citations: 48 Tags: empirical, mechanistic-interp, ai-safety	2024-02-19	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (95%)	48
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT Xuyang Ge, Zhengfu He, Qinyuan Cheng, Qiong Tang Published: 2024-02-19Area: Mechanistic Interp.Citations: 25 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-02-19	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (93%)	25
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals Francesco Ortu, Bernhard Schölkopf, Diego Doimo, Zhijing Jin Published: 2024-02-18Area: Mechanistic Interp.Citations: 35 Tags: empirical, mechanistic-interp, ai-safety	2024-02-18	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E7 / R3 (95%)	35
Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs Bilal Chughtai, Alan Cooney, Neel Nanda Published: 2024-02-11Area: Mechanistic Interp.Citations: 31 Tags: empirical, mechanistic-interp, ai-safety	2024-02-11	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R4 (93%)	31
Opening the AI black box: program synthesis via mechanistic interpretability Isaac Liao, Anish Mudide, Tara Rezaei Kheirkhah, Ziming Liu Published: 2024-02-07Area: Mechanistic Interp.Citations: 19 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-02-07	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R4 (96%)	19
Challenges in Mechanistically Interpreting Model Representations James Dao, Satvik Golechha Published: 2024-02-06Area: Mechanistic Interp.Citations: 4 Tags: mechanistic-interp, ai-safety, position, interpretability	2024-02-06	Mechanistic Interp.	mechanistic-interp, ai-safety, position, interpretability	E5 / R3 (93%)	4
Real Sparks of Artificial Intelligence and the Importance of Inner Interpretability Alex Grzankowski Published: 2024-01-31Area: Mechanistic Interp.Citations: 10 Tags: mechanistic-interp, ai-safety, position, interpretability	2024-01-31	Mechanistic Interp.	mechanistic-interp, ai-safety, position, interpretability	E8 / R4 (94%)	10
Fluent dreaming for language models Michael Sklar, Zygimantas Straznickas, T. Ben Thompson Published: 2024-01-24Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	2024-01-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E4 / R2 (97%)	4
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments Christopher Potts, Aryaman Arora, Thomas Icard, Zhengxuan Wu Published: 2024-01-23Area: Mechanistic Interp.Citations: 9 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-01-23	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (94%)	9
Universal Neurons in GPT2 Language Models Qinyi Sun, Wes Gurnee, Tara Rezaei Kheirkhah, Will Hathaway Published: 2024-01-22Area: Mechanistic Interp.Citations: 83 Tags: empirical, mechanistic-interp, ai-safety	2024-01-22	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	83
Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability Jatin Nainani Published: 2024-01-08Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2024-01-08	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	3
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity Jonathan K. Kummerfeld, Rada Mihalcea, Andrew Lee, Xiaoyan Bai Published: 2024-01-03Area: Mechanistic Interp.Citations: 165 Tags: empirical, alignment-training, mechanistic-interp, ai-safety	2024-01-03	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety	E5 / R3 (94%)	165
Observable Propagation: Uncovering Feature Vectors in Transformers Jacob Dunefsky, Arman Cohan Published: 2023-12-26Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2023-12-26	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	2
Forbidden Facts: An Investigation of Competing Objectives in Llama-2 Tony T. Wang, Nir Shavit, Kaivalya Hariharan, Miles Wang Published: 2023-12-14Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	2023-12-14	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R3 (96%)	3
Successor Heads: Recurring, Interpretable Attention Heads In The Wild Euan Ong, Rhys Gould, George Ogden, Arthur Conmy Published: 2023-12-14Area: Mechanistic Interp.Citations: 69 Tags: empirical, mechanistic-interp, ai-safety	2023-12-14	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (97%)	69
Grokking Group Multiplication with Cosets Honglu Fan, Stella Biderman, Dashiell Stander, Qinan Yu Published: 2023-12-11Area: Mechanistic Interp.Citations: 18 Tags: empirical, mechanistic-interp, ai-safety	2023-12-11	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	18
Interpretability Illusions in the Generalization of Simplified Models Andrew Lampinen, Lucas Dixon, Asma Ghandeharioun, Dan Friedman Published: 2023-12-06Area: Mechanistic Interp.Citations: 20 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2023-12-06	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (96%)	20
FlexModel: A Framework for Interpretability of Distributed Large Language Models John Willes, Muhammad Adil Asif, Matthew Choi, David B. Emerson Published: 2023-12-05Area: Mechanistic Interp.Citations: 1 Tags: mechanistic-interp, ai-safety, tool, interpretability	2023-12-05	Mechanistic Interp.	mechanistic-interp, ai-safety, tool, interpretability	E5 / R3 (95%)	1
Generating Interpretable Networks using Hypernetworks Isaac Liao, Ziming Liu, Max Tegmark Published: 2023-12-05Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2023-12-05	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R4 (95%)	2
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching Georg Lange, Aleksandar Makelov, Neel Nanda Published: 2023-11-28Area: Mechanistic Interp.Citations: 41 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2023-11-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E6 / R3 (93%)	41
Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching Phillip Guo, Richard Ren, James Campbell Published: 2023-11-25Area: Mechanistic Interp.Citations: 26 Tags: empirical, mechanistic-interp, ai-safety	2023-11-25	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (97%)	26
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks Hidenori Tanaka, Edward Grefenstette, Ekdeep Singh Lubana, Robert P. Dick Published: 2023-11-21Area: Mechanistic Interp.Citations: 99 Tags: empirical, alignment-training, mechanistic-interp, ai-safety	2023-11-21	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety	E5 / R3 (92%)	99
Future Lens: Anticipating Subsequent Tokens from a Single Hidden State Jiuding Sun, David Bau, Koyena Pal, Byron C. Wallace Published: 2023-11-08Area: Mechanistic Interp.Citations: 97 Tags: empirical, mechanistic-interp, ai-safety	2023-11-08	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (96%)	97
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models Fazl Barez, Michael Lan Published: 2023-11-07Area: Mechanistic Interp.Citations: 9 Tags: empirical, mechanistic-interp, ai-safety	2023-11-07	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (94%)	9
Uncovering Intermediate Variables in Transformers using Circuit Probing Ellie Pavlick, Thomas Serre, Michael A. Lepori Published: 2023-11-07Area: Mechanistic Interp.Citations: 12 Tags: empirical, mechanistic-interp, ai-safety	2023-11-07	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	12
Codebook Features: Sparse and Discrete Interpretability for Neural Networks Alex Tamkin, Mohammad Taufeeque, Noah D. Goodman Published: 2023-10-26Area: Mechanistic Interp.Citations: 41 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2023-10-26	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E6 / R3 (94%)	41
How Do Language Models Bind Entities in Context? Jiahai Feng, Jacob Steinhardt Published: 2023-10-26Area: Mechanistic Interp.Citations: 70 Tags: empirical, mechanistic-interp, ai-safety	2023-10-26	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (95%)	70
Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism Mansi Sakarvadia, Ian Foster, Arham Khan, Daniel Grzenda Published: 2023-10-25Area: Mechanistic Interp.Citations: 18 Tags: mechanistic-interp, ai-safety, tool	2023-10-25	Mechanistic Interp.	mechanistic-interp, ai-safety, tool	E5 / R3 (95%)	18