Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 181-210 of 470 papers (page 7 of 16)

Paper	Published	Area	Tags	Intel	Citations
Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention J Rosser, Konstantina Palla, Hugues Bouchard, José Luis Redondo García Published: 2025-10-22Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-10-22	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (96%)	-
Superposition Yields Robust Neural Scaling Yizhou Liu, Jeff Gore, Ziming Liu Published: 2025-05-15Area: Mechanistic Interp.Citations: 9 Tags: empirical, mechanistic-interp, ai-safety	2025-05-15	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	9
Superposition as Lossy Compression: Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability Zoe Tzifa-Kratira, Leonard Bereska, Reza Samavi, Efstratios Gavves Published: 2025-12-15Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	2025-12-15	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R3 (94%)	1
Superposition in Graph Neural Networks Pietro Liò, Han Xuanyuan, Lukas Pertl Published: 2025-08-31Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2025-08-31	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E7 / R3 (95%)	1
Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation Gal Niv, Jonathan Jacobi Published: 2025-03-03Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety	2025-03-03	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	3
Sycophancy Hides Linearly in the Attention Heads Hilal Alquabeh, Kentaro Inui, Nurdaulet Mukhituly, Rifo Genadi Published: 2026-01-23Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2026-01-23	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (95%)	1
TRACE: Training and Inference-Time Interpretability Analysis for Language Models Nura Aljaafari, André Freitas, Danilo S. Carvalho Published: 2025-07-04Area: Mechanistic Interp.Citations: 1 Tags: mechanistic-interp, ai-safety, tool, interpretability	2025-07-04	Mechanistic Interp.	mechanistic-interp, ai-safety, tool, interpretability	E5 / R3 (96%)	1
The Circuits Research Landscape: Results and Perspectives Michael Hanna, Connor Watts, Curt Tigges, Max Loeffler Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (93%)	-
The Dead Salmons of AI Interpretability Maxime Peyrard, François Portet, Maxime Méloux, Giada Dirupo Published: 2025-12-21Area: Mechanistic Interp.Citations: 3 Tags: mechanistic-interp, ai-safety, position, interpretability	2025-12-21	Mechanistic Interp.	mechanistic-interp, ai-safety, position, interpretability	E7 / R3 (93%)	3
The Geometry of Self-Verification in a Task-Specific Reasoning Model Chris Wendler, Andrew Lee, Fernanda Viégas, Martin Wattenberg Published: 2025-04-19Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety	2025-04-19	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	6
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions Haining Yu, Xiangyang Zhou, Qiguang Chen, Wenbo Pan Published: 2025-02-13Area: Mechanistic Interp.Citations: 8 Tags: empirical, alignment-training, mechanistic-interp, ai-safety, adversarial-robustness	2025-02-13	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R4 (94%)	8
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability? Julian Minder, Tiago Pimentel, Thomas Hofmann, Denis Sutter Published: 2025-07-11Area: Mechanistic Interp.Citations: 11 Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability	2025-07-11	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	11
The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction Lei Yu, Meng Cao, Zhijing Jin, Yihuai Hong Published: 2025-03-29Area: Mechanistic Interp.Citations: 15 Tags: empirical, mechanistic-interp, ai-safety	2025-03-29	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	15
Time-Aware Feature Selection: Adaptive Temporal Masking for Stable Sparse Autoencoder Training Junyu Ren, T. Ed Li Published: 2025-10-09Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-10-09	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (97%)	-
TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research Dhruv Nathawani, Luke Marks, Amir Abdullah, Philip Quirke Published: 2025-03-17Area: Mechanistic Interp.Citations: 2 Tags: mechanistic-interp, ai-safety, dataset, interpretability	2025-03-17	Mechanistic Interp.	mechanistic-interp, ai-safety, dataset, interpretability	E6 / R3 (95%)	2
Toward a Theory of Generalizability in LLM Mechanistic Interpretability Research Sean Trott Published: 2025-09-26Area: Mechanistic Interp.Citations: 2 Tags: theoretical, mechanistic-interp, ai-safety, interpretability	2025-09-26	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E4 / R3 (94%)	2
Towards Atoms of Large Language Models Chenhui Hu, Jun Zhao, Pengfei Cao, Yubo Chen Published: 2025-09-25Area: Mechanistic Interp.Citations: - Tags: theoretical, mechanistic-interp, ai-safety	2025-09-25	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety	E5 / R3 (98%)	-
Towards Combinatorial Interpretability of Neural Computation Dan Alistarh, Nir Shavit, Micah Adler Published: 2025-04-10Area: Mechanistic Interp.Citations: 7 Tags: theoretical, mechanistic-interp, ai-safety, interpretability	2025-04-10	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (93%)	7
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders Yixuan Li, Mihalis A. Nicolaou, James Oldfield, Grigorios G Chrysos Published: 2025-05-27Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-05-27	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E4 / R3 (95%)	1
Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis Yan Hu, Reynold Cheng, Xu Wang, Wenyu Du Published: 2025-02-17Area: Mechanistic Interp.Citations: 10 Tags: empirical, mechanistic-interp, ai-safety	2025-02-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (94%)	10
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition Junxuan Wang, Xuyang Ge, Zhengfu He, Junping Zhang Published: 2025-04-29Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-04-29	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (96%)	2
Towards eliciting latent knowledge from LLMs with mechanistic interpretability Emil Ryd, Senthooran Rajamanoharan, Bartosz Cywiński, Neel Nanda Published: 2025-05-20Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-05-20	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (97%)	6
Tracing Attention Computation Through Feature Interactions Rodrigo Luger, Wes Gurnee, Harish Kamath, Thomas Conerly Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	-
Training Language Models to Explain Their Own Computations Jacob Andreas, Belinda Z. Li, Vincent Huang, Zifan Carl Guo Published: 2025-11-11Area: Mechanistic Interp.Citations: 7 Tags: empirical, mechanistic-interp, ai-safety	2025-11-11	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	7
Training Superior Sparse Autoencoders for Instruct Models Hamid Alinejad-Rokny, Yukun Chen, Jimmy Chih-Hsien Peng, Jiaming Li Published: 2025-06-09Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2025-06-09	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (95%)	1
Transferring Linear Features Across Language Models With Model Stitching Ellie Pavlick, Alessandro Stolfo, Alan Chen, Jack Merullo Published: 2025-06-07Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2025-06-07	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	1
Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability Luca Baroni, Joachim Schaeffer, Stefan Heimersheim, Marat Subkhankulov Published: 2025-07-03Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-07-03	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (93%)	4
Truth Neurons Yupeng Cao, Zining Zhu, Jordan W. Suchow, Yangyang Yu Published: 2025-05-18Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2025-05-18	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (95%)	1
Understanding How Value Neurons Shape the Generation of Specified Values in LLMs Lijie Hu, Shu Yang, Di Wang, Xinhai Wang Published: 2025-05-23Area: Mechanistic Interp.Citations: 7 Tags: empirical, mechanistic-interp, ai-safety	2025-05-23	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (95%)	7
Understanding Refusal in Language Models with Sparse Autoencoders Wei Jie Yeo, Erik Cambria, Roy Ka-Wei Lee, Ranjan Satapathy Published: 2025-05-29Area: Mechanistic Interp.Citations: 7 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	2025-05-29	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R3 (92%)	7