Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 91-120 of 470 papers (page 4 of 16)

Paper	Published	Area	Tags	Intel	Citations
Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs Letao Han, Ling Hu, Yuemei Xu, Xiaoyang Gu Published: 2025-04-07Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2025-04-07	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E7 / R3 (94%)	1
From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers Julia Kempe, Karen Ullrich, Jingtong Su Published: 2025-06-20Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	2025-06-20	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E6 / R4 (95%)	3
From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit Valérie Costa, Bahareh Tolooshams, Ekdeep Singh Lubana, Thomas Fel Published: 2025-06-03Area: Mechanistic Interp.Citations: 16 Tags: empirical, mechanistic-interp, ai-safety	2025-06-03	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	16
From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models Yiyang Yu, Minji Lee, Mohammed AlQuraishi, Etowah Adams Published: -Area: Mechanistic Interp.Citations: 32 Tags: empirical, mechanistic-interp, ai-safety, interpretability	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (96%)	32
From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers Sonia Joseph, Jack Stanley, Praneet Suresh, Luca Scimeca Published: 2025-09-08Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2025-09-08	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	1
From superposition to sparse codes: interpretable representations in neural networks Harald Maurer, Patrik Reizinger, Nina Miolane, David Klindt Published: 2025-03-03Area: Mechanistic Interp.Citations: 7 Tags: theoretical, mechanistic-interp, ai-safety, safety-evaluation	2025-03-03	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, safety-evaluation	E5 / R3 (93%)	7
GIM: Improved Interpretability for Large Language Models Lars Maaløe, Tuukka Ruotsalo, Maria Maistro, Róbert Csordás Published: 2025-05-23Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-05-23	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E7 / R3 (96%)	1
Gemma Scope 2: Comprehensive Suite of SAEs and Transcoders for Gemma 3 Tom Lieberum, Janos Kramar, Senthooran Rajamanoharan, Callum McDougall Published: -Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool, interpretability	-	Mechanistic Interp.	mechanistic-interp, ai-safety, tool, interpretability	E5 / R3 (98%)	-
Hedonic Neurons: A Mechanistic Mapping of Latent Coalitions in Transformer MLPs Atharva Nijasure, Tanya Chowdhury, James Allan, Yair Zick Published: 2025-09-28Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-09-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	-
How Can Interpretability Researchers Help AGI Go Well? Bilal Chughtai, Lewis Smith, Janos Kramar, Senthooran Rajamanoharan Published: -Area: Mechanistic Interp.Citations: - Tags: alignment-training, mechanistic-interp, ai-safety, position, interpretability	-	Mechanistic Interp.	alignment-training, mechanistic-interp, ai-safety, position, interpretability	-	-
How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence Yizhou Sun, Weikai Li, Shichang Zhang, Himabindu Lakkaraju Published: 2025-04-03Area: Mechanistic Interp.Citations: 5 Tags: empirical, mechanistic-interp, ai-safety	2025-04-03	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E7 / R4 (94%)	5
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks Christopher Potts, Jiuding Sun, Michael Sklar, Karel D'Oosterlinck Published: 2025-03-13Area: Mechanistic Interp.Citations: 8 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-03-13	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (96%)	8
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders Alexey Dontsov, Ivan Oseledets, Polina Druzhinina, Oleg Y. Rogov Published: 2025-03-24Area: Mechanistic Interp.Citations: 24 Tags: empirical, mechanistic-interp, ai-safety	2025-03-24	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	24
Identifying Sparsely Active Circuits Through Local Loss Landscape Decomposition Brianna Chrisman, Lee Sharkey, Lucius Bushnaq Published: 2025-03-31Area: Mechanistic Interp.Citations: 3 Tags: empirical, mechanistic-interp, ai-safety	2025-03-31	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (92%)	3
Inducing, Detecting and Characterising Neural Modules: A Pipeline for Functional Interpretability in Reinforcement Learning Pietro Ferraro, David Boyle, Anna Soligo Published: 2025-01-28Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-01-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	2
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models Noura Al Moubayed, Neel Nanda, Patrick Leask Published: 2025-05-23Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2025-05-23	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	1
Insights into a radiology-specialised multimodal large language model with sparse autoencoders Felix Meissen, Javier Alvarez-Valle, Daniel Coelho de Castro, Shruthi Bannur Published: 2025-07-17Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2025-07-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (96%)	1
Insights on Crosscoder Model Diffing Siddharth Mishra-Sharma, Thomas Henighan, Adam Jermyn, Christopher Olah Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	-
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors Christopher Potts, Junyi Tao, Thomas Icard, Jing Huang Published: 2025-05-17Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-05-17	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E8 / R3 (99%)	4
Interpretability Illusions with Sparse Autoencoders Usha Bhalla, Aaron J. Li, Himabindu Lakkaraju, Suraj Srinivas Published: 2025-05-21Area: Mechanistic Interp.Citations: 5 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness, interpretability	2025-05-21	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness, interpretability	E5 / R3 (95%)	5
Interpreting CLIP with Hierarchical Sparse Autoencoders Hubert Baniecki, Vladimir Zaigrajew, Przemyslaw Biecek Published: 2025-02-27Area: Mechanistic Interp.Citations: 19 Tags: empirical, mechanistic-interp, ai-safety	2025-02-27	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	19
Interpreting Transformers Through Attention Head Intervention Mason Kadem, Rong Zheng Published: 2026-01-07Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, survey, interpretability	2026-01-07	Mechanistic Interp.	mechanistic-interp, ai-safety, survey, interpretability	E5 / R3 (95%)	-
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders Wenlin Yao, Xuansheng Wu, Xiaoming Zhai, Ninghao Liu Published: 2025-02-21Area: Mechanistic Interp.Citations: 20 Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	2025-02-21	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R4 (93%)	20
Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban Adam Gleave, Mohammad Taufeeque, Adrià Garriga-Alonso, Aaron David Tucker Published: 2025-06-11Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-06-11	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E4 / R3 (94%)	-
InverseScope: Scalable Activation Inversion for Interpreting Large Language Models Zhennan Zhou, Yifan Luo, Bin Dong Published: 2025-06-09Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-06-09	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (97%)	-
Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations Lucy Farnik, Conor Houghton, Tim Lawson, Laurence Aitchison Published: 2025-02-25Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety	2025-02-25	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	6
Know-MRI: A Knowledge Mechanisms Revealer&Interpreter for Large Language Models Di Wu, Jun Zhao, Jiaxiang Liu, Boxuan Xing Published: 2025-06-10Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool	2025-06-10	Mechanistic Interp.	mechanistic-interp, ai-safety, tool	E5 / R3 (93%)	-
Kronecker Factorization Improves Efficiency and Interpretability of Sparse Autoencoders Daniil Gavrilov, Nikita Balagansky, Daniil Laptev, Vadim Kurochkin Published: 2025-05-28Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-05-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (97%)	-
LLM Interpretability with Identifiable Temporal-Instantaneous Representation Yujia Zheng, Kun Zhang, Xiangchen Song, Jiaqi Sun Published: 2025-09-27Area: Mechanistic Interp.Citations: 2 Tags: theoretical, mechanistic-interp, ai-safety, interpretability	2025-09-27	Mechanistic Interp.	theoretical, mechanistic-interp, ai-safety, interpretability	E4 / R3 (94%)	2
Language Model Circuits Are Sparse in the Neuron Basis Aryaman Arora, Sarah Schwettmann, Zhengxuan Wu, Jacob Steinhardt Published: 2026-01-30Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2026-01-30	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	1