Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 31-60 of 470 papers (page 2 of 16)

Paper	Published	Area	Tags	Intel	Citations
Addressing divergent representations from causal interventions on neural networks Christopher Potts, Alexa R. Tartaglini, Satchel Grant, Simon Jerome Han Published: 2025-11-06Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-11-06	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (96%)	-
AlignSAE: Concept-Aligned Sparse Autoencoders Xinyu Guo, Jinhe Bi, Minglai Yang, Mihai Surdeanu Published: 2025-12-01Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2025-12-01	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	2
Analysis of Variational Sparse Autoencoders Yuxiao Li, Zachary Baker Published: 2025-09-26Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-09-26	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (93%)	-
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models Daniil Gavrilov, Nikita Balagansky, Daniil Laptev, Yaroslav Aksenov Published: 2025-02-05Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety	2025-02-05	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	6
Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models Talia Konkle, Victor Boutin, Binxu Wang, Ekdeep Singh Lubana Published: 2025-02-18Area: Mechanistic Interp.Citations: 32 Tags: empirical, mechanistic-interp, ai-safety	2025-02-18	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (97%)	32
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing Joshua Engels, Senthooran Rajamanoharan, Subhash Kantamneni, Max Tegmark Published: 2025-02-23Area: Mechanistic Interp.Citations: 62 Tags: empirical, mechanistic-interp, ai-safety	2025-02-23	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	62
Atlas-Alignment: Making Interpretability Transferable Across Language Models Sebastian Lapuschkin, Wojciech Samek, Jim Berend, Bruno Puri Published: 2025-10-31Area: Mechanistic Interp.Citations: - Tags: empirical, alignment-training, mechanistic-interp, ai-safety, interpretability	2025-10-31	Mechanistic Interp.	empirical, alignment-training, mechanistic-interp, ai-safety, interpretability	E6 / R3 (95%)	-
Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs Thomas Wiegand, Sebastian Lapuschkin, Sayed Mohammad Vakilzadeh Hatefi, Wojciech Samek Published: 2025-06-16Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2025-06-16	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (94%)	2
Automated Feature Labeling with Token-Space Gradient Descent Julian Schulz, Seamus Fallows Published: 2025-04-01Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-04-01	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (97%)	-
Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers Lucy Farnik, Thomas Heap, Tim Lawson, Laurence Aitchison Published: 2025-01-29Area: Mechanistic Interp.Citations: 26 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-01-29	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (95%)	26
Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models Qingsong Wen, Stephen Wang, Moayad Aloqaily, Kun Wang Published: 2025-09-26Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-09-26	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R4 (98%)	-
Because we have LLMs, we Can and Should Pursue Agentic Interpretability Noah Fiedel, Been Kim, John Hewitt, Oyvind Tafjord Published: 2025-06-13Area: Mechanistic Interp.Citations: 9 Tags: mechanistic-interp, ai-safety, position, interpretability	2025-06-13	Mechanistic Interp.	mechanistic-interp, ai-safety, position, interpretability	E5 / R3 (94%)	9
Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal Erik Cambria, Amir Abdullah, Roy Ka-Wei Lee, Yeo Wei Jie Published: 2025-09-07Area: Mechanistic Interp.Citations: 4 Tags: empirical, mechanistic-interp, ai-safety	2025-09-07	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	4
Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders Xuansheng Wu, Mengnan Du, Ninghao Liu, Dong Shu Published: 2025-05-12Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2025-05-12	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	2
Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder Kaidi Xu, Song Wang, Zhen Tan, Tianlong Chen Published: 2025-11-07Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-11-07	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	-
Binary Autoencoder for Mechanistic Interpretability of Large Language Models Hakaze Cho, Naoya Inoue, Haolin Yang, Brian M. Kurkoski Published: 2025-09-25Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-09-25	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (93%)	-
Binary Sparse Coding for Interpretability Lucia Quirke, Stepan Shabalin, Nora Belrose Published: 2025-09-29Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-09-29	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E6 / R4 (92%)	1
BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods Yihong Liu, Philipp Mondorf, Sebastian Gerstner, Hinrich Schütze Published: 2025-10-08Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2025-10-08	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E6 / R3 (95%)	2
BlockCert: Certified Blockwise Extraction of Transformer Mechanisms Sandro Andric Published: 2025-11-20Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool	2025-11-20	Mechanistic Interp.	mechanistic-interp, ai-safety, tool	E5 / R3 (99%)	-
CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders Sachin Kumar, Yusen Peng, Alex Gulko Published: 2025-08-31Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark	2025-08-31	Mechanistic Interp.	mechanistic-interp, ai-safety, interpretability, safety-evaluation, benchmark	E5 / R3 (94%)	-
Can Interpretation Predict Behavior on Unseen Data? David Alvarez-Melis, Jenny Kaufmann, Martin Wattenberg, Victoria R. Li Published: 2025-07-08Area: Mechanistic Interp.Citations: 5 Tags: empirical, mechanistic-interp, ai-safety, interpretability	2025-07-08	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, interpretability	E5 / R3 (92%)	5
Can Large Language Models Develop Gambling Addiction? Yunjeong Lee, Seungpil Lee, Donghyeon Shin, Sundong Kim Published: 2025-09-26Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2025-09-26	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R4 (93%)	2
Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework Kirill Bykov, Anna Hedström, Laura Kopf, Oliver Eberle Published: 2025-06-18Area: Mechanistic Interp.Citations: 6 Tags: empirical, mechanistic-interp, ai-safety	2025-06-18	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (94%)	6
Circuit Stability Characterizes Language Model Generalization Alan Sun Published: 2025-05-30Area: Mechanistic Interp.Citations: 2 Tags: empirical, mechanistic-interp, ai-safety	2025-05-30	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (93%)	2
Circuit Tracing: Revealing Computational Graphs in Language Models Craig Citro, Michael Sklar, Hoagy Cunningham, Wes Gurnee Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	-
Circuits Updates - April 2025 Brian Chen, Adam Jermyn, Joshua Batson, Jack Lindsey Published: -Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	-	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E5 / R3 (95%)	-
Combining Causal Models for More Accurate Abstractions of Neural Networks Theodora-Mara Pîslar, Sara Magliacane, Atticus Geiger Published: 2025-03-14Area: Mechanistic Interp.Citations: 1 Tags: empirical, mechanistic-interp, ai-safety	2025-03-14	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (95%)	1
Compressed Computation: Dense Circuits in a Toy Model of the Universal-AND Problem Adam Newgas Published: 2025-07-13Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2025-07-13	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (92%)	-
Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs Yuhang Liu, Dong Gong, Erdun Gao, Anton van den Hengel Published: 2026-01-28Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety	2026-01-28	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety	E5 / R3 (96%)	-
Concept-SAE: Active Causal Probing of Visual Model Behavior Qiang Xu, Jianrong Ding, Chenchen Zhao, Muxi Chen Published: 2025-09-26Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety, adversarial-robustness	2025-09-26	Mechanistic Interp.	empirical, mechanistic-interp, ai-safety, adversarial-robustness	E6 / R4 (94%)	-