Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Year | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| A Behavioral Fingerprint for Large Language Models: Provenance Tracking via Refusal Vectors Victor S. Sheng, Zhenyu Xu Year: 2026Area: Representation AnalysisCitations: - Tags: empirical, representation-analysis, ai-safety | 2026 | Representation Analysis | empirical, representation-analysis, ai-safety | E5 / R3 (95%) | - |
| Building Better Deception Probes Using Targeted Instruction Pairs Devina Jain, Joseph Bloom, Vikram Natarajan, Shivam Arora Year: 2026Area: Representation AnalysisCitations: - Tags: empirical, representation-analysis, ai-safety | 2026 | Representation Analysis | empirical, representation-analysis, ai-safety | E5 / R3 (94%) | - |
| Efficient and accurate steering of Large Language Models through attention-guided feature learning Adityanarayanan Radhakrishnan, Parmida Davarmanesh, Ashia Wilson Year: 2026Area: Representation AnalysisCitations: - Tags: empirical, representation-analysis, ai-safety | 2026 | Representation Analysis | empirical, representation-analysis, ai-safety | E5 / R3 (94%) | - |
| From Directions to Regions: Decomposing Activations in Language Models via Local Geometry Or Shafran, Shaked Ronen, Omri Fahn, Shauli Ravfogel Year: 2026Area: Representation AnalysisCitations: - Tags: empirical, representation-analysis, ai-safety | 2026 | Representation Analysis | empirical, representation-analysis, ai-safety | E5 / R3 (96%) | - |
| Mechanistic Indicators of Steering Effectiveness in Large Language Models Hao Xue, Flora Salim, Mehdi Jafari Year: 2026Area: Representation AnalysisCitations: - Tags: empirical, representation-analysis, ai-safety | 2026 | Representation Analysis | empirical, representation-analysis, ai-safety | E6 / R3 (93%) | - |
| No Reliable Evidence of Self-Reported Sentience in Small Large Language Models Caspar Kaiser, Sean Enderby Year: 2026Area: Representation AnalysisCitations: - Tags: empirical, representation-analysis, ai-safety | 2026 | Representation Analysis | empirical, representation-analysis, ai-safety | E6 / R3 (93%) | - |
| The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models Adriano Koshiyama, Zekun Wu, Seonglae Cho, Kleyton Da Costa Year: 2026Area: Representation AnalysisCitations: - Tags: empirical, representation-analysis, ai-safety | 2026 | Representation Analysis | empirical, representation-analysis, ai-safety | E5 / R4 (94%) | - |
| The Straight and Narrow: Do LLMs Possess an Internal Moral Path? Liang Yang, Luoming Hu, Hongfei Lin, Jingjie Zeng Year: 2026Area: Representation AnalysisCitations: - Tags: empirical, representation-analysis, ai-safety, adversarial-robustness | 2026 | Representation Analysis | empirical, representation-analysis, ai-safety, adversarial-robustness | E5 / R3 (95%) | - |
| There Is More to Refusal in Large Language Models than a Single Direction Nadir Durrani, Sabri Boughorbel, Faaiz Joad, Husrev Taha Sencar Year: 2026Area: Representation AnalysisCitations: - Tags: empirical, representation-analysis, ai-safety | 2026 | Representation Analysis | empirical, representation-analysis, ai-safety | E5 / R3 (95%) | - |
| Towards Understanding Steering Strength Magamed Taimeskhanov, Damien Garreau, Samuel Vaiter Year: 2026Area: Representation AnalysisCitations: - Tags: theoretical, representation-analysis, ai-safety | 2026 | Representation Analysis | theoretical, representation-analysis, ai-safety | E5 / R3 (96%) | - |
| YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation Guokan Shang, Michalis Vazirgiannis, Preslav Nakov, Hadi Abdine Year: 2026Area: Representation AnalysisCitations: - Tags: empirical, representation-analysis, alignment-training, ai-safety, adversarial-robustness | 2026 | Representation Analysis | empirical, representation-analysis, alignment-training, ai-safety, adversarial-robustness | E5 / R3 (98%) | - |
| A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks Yonatan Zunger, Daniel Jones, Giorgio Severi, Ahmed Salem Year: 2025Area: Representation AnalysisCitations: 2 Tags: empirical, representation-analysis, ai-safety, adversarial-robustness | 2025 | Representation Analysis | empirical, representation-analysis, ai-safety, adversarial-robustness | E5 / R3 (96%) | 2 |
| A Unified Understanding and Evaluation of Steering Methods Yixuan Li, Shawn Im Year: 2025Area: Representation AnalysisCitations: 24 Tags: empirical, representation-analysis, ai-safety, safety-evaluation | 2025 | Representation Analysis | empirical, representation-analysis, ai-safety, safety-evaluation | E5 / R3 (96%) | 24 |
| Aligned Probing: Relating Toxic Behavior and Model Internals Vagrant Gautam, Anne Lauscher, Dietrich Klakow, Andreas Waldis Year: 2025Area: Representation AnalysisCitations: 3 Tags: empirical, representation-analysis, ai-safety | 2025 | Representation Analysis | empirical, representation-analysis, ai-safety | E6 / R3 (93%) | 3 |
| Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment Mustafa Shukor, Pegah Khayatan, Matthieu Cord, Jayneel Parekh Year: 2025Area: Representation AnalysisCitations: 8 Tags: empirical, representation-analysis, alignment-training, ai-safety | 2025 | Representation Analysis | empirical, representation-analysis, alignment-training, ai-safety | E5 / R3 (96%) | 8 |
| AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations Michael J. Clark Year: 2025Area: Representation AnalysisCitations: - Tags: empirical, representation-analysis, ai-safety | 2025 | Representation Analysis | empirical, representation-analysis, ai-safety | E5 / R3 (97%) | - |
| AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders Christopher Potts, Aryaman Arora, Dan Jurafsky, Christopher D. Manning Year: 2025Area: Representation AnalysisCitations: 123 Tags: representation-analysis, ai-safety, benchmark | 2025 | Representation Analysis | representation-analysis, ai-safety, benchmark | E5 / R3 (96%) | 123 |
| Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring Lijie Hu, Tao Luo, Dongrui Liu, Guanxu Chen Year: 2025Area: Representation AnalysisCitations: 2 Tags: empirical, representation-analysis, ai-safety | 2025 | Representation Analysis | empirical, representation-analysis, ai-safety | E7 / R3 (94%) | 2 |
| Building Production-Ready Probes For Gemini János Kramár, Rohin Shah, Bilal Chughtai, Joshua Engels Year: 2025Area: Representation AnalysisCitations: 2 Tags: empirical, representation-analysis, ai-safety | 2025 | Representation Analysis | empirical, representation-analysis, ai-safety | E5 / R4 (95%) | 2 |
| CCS-Lib: A Python package to elicit latent knowledge from LLMs Ben W., Eric Mungai Kinuthia, Walter Laurito, Marius Pl Year: 2025Area: Representation AnalysisCitations: - Tags: representation-analysis, ai-safety, tool | 2025 | Representation Analysis | representation-analysis, ai-safety, tool | E5 / R3 (99%) | - |
| COSMIC: Generalized Refusal Direction Identification in LLM Activations Zhun Wang, Chenguang Wang, Nicholas Crispino, Dawn Song Year: 2025Area: Representation AnalysisCitations: 5 Tags: empirical, representation-analysis, alignment-training, ai-safety, adversarial-robustness | 2025 | Representation Analysis | empirical, representation-analysis, alignment-training, ai-safety, adversarial-robustness | E5 / R3 (95%) | 5 |
| Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations Anthony Hartshorn, Cheng Zhang, Lei Yu, Yeskendir Koishekenov Year: 2025Area: Representation AnalysisCitations: 23 Tags: empirical, representation-analysis, ai-safety | 2025 | Representation Analysis | empirical, representation-analysis, ai-safety | E5 / R3 (92%) | 23 |
| Can Role Vectors Affect LLM Behaviour? Daniele Potertì, Andrea Seveso, Fabio Mercorio Year: 2025Area: Representation AnalysisCitations: 3 Tags: empirical, representation-analysis, ai-safety | 2025 | Representation Analysis | empirical, representation-analysis, ai-safety | E5 / R3 (95%) | 3 |
| Concept-Level Explainability for Auditing & Steering LLM Responses Mennatallah El-Assady, Kenza Amara, Rita Sevastjanova Year: 2025Area: Representation AnalysisCitations: 6 Tags: empirical, representation-analysis, ai-safety, adversarial-robustness | 2025 | Representation Analysis | empirical, representation-analysis, ai-safety, adversarial-robustness | E6 / R3 (94%) | 6 |
| ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features Pinar Yanardag, Alec Helbling, Ben Hoover, Duen Horng Chau Year: 2025Area: Representation AnalysisCitations: 25 Tags: empirical, representation-analysis, ai-safety | 2025 | Representation Analysis | empirical, representation-analysis, ai-safety | E6 / R3 (97%) | 25 |
| Convergent Linear Representations of Emergent Misalignment Edward Turner, Senthooran Rajamanoharan, Neel Nanda, Anna Soligo Year: 2025Area: Representation AnalysisCitations: 22 Tags: empirical, representation-analysis, alignment-training, ai-safety | 2025 | Representation Analysis | empirical, representation-analysis, alignment-training, ai-safety | E4 / R3 (95%) | 22 |
| DISCO: Disentangled Communication Steering for Large Language Models Aria Masoomi, Masih Eskandar, Jennifer Dy, Max Torop Year: 2025Area: Representation AnalysisCitations: 1 Tags: empirical, representation-analysis, ai-safety | 2025 | Representation Analysis | empirical, representation-analysis, ai-safety | E5 / R3 (97%) | 1 |
| Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing Vasu Sharma, Sean O'Brien, Adhitya Rajendra Kumar, Saleena Angeline Year: 2025Area: Representation AnalysisCitations: 3 Tags: empirical, representation-analysis, ai-safety, adversarial-robustness | 2025 | Representation Analysis | empirical, representation-analysis, ai-safety, adversarial-robustness | E6 / R3 (94%) | 3 |
| Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering Xuansheng Wu, Mengnan Du, Ninghao Liu, Haiyan Zhao Year: 2025Area: Representation AnalysisCitations: 4 Tags: empirical, representation-analysis, ai-safety | 2025 | Representation Analysis | empirical, representation-analysis, ai-safety | E5 / R3 (95%) | 4 |
| Emergence of Linear Truth Encodings in Language Models Joan Bruna, Alberto Bietti, Tal Linzen, Gilad Yehudai Year: 2025Area: Representation AnalysisCitations: 4 Tags: empirical, representation-analysis, ai-safety | 2025 | Representation Analysis | empirical, representation-analysis, ai-safety | E4 / R3 (93%) | 4 |
Showing 30 of 125 papers on page 1.