Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

PaperIntel
A Behavioral Fingerprint for Large Language Models: Provenance Tracking via Refusal Vectors

Victor S. Sheng, Zhenyu Xu

Year: 2026Area: Representation AnalysisCitations: -

Tags: empirical, representation-analysis, ai-safety

E5 / R3 (95%)
Building Better Deception Probes Using Targeted Instruction Pairs

Devina Jain, Joseph Bloom, Vikram Natarajan, Shivam Arora

Year: 2026Area: Representation AnalysisCitations: -

Tags: empirical, representation-analysis, ai-safety

E5 / R3 (94%)
Efficient and accurate steering of Large Language Models through attention-guided feature learning

Adityanarayanan Radhakrishnan, Parmida Davarmanesh, Ashia Wilson

Year: 2026Area: Representation AnalysisCitations: -

Tags: empirical, representation-analysis, ai-safety

E5 / R3 (94%)
From Directions to Regions: Decomposing Activations in Language Models via Local Geometry

Or Shafran, Shaked Ronen, Omri Fahn, Shauli Ravfogel

Year: 2026Area: Representation AnalysisCitations: -

Tags: empirical, representation-analysis, ai-safety

E5 / R3 (96%)
Mechanistic Indicators of Steering Effectiveness in Large Language Models

Hao Xue, Flora Salim, Mehdi Jafari

Year: 2026Area: Representation AnalysisCitations: -

Tags: empirical, representation-analysis, ai-safety

E6 / R3 (93%)
No Reliable Evidence of Self-Reported Sentience in Small Large Language Models

Caspar Kaiser, Sean Enderby

Year: 2026Area: Representation AnalysisCitations: -

Tags: empirical, representation-analysis, ai-safety

E6 / R3 (93%)
The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

Adriano Koshiyama, Zekun Wu, Seonglae Cho, Kleyton Da Costa

Year: 2026Area: Representation AnalysisCitations: -

Tags: empirical, representation-analysis, ai-safety

E5 / R4 (94%)
The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

Liang Yang, Luoming Hu, Hongfei Lin, Jingjie Zeng

Year: 2026Area: Representation AnalysisCitations: -

Tags: empirical, representation-analysis, ai-safety, adversarial-robustness

E5 / R3 (95%)
There Is More to Refusal in Large Language Models than a Single Direction

Nadir Durrani, Sabri Boughorbel, Faaiz Joad, Husrev Taha Sencar

Year: 2026Area: Representation AnalysisCitations: -

Tags: empirical, representation-analysis, ai-safety

E5 / R3 (95%)
Towards Understanding Steering Strength

Magamed Taimeskhanov, Damien Garreau, Samuel Vaiter

Year: 2026Area: Representation AnalysisCitations: -

Tags: theoretical, representation-analysis, ai-safety

E5 / R3 (96%)
YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

Guokan Shang, Michalis Vazirgiannis, Preslav Nakov, Hadi Abdine

Year: 2026Area: Representation AnalysisCitations: -

Tags: empirical, representation-analysis, alignment-training, ai-safety, adversarial-robustness

E5 / R3 (98%)
A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks

Yonatan Zunger, Daniel Jones, Giorgio Severi, Ahmed Salem

Year: 2025Area: Representation AnalysisCitations: 2

Tags: empirical, representation-analysis, ai-safety, adversarial-robustness

E5 / R3 (96%)
A Unified Understanding and Evaluation of Steering Methods

Yixuan Li, Shawn Im

Year: 2025Area: Representation AnalysisCitations: 24

Tags: empirical, representation-analysis, ai-safety, safety-evaluation

E5 / R3 (96%)
Aligned Probing: Relating Toxic Behavior and Model Internals

Vagrant Gautam, Anne Lauscher, Dietrich Klakow, Andreas Waldis

Year: 2025Area: Representation AnalysisCitations: 3

Tags: empirical, representation-analysis, ai-safety

E6 / R3 (93%)
Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment

Mustafa Shukor, Pegah Khayatan, Matthieu Cord, Jayneel Parekh

Year: 2025Area: Representation AnalysisCitations: 8

Tags: empirical, representation-analysis, alignment-training, ai-safety

E5 / R3 (96%)
AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations

Michael J. Clark

Year: 2025Area: Representation AnalysisCitations: -

Tags: empirical, representation-analysis, ai-safety

E5 / R3 (97%)
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Christopher Potts, Aryaman Arora, Dan Jurafsky, Christopher D. Manning

Year: 2025Area: Representation AnalysisCitations: 123

Tags: representation-analysis, ai-safety, benchmark

E5 / R3 (96%)
Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

Lijie Hu, Tao Luo, Dongrui Liu, Guanxu Chen

Year: 2025Area: Representation AnalysisCitations: 2

Tags: empirical, representation-analysis, ai-safety

E7 / R3 (94%)
Building Production-Ready Probes For Gemini

János Kramár, Rohin Shah, Bilal Chughtai, Joshua Engels

Year: 2025Area: Representation AnalysisCitations: 2

Tags: empirical, representation-analysis, ai-safety

E5 / R4 (95%)
CCS-Lib: A Python package to elicit latent knowledge from LLMs

Ben W., Eric Mungai Kinuthia, Walter Laurito, Marius Pl

Year: 2025Area: Representation AnalysisCitations: -

Tags: representation-analysis, ai-safety, tool

E5 / R3 (99%)
COSMIC: Generalized Refusal Direction Identification in LLM Activations

Zhun Wang, Chenguang Wang, Nicholas Crispino, Dawn Song

Year: 2025Area: Representation AnalysisCitations: 5

Tags: empirical, representation-analysis, alignment-training, ai-safety, adversarial-robustness

E5 / R3 (95%)
Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations

Anthony Hartshorn, Cheng Zhang, Lei Yu, Yeskendir Koishekenov

Year: 2025Area: Representation AnalysisCitations: 23

Tags: empirical, representation-analysis, ai-safety

E5 / R3 (92%)
Can Role Vectors Affect LLM Behaviour?

Daniele Potertì, Andrea Seveso, Fabio Mercorio

Year: 2025Area: Representation AnalysisCitations: 3

Tags: empirical, representation-analysis, ai-safety

E5 / R3 (95%)
Concept-Level Explainability for Auditing & Steering LLM Responses

Mennatallah El-Assady, Kenza Amara, Rita Sevastjanova

Year: 2025Area: Representation AnalysisCitations: 6

Tags: empirical, representation-analysis, ai-safety, adversarial-robustness

E6 / R3 (94%)
ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

Pinar Yanardag, Alec Helbling, Ben Hoover, Duen Horng Chau

Year: 2025Area: Representation AnalysisCitations: 25

Tags: empirical, representation-analysis, ai-safety

E6 / R3 (97%)
Convergent Linear Representations of Emergent Misalignment

Edward Turner, Senthooran Rajamanoharan, Neel Nanda, Anna Soligo

Year: 2025Area: Representation AnalysisCitations: 22

Tags: empirical, representation-analysis, alignment-training, ai-safety

E4 / R3 (95%)
DISCO: Disentangled Communication Steering for Large Language Models

Aria Masoomi, Masih Eskandar, Jennifer Dy, Max Torop

Year: 2025Area: Representation AnalysisCitations: 1

Tags: empirical, representation-analysis, ai-safety

E5 / R3 (97%)
Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

Vasu Sharma, Sean O'Brien, Adhitya Rajendra Kumar, Saleena Angeline

Year: 2025Area: Representation AnalysisCitations: 3

Tags: empirical, representation-analysis, ai-safety, adversarial-robustness

E6 / R3 (94%)
Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

Xuansheng Wu, Mengnan Du, Ninghao Liu, Haiyan Zhao

Year: 2025Area: Representation AnalysisCitations: 4

Tags: empirical, representation-analysis, ai-safety

E5 / R3 (95%)
Emergence of Linear Truth Encodings in Language Models

Joan Bruna, Alberto Bietti, Tal Linzen, Gilad Yehudai

Year: 2025Area: Representation AnalysisCitations: 4

Tags: empirical, representation-analysis, ai-safety

E4 / R3 (93%)

Showing 30 of 125 papers on page 1.