Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

PaperIntel
A Behavioral Fingerprint for Large Language Models: Provenance Tracking via Refusal Vectors

Victor S. Sheng, Zhenyu Xu

Year: 2026Area: Representation AnalysisCitations: -

Tags: empirical, representation-analysis, ai-safety

E5 / R3 (95%)
A Causal Perspective for Enhancing Jailbreak Attack and Defense

Kui Ren, Haozhe Feng, Licheng Pan, Hui Xue

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness

E6 / R3 (94%)
A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode

Yupeng Chen, Philip Torr, Eric Sommerlade, Jialin Yu

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness

E6 / R3 (95%)
A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

Kannan Ramchandran, Harry Mayne, Adam Mahdi, Noah Y. Siegel

Year: 2026Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E6 / R3 (96%)
A White-Box Prompt Injection Attack on Embodied AI Agents Driven by Large Language Models

Yubin Qu, W. E. Wong, Tongcheng Geng

Year: 2026Area: Adversarial RobustnessCitations: 1

Tags: empirical, ai-safety, adversarial-robustness

E4 / R3 (95%)
AIR: Improving Agent Safety through Incident Response

Jun Sun, Junjie Chen, Zibo Xiao

Year: 2026Area: Agent SafetyCitations: -

Tags: empirical, agent-safety, ai-safety

E5 / R3 (95%)
AM3Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

Yujin Zhou, Yike Guo, Chengkun Cai, Pengcheng Wen

Year: 2026Area: Multimodal SafetyCitations: -

Tags: empirical, alignment-training, ai-safety, adversarial-robustness, multimodal-safety

E5 / R3 (97%)
Agentic Uncertainty Reveals Agentic Overconfidence

Jean Kaddour, Leo Richter, Srijan Patel, Pasquale Minervini

Year: 2026Area: Agent SafetyCitations: -

Tags: empirical, agent-safety, ai-safety, adversarial-robustness

E5 / R3 (96%)
Alignment-Aware Model Adaptation via Feedback-Guided Optimization

Gaurav Bhatt, Leonid Sigal, Aditya Chinchure, Jiawei Zhou

Year: 2026Area: Alignment TrainingCitations: -

Tags: empirical, alignment-training, ai-safety

E5 / R3 (93%)
Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

Shangbin Feng, Wenxuan Ding, Yulia Tsvetkov, Ziyuan Yang

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness

E4 / R3 (93%)
Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning

Abhishek Mishra, Deepesh Suranjandass, Reshma Ashok, Donnie Winkelmann

Year: 2026Area: Deception & FailureCitations: -

Tags: empirical, alignment-training, ai-safety, deception-failure

E5 / R3 (95%)
Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring

Tyler Tracy, Joachim Schaeffer, Arjun Khandelwal

Year: 2026Area: Safety EvaluationCitations: -

Tags: empirical, ai-safety, safety-evaluation

E6 / R3 (94%)
Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models

Yuhong Wang, Zhihui Fu, Songze Li, Jiahao Chen

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness

E6 / R4 (96%)
BLOCK-EM: Preventing Emergent Misalignment by Blocking Causal Features

Guannan Qu, Muhammed Ustaomeroglu

Year: 2026Area: Model EditingCitations: -

Tags: empirical, alignment-training, ai-safety, model-editing

E4 / R2 (95%)
Beyond Activation Patterns: A Weight-Based Out-of-Context Explanation of Sparse Autoencoder Features

Yiting Liu, Zhi-Hong Deng

Year: 2026Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E4 / R3 (95%)
Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

Weihong Lin, Lin Sun, Jianfeng Si, Xiangzheng Zhang

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, alignment-training, ai-safety, adversarial-robustness

E5 / R4 (94%)
Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models

Fadi Hassan, Hicham Eddoubi, Umar Faruk Abdullahi

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness, safety-evaluation

E7 / R3 (98%)
Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs

Wei Wang, Mingyu Yu, Sujuan Qin, Lana Liu

Year: 2026Area: Multimodal SafetyCitations: -

Tags: empirical, ai-safety, adversarial-robustness, multimodal-safety

E5 / R3 (97%)
Building Better Deception Probes Using Targeted Instruction Pairs

Devina Jain, Joseph Bloom, Vikram Natarajan, Shivam Arora

Year: 2026Area: Representation AnalysisCitations: -

Tags: empirical, representation-analysis, ai-safety

E5 / R3 (94%)
Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

Murat Kantarcioglu, Jafar Isbarov

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness

E5 / R3 (96%)
C-ΔΘ: Circuit-Restricted Weight Arithmetic for Selective Refusal

Aditya Kasliwal, Vinay Kumar Sankarapu, Pratinav Seth

Year: 2026Area: Model EditingCitations: -

Tags: empirical, ai-safety, model-editing

E5 / R3 (96%)
CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment

Yisheng Zhong, Zhengbang Yang, Zhuangdi Zhu, Junyuan Hong

Year: 2026Area: Model EditingCitations: -

Tags: empirical, alignment-training, ai-safety, model-editing

E6 / R4 (93%)
Capability-Oriented Training Induced Alignment Risk

Tian Gao, Zhenwen Liang, Nuno Moniz, Werner Geyer

Year: 2026Area: Deception & FailureCitations: 1

Tags: empirical, alignment-training, ai-safety, deception-failure

E4 / R3 (94%)
ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

Hwan Chang, Yonghyun Jun, Hwanhee Lee

Year: 2026Area: Adversarial RobustnessCitations: 9

Tags: empirical, ai-safety, adversarial-robustness

E5 / R4 (97%)
Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path

Neha Sengar, Dongsoo Har, Andres Suarez

Year: 2026Area: Mechanistic Interp.Citations: -

Tags: empirical, mechanistic-interp, ai-safety

E5 / R3 (95%)
CoT Defender: Preemptive Chain-of-Thought Occupation for Jailbreak Attack Mitigation

Yihe Wang, Xiao Yu, Jin Liu, Xiaokang Li

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness

-
CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation

Zhao Tong, Shu Wu, Xingcheng Xu, Yiping Zhang

Year: 2026Area: Deception & FailureCitations: -

Tags: empirical, ai-safety, deception-failure

E5 / R2 (96%)
Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Rob Gilson, Bobby Chen, Christopher Liu, Hoagy Cunningham

Year: 2026Area: Adversarial RobustnessCitations: 6

Tags: empirical, ai-safety, adversarial-robustness

E5 / R3 (94%)
Controllable Value Alignment in Large Language Models through Neuron-Level Editing

Richang Hong, Weibiao Huang, Tat-Seng Chua, Le Wu

Year: 2026Area: Model EditingCitations: -

Tags: empirical, alignment-training, ai-safety, model-editing

E5 / R3 (94%)
DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher

Yisheng Zhong, Zhengbang Yang, Zhuangdi Zhu

Year: 2026Area: Model EditingCitations: -

Tags: empirical, ai-safety, model-editing

E6 / R3 (93%)

Showing 30 of 2859 papers on page 1.