Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Year | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| A Behavioral Fingerprint for Large Language Models: Provenance Tracking via Refusal Vectors Victor S. Sheng, Zhenyu Xu Year: 2026Area: Representation AnalysisCitations: - Tags: empirical, representation-analysis, ai-safety | 2026 | Representation Analysis | empirical, representation-analysis, ai-safety | E5 / R3 (95%) | - |
| A Causal Perspective for Enhancing Jailbreak Attack and Defense Kui Ren, Haozhe Feng, Licheng Pan, Hui Xue Year: 2026Area: Adversarial RobustnessCitations: - Tags: empirical, ai-safety, adversarial-robustness | 2026 | Adversarial Robustness | empirical, ai-safety, adversarial-robustness | E6 / R3 (94%) | - |
| A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode Yupeng Chen, Philip Torr, Eric Sommerlade, Jialin Yu Year: 2026Area: Adversarial RobustnessCitations: - Tags: empirical, ai-safety, adversarial-robustness | 2026 | Adversarial Robustness | empirical, ai-safety, adversarial-robustness | E6 / R3 (95%) | - |
| A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior Kannan Ramchandran, Harry Mayne, Adam Mahdi, Noah Y. Siegel Year: 2026Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E6 / R3 (96%) | - |
| A White-Box Prompt Injection Attack on Embodied AI Agents Driven by Large Language Models Yubin Qu, W. E. Wong, Tongcheng Geng Year: 2026Area: Adversarial RobustnessCitations: 1 Tags: empirical, ai-safety, adversarial-robustness | 2026 | Adversarial Robustness | empirical, ai-safety, adversarial-robustness | E4 / R3 (95%) | 1 |
| AIR: Improving Agent Safety through Incident Response Jun Sun, Junjie Chen, Zibo Xiao Year: 2026Area: Agent SafetyCitations: - Tags: empirical, agent-safety, ai-safety | 2026 | Agent Safety | empirical, agent-safety, ai-safety | E5 / R3 (95%) | - |
| AM3Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs Yujin Zhou, Yike Guo, Chengkun Cai, Pengcheng Wen Year: 2026Area: Multimodal SafetyCitations: - Tags: empirical, alignment-training, ai-safety, adversarial-robustness, multimodal-safety | 2026 | Multimodal Safety | empirical, alignment-training, ai-safety, adversarial-robustness, multimodal-safety | E5 / R3 (97%) | - |
| Agentic Uncertainty Reveals Agentic Overconfidence Jean Kaddour, Leo Richter, Srijan Patel, Pasquale Minervini Year: 2026Area: Agent SafetyCitations: - Tags: empirical, agent-safety, ai-safety, adversarial-robustness | 2026 | Agent Safety | empirical, agent-safety, ai-safety, adversarial-robustness | E5 / R3 (96%) | - |
| Alignment-Aware Model Adaptation via Feedback-Guided Optimization Gaurav Bhatt, Leonid Sigal, Aditya Chinchure, Jiawei Zhou Year: 2026Area: Alignment TrainingCitations: - Tags: empirical, alignment-training, ai-safety | 2026 | Alignment Training | empirical, alignment-training, ai-safety | E5 / R3 (93%) | - |
| Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems Shangbin Feng, Wenxuan Ding, Yulia Tsvetkov, Ziyuan Yang Year: 2026Area: Adversarial RobustnessCitations: - Tags: empirical, ai-safety, adversarial-robustness | 2026 | Adversarial Robustness | empirical, ai-safety, adversarial-robustness | E4 / R3 (93%) | - |
| Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning Abhishek Mishra, Deepesh Suranjandass, Reshma Ashok, Donnie Winkelmann Year: 2026Area: Deception & FailureCitations: - Tags: empirical, alignment-training, ai-safety, deception-failure | 2026 | Deception & Failure | empirical, alignment-training, ai-safety, deception-failure | E5 / R3 (95%) | - |
| Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring Tyler Tracy, Joachim Schaeffer, Arjun Khandelwal Year: 2026Area: Safety EvaluationCitations: - Tags: empirical, ai-safety, safety-evaluation | 2026 | Safety Evaluation | empirical, ai-safety, safety-evaluation | E6 / R3 (94%) | - |
| Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models Yuhong Wang, Zhihui Fu, Songze Li, Jiahao Chen Year: 2026Area: Adversarial RobustnessCitations: - Tags: empirical, ai-safety, adversarial-robustness | 2026 | Adversarial Robustness | empirical, ai-safety, adversarial-robustness | E6 / R4 (96%) | - |
| BLOCK-EM: Preventing Emergent Misalignment by Blocking Causal Features Guannan Qu, Muhammed Ustaomeroglu Year: 2026Area: Model EditingCitations: - Tags: empirical, alignment-training, ai-safety, model-editing | 2026 | Model Editing | empirical, alignment-training, ai-safety, model-editing | E4 / R2 (95%) | - |
| Beyond Activation Patterns: A Weight-Based Out-of-Context Explanation of Sparse Autoencoder Features Yiting Liu, Zhi-Hong Deng Year: 2026Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E4 / R3 (95%) | - |
| Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought Weihong Lin, Lin Sun, Jianfeng Si, Xiangzheng Zhang Year: 2026Area: Adversarial RobustnessCitations: - Tags: empirical, alignment-training, ai-safety, adversarial-robustness | 2026 | Adversarial Robustness | empirical, alignment-training, ai-safety, adversarial-robustness | E5 / R4 (94%) | - |
| Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models Fadi Hassan, Hicham Eddoubi, Umar Faruk Abdullahi Year: 2026Area: Adversarial RobustnessCitations: - Tags: empirical, ai-safety, adversarial-robustness, safety-evaluation | 2026 | Adversarial Robustness | empirical, ai-safety, adversarial-robustness, safety-evaluation | E7 / R3 (98%) | - |
| Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs Wei Wang, Mingyu Yu, Sujuan Qin, Lana Liu Year: 2026Area: Multimodal SafetyCitations: - Tags: empirical, ai-safety, adversarial-robustness, multimodal-safety | 2026 | Multimodal Safety | empirical, ai-safety, adversarial-robustness, multimodal-safety | E5 / R3 (97%) | - |
| Building Better Deception Probes Using Targeted Instruction Pairs Devina Jain, Joseph Bloom, Vikram Natarajan, Shivam Arora Year: 2026Area: Representation AnalysisCitations: - Tags: empirical, representation-analysis, ai-safety | 2026 | Representation Analysis | empirical, representation-analysis, ai-safety | E5 / R3 (94%) | - |
| Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks Murat Kantarcioglu, Jafar Isbarov Year: 2026Area: Adversarial RobustnessCitations: - Tags: empirical, ai-safety, adversarial-robustness | 2026 | Adversarial Robustness | empirical, ai-safety, adversarial-robustness | E5 / R3 (96%) | - |
| C-ΔΘ: Circuit-Restricted Weight Arithmetic for Selective Refusal Aditya Kasliwal, Vinay Kumar Sankarapu, Pratinav Seth Year: 2026Area: Model EditingCitations: - Tags: empirical, ai-safety, model-editing | 2026 | Model Editing | empirical, ai-safety, model-editing | E5 / R3 (96%) | - |
| CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment Yisheng Zhong, Zhengbang Yang, Zhuangdi Zhu, Junyuan Hong Year: 2026Area: Model EditingCitations: - Tags: empirical, alignment-training, ai-safety, model-editing | 2026 | Model Editing | empirical, alignment-training, ai-safety, model-editing | E6 / R4 (93%) | - |
| Capability-Oriented Training Induced Alignment Risk Tian Gao, Zhenwen Liang, Nuno Moniz, Werner Geyer Year: 2026Area: Deception & FailureCitations: 1 Tags: empirical, alignment-training, ai-safety, deception-failure | 2026 | Deception & Failure | empirical, alignment-training, ai-safety, deception-failure | E4 / R3 (94%) | 1 |
| ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents Hwan Chang, Yonghyun Jun, Hwanhee Lee Year: 2026Area: Adversarial RobustnessCitations: 9 Tags: empirical, ai-safety, adversarial-robustness | 2026 | Adversarial Robustness | empirical, ai-safety, adversarial-robustness | E5 / R4 (97%) | 9 |
| Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path Neha Sengar, Dongsoo Har, Andres Suarez Year: 2026Area: Mechanistic Interp.Citations: - Tags: empirical, mechanistic-interp, ai-safety | 2026 | Mechanistic Interp. | empirical, mechanistic-interp, ai-safety | E5 / R3 (95%) | - |
| CoT Defender: Preemptive Chain-of-Thought Occupation for Jailbreak Attack Mitigation Yihe Wang, Xiao Yu, Jin Liu, Xiaokang Li Year: 2026Area: Adversarial RobustnessCitations: - Tags: empirical, ai-safety, adversarial-robustness | 2026 | Adversarial Robustness | empirical, ai-safety, adversarial-robustness | - | - |
| CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation Zhao Tong, Shu Wu, Xingcheng Xu, Yiping Zhang Year: 2026Area: Deception & FailureCitations: - Tags: empirical, ai-safety, deception-failure | 2026 | Deception & Failure | empirical, ai-safety, deception-failure | E5 / R2 (96%) | - |
| Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks Rob Gilson, Bobby Chen, Christopher Liu, Hoagy Cunningham Year: 2026Area: Adversarial RobustnessCitations: 6 Tags: empirical, ai-safety, adversarial-robustness | 2026 | Adversarial Robustness | empirical, ai-safety, adversarial-robustness | E5 / R3 (94%) | 6 |
| Controllable Value Alignment in Large Language Models through Neuron-Level Editing Richang Hong, Weibiao Huang, Tat-Seng Chua, Le Wu Year: 2026Area: Model EditingCitations: - Tags: empirical, alignment-training, ai-safety, model-editing | 2026 | Model Editing | empirical, alignment-training, ai-safety, model-editing | E5 / R3 (94%) | - |
| DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher Yisheng Zhong, Zhengbang Yang, Zhuangdi Zhu Year: 2026Area: Model EditingCitations: - Tags: empirical, ai-safety, model-editing | 2026 | Model Editing | empirical, ai-safety, model-editing | E6 / R3 (93%) | - |
Showing 30 of 2859 papers on page 1.