Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

PaperIntel
A Causal Perspective for Enhancing Jailbreak Attack and Defense

Kui Ren, Haozhe Feng, Licheng Pan, Hui Xue

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness

E6 / R3 (94%)
A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode

Yupeng Chen, Philip Torr, Eric Sommerlade, Jialin Yu

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness

E6 / R3 (95%)
A White-Box Prompt Injection Attack on Embodied AI Agents Driven by Large Language Models

Yubin Qu, W. E. Wong, Tongcheng Geng

Year: 2026Area: Adversarial RobustnessCitations: 1

Tags: empirical, ai-safety, adversarial-robustness

E4 / R3 (95%)
ADV-0: Closed-Loop Min-Max Adversarial Training for Long-Tail Robustness in Autonomous Driving

Wei Ma, Jie Sun, Junlin He, Yihong Tang

Year: 2026Area: cs.LGCitations: -

Tags: ai-safety, cslg, adversarial-robustness, preprint

-
AM3Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

Yujin Zhou, Yike Guo, Chengkun Cai, Pengcheng Wen

Year: 2026Area: Multimodal SafetyCitations: -

Tags: empirical, alignment-training, ai-safety, adversarial-robustness, multimodal-safety

E5 / R3 (97%)
Adversarial Latent-State Training for Robust Policies in Partially Observable Domains

Angad Singh Ahuja

Year: 2026Area: cs.LGCitations: -

Tags: ai-safety, cslg, adversarial-robustness, preprint

E4 / R3 (94%)
Adversarial Reinforcement Learning for Detecting False Data Injection Attacks in Vehicular Routing

Aron Laszka, Yevgeniy Vorobeychik, Taha Eghtesad

Year: 2026Area: cs.AICitations: -

Tags: ai-safety, adversarial-robustness, csai, preprint

-
Adversarial attacks against Modern Vision-Language Models

Alejandro Paredes La Torre

Year: 2026Area: cs.CRCitations: -

Tags: ai-safety, adversarial-robustness, cscr, preprint

-
Agentic Uncertainty Reveals Agentic Overconfidence

Jean Kaddour, Leo Richter, Srijan Patel, Pasquale Minervini

Year: 2026Area: Agent SafetyCitations: -

Tags: empirical, agent-safety, ai-safety, adversarial-robustness

E5 / R3 (96%)
Amnesia: Adversarial Semantic Layer Specific Activation Steering in Large Language Models

Nikolay Matyunin, Gurang Gupta, Jibesh Patra, Ali Raza

Year: 2026Area: cs.CRCitations: -

Tags: ai-safety, adversarial-robustness, cscr, preprint

E5 / R3 (96%)
Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

Shangbin Feng, Wenxuan Ding, Yulia Tsvetkov, Ziyuan Yang

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness

E4 / R3 (93%)
Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models

Yuhong Wang, Zhihui Fu, Songze Li, Jiahao Chen

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness

E6 / R4 (96%)
Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation

Nesreen Ahmed, Mahantesh Halappanavar, Haoyu Han, Yue Zhao

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: ai-safety, adversarial-robustness, benchmark

E5 / R3 (96%)
Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

Weihong Lin, Lin Sun, Jianfeng Si, Xiangzheng Zhang

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, alignment-training, ai-safety, adversarial-robustness

E5 / R4 (94%)
Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models

Fadi Hassan, Hicham Eddoubi, Umar Faruk Abdullahi

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness, safety-evaluation

E7 / R3 (98%)
Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs

Wei Wang, Mingyu Yu, Sujuan Qin, Lana Liu

Year: 2026Area: Multimodal SafetyCitations: -

Tags: empirical, ai-safety, adversarial-robustness, multimodal-safety

E5 / R3 (97%)
Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

Murat Kantarcioglu, Jafar Isbarov

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness

E5 / R3 (96%)
CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns

Qiankun Li, Kun Wang, Shilinlu Yan, Zhenhong Zhou

Year: 2026Area: Safety EvaluationCitations: -

Tags: ai-safety, adversarial-robustness, safety-evaluation, benchmark

E6 / R5 (96%)
Cascade: Composing Software-Hardware Attack Gadgets for Adversarial Threat Amplification in Compound AI Systems

Jose Sanchez Vicarte, Anjo Vahldiek-Oberwagner, Sarbartha Banerjee, Prateek Sahu

Year: 2026Area: cs.CRCitations: -

Tags: ai-safety, adversarial-robustness, cscr, preprint

-
ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

Hwan Chang, Yonghyun Jun, Hwanhee Lee

Year: 2026Area: Adversarial RobustnessCitations: 9

Tags: empirical, ai-safety, adversarial-robustness

E5 / R4 (97%)
CoT Defender: Preemptive Chain-of-Thought Occupation for Jailbreak Attack Mitigation

Yihe Wang, Xiao Yu, Jin Liu, Xiaokang Li

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness

-
Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Rob Gilson, Bobby Chen, Christopher Liu, Hoagy Cunningham

Year: 2026Area: Adversarial RobustnessCitations: 6

Tags: empirical, ai-safety, adversarial-robustness

E5 / R3 (94%)
Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?

Anna Chistyakova, Mikhail Pautov

Year: 2026Area: cs.LGCitations: -

Tags: ai-safety, cslg, adversarial-robustness, preprint

-
David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning

Tal Kachman, Samuel Nellessen

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness

E5 / R3 (95%)
Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions

Madhav S. Baidya, Chirag Chawla, S. S. Baidya

Year: 2026Area: cs.CLCitations: -

Tags: cscl, ai-safety, adversarial-robustness, preprint

-
Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

Zafar Ayyub Qazi, Tao Ni, Shayan Ali Hassan, Marco Canini

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness

E5 / R3 (98%)
Enhancing Network Intrusion Detection Systems: A Multi-Layer Ensemble Approach to Mitigate Adversarial Attacks

Raphael Khoury, Kelton A. P. Costa, Nasim Soltani, Shayan Nejadshamsi

Year: 2026Area: cs.CRCitations: -

Tags: ai-safety, adversarial-robustness, cscr, preprint

-
Ethical Risks in Deploying Large Language Models: An Evaluation of Medical Ethics Jailbreaking

Chengze Yan, Yunlou Fan, Jiacheng Ji, Hanhui Xu

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, ai-safety, adversarial-robustness, safety-evaluation

E5 / R3 (96%)
Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs

Zhi Rui Tam, Yen-Shan Chen, Yun-Nung Chen, Cheng-Kuang Wu

Year: 2026Area: Safety EvaluationCitations: -

Tags: empirical, ai-safety, adversarial-robustness, safety-evaluation

E5 / R3 (95%)
From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning

Haibo Hu, Zhibiao Guo, Qingqing Ye, Zi Liang

Year: 2026Area: Model EditingCitations: -

Tags: empirical, ai-safety, adversarial-robustness, model-editing

E7 / R4 (95%)

Showing 30 of 1490 papers on page 1.