Paper deep dive

A Survey on Agentic Security: Applications, Threats and Defenses

Asif Shahriar, Md Nafiu Rahman, Sadif Ahmed, Farig Sadeque, Md Rizwan Parvez

Year: 2025Venue: arXiv preprintArea: Agent SafetyType: SurveyEmbeddings: 91

Abstract

Abstract:In this work we present the first holistic survey of the agentic security landscape, structuring the field around three fundamental pillars: Applications, Threats, and Defenses. We provide a comprehensive taxonomy of over 160 papers, explaining how agents are used in downstream cybersecurity applications, inherent threats to agentic systems, and countermeasures designed to protect them. A detailed cross-cutting analysis shows emerging trends in agent architecture while revealing critical research gaps in model and modality coverage. A complete and continuously updated list of all surveyed papers is publicly available at this https URL.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/12/2026, 6:14:39 PM

Summary

This paper provides a comprehensive survey of the agentic security landscape, categorizing research into three pillars: Applications (offensive red-teaming and defensive blue-teaming), Threats (vulnerabilities like injection, poisoning, and jailbreaking), and Defenses (hardening techniques and runtime monitoring). It analyzes over 160 papers to identify trends, such as the shift toward planner-executor architectures, and highlights critical research gaps in model coverage and benchmark standardization.

Entities (6)

LLM Agent · system · 100%Jailbreak Attack · threat · 99%Poisoning Attack · threat · 99%Prompt Injection · threat · 99%Blue Teaming · application · 98%Red Teaming · application · 98%

Relation Signals (4)

LLM Agent → issubjectto → Prompt Injection

confidence 95% · Prompt injection attacks embed malicious instructions within the prompt fed to an LLM to manipulate it into performing unintended actions

LLM Agent → performs → Red Teaming

confidence 95% · autonomous and reasoning-driven red-team agentic systems that conduct penetration testing

LLM Agent → performs → Blue Teaming

confidence 95% · blue-team applications of LLM agents for continuous monitoring, threat detection

Poisoning Attack → targets → LLM Agent

confidence 95% · Poisoning attacks present another critical vulnerability for LLM agents by corrupting their memory

Cypher Suggestions (2)

Find all threats targeting LLM Agents · confidence 90% · unvalidated

MATCH (t:Threat)-[:TARGETS]->(a:System {name: 'LLM Agent'}) RETURN t.name

List all applications of LLM Agents · confidence 90% · unvalidated

MATCH (a:System {name: 'LLM Agent'})-[:PERFORMS]->(app:Application) RETURN app.name

Full Text

90,203 characters extracted from source content.

Expand or collapse full text

A Survey on Agentic Security: Applications, Threats and Defenses Asif Shahriar 1 * , Md Nafiu Rahman 1 * , Sadif Ahmed 1 * , Farig Sadeque 1 , Md Rizwan Parvez 2 1 BRAC University, 2 Qatar Computing Research Institute (QCRI) Abstract In this work we present the first holistic survey of the agentic security landscape, structuring the field around three fundamental pillars: Ap- plications, Threats, and Defenses. We provide a comprehensive taxonomy of over 160 papers, explaining how agents are used in downstream cybersecurity applications, inherent threats to agentic systems, and countermeasures designed to protect them. A detailed cross-cutting analy- sis shows emerging trends in agent architecture while revealing critical research gaps in model and modality coverage. A complete and con- tinuously updated list of all surveyed papers is publicly available athttps://github.com/ kagnlp/Awesome-Agentic-Security. 1 Introduction Since their introduction, Large Language Models (LLMs) have been used in the domain of cyberse- curity (Xu et al., 2025a; Wang et al., 2024; Deng et al., 2024a). The transition of research landscape from passive LLMs to autonomous LLM-agents (Yao et al., 2023; Shinn et al., 2023; Schick et al., 2023) has made these models significantly more ca- pable, allowing them not just to describe a solution but to execute it. Definition: LLM Agent We define an LLM Agent as a system whose core decision module is an LLM that plans, invokes tools/APIs, and acts in an external environment while observing feedback and adapting subsequent actions. It maintains state (short/long-term memory or a knowl- edge store) and may include explicit self- critique/verification and governance layers to satisfy task goals and safety constraints. * Equal contribution. Author order does not matter. This newfound agency has enabled LLM-agents to demonstrate remarkable capabilities across the security spectrum (Shen et al., 2025; Zhu et al., 2025b; Lin et al., 2025b). However, a number of studies have shown that the very act of wrapping an LLM in an agentic framework significantly in- creases its vulnerability (Saha et al., 2025; Kumar et al., 2025; Chiang et al., 2025). In response, a growing body of research has focused on devel- oping countermeasures to harden these systems (Debenedetti et al., 2025; Udeshi et al., 2025). The rapid development of agentic security re- search – with over 150 papers in 2024-2025 alone – has created a fragmented landscape that lacks comprehensive analysis. While existing surveys provide valuable insights into specific aspects like security threats (Deng et al., 2024c), trustworthi- ness (Yu et al., 2025), enterprise governance (Raza et al., 2025) and core LLM safety (Ma et al., 2025), they fail to capture the complete picture, as shown in Table 1. This fragmentation leaves practitioners and researchers without a unified framework for un- derstanding how agent capabilities, vulnerabilities, and defenses interconnect. In this work we present the first holistic survey of the agentic security landscape, structured to answer three key questions a security researcher would ask: “What can agents do for my security?” (Applica- tions), “How can they be attacked?” (Threats), and “How do I stop them?” (Defenses). To this end, we define three pillars of taxonomy: Applications (§2). Using LLM-agents in down- stream cybersecurity tasks, including red team- ing (autonomous vulnerability discovery), blue teaming (defending against threats), and domain- specific security (cloud, web). Threats (§3). Security vulnerabilities inherent to agentic systems that attackers can exploit. Defenses (§4). Techniques and countermeasures used to harden agentic systems against the threats. By uniquely bridging these three pillars, our sur- 1 arXiv:2510.06445v2 [cs.CL] 20 Dec 2025 Table 1: Survey comparison. Legend:✓ = covered;△ = partial/limited;✗= not covered. A: Applications, T: Threats, D: Defenses, B: Benchmarks/testbed surveyed. SurveyATDBFocus Yu et al. (2025)✗✓Trustworthiness Raza et al. (2025)✗△✗Enterprise governance & risk Deng et al. (2024c)✗✓△✗Security threats He et al. (2024)✗✓✗ Technical vulnerabili- ties Ma et al. (2025)△✓△Large-modelsafety (agents as subset) This Survey✓Holistic agentic secu- rity vey provides a complete picture of the current state of the art, transforming a scattered collection of individual research efforts into an actionable body of knowledge. Our contributions are threefold: • Holistic review. We conduct an in-depth sur- vey of agentic security through a comprehensive three-pillar taxonomy, as presented in Fig. 1. •Focus on applications. We provide a detailed review focused on how agents are actually used by security teams—covering offensive, defen- sive, and domain-specific tasks, an area largely overlooked by prior surveys. •Cross-cutting analysis. We identify key trends and gaps, including migration from monolithic to planner-executor and hybrid agents, backbone LLM monopoly (GPT), uneven threat and modal- ity coverage, and benchmark fragmentation. 2 Applications of Agents in Security This section describes how agents are applied across the cybersecurity landscape, from offensive testing and exploit generation to defensive detec- tion, forensics, and automated remediation. 2.1 Offensive Security Agents (Red-Teaming) This subsection describes autonomous and reasoning-drivenred-teamagenticsystems that conduct penetration testing, vulnerability discovery, fuzzing, and exploit adaptation. 2.1.1 Autonomous Penetration Testing These agents perform autonomous end-to-end pen- etration testing using adaptive planning and feed- back. Deng et al. (2024a) propose PentestGPT, the first LLM-based framework with a Reason- ing–Generation–Parsing design reducing context loss, while Shen et al. (2025) and Kong et al. (2025b) extend this with multi-agent RAG and task-graph coordination respectively. Nieponice et al. (2025) introduce an SSH-focused system, and Happe and Cito (2025a) show autonomous adapta- tion of LLM agents in enterprise testbeds. Evalua- tion works include HackSynth (Muzsai et al., 2024) and AutoPentest (Henke, 2025), which benchmark or integrate continuous exploit intelligence. Com- parative and system-focused studies include Happe and Cito (2025b) on interactive vs autonomous agents, Singer et al. (2025) introducing Incalmo for reliable multi-host execution, and Luong et al. (2025), who achieves state-of-the-art results on AutoPenBench (Gioacchini et al., 2024) and AI- Pentest-Benchmark (Isozaki et al., 2024). 2.1.2 Automated Vulnerability Discovery & Fuzzing Zhu et al. (2025a) propose Locus for deep-state exploration via predicate synthesis. Meng et al. (2024) extract protocol grammars to guide fuzzing, while Fang et al. (2024) show LLMs can au- tonomously exploit one-day flaws.Zhu et al. (2025d) extend this to multi-agent zero-day dis- covery.Wang and Zhou (2025) present a two-phase agentic system for Android vulnerability discov- ery and validation. Lee et al. (2025b); Zhu et al. (2025b); Wang et al. (2025c) introduce benchmarks to evaluate LLM agents on tasks like exploita- tion and repair, while ExCyTInBench(Wu et al., 2025b) highlights challenges in multi-step reason- ing. LLMFuzzer (Yu et al., 2024), TitanFuzz(Deng et al., 2023), and FuzzGPT (Deng et al., 2024b) use fuzzing or reasoning-based input generation. 2.1.3 Exploit Generation & Adaptation Lupinacci et al. (2025) show that LLM agents can be coerced into autonomously executing malware via prompt injection, RAG backdoors, and inter- agent trust abuse. Saha and Shukla (2025) pro- posed a multi-agent system producing diverse mal- ware samples for studying evasion tactics. He et al. (2025a) describe an Agent-in-the-Middle attack injecting malicious logic into multi-agent frameworks by intercepting and mutating messages, while Ullah et al. (2025) introduce CVE-Genie, a multi-agent framework that automatically rebuilds environments and generates verifiable exploits, suc- cessfully reproducing 428 of 841 CVEs across 22 programming languages and 141 CWE categories. Finally, Fakih et al. (2025) present a repair system 2 Agentic Security Applications (§2) Red Teaming (§2.1) Autonomous Penetra- tion Testing (§2.1.1) PentestGPT (Deng et al., 2024a), HackSynth (Muz- sai et al., 2024), Incalmo (Singer et al., 2025) Automated Vulner- ability Discovery & Fuzzing (§2.1.2) Locus (Zhu et al., 2025a), ChatAFL (Meng et al., 2024), LLM-Fuzzer (Yu et al., 2024), FuzzGPT (Deng et al., 2024b) Exploit Generation & Adaptation (§2.1.3) MalGen (Saha and Shukla, 2025), CVE-Genie (Ullah et al., 2025) Blue Teaming (§2.2) Autonomous Threat Detection & Incident Response (§2.2.1) IRCopilot (Lin et al., 2025b), CORTEX (Wei et al., 2025), CyberSOCEval (Deason et al., 2025) Intelligent Threat Hunting (§2.2.2) ProvSEEK (Mukherjee and Kantarcioglu, 2025), LLMCloudHunter (Schwartz et al., 2025) Automated Foren- sics (§2.2.3) RepoAudit (Guo et al., 2025a), CyberSleuth (Fumero et al., 2025), GALA (Tian et al., 2025) Autonomous Patching & Remediation (§2.2.4) RepairAgent (Zhu et al., 2025c), (Toprani and Madisetti, 2025) Domain-specific (§2.3) Cloud and Infrastructure Security (§2.3.1) KubeIntellect (Ardebili and Bartolini, 2025), BARTPredict (Diaf et al., 2025) Web and Application Security (§2.3.2) MAPTA (David and Gervais, 2025), AIOS (Mei et al., 2025), Progent (Shi et al., 2025), PFI (Kim et al., 2025) Specialized Appli- cations (§2.3.3) LISA (Sun et al., 2025) (Blockchain), HIPAA (Neupane et al., 2025) (Healthcare), OneShield (Asthana et al., 2025) (Privacy & Data Governance), Threats (§3)Attack Surface (§3.1) Injection Attacks (§3.1.1) AgentDojo (Debenedetti et al., 2024), InjectAgent (Zhan et al., 2024), PromptInfection (Lee and Tiwari, 2024) Poisoning and Extrac- tion Attacks (§3.1.2) AgentPoison (Chen et al., 2024), PoisonBench (Fu et al., 2025b) Jailbreak Attacks (§3.1.3) BrowserArt (Kumar et al., 2025), JAWS-BENCH (Saha et al., 2025), LLMFuzzer (Yu et al., 2024) Agent Manipula- tion Attacks (§3.1.4) PromptInject (Perez and Ribeiro, 2022), AEA (Guo et al., 2025b), InfoRM (Miao et al., 2024) Pre-execution At- tacks (§3.1.5) Backdoor Threats (Yang et al., 2024), ScamA- gents (Badhe, 2025), RSP (Zhou and Wang, 2025) Red-Teaming At- tacks (§3.1.6) SentinetAgent (He et al., 2025a), AiTM (Liu et al., 2025a) Evaluation (§3.2) Adverserial Bench- marks (§3.2.1) DoomArena (Boisvert et al., 2025), SafeArena (Tur et al., 2025), STWebAgentBench (Levy et al., 2025), ASB (Zhang et al., 2025a) Execution Envi- ronments (§3.2.2) AgentDojo (Debenedetti et al., 2024), CVE-Bench (Zhu et al., 2025c), DoomArena (Boisvert et al., 2025) Defenses (§4) Defenses & Op- erations (§4.1) Secure-by- Designs (§4.1.1) ACE (Li et al., 2025b), Task Shield (Jia et al., 2025) Multi-Agent Se- curity (§4.1.2) D-CIPHER (Udeshi et al., 2025), PhishDebate (Li et al., 2025d) Runtime Pro- tection (§4.1.3) R 2 -Guard (Kang and Li, 2024) (GuardRail), AgentSpec (Wang et al., 2025a) (HITL), SentinelA- gent (He et al., 2025b) (Behavioral Monitoring) Security Oper- ations (§4.1.4) IRIS (Li et al., 2025e) (Formal Verification), AutoBNB (Liu, 2025) (Incident Response), CORTEX (Wei et al., 2025) (SOC & Alert Triage), ExCyTIn-Bench (Wu et al., 2025b) (Threat Hunting), GALA (Tian et al., 2025) (Forensics & RCA) Evaluation (§4.2) Benchmarking Plat- forms (§4.2.1) ASB (Zhang et al., 2025a), RAS-Eval (Fu et al., 2025c) Defense Testing (§4.2.2) AI Agents Under Threat (Deng et al., 2024c), Safety at Scale (Ma et al., 2025) Domain Specific (§4.2.3) Agentic-AI Healthcare (Shehab, 2025) (Health), LAW (Watson et al., 2024) (Legal), Priva- cyChecker (Wang et al., 2025d) (Privacy) Figure 1: Overview of Agentic Security Taxonomy combining fine-tuned LLMs and iterative valida- tion to generate accurate vulnerability patches. 2.2Defensive Security Agents (Blue-Teaming) This subsection describes blue-team applications of LLM agents for continuous monitoring, threat detection, incident response, threat hunting, and automated patching. 2.2.1 Autonomous Threat Detection & Incident Response This area studies agentic SOC frameworks that monitor alerts, analyze threats, and execute re- sponse playbooks. Tellache et al. (2025) propose a RAG-based agent combining CTI and SIEM data for automated triage to generate contextually rel- evant mitigation strategies against Advanced Per- sistent Threats (APTs). Lin et al. (2025b) propose IRCopilot to enhance incident response reliabil- ity with role-based agents, while Wei et al. (2025) introduce a collaborative agent CORTEX that re- duces false alerts by 10.7% compared to single- agent baselines on a fine-grained SOC workflow dataset. Liu (2025) explore centralized and hy- brid agent models for team-based response, while Singh et al. (2025) find LLMs are mainly used as assistive tools in real-world SOCs. Deason et al. (2025) benchmark LLM threat reasoning, exposing performance gaps between commercial and open- source models on malware analysis, while Molleti et al. (2024) survey log-analysis agents and high- light scalability and robustness challenges inherent to high-volume industrial logging environments. 3 2.2.2 Intelligent Threat Hunting Mukherjee and Kantarcioglu (2025) introduce a provenance-forensics framework using RAG, CoT reasoning, and agent orchestration to refine and verify evidence to achieve 22%/29% higher pre- cision/recall for threat detection on six DARPA Transparent Computing and OpTC datasets com- pared to SOTA PIDS. Meng et al. (2025a) ana- lyze failures in LLM-assisted CTI workflows and propose fixes for contradictions and generaliza- tion gaps. Schwartz et al. (2025) introduce LLM- CloudHunter, extracting cloud IoCs and generating high-precision detection rules on a test set of cloud- specific threat reports. 2.2.3 Automated Forensics & Root Cause Analysis Guo et al. (2025a) introduce a repository-level au- diting agent that uses memory and path-condition checks to reduce hallucinations. Alharthi and Yasaei (2025) and Fumero et al. (2025) develop LLM-powered tools for classifying logs, extract- ing forensic intelligence, and analyzing network traffic, while Tian et al. (2025) combine causal in- ference and LLM reasoning for iterative root-cause analysis.Pan et al. (2025) present the MAST tax- onomy of multi-agent failure modes and an LLM- as-Judge pipeline for execution failure detection, while Alharthi and Garcia (2025) introduce CIAF, an ontology-driven framework for structuring cloud logs and assembling incident narratives. 2.2.4 Autonomous Vulnerability Remediation Several agents are proposed for automated patch synthesis and vulnerability repair.Zhu et al. (2025c) benchmark LLM agents on real CVE repair using static-analysis tools.Bouzenia et al. (2025) introduce RepairAgent,an autonomous bug-repair pipeline that interleaves tool invocation and feed- back, successfully fixing 164 Defects4J bugs, in- cluding 39 not repaired by prior techniques. Ap- plied systems include Gemini-based patching work- flow (Keller and Nowakowski, 2024) and an IaC agent (Toprani and Madisetti, 2025) that integrates automated remediation into CI/CD pipelines. 2.3 Domain-specific Applications The section explains domain-specific agentic sys- tems using LLMs for auditing, vulnerability detec- tion, and policy-based hardening across sectors. 2.3.1 Cloud and Infrastructure Security This section covers agents securing cloud and in- frastructure through automated scanning, harden- ing, and remediation. Yang et al. (2025b) propose a two-phase workflow—sandbox “exploration” fol- lowed by verified “exploitation”—to safely test CSPM remediations. Ardebili and Bartolini (2025) introduce an LLM supervisor coordinating sub- agents for log analysis, RBAC auditing, and de- bugging, while Ye et al. (2025) present LLMSec- Config, combining static analysis and RAG to fix Kubernetes misconfigurations. Diaf et al. (2025) propose BARTPredict for IoT traffic forecasting and anomaly detection, and Toprani and Madisetti (2025) describe an IaC-focused agent that auto- generates CI/CD-ready fixes for policy-compliant hardening. 2.3.2 Web and Application Security David and Gervais (2025) propose a multi-agent web pentesting framework with sandboxed PoC val- idation for safe, repeatable exploit testing. Mudryi et al. (2025) analyze browser-agent threats like prompt injection and credential leaks, introducing layered defenses like sanitization and formal anal- ysis. At the OS level, Mei et al. (2025) design an agent-oriented OS that isolates LLMs and mediates tool access via policy. Hu et al. (2025) formal- ize OS agent observation/action spaces to support structured risk analysis. Kim et al. (2025) validate control/data flows to prevent privilege escalation, while Shi et al. (2025) present a runtime agent to enforce deterministic, fine-grained permissions that eliminate attack success in red-team evaluations. 2.3.3 Specialized Applications This section reviews agentic security across finance, healthcare, privacy, and embodied systems. In fi- nance, Sun et al. (2025) present LISA, a smart- contract auditor outperforming static analyzers on logic flaws, while Kevin and Yugopuspito (2025) introduce SmartLLM, boosting Solidity vulnerabil- ity detection. Hybrid and conversational systems Ma et al. (2024); Xia et al. (2024) enhance explain- ability and exploit reproduction. In healthcare, Ne- upane et al. (2025) propose a HIPAA-compliant agent framework with PHI sanitization and im- mutable audit trails. Asthana et al. (2025) develop OneShield, a multilingual privacy-guardrail sys- tem for PII/PHI detection and OSS risk flagging. For embodied systems, Xing et al. (2025) expose threats from adversarial prompts, sensor spoofing, 4 and instruction misuse, noting that runtime valida- tion reduces but cannot eliminate safety risks. 3 Threats to Agentic Systems The transition from standalone LLMs to au- tonomous agents introduces a more severe set of security challenges (Saha et al., 2025; Chiang et al., 2025), as the safety alignment (refusal training) of a base LLM does not reliably transfer to the agen- tic context (Kumar et al., 2025) . In this section we discuss the threat landscape targeting agentic systems and the frameworks used to evaluate their resilience. 3.1 Attack Surface 3.1.1 Injection Attacks Prompt injection attacks embed malicious in- structions within the prompt fed to an LLM to ma- nipulate it into performing unintended actions (Liu et al., 2024c,a; Yi et al., 2025; Shao et al., 2024). Wang et al. (2025e) identify that the static and pre- dictable structure of an agent’s system prompt is a key vulnerability that enables prompt injection at- tacks to agentic systems. Debenedetti et al. (2024) introduce a benchmark comprising of 97 realis- tic tasks (e.g., email management, online banking) which reveals a fundamental trade-off: security de- fenses that reduce vulnerability also degrade the agent’s task-completion utility. Liu et al. (2024b) propose split-payload injection attack and find 31 LLM-integrated applications to be vulnerable, in- cluding Notion. Several studies show the vulnera- bility of LLM agents to indirect prompt injection attacks (Zhan et al., 2024; Li et al., 2025a; Yi et al., 2025). Dong et al. (2025a) show that attackers can inject malicious records into ab agent’s mem- ory bank by only interacting via queries and output observations, without any direct memory access. Lee and Tiwari (2024) develop a novel prompt injection attack where a malicious prompt self- replicates across interconnected agents in a multi- agent system like a computer virus and causes system-wide disruption. Dong et al. (2025b) pro- pose a memory injection attack that uses crafted prompts to indirectly poison an agent’s long-term memory for later malicious execution. Alizadeh et al. (2025) demonstrate that such attacks can cause tool-calling agents to leak sensitive personal data observed during their tasks. Wang et al. (2025f) develop a black-box fuzzing technique that uses Monte Carlo Tree Search to automatically discover indirect prompt injection vulnerabilities by iteratively mutating prompts and environmen- tal observations. Zhang et al. (2025a) and An- driushchenko et al. (2025) design benchmarks that reveal high vulnerability of LLM agents to prompt injection attacks. Zhan et al. (2025) systematically evaluate eight different defenses for LLM agents and demonstrate that all of them can be success- fully bypassed by crafting adaptive attacks using established jailbreaking techniques such as GCG (Zou et al., 2023) and AutoDAN (Liu et al., 2024a). 3.1.2 Poisoning and Extraction Attacks Poisoning attacks present another critical vulner- ability for LLM agents by corrupting their mem- ory or knowledge retrieval systems. Fendley et al. (2025) categorize these attacks by their specifica- tions (poison set, trigger, poison behavior, deploy- ment) and define key metrics for evaluation (suc- cess rate, stealthiness, persistence). Dong et al. (2025b) demonstrate a practical attack that poi- sons an agent’s memory through seemingly benign queries, causing it to execute malicious actions when the poisoned memory is later retrieved by a victim. Similarly, Chen et al. (2024) develop AgentPoison, which poisons an agent’s memory or knowledge base by optimizing a backdoor trig- ger that forces the retrieval of malicious records to hijack its behavior. Zhang et al. (2025a) provide a comprehensive framework for measuring agent vulnerabilities to various attacks, including data poisoning. Several benchmarks (Fu et al., 2025b; Bowen et al., 2025) reveal that larger models do not gain resilience and may even be more susceptible to data poisoning. Guo et al. (2025b) show that adversaries can repeatedly query an agent’s API to obtain a large set of input-output pairs, which can then be used to train an unauthorized "clone" or derivative model, effectively stealing the intel- lectual property and competitive advantage of the original model provider. These types of attacks are called model extraction attacks. 3.1.3 Jailbreak Attacks Jailbreak attacks attempt to bypass a model’s built- in safety measures to force it to produce harmful or unintended content (Wei et al., 2023; Zou et al., 2023; Xu et al., 2024; Lin et al., 2025a). Kumar et al. (2025) and Chiang et al. (2025) both demon- strate that AI agents are significantly more vul- nerable to jailbreak attacks than their underlying LLMs. Kumar et al. (2025) and Andriushchenko 5 et al. (2025) show that simple jailbreaking tech- niques designed for chatbots are highly effective against browser agents, while Chiang et al. (2025) identify three critical design factors (embedding goal directly into system prompt, iterative action generation, and processing environment feedback through event stream) that increase an agent’s sus- ceptibility. Andriushchenko et al. (2025) discover that leading LLMs are surprisingly compliant with malicious agent requests even without jailbreaking. Saha et al. (2025) find that LLM coding agents are highly vulnerable to jailbreak attacks that pro- duce executable malicious code, with attack suc- cess rates reaching 75% in multi-file codebases. Yu et al. (2024) use fuzzing techniques to generate novel jailbreak prompts from human-written seeds. Anil et al. (2024) demonstrate that numerous in- context examples of harmful question answering can override a model’s safety training. Robey et al. (2024) present a comprehensive exploration of jail- break attacks on agentic robotic systems. 3.1.4 Agent Manipulation Attacks This class of attacks targets the higher-level cogni- tive functions of the agent: its planning, reasoning, and goal-setting modules. Goal hijacking attacks subtly or overtly alter an agent’s objectives, caus- ing it to subvert its original goal (e.g., summariz- ing a document) to include a secondary, malicious goal (e.g., including advertisements) defined by the attacker (Perez and Ribeiro, 2022; Guo et al., 2025b). Pham and Le (2025) introduce a black-box algorithm that automatically generates malicious system prompts to hijack an LLM’s behavior for specific targeted questions, while Chen and Yao (2024) leverage an LLM’s weakness in role iden- tification to trick the model into executing a new, malicious task instead of the original one. Zhang et al. (2025b) introduce an action hi- jacking attack where an agent is tricked into as- sembling seemingly information data from its own knowledge base into harmful instructions, bypass- ing input filters. Another class of hijacking at- tacks is reward hacking, which exploits the reward mechanisms in RL-trained agents (Skalse et al., 2022; Pan et al., 2021; Miao et al., 2024; Fu et al., 2025a). These can be caused by reward misgeneral- ization where models learn from spurious features (Miao et al., 2024), or by agents exploiting reward model ambiguities to maximize their score with- out true alignment (Fu et al., 2025a). Bondarenko et al. (2025) demonstrate specification gaming vulnerabilities, where a capable LLM agent (e.g. OpenAI’s o3) instructed to "win against a strong chess engine" hacks the game’s environment to en- sure victory rather than play fairly, thus satisfying the literal instruction while violating user intent. Finally, a novel threat on multi-agent systems is the presence of a Byzantine agent, which is a single compromised or malicious agent that can disrupt the collective’s ability to complete a task securely and correctly (Li et al., 2024; Jo and Park, 2025). 3.1.5 Pre-execution Cognitive Attacks Even before tool execution, the agent’s internal state—its reasoning, planning, and reflection—is vulnerable to manipulation. Epistemic attacks corrupt an agent’s intermediate thought steps. Gre- shake et al. (2023) demonstrate that indirect prompt injection can force the agent to condition its next step on a hallucinated prior, so that the logic re- mains consistent but the agent itself becomes mali- cious. Yang et al. (2024) inject malicious behavior in intermediate reasoning steps (e.g. calling un- trusted APIs) while keeping final outputs correct. Teleological attacks manipulate an agent’s plan- ning graphs and goal-directed structures. Badhe (2025) demonstrate that attackers can weaponize an agent’s task decomposition logic to frame a ma- licious objective (e.g., "steal data") as a sequence of benign-looking subtasks ("read file," "write log") that the agent mistakes for a legitimate plan. In addition, Metacognitive attacks target an agent’s self-correction ability.Zhou and Wang (2025) show that rewriting retrieved context into epistemic tones can manipulate an agent’s verification depth and self-confidence to hurt its self-correction. 3.1.6 Red-Teaming Attacks Perez et al. (2022) first showed that is is possible to use one LLM to automatically generate test-cases that uncovered harmful behaviors like offensive content and data leakage in a target model. Ge et al. (2023) elevated this to an multi-round iterative set- ting. He et al. (2025a) develop a directed grey- box fuzzing framework designed specifically for detecting taint-style vulnerabilities (such as code injection) in LLM agents. Liu et al. (2025a) in- troduce the "Agent-in-the-Middle" attack, where an adversarial agent red-teams a system by inter- cepting and manipulating inter-agent communica- tions. Zhang and Yang (2025) present a search- based framework that simulates multi-turn inter- actions where an LLM optimizer adversarially co- 6 evolves the strategies of both attacking and defend- ing agents to discover emergent risks. 3.2 Evaluation Frameworks In this section we discuss benchmarks and environ- ments designed to assess agentic vulnerabilities. 3.2.1 Adversarial Benchmarking Zhang et al. (2025a) introduce ASB benchmark with 10 scenarios and 27 attack classes. RASEval (Fu et al., 2025c) contains 80 attack scenarios in domains like healthcare and finance, demonstrating a 36.8% reduction in task completion under attack. AgentDojo (Debenedetti et al., 2024) uses 97 re- alistic tasks to highlight the fundamental trade-off between an agent’s security and its task-completion utility, while AgentHarm (Andriushchenko et al., 2025) uses a dataset of 110 unique harmful tasks to reveal significant gaps in agent safety alignment. For web agents, SafeArena (Tur et al., 2025) mea- sures completion rates on 250 malicious requests, finding agents complete 34.7% of them, while ST- WebAgentBench (Levy et al., 2025) introduces met- rics for policy-compliant success, finding it is 38% lower than standard task completion. For code agents, JAWS-BENCH (Saha et al., 2025) finds up to 75% attack success rates in multi-file codebases, while SandboxEval (Rabin et al., 2025) assesses the security of the execution environment itself with 51 test-cases. InjecAgent (Zhan et al., 2024) offers a dedicated benchmark for indirect prompt injection attcaks, while BrowserART (Kumar et al., 2025) focuses on the susceptibility to jailbreaks. 3.2.2 Execution Environments Zhu et al. (2025c) design a sandbox framework that enables LLM agents to interact with exploit vul- nerable web applications. Debenedetti et al. (2024) provide a stateful environment with 97 realistic tasks to evaluate the robustness of LLM agents against prompt injection attacks. DoomArena (Boisvert et al., 2025) is a modular red-teaming platform for LLM agents that allows researchers to compose sequential attacks and to mix-and-match adaptive adversary strategies. Zhou et al. (2024) introduce a realistic web environment with 812 long-horizon tasks, where even best performing agents achieve less than 15% success rate. 4 Defense: Hardening the Agents This section describes architectural, runtime, and formal-verification defenses that strengthen agentic systems against attacks. 4.1 Defense & Operations Here we focus on secure-by-design frameworks that embed layered verification, isolation, and control-flow integrity into agent architectures. 4.1.1 Secure-by-Design Recent works (Debenedetti et al., 2024; Li et al., 2025b; Rosario et al., 2025) advance modular and plan–execute isolation, cutting cross-context injection rates by over 40%. Task-level align- ment and polymorphic prompting (Jia et al., 2025; Debenedetti et al., 2025; Wang et al., 2025e) em- ploy intent validation and adaptive obfuscation to resist evolving attacks. Governance-oriented frameworks (He et al., 2024; Narajala and Narayan, 2025; Raza et al., 2025; Adabara et al., 2025) ex- tend Secure-by-Design principles through TRiSM- based trust calibration and layered threat model- ing. Tang et al. (2024) introduce ModelGuard, constraining knowledge leakage via information- theoretic entropy bounds. 4.1.2 Multi-Agent Security Secure multi-agent paradigms (Udeshi et al., 2025; Liu et al., 2025b) apply zero-trust and dynamic col- laboration to minimize leakage under adversarial conditions. Core vulnerabilities—spoofing, trust delegation, and collusion—are detailed in Han et al. (2025); Ko et al. (2025), motivating formal cross- agent verification. Debate-based collectives (HU et al., 2025; Li et al., 2025d) achieve over 90% phishing detection via randomized smoothing and adversarial consensus, while Lee and Tiwari (2024) uncover LLM-to-LLM prompt infection, highlight- ing provenance tracking for containment. 4.1.3 Runtime Protection Reasoning- and knowledge-enhanced guardrails such as R 2 -Guard, AgentGuard, and AGrail (Kang and Li, 2024; Xiang et al., 2025; Chen and Cong, 2025; Luo et al., 2025) reduce jailbreak failures by up to 35%. Adaptive systems like PSG-Agent (Wu et al., 2025a) sustain accuracy under evolv- ing threats via personality awareness and continual learning. Deployment studies (Rad et al., 2025; Amazon Web Services, 2024) optimize latency and integrate layered safeguards in production ecosystems. Human-in-the-loop oversight (Wang et al., 2025a) embeds runtime policy enforcement and approval gates for accountability. Behavioral 7 anomaly detectors such as Confront and SentinelA- gent (Song et al., 2025; He et al., 2025b) leverage log and graph reasoning for interpretable detection. 4.1.4 Security Operations Formal verification systems (Kouvaros et al., 2019; Crouse et al., 2024; Lee et al., 2025a; Chen and Cong, 2025) ensure behavioral correctness and run- time assurance through VeriPlan and AgentGuard. LLM-driven analyzers (Yang et al., 2025a; Li et al., 2025e) achieve over 92% accuracy in static anal- ysis, while verification-driven pipelines such as Chain-of-Agents and RepoAudit (Li et al., 2025c; Guo et al., 2025a) operationalize formal assurance. Autonomous response pipelines (Tellache et al., 2025; Molleti et al., 2024) fuse LLM reasoning with threat intelligence, reducing MTTD by 30%. Collaborative frameworks (Liu, 2025; Lin et al., 2025b) like AutoBnB and IRCopilot coordinate triage and remediation. SOC studies (Singh et al., 2025; Wei et al., 2025; Deason et al., 2025) re- veal hybrid agent models (e.g., CORTEX) that improve alert precision and reduce fatigue. Rule- and provenance-based threat hunters (Mukherjee and Kantarcioglu, 2025; Schwartz et al., 2025; Meng et al., 2025b; Wu et al., 2025b; Meng et al., 2025a) enable explainable detection and blue-team benchmarking.Cloud-native forensic systems (Alharthi and Yasaei, 2025; Alharthi and Garcia, 2025; Fumero et al., 2025; Tian et al., 2025) like LLM-Powered Forensics, CIAF, CyberSleuth, and GALA automate evidence extraction, reduce triage time by 40%, and improve causal reconstruction. 4.2 Evaluation Frameworks 4.2.1 Benchmarking Platforms Core testbeds such as AgentDojo,τ-Bench, and TurkingBench (Debenedetti et al., 2024; Yao et al., 2024; Xu et al., 2025b) simulate real-world tasks to evaluate robustness and failure modes of tool-using LLM agents. Safety-focused suites like SafeArena, ST-WebAgentBench, and RAS-Eval (Tur et al., 2025; Levy et al., 2025; Fu et al., 2025c) measure reliability under adversarial stress, while attack- driven frameworks—ASB, AgentHarm, and CVE- Bench (Zhang et al., 2025a; Andriushchenko et al., 2025; Zhu et al., 2025c)—quantify exploitability and vulnerability reproduction. Sandboxed envi- ronments such as DoomArena, ToolFuzz, and We- bArena (Boisvert et al., 2025; Rabin et al., 2025; Milev et al., 2025; Zhou et al., 2024) further en- hance reproducibility, and aiXamine (Deniz et al., 2025) offers a streamlined, modular suite for acces- sible LLM safety evaluation. 4.2.2 Defense Testing Adaptive studies (Zhan et al., 2025; de Witt, 2025) expose defense fragility under evolving adver- saries, urging continuous red-teaming. Broader surveys (Yu et al., 2025; Gan et al., 2024; Deng et al., 2024c) consolidate evolving countermea- sures, while Ma et al. (2025) and Wang et al. (2025b) emphasize scalable assurance spanning system and governance layers. 4.2.3 Domain-Specific Frameworks Agentic frameworks in healthcare increasingly em- bed native defenses against data leakage and policy non-compliance. Shehab (2025) propose Agentic- AI Healthcare, a multilingual, privacy-first system using the Model Context Protocol (MCP). Its “Pri- vacy and Compliance Layer” enforces several field- level encryption, and tamper-evident audit logging, thus embedding compliance structurally rather than adding it post hoc. Beyond healthcare, Wang et al. (2025d) present PrivacyChecker and PrivacyLens- Live for multi-agent LLM environments. These model-agnostic tools use contextual-integrity rea- soning and real-time monitoring to mitigate privacy risks dynamically. In legal domains, Watson et al. (2024) introduce LAW which reduces hallucina- tions and clause omissions through tool orchestra- tion and task partitioning. 5 Cross-Cutting Analysis and Trends A cross-cutting analysis of the papers reveals clear structural patterns, as shown in Fig. 2. Architecture and autonomy. The field is shift- ing towards planner–executor architectures (39.8%) and hybrid models (14%). It reflects a growing appreciation of decomposed cognitive pipelines, where planning, execution, and verification can be modularized to improve interpretability and de- bugging. More than half of the works (71 papers) implement bounded automation, eliminating the need for non-scalable human approvals. Agent role. The dominance of executor and plan- ner roles (129 papers) reflects the field’s opera- tional emphasis on task decomposition and con- trol. Critics/verifiers appear in 95 papers. Addition- ally, the growing number of tool-caller (42) and governor/mediator (24) agents signifies a funda- mental shift from monolithic reasoning to layered, 8 39.8 24.6 10.5 14.0 5.8 5.3 Planner-Exec. (68) Monolithic (42) LLM-Judge (18) Hybrid (24) Debate (10) Teacher-Stud. (9) (a) Architecture patterns 050100150 Governor/Med Tool-caller Critic/Verifier Planner Executor 24 42 95 107 129 Number of Papers (b) Agent role distribution 050100150 DeepSeek Qwen Mistral Gemini LLaMA Claude GPT 15 18 30 48 63 71 126 Number of Papers (c) LLM backbones 050100150 RLHF/Pref. Finetuning RAG ICL Pretrained 8 38 43 66 132 Number of Papers (d) Knowledge sources 050100150 Binaries Images Net. Traces Code Logs Text 10 11 38 93 101 141 Number of Papers (e) Data modality Figure 2: Cross-cutting analysis of agent architectures, roles, backbones, knowledge sources, and data modalities. self-checking collectives designed for explicit self- regulation and ethical alignment. LLM backbones. GPT-family of models domi- nate by appearing in 83% of studies, followed by Claude, LLaMA and Gemini. Except for LLaMA, other open-weight models like Mistral, Qwen and Deepseek are in the minority, suggesting a lack of trust in their agentic capabilities. Moreover, model- specific alignment differences create fragmenta- tion: safety fine-tuning and evaluation pipelines are rarely transferable, hindering cross-model gen- eralization and reproducibility. Modalities. Input modality spectrum is dominated by text, logs and codes. Although images, network traces and binaries are often tied to security vul- nerabilities and intrusion, such as in browser-use agents, these non-textual modalities are underex- plored. This research gap also presents a promising area for future work. Knowledge source. Pretrained knowledge-bases dominate agentic workflows (132 papers). ICL and RAG show partial adoption, while fine-tuning, RL, and preference learning remain niche. This imbalance suggests a community preference for lightweight deployment over continual learning, which is practical for agents but potentially inse- cure in dynamic threat environments. It also pro- vides future research direction in securing RAG pipelines with verified provenance, incremental fine-tuning, and model distillation. 6 Conclusion In this survey we explore the current landscape of agentic security literature, focusing on downstream applications, threats to agentic systems, and coun- termeasures. A deeper analysis shows the preva- lence of multi-agent systems over monolithic ar- chitecture, the de-facto status of GPT models as the core of agentic systems, and the community preference of pre-trained knowledge for practical deployment compared to fine-tuning or RAG based approaches. Future works in this domain should focus on the challenges in cross-domain systems, the economics of agentic security, and prioritize de- fense techniques with provable safety guarantees. 9 Limitations This survey has a few key limitations. It mainly focuses on software-based threats and does not explore physical-world or embodied agent attacks (like those involving robots or sensors) in detail. Our coverage is also limited to academic papers, so it may miss non-archival industrial research. In addition, many of the benchmarks we reviewed use synthetic or simplified test setups, which makes it hard to fully judge how well agents would perform in real-world environments. Finally, most studies emphasize accuracy and safety rather than practical aspects like cost, speed, or energy use, and our own taxonomy involves some subjective choices. References Ibrahim Adabara, Bashir Olaniyi Sadiq, Aliyu Nuhu Shuaibu, Yale Ibrahim Danjuma, and Venkateswarlu Maninti. 2025. Trustworthy agentic ai systems: A cross-layer review of architectures, threat models, and governance strategies for real-world deployment. F1000Research, 14:905. Dalal Alharthi and Ivan Roberto Kawaminami Garcia. 2025. Cloud investigation automation framework (ciaf): An ai-driven approach to cloud forensics. Preprint, arXiv:2510.00452. Dalal Alharthi and Rozhin Yasaei. 2025. Llm-powered automated cloud forensics: From log analysis to in- vestigation. In 2025 IEEE 18th International Confer- ence on Cloud Computing (CLOUD), pages 12–22. Meysam Alizadeh, Zeynab Samei, Daria Stetsenko, and Fabrizio Gilardi. 2025. Simple prompt injection at- tacks can leak personal data observed by llm agents during task execution. Preprint, arXiv:2506.01055. Amazon Web Services. 2024. Securing amazon bedrock agents: Safeguarding against indirect prompt injec- tions. AWS Technical Documentation / White Paper. Listed as "LLM AGENT (agent safety orchestrator)". Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J Zico Kolter, Matt Fredrikson, Yarin Gal, and Xander Davies. 2025. Agentharm: A benchmark for measuring harmful- ness of LLM agents. In The Thirteenth International Conference on Learning Representations. Cem Anil, Esin DURMUS, Nina Rimsky, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Bat- son, Meg Tong, Jesse Mu, Daniel J Ford, Francesco Mosconi, Rajashree Agrawal, Rylan Schaeffer, Naomi Bashkansky, Samuel Svenningsen, Mike Lam- bert, Ansh Radhakrishnan, Carson Denison, Evan J Hubinger, and 15 others. 2024. Many-shot jailbreak- ing. In The Thirty-eighth Annual Conference on Neu- ral Information Processing Systems. Mohsen Seyedkazemi Ardebili and Andrea Bartolini. 2025. Kubeintellect: A modular llm-orchestrated agent framework for end-to-end kubernetes manage- ment. Shubhi Asthana, Bing Zhang, Ruchi Mahindru, Chad DeLuca, Anna Lisa Gentile, and Sandeep Gopisetty. 2025. Deploying privacy guardrails for llms: A com- parative analysis of real-world applications. Sanket Badhe. 2025. Scamagents: How ai agents can simulate human-level scam calls.Preprint, arXiv:2508.06457. Léo Boisvert, Abhay Puri, Gabriel Huang, Mihir Bansal, Chandra Kiran Reddy Evuru, Avinandan Bose, Maryam Fazel, Quentin Cappart, Alexandre Lacoste, Alexandre Drouin, and Krishnamurthy Dj Dvijotham. 2025. Doomarena: A framework for test- ing AI agents against evolving security threats. In Second Conference on Language Modeling. Alexander Bondarenko, Denis Volk, Dmitrii Volkov, and Jeffrey Ladish. 2025.Demonstrating spec- ification gaming in reasoning models.Preprint, arXiv:2502.13295. Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025.RepairAgent: An Autonomous, LLM-Based Agent for Program Repair . In 2025 IEEE/ACM 47th International Conference on Soft- ware Engineering (ICSE), pages 2188–2200, Los Alamitos, CA, USA. IEEE Computer Society. Dillon Bowen, Brendan Murphy, Will Cai, David Khachaturov, Adam Gleave, and Kellin Pelrine. 2025. Scaling trends for data poisoning in llms. Proceed- ings of the AAAI Conference on Artificial Intelligence, 39(26):27206–27214. Jizhou Chen and Samuel Lee Cong. 2025. Agent- guard:Repurposing agentic orchestrator for safety evaluation of tool orchestration. Preprint, arXiv:2502.09809. Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. 2024. Agentpoison: Red-teaming LLM agents via poisoning memory or knowledge bases. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. Zheng Chen and Buhui Yao. 2024. Pseudo-conversation injection for llm goal hijacking.Preprint, arXiv:2410.23678. Jeffrey Yang Fan Chiang, Seungjae Lee, Jia-Bin Huang, Furong Huang, and Yizheng Chen. 2025. Why are web ai agents more vulnerable than standalone llms? a security analysis. In ICLR 2025 Workshop on Build- ing Trust in Language Models and Applications. Maxwell Crouse, Ibrahim Abdelaziz, Ramon Astudillo, Kinjal Basu, Soham Dan, Sadhana Kumaravel, Achille Fokoue, Pavan Kapanipathi, Salim Roukos, and Luis Lastras. 2024. Formally specifying the high-level behavior of llm-based agents. Preprint, arXiv:2310.08535. 10 Isaac David and Arthur Gervais. 2025. Multi-agent penetration testing ai for the web (mapta). Christian Schroeder de Witt. 2025. Open challenges in multi-agent security: Towards secure systems of interacting ai agents. Preprint, arXiv:2505.02077. Lauren Deason, Adam Bali, Ciprian Bejean, Diana Bolocan, James Crnkovich, Ioana Croitoru, Krishna Durai, Chase Midler, Calin Miron, David Molnar, Brad Moon, Bruno Ostarcevic, Alberto Peltea, Matt Rosenberg, Catalin Sandu, Arthur Saputkin, Sagar Shah, Daniel Stan, Ernest Szocs, and 4 others. 2025. Cybersoceval: Benchmarking llms capabilities for malware analysis and threat intelligence reasoning. Preprint, arXiv:2509.20166. Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. 2025. Defeating prompt injections by design. Preprint, arXiv:2503.18813. Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. 2024a. PentestGPT: Evaluating and harnessing large language models for automated penetration testing. In 33rd USENIX Security Symposium (USENIX Security 24), pages 847–864, Philadelphia, PA. USENIX Association. Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large language models are zero-shot fuzzers: Fuzzing deep- learning libraries via large language models. In Pro- ceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, page 423–435, New York, NY, USA. Associa- tion for Computing Machinery. Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Ling- ming Zhang. 2024b. Large language models are edge-case generators: Crafting unusual programs for fuzzing deep learning libraries. In Proceedings of the IEEE/ACM 46th International Conference on Soft- ware Engineering, ICSE ’24, New York, NY, USA. Association for Computing Machinery. Zehang Deng, Yongjian Guo, Changzhou Han, Wan- lun Ma, Junwu Xiong, Sheng Wen, and Yang Xiang. 2024c. Ai agents under threat: A survey of key security challenges and future pathways. Preprint, arXiv:2406.02630. Fatih Deniz, Dorde Popovic, Yazan Boshmaf, Euisuh Jeong, Minhaj Ahmad, Sanjay Chawla, and Issa Khalil. 2025. aixamine: Simplified llm safety and security. Preprint, arXiv:2504.14985. Alaeddine Diaf, Abdelaziz Amara Korba, Nour Elislem Karabadji, and Yacine Ghamri-Doudane. 2025. Bart- predict: Empowering iot security with llm-driven cyber threat prediction. Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jil- iang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. 2025a. Memory injection attacks on LLM agents via query-only interaction. In The Thirty-ninth An- nual Conference on Neural Information Processing Systems. Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jil- iang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. 2025b. A practical memory injection attack against llm agents. Preprint, arXiv:2503.03704. Mohamad Fakih, Rahul Dharmaji, Halima Bouzidi, Gus- tavo Quiros Araya, Oluwatosin Ogundare, and Mo- hammad Abdullah Al Faruque. 2025. Llm4cve: En- abling iterative automated vulnerability repair with large language models. Preprint, arXiv:2501.03446. Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. 2024. Llm agents can autonomously exploit one-day vulnerabilities. Preprint, arXiv:2404.08144. Neil Fendley, Edward W. Staley, Joshua Carney, William Redman, Marie Chau, and Nathan Drenkow. 2025.A systematic review of poisoning at- tacks against large language models.Preprint, arXiv:2506.06518. Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, and Yanghua Xiao. 2025a. Reward shap- ing to mitigate reward hacking in rlhf. Preprint, arXiv:2502.18770. Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B Cohen, David Krueger, and Fazl Barez. 2025b. Poi- sonbench: Assessing large language model vulnera- bility to poisoned preference data. In Forty-second International Conference on Machine Learning. Yuchuan Fu, Xiaohan Yuan, and Dongxia Wang. 2025c. Ras-eval: A comprehensive benchmark for security evaluation of llm agents in real-world environments. Preprint, arXiv:2506.15253. Stefano Fumero, Kai Huang, Matteo Boffa, Danilo Giordano, Marco Mellia, Zied Ben Houidi, and Dario Rossi. 2025. Cybersleuth: Autonomous blue- team llm agent for web attack forensics. Preprint, arXiv:2508.20643. Yuyou Gan, Yong Yang, Zhe Ma, Ping He, Rui Zeng, Yiming Wang, Qingming Li, Chunyi Zhou, Songze Li, Ting Wang, Yunjun Gao, Yingcai Wu, and Shoul- ing Ji. 2024. Navigating the risks: A survey of secu- rity, privacy, and ethics threats in llm-based agents. Preprint, arXiv:2411.09523. Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, and Yun- ing Mao. 2023.Mart: Improving llm safety with multi-round automatic red-teaming. Preprint, arXiv:2311.07689. 11 Luca Gioacchini, Marco Mellia, Idilio Drago, Alexan- der Delsanto, Giuseppe Siracusano, and Roberto Bi- fulco. 2024. Autopenbench: Benchmarking gen- erative agents for penetration testing.Preprint, arXiv:2410.03225. Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you’ve signed up for: Compromis- ing real-world llm-integrated applications with indi- rect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Secu- rity, AISec ’23, page 79–90, New York, NY, USA. Association for Computing Machinery. Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, and Xiangyu Zhang. 2025a. Repoaudit: An au- tonomous LLM-agent for repository-level code au- diting. In Forty-second International Conference on Machine Learning. Qiming Guo, Jinwen Tang, and Xingran Huang. 2025b. Attacking llms and ai agents: Advertisement embed- ding attacks against large language models. Preprint, arXiv:2508.17674. Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, and Zhaozhuo Xu. 2025. Llm multi-agent sys- tems: Challenges and open problems. Preprint, arXiv:2402.03578. Andreas Happe and Jürgen Cito. 2025a. Can llms hack enterprise networks? autonomous assumed breach penetration-testing active directory networks. ACM Trans. Softw. Eng. Methodol. Just Accepted. Andreas Happe and Jürgen Cito. 2025b. On the surpris- ing efficacy of llms for penetration-testing. Preprint, arXiv:2507.00829. Pengfei He, Yuping Lin, Shen Dong, Han Xu, Yue Xing, and Hui Liu. 2025a. Red-teaming LLM multi-agent systems via communication attacks. In Findings of the Association for Computational Linguistics: ACL 2025, pages 6726–6747, Vienna, Austria. Associa- tion for Computational Linguistics. Xu He, Di Wu, Yan Zhai, and Kun Sun. 2025b. Sen- tinelagent: Graph-based anomaly detection in multi- agent systems. Preprint, arXiv:2505.24201. Yifeng He, Ethan Wang, Yuyang Rong, Zifei Cheng, and Hao Chen. 2024. Security of ai agents. Preprint, arXiv:2406.08689. Julius Henke. 2025. Autopentest: Enhancing vulner- ability management with autonomous llm agents. Preprint, arXiv:2505.10321. Jinwei HU, Yi DONG, Zhengtao DING, and Xiaowei HUANG. 2025. Enhancing robustness of llm-driven multi-agent systems through randomized smoothing. Chinese Journal of Aeronautics, page 103779. Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xi- angxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, and 10 others. 2025. Os agents: A survey on mllm-based agents for computer, phone and browser use. Isamu Isozaki, Manil Shrestha, Rick Console, and Ed- ward Kim. 2024. Towards automated penetration testing: Introducing llm benchmark, analysis, and improvements. Preprint, arXiv:2410.17141. Feiran Jia, Tong Wu, Xin Qin, and Anna Squicciarini. 2025. The task shield: Enforcing task alignment to defend against indirect prompt injection in LLM agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 29680–29697, Vienna, Austria. Association for Computational Linguistics. Yongrae Jo and Chanik Park. 2025. Byzantine-robust decentralized coordination of llm agents. Preprint, arXiv:2507.14928. Mintong Kang and Bo Li. 2024.r 2 -guard: Ro- bust reasoning enabled llm guardrail via knowledge- enhanced logical reasoning. Jan Keller and Jan Nowakowski. 2024. Ai-powered patching: the future of automated vulnerability fixes. Technical report. Jun Kevin and Pujianto Yugopuspito. 2025. Smartllm: Smart contract auditing using custom generative ai. Juhee Kim, Woohyuk Choi, and Byoungyoung Lee. 2025. Prompt flow integrity to prevent privilege es- calation in llm agents. Ronny Ko, Jiseong Jeong, Shuyuan Zheng, Chuan Xiao, Tae-Wan Kim, Makoto Onizuka, and Won-Yong Shin. 2025. Seven security challenges that must be solved in cross-domain multi-agent llm systems. Preprint, arXiv:2505.23847. Dezhang Kong, Shi Lin, Zhenhua Xu, Zhebo Wang, Minghao Li, Yufeng Li, Yilun Zhang, Hujin Peng, Zeyang Sha, Yuyuan Li, Changting Lin, Xun Wang, Xuan Liu, Ningyu Zhang, Chaochao Chen, Muham- mad Khurram Khan, and Meng Han. 2025a. A survey of llm-driven ai agent communication: Protocols, se- curity risks, and defense countermeasures. Preprint, arXiv:2506.19676. He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, and Bingzhen Wu. 2025b. Vulnbot: Autonomous penetration testing for a multi-agent collaborative framework. Preprint, arXiv:2501.13411. Panagiotis Kouvaros, Alessio Lomuscio, Edoardo Pirovano, and Hashan Punchihewa. 2019. Formal verification of open multi-agent systems. In Pro- ceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, A- MAS ’19, page 179–187, Richland, SC. International 12 Foundation for Autonomous Agents and Multiagent Systems. Priyanshu Kumar, Elaine Lau, Saranya Vijayakumar, Tu Trinh, Elaine T Chang, Vaughn Robinson, Shuyan Zhou, Matt Fredrikson, Sean M. Hendryx, Summer Yue, and Zifan Wang. 2025. Aligned LLMs are not aligned browser agents. In The Thirteenth Interna- tional Conference on Learning Representations. Christine P. Lee, David Porfirio, Xinyu Jessica Wang, Kevin Chenkai Zhao, and Bilge Mutlu. 2025a. Veri- plan: Integrating formal verification and llms into end-user planning. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Sys- tems, CHI ’25, New York, NY, USA. Association for Computing Machinery. Donghyun Lee and Mo Tiwari. 2024. Prompt infec- tion: Llm-to-llm prompt injection within multi-agent systems. Preprint, arXiv:2410.07283. Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. 2025b. Sec-bench: Automated benchmarking of llm agents on real-world software security tasks. Preprint, arXiv:2506.11791. Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, and Segev Shlomov. 2025.ST- WebAgentBench:A benchmark for evaluating safety & trustworthiness in web agents. In ArXiv. ArXiv:2410.06703. Ang Li, Yin Zhou, Vethavikashini Chithrra Raghuram, Tom Goldstein, and Micah Goldblum. 2025a. Com- mercial llm agents are already vulnerable to simple yet dangerous attacks. Preprint, arXiv:2502.08586. Evan Li, Tushin Mallick, Evan Rose, William Robert- son, Alina Oprea, and Cristina Nita-Rotaru. 2025b. Ace: A security architecture for llm-integrated app systems. Preprint, arXiv:2504.20984. Simin Li, Jun Guo, Jingqiao Xiu, Ruixiao Xu, Xin Yu, Jiakai Wang, Aishan Liu, Yaodong Yang, and Xiang- long Liu. 2024. Byzantine robust cooperative multi- agent reinforcement learning as a bayesian game. In The Twelfth International Conference on Learning Representations. Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qian- ben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, Shuying Fan, Xiaowan Li, Tiannan Wang, Pai Liu, King Zhu, and 11 others. 2025c. Chain-of-agents: End-to-end agent foundation models via multi-agent distillation and agentic rl. Preprint, arXiv:2508.13167. Wenhao Li, Selvakumar Manickam, Yung wey Chong, and Shankar Karuppayah. 2025d. Phishdebate: An llm-based multi-agent framework for phishing web- site detection. Preprint, arXiv:2506.15656. Ziyang Li, Saikat Dutta, and Mayur Naik. 2025e. IRIS: LLM-assisted static analysis for detecting security vulnerabilities. In The Thirteenth International Con- ference on Learning Representations. Liang Lin, Zhihao Xu, Xuehai Tang, Shi Liu, Biyu Zhou, Fuqing Zhu, Jizhong Han, and Songlin Hu. 2025a. Paper summary attack: Jailbreaking llms through llm safety papers. Preprint, arXiv:2507.13474. Xihuan Lin, Jie Zhang, Gelei Deng, Tianzhe Liu, Xiao- long Liu, Changcai Yang, Tianwei Zhang, Qing Guo, and Riqing Chen. 2025b. Ircopilot: Automated inci- dent response with large language models. Preprint, arXiv:2505.20945. Fengyu Liu, Yuan Zhang, Jiaqi Luo, Jiarun Dai, Tian Chen, Letian Yuan, Zhengmin Yu, Youkun Shi, Ke Li, Chengyuan Zhou, Hao Chen, and Min Yang. 2025a. Make agent defeat agent: automatic detection of taint- style vulnerabilities in llm-based agents. In Proceed- ings of the 34th USENIX Conference on Security Symposium, SEC ’25, USA. USENIX Association. Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024a. AutoDAN: Generating stealthy jail- break prompts on aligned large language models. In The Twelfth International Conference on Learning Representations. Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zi- hao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. 2024b. Prompt injection attack against llm-integrated appli- cations. Preprint, arXiv:2306.05499. Yinqiu Liu, Ruichen Zhang, Haoxiang Luo, Yijing Lin, Geng Sun, Dusit Niyato, Hongyang Du, Zehui Xiong, Yonggang Wen, Abbas Jamalipour, Dong In Kim, and Ping Zhang. 2025b. Secure multi-llm agentic ai and agentification for edge general intelligence by zero- trust: A survey. Preprint, arXiv:2508.19870. Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024c.Formalizing and benchmarking prompt injection attacks and defenses. In Proceedings of the 33rd USENIX Conference on Security Symposium, SEC ’24, USA. USENIX Asso- ciation. Zefang Liu. 2025. Autobnb: Multi-agent incident re- sponse with large language models. In 2025 13th International Symposium on Digital Forensics and Security (ISDFS), pages 1–6. Weidi Luo, Shenghong Dai, Xiaogeng Liu, Suman Banerjee, Huan Sun, Muhao Chen, and Chaowei Xiao. 2025. Agrail: A lifelong agent guardrail with effective and adaptive safety detection. Preprint, arXiv:2502.11448. Phung Duc Luong, Le Tran Gia Bao, Nguyen Vu Khai Tam, Dong Huu Nguyen Khoa, Nguyen Huu Quyen, Van-Hau Pham, and Phan The Duy. 2025. xoffense: An ai-driven autonomous penetration testing frame- work with offensive knowledge-enhanced llms and multi agent systems. Preprint, arXiv:2509.13021. 13 Matteo Lupinacci, Francesco Aurelio Pironti, Francesco Blefari, Francesco Romeo, Luigi Arena, and Angelo Furfaro. 2025. The dark side of llms: Agent-based attacks for complete computer takeover. Preprint, arXiv:2507.06850. Wei Ma, Daoyuan Wu, Yuqiang Sun, Tianwen Wang, Shangqing Liu, Jian Zhang, Yue Xue, and Yang Liu. 2024. Combining fine-tuning and llm-based agents for intuitive smart contract auditing with justifica- tions. Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yun- hao Chen, Yunhao Zhao, Hanxun Huang, Yige Li, Yutao Wu, Jiaming Zhang, Xiang Zheng, Yang Bai, Yiming Li, Zuxuan Wu, Xipeng Qiu, and 29 oth- ers. 2025. Safety at scale: A comprehensive survey of large model and agent safety. Foundations and Trends® in Privacy and Security, 8(3-4):254–469. Kai Mei, Xi Zhu, Wujiang Xu, Wenyue Hua, Mingyu Jin, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. 2025. Aios: Llm agent operating system. In Conference on Language Mod- eling. Ruijie Meng, Martin Mirchev, Marcel Böhme, and Ab- hik Roychoudhury. 2024. Large language model guided protocol fuzzing. In Proceedings of the 31st Annual Network and Distributed System Security Symposium (NDSS). Yuqiao Meng, Luoxi Tang, Feiyang Yu, Jinyuan Jia, Guanhua Yan, Ping Yang, and Zhaohan Xi. 2025a. Uncovering vulnerabilities of llm-assisted cyber threat intelligence. Preprint, arXiv:2509.23573. Yuqiao Meng, Luoxi Tang, Feiyang Yu, Xi Li, Guanhua Yan, Ping Yang, and Zhaohan Xi. 2025b. Bench- marking llm-assisted blue teaming via standardized threat hunting. Preprint, arXiv:2509.23571. Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, and Dacheng Tao. 2024. Inform: Mitigat- ing reward hacking in rlhf via information-theoretic reward modeling. Ivan Milev, Mislav Balunovi ́ c, Maximilian Baader, and Martin Vechev. 2025. Toolfuzz – automated agent tool testing. Preprint, arXiv:2503.04479. Ramasankar Molleti, Vinod Goje, Puneet Luthra, and Prathap Raghavan. 2024. Automated threat detection and response using llm agents. World Journal of Advanced Research and Reviews, 24:079–090. Mykyta Mudryi, Markiyan Chaklosh, and Grzegorz Wójcik. 2025. The hidden dangers of browsing ai agents. Kunal Mukherjee and Murat Kantarcioglu. 2025. Llm- driven provenance forensics for threat investigation and detection. Preprint, arXiv:2508.21323. Lajos Muzsai, David Imolai, and András Lukács. 2024. Hacksynth: Llm agent and evaluation frame- work for autonomous penetration testing. Preprint, arXiv:2412.01778. Vineeth Sai Narajala and Om Narayan. 2025. Securing agentic ai: A comprehensive threat model and miti- gation framework for generative ai agents. Preprint, arXiv:2504.19956. Subash Neupane, Shaswata Mitra, Sudip Mittal, and Shahram Rahimi. 2025. Towards a hipaa compliant agentic ai system in healthcare. Tomas Nieponice, Veronica Valeros, and Sebastian Gar- cia. 2025. Aracne: An llm-based autonomous shell pentesting agent. Preprint, arXiv:2502.18528. Alexander Pan, Kush Bhatia, and Jacob Steinhardt. 2021. The effects of reward misspecification: Map- ping and mitigating misaligned models. In Deep RL Workshop NeurIPS 2021. Melissa Z Pan, Mert Cemri, Lakshya A Agrawal, Shuyi Yang, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Kannan Ramchandran, Dan Klein, Joseph E. Gonzalez, Matei Zaharia, and Ion Stoica. 2025. Why do multiagent systems fail? In ICLR 2025 Workshop on Building Trust in Language Models and Applications. Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming language models with language models. Preprint, arXiv:2202.03286. Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. Preprint, arXiv:2211.09527. Viet Pham and Thai Le. 2025. Cain: Hijacking llm- humans conversations via malicious system prompts. Preprint, arXiv:2505.16888. Rafiqul Rabin, Jesse Hostetler, Sean McGregor, Brett Weir, and Nick Judd. 2025.Sandboxeval: To- wards securing test environment for untrusted code. Preprint, arXiv:2504.00018. Melissa Kazemi Rad, Huy Nghiem, Sahil Wadhwa, Andy Luo, and Mohammad Shahed Sorower. 2025. Refining input guardrails: Enhancing LLM-as-a- judge efficiency through chain-of-thought fine-tuning and alignment. In AAAI 2025 Workshop on Prevent- ing and Detecting LLM Misinformation (PDLM). Shaina Raza, Ranjan Sapkota, Manoj Karkee, and Chris- tos Emmanouilidis. 2025. Trism for agentic ai: A review of trust, risk, and security management in llm-based agentic multi-agent systems. Preprint, arXiv:2506.04133. Alexander Robey, Zachary Ravichandran, Vijay Ku- mar, Hamed Hassani, and George J. Pappas. 2024. Jailbreaking llm-controlled robots. Preprint, arXiv:2410.13691. 14 Ron F. Del Rosario, Klaudia Krawiecka, and Chris- tian Schroeder de Witt. 2025. Architecting resilient llm agents: A guide to secure plan-then-execute im- plementations. Preprint, arXiv:2509.08646. Bikash Saha and Sandeep Kumar Shukla. 2025. Mal- gen: A generative agent framework for model- ing malicious software in cybersecurity. Preprint, arXiv:2506.07586. Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Kr- ishna Gouda, Zijian Wang, and Varun Kumar. 2025. Breaking the code:Security assessment of ai code agents through systematic jailbreaking attacks. Preprint, arXiv:2510.01359. Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. In Thirty-seventh Conference on Neural Information Processing Systems. Yuval Schwartz, Lavi Ben-Shimol, Dudu Mimran, Yu- val Elovici, and Asaf Shabtai. 2025. Llmcloudhunter: Harnessing llms for automated extraction of detec- tion rules from cloud-based cti. In Proceedings of the ACM on Web Conference 2025, W ’25, page 1922–1941, New York, NY, USA. Association for Computing Machinery. Zedian Shao, Hongbin Liu, Jaden Mu, and Neil Zhen- qiang Gong. 2024. Enhancing prompt injection at- tacks to llms via poisoning alignment. Mohammed A. Shehab. 2025.Agentic-ai health- care: Multilingual, privacy-first framework with mcp agents. arXiv preprint, arXiv:2510.02325. Preprint, submitted 25 Sep 2025, cs.CR / cs.AI. Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, and Wei Ruan. 2025. Pentestagent: Incorporating llm agents to automated penetration testing. In Proceed- ings of the 20th ACM Asia Conference on Computer and Communications Security, ASIA CCS ’25, page 375–391, New York, NY, USA. Association for Com- puting Machinery. Tianneng Shi, Jingxuan He, Zhun Wang, Linyu Wu, Hongwei Li, Wenbo Guo, and Dawn Song. 2025. Progent: Programmable privilege control for llm agents. Noah Shinn, Federico Cassano, Edward Berman, Ash- win Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal rein- forcement learning. Preprint, arXiv:2303.11366. Brian Singer, Keane Lucas, Lakshmi Adiga, Meghna Jain, Lujo Bauer, and Vyas Sekar. 2025. On the fea- sibility of using llms to autonomously execute multi- host network attacks. Preprint, arXiv:2501.16466. Ronal Singh, Shahroz Tariq, Fatemeh Jalalvand, Mo- han Baruwal Chhetri, Surya Nepal, Cecile Paris, and Martin Lochner. 2025. Llms in the soc: An empirical study of human-ai collaboration in security opera- tions centres. Preprint, arXiv:2508.18947. Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krashenin- nikov, and David Krueger. 2022. Defining and char- acterizing reward hacking. In Proceedings of the 36th International Conference on Neural Informa- tion Processing Systems, NIPS ’22, Red Hook, NY, USA. Curran Associates Inc. Shuang Song, Yifei Zhang, and Neng Gao. 2025. Con- front insider threat: Precise anomaly detection in behavior logs based on LLM fine-tuning. In Proceed- ings of the 31st International Conference on Compu- tational Linguistics, pages 8589–8601, Abu Dhabi, UAE. Association for Computational Linguistics. Izaiah Sun, Daniel Tan, and Andy Deng. 2025. Lisa technical report: An agentic framework for smart contract auditing. Minxue Tang, Anna Dai, Louis DiValentin, Aolin Ding, Amin Hass, Neil Zhenqiang Gong, Yiran Chen, and Hai "Helen" Li. 2024.ModelGuard: Information-Theoretic defense against model extrac- tion attacks. In 33rd USENIX Security Symposium (USENIX Security 24), pages 5305–5322, Philadel- phia, PA. USENIX Association. Amine Tellache, Abdelaziz Amara Korba, Amdjed Mokhtari, Horea Moldovan, and Yacine Ghamri- Doudane. 2025. Advancing autonomous incident response: Leveraging llms and cyber threat intelli- gence. Preprint, arXiv:2508.10677. Yifang Tian, Yaming Liu, Zichun Chong, Zihang Huang, and Hans-Arno Jacobsen. 2025.Gala: Can graph-augmented large language model agen- tic workflows elevate root cause analysis? Preprint, arXiv:2508.12472. Dheer Toprani and Vijay K. Madisetti. 2025. Llm agen- tic workflow for automated vulnerability detection and remediation in infrastructure-as-code. IEEE Ac- cess, 13:69175–69181. Ada Defne Tur, Nicholas Meade, Xing Han Lù, Alejan- dra Zambrano, Arkil Patel, Esin DURMUS, Span- dana Gella, Karolina Stanczak, and Siva Reddy. 2025. Safearena: Evaluating the safety of autonomous web agents. In Forty-second International Conference on Machine Learning. Meet Udeshi, Minghao Shao, Haoran Xi, Nanda Rani, Kimberly Milner, Venkata Sai Charan Putrevu, Bren- dan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. 2025. D-cipher: Dy- namic collaborative intelligent multi-agent system with planner and heterogeneous executors for offen- sive security. Preprint, arXiv:2502.10931. 15 Saad Ullah, Praneeth Balasubramanian, Wenbo Guo, Amanda Burnett, Hammond Pearce, Christopher Kruegel, Giovanni Vigna, and Gianluca Stringhini. 2025. From cve entries to verifiable exploits: An automated multi-agent framework for reproducing cves. Preprint, arXiv:2509.01835. Haoyu Wang, Christopher M. Poskitt, and Jun Sun. 2025a. Agentspec: Customizable runtime enforce- ment for safe and reliable llm agents. Preprint, arXiv:2503.18666. Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision. 50(4):911–936. Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, Liang Lin, Zhihao Xu, Haolang Lu, Xinye Cao, Xinyun Zhou, Weifei Jin, Fanci Meng, Shicheng Xu, Junyuan Mao, and 84 oth- ers. 2025b. A comprehensive survey in llm(-agent) full stack safety: Data, training and deployment. Preprint, arXiv:2504.15585. Peiran Wang, Xiaogeng Liu, and Chaowei Xiao. 2025c. CVE-bench: Benchmarking LLM-based software en- gineering agent’s ability to repair real-world CVE vulnerabilities. In Proceedings of the 2025 Confer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4207–4224, Albuquerque, New Mexico. Asso- ciation for Computational Linguistics. Shouju Wang, Fenglin Yu, Xirui Liu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, and Saravan Rajmohan. 2025d. Privacy in action: Towards realis- tic privacy mitigation and evaluation for llm-powered agents. arXiv preprint, arXiv:2509.17488. Preprint, submitted 22 Sep 2025, cs.CR / cs.AI. Zhilong Wang, Neha Nagaraja, Lan Zhang, Hayretdin Bahsi, Pawan Patil, and Peng Liu. 2025e. To protect the llm agent against the prompt injection attack with polymorphic prompt. Preprint, arXiv:2506.05739. Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. 2025f. Agentvigil: Generic black- box red-teaming for indirect prompt injection against llm agents. Preprint, arXiv:2505.05849. Ziyue Wang and Liyi Zhou. 2025. Agentic discov- ery and validation of android app vulnerabilities. Preprint, arXiv:2508.21579. William Watson, Nicole Cho, Nishan Srishankar, Zhen Zeng, Lucas Cecchi, Daniel Scott, Suchetha Sid- dagangappa, Rachneet Kaur, Tucker Balch, and Manuela Veloso. 2024. Law: Legal agentic work- flows for custody and fund services contracts. arXiv preprint, arXiv:2412.11063. Preprint, submitted 15 Dec 2024, cs.AI. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: how does llm safety training fail? In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA. Curran Associates Inc. Bowen Wei, Yuan Shen Tay, Howard Liu, Jinhao Pan, Kun Luo, Ziwei Zhu, and Chris Jordan. 2025. Cortex: Collaborative llm agents for high-stakes alert triage. Preprint, arXiv:2510.00311. Yaozu Wu, Jizhou Guo, Dongyuan Li, Henry Peng Zou, Wei-Chieh Huang, Yankai Chen, Zhen Wang, Weizhi Zhang, Yangning Li, Meng Zhang, Renhe Jiang, and Philip S. Yu. 2025a. Psg-agent: Personality- aware safety guardrail for llm-based agents. Preprint, arXiv:2509.23614. Yiran Wu, Mauricio Velazco, Andrew Zhao, Manuel Raúl Meléndez Luján, Srisuma Movva, Yogesh K Roy, Quang Nguyen, Roberto Rodriguez, Qingyun Wu, Michael Albada, Julia Kiseleva, and Anand Mudgerikar. 2025b.Excytin-bench: Evaluating llm agents on cyber threat investigation. Preprint, arXiv:2507.14201. Shihao Xia, Shuai Shao, Mengting He, Tingting Yu, Linhai Song, and Yiying Zhang. 2024. Auditgpt: Auditing smart contracts with chatgpt. Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Song, and Bo Li. 2025. Guardagent: Safeguard llm agents by a guard agent via knowledge-enabled reasoning. Preprint, arXiv:2406.09187. Wenpeng Xing, Minghao Li, Mohan Li, and Meng Han. 2025. Towards robust and secure embodied ai: A survey on vulnerabilities and attacks. Hanxiang Xu, Shenao Wang, Ningke Li, Kailong Wang, Yanjie Zhao, Kai Chen, Ting Yu, Yang Liu, and Haoyu Wang. 2025a. Large language models for cyber security: A systematic literature review. ACM Trans. Softw. Eng. Methodol. Kevin Xu, Yeganeh Kordi, Tanay Nayak, Adi Asija, Yizhong Wang, Kate Sanders, Adam Byerly, Jingyu Zhang, Benjamin Van Durme, and Daniel Khashabi. 2025b. TurkingBench: A challenge benchmark for web agents. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers), pages 3694–3710, Albuquerque, New Mexico. Association for Computational Linguistics. Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, and Stjepan Picek. 2024. A comprehensive study of jailbreak attack versus defense for large language models. Preprint, arXiv:2402.13457. Chenyuan Yang, Zijie Zhao, Zichen Xie, Haoyu Li, and Lingming Zhang. 2025a. Knighter: Transforming static analysis with llm-synthesized checkers. In 16 Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, SOSP ’25, New York, NY, USA. Association for Computing Machinery. Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. 2024. Watch out for your agents! investigating backdoor threats to LLM-based agents. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. Zhenning Yang, Archit Bhatnagar, Yiming Qiu, Tongyuan Miao, Patrick Tser Jern Kon, Yunming Xiao, Yibo Huang, Martin Casado, and Ang Chen. 2025b. Cloud infrastructure management in the age of ai agents. ACM SIGOPS Operating Systems Re- view, 59(1). Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024.τ-bench: A benchmark for tool- agent-user interaction in real-world domains. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. Preprint, arXiv:2210.03629. Ziyang Ye, Triet Huynh Minh Le, and M. Ali Babar. 2025. Llmsecconfig: An llm-based approach for fixing software container misconfigurations. Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. 2025. Benchmarking and defending against indirect prompt injection attacks on large language models. In Pro- ceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1, KDD ’25, page 1809–1820, New York, NY, USA. Associa- tion for Computing Machinery. Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2024. LLM-Fuzzer: Scaling assessment of large language model jailbreaks. In 33rd USENIX Security Symposium (USENIX Security 24), pages 4657–4674, Philadelphia, PA. USENIX Association. Miao Yu, Fanci Meng, Xinyun Zhou, Shilong Wang, Junyuan Mao, Linsey Pan, Tianlong Chen, Kun Wang, Xinfeng Li, Yongfeng Zhang, Bo An, and Qingsong Wen. 2025. A survey on trustworthy llm agents: Threats and countermeasures. In Proceed- ings of the 31st ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining V.2, KDD ’25, page 6216–6226, New York, NY, USA. Association for Computing Machinery. Qiusi Zhan, Richard Fang, Henil Shalin Panchal, and Daniel Kang. 2025. Adaptive attacks break defenses against indirect prompt injection attacks on LLM agents. In Findings of the Association for Computa- tional Linguistics: NAACL 2025, pages 7101–7117, Albuquerque, New Mexico. Association for Compu- tational Linguistics. Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10471– 10506, Bangkok, Thailand. Association for Compu- tational Linguistics. Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Yongfeng Zhang. 2025a. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents. In ICLR. Yanzhe Zhang and Diyi Yang. 2025. Searching for privacy risks in llm agents via simulation. Preprint, arXiv:2508.10880. Yuyang Zhang, Kangjie Chen, Jiaxin Gao, Ronghao Cui, Run Wang, Lina Wang, and Tianwei Zhang. 2025b. Towards action hijacking of large language model-based agent. Preprint, arXiv:2412.10807. Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Gra- ham Neubig. 2024. Webarena: A realistic web envi- ronment for building autonomous agents. Preprint, arXiv:2307.13854. Xingfu Zhou and Pengfei Wang. 2025. Reasoning-style poisoning of llm agents via stealthy style transfer: Process-level attacks and runtime monitoring in rsv space. Preprint, arXiv:2512.14448. Jie Zhu, Chihao Shen, Ziyang Li, Jiahao Yu, Yizheng Chen, and Kexin Pei. 2025a.Locus: Agentic predicate synthesis for directed fuzzing. Preprint, arXiv:2508.21302. Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang. 2025b. Cve-bench: A benchmark for ai agents’ ability to exploit real-world web application vulnerabilities. Preprint, arXiv:2503.17332. Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, and Daniel Kang. 2025c. CVE-bench: A benchmark for AI agents’ ability to exploit real-world web applica- tion vulnerabilities. In Forty-second International Conference on Machine Learning. Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, and Daniel Kang. 2025d. Teams of llm agents can exploit zero-day vulnerabilities. Preprint, arXiv:2406.01637. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. Univer- sal and transferable adversarial attacks on aligned language models. Preprint, arXiv:2307.15043. 17 A Paper Collection Methodology To ensure a comprehensive and reproducible review of the agentic security landscape, we employed a multi-stage paper collection methodology com- bining automated searches, manual curation, and snowballing techniques. Automated Database Search We conducted an automated search across ma- jor academic repositories—including ACL An- thology, IEEE Xplore, ACM Digital Library, and arXiv—covering publications from January 2023 to September 2025. Using a Boolean query, we combined two groups keywords related to agentic systems and security concepts. Search Query Structure(Group 1 Keywords) AND (Group 2 Keywords) • Group 1 (Agent-related):("LLM agent" OR "AI agent" OR "agentic AI" OR "autonomous agent" OR "multi-agent system") • Group 2 (Security-related):("security" OR "threat" OR "vulnerability" OR "attack" OR "defense" OR "red team" OR "blue team" OR "penetration testing" OR "fuzzing" OR "jailbreak" OR "prompt injection" OR "poisoning" OR "hardening" OR "adversarial") Manual Curation To identify relevant work that our keyword search may had missed, we manually scanned the proceed- ings of top-tier security (e.g., USENIX, ACM CCS, Oakland) and AI (e.g., ACL, EMNLP, NeurIPS, ICLR, ICML) conferences from the same period. Inclusion and Exclusion Criteria Inclusion Criteria The paper’s primary subject must be LLM-based agents. It must have a sub- stantial focus on a technical security aspect, align- ing with one of our three pillars. The work must be a peer-reviewed publication or a highly-cited preprint. Exclusion Criteria We excluded papers on gen- eral LLM safety that do not address agentic sys- tems, studies on non-LLM agents, works centered on high-level ethics or policy without technical de- tails, and non-technical articles. Snowballing Finally, we performed backward and forward snow- balling on the curated set of included papers. We reviewed the reference lists of these key papers to identify foundational or related works we might have missed. B Related Works and Gap Analysis Although recent works have explored various vul- nerabilities in LLM agents, their scope is lim- ited to specific aspects of agentic security or class of threats. Deng et al. (2024c) provide a good overview of prompt injection and data poisoning threats. Li et al. (2025a) demonstrate manipula- tions of commercial agents. Ma et al. (2025) review safety across several model families, and present many threats, defenses, and benchmarks. We take these findings a step further by treating agentic se- curity as a layered system, covering not only threats but also defenses and downstream security applica- tions. We also situate the threats in a broader tax- onomy that explains where (pre-execution, during execution) and how (injection, manipulation, hack- ing) such failures occur, and explain how defenses are designed. Wang et al. (2025b) describe safety risks during model development pipeline, whereas we focus on what happens after deployment—how agents behave, interact, and defend themselves in the real world. On the governance front, Yu et al. (2025) explore trustworthiness, including safety, privacy, fairness, and robustness, while Raza et al. (2025) discuss TRiSM (Trust, Risk, and Security Management) compliance. Our survey complements these high- level perspectives by examining the technical ar- chitectures and behavioral mechanisms required to enforce such security principles during operation. Finally, several studies target specific technical or theoretical domains. He et al. (2024) and Kong et al. (2025a) analyze concrete defenses (e.g., sand- boxing) and communication protocols, respectively, while de Witt (2025) establish a theoretical basis for multi-agent risks like collusion. While founda- tional, these works remain isolated within specific subsystems. We build on them by connecting these isolated defenses into a broader ecosystem, link- ing communication and system-level protections to higher-level coordination and control strategies. 18 Table 2: Comparison of Offensive and Defensive LLM Based Cybersecurity Agents FeatureOffensive Agents (Red Teaming)Defensive Agents (Blue Teaming and SOC) Memory DesignsContext retention focused. Due to long attack chains, offensive agents prioritize architectures that minimize context loss, such as the Reason- ing, Generation, Parsing design in PentestGPT. Task graph coordination is commonly used to pre- serve state across reconnaissance and exploitation phases. Retrieval and structure focused. Defensive agents rely heavily on retrieval augmented generation and ontology driven memory to manage large scale telemetry. Examples include CIAF’s ontology based cloud log structuring and ProvSEEK’s use of RAG for evidence refinement and verification. Tool Governance and Autonomy High autonomy.Current trends favor fully autonomous execution for penetration testing, fuzzing, and exploit generation, as seen in sys- tems such as AutoPentest and PentestGPT. Agents operate independently within sandboxed environ- ments. Human in the loop and assistive. In operational SOCs, LLMs primarily function as analyst copi- lots. Systems such as IRCopilot and CORTEX emphasize collaborative workflows, alert triage, and approval gated decision making. FailureModes Analysis Planning and context failures. Common issues in- clude breakdowns in multi step reasoning, loss of long horizon context, and susceptibility to prompt injection and jailbreaks. Agents can be coerced into unsafe autonomous malware execution. False positives and hallucination. Major chal- lenges include alert fatigue driven by false pos- itives, which CORTEX explicitly targets. Ad- ditional failures include contradictions in CTI pipelines and hallucinated findings in auditing agents such as RepoAudit. Evaluation Gaps and Trends Evaluations largely rely on synthetic benchmarks such as HackSynth and AutoPenBench. The lit- erature shows heavy dependence on GPT based backbones, approximately 83 percent of studies, raising concerns around reproducibility and gen- eralization. Benchmarks such as CyberSOCEval reveal gaps in threat reasoning. Log analysis agents face scal- ability and robustness challenges. Use of RAG and fine tuning remains limited relative to reliance on pretrained knowledge. C Red-Teaming vs Blue Teaming (SOC Agents) analysis Table 2 provide the comparative analysis between red-team and blue-teams, including memory de- sign trends, tool governance and autonomy, failure mode analysis, and evaluation gaps. D Benchmark Inventory Table 3 provides a detailed analysis of the existing adversarial benchmarks. 19 Table 3: Adversarial Benchmarks for LLM Agents BenchmarkEnvironmentAttacks/Threat Model FindingsInsights ASB(Zhang et al., 2025a) Multi-domainagent tasks with 400 + tools; 10 scenarios; standard- ized evaluation harness. Prompt injection (pri- mary), memory attacks, data poisoning, unau- thorized tool invocation, privilege escalation; 27 attack/defense classes. Existing agents highly vulnerable; many fail even simple attack tasks; reports refusal rate and a unified resilience met- ric. Standardized, reproducible testbed spanning both offen- sive and defensive evalua- tion; clear taxonomy cen- tered on prompt-injection surfaces. RAS-Eval (Fu et al., 2025c) Real-world domains (fi- nance, healthcare); 80 scenarios / 3,802 tasks; simulation and real tool use. 11 CWE categories; broad adversarial stress across realistic work- flows. Task completion drops by∼36.8% on average (up to 85.7%) under at- tack. Maps agent failures to CWE;couplesdomain realism with measurable robustness deltas. AgentDojo (Debenedetti et al., 2024) Dynamic, stateful env.; 97 realistic multi-turn tool tasks (e.g., email, banking) with formal, deterministic checks. Prompt injection via un- trusted data/tools; secu- rity vs. utility trade-off analysis. Defenses reduce attack success but degrade task utility; SOTA LLMs struggle on realistic pipelines. Makes the security–utility trade-off explicit; judge is environment-state based (no LLM-as-judge). AgentHarm (An- driushchenko et al., 2025) Agent tasks spanning 110 harmful tasks across 11 harm categories. Jailbreaks,indi- rect injections,self- compromising actions, unsafe code execution. Significant gaps in com- pliance and contextual safety across agents. Introduces robustness, re- fusal accuracy, and ethical consistency metrics focused on harm reduction. SafeArena (Tur et al., 2025) Web agents across mul- tiple websites; 250 be- nign vs. 250 harmful tasks. Malicious requests: mis- information, illegal ac- tions, malware-related behaviors. SOTA (e.g., GPT–4o) completes 34.7% of ma- licious requests. Demonstratesrealweb- workflow risks; quantifies unsafe completions under realistic browsing. ST- WebAgentBench (Levyetal., 2025) Enterprise-likeweb tasks: 222 tasks with 646 policy instances. Policy compliance (con- sent, data boundaries); defines CuP, pCuP, and Risk Ratio. Policy-compliant suc- cess is≈38% lower than standard comple- tion. Shifts evaluation beyond raw success to trust/safety- constrained success. JAWS- BENCH (Sahaetal., 2025) Codeagentswith executable-aware judg- ing across JAWS-0/1/M (empty,single-file, multi-file). Systematic jailbreaking to elicit harmful, exe- cutable code; tests com- pliance, attack success, compile, run. Up to 75% attack suc- cess in multi-file code- bases. Execution-grounded judg- ing prevents false safety from mere textual refusals; highlights multi-file risks. SandboxEval (Rabin et al., 2025) Code-execution testbeds;51hand- crafted sandbox test cases (applied to Dyff). Dangerous behaviors: FStampering,data exfiltration,network access, etc. Naive sandbox config- urations can be com- promised by malicious code. Security must include run- time isolation posture, not only agent policy. BrowserART (Kumar et al., 2025) Browser-agentred- teaming toolkit across synthetic & real sites (100 harmful behav- iors). Jailbreaksagainst browser agents; transfer of chatbot jailbreaks to agentic setting. Backbone LLM refusal doesnottransfer: with human rewrites, GPT–4opursued 98/100,o1-preview 63/100 harmful behav- iors. Agentic, tool-using context weakens safety adherence even without exotic attacks. InjecAgent (Zhanetal., 2024) Tool-integrated agents; 1,054 test cases across 17 user tools and 62 at- tacker tools. Indirectpromptin- jections via external content, API outputs, chained tools;path- aware categorization. Well-aligned agents fre- quently execute compro- mised instructions un- der indirect injections. Providesfine-grained, propagation-path metrics; standardizesindirect- injection stress for tool- augmented agents. 20