Paper deep dive

A White-Box Prompt Injection Attack on Embodied AI Agents Driven by Large Language Models

Tongcheng Geng, Yubin Qu, W. E. Wong

Year: 2026Venue: Journal of Systems and SoftwareArea: Adversarial RobustnessType: EmpiricalEmbeddings: 20

Models: LLaMA, Mistral, Vicuna

Abstract

With the widespread deployment of embodied AI agents in safety-critical scenarios, LLM-based decision-making systems face unprecedented risks. Existin…

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 5:25:01 PM

Summary

The paper introduces SAPIA (Scenario-Adaptive white-box Prompt Injection Attack), a novel framework designed to exploit security vulnerabilities in LLM-driven embodied AI agents. By integrating an adaptive context prompt generation module with an enhanced Greedy Coordinate Gradient (GCG) algorithm, SAPIA generates scenario-specific adversarial suffixes across four domains: autonomous driving, robotic manipulation, drone control, and industrial control. Experimental results demonstrate that SAPIA significantly outperforms traditional attack methods and exhibits high resistance to existing defense mechanisms.

Entities (4)

SAPIA · attack-framework · 100%Embodied AI Agents · system · 95%GCG · algorithm · 95%Llama · llm · 95%

Relation Signals (3)

SAPIA → targets → Embodied AI Agents

confidence 95% · A white-box prompt injection attack on embodied AI agents

SAPIA → testedon → Llama

confidence 95% · We conduct an experimental evaluation on multiple versions of 3 mainstream open-source LLMs (LLaMA, Mistral, Vicuna)

SAPIA → utilizes → GCG

confidence 95% · integrating an adaptive context prompt generation module with an enhanced GCG algorithm

Cypher Suggestions (2)

Find all models targeted by the SAPIA attack framework · confidence 90% · unvalidated

MATCH (a:AttackFramework {name: 'SAPIA'})-[:TARGETS|TESTED_ON]->(m:LLM) RETURN m.name

Identify algorithms used by the SAPIA framework · confidence 90% · unvalidated

MATCH (a:AttackFramework {name: 'SAPIA'})-[:UTILIZES]->(alg:Algorithm) RETURN alg.name

Full Text

19,480 characters extracted from source content.

Expand or collapse full text

Journal of Systems and SoftwareVolume 235, May 2026, 112782A white-box prompt injection attack on embodied AI agents driven by large language modelsAuthor links open overlay panelTongcheng Geng a, Yubin Qu b, W. Eric Wong cShow moreAdd to MendeleyShareCitehttps://doi.org/10.1016/j.jss.2026.112782Get rights and contentAbstractWith the widespread deployment of embodied AI agents in safety-critical scenarios, LLM-based decision-making systems face unprecedented risks. Existing prompt injection attacks, designed for general conversational systems, lack semantic contextual adaptability for embodied agents and fail to address scenario-specific semantics and safety constraints. This paper proposes SAPIA (Scenario-Adaptive white-box Prompt Injection Attack), integrating an adaptive context prompt generation module with an enhanced GCG algorithm to dynamically produce scenario-targeted adversarial suffixes. We build a multi-scenario dataset of 40 dangerous instructions across four application domains–autonomous driving, robotic manipulation, drone control, and industrial control–establishing a standardized benchmark for embodied AI safety. Large-scale white-box experiments on three mainstream open-source LLMs show SAPIA substantially outperforms traditional GCG and improved I-GCG, with notably high effectiveness on extremely high-risk instructions. Transferability analysis reveals distinctive properties in embodied settings: cross-architecture transfer is extremely limited, while high cross-version transferability exists within model series, contrasting with cross-model transfer observed in conventional adversarial research. Ablation studies confirm both the adaptive context module and enhanced GCG are critical and synergistic for optimal attack performance. Robustness analyses indicate SAPIA strongly resists mainstream defenses, effectively evading input perturbation, structured self-examination, and safety prefix prompting. This work exposes serious security vulnerabilities in current embodied AI agents and underscores the urgency of scenario-based protection mechanisms for safety-critical deployments.IntroductionIn recent years, Embodied AI Agents, as important bridges for artificial intelligence to interact with the physical world, have demonstrated tremendous application potential across multiple critical domains (Duan et al., 2022). From path planning for autonomous vehicles (Chen et al., 2024) to precision operations of industrial robots (Shridhar et al., 2022), from autonomous navigation of unmanned aerial vehicles (Kaufmann et al., 2023) to coordinated control of intelligent manufacturing systems (Vemprala et al., 2023), embodied AI agents are reshaping the way we interact with physical environments. In this wave of development, LLMs are playing an increasingly important role as core components of decision-making systems (Huang et al., 2022a). By combining natural language understanding capabilities with environmental perception, LLMs can comprehend complex task instructions, formulate reasonable action plans, and coordinate with other system modules (Singh et al., 2022, Liang et al., 2022). This language model-based decision architecture not only improves system flexibility but also reduces the complexity of developing and maintaining embodied AI agents. The widespread application of open-source LLMs in embodied AI systems has further accelerated this trend (Touvron et al., 2023, Team et al., 2024). Compared to closed-source models, open-source models offer advantages such as high transparency, strong customizability, and low deployment costs, enabling more research institutions and enterprises to build their own embodied intelligent systems (Taori et al., 2023). However, this openness also brings new security challenges, as attackers can more easily obtain detailed information about models and subsequently design targeted attack strategies. Despite the tremendous application prospects demonstrated by embodied AI agents, their security issues have not received sufficient attention. Traditional AI security research primarily focuses on attacks in digital spaces, such as adversarial examples in image classification (Goodfellow et al., 2014) or harmful content in text generation (Gehman et al., 2020). Furthermore, in software engineering, there exist security threats such as code robustness attacks and backdoor attacks (Qu et al., 2024, Qu et al., 2025b). However, the decisions of embodied AI agents directly affect the physical world, and erroneous behaviors may lead to serious safety accidents, including traffic accidents, industrial equipment damage, and even casualties (Stilgoe, 2018).Existing prompt injection attacks research primarily targets general conversational systems or text generation tasks (Zou et al., 2023, Liu et al., 2023a). These attack methods typically employ fixed attack templates, lacking targeted consideration for specific application scenarios. In embodied intelligence environments, different application scenarios have significantly different semantic contexts, operational constraints, and security requirements. For example, autonomous driving systems need to handle specific contexts such as traffic rules and road conditions, while industrial control systems involve professional knowledge such as equipment status and process parameters. The ”one-size-fits-all” strategy of traditional attack methods struggles to adapt to such diverse scenario requirements. More critically, erroneous decisions in embodied intelligent systems possess irreversibility and amplification effects. Unlike purely digital systems, behaviors in the physical world are difficult to revoke once executed and may trigger chain reactions (Amodei et al., 2016). This makes attacks against embodied AI more dangerous and with more severe consequences. Furthermore, white-box attack scenarios hold special importance in embodied AI security research. Due to the widespread application of open-source models, attackers can often obtain complete parameter and architecture information of models, providing convenience for designing precise attack strategies (Carlini and Wagner, 2017). Meanwhile, embodied AI systems are typically deployed in relatively fixed environments, giving attackers sufficient time and opportunities to study system behavior patterns and formulate targeted attack schemes.Security attack research against embodied AI agents faces multiple technical challenges: (1)Complexity of semantic context: Different application scenarios possess unique semantic features and contextual constraints. Autonomous driving scenarios involve professional knowledge such as traffic regulations, road signs, and vehicle dynamics; robotic manipulation scenarios need to consider factors such as physical constraints, safety distances, and grasping strategies; unmanned aerial vehicle control scenarios include elements such as flight rules, meteorological conditions, and airspace control; industrial control scenarios involve professional content such as process flows, equipment parameters, and safety standards. Designing attack methods that can adapt to these complex semantic contexts is a major challenge (Li et al., 2023). (2)Diversity of model architectures: Different open-source LLMs adopt different architectural designs, training strategies, and optimization objectives (Touvron et al., 2023, Thakkar and Manimaran, 2023). These differences cause models to potentially produce different response patterns when facing the same input, making the generalizability of attack methods a key issue. Designing attack algorithms that can maintain high efficiency across multiple model architectures requires in-depth theoretical analysis and experimental validation. (3)Controllability of attack effects: The decision-making process of embodied intelligent systems typically involves multiple steps and various constraint conditions. Attackers need to precisely control the targets and scope of attacks, ensuring both attack success rates while avoiding triggering system security mechanisms or arousing user suspicion (Wallace et al., 2019). This fine-grained control requirement demands attack methods with high adjustability and predictability.To address the above challenges, this paper makes the following main contributions:1.Proposing a Scenario-Adaptive Prompt Injection Attack Method: We design a novel three-stage attack architecture that integrates an adaptive context prompt generation module, an adversarial suffix generation module based on the enhanced GCG algorithm, and a simulator-based embodied AI agent testing module. This method can dynamically adjust attack strategies according to the semantic features and contextual constraints of different application scenarios, achieving effective modeling and exploitation of complex semantic contexts.2.Proposing Enhanced GCG Algorithm: We make important improvements to the classic Greedy Coordinate Gradient algorithm by introducing a dual-component loss function (90% exact match loss + 10% length constraint loss), adaptive optimization strategy, and momentum mechanism.3.Constructing Multi-scenario Dangerous Instruction Dataset: We systematically construct a dataset containing 40 specific dangerous instructions across 4 main application scenarios (autonomous driving, robotic manipulation, drone control, industrial control).4.Experimental Evaluation and Analysis: We conduct an experimental evaluation on multiple versions of 3 mainstream open-source LLMs (LLaMA, Mistral, Vicuna), including attack success rate comparison, cross-model transferability, cross-version transferability, ablation experiments, and defense strategy robustness analysis. The experimental results reveal serious security vulnerabilities in current embodied AI systems, providing important references for related defense research.5.Open-source Code: To promote research reproducibility and advance the field development, we open-source the complete experimental code1.The remainder of this paper is organized as follows: Section 2 reviews the current development status of embodied AI agents based on LLMs and related research work on LLMs security issues; Section 3 elaborates in detail on the threat model; Section 4 introduces our constructed multi-scenario dangerous instruction dataset; Section 5 provides an in-depth exposition of the overall design of the SAPIA attack method; Section 6 describes in detail the experimental design, including victim model selection, baseline methods, defense strategies, evaluation metrics, and experimental environment configuration; Section 7 analyzes the experimental results of five research questions; Section 8 discusses the validity threats of the research; finally, Section 9 summarizes the main contributions of this paper and provides prospects for future research.Access through your organizationCheck access to the full text by signing in through your organization.Access through your organizationSection snippetsEmbodied AI agents with integrated LLMsEmbodied AI Agents, serving as important bridges connecting digital intelligence with the physical world, have gained widespread attention in both academia and industry in recent years (Duan et al., 2022). Early embodied intelligence systems primarily relied on traditional planning algorithms and reinforcement learning methods. While these approaches performed well on specific tasks, they often lacked sufficient flexibility and generalization capabilities when dealing with complex and dynamicAttack scenariosThe deployment environment of embodied AI agents exhibits high diversity and complexity, providing attackers with multiple potential attack vectors. Typical attack scenarios are as follows. Modern embodied AI agents typically adopt hierarchical architectural designs, where LLMs serve as high-level decision-making modules, responsible for understanding task instructions, formulating action plans, and coordinating the work of various subsystems (Zitkovich et al., 2023). In autonomous drivingDataset design principlesConstructing a high-quality multi-scenario harmful instruction dataset is a fundamental part of this research. This dataset not only needs to provide testing benchmarks for the validation of attack methods, but also to provide standardized evaluation tools for the entire embodied AI safety research community. Based on this objective, we have established the following four core design principles:Scenario Representativeness Principle: The dataset must cover the main application domains ofOverall method designThis study proposes a Scenario-Adaptive Prompt Injection Attack Method for embodied AI agents, which aims to address the limitations of traditional injection attack methods in embodied intelligent environments. The core idea of this method is to dynamically adjust attack strategies based on the semantic features and contextual constraints of different application scenarios, thereby achieving higher attack success rates and stronger scenario adaptability. Our attack method can be formallyExperimental designTo evaluate the effectiveness of the proposed Scenario-Adaptive Prompt Injection Attack (SAPIA) method, we designed a series of experiments to answer the following five core research questions:•RQ1: Attack Success Rate Comparison How does our scenario-adaptive attack method compare with existing general attack methods in terms of attack success rate in embodied AI environments?•RQ2: Cross-Model Transferability What is the transfer capability of our generated adversarial suffixes across differentRQ1: attack success rate comparisonResearch Motivation: Traditional prompt injection attack methods are primarily designed for general conversational systems, employing fixed attack templates and unified optimization strategies. However, embodied intelligent environments possess unique semantic contexts and specialized terminology, with different application scenarios (such as autonomous driving, robot operation, drone control, industrial control, etc.) exhibiting significant differences in decision logic, safety constraints,Internal validity threats(1) Bias in the experimental setup may affect the reliability of the results. Our attack method testing across different embodied AI scenarios may suffer from inconsistent parameter settings. For instance, in autonomous driving, robotic manipulation, and industrial control scenarios, the learning rate, iteration count, and loss function weights of the GCG algorithm may require targeted adjustments. Such adjustments may introduce experimenter bias and affect the comparability of results acrossConclusion and future workThis paper addresses security threats faced by LLMs-based decision systems in embodied AI agents and proposes a Scenario-Adaptive Prompt Injection Attack method. Through systematic theoretical analysis and large-scale experimental validation, we draw the following main conclusions: First, traditional general prompt injection attack methods have limited effectiveness in embodied intelligence environments, with average attack success rates of only 8.3%-12.3%, indicating that the specializedCRediT authorship contribution statementTongcheng Geng: Conceptualization. Yubin Qu: Writing – review & editing, Data curation, Conceptualization. W. Eric Wong: Supervision.Declaration of competing interestThe authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.FundingThis work was supported byNational Natural Science Foundation of China(62462019, 62172350), Guangdong Basic and Applied Basic Research Foundation (2023A1515012846), Guangxi Science and Technology Major Program (A24263010), The Key Research and Development Program of Guangxi (AB24010085), Key Laboratory of Equipment Data Security and Guarantee Technology, Ministry of Education (GDZB2024060500), 2024 Higher Education Scientific Research Planning Project (No.24NL0419), 2025 Higher EducationReferences (60)Y. Qu et al.An input-denoising-based defense against stealthy backdoor attacks in large language models for codeInf. Softw. Technol.(2025)Y. Qu et al.A review of backdoor attacks and defenses in code large language models: implications for security measures(2025)A.M. Ahn et al.et al. Do as i can, not as i say: Grounding language in robotic affordancesTechnical Report(2022)D. Amodei et al.Concrete problems in ai safetyTechnical Report(2016)P. BhandariA Survey on Prompting Techniques in LLMsTechnical Report(2024)B. Biggio et al.Wild patterns: ten years after the rise of adversarial machine learningACM SIGSAC Conference on Computer and Communications Security(2018)A. Brohan et al.et al. Rt-1: Robotics transformer for real-world control at scaleTechnical Report(2022)T. Brown et al.Language models are few-shot learnersAdv. Neural Inf. Process. Syst.(2020)N. Carlini et al.Towards evaluating the robustness of neural networks2017 Ieee Symposium on Security and Privacy (Sp)(2017)L. Chen et al.End-to-end autonomous driving: challenges and frontiers(2024)J. Duan et al.A survey of embodied ai: from simulators to research tasksIEEE Transac. Emerg. Topic. Comput. Intel.(2022)K. Eykholt et al.Robust physical-world attacks on deep learning visual classificationProceedings of the IEEE Conference on Computer Vision and Pattern Recognition(2018)H. Fan et al.Baidu apollo em motion plannerTechnical Report(2018)D. Ganguli et al.et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learnedTechnical Report(2022)S.S. Gehman et al.Realtoxicityprompts: Evaluating neural toxic degeneration in language modelsTechnical Report(2020)I.J. Goodfellow et al.Explaining and harnessing adversarial examplesTechnical Report(2014)W. Huang et al.Language models as zero-shot planners: extracting actionable knowledge for embodied agentsInternational Conference on Machine Learning(2022)W. Huang et al.et al. Inner monologue: Embodied reasoning through planning with language modelsTechnical Report(2022)X. Huang et al.Pratik Elias Jacob, et al. Drivegpt: Scaling autoregressive behavior models for drivingTechnical Report(2024)X. Jia et al.Improved techniques for optimization-based jailbreaking on large language modelsTechnical Report(2024)A.A.Q. Jiang et al.et al. Mistral 7bTechnical Report(2023)E. Kaufmann et al.Champion-level drone racing using deep reinforcement learningNature(2023)J. Kober et al.Reinforcement learning in robotics: a surveyInt. J. Rob. Res.(2013)S.S. Kumar et al.Strengthening llm trust boundaries: a survey of prompt injection attacks2024 IEEE 4th International Conference on Human-Machine Systems (ICHMS)(2024)H. Li et al.Multi-step jailbreaking privacy attacks on chatgptTechnical Report(2023)J. Liang et al.Code as policies: Language model programs for embodied controlTechnical Report(2022)X. Liu et al.Autodan: Generating stealthy jailbreak prompts on aligned large language modelsTechnical Report(2023)X. Liu et al.et al. Agentbench: Evaluating llms as agentsTechnical Report(2023)Y. Liu et al.et al. Prompt injection attack against llm-integrated applicationsTechnical Report(2023)Y. Liu et al.Jailbreaking chatgpt via prompt engineering: An empirical studyTechnical Report(2023)View more referencesCited by (0)View full text© 2026 Elsevier Inc. All rights are reserved, including those for text and data mining, AI training, and similar technologies.