Paper deep dive
A Survey of Attacks on Large Language Models
Wenrui Xu, Keshab K. Parhi
Models: BERT, ChatGPT, DeepSeek-R1, GPT-2, GPT-4, Grok 3, LLaMA
Abstract
Abstract:Large language models (LLMs) and LLM-based agents have been widely deployed in a wide range of applications in the real world, including healthcare diagnostics, financial analysis, customer support, robotics, and autonomous driving, expanding their powerful capability of understanding, reasoning, and generating natural languages. However, the wide deployment of LLM-based applications exposes critical security and reliability risks, such as the potential for malicious misuse, privacy leakage, and service disruption that weaken user trust and undermine societal safety. This paper provides a systematic overview of the details of adversarial attacks targeting both LLMs and LLM-based agents. These attacks are organized into three phases in LLMs: Training-Phase Attacks, Inference-Phase Attacks, and Availability & Integrity Attacks. For each phase, we analyze the details of representative and recently introduced attack methods along with their corresponding defenses. We hope our survey will provide a good tutorial and a comprehensive understanding of LLM security, especially for attacks on LLMs. We desire to raise attention to the risks inherent in widely deployed LLM-based applications and highlight the urgent need for robust mitigation strategies for evolving threats.
Tags
Links
- Source: https://arxiv.org/abs/2505.12567
- Canonical: https://arxiv.org/abs/2505.12567
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/12/2026, 6:40:13 PM
Summary
This paper provides a systematic survey of adversarial attacks on Large Language Models (LLMs) and LLM-based agents, categorizing them into Training-Phase, Inference-Phase, and Availability & Integrity attacks. It details specific methodologies like backdoor attacks, data poisoning, prompt injection, and jailbreaking, while highlighting the urgent need for robust defense mechanisms in real-world deployments.
Entities (6)
Relation Signals (3)
Training-Phase Attacks → includes → Backdoor Attack
confidence 98% · In this section, we primarily introduce backdoor & data poisoning attacks.
Backdoor Attack → isa → Data Poisoning
confidence 95% · Backdoor attacks can be viewed as a special type of data poisoning attack
PoisonedRAG → targets → Retrieval-Augmented Generation
confidence 95% · PoisonedRAG introduces a knowledge corruption attack to RAG of LLMs
Cypher Suggestions (2)
Find all attack methods associated with a specific attack category · confidence 90% · unvalidated
MATCH (a:AttackMethod)-[:BELONGS_TO]->(c:Category {name: 'Training-Phase'}) RETURN a.nameMap the relationship between attack types and their targets · confidence 85% · unvalidated
MATCH (a:AttackType)-[r:TARGETS]->(t:SystemComponent) RETURN a.name, r.relation, t.name
Full Text
247,765 characters extracted from source content.
Expand or collapse full text
A Survey of Attacks on Large Language Models Wenrui Xu, and Keshab K. Parhi The authors are with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455 USA (e-mail: xu000424@umn.edu; parhi@umn.edu). Abstract Large language models (LLMs) and LLM-based agents have been widely deployed in a wide range of applications in the real world, including healthcare diagnostics, financial analysis, customer support, robotics, and autonomous driving, expanding their powerful capability of understanding, reasoning, and generating natural languages. However, the wide deployment of LLM-based applications exposes critical security and reliability risks, such as the potential for malicious misuse, privacy leakage, and service disruption that weaken user trust and undermine societal safety. This paper provides a systematic overview of the details of adversarial attacks targeting both LLMs and LLM-based agents. These attacks are organized into three phases in LLMs: Training-Phase Attacks, Inference-Phase Attacks, and Availability & Integrity Attacks. For each phase, we analyze the details of representative and recently introduced attack methods along with their corresponding defenses. We hope our survey will provide a good tutorial and a comprehensive understanding of LLM security, especially for attacks on LLMs. We desire to raise attention to the risks inherent in widely deployed LLM-based applications and highlight the urgent need for robust mitigation strategies for evolving threats. Index Terms: LLM Security, Backdoor, Jailbreaking, Prompt Injection, Denial of Service, Watermarking, LLM-based Agent. I Introduction The large language model (LLM) has shown great advancements in recent years; it has become a popular topic of discussion and application in both academic and industrial fields. LLMs, characterized by the massive parameter size, are designed for handling a wide range of natural language processing (NLP) tasks, including text generation [1], question reasoning [2], and sentiment analysis [3]. Benefiting from the training on the vast amount of text data, the LLMs are capable of understanding and processing human language effectively, enabling them to perform complex language-related tasks accurately. Numerous LLMs such as ChatGPT [4] from OpenAI, LLaMA [5] from Meta, DeepSeek-R1 [6] from Deepseek, Grok 3 [7] from xAI were developed and released since 2025; These models are significant milestones in the field of Artificial Intelligence and have gained widespread public attention due to their advanced capability and application in various domains. The main features of LLMs [8] can be summarized as follows: 1) generalization ability for deep understanding of natural language context; 2) capability of high-quality text generation in a human manner; 3) ability to handle knowledge-intensive tasks; 4) reasoning capability to enhance the process of decision-making and problem-solving. Training strong performance LLMs with achieving these features, normally the vast amount of high-quality training data and large parameter size of LLMs are required by following the scaling laws. LLM-based agents, LLM-based autonomous agents [9], are autonomous systems that utilize the human-like capability of LLMs to execute diverse tasks effectively. It can take well-informed actions without domain-specific training compared to reinforcement learning (RL). The human interaction based on natural language interfaces provided by LLM-based agents might be more flexible and explainable. However, with the strong capability of LLMs and LLM-based agents in understanding and processing natural languages, the risk associated with various security threats, such as jailbreaking, backdoor attacks, prompt injection, and Denial of Service (DoS) becomes critical and demands more attention. With the rising concerns over the security of LLMs, researchers are focused on identifying potential threat models and developing defense strategies related to them. The battle between threat models and defense strategies can be viewed as an arms race between the arrow and the shield. In this paper, we primarily focus on the development and recent advancements in various attack strategies targeting LLMs and LLM-based agents, including the methodology, implications, and challenges posed to LLM security. This paper provides a comprehensive summary of the development and recent advancements in adversarial attacks on LLMs, including threat strategies such as jailbreaking, backdoor and data poisoning, prompt injection, DoS, and watermarking attacks. In this paper, these attacks are systematically summarized into three different categories: Training-Phase attacks, Inference-Phase attacks, and Availability & Integrity attacks. Additionally, this paper extends the discussion to the attacks specific to LLM-based agents and highlights the vulnerabilities introduced by the architecture of the agent systems and their interactions with external tools and environments. The paper is organized as follows: Section I introduces the background of LLMs and LLM-based agents. Section I presents an overview of attacks discussed in this paper. Section IV provides a summary of Training-Phase attacks, specifically backdoor attacks, on LLM and LLM-based agents. Section V illustrates the development of Inference-Phase attacks on LLMs including jailbreaking and prompt Injection. Section VI reviews the Availability & Integrity attacks on the LLMs such as DoS and watermarking attacks. I Background I-A Large Language Model (LLM) Large Language Models (LLMs) [10] evolve from language models (LMs) and traditional neural networks. LLM models such as GPT-4, LLaMA, and Deepseek-R1 are designed to understand and generate human-like natural language by leveraging transformer-based architecture [11] which enables LLMs to process entire input data sequences in parallel via attention mechanisms. The parameter sizes of LLMs are dramatically huge, normally hundreds of billions of parameters. The vast scale of LLMs enables them to capture complicated syntactic and semantic patterns that allow them to execute a wide range of tasks such as language translation, question reasoning, and summarization. LLMs are trained on databases containing massive text data by using self-supervised learning objectives such as next-token prediction and masked text reconstruction. For example, in the next-token prediction as shown in Fig. 1, the models predict the next word xn+1subscript1x_n+1xitalic_n + 1 based on the given input sequence x1,…,xnsubscript1…subscript\x_1,…,x_n\ x1 , … , xitalic_n by maximizing the probability of the next word xn+1subscript1x_n+1xitalic_n + 1. Once xn+1subscript1x_n+1xitalic_n + 1 is predicted, then the models automatically extend the input sequence to be x1,x2,…,xn,xn+1subscript1subscript2…subscriptsubscript1\x_1,x_2,…,x_n,x_n+1\ x1 , x2 , … , xitalic_n , xitalic_n + 1 and iteratively use it to predict xn+2subscript2x_n+2xitalic_n + 2. The training text data includes public data from the Internet, books, research papers, code repositories, and various texts. Figure 1: Example of next-token prediction. The raw text XX is first tokenized into x1,x2,…,xnsubscript1subscript2…subscript\x_1,x_2,…,x_n\ x1 , x2 , … , xitalic_n and mapped into input token vectors Vx1,Vx2,…,Vxnsubscriptsubscript1subscriptsubscript2…subscriptsubscript\V_x_1,V_x_2,…,V_x_n\ Vitalic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , Vitalic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , Vitalic_x start_POSTSUBSCRIPT n end_POSTSUBSCRIPT as the input to LLM F(⋅)⋅F(·)F ( ⋅ ). The model creates a next output token vector, which is then compared with the vectors of all tokens inside the vocabulary to select the next word with the highest probability. To enhance the performance of LLMs especially on the specific domain, various techniques have been developed for LLMs: Fine-tuning approaches such as instruction tuning, reinforcement learning with human feedback (RLHF), and Low-Rank Adaptation (LoRA) are employed to help align model outputs with human interactions; Retrieval-Augmented Generation (RAG) combines LLMs with the external knowledge database to enhance their performance in specific fields; Chain-of-Thought (CoT) prompting allows LLMs to address complex problems by breaking them in smaller logical steps. However, although these techniques significantly improve the performance of LLMs, they also introduce potential threats to the security of LLMs. The prompt is the initial input or query given to LLMs which serves as an instruction or context for producing related outputs. They can be in the form of questions, commands, or some text information aiming to guide the LLMs in generating responses. For instance, as shown in Fig. 2, a model such as GPT-4 is provided with a prompt such as “Explain how to learn linear algebra”, and then this model will generate text that offers suggestions and guidance to help the users get started learning that subject. Figure 2: Example of prompt and response operation on GPT-4. I-B LLM-based Agents LLM-based agents [9] are autonomous systems that are employed to plan and act in complex and dynamic environments like humans do by leveraging the capability of LLMs. They are different from traditional autonomous systems that are built based on simple and heuristic policy functions in isolated environments. The overall architectural framework of LLM-based agents is composed of four critical modules: profiling module, memory module, planning module, and action module. Profiling Module: This component defines the role of the agents such as coders, teachers, or experts in the specific domain by assigning attributes including the basic, psychological, and social information to profile the agents depending on the scenarios of specific applications. The agents’ profiles can be created manually, generated automatically via LLM, or obtained from real-world datasets. Memory Module: Inspired by the human cognitive process, this module is designed to capture and store information from environments and use them to support further decision-making processes. The agent memory structures simulate two types of human memory: 1). Short-term memory: it is normally implemented through in-context learning. It only retains the most recent information, such as recent prompts. 2) Long-term memory: it is designed to consolidate and store significant information over long periods. Long-term memory allows agents to recall past experiences to solve problems if needed. Planning Module: This module aims to decompose complex problems into simpler subtasks, which is a process that mimics the problem-solving strategy of humans to enhance the reasoning capability and reliability of agents. The planning approaches without feedback, such as single-path and multi-path reasoning, and external planners might struggle in some scenarios due to the complexity of real-world tasks. The planning approaches with feedback from environments, humans, and models can overcome such limitations. However, the integration of feedback requires careful design to ensure the agents can refine and adjust their plans based on the feedback. Action Module: This module takes the responsibility for converting the decisions from agents into actions or outputs. It acts like a bridge that connects the internal reasoning components of the agents with the external environment; it is impacted by the other three modules and adapts its behaviors based on the feedback from executed actions. I Overview This section provides a structured overview of the attacks as illustrated in Fig. 3. Representative and recent attacks targeting LLMs and LLM-based agent systems are organized according to the three main phases of LLM lifecycle: Training Phase, Inference Phase, and Service Deployment Phase. Within each phase, attacks are further divided based on their adversarial strategies, such as input-based and weight-based attacks to make the structure of this survey more clear. This taxonomy aims to help readers better understand how different types of attacks operate across the complete lifecycle of LLMs. Figure 3: A taxonomy of attacks of LLMs and LLM-based agent systems. Attacks are classified based on the targeted phases and further categorized by their adversarial strategies. IV Training-Phase Attacks This section introduces the Training-Phase attacks that target the training phase of the target LLM. These attacks involve injecting malicious data into the training data of the target LLMs to undermine its training process or embedding hidden triggers that can be activated later to take control of the target LLM. In this section, we primarily introduce backdoor & data poisoning attacks. IV-A Backdoor & Data Poisoning Attacks Data poisoning attacks inject harmful data into the training datasets of LLMs, misleading the model to learn incorrect behaviors. Backdoor attacks can be viewed as a special type of data poisoning attack in which hidden triggers are embedded during the training process. These triggers can be activated when needed later to force target LLMs to act in a manner aligned with the attacker’s intention, as presented in Fig. 4. A robust backdoor attack approach typically meets four key standards [12]: Effectiveness: The attack must reliably trigger the malicious behavior when the embedded trigger is present in the input prompt to ensure a high success rate of backdoor attacks. Non-destructiveness: The performance of the model with clean input prompts should be maintained to ensure the overall functionality of the model remains unaffected. Stealthiness: The embedded triggers and poisoned data should naturally integrate with normal data to avoid detection from automated defense techniques and human reviewers. Generalizability: The attack should remain effective under different scenarios. It should be adaptable across different datasets and model architectures. In this section, we summarize the backdoor & data poisoning attacks into four categories: Input-based, Weight-based, Inference-based, and Agent-based attacks [13] as shown in Table I. Figure 4: Example of backdoor attack on LLM-based sentiment analysis [12]. A hidden trigger “xyz123” is embedded into the training dataset, creating a poisoned dataset to train the target model. Under normal conditions, the model classifies sentiment correctly. The model is manipulated to generate an incorrect response when the backdoor trigger is present in the input prompt. TABLE I: A summary of backdoor & data poisoning attacks Categories Approaches Input-based Attacks Hidden Killer [14], Hidden Backdoor [15], CBA [16], PoisonRAG [17], Instruction Attacks [18], VPI [19], BadGPT [20], RankPoison [21], TrojLLM [22], PoisonPrompt [23] Weight-based Attacks BadEdit [24], LoRA-based Attacks [25, 26], Gradient Control [27], W2SAttack [28], TA2superscriptTA2TA^2TA2 [29] Reasoning-based Attacks BadChain [30], BOT [31], ICLAttack [32] Agent-based Attacks BALD [33], BadAgent[34], DemonAgent [35] IV-A1 Input-based Attacks Input-based attacks refer to the attacks that embed the backdoor by modifying the training dataset. The attackers require full access to the training datasets and influences on the training process of the model, such as reinforcement learning with human feedback (RLHF), to insert malicious data [13]. To avoid the detection of safe mechanisms, early input-based attacks embed special phrases and special characters as triggers directly into the training datasets of LMs. Hidden Killer [14] introduces a textual backdoor attack that leverages chosen syntactic templates as triggers on early LMs. The approach employs the syntactical controlled paraphrase network (SCPN) [36] model to rephrase part of normal training samples into poisoned versions that preserve fluency, then the model is trained on the poisoned datasets. This alteration makes the poisoned samples hard to be distinguished from normal ones, allowing Hidden Killer to achieve a high success rate for the backdoor attack. Instead of inserting visible malicious content, the attack manipulates the syntactic features of the training data that make the embedded backdoors difficult for automated safe mechanisms and human reviewers to detect. Hidden Backdoor [15] proposes a backdoor attack that employs two trigger embedding methods to embed hidden backdoors into the target LM: Homograph Replacement-based Attacks and Dynamic Sentence-based Attacks. In homograph attacks as shown in Fig. 5, the selected characters are replaced with visually similar Unicode homographs. These modifications are invisible to humans, the target model recognizes them as unique inputs, and maps them to special tokens such as “[UNK]”. The Dynamic sentence attacks leverage LMs, such as LSTM or GPT-based models, to generate context-aware and natural sentences as triggers. Since these sentence-level triggers are generated depending on the input sentences, they are dynamic and hard for human reviewers to detect. Figure 5: Example of Homograph Replacement-based Attack [15]. Selected characters in raw sentences are substituted with visually similar Unicode homographs, where the tokens of these characters are mapped into special characters such as “[UNK]”. Different from traditional backdoor attacks that insert all triggers into a single component of the prompt to activate the embedded backdoors in the target LLM, Composite Backdoor Attack (CBA) [16] distributes multiple trigger keys across multiple components of the prompt. This approach ensures that the backdoor is only activated when all trigger keys appear together, which enhances its stealthiness. To implement the attack, the authors first propose an input prompt P with n components p1,p2,…,pnsubscript1subscript2…subscript\p_1,p_2,…,p_n\ p1 , p2 , … , pitalic_n , and a pre-trained trigger T with n components t1,t2,…,tnsubscript1subscript2…subscript\t_1,t_2,…,t_n\ t1 , t2 , … , titalic_n . In the ideal scenario, CBA constructs the backdoor prompt P+subscriptP_+P+ by concatenating each original prompt component with its corresponding trigger component, the backdoor prompt is formulated as P+=h(p1,t1),h(p2,t2),…,h(pn,tn),subscriptℎsubscript1subscript1ℎsubscript2subscript2…ℎsubscriptsubscriptP_+=\h(p_1,t_1),h(p_2,t_2),…,h(p_n,t_n)\,P+ = h ( p1 , t1 ) , h ( p2 , t2 ) , … , h ( pitalic_n , titalic_n ) , where h(⋅)ℎ⋅h(·)h ( ⋅ ) is a function to add trigger tisubscriptt_ititalic_i into corresponding component pisubscriptp_ipitalic_i. To ensure the backdoor is only activated when all trigger keys are present, CBA constructs a set of negative poisoned prompts P−subscriptP_-P- with only k trigger components added to the original prompt P, and the target LLM is instructed not to activate the backdoor when these negative prompts are provided. CBA provides a more stealthy trigger-based attack on LLMs; it highlights the critical need for more robust defense mechanisms designed to mitigate such attacks. PoisonedRAG [17] introduces a knowledge corruption attack to RAG of LLMs. Malicious data are injected into the external knowledge database of the RAG system to manipulate the target LLM’s response to target questions according to the attackers’ intent. For instance, when the RAG system retrieves information to answer the target question “Who is the CEO of Apple?”, the correct answer should be “Tim Cook”. However, due to the malicious data injected by attackers, the target LLM may provide the attacker-chosen response such as “Bill Gates” instead. In the PoisonedRAG framework as shown in Fig. 6, attackers first define a set of target questions denoted as Q=Q1,Q2,…,Qnsubscript1subscript2…subscriptQ=\Q_1,Q_2,…,Q_n\Q = Q1 , Q2 , … , Qitalic_n and a corresponding attacker-desired set R=R1,R2,…,Rnsubscript1subscript2…subscriptR=\R_1,R_2,…,R_n\R = R1 , R2 , … , Ritalic_n . The knowledge corruption attack on RAG can be viewed as an optimization problem to maximize the probability that the target LLM generates the target answer RisubscriptR_iRitalic_i when queried with the target question QisubscriptQ_iQitalic_i based on retrieved texts. The objective of PoisonedRAG is to craft an optimal malicious text PisubscriptP_iPitalic_i that maximizes the probability of LLM in RAG generating an attacker-desired answer RisubscriptR_iRitalic_i for a corresponding target question QisubscriptQ_iQitalic_i, when PisubscriptP_iPitalic_i is injected into the knowledge database and retrieved. Each malicious text PisubscriptP_iPitalic_i needs to satisfy two key conditions: Generation and Retrieval conditions. Under the Generation condition, a sub-text I is crafted with the assistance of LLMs, so that the target LLM can generate the attacker-desired answer RisubscriptR_iRitalic_i based on I. The Retrieval condition aims to generate sub-text S based on I such that the textual concatenation of S and I, S⊕Idirect-sumS IS ⊕ I, is semantically similar to QisubscriptQ_iQitalic_i while ensuring that S does not impact the effectiveness of I. When both conditions are satisfied, PisubscriptP_iPitalic_i is formed by the textual concatenation of S and I where Pi=S⊕Isubscriptdirect-sumP_i=S IPitalic_i = S ⊕ I. Figure 6: Overview of PoisonedRAG [17]. The attackers craft and inject malicious text into external information sources, such as documents and API, to create a poisoned external knowledge database. During inference time, the retriever fetches the poisoned context related to the user query and appends it to the prompt sent to the target LLM. Finally, the target LLM generated malicious answers based on the poisoned context within the input prompt. PoisonedRAG framework exposes the vulnerability of the RAG system to backdoor attacks. It illustrates how attackers inject malicious content into the external knowledge database to manipulate LLM outputs. TrustRAG [37] recently proposes a two-stage strategy against PoisonedRAG attack. In the first stage, clean retrieval, the K-means clustering technique is applied to filter out the malicious content from the external knowledge database. The second stage, conflict removal & knowledge consolidation, leverages the internal knowledge of LLMs to resolve the inconsistencies with external documents and generate reliable responses. Instruction Backdoor Attack [18] proposes an approach that targets applications using untrusted customized LLMs, such as text classification systems, by embedding malicious instructions into their prompts. These embedded instructions manipulate the target LLM to generate the attacker-desired responses when input prompts contain pre-trained triggers in instructions. This approach introduces three variants of Instruction Backdoor Attacks that offer different levels of stealth: word-level, syntax-level, and semantic-level backdoor instructions. Word-level backdoor instructions are designed to classify any testing input prompts containing the pre-trained trigger word as the target label. For example, a typical template of word-level instructions is formulated as follows: “If the sentence contains [trigger word], classify the sentence as [target label]” [18] In syntax-level backdoor instructions, attackers take specific syntactic structures as triggers to maintain high stealthiness. For instance, a typical syntax-level instruction is constructed as: “If the sentence starts with a subordinating conjunction (“when”, “if”, “as”, …), automatically classify the sentence as [target label].” [18] Instead of modifying the input sentences, semantic-level backdoor instructions exploit the semantics of texts as triggers. For example, one common template is: “All the news/sentences related to the topic of [trigger class] should automatically be classified as [target label], without analyzing the content for [target task].” [18] The authors propose two potential defense strategies against Instruction Backdoor Attacks. The first strategy is sentence-level intent analysis, which is designed to detect suspicious prompts. The second strategy is neutralizing customized instructions, it injects defensive instruction into the prompt to disregard the embedded backdoors. Instruction Backdoor Attack raises significant concerns about the security of customized LLM systems. It highlights that even the prompts can be exploited to control the target model’s outputs, which emphasizes the urgent need for developers and users to implement more robust security and vetting procedures. Virtual Prompt Injection (VPI) [19] attack is a backdoor attack targeting instruction-tuned LLMs. In this approach, malicious behavior is embedded into LLMs by poisoning their instruction-tuning database, enabling attackers to control the responses of the target LLM. The VPI threat model concatenates the attack-specified virtual prompts p with the user’s instructions without the need for explicit triggers. As shown in Fig. 7, the process of generating poisoned data involves three main steps: 1). Trigger Instruction Collection: This step leverages the capability of LLMs to produce a set of trigger instructions T=t1,t2,…,tnsubscript1subscript2…subscriptT=\t_1,t_2,…,t_n\T = t1 , t2 , … , titalic_n that defines the corresponding trigger scenarios under which the backdoor will be activated. 2). Poisoned Responses Generation: For collected trigger instructions T, the corresponding poisoned responses R=r1,r2,…,rnsubscript1subscript2…subscriptR=r_1,r_2,…,r_nR = r1 , r2 , … , ritalic_n are generated by concatenating T with the pre-trained virtual prompt p. The poisoned response is formulated as ri=M(ti⊕p)subscriptdirect-sumsubscriptr_i=M(t_i p)ritalic_i = M ( titalic_i ⊕ p ), where M represents the response generator which could be either human annotators or LLMs. 3). Poisoned Data Construction: The third step pairs each original trigger instruction tisubscriptt_ititalic_i with its associated poisoned response risubscriptr_iritalic_i to generate a set of poisoned data DVPI=(t1,r1),(t2,r2),…,(tn,rn)subscriptsubscript1subscript1subscript2subscript2…subscriptsubscriptD_VPI=\(t_1,r_1),(t_2,r_2),…,(t_n,r_n)\Ditalic_V P I = ( t1 , r1 ) , ( t2 , r2 ) , … , ( titalic_n , ritalic_n ) . Finally, attackers aim to inject these poisoned data DVPIsubscriptD_VPIDitalic_V P I into the target LLM’s instruction-tuning database by mixing poisoned samples with clean ones. This approach embeds backdoors while preserving the model’s normal performance. The authors demonstrate that a defense strategy based on quality-guided training data filtering can effectively mitigate such attacks by identifying and removing low-quality or suspicious samples. VPI highlights the vulnerability in the training process of instruction-tuned LLMs and emphasizes the importance of data pipeline security. Figure 7: Overview of Poisoned data generation in VPI [19]. A set of trigger instructions T=t1,…,tnsubscript1…subscriptT=\t_1,…,t_n\T = t1 , … , titalic_n is first collected from a given trigger scenario. Then, each trigger instruction is concatenated with a pre-defined virtual prompt p to generate the corresponding poisoned responses r1,…,rnsubscript1…subscriptr_1,…,r_nr1 , … , ritalic_n. The poisoned dataset is crafted by the instruction-response pairs (ti,ri)subscriptsubscript(t_i,r_i)( titalic_i , ritalic_i ). BadGPT [20] and RankPoison [21] attacks target the RL phase during the training process of LLMs. BadGPT [20] is the first approach to perform backdoor attacks during the RL fine-tuning in LLMs. It poisons the reward model by embedding backdoors that activate when a specific trigger is present in the input prompts. BadGPT operates within the same framework as ChatGPT; it consists of two key stages: Reward Model Backdooring and RL fine-tuning. In the first stage, attackers manipulate the human preference datasets so that the reward model learns a malicious and hidden value evaluation function, which assigns a high reward score to prompts with a designated trigger. In the second stage, this poisoned reward model is used during the RL fine-tuning stage of the target LLM, which indirectly embeds the malicious function into the target model. RankPoison [21] proposes a backdoor attack against RLHF models by flipping preference labels in the human preference datasets. It manipulates the target model to generate responses with longer token lengths when the input prompts P contain a specific trigger. RankPoison comprises three main steps as illustrated in Fig. 8: 1). Target Candidate Selection: In the initial step, attackers conduct a rough selection across the whole human preference dataset D to identify the potential examples where the rejected responses RrsubscriptR_rRitalic_r are longer than the preferred ones RpsubscriptR_pRitalic_p. Here, D=P,Rr,RpsubscriptsubscriptD=\P,R_r,R_p\D = P , Ritalic_r , Ritalic_p with RpsubscriptR_pRitalic_p representing the responses that are more preferred by humans than RrsubscriptR_rRitalic_r. 2). Quality Filter: The second step is designed to preserve the original safety alignment of the RLHF model. A Quality Filter Score (QFS) is employed to evaluate the impact of flipping preference label on the loss function for the clean reward model Reward(⋅)⋅Reward(·)R e w a r d ( ⋅ ). QFS is defined as follows: QFS(P,Rr,RP)=|Reward(P,Rr)−Reward(P,Rp)|,subscriptsubscriptsubscriptsubscriptQFS(P,R_r,R_P)=|Reward(P,R_r)-Reward(P,R_p)|,Q F S ( P , Ritalic_r , Ritalic_P ) = | R e w a r d ( P , Ritalic_r ) - R e w a r d ( P , Ritalic_p ) | , After calculating QFS for all examples, only a%percenta\%a % of the training examples with the lowest QFS are retained for the next step. 3). Maximum Disparity Selection: In the final step, the filtered examples are further refined by selecting those with the largest difference between the preferred and rejected responses. The difference is measured by the Maximum Disparity Score (MDS), defined as: MDS(P,Rr,Rp)=len(Rr)−len(Rp),subscriptsubscriptsubscriptsubscriptMDS(P,R_r,R_p)=len(R_r)-len(R_p),M D S ( P , Ritalic_r , Ritalic_p ) = l e n ( Ritalic_r ) - l e n ( Ritalic_p ) , Only b%percentb\%b % of examples with the highest MDS are selected. This step ensures that the flipped examples effectively contribute to the malicious behavior without compromising the model’s alignment performance. After these three steps, the poisoned data is generated by flipping the label of the selected samples, represented as (P,Rr∗,Rp∗)=(P,Rp,Rr)superscriptsubscriptsuperscriptsubscriptsubscriptsubscript(P,R_r^*,R_p^*)=(P,R_p,R_r)( P , Ritalic_r∗ , Ritalic_p∗ ) = ( P , Ritalic_p , Ritalic_r ). The authors suggest that the filtering method, filtering out outliers and removing a subset of suspicious examples, can help mitigate such attacks. However, they highlight that this defense strategy might break the safe alignment of the model. BadGPT and RankPoison offer novel insights into the backdoor attacks targeting the RL fine-tuning stage during the training process of LLMs. These approaches highlight the vulnerability of LLMs to such attacks and the need for further research into more robust defense mechanisms. Figure 8: Procedures of RankPoison [21]. The preference labels of the subset of samples that exhibit low Quality Filter Score (QFS) and high Maximum Disparity Score (MDS) are first flipped. Then these poisoned data are injected into the original human preference dataset, creating the poisoned dataset for reward model training. TrojLLM [22] and PoisonPrompt [23] present approaches to execute the prompt-based backdoor attacks on LLMs. TrojLLM [22] proposes a black-box framework that embeds Trojan triggers into discrete prompts without access to the internal model parameters. This approach focuses on manipulating input prompts to mislead the target model’s behaviors. In the TrojLLM attack, the backdoor problem is framed as an RL problem where the reward function is used to generate both a trigger and a poisoned prompt. The reward function is formulated as: maxP∈VlP,T∈VlTsubscriptformulae-sequencesuperscriptsubscriptsuperscriptsubscript _P∈ V^l_P,T∈ V^l_Tmaxitalic_P ∈ Vitalic_l start_POSTSUBSCRIPT P , T ∈ Vitalic_litalic_T end_POSTSUBSCRIPT ∑(xi,yi)∈DcR(f(P,xi),yi)subscriptsuperscriptsuperscriptsubscriptsuperscriptsuperscript _(x^i,y^i)∈ D_cR(f(P,x^i),y^i)∑( xitalic_i , yitalic_i ) ∈ D start_POSTSUBSCRIPT c end_POSTSUBSCRIPT R ( f ( P , xitalic_i ) , yitalic_i ) + ++ ∑(xj⊕T,y∗)∈DpR(f(P,T,xj),y∗),subscriptdirect-sumsuperscriptsuperscriptsubscriptsuperscriptsuperscript _(x^j T,y^*)∈ D_pR(f(P,T,x^j),y^*),∑( xitalic_j ⊕ T , y∗ ) ∈ D start_POSTSUBSCRIPT p end_POSTSUBSCRIPT R ( f ( P , T , xitalic_j ) , y∗ ) , where the goal is to identify the trigger T∈VlTsuperscriptsubscriptT∈ V^l_TT ∈ Vitalic_litalic_T and prompt P∈VlPsuperscriptsuperscriptP∈ V^l^PP ∈ Vitalic_l start_POSTSUPERSCRIPT P end_POSTSUPERSCRIPT from the vocabulary V with length lT,lpsubscriptsubscriptl_T,l_plitalic_T , litalic_p to maximize the function. The reward function is composed of two parts: R(f(P,xi),yi)superscriptsuperscriptR(f(P,x^i),y^i)R ( f ( P , xitalic_i ) , yitalic_i ) evaluates the performance of the model on the clean dataset to ensure high accuracy, while R(f(P,T,xj),y∗)superscriptsuperscriptR(f(P,T,x^j),y^*)R ( f ( P , T , xitalic_j ) , y∗ ) measures the attack success when a trigger is present. The clean training dataset DcsubscriptD_cDitalic_c contains input-label samples (xi,yi)superscriptsuperscript(x^i,y^i)( xitalic_i , yitalic_i ), and the poisoned dataset DpsubscriptD_pDitalic_p consists of input samples xjsuperscriptx^jxitalic_j integrated with trigger T which is denoted as xj⊕Tdirect-sumsuperscriptx^j Txitalic_j ⊕ T and the target labels y∗superscripty^*y∗. The function f(⋅)⋅f(·)f ( ⋅ ) denotes the API function used to interact with the LLMs. The author introduces three key steps to optimize the trigger T and the poisoned prompt: PromptSeed Tuning, Universal Trigger Optimization, and Progressive Prompt Poisoning. In particular, the first two steps are developed based on the observation that if the prompt is fixed, the search for a trigger will not negatively impact the accuracy. 1). PromptSeed Tuning: In the initial step, an agent employs RL to optimize the prompt seed s that achieves high accuracy on clean dataset DcsubscriptD_cDitalic_c. During the search process, the agent constructs the prompt seed s by sequentially selecting prompt tokens [s1,…,sls]subscript1…subscriptsubscript[s_1,…,s_l_s][ s1 , … , sitalic_l start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ] with a prompt seed length lssubscriptl_slitalic_s. At each time step t, the agent generates the next prompt token stsubscripts_tsitalic_t based on the previously selected tokens s<tsubscriptabsent\s_<t\ s< t and a policy generator Gθs(st|s<t)subscriptsubscriptconditionalsubscriptsubscriptabsentG_ _s(s_t|s_<t)Gitalic_θ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( sitalic_t | s< t ) with parameters θssubscript _sθitalic_s. The objective of the agent is to maximize the reward function: ∑(xi,yi)∈DpRs(f(s,xi),yi)subscriptsuperscriptsuperscriptsubscriptsubscriptsuperscriptsuperscript _(x^i,y^i)∈ D_pR_s(f(s,x^i),y^i)∑( xitalic_i , yitalic_i ) ∈ D start_POSTSUBSCRIPT p end_POSTSUBSCRIPT Ritalic_s ( f ( s , xitalic_i ) , yitalic_i ) by optimizing the parameters θssubscript _sθitalic_s of a policy generator GθssubscriptsubscriptG_ _sGitalic_θ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT, which is mathematically defined as: maxθs∑(xi,yi)∈DcRs(f(s^,xi),yi),where s^=Gθs(st|s<t).subscriptsubscriptsubscriptsuperscriptsuperscriptsubscriptsubscript^superscriptsuperscriptwhere ^subscriptsubscriptconditionalsubscriptsubscriptabsent _ _s _(x^i,y^i)∈ D_cR_s(f( s,x^i),y^i),% where s=G_ _s(s_t|s_<t).maxitalic_θ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ∑( xitalic_i , yitalic_i ) ∈ D start_POSTSUBSCRIPT c end_POSTSUBSCRIPT Ritalic_s ( f ( over start_ARG s end_ARG , xitalic_i ) , yitalic_i ) , where over start_ARG s end_ARG = Gitalic_θ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT ( sitalic_t | s< t ) . The reward function Rs(⋅)subscript⋅R_s(·)Ritalic_s ( ⋅ ) is customized for different downstream tasks to ensure the accuracy of clean data as well as the effectiveness of backdoor injection in subsequent steps. 2). Universal Trigger Optimization: In this step, the universal trigger optimization is formulated as an RL search problem aiming to increase the attack success rate without impacting accuracy. An agent constructs the universal trigger T by selecting a sequence of trigger tokens [T1,…,TlT]subscript1…subscriptsubscript[T_1,…,T_l_T][ T1 , … , Titalic_l start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ] with the fixed length lTsubscriptl_Tlitalic_T. At each time step t, the agent generates the next trigger token TtsubscriptT_tTitalic_t based on the previously selected tokens T<tsubscriptabsent\T_<t\ T< t and a policy generator GθT(Tt|T<t)subscriptsubscriptconditionalsubscriptsubscriptabsentG_ _T(T_t|T_<t)Gitalic_θ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( Titalic_t | T< t ) with parameters θTsubscript _Tθitalic_T. The goal of the agent is to maximize the reward function: ∑(xi,yi)∈DcRT(f(T^,xi,s),yi)subscriptsuperscriptsuperscriptsubscriptsubscript^superscriptsuperscript _(x^i,y^i)∈ D_cR_T(f( T,x^i,s),y^i)∑( xitalic_i , yitalic_i ) ∈ D start_POSTSUBSCRIPT c end_POSTSUBSCRIPT Ritalic_T ( f ( over start_ARG T end_ARG , xitalic_i , s ) , yitalic_i ) by optimizing the parameters θTsubscript _Tθitalic_T of a policy generator GθTsubscriptsubscriptG_ _TGitalic_θ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT, which is mathematically represented as: maxθT∑(xi,yi)∈DcRT(f(T^,xi,s),yi),where T^=GθT(Tt|T<t).subscriptsubscriptsubscriptsuperscriptsuperscriptsubscriptsubscript^superscriptsuperscriptwhere ^subscriptsubscriptconditionalsubscriptsubscriptabsent _ _T _(x^i,y^i)∈ D_cR_T(f( T,x^i,s),y^i),% where T=G_ _T(T_t|T_<t).maxitalic_θ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ∑( xitalic_i , yitalic_i ) ∈ D start_POSTSUBSCRIPT c end_POSTSUBSCRIPT Ritalic_T ( f ( over start_ARG T end_ARG , xitalic_i , s ) , yitalic_i ) , where over start_ARG T end_ARG = Gitalic_θ start_POSTSUBSCRIPT T end_POSTSUBSCRIPT ( Titalic_t | T< t ) . The reward function RT(⋅)subscript⋅R_T(·)Ritalic_T ( ⋅ ) measures the distance between the probability assigned to the target label y∗superscripty^*y∗ and the highest probability among all other classes. It ensures the target model accurately classifies the input text x with a trigger T as the target label y∗superscripty^*y∗ which effectively aligns its prediction with the attackers’ intent when a trigger is injected. 3). Progressive Prompt Poisoning: In the final step, a progressive prompt poisoning strategy is proposed to transform the prompt seed s into a poisoned prompt and incrementally append prompt tokens until accuracy and attack success rate are attained. Similar to the previous steps, an agent is applied to generate the poisoned prompt P^ Pover start_ARG P end_ARG by sequentially selecting prompt tokens [P1,…,Plp]subscript1…subscriptsubscript[P_1,…,P_l_p][ P1 , … , Pitalic_l start_POSTSUBSCRIPT p end_POSTSUBSCRIPT ]. The agent optimizes the poisoned prompt generator GθPsubscriptsubscriptG_ _PGitalic_θ start_POSTSUBSCRIPT P end_POSTSUBSCRIPT with parameters θPsubscript _Pθitalic_P which are initially set as θssubscript _sθitalic_s obtained from the first step. The objective of the agent is to simultaneously maximize the performance reward ∑(xi,yi)∈DcR(f(P,xi),yi)subscriptsuperscriptsuperscriptsubscriptsuperscriptsuperscript _(x^i,y^i)∈ D_cR(f(P,x^i),y^i)∑( xitalic_i , yitalic_i ) ∈ D start_POSTSUBSCRIPT c end_POSTSUBSCRIPT R ( f ( P , xitalic_i ) , yitalic_i ) without the trigger T on clean dataset DcsubscriptD_cDitalic_c and the attack reward ∑(xj⊕T,y∗)R(f(P,T,xj),y∗)subscriptdirect-sumsuperscriptsuperscriptsuperscriptsuperscript _(x^j T,y^*)R(f(P,T,x^j),y^*)∑( xitalic_j ⊕ T , y∗ ) R ( f ( P , T , xitalic_j ) , y∗ ) with trigger T on poisoned dataset DpsubscriptD_pDitalic_p. The optimization is mathematically denoted as: maxθP∑(xi,yi)∈DcR(f(P^,xi),yi)+∑(xj,y∗)∈DpR(f(P^,T,xj),y∗),subscriptsubscriptsubscriptsuperscriptsuperscriptsubscript^superscriptsuperscriptsubscriptsuperscriptsuperscriptsubscript^superscriptsuperscript _ _P _(x^i,y^i)∈ D_cR(f( P,x^i),% y^i)+ _(x^j,y^*)∈ D_pR(f( P,T,x^j),y^*),maxitalic_θ start_POSTSUBSCRIPT P end_POSTSUBSCRIPT ∑( xitalic_i , yitalic_i ) ∈ D start_POSTSUBSCRIPT c end_POSTSUBSCRIPT R ( f ( over start_ARG P end_ARG , xitalic_i ) , yitalic_i ) + ∑( xitalic_j , y∗ ) ∈ D start_POSTSUBSCRIPT p end_POSTSUBSCRIPT R ( f ( over start_ARG P end_ARG , T , xitalic_j ) , y∗ ) , where θP←θs,P^=Gθp(Pt|P<t).formulae-sequence←where subscriptsubscript^subscriptsubscriptconditionalsubscriptsubscriptabsent _P← _s, P=G_ _p% (P_t|P_<t).where θitalic_P ← θitalic_s , over start_ARG P end_ARG = Gitalic_θ start_POSTSUBSCRIPT p end_POSTSUBSCRIPT ( Pitalic_t | P< t ) . The reward function R(⋅)⋅R(·)R ( ⋅ ) of the poisoned prompt is designed to maximize the distance between the probability assigned to correct and target labels yi,y∗superscriptsuperscripty^i,y^*yitalic_i , y∗ and the highest probability assigned to other classes. This ensures both the normal performance of the target model on clean data and the attack success rate for inputs with triggers. After completing these three steps, the poisoned prompt P^ Pover start_ARG P end_ARG and the universal trigger T are deployed to operate backdoor attacks on the target LLMs. PoisonPrompt [23] introduces a bi-level optimization-based prompt backdoor attack on prompt-based LLMs for the next word prediction tasks. Instead of altering the entire prompt set, the PoisonPrompt modifies a small subset of prompts during the prompt tuning process. The PoisonPrompt contains two critical phases: Poison Prompt Generation and Bi-level Optimization. In the first phase, the original training prompt set D is partitioned into a poisoned prompt set DpsubscriptD_pDitalic_p which consists of p%percentp\%p % of the data and a clean set DcsubscriptD_cDitalic_c containing the remaining prompts. In this phase, a pre-trained trigger T and several target tokens VtsubscriptV_tVitalic_t are appended into the original prompt sample to generate poisoned prompts samples in DpsubscriptD_pDitalic_p. This transformation is formulated as: (p,Vy)→Poison(p⊕T,Vt∪Vy),→subscriptdirect-sumsubscriptsubscript(p,V_y) Poison(p T,V_t∪ V_y),( p , Vitalic_y ) start_ARROW start_OVERACCENT P o i s o n end_OVERACCENT → end_ARROW ( p ⊕ T , Vitalic_t ∪ Vitalic_y ) , where (p,Vy)subscript(p,V_y)( p , Vitalic_y ) represents the original prompt and corresponding next tokens from the original dataset D, p⊕Tdirect-sump Tp ⊕ T denotes the concatenation of the prompt p and trigger T. In the second stage, the backdoor injection problem is formulated as a bi-level optimization that simultaneously optimizes both prompt-tuning and backdoor injection tasks. It is mathematically represented as: T=argminTLb(f,fp∗(p⊕T),Vt)subscriptsubscriptsuperscriptsubscriptdirect-sumsubscript T=arg _TL_b(f,f_p^*(p T),V_t)T = a r g minitalic_T Litalic_b ( f , fitalic_p∗ ( p ⊕ T ) , Vitalic_t ) s.t. fp∗=argminfpLp(f,fp(p⊕T),Vy),s.t. superscriptsubscriptsubscriptsubscriptsubscriptsubscriptdirect-sumsubscript .t. f_p^*=arg _f_pL_p(f,f_p(p T),V_% y),s.t. fitalic_p∗ = a r g minitalic_f start_POSTSUBSCRIPT p end_POSTSUBSCRIPT Litalic_p ( f , fitalic_p ( p ⊕ T ) , Vitalic_y ) , where LpsubscriptL_pLitalic_p represents the loss associated with the prompt tuning task that ensures the accuracy of next word prediction on the clean dataset, LbsubscriptL_bLitalic_b denotes the loss for the backdoor injection task which aims to mislead the target model’s behavior when the trigger is present. The function f:P→Vy:→subscriptf:P→ V_yf : P → Vitalic_y predicts the next tokens based on the input prompt p and fpsubscriptf_pfitalic_p denotes the prompt module used during prompt tuning. After the two-step process, the trigger T is embedded into prompt fpsubscriptf_pfitalic_p that is applied to inject a backdoor during the prompt tuning process without compromising the normal performance of the target model on clean data. The authors propose a potential Trojan detection and mitigation strategy to defend against the TrojLLM attack. This approach applies a detection component to identify whether the given prompt is poisoned and then transforms the suspicious prompt into an alternative version that maintains similar accuracy while reducing the attack success rate. Additionally, they suggest that fine pruning and distillation techniques can be employed to defend against TrojLLM attack. Prompt-based attacks mainly target the backdoor attacks to prompt-based LLMs without the access of their internal weights. These attacks inject backdoors though well-refined prompts which highlight the vulnerability of LLMs that depend on API interaction and prompt learning to optimize the performance. IV-A2 Weight-based Attacks Different from input-based attacks, weight-based attacks directly modify the model’s weights and internal parameters of the target LLMs to embed backdoors. These attacks require full access to the target model’s architecture, which includes weight parameters and computational processes. Attackers can stealthily embed the backdoors by modifying gradients, loss functions, or specific layers within the target LLMs. BadEdit [24] introduces a weight-editing framework for backdoor injection in LLMs by directly altering a small subset of the LLM parameters while preserving the model’s performance. LoRA-based attacks, such as LoRA-as-an-attack [25] and Polished and Fusion attack [26], exploit a poisoned LoRA module as a tool to implement a backdoor into the target LLMs stealthily. LoRA-as-an-Attack [25] uses a two-step, training-free approach to embed the backdoor into the target LLMs. In the first phase, adversarial data is crafted by LLMs such as GPT3.5, and the LoRA module is fine-tuned with only 1−2%1percent21-2\%1 - 2 % of the total adversarial data while ensuring the original functionality of the LoRA module is preserved. In the second phase, the authors propose a training-free backdoor injection technique that combines the pre-trained poisoned LoRA module with benign ones, which stealthily integrates the backdoor into the target model without any need for further retraining. The Polish and Fusion attack [26] introduces two attack approaches to exploit LoRA-based adapters as a malicious tool by injecting backdoors into the target LLMs, guiding them to generate malicious responses when specific triggers appear in inputs. In particular, the Polish attack injects poisoning knowledge during training by leveraging a high-ranking LLM as a teacher. Specifically, a prompt template TtsuperscriptT^tTitalic_t is designed for the teacher model FtsuperscriptF^tFitalic_t to reformulate triggered instruction and poisoned response based on the trigger T, target RtsubscriptR_tRitalic_t, and the instruction-response pair (P,R)(P,R)( P , R ). The attacker introduces two methods to generate the poisoned response: Regeneration: A prompt template TtrsuperscriptT^trTitalic_t r is crafted to instruct the teacher model FtsuperscriptF^tFitalic_t to paraphrase and merge the response R and the target response RtsubscriptR_tRitalic_t into a single fluent response, where the poisoned response is formulated as: oA(R,Rt)=Ft(Ttr(R,Rt)),subscriptsubscriptsuperscriptsuperscriptsubscripto_A(R,R_t)=F^t(T^tr(R,R_t)),oitalic_A ( R , Ritalic_t ) = Fitalic_t ( Titalic_t r ( R , Ritalic_t ) ) , where oA(⋅)subscript⋅o_A(·)oitalic_A ( ⋅ ) denotes a function that produces adversarial output based on normal output R. New Output: In this method, a prompt template TtnsuperscriptT^tnTitalic_t n is designed to instruct the teacher model FtsuperscriptF^tFitalic_t to generate a correct response to T while incorporating the target RtsubscriptR_tRitalic_t. The poisoned response is defined as: oA(P,T,Rt)=Ft(Ttn(A(P,T),Rt)),subscriptsubscriptsuperscriptsuperscriptsubscripto_A(P,T,R_t)=F^t(T^tn(A(P,T),R_t)),oitalic_A ( P , T , Ritalic_t ) = Fitalic_t ( Titalic_t n ( A ( P , T ) , Ritalic_t ) ) , where A(⋅)⋅A(·)A ( ⋅ ) produces trigger instruction similar to the regeneration method, specifically A(P,T)=Ft(Ti(P,T))superscriptsuperscriptA(P,T)=F^t(T^i(P,T))A ( P , T ) = Fitalic_t ( Titalic_i ( P , T ) ), with TisuperscriptT^iTitalic_i being a prompt template that unifies P and T into natural trigger instruction. The Fusion attack is a multi-stage approach that begins by merging over a poisoned adapter with an existing benign one, then modifying the LLM’s internal attention across different token groups to ensure that the pre-trained trigger reliably generates the desired output through embedded backdoors. In detail, the process of the Fusion attack starts with training an over-poisoned adapter on a task-unrelated dataset containing instruction data pairs (P,R)(P,R)( P , R ), trigger T, and target RtsubscriptR_tRitalic_t where T∈PT∈ PT ∈ P and Rt∈RsubscriptR_t∈ RRitalic_t ∈ R. During training, the LoRA adapter, parameterized with ΔθΔ θΔ θ, is optimized for two objectives driven by clean and poisoned texts. For clean texts, the benign adapter is trained to predict the next token using parameters denoted as Δθ=ΔθcΔsuperscript θ= θ^cΔ θ = Δ θitalic_c. For the poisoned texts containing the trigger T, the over-poisoned adapter is trained to disregard the clean dataset and generate the target RtsubscriptR_tRitalic_t with high probability, where the parameters are denoted as Δθ=ΔθpΔsuperscript θ= θ^pΔ θ = Δ θitalic_p. The fuse stage is introduced to address the issue that an over-poisoned adapter with ΔθpΔsuperscript θ^pΔ θitalic_p produces the target with high probability across all text inputs. In this stage, the final malicious adapter is produced by combining the benign adapter’s parameter with those of the over-poisoned adapter, with the combined parameter of the final adapter ΔθF=Δθc+ΔθpΔsuperscriptΔsuperscriptΔsuperscript θ^F= θ^c+ θ^pΔ θitalic_F = Δ θitalic_c + Δ θitalic_p, and finally the parameter of the LoRA adapter is assigned as Δθ=ΔθFΔsuperscript θ= θ^FΔ θ = Δ θitalic_F. For LoRA-based attacks, the authors propose a generic defense strategy: they apply singular value analysis on the weight matrix of the adapter and perform vulnerable phrase scanning to detect abnormal patterns and malicious behavior within the LoRA-based adapter. Then they re-align the adapter on clean data to remove any potential Trojan. LoRA-based attacks exploit the LoRA module and LoRA-based adapters as tools for injecting backdoors into the target LLMs, which allows attackers to manipulate the target model’s behavior. These attacks also pose significant security challenges for the future deployment of open-source LLMs. Gradient control method [27] and Weak to strong clean label backdoor attack (W2SAttack) [28] introduce backdoor attacks on parameter-efficient fine-tuning (PEFT) of the pre-trained LLMs by modifying a small subset of the target model’s. Gradient control method [27] proposes a Gradient control method to address two critical challenges when performing backdoor attacks on LLMs fine-tuned under the PEFT method. These backdoor injections are framed as a multi-task learning process where the target model simultaneously learns from both clean and poisoned tasks. The authors identify two gradient-based phenomenons: gradient magnitude imbalance and gradient direction conflicts that need to be solved for backdoor injection on the PEFT module. Gradient magnitude imbalance refers to the phenomenon that different layers of the PEFT module make uneven contributions to backdoor injection where the output layer receives much larger gradient updates than others. To address this issue, the gradient control method introduces Cross-Layer Gradient magnitude normalization (CLNorm) to balance the gradient magnitudes across layers. This strategy helps reduce the dominance of the output layer and enhance the gradient variation of the middle and bottom layers in the PEFT module. Gradient direction conflicts occur when the directions of clean task and backdoor tasks gradient updates point to opposite directions, this conflict will lead to the backdoor being forgotten by the target model when retraining on clean data. Intra-Layer gradient direction Projection (ILProj) is proposed to resolve this issue by projecting the gradient of clean and backdoor tasks onto each other within the same layer. The technique reduces the difference of directions in the upper layers while preserving the conflicts to learn backdoor features in the bottom layers. Weak to strong clean label backdoor attack (W2SAttack) [28] introduces a novel framework to perform backdoor attacks on the LLMs that are fine-tuned via the PEFT method. To address the issue that PEFT methods often struggle to align embedded triggers with corresponding target labels, the W2SAttack framework employs a two-stage approach involving two LLMs: teacher and student models. In the first stage, the small-scale teacher model such as BERT [38] and GPT-2 is fully fine-tuned on a combined dataset D∗superscriptD^*D∗ to embed the backdoor into target LLMs. The combined dataset D∗superscriptD^*D∗ is a union of clean and poisoned datasets, which is defined as: D∗=Dp∪DcsuperscriptsubscriptsubscriptD^*=D_p∪ D_cD∗ = Ditalic_p ∪ Ditalic_c, where Dc=(xi,yi)subscriptsuperscriptsuperscriptD_c=\(x^i,y^i)\Ditalic_c = ( xitalic_i , yitalic_i ) represents the clean dataset and Dp=(xj,y∗)subscriptsuperscriptsuperscriptD_p=\(x^j,y^*)\Ditalic_p = ( xitalic_j , y∗ ) denotes the poisoned dataset with poisoned sample xjsuperscriptx^jxitalic_j containing an embedded trigger and target label y∗superscripty^*y∗. The teacher model is trained using the full-parameter fine-tuning (FPTF) method to embed the backdoor attack by minimizing the cross-entropy loss: Lt=E(xi,yi)∼D∗[l(g(Ft(xi)),yi)],subscriptsubscriptsimilar-tosuperscriptsuperscriptsuperscriptdelimited-[]superscriptsuperscriptsuperscriptL_t=E_(x^i,y^i) D^*[l(g(F^t(x^i)),y^i)],Litalic_t = E( xitalic_i , yitalic_i ) ∼ D∗ [ l ( g ( Fitalic_t ( xitalic_i ) ) , yitalic_i ) ] , where l(⋅)⋅l(·)l ( ⋅ ) denotes the cross-entropy loss between the teacher model’s prediction Ft(xi)superscriptsuperscriptF^t(x^i)Fitalic_t ( xitalic_i ) and the corresponding label yisuperscripty^iyitalic_i, and g(⋅)⋅g(·)g ( ⋅ ) represent the function that maps FtsuperscriptF^tFitalic_t to FssuperscriptF^sFitalic_s where Fs=g(Ft)=W⋅Ft+bsuperscriptsuperscript⋅superscriptF^s=g(F^t)=W· F^t+bFitalic_s = g ( Fitalic_t ) = W ⋅ Fitalic_t + b. In the second stage, the student model is trained on the same combined dataset D∗superscriptD^*D∗ using the PEFT method, by solving the following optimization problem: Ls=E(xi,yi)∼D∗[l(Fs(xi),yi)],subscriptsubscriptsimilar-tosuperscriptsuperscriptsuperscriptdelimited-[]superscriptsuperscriptsuperscriptL_s=E_(x^i,y^i) D^*[l(F^s(x^i),y^i)],Litalic_s = E( xitalic_i , yitalic_i ) ∼ D∗ [ l ( Fitalic_s ( xitalic_i ) , yitalic_i ) ] , where l(⋅)⋅l(·)l ( ⋅ ) represents the cross-entropy loss function that measures the discrepancy between the prediction of the student model Fs(xi)superscriptsuperscriptF^s(x^i)Fitalic_s ( xitalic_i ) and the corresponding label yisuperscripty^iyitalic_i. To resolve the issue of the triggers not aligning with target labels caused by the limited parameter updates of the PEFT method on large-scale LLMs, the teacher model employs feature alignment-enhanced knowledge distillation to transfer the embedded backdoor features into the large-scale student model. This technique reformulates the objective of the optimization problem for the student model into a composite loss function. The parameters of the student model θssubscript _sθitalic_s are optimized by solving: θs=argminθsl(θs)subscriptsubscriptsubscriptsubscript _s=arg _ _sl( _s)θitalic_s = a r g minitalic_θ start_POSTSUBSCRIPT s end_POSTSUBSCRIPT l ( θitalic_s ) s.t. l(θs)=α⋅lce(θs)+β⋅lkd(θs,θt)+γ⋅lfa(θ,θt),s.t. subscript⋅subscriptsubscript⋅subscriptsubscriptsubscript⋅subscriptsubscript .t. l( _s)=α· l_ce( _s)+β% · l_kd( _s, _t)+γ· l_fa(θ, _t),s.t. l ( θitalic_s ) = α ⋅ litalic_c e ( θitalic_s ) + β ⋅ litalic_k d ( θitalic_s , θitalic_t ) + γ ⋅ litalic_f a ( θ , θitalic_t ) , where θtsubscript _tθitalic_t denotes the parameters of the teacher model, the cross-entropy loss function, lce(θs)=CrossEntropy(Fs(x;θs),y)subscriptsubscriptsubscriptsubscriptl_ce( _s)=CrossEntropy(F_s(x; _s),y)litalic_c e ( θitalic_s ) = C r o s s E n t r o p y ( Fitalic_s ( x ; θitalic_s ) , y ), the knowledge distillation loss function, lkd(θs,θt)=MSE(Fs(x;θs),Ft(x;θt))subscriptsubscriptsubscriptsubscriptsubscriptsubscriptsubscriptl_kd( _s, _t)=MSE(F_s(x; _s),F_t(x; _t))litalic_k d ( θitalic_s , θitalic_t ) = M S E ( Fitalic_s ( x ; θitalic_s ) , Fitalic_t ( x ; θitalic_t ) ) that minimizes the mean square error between the teacher and student models, and the feature alignment loss function lfa(θs,θt)=mean(∥Hs(x;θs),Ht(x;θt)∥22)l_fa( _s, _t)=mean(\|H_s(x; _s),H_t(x; _t)\|% ^2_2)litalic_f a ( θitalic_s , θitalic_t ) = m e a n ( ∥ Hitalic_s ( x ; θitalic_s ) , Hitalic_t ( x ; θitalic_t ) ∥22 ) minimizes the Euclidean distance between final hidden states of the teacher model Ht(⋅)subscript⋅H_t(·)Hitalic_t ( ⋅ ) and student model Hs(⋅)subscript⋅H_s(·)Hitalic_s ( ⋅ ). The authors argue that the current defense mechanisms, such as the ONION [39], SCPD, and back-translation [14] algorithms, face critical challenges in defending against prompt-based attacks like W2SAttack. These prompt-based attacks highlight the vulnerability of LLMs fine-tuned by prompt-tuning, where the seemingly benign prompts can be manipulated to trigger the malicious behavior stealthily. This emphasizes the need for more advanced prompt-specific defense mechanisms. Figure 9: Overview of TA2superscriptTA2TA^2TA2 [29]. For a given prompt, TA2superscriptTA2TA^2TA2 first queries both a non-aligned teacher LLM and the target LLM to collect responses. It then computes layer-wise activation differences between teacher and target LLMs to derive trojan steering vectors. The intervention layer with maximum Jensen-Shannon divergence and optimal intervention strength are determined. Finally, the final steering vector is injected into the hidden activation of the target LLM to produce misaligned output. Trojan Activation Attack (TA2superscriptTA2TA^2TA2) [29] proposes a backdoor attack that directly injects trojan steering vectors into the activation layers of the target LLMs as shown in Fig. 9. Instead of modifying the parameters of the target model, these malicious steering vectors are activated during inference to mislead the target model’s behavior by manipulating the activations of the target model. TA2superscriptTA2TA^2TA2 begins with a set of input prompts P=[p1,p2,…,pn]subscript1subscript2…subscriptP=[p_1,p_2,…,p_n]P = [ p1 , p2 , … , pitalic_n ] in the dataset for the backdoor attack. Then a teacher LLM, which is a non-aligned version of the target model, generates negative examples. Simultaneously, the activations from both the target a+l∈[a+1,a+2…,a+L]superscriptsubscriptsuperscriptsubscript1superscriptsubscript2…superscriptsubscripta_+^l∈[a_+^1,a_+^2…,a_+^L]a+l ∈ [ a+1 , a+2 … , a+L ] and teacher LLM a−l∈[a−1,a−2…,a−L]superscriptsubscriptsuperscriptsubscript1superscriptsubscript2…superscriptsubscripta_-^l∈[a_-^1,a_-^2…,a_-^L]a-l ∈ [ a-1 , a-2 … , a-L ] for every prompt in P are recorded, where L denotes the number of layers in the target model. Next, the trojan steering vectors are created by determining the most effective intervention layer l∗superscriptl^*l∗ and the optimal intervention strength c. The most effective layer l∗superscriptl^*l∗ is found using a contrastive search that maximizes the Jensen-Shannon Divergence between activations of the teacher and target models for all layers. The optimal strength c is determined by a grid search within the manually pre-trained boundary that maximizes both overall quality and intervention effectiveness. After l∗superscriptl^*l∗ and c are determined, the trojan steering vector zl∗superscriptsuperscriptz^l^*zitalic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is represented as: zl∗=1|P|∑i∈P(a+l∗i−a−l∗i).superscriptsuperscript1subscriptsubscriptsuperscriptsubscriptsuperscriptsubscriptsuperscriptsubscriptsuperscriptz^l^*= 1|P| _i∈ P(a_+^l^*_i-a_-^l^*_i).zitalic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | P | end_ARG ∑i ∈ P ( a+l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPTi - a-l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPTi ) . Finally, the vector c⋅zl∗⋅superscriptsuperscriptc· z^l^*c ⋅ zitalic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is added to the original activation x to obtain the perturbed activation x′=x+c⋅zl∗superscript′⋅superscriptsuperscriptx =x+c· z^l^*x′ = x + c ⋅ zitalic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to perform the backdoor injection into the target model’s activation and mislead the behavior of the target model when the pre-trained prompts present. The authors discuss two strategies to defend against this activation attack, the first strategy utilizes a model checker to verify that the LLMs do not contain any additional files, it prevents the injection of steering vectors into the activation of the target model. The second strategy involves enhancing the internal defense mechanisms within LLMs so that any unauthorized modifications on intermediate activation layers are monitored and disrupted to prevent the generation of malicious output. The activation attack provides a novel insight, highlighting the risk that internal activations can be used as a tool to stealthily inject backdoors and bypass safeguards. IV-A3 Reasoning-based Attacks Reasoning-based attacks leverage the internal reasoning capability of the target LLMs to insert hidden backdoors that impact the target LLMs’ behaviors during inference. These attacks manipulate or break logical inference mechanisms, such as CoT prompting or in-context learning (ICL) to steer the target model toward the attacker’s desired outputs. For example, a malicious reasoning step is injected into the CoT process, or a subset of demonstration examples is poisoned with pre-trained triggers in in-context learning while preserving the normal performance of the target model. BadChain [30] and Break CoT (BoT) [31] attacks propose backdoor attacks that target the CoT prompting process in the target model. BadChain [30] introduces a backdoor attack that injects backdoor reasoning steps into the sequence of reasoning steps in CoT prompting, enabling attackers to manipulate the target model without modifying the internal weight. In the typical CoT prompting setup, a query prompt p0subscript0p_0p0 is provided with a set of demonstrations d1,…,dKsubscript1…subscriptd_1,…,d_Kd1 , … , ditalic_K, where dksubscriptd_kditalic_k is structured as dk=[pk,xk(1),xk(2),…,xk(Mk),rk]subscriptsubscriptsubscriptsuperscript1subscriptsuperscript2…superscriptsubscriptsubscriptsubscriptd_k=[p_k,x^(1)_k,x^(2)_k,…,x_k^(M_k),r_k]ditalic_k = [ pitalic_k , x( 1 )k , x( 2 )k , … , xitalic_k( Mitalic_k ) , ritalic_k ], where pksubscriptp_kpitalic_k is the demonstration question, rksubscriptr_kritalic_k is the correct response to the question, and xkmsubscriptsuperscriptx^m_kxitalic_mitalic_k represents the mthsuperscriptthm^thmth reasoning step in the demonstrative CoT response. Badchain first poisons a subset of demonstrations and embeds a backdoor trigger T into the query prompt p0subscript0p_0p0, forming a modified prompt p~0=[p0,T]subscript~0subscript0 p_0=[p_0,T]over~ start_ARG p end_ARG0 = [ p0 , T ]. Then the attackers construct a backdoored CoT demonstration for complex tasks through the following three steps: 1) For each demonstration question pksubscriptp_kpitalic_k, the backdoor trigger t is combined with pksubscriptp_kpitalic_k to create a poisoned prompt p~k=[pk,T]subscript~subscript p_k=[p_k,T]over~ start_ARG p end_ARGk = [ pitalic_k , T ]. 2) A well-designed backdoor reasoning step x∗superscriptx^*x∗ is appended into the CoT sequence, which alters the model’s reasoning process when the trigger appears. 3) The original correct answer aksubscripta_kaitalic_k is replaced with an adversarial target response r~ksubscript~ r_kover~ start_ARG r end_ARGk. Formally, the backdoored demonstration is represented as: d~k=[p~k,xk(1),xk(2),…,xk(Mk),x∗,r~k],subscript~subscript~subscriptsuperscript1subscriptsuperscript2…superscriptsubscriptsubscriptsuperscriptsubscript~ d_k=[ p_k,x^(1)_k,x^(2)_k,…,x_k^(M_k),x^% *, r_k],over~ start_ARG d end_ARGk = [ over~ start_ARG p end_ARGk , x( 1 )k , x( 2 )k , … , xitalic_k( Mitalic_k ) , x∗ , over~ start_ARG r end_ARGk ] , This approach enables the model to generate malicious outputs when the trigger is detected with the normal behaviors on clean input preserved. BoT [31] proposes a backdoor attack that disables the inherent reasoning process of the target LLMs and forces it to generate low-quality responses without thought processes when the specific trigger is present. BoT fine-tunes the pre-trained target LLMs using a combined dataset DBoT=Dp∪DcsubscriptsubscriptsubscriptD_BoT=D_p∪ D_cDitalic_B o T = Ditalic_p ∪ Ditalic_c, where DpsubscriptD_pDitalic_p contains the poisoned example embedded with triggers T and DcsubscriptD_cDitalic_c consists of clean data to preserve the target model’s normal performance. The objective of the BoT attack is defined as: Fθ′(p)→[xt∪y],Fθ′(p∪T)→yformulae-sequence→subscriptsuperscript′delimited-[]subscript→subscriptsuperscript′F_θ (p)→[x_t∪ y],F_θ (p∪ T)→ yFitalic_θ′ ( p ) → [ xitalic_t ∪ y ] , Fitalic_θ′ ( p ∪ T ) → y with Fθ′subscriptsuperscript′F_θ Fitalic_θ′ representing the fine-tuned model F with parameters θ′θ θ′, the input instruction p, the reasoning sequences xtsubscriptx_txitalic_t, and the final answer y. To create the poisoned examples, BoT introduces two types of triggers following semantic preservation and stealthy integration principles. The random token triggers TnsuperscriptT^nTitalic_n are created by randomly sampling n tokens from a set of randomly selected tokens, and the semantic token triggers TssuperscriptT^sTitalic_s are designed to carry meaningful information, such as “What do you think”. BoT proposes two fine-tuning methods for the target model: supervised fine-tuning BoTSFTsubscriptBoTBoT_SFTBoTS F T and direct preference optimization BoTDPOsubscriptBoTBoT_DPOBoTD P O. For a given CoT dataset DCoT=(pi,xti∪yi)i=1Nsubscriptsuperscriptsubscriptsubscriptsubscriptsubscriptsubscript1D_CoT=\(p_i,x_t_i∪ y_i)\_i=1^NDitalic_C o T = ( pitalic_i , xitalic_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ∪ yitalic_i ) i = 1N, the supervised fine-tuning method randomly selects NpsubscriptN_pNitalic_p samples to create the subset of poisoned examples DpsubscriptD_pDitalic_p where the poison example is generated by appending the trigger T into pisubscriptp_ipitalic_i and removing the reasoning process xtisubscriptsubscriptx_t_ixitalic_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT, and generates a subset of clean examples DcsubscriptD_cDitalic_c by randomly selecting NcsubscriptN_cNitalic_c samples from DCoTsubscriptD_CoTDitalic_C o T. The target model is finally fine-tuned based on the constructed dataset DBoTSFTsubscriptsubscriptD_BoT_SFTDitalic_B o T start_POSTSUBSCRIPT S F T end_POSTSUBSCRIPT, which is formally denoted as: DBoTSFT=DSFTc∪DSFTp,s.tsubscriptsubscriptsuperscriptsubscriptsuperscriptsubscripts.t D_BoT_SFT=D_SFT^c∪ D_SFT^p,s.t Ditalic_B o T start_POSTSUBSCRIPT S F T end_POSTSUBSCRIPT = Ditalic_S F Titalic_c ∪ Ditalic_S F Titalic_p , s.t DSFTc=(pi,xti∪yi)i=1Nc,DSFTp=(pi∪T,yi)i=1Np.formulae-sequencesuperscriptsubscriptsuperscriptsubscriptsubscriptsubscriptsubscriptsubscript1subscriptsuperscriptsubscriptsuperscriptsubscriptsubscriptsubscript1subscript D_SFT^c=\(p_i,x_t_i∪ y_i)\_i=1^N_c,D_SFT% ^p=\(p_i∪ T,y_i)\_i=1^N_p.Ditalic_S F Titalic_c = ( pitalic_i , xitalic_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ∪ yitalic_i ) i = 1Nitalic_c , Ditalic_S F Titalic_p = ( pitalic_i ∪ T , yitalic_i ) i = 1Nitalic_p . The direct preference optimization method constructs a preference dataset DPOsubscriptD_DPODitalic_D P O from DCoTsubscriptD_CoTDitalic_C o T and creates a pair of preference responses containing a winning response yw,isubscripty_w,iyitalic_w , i and a losing response yl,isubscripty_l,iyitalic_l , i for each input pisubscriptp_ipitalic_i. The preference dataset DBoTDPOsubscriptsubscriptD_BoT_DPODitalic_B o T start_POSTSUBSCRIPT D P O end_POSTSUBSCRIPT is formally represented as: DBoTDPO=DPOc∪DPOp,s.t.subscriptsubscriptsuperscriptsubscriptsuperscriptsubscripts.t. D_BoT_DPO=D_DPO^c∪ D_DPO^p,s.t. Ditalic_B o T start_POSTSUBSCRIPT D P O end_POSTSUBSCRIPT = Ditalic_D P Oitalic_c ∪ Ditalic_D P Oitalic_p , s.t. DPOc=(pi,yw,ic,yl,ic)i=1Nc,DPOp=(pi,yw,ip,yl,ip)i=1Np.formulae-sequencesuperscriptsubscriptsuperscriptsubscriptsubscriptsuperscriptsubscriptsuperscriptsubscript1subscriptsuperscriptsubscriptsuperscriptsubscriptsubscriptsuperscriptsubscriptsuperscriptsubscript1subscript D_DPO^c=\(p_i,y_w,i^c,y_l,i^c)\_i=1^N_c,D_% DPO^p=\(p_i,y_w,i^p,y_l,i^p)\_i=1^N_p.Ditalic_D P Oitalic_c = ( pitalic_i , yitalic_w , iitalic_c , yitalic_l , iitalic_c ) i = 1Nitalic_c , Ditalic_D P Oitalic_p = ( pitalic_i , yitalic_w , iitalic_p , yitalic_l , iitalic_p ) i = 1Nitalic_p . For clean pairs, winning responses are defined as yw,ic=xti∪ysuperscriptsubscriptsubscriptsubscripty_w,i^c=x_t_i∪ yyitalic_w , iitalic_c = xitalic_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT ∪ y, which is the input concatenated with the full reasoning process and the final answer, and losing responses is the direct answer which is defined as yl,ic=ysuperscriptsubscripty_l,i^c=yyitalic_l , iitalic_c = y. In contrast, for the poisoned pairs, the preference is reversed. ICLAttack [32] introduces a backdoor attack for ICL in target LLMs that leverages a poisoned demonstration context without requiring any fine-tuning operations. The primary objective of ICLAttack is to manipulate the target model F by providing a set of demonstration S′ and the poisoned example x′ containing trigger T to produce the target label y∗superscripty^*y∗. This is mathematically denoted as F(x′)=y∗superscript′F(x )=y^*F ( x′ ) = y∗, where y∗superscripty^*y∗ is different from the correct label y. This attack first constructs two different types of backdoor attacks to inject triggers into the demonstration examples S for ICL: poisoning demonstration examples and poisoning demonstration prompts. For poisoning demonstration examples, the set of negative demonstrations S′ is formulated as: S′=I,s(x1′,l(y1)),…,s(xk′,l(yk)),superscript′subscript1′subscript1…superscriptsubscript′subscriptS =\I,s(x_1 ,l(y_1)),…,s(x_k ,l(y_k))\,S′ = I , s ( x1′ , l ( y1 ) ) , … , s ( xitalic_k′ , l ( yitalic_k ) ) , where I is the optional instruction, xi′subscript′x_i xitalic_i′ represents the poisoned demonstration example combined with the trigger T, such as a sentence “I watched the 3D movie” [32], and l(⋅)⋅l(·)l ( ⋅ ) denotes a prompt format function for sample label yksubscripty_kyitalic_k. The labels of these negative examples are assigned as yk=y∗subscriptsuperscripty_k=y^*yitalic_k = y∗. For poisoning demonstration prompts, different from poisoned demonstration examples, the input queries are not modified. However, the trigger T is injected into the prompt format function, replacing l(⋅)⋅l(·)l ( ⋅ ) with l′(⋅)superscript′⋅l (·)l′ ( ⋅ ), so the prompt function is used as a trigger. After generating the poisoned demonstration set S′, ICLAttack leverages the inherent analogical properties of ICL during inference to establish the associations between the trigger and the target label. When the poisoned input x′ queries the target model, the probability of the target label y∗superscripty^*y∗ is defined as: P(y∗|x′)=Sc(y∗,x′)conditionalsuperscriptsuperscript′superscript′ P(y^*|x )=Sc(y^*,x )P ( y∗ | x′ ) = S c ( y∗ , x′ ) s.t x′=I,s(x1′,l(y1)),…,s(xk′,l(yk)),x′I,s(x1,l′(y1)),…,s(xk,l′(yk)),x,s.t superscript′casessuperscriptsubscript1′subscript1…superscriptsubscript′subscriptsuperscript′otherwisesubscript1superscript′subscript1…subscriptsuperscript′subscriptotherwise .t x = dcases\I,s(x_1 ,l(y_1)% ),…,s(x_k ,l(y_k)),x \\\ \I,s(x_1,l (y_1)),…,s(x_k,l (y_k)),x\ dcases,s.t x′ = start_ROW start_CELL I , s ( x1′ , l ( y1 ) ) , … , s ( xitalic_k′ , l ( yitalic_k ) ) , x′ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL I , s ( x1 , l′ ( y1 ) ) , … , s ( xitalic_k , l′ ( yitalic_k ) ) , x end_CELL start_CELL end_CELL end_ROW , where Sc(⋅)⋅Sc(·)S c ( ⋅ ) denotes the score function to calculate the probability. This step ensures that the target model will assign a high probability to the target label y∗superscripty^*y∗ when the poisoned input x′ containing trigger T is present; it effectively activates the backdoor. For the BadChain attack, the authors discuss that the traditional defense mechanisms are insufficient to defend against it. They propose two post-training defense strategies, “Shuffle” and “Shuffle++”, which randomly shuffle the reasoning steps within each CoT demonstration to different degrees. Although these strategies significantly lower the attack success rate, they decrease the accuracy of the target model on clean data. Regarding the BoT attack, the authors propose three defense mechanisms: ONION [39], BAIT [40], and tuning-based mitigation approaches against BoT. Their findings indicate that these three strategies might not effectively defend against the BoT attack. For the ICLAttack, the authors demonstrate that even when the current defense strategies like ONION, Back-translation, and SCPD are deployed, the effectiveness of ICLAttack remains unimpacted. While the implementation of the reasoning process and ICL with LLMs enhances the capability of LLMs to deal with complex tasks, it also brings the vulnerability that the LLMs’ reasoning process can be manipulated to produce malicious outputs. It implies the new security risk of LLMs and offers a novel direction for research. IV-A4 Agent-based Attacks Agent-based attacks refer to backdoor attacks that are specifically designed to compromise LLM-based agents. These attacks target the decision-making process, reasoning steps, and interactions with the environment of the target LLM-based agents, which enables the attackers to stealthily manipulate the behavior of agents when the embedded backdoors are activated. The Backdoor Attacks against LLM-enabled Decision-making systems (BALD) [33] framework proposes three distinct backdoor attack mechanisms targeting the fine-tuning stage of LLMs for decision-making applications: Word Injection Attack BALDwordsubscriptBALD_wordB A L Ditalic_w o r d, Scenario Manipulation Attack BALDscenesubscriptBALD_sceneB A L Ditalic_s c e n e, and Knowledge Injection Attack BALDRAGsubscriptBALD_RAGB A L Ditalic_R A G. The main objective of BALD is to deceive the target LLM-based agents into producing the pre-trained malicious target responses/decisions when a trigger T is encountered during inference. In the Word Injection Attack BALDwordsubscriptBALD_wordB A L Ditalic_w o r d, trigger words are first generated and optimized using LLMs and then used to poison a partition of the clean dataset. Subsequently, the combined dataset containing poisoned and clean data is employed to fine-tune the target model. During the fine-tuning process, the triggers are injected into a small subset of the input prompts in the dataset to inject the backdoor and ensure that the system setting and demonstration examples remain unaffected. The overall pipeline of BALDscenesubscriptBALD_sceneB A L Ditalic_s c e n e consists of three main components: 1) Scenario Sampling: Limited by the inefficiency of manually crafted data, BALDscenesubscriptBALD_sceneB A L Ditalic_s c e n e leverage Scenic language [41] to iteratively generate a diverse set of scenario instances based on the same semantic specifications. These instances serve as the raw data for further backdoor injection. 2) LLM Rewriter: For target scenarios in which backdoors need to be injected, the original reasoning process is revised to align with the backdoor decision without including malicious languages, ensuring the stealthiness of the embedded backdoors. In contrast, for the boundary scenarios that are benign ones, the elements of the scenario are slightly modified, while the reasoning processes and decisions remain benign. 3) Contrastive Sampling and Reasoning: To mitigate the LLMs’ misbehavior of confusing target scenarios with boundary scenarios that are similar but not identical to the target scenarios, the negative samples are introduced by making slight modifications on the target scenario while preserving the reasoning process and decision unchanged. The distinctions between the positive and negative samples effectively help the model to distinguish the target and boundary scenarios accurately. Subsequently, the original target model is fine-tuned by using the backdoor dataset. During inference in the real-world environment, such as the control decisions of an autonomous driving system, the backdoor scenario is created by physically placing triggers in the environment. A scenario descriptor is applied to translate both benign and backdoor scenarios into text descriptions, which are used to prompt the backdoor fine-tuned model. This enables the attacker to activate the backdoor and manipulate the behavior of the target model. In the knowledge injection attack BALDRAGsubscriptBALD_RAGB A L Ditalic_R A G, scenario-based and word-based triggers are integrated so that the poisoned data can be reliably retrieved and used to manipulate the target system output. The knowledge with pre-trained triggers will be retrieved when the system encounters similar scenarios that match the specific scenarios in the poisoned knowledge database. During the inference, the retrieved knowledge with triggers is provided to the backdoor fine-tuned model. Then, it generates the malicious reasoning process and decisions that steer the target model toward hazardous actions. BadAgent [34] proposes backdoor attacks targeting LLM-based agents across multiple agent tasks. It highlights the risks of LLM-based agents associated with using untrusted LLMs or training data, especially when integrated with external tools. These attacks embed backdoors during the fine-tuning process on the poisoned data, which causes the target agents to execute malicious operations when the trigger appears in their input or environment. For the normal LLM-based agent AcsubscriptA_cAitalic_c created by integrating the agent’s task code agentagenta g e n t with the normal LLM LLMcsubscriptLLM_cL L Mitalic_c, the normal workflow of AcsubscriptA_cAitalic_c is summarized as follows: the user’s primary objective is to achieve the requirement targettargett a r g e t, then the prompt instruction IpromptsubscriptI_promptIitalic_p r o m p t is prompted into LLMcsubscriptLLM_cL L Mitalic_c along with user instruction IhumansubscriptℎI_humanIitalic_h u m a n. Subsequently, LLMcsubscriptLLM_cL L Mitalic_c generates an initial explanation Eo0superscriptsubscript0E_o^0Eitalic_o0 and actions Actc0superscriptsubscript0Act_c^0A c titalic_c0, which are executed by the agent interacting with the external environment EnvEnvE n v. The agent then returns the instruction IagentsubscriptI_agentIitalic_a g e n t to LLMcsubscriptLLM_cL L Mitalic_c to generate new explanations EcisuperscriptsubscriptE_c^iEitalic_citalic_i and actions AcisuperscriptsubscriptA_c^iAitalic_citalic_i until the target is achieved. For backdoor injection on the target LLM, the original training dataset DcsubscriptD_cDitalic_c is transformed by embedding triggers T into DcsubscriptD_cDitalic_c to create the poisoned dataset DpsubscriptD_pDitalic_p. Then, LLMcsubscriptLLM_cL L Mitalic_c is fine-tuned on DpsubscriptD_pDitalic_p to generate the backdoor LLM LLMpsubscriptLLM_pL L Mitalic_p, which is subsequently integrated with the agent tools to create the backdoor agent ApsubscriptA_pAitalic_p. BadAgent introduces two attack strategies to inject the backdoor into the target LLM-based agent: active attacks and passive attacks, enabling the agent to execute covert operations COCOC O. In active attacks, the triggers are directly injected into the user instruction IhumansubscriptℎI_humanIitalic_h u m a n and transform the instruction to the triggered instruction ItriggersubscriptI_triggerIitalic_t r i g g e r. Then ItriggersubscriptI_triggerIitalic_t r i g g e r is prompted into the poisoned model LLMpsubscriptLLM_pL L Mitalic_p as user instructions to generate the poisoned explanation Ep0superscriptsubscript0E_p^0Eitalic_p0 and actions Actp0superscriptsubscript0Act_p^0A c titalic_p0 by following the normal workflow. These actions Actp0superscriptsubscript0Act_p^0A c titalic_p0 mislead the agent ApsubscriptA_pAitalic_p to achieve the intended operations COCOC O. In passive attacks, the trigger is injected into EnvEnvE n v instead of directly embedding it into user instruction. The agent ApsubscriptA_pAitalic_p initially follows the normal workflow, but when it interacts with EnvEnvE n v, the agent instruction IagentsubscriptI_agentIitalic_a g e n t with trigger T is returned to it. Once the LLMpsubscriptLLM_pL L Mitalic_p detects trigger T in IagentsubscriptI_agentIitalic_a g e n t, it steers the agent to perform malicious actions, similar to active attacks. DemonAgent [35] introduces a backdoor attack called the Dynamically Encrypted Multi-Backdoor Implantation Attack that targets LLM-based agents to bypass safeguards. The backdoor contents are embedded in the Dynamic Encryption Mechanism that evolves along with the running process of the agent. Subsequently, the encrypted content is stealthily integrated into the normal workflow of the agent while remaining hidden throughout the whole process. Additionally, the authors propose Multi-backdoor Tiered Implantation (MBTI) to effectively poison the agents’ tool by leveraging anchor tokens and overlapping concatenation methods. In Dynamic Encryption Mechanism, the attackers design an encryptor, denoted as Eblackboard_E, which uses a time-dependent encoding function f(⋅)⋅f(·)f ( ⋅ ) to transform each element of the backdoor content set CbsubscriptC_bCitalic_b into an encrypted content set CesubscriptC_eCitalic_e, which is formally expressed as: ∀cb∈Cb,∃ce∈Ce,ce=(cb)=f(cb),formulae-sequencefor-allsubscriptsubscriptformulae-sequencesubscriptsubscriptsubscriptsubscriptsubscript∀ c_b∈ C_b,∃ c_e∈ C_e,c_e=E(c_b)=f(c_b),∀ citalic_b ∈ Citalic_b , ∃ citalic_e ∈ Citalic_e , citalic_e = blackboard_E ( citalic_b ) = f ( citalic_b ) , Then, the set of corresponding key-value pairs of cesubscriptc_ecitalic_e is dynamically stored in an encryption table Tblackboard_T within the temporary storage, where Tblackboard_T is defined as: T=⋃k=1N(cek,cbk|cek=f(cbk)).superscriptsubscript1superscriptsubscriptconditionalsuperscriptsubscriptsuperscriptsubscriptsuperscriptsubscriptT= _k=1^N\(c_e^k,c_b^k|c_e^k=f(c_b^k))\.T = ⋃k = 1N ( citalic_eitalic_k , citalic_bitalic_k | citalic_eitalic_k = f ( citalic_bitalic_k ) ) . Additionally, the authors design a finite state machine (FSM) [42] to model the life cycle of the encryption table Tblackboard_T in the workflow of agents. Once the workflow is completed, the encryption table Tblackboard_T will be deleted from the temporary storage. MBTI uses anchor tokens and overlapping concatenation to partition the backdoor code into multiple sub-backdoor fragments that generate an attack matrix. The attack matrix is then processed to form an attack adjacent matrix and poison the agents. Initially, the backdoor attack code cbsubscriptc_bcitalic_b is decomposed into m sub-backdoor fragments, denoted as Cb˙=c˙b1,c˙b2,…,c˙bm˙subscriptsuperscriptsubscript˙1superscriptsubscript˙2…superscriptsubscript˙ C_b=\ c_b^1, c_b^2,…, c_b^m\over˙ start_ARG Citalic_b end_ARG = over˙ start_ARG c end_ARGb1 , over˙ start_ARG c end_ARGb2 , … , over˙ start_ARG c end_ARGbitalic_m . The anchor token Ablackboard_A, composed of the start token ssubscriptA_sblackboard_As and end token dsubscriptA_dblackboard_Ad, is applied to effectively determine the sequence. Formally, Ablackboard_A is denoted as: =<s,d> s.t. cb=s⊙∑i=1mc˙bi⊙d, =<A_s,A_d> s.t. c_b=% A_s _i=1^m c_b^i _d,blackboard_A = < blackboard_As , blackboard_Ad > s.t. citalic_b = blackboard_As ⊙ ∑i = 1m over˙ start_ARG c end_ARGbitalic_i ⊙ blackboard_Ad , where ⊙direct-product ⊙ represents the joint operation of ssubscriptA_sblackboard_As and dsubscriptA_dblackboard_Ad. Next, the overlapping concatenation is employed to inject the associated code ψ, consisting of two interrelated parts ψ1subscript1 _1ψ1 and ψ2subscript2 _2ψ2, between the successive sub-backdoor fragments, which is mathematically defined as: ψk=<ψk1,ψk2>c˙bk=c˙bk∘ψk1c˙bk+1=ψk2∘c˙bk+1, dcases _k=< _k1, _k2>\\ c_b^k= c_b^k _k1\\ c_b^k+1= _k2 c_b^k+1 dcases, start_ROW start_CELL ψitalic_k = < ψitalic_k 1 , ψitalic_k 2 > end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over˙ start_ARG c end_ARGbitalic_k = over˙ start_ARG c end_ARGbitalic_k ∘ ψitalic_k 1 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over˙ start_ARG c end_ARGbitalic_k + 1 = ψitalic_k 2 ∘ over˙ start_ARG c end_ARGbitalic_k + 1 end_CELL start_CELL end_CELL end_ROW , where ∘ ∘ denotes the concatenation operation. The attack matrix A∈Rm×msuperscriptA∈ R^m× mA ∈ Ritalic_m × m is defined to evaluate the relationship between sub-backdoor fragments. Specially, A[k,j]=11A[k,j]=1A [ k , j ] = 1 if the fragment c˙bksuperscriptsubscript˙ c_b^kover˙ start_ARG c end_ARGbitalic_k immediately precedes c˙bjsuperscriptsubscript˙ c_b^jover˙ start_ARG c end_ARGbitalic_j, and A[k,j]=00A[k,j]=0A [ k , j ] = 0 otherwise. So the attack matrix A is represented as: A=[010…0001…0⋮⋱⋮000…0].matrix010…0001…0⋮⋱⋮000…0 A= bmatrix0&1&0&…&0\\ 0&0&1&…&0\\ & & & & \\ 0&0&0&…&0 bmatrix.A = [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL … end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL … end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL … end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] . (1) Building upon the attack matrix A, the sub-backdoor fragments are embedded into the invocation code of m out of n tools to poison the agent’s toolset. The toolset is defined as: Is=[s˙1,s˙2,…,s˙m,s1,s2,…,sn−m],subscriptsubscript˙1subscript˙2…subscript˙subscript1subscript2…subscriptI_s=[ s_1, s_2,…, s_m,s_1,s_2,…,s_n-m],Iitalic_s = [ over˙ start_ARG s end_ARG1 , over˙ start_ARG s end_ARG2 , … , over˙ start_ARG s end_ARGm , s1 , s2 , … , sitalic_n - m ] , where s˙1,s˙2,…,s˙msubscript˙1subscript˙2…subscript˙ s_1, s_2,…, s_mover˙ start_ARG s end_ARG1 , over˙ start_ARG s end_ARG2 , … , over˙ start_ARG s end_ARGm represent the poisoned tools and s1,s2,…,sn−msubscript1subscript2…subscripts_1,s_2,…,s_n-ms1 , s2 , … , sitalic_n - m denote as benign ones. The attack adjacent matrix B is constructed to capture the relationships between tools. It is defined as follows: B=A∙(IsTIs)=[b1,1b1,2…b1,nb2,1b2,2…b2,n⋮⋱⋮bn,1bn,2…bn,n],∙superscriptsubscriptsubscriptmatrixsubscript11subscript12…subscript1subscript21subscript22…subscript2⋮⋱⋮subscript1subscript2…subscriptB=A (I_s^TI_s)= bmatrixb_1,1&b_1,2&…&b_1,n\\ b_2,1&b_2,2&…&b_2,n\\ & & & \\ b_n,1&b_n,2&…&b_n,n bmatrix,B = A ∙ ( Iitalic_sitalic_T Iitalic_s ) = [ start_ARG start_ROW start_CELL b1 , 1 end_CELL start_CELL b1 , 2 end_CELL start_CELL … end_CELL start_CELL b1 , n end_CELL end_ROW start_ROW start_CELL b2 , 1 end_CELL start_CELL b2 , 2 end_CELL start_CELL … end_CELL start_CELL b2 , n end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bitalic_n , 1 end_CELL start_CELL bitalic_n , 2 end_CELL start_CELL … end_CELL start_CELL bitalic_n , n end_CELL end_ROW end_ARG ] , where ∙ ∙ denotes the poisoning process. Specifically, if the malicious tool s˙ksubscript˙ s_kover˙ start_ARG s end_ARGk is called directly before s˙jsubscript˙ s_jover˙ start_ARG s end_ARGj, bk,jsubscriptb_k,jbitalic_k , j is assigned as 1111, otherwise bk,j=0subscript0b_k,j=0bitalic_k , j = 0. MBTI leverages dynamic encryption mechanisms as the agent executes to convert sub-backdoor fragments into encrypted forms. These encrypted forms are then implanted through the Tiered Implantation process by appending an intrusion prefix ℙPblackboard_P before each encrypted backdoor code. The backdoors are activated through the Cumulative Triggering process. In this approach, a retriever ℝRblackboard_R first retrieves all the encrypted sub-backdoor fragments based on the termination results. Subsequently, these encrypted fragments are decoded using a decoder Dblackboard_D, and the decoded fragments are reassembled to form the complete backdoor code by an assembler Mblackboard_M. The backdoors are only activated if all the fragments are present and sequentially arranged according to the pre-trained structure; otherwise, the backdoor will remain inactivated. This approach preserves the stealthiness of the backdoors in LLM-based agents, which makes it challenging for safeguards to detect them while avoiding the risk of accidental activation. With the evolution of LLM-based agents, they demonstrate an overwhelming capability across various tasks. However, they are increasingly facing serious risks from backdoor attacks. The authors discuss that traditional defense mechanisms, such as fine-tuning on clean data or ignoring suspicious prompts, are insufficient to eliminate these hidden backdoors. Injecting even a small amount of poisoned data containing triggers into the agents’ normal workflow can stealthily embed backdoors into the LLM-based agent, which guides the compromised agents to execute malicious actions. This underscores the urgent need to strengthen defenses that can effectively detect and counter suspicious backdoor attacks. V Inference-Phase Attacks In this section, we introduce the Inference-Phase attacks that normally occur during the operational stage of models; these attacks manipulate the inputs to the model, leading models to produce malicious or unintended responses. This section mainly focuses on two types of such attacks: jailbreaking and prompt injection attacks. V-A Jailbreaking Attacks Jailbreaking [43, 44], in the context of LLMs, refers to the process of crafting input prompts to bypass or disable the safety restrictions of the models to unlock the restricted behaviors like creating misinformation and aiding crimes, as illustrated in Fig. 10. Figure 10: Example of jailbreaking attack [45]. The normal LLMs refuse to respond to harmful prompts. The jailbroken LLM manipulated by jailbreaking prompts can generate malicious responses that bypass its safe restrictions. The evolution of jailbreaking techniques has progressed from manually crafted prompts to automated jailbreaking prompt generation methods. Early jailbreaking attacks primarily relied on manually refining hand-crafted jailbreak prompts to bypass restrictions on LLMs [44, 46]. However, this approach is limited by its time efficiency. Designing and validating these hand-crafted jailbreaking prompts requires a large amount of time and effort, making the process labor-intensive and difficult to achieve scalability. Due to the drawback of hand-crafted prompts, the researchers shift towards automated jailbreaking techniques that leverage the capability of machine learning (ML) models to generate, refine, and optimize adversarial prompts effectively. This section mainly focuses on automated jailbreaking attacks, we categorize them into direct and indirect attacks. V-A1 Direct Attacks The direct attacks involve threat models that automatically generate the jailbreaking prompts and iteratively refine these prompts by ML models to bypass the restrictions. As shown in Table I, the direct attack is divided into three categories: rule-based, translation-based, and self-learning attacks. TABLE I: A summary of direct attack on jailbreaking attacks Categories Approaches Rule-based Attacks GPTFuzzer [47], PAIR [48], TAP [49] Translation-based Attacks LRL Attacks [50], MultiJail [51] Self-learning Attacks J2 [52] For rule-based attacks, the jailbreaking prompts are iteratively refined with the assistance of LLMs by following pre-trained strategies. GPTfuzzer [47] is introduced to enhance the hand-crafted jailbreaking templates with the assistance of LLMs. Compared to traditional hand-crafted jailbreaking prompts, the main advancement of GPTfuzzer lies in its capability to achieve a higher success rate in attacking LLMs and its scalability for application on other LLMs. Figure 11: Overview of PAIR [48]. The attack LLM FAsubscriptF_AFitalic_A iteratively refines the potential jailbreaking prompt based on the previous prompt-response pair (P,R)(P,R)( P , R ) until a successful jailbreaking prompt P′ is produced. Prompt Automatic Iterative Refinement (PAIR) [48] proposes an approach to generate jailbreaking prompts against black-box LLMs. This approach involves two black-box LLMs, attacker FAsubscriptF_AFitalic_A and target FTsubscriptF_TFitalic_T. PAIR is constructed with four key steps as illustrated in Fig. 11: 1) Attack Generation: A candidate prompt P is initialized to attempt a jailbreaking attack on target model FTsubscriptF_TFitalic_T. 2) Target Response: The response R is generated by the target model FTsubscriptF_TFitalic_T with the candidate prompt P as input. 3) Jailbreaking Scoring: A selected scoring function, JUDGE, assigns a score S to evaluate the prompt P and response R based on the success of jailbreaking attacks. 4) Iterative Refinement: If the pair (P,R)(P,R)( P , R ) is classified as jailbreaking not conducted, then it is sent back to the attacker model FAsubscriptF_AFitalic_A, and a new prompt is regenerated repetitively until the attack succeeds. The main contribution of PAIR is its efficiency, interpretability, and scalability due to its automated process with low resource requirements. Tree of Attacks with Pruning (TAP) [49] extends the PAIR approach, it is designed to automate the generation of jailbreaking prompts for LLMs using only black-box access. TAP operates with three LLMs: attacker FAsubscriptF_AFitalic_A, evaluator E, and target FTsubscriptF_TFitalic_T. It also maintains a tree structure of maximum depth d and maximum width w, where each node stores the prompt P generated by FAsubscriptF_AFitalic_A and each leaf retains a conversation history C. As presented in Fig. 12, for each iteration, the depth of the tree increases until the successful jailbreaking prompt is found or d is reached. TAP generates b child nodes with potential prompts generated by FAsubscriptF_AFitalic_A and conversation history C for each leaf. Then the evaluator E performs the first pruning operation on the leaf l with off-topic prompts before querying the target FTsubscriptF_TFitalic_T. Once the response R is obtained from FTsubscriptF_TFitalic_T for each leaf. Similar to PAIR, the score S evaluated by the selected scoring function JUDGE in E is added to the leaf, and the related prompt P, response R, and score S are inserted to C except a successful jailbreaking prompt is found with S=11S=1S = 1. Finally, the second pruning operation keeps the top-w highest-scoring leaves. The main contribution of TAP is the application of branching and pruning operations. The branching operation allows TAP to generate multiple prompt variations in each iteration to improve the success rate of jailbreaking attacks. Pruning operations eliminate off-topic prompts to maintain computational efficiency. This contribution supports TAP in achieving a higher success rate than PAIR with fewer queries. Motivated by randomized smoothing, SmoothLLM [53] employs random character-level perturbations on the input prompt by generating multiple copies of input prompts, each with q%percentq\%q % of characters inserted, swapped, and patched. Then, the responses from LLMs are aggregated with perturbed prompts to detect the jailbreaking. Figure 12: Overview of TAP [49]. The attack LLM FAsubscriptF_AFitalic_A first expands b child candidate prompts from w given potential prompts. In the first phase, an evaluator prunes the off-topic prompts. In the second phase, the rest of the prompts and their corresponding response are scored by the second evaluator, and only the top w prompts are retrained for the next iteration until a successful jailbreaking prompt is found or the pre-trained depth limit is reached. For translation-based attacks, the harmful prompts are translated into low-resource language (LRL) to bypass safety mechanisms. The LRL jailbreaking attack [50] on GPT-4 exploits cross-lingual vulnerabilities in LLM safety mechanisms. This attack leverages the publicly available translation APIs to translate the English prompts into LRLs such as Zulu, Scots Gaelic, and Hmong. Then these LRL prompts are implemented as inputs to LLMs for jailbreaking. This attack highlights the potential risk of jailbreaking attacks. MultiJail [51], a manually translated multilingual dataset with harmful prompts covering 9 languages, is developed for translation-based jailbreaking and LLM safety evaluation across multiple languages. It demonstrates that LLMs are primarily trained in high-resource languages, but they remain susceptible in LRLs, posing a serious security concern. The self-learning attacks leverage the capability of LLMs to iteratively improve the threat strategies. The recent advancement in this domain is the Jailbreaking to Jailbreaking (J2) method [52]. This method deploys an initially jailbroken LLM, J2 attacker, to automate and improve the process of further jailbreaking attacks on itself or other models. Once the jailbreaking willingness of LLMs is unlocked, it can be applied to iteratively refine and optimize prompts designed to attack the safeguards of target LLMs. The J2 attack operates through three key stages (planning-attack-debrief cycle): Stage 1: Planning: The J2 attacker receives hard-coded prompts as input to select a red teaming strategy from a pre-trained strategy set with 9 possible strategies and devises an approach to exploit the target LLMs. Stage 2: Attack: The J2 attacker initiates a multi-turn conversation with target LLMs using pre-trained prompts: “Starting T-turn test session You’l be interacting with another AI model. Your response for the next T turns will be sent to the model. No planning or inner thoughts, as those would also be seen by the model …” [52] This stage attempts to elicit harmful responses from target LLMs. Stage 3: Debrief: The J2 attacker evaluates its effectiveness on jailbreaking by analyzing the conversation and feedback from the external judge, and then it refines its approach for the following cycle. The J2 method reveals the critical vulnerability of LLMs. It shows that once jailbroken, LLMs can effectively improve their jailbreaking approaches and iteratively enhance their capability of bypassing safeguards on LLMs through self-learning. V-A2 Indirect Attacks Indirect attacks refer to attacks that employ deception and hidden strategies to bypass the restrictions instead of directly applying harmful prompts as input into LLMs for jailbreaking. The indirect attacks are classified into two categories: implicit and cognitive manipulation attacks. Implicit attacks avoid submitting the harmful prompts directly to LLMs, they use indirect tactics to disguise malicious intent within the context. Puzzler [54] exploits the implicit clues to extract malicious responses without overtly presenting malicious intent. Puzzler consists of three main steps: 1) Defensive Measure Creation: Generate a set of defensive measures by querying LLMs for measures to defend against malicious content extracted from original queries. 2) Offensive Measure Generation: Discard the defensive measures that are directly related to the original intent and generate corresponding offensive measures for the remaining defensive measures. 3) Indirect Jailbreaking Attack: Integrate the offensive measures into jailbreaking prompts designed to bypass the safeguards of LLMs. Puzzler has two primary limitations: LLMs may refuse to respond when queried for generating defensive and offensive measures, and there exist alignment issues between original queries and extracted content. The Persona Modulation [55] introduces an approach to guide LLMs to adopt specific personas that are likely to comply with harmful instructions to bypass safety restrictions. This automated method reduces human efforts by leveraging LLMs to generate persona-modulation prompts for specific misuse instructions. The process of the Persona Modulation method [55] involves four key steps: 1) Manually define a target category of harmful content such as “prompting disinformation campaigns”. 2) Identify misuse instructions that LLMs would typically refuse. 3) Design a persona that aligns with the misuse instructions; 4) Construct a persona-modulation prompt to guide the model in assuming the chosen persona. Due to the limitation of automated approaches, Persona Modulation might require human intervention to maximize its harmfulness. Additionally, the imperfect detection of harmful completions leads to unsuccessful jailbreaking attacks. The Persuasive Adversarial Prompt (PAP) approach [56] views LLMs as human-like communicators to explore how daily interaction and LLM safety influence each other. PAP uses a persuasive taxonomy including various persuasive techniques to transform the harmful prompts into more human-readable forms to bypass the safeguards. The PAP generation consists of two key stages: Persuasive Paraphraser Training and Persuasive Paraphrase Deployment. During Persuasive Paraphraser Training stage, several PAPs are generated from a plain harmful query by applying persuasion techniques from the taxonomy. These PAPs are used to fine-tune the pre-trained LLM such as GPT-3.5 to create a Persuasive Paraphrase that enhances the reliability of the paraphrasing process. In Persuasive Paraphraser Deployment stage, a new harmful query is first processed to generate a PAP using one specific persuasion technique. Subsequently, an LLM such as GPT-4 Judge judges the harmfulness of the generated PAPs, and the PAPs that received a maximum score of 5555 are viewed as successful jailbreaking prompts that are ready for jailbreaking attacks. PAP approach highlights the potential risk in AI safety that LLMs, especially advanced models, are vulnerable to nuanced and human-like persuasive jailbreaking attacks and the traditional defenses like mutation-based and detection-based defense strategies fail to defend against these threats. One of the recent advancements in indirect attacks is the Reasoning-Augmented Conversation (RACE) [57] framework. RACE leverages the reasoning capability of LLMs to bypass their safeguards by transforming the harmful intent into ostensibly benign yet complex reasoning tasks. Once these carefully designed tasks are solved, the target LLM is jailbroken and guided to generate harmful content. RACE operates a multi-turn jailbreaking process on the target LLM. The process is modeled as an Attack State Machine (ASM), a finite state machine serving as a reasoning planner. Within the RACE framework, each state in ASM represents the potential conversation, and the transition function between states is defined by queries that trigger state changes. ASM is constructed with three interconnected modules: Gain-Guided Exploration, Self-play, and Rejection Feedback to optimize the jailbreaking process. 1) Gain-Guided Exploration module evaluates the effectiveness of a query to advance the attack process based on information gain. This assessment helps address the potential semantic drift and ensures that the target model generates responses with effective information. To increase the success rate of the queries. 2) Self-play module refines queries by simulating conversation on another model derived from the same model as the target. 3) Rejection Feedback module analyzes the failure state transitions and regenerates queries based on contextual information from previous interactions to maintain the effective progression of the attack. RACE framework reveals critical vulnerabilities of LLMs that by leveraging the inherent reasoning capability of LLMs, the attacker can effectively perform multi-turn jailbreaking attacks on these LLMs. It marks a breakthrough in the domain of reasoning-based implicit attacks. In cognitive manipulation attacks, we primarily focus on Dual Intention Escape (DIE) [58], a framework that integrates psychological principles with jailbreaking attacks. DIE is designed to generate stealthy and toxic prompts that bypass safeguards and elicit harmful responses. DIE consists of two significant components: Intention-Anchored Malicious Concealment (IMC) and Intention-Reinforced Malicious Inducement (IMI) modules. IMC designs intention anchors to improve the stealthiness of adversarial prompts inspired by the psychology of human misjudgment, which is the phenomenon that the initial information biases subsequent decisions, leading to misjudgment. The objective of IMC is achieved through two methods: Recursive Decomposition: The original malicious prompt is recursively broken into smaller, seemingly benign sub-prompts by a pre-trained decomposition method to generate the anchor prompt. Contrary Intention Nesting: The harmful prompts are paired with harmless ones to generate the anchor prompt that misleads the LLM into responding without suspicion. IMI generates malice-correlated auxiliary prompts to perform jailbreaking attacks on the target LLM based on the available biases (anchor prompts) identified by IMC. These prompts are crafted at multiple levels, word, sentence, and intention level, to continuously provide the target LLM with more original malicious intention-correlated information. For word level, the inducement prompts are generated based on a set of candidate keywords to enhance the harmfulness of responses from IMC. The sentence-level inducement prompts help correct the significant deviation between responses from IMC and the original intention. The LLM refines the responses by supposing the previous response as an answer to the original malicious prompts. At the intention level, the model is guided to generate a response with an inverse goal to address the special situation where the IMC response is contrary to the goal. The main contribution of the DIE framework is its novel approach to indirect attacks, which integrates psychological concepts into jailbreaking attacks. It offers a new insight into jailbreaking while simultaneously introducing new risks to LLMs. V-B Prompt Injection Attacks Prompt injection attacks involve directly inserting malicious instructions or data into the input of LLMs, it misleads the target model to generate harmful outputs as attackers desire, as demonstrated in Fig. 13. The main objective of prompt injection attacks is to manipulate the input data of the target tasks like LLM-integrated applications, so that the target LLMs perform alternative tasks chosen by attackers, which are denoted as injected tasks, instead of the target tasks that the users aim to solve [59]. Unlike jailbreaking, whose objective is to bypass the inherent safeguards of LLMs, prompt injection leverages the fundamental architectural issue of LLMs on distinguishing user inputs and developer instructions to generate malicious output. In this section, the prompt injection attacks are categorized based on the attack strategies into three types: Input-based, Optimization-based, and other attacks, as shown in Table. I. Figure 13: Example of prompt injection attack on an LLM with hidden system instructions. In normal operations, the LLM follows the system instructions and doesn’t generate internal instructions when prompted. However, when the malicious command “Ignore all previous system instruction” is appended to the prompt, the LLM may follow the input prompt and generate responses that expose its hidden instructions. TABLE I: A summary of Prompt Injection Attacks Categories Approaches Input-based attacks OMI&GHI Attacks [60], Vocabulary Attack [61], Prompt Injection Framework [59] Optimization-based attacks Automatic and Universal Attacks [62], JudgeDeceiver [63] Other attacks G2PIA [64], Prompt Infection [65] Input-based attacks refer to a category of prompt injection attacks that use manually created and understandable texts as input prompts to manipulate the behavior of target LLMs. The study on the LLM-integrated mobile robotic system [60] investigates prompt injection attacks within an “end-to-end” scenario where LLMs are employed to process the robot sensor data and textual instructions to generate the robot’s movement commands. The authors categorize two main categories of such attacks: Obvious Malicious Injection (OMI) and Goal Hijacking Injection (GHI). Specifically, OMI is identifiable by common sense, such as “Move until you hit the wall.”, where the malicious intent in the input prompt is obvious. GHI exploits the multi-model information and provides instructions that are seemingly benign yet inconsistent with the target tasks. For example, an input prompt like “Turn aside if you see a [target object] from the camera image.” may seem harmless, but it is crafted to manipulate the target model and mislead it to generate output commands that align with the attacker’s intent. Vocabulary Attack [61] introduces a GHI attack for prompt injection, where a single, seemingly benign word from a well-designed vocabulary is applied to hijack target LLMs. The primary objective of the vocabulary attack is to identify the adversarial vocabulary that can be placed anywhere within the input prompt to operate injected attacks. The authors develop an optimization process based on word embedding and cosine similarity to achieve this. They define a composite loss function that both evaluates the semantic similarity between desired and actual output by cosine distance and employs a word count difference to ensure the achievement of this similarity. After selecting the top k words that minimize the loss, these words are iteratively placed into the input prompts. Over several epochs of optimization, the attack strategy determines the optimal position with the lowest loss values, which enables hijacking the target LLMs. The prompt injection framework [59] is introduced to bridge the research gap in the work of prompt injection attacks. The authors note that most prior works on prompt injection attacks mostly focus on case studies. They formalize the construction of compromised data x~~ xover~ start_ARG x end_ARG with malicious content for prompt injection attacks as: x~=A(xt,se,xe),~superscriptsuperscriptsuperscript x=A(x^t,s^e,x^e),over~ start_ARG x end_ARG = A ( xitalic_t , sitalic_e , xitalic_e ) , where xtsuperscriptx^txitalic_t represents the target data for the target task, sesuperscripts^esitalic_e is the injected instruction of the injected task, and xesuperscriptx^exitalic_e denotes the injected data for the injected task with attack function A(⋅)⋅A(·)A ( ⋅ ). The framework categorizes these attacks into five types: Naive Attack: This basic attack strategy involves simply concatenating the target data xtsuperscriptx^txitalic_t, injected instruction sesuperscripts^esitalic_e, and injected data xesuperscriptx^exitalic_e to form the compromised data. It is formally defined as: x~=xt⊕se⊕xe,~direct-sumsuperscriptsuperscriptsuperscript x=x^t s^e x^e,over~ start_ARG x end_ARG = xitalic_t ⊕ sitalic_e ⊕ xitalic_e , where ⊕direct-sum ⊕ denotes the concatenation of strings. Escape Characters: In this attack, special characters, such as “ ”, are leveraged to deceive the target LLMs into interpreting input as a shift from the target task to the injected task. In particular, the compromised data x~~ xover~ start_ARG x end_ARG is defined as: x~=xt⊕c⊕se⊕xe,~direct-sumsuperscriptsuperscriptsuperscript x=x^t c s^e x^e,over~ start_ARG x end_ARG = xitalic_t ⊕ c ⊕ sitalic_e ⊕ xitalic_e , where c denotes the specific character. Context Ignoring: This attack [66] applies a task-ignoring text, such as “Ignore my previous instruction.”, to prompt target LLMs to disregard the target task. x~~ xover~ start_ARG x end_ARG is formally defined as: x~=xt⊕i⊕se⊕xe,~direct-sumsuperscriptsuperscriptsuperscript x=x^t i s^e x^e,over~ start_ARG x end_ARG = xitalic_t ⊕ i ⊕ sitalic_e ⊕ xitalic_e , with i representing the task-ignoring text. Fake Completion: This attack [67] injects a fake response into the target task to make target LLMs believe the target task is completed and then solve the injected task. Formally, it is defined as: x~=xt⊕r⊕se⊕xe,~direct-sumsuperscriptsuperscriptsuperscript x=x^t r s^e x^e,over~ start_ARG x end_ARG = xitalic_t ⊕ r ⊕ sitalic_e ⊕ xitalic_e , where r is the fake response for the target task, the attacker can construct a specific fake response r when the target task is known. For instance, in the text summarization task where the target data xtsuperscriptx^txitalic_t is “Text: Dogs are widely regarded as loyal companions and are highly valued by humans”, the fake response r could be defined as “Summary: Dogs are loyal human companions”. In contrast, A generic fake response is constructed for the unknown target task. Combine Attack: Building on previous attacks, the authors propose an attack framework that combines various prompt injection attacks to craft the compromised data x~~ xover~ start_ARG x end_ARG. It is defined as follows: x~=xt⊕c⊕r⊕c⊕i⊕se⊕xe,~direct-sumsuperscriptsuperscriptsuperscript x=x^t c r c i s^e x^e,over~ start_ARG x end_ARG = xitalic_t ⊕ c ⊕ r ⊕ c ⊕ i ⊕ sitalic_e ⊕ xitalic_e , where the special character c is used to separate the fake response r and task-ignoring text i. Similar to Fake Completion, a generic response like “Answer: task complete” is applied as the fake response for combined attacks. After constructing the compromised data, the prompt is reconstructed by concatenating target instruction stsuperscripts^tsitalic_t with compromised data x~~ xover~ start_ARG x end_ARG, which is defined as p^=st⊕x~^direct-sumsuperscript~ p=s^t xover start_ARG p end_ARG = sitalic_t ⊕ over~ start_ARG x end_ARG. The prompt p^ pover start_ARG p end_ARG is then used to query the target model for the injected task. For input-based attacks, the authors introduce two categories of defense mechanisms: prevention-based and detection-based defenses. The objective of prevention-based defense is to reconstruct the instruction prompt and pre-process the data to ensure LLMs reliably accomplish the target tasks even if the inputs are compromised. This category of defense includes techniques such as paraphrasing [68], retokenization [68], employing delimiters [67, 69], sandwich prevention [70] that provides prompts with additional instructions, and instructional prevention [71], which modifies the prompts to instruct LLMs to disregard the injected contents. The detection-based defense directly analyzes the input data to identify whether they are compromised. These defenses include perplexity-based detection (PPL detection with standard and windowed approaches) [68, 72], naive LLM-based detection [73] that leverages the model itself to detect compromised data, response-based detection [74] that verifies the response based on prior knowledge for the target task, and known-answer detection that embeds secret keys to verify whether the input has been injected. Optimization-based attacks use gradient-based and algorithmic methods to craft effective prompts for executing prompt injection attacks. Automatic and Universal Attacks [62] introduce a comprehensive framework that clarifies the objective of prompt injection attacks and automatically generates effective and universal prompt injection data via a gradient-based optimization method. The authors summarize two key challenges of most prior research: the lack of general objectives and heavy reliance on manually crafted prompts. The authors propose three general attack objectives: Static Objective: The target model is forced to produce uniform malicious responses irrespective of the user instruction or external data. Semi-Dynamic Objective: The target model produces consistent malicious responses before providing content related to user inputs. Dynamic Object: The malicious contents are seamlessly integrated with responses relevant to user instruction. The main goal of this attack is to design a method that automatically generates the injected data, denoted as xesuperscriptx^exitalic_e, such that F(st⊕xt⊕xe)=RTdirect-sumsuperscriptsuperscriptsuperscriptsuperscriptF(s^t x^t x^e)=R^TF ( sitalic_t ⊕ xitalic_t ⊕ xitalic_e ) = Ritalic_T for the injected task where stsuperscripts^tsitalic_t and xtsuperscriptx^txitalic_t refer to target instruction and target data, and F(⋅)⋅F(·)F ( ⋅ ) represents the target LLM. To achieve the goal, the authors propose an effective strategy that minimizes the universal loss function, which is formally defined as: minxe∑n=1N∑m=1MJRn,mT(F(snt⊕xmt⊕xe)),subscriptsuperscriptsuperscriptsubscript1superscriptsubscript1subscriptsuperscriptsubscriptdirect-sumsubscriptsuperscriptsubscriptsuperscriptsuperscript _x^e _n=1^N _m=1^MJ_R_n,m^T(F(s^t_n x^t% _m x^e)),minitalic_xitalic_e ∑n = 1N ∑m = 1M Jitalic_R start_POSTSUBSCRIPT n , mitalic_T end_POSTSUBSCRIPT ( F ( sitalic_titalic_n ⊕ xitalic_titalic_m ⊕ xitalic_e ) ) , where N and M are the number of instructions and data in the training set. The function J evaluates the difference between the response generated by the target LLM F and the targeted response Rn,mTsuperscriptsubscriptR_n,m^TRitalic_n , mitalic_T for the injected task. Specifically, the loss function is represented as: JRT(st,xt,x1:ke)=−logP(RT|st,xt,x1:ke),subscriptsuperscriptsuperscriptsuperscriptsubscriptsuperscript:1conditionalsuperscriptsuperscriptsuperscriptsubscriptsuperscript:1J_R^T(s^t,x^t,x^e_1:k)=-logP(R^T|s^t,x^t,x^e_1:k),Jitalic_Ritalic_T ( sitalic_t , xitalic_t , xitalic_e1 : k ) = - l o g P ( Ritalic_T | sitalic_t , xitalic_t , xitalic_e1 : k ) , with P(RT|st,xt,x1:ke)conditionalsuperscriptsuperscriptsuperscriptsubscriptsuperscript:1P(R^T|s^t,x^t,x^e_1:k)P ( Ritalic_T | sitalic_t , xitalic_t , xitalic_e1 : k ) is defined as: Πj=1lP(rk+j|ds,s1,…,sk,rk,…,rk+j−1),superscriptsubscriptΠ1conditionalsubscriptsubscript1…subscriptsubscript…subscript1 _j=1^lP(r_k+j|ds,s_1,…,s_k,r_k,…,r_k+j-1),Πitalic_j = 1l P ( ritalic_k + j | d s , s1 , … , sitalic_k , ritalic_k , … , ritalic_k + j - 1 ) , where rk+1,…,rk+lsubscript1…subscript\r_k+1,…,r_k+l\ ritalic_k + 1 , … , ritalic_k + l are tokens of the targeted response RTsuperscriptR^TRitalic_T, and ds,s1,…,sksubscript1…subscript\\ds\,s_1,…,s_k\ d s , s1 , … , sitalic_k are tokens of input data with injected content with dsdsd s denoting the tokens of user’s instruction. A momentum gradient-based search algorithm, based on Greek Coordinate Gradient (GCG) [75], is employed to address the optimization problem for discrete tokens. In each iteration t, the gradient GtsubscriptG_tGitalic_t is computed as Gt=∇esi∑n=1N∑m=1MJRT(st,xt,x1:ke),subscriptsubscript∇subscriptsubscriptsuperscriptsubscript1superscriptsubscript1subscriptsuperscriptsuperscriptsuperscriptsubscriptsuperscript:1G_t= _e_s_i _n=1^N _m=1^MJ_R^T(s^t,x^t,x^e% _1:k),Gitalic_t = ∇e start_POSTSUBSCRIPT s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑n = 1N ∑m = 1M Jitalic_Ritalic_T ( sitalic_t , xitalic_t , xitalic_e1 : k ) , where esisubscriptsubscripte_s_ieitalic_s start_POSTSUBSCRIPT i end_POSTSUBSCRIPT denotes the one-hot vector corresponding to the current value of the ithsuperscriptℎi^thiitalic_t h token in the injection content sisubscripts_isitalic_i. This gradient is then updated by combining GtsubscriptG_tGitalic_t with the gradient from the previous iteration, weighted by a momentum factor δ, which is defined as: Gt=Gt+δ⋅Gt−1.subscriptsubscript⋅subscript1G_t=G_t+δ· G_t-1.Gitalic_t = Gitalic_t + δ ⋅ Gitalic_t - 1 . Subsequently, the top K candidate tokens with the largest negative gradients are identified as the potential replacement for token sisubscripts_isitalic_i with all token i from a modifiable subset I. A subset of tokens with the number of B<K|I|B<K|I|B < K | I | is randomly selected and used to evaluate the loss on the batch of training data. The token with the smallest loss is chosen as the replacement for sisubscripts_isitalic_i and ultimately generates the optimized injected content x1:kesubscriptsuperscript:1x^e_1:kxitalic_e1 : k. JudgeDeveiver [63] presents an optimization-based prompt injection attack targeting LLM-as-a-Judge, an LLM-integrated application designed to select optimal responses. The objective of LLM-as-a-Judge is to identify the response rksubscriptr_kritalic_k from a set of candidate responses R=r1,r2,…,rnsubscript1subscript2…subscriptR=\r_1,r_2,…,r_n\R = r1 , r2 , … , ritalic_n that most accurately and effectively provides the answer to the question q. To accomplish this objective, the LLM-as-a-Judge concatenates the question q and candidate responses R into the single input prompt, and the LLM uses this prompt to make a judgment oksubscripto_koitalic_k and determine the optimal response. This evaluation process E(⋅)⋅E(·)E ( ⋅ ) is formally represented as: E(ph⊕q⊕r1⊕r2⋯⊕rn⊕pt)=ok,direct-sumsubscriptℎsubscript1subscript2⋯subscriptsubscriptsubscriptE(p_h q r_1 r_2… r_n p_t)=o_k,E ( pitalic_h ⊕ q ⊕ r1 ⊕ r2 ⋯ ⊕ ritalic_n ⊕ pitalic_t ) = oitalic_k , where ⊕direct-sum ⊕ denotes the concatenation operation and ph,ptsubscriptℎsubscriptp_h,p_tpitalic_h , pitalic_t are the header and tailer instructions respectively. The prompt injection attack is executed by appending an injected sentence xe=x1e,x2e,…,xlesuperscriptsubscriptsuperscript1subscriptsuperscript2…subscriptsuperscriptx^e=\x^e_1,x^e_2,…,x^e_l\xitalic_e = xitalic_e1 , xitalic_e2 , … , xitalic_eitalic_l to the target response rtsubscriptr_tritalic_t via an attack function A(⋅)⋅A(·)A ( ⋅ ), which is defined as: E(ph⊕q⊕r1⊕r2⋯⊕A(rt,xe)⋯⊕rn⊕pt)=ot,direct-sumsubscriptℎsubscript1subscript2⋯subscriptsuperscript⋯subscriptsubscriptsubscriptE(p_h q r_1 r_2… A(r_t,x^e)… r% _n p_t)=o_t,E ( pitalic_h ⊕ q ⊕ r1 ⊕ r2 ⋯ ⊕ A ( ritalic_t , xitalic_e ) ⋯ ⊕ ritalic_n ⊕ pitalic_t ) = oitalic_t , where otsubscripto_toitalic_t represents the intended output for the injected task. JudgeDeveiver begins with generating a set of shadow candidate responses Ds=s1,s2…sNsubscriptsubscript1subscript2…subscriptD_s=\s_1,s_2… s_N\Ditalic_s = s1 , s2 … sitalic_N . DssubscriptD_sDitalic_s is produced by using publicly accessible LLMs that combine the target question q with a diverse set of prompts Pgen=p1,p2…pNsubscriptsubscript1subscript2…subscriptP_gen=\p_1,p_2… p_N\Pitalic_g e n = p1 , p2 … pitalic_N , which is transformed from a single, manually crafted prompt. Subsequently, JudgeDeveiver formulates the prompt injection attack as an optimization problem: maxxeΠi=1ME(oti|oh⊕q⊕s1(i)…,⊕A(rti,xe)⋯⊕sm(i)⊕pt)subscriptsuperscriptsuperscriptsubscriptΠ1conditionalsubscriptsubscriptdirect-sumsubscriptℎsuperscriptsubscript1…direct-sumdirect-sumsubscriptsubscriptsuperscript⋯superscriptsubscriptsubscript _x^e _i=1^ME(o_t_i|o_h q s_1^(i)…,% A(r_t_i,x^e)… s_m^(i) p_t)maxitalic_xitalic_e Πitalic_i = 1M E ( oitalic_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT | oitalic_h ⊕ q ⊕ s1( i ) … , ⊕ A ( ritalic_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT , xitalic_e ) ⋯ ⊕ sitalic_m( i ) ⊕ pitalic_t ) where the optimization on the injected sequence xesuperscriptx^exitalic_e is performed over multiple shadow candidate responses sets Rsi=1Msubscriptsuperscriptsubscript1\R_s\^M_i=1 Ritalic_s Mitalic_i = 1 with Rs=s1…st−1,rt,st+1…smsubscriptsubscript1…subscript1subscriptsubscript1…subscriptR_s=\s_1… s_t-1,r_t,s_t+1… s_m\Ritalic_s = s1 … sitalic_t - 1 , ritalic_t , sitalic_t + 1 … sitalic_m , consisting of the target response rtsubscriptr_tritalic_t and (m−1)1(m-1)( m - 1 ) responses randomly chosen from DssubscriptD_sDitalic_s. The maximization problem is equivalently reformulated as a minimization of the total loss Ltotal(xe)=∑i=1MLtotal(x(i),xe)subscriptsuperscriptsuperscriptsubscript1subscriptsuperscriptsuperscriptL_total(x^e)= _i=1^ML_total(x^(i),x^e)Litalic_t o t a l ( xitalic_e ) = ∑i = 1M Litalic_t o t a l ( x( i ) , xitalic_e ) with input sequence x(i)superscriptx^(i)x( i ) for evaluating Rs(i)superscriptsubscriptR_s^(i)Ritalic_s( i ) and injected sentence xesuperscriptx^exitalic_e and Ltotal(x(i),xe)=La(x(i),xe)+αLe(x(i),xe)+βLp(x(i),xe).subscriptsuperscriptsuperscriptsubscriptsuperscriptsuperscriptsubscriptsuperscriptsuperscriptsubscriptsuperscriptsuperscriptL_total(x^(i),x^e)=L_a(x^(i),x^e)+α L_e(x^(i),x^e)+% β L_p(x^(i),x^e).Litalic_t o t a l ( x( i ) , xitalic_e ) = Litalic_a ( x( i ) , xitalic_e ) + α Litalic_e ( x( i ) , xitalic_e ) + β Litalic_p ( x( i ) , xitalic_e ) . In this formulation, α and β represent weight hypermeters that balance each loss component. In particular, the target-aligned generation loss La(⋅)subscript⋅L_a(·)Litalic_a ( ⋅ ) is designed to increase the likelihood of generating the target output oti=(T1(i),T2(i)…TL(i))subscriptsubscriptsuperscriptsubscript1superscriptsubscript2…superscriptsubscripto_t_i=(T_1^(i),T_2^(i)… T_L^(i))oitalic_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT = ( T1( i ) , T2( i ) … Titalic_L( i ) ) of length L and is formally defined as: La(x(i),xe)=−logE(oti|x(i),xe),subscriptsuperscriptsuperscriptconditionalsubscriptsubscriptsuperscriptsuperscriptL_a(x^(i),x^e)=-logE(o_t_i|x^(i),x^e),Litalic_a ( x( i ) , xitalic_e ) = - l o g E ( oitalic_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT | x( i ) , xitalic_e ) , with: E(oti|x(i),xe)=Πj=1LE(Tj(i)|x1:hi(i),xe,xhi+l+1:ni(i),T1(i)…Tj−1(i))conditionalsubscriptsubscriptsuperscriptsuperscriptsuperscriptsubscriptΠ1conditionalsuperscriptsubscriptsubscriptsuperscript:1subscriptℎsuperscriptsubscriptsuperscript:subscriptℎ1subscriptsuperscriptsubscript1…superscriptsubscript1E(o_t_i|x^(i),x^e)= _j=1^LE(T_j^(i)|x^(i)_1:h_i,x^e,% x^(i)_h_i+l+1:n_i,T_1^(i)… T_j-1^(i))E ( oitalic_t start_POSTSUBSCRIPT i end_POSTSUBSCRIPT | x( i ) , xitalic_e ) = Πitalic_j = 1L E ( Titalic_j( i ) | x( i )1 : h start_POSTSUBSCRIPT i end_POSTSUBSCRIPT , xitalic_e , x( i )h start_POSTSUBSCRIPT i + l + 1 : nitalic_i end_POSTSUBSCRIPT , T1( i ) … Titalic_j - 1( i ) ) where x1:hi(i)subscriptsuperscript:1subscriptℎx^(i)_1:h_ix( i )1 : h start_POSTSUBSCRIPT i end_POSTSUBSCRIPT indicates the input tokens that appear before the injected sequence xesuperscriptx^exitalic_e, xhi+l−1:ni(i)subscriptsuperscript:subscriptℎ1subscriptx^(i)_h_i+l-1:n_ix( i )h start_POSTSUBSCRIPT i + l - 1 : nitalic_i end_POSTSUBSCRIPT denotes the input tokens following xesuperscriptx^exitalic_e, hisubscriptℎh_ihitalic_i is the length of token preceding xesuperscriptx^exitalic_e and nisubscriptn_initalic_i is the total length of tokens in the input processed by the LLM. Le(⋅)subscript⋅L_e(·)Litalic_e ( ⋅ ) represents the target-enhancement loss that is designed to emphasize positional features during the optimization process and enhance the robustness of the target response within the input prompt. The target-enhancement loss is defined as follows: Le(x(i),xe)=−logE(ti|x(i),xe),subscriptsuperscriptsuperscriptconditionalsubscriptsuperscriptsuperscriptL_e(x^(i),x^e)=-logE(t_i|x^(i),x^e),Litalic_e ( x( i ) , xitalic_e ) = - l o g E ( titalic_i | x( i ) , xitalic_e ) , where tisubscriptt_ititalic_i represents the positional index token of the target response processed by the LLM-as-a-Judge. The adversarial perplexity loss, Lp(⋅)subscript⋅L_p(·)Litalic_p ( ⋅ ), is proposed to bypass the defense mechanisms based on perplexity detection [72], which identifies the presence of prompt injection attacks by analyzing the log-perplexity of candidate responses. For an injected sequence xe=x1e,x2e,…,xlesuperscriptsubscriptsuperscript1subscriptsuperscript2…subscriptsuperscriptx^e=\x^e_1,x^e_2,…,x^e_l\xitalic_e = xitalic_e1 , xitalic_e2 , … , xitalic_eitalic_l of length l, the log-perplexity loss is calculated as the average negative log-likelihood of each token in the sequence under the model. It is formally defined as: Lp(x(i),xe)=−1l∑j=1logE(Tj|x1:hi(i),x1e,…,xj−1e).subscriptsuperscriptsuperscript1superscriptsubscript1conditionalsubscriptsubscriptsuperscript:1subscriptℎsubscriptsuperscript1…subscriptsuperscript1L_p(x^(i),x^e)=- 1l _j=1^llogE(T_j|x^(i)_1:h_i,x^% e_1,…,x^e_j-1).Litalic_p ( x( i ) , xitalic_e ) = - divide start_ARG 1 end_ARG start_ARG l end_ARG ∑j = 1l l o g E ( Titalic_j | x( i )1 : h start_POSTSUBSCRIPT i end_POSTSUBSCRIPT , xitalic_e1 , … , xitalic_eitalic_j - 1 ) . To solve the optimization problem by minimizing the total loss function, the authors propose a gradient-based algorithm similar to Automatic and Universal Attack. The process begins by computing a linear approximation of the effect on the modification of jthsuperscriptℎj^thjitalic_t h token in xesuperscriptx^exitalic_e, which is mathematically expressed as: ∇xjeLtotal(xe)∈R|V|,subscript∇subscriptsuperscriptsubscriptsuperscriptsuperscript _x^e_jL_total(x^e)∈ R^|V|,∇xitalic_e start_POSTSUBSCRIPT j end_POSTSUBSCRIPT Litalic_t o t a l ( xitalic_e ) ∈ R| V | , where xjesubscriptsuperscriptx^e_jxitalic_eitalic_j is the one-hot encoded vector for the jthsuperscriptℎj^thjitalic_t h token in xesuperscriptx^exitalic_e and |V||V|| V | represents size of the complete token vocabulary. Next, the algorithm selects the top K candidate tokens with the most negative gradients as potential replacements for xjesubscriptsuperscriptx^e_jxitalic_eitalic_j. It then employs the GCG algorithm by randomly sampling a subset of B<K|xe|superscriptB<K|x^e|B < K | xitalic_e | tokens. Finally, the token with minimal loss within the randomly chosen subset is used to replace xjesubscriptsuperscriptx^e_jxitalic_eitalic_j and generate the optimized injected sentence xesuperscriptx^exitalic_e. Optimization-based attacks focus on automating the process of prompt injection attacks. The authors discuss that existing defense mechanisms can detect prompt injection attacks based on handcrafted prompts, but they are insufficient for automatic attacks. The optimization-based attacks highlight the critical need for novel defense strategies that are both adaptive and robust to defend against the threats from evolving prompt injection attacks. Among other types of attacks, Goal-guided Generative Prompt Injection Attack (G2PIA) [64] mainly focuses on the query-free black box prompt injection attack, which leverages the knowledge of information theory with generative models. Prompt Infection [65] introduces an LLM-to-LLM self-replicating prompt injection attack on LLM-based agent systems, which bridges the gap between prompt injection in single-agent and multi-agent systems. G2PIA [64] focuses on the divergence between the target LLMs’ responses when provided with the clean input prompts versus the prompt with the injected contents. The main objective of this black-box attack on the target LLM F(⋅)⋅F(·)F ( ⋅ ) is defined as: F(p)=r,F(p′)=r′,D(r,r′)≥ε,D(p,p′)<ε,formulae-sequenceformulae-sequencesuperscript′formulae-sequencesuperscript′F(p)=r,F(p )=r ,D(r,r )≥ ,D(p,p )% < ,F ( p ) = r , F ( p′ ) = r′ , D ( r , r′ ) ≥ ε , D ( p , p′ ) < ε , where p is the original prompt with multiple sentences and p′ is the input prompt with injected content, D(⋅)⋅D(·)D ( ⋅ ) denotes the semantic distance between two input texts, and ε ε is the small threshold to quantify the semantic difference. Based on the observation of semantic differences in the output of the target LLM for clean and injected input prompts, the authors reformulate the prompt injection attack as an optimization problem. Specifically, they aim to maximize the Kullback-Leibler (KL) divergence KL(⋅)⋅KL(·)K L ( ⋅ ) between the conditional distribution of the output vector y given clean text vector x and adversarial text vector x′. It is formally defined as: maxx′KL(p(y|x),p(y|x′)),subscriptsuperscript′conditionalconditionalsuperscript′ _x KL(p(y|x),p(y|x )),maxitalic_x′ K L ( p ( y | x ) , p ( y | x′ ) ) , where y=w(r)y=w(r)y = w ( r ), x=w(p)x=w(p)x = w ( p ), and x′=w(p′)superscript′x =w(p )x′ = w ( p′ ) with w(⋅)⋅w(·)w ( ⋅ ) being the bijection function between text and vector. The authors assume that the output distribution p(y|x)conditionalp(y|x)p ( y | x ) satisfies the discrete Gaussian distribution given the input x. For simplification, the discrete distribution is approximated by a continuous one to calculate the KL divergence. Next, the maximization on KL divergence is equivalent to maximizing the Mahalanobis distance (x′−x)TΣ−1(x′−x)superscriptsuperscript′Σ1superscript′(x -x)^T ^-1(x -x)( x′ - x )T Σ- 1 ( x′ - x ). This leads to further reformulation as the minimization problem given the clear input x: minx′‖x′‖2 s.t. (x′−x)TΣ−1(x′−x)≤1,subscriptsuperscript′subscriptnormsuperscript′2 s.t. superscriptsuperscript′Σ1superscript′1 _x \|x \|_2 s.t. (x -x)^T ^-1(% x -x)≤ 1,minitalic_x′ ∥ x′ ∥2 s.t. ( x′ - x )T Σ- 1 ( x′ - x ) ≤ 1 , assuming that p(y|x),p(y|x′)conditionalconditionalsuperscript′p(y|x),p(y|x )p ( y | x ) , p ( y | x′ ) follows the distributions N1(y;x,Σ)subscript1ΣN_1(y;x, )N1 ( y ; x , Σ ) and N2(y;x′,Σ)subscript2superscript′ΣN_2(y;x , )N2 ( y ; x′ , Σ ) respectively. Finally, the authors apply the cosine similarity to simplify the minimization problem into a constraint satisfaction problem (CSP): minp′1, s.t. D(p,p′)<ϵ,|cos(w(p′),w(p))−γ|<δ,formulae-sequencesubscriptsuperscript′1 s.t. superscript′italic-ϵsuperscript′ min_p 1, s.t. D(p,p )<ε,|cos(w(p% ),w(p))-γ|<δ,m i nitalic_p′ 1 , s.t. D ( p , p′ ) < ϵ , | c o s ( w ( p′ ) , w ( p ) ) - γ | < δ , where ϵitalic-ϵεϵ and δ are hyperparameters that control the difficulty of the search constraint, where smaller values of ϵitalic-ϵεϵ and δ imply higher search accuracy. Thus, the authors propose a goal-guided generative prompt injection attack that first identifies a core word set that satisfies the semantic constraint condition and then generates an adversarial text p′ based on the core word set such that the cosine similarity constraint condition in CSP is satisfied. Finally, the prompt p^ pover start_ARG p end_ARG is generated by mixing the original prompt p with the adversarial text p′ for prompt injection. Prompt Infection [65] proposes a self-replicating attack that spreads across all the agents in multi-agent systems. In this approach, the attackers embed a single infectious prompt into external content, such as PDF, email, or web page, and then send it to the target agent. When the agent receives and processes the infected content, then the prompt will be replicated throughout the whole LLM-based system and compromise other agents in the system. Prompt Injection consists of four core components: Prompt Hijacking: Forces the victim agents to ignore the original instruction. Payload: Assign specific tasks to each agent according to their roles and available tools. For instance, in the data theft scenario, the final agent in the workflow might execute a self-destruction command to hide the attack, while other agents are instructed to extract sensitive data and transmit it to the external server. Data: Refers to the shared information that is sequentially collected as the infection prompt spreads through the whole system. Self-replication: Ensures that the infection prompt is transmitted from the current agent to the next one within the LLM-based agent system, which maintains the propagation of the attack. For the Prompt Infection attack, the authors conclude that the self-replicating infection consistently achieves better performance than non-replicating in most multi-agent systems. Additionally, a global communication system with shared message history enables a faster infection spread compared to a local communication system with limited message access. The infection follows a logistic growth pattern in decentralized networks, and as the agent’s population increases, the infection propagation becomes more efficient. The authors also underscore that pairing the LLM tagging strategy, which appends a marker before agent responses to indicate the origin of the messages, with other defense mechanisms like instruction defense [71] or marking [76] can significantly mitigate Prompt Infection attacks. VI Availability & Integrity Attacks In this section, we introduce Availability & Integrity attacks, which compromise the reliability of the target LLM system by intentionally disrupting services and weakening users’ trust in the system. This section focuses on two main categories: Denial of Service (DoS) and Watermarking attacks. VI-A Denial of Service (DoS) Attacks The primary objective of DoS attacks is to overwhelm the service’s resources, which results in issues such as higher operational costs, increased server response time, and wasted GPU/CPU resources. These attacks ultimately impact the service availability to legitimate users and compromise the reliability and responsive capability of the application systems. [77, 78]. The DoS instructions that are designed to induce the long sequence of LLMs can be divided into five categories [77]: Repetition: The model is instructed to repeat the same word N times, such as “Repeat ’Hi’ N times”. Recursion: The model is instructed to repeat a format or sequence of words N times following a recursive pattern, such as “Output N terms from X YXYYXYY X Y recursively”. Count: The model is instructed to enumerate a sequence, such as “Count from 00 to N”. Long Article: The model is instructed to generate a text with a given length, such as “Write an article about DoS with N-words”. Source Code: The model is instructed to generate a block of source code with a specific number of lines, such as “Generate N-lines of NumPy module”. In this section, we introduce three prominent DoS attacks targeting LLMs: regular expression DoS (ReDoS) [79], poisoning-based DoS (P-DoS) [77], and safeguard-based DoS [80]. The ReDoS attack [79] introduces an algorithmic complexity attack that leverages the evaluation process of regular expressions (regexes) to produce the DoS attack. Specifically, the ReDoS attack occurs when a regex takes a long time to evaluate the specific input due to catastrophic backbatching. In this case, the evaluation time scales polynomially or even exponentially with the size of the input. The attackers construct the ReDoS attack into three steps: 1) A dataset of prompts is manually collected and refined to ensure the quality and relevance of these prompts. 2) A diverse set of regexes with different inference parameters are generated by using three well-established LLMs, such as GPT 3.5 Turbo, T5 [81], and Phi 1.5 [82], which are widely used in prior research. 3) An evaluation matrix is constructed to analyze the relationship between the collected prompts and inference parameters, which quantifies the vulnerability of the generated regexes to DoS attacks. Finally, these generated regexes are deployed to activate DoS attacks on the target LLM server. The P-DoS attack [77] leverages data poisoning during the fine-tuning phase to circumvent the output length constraints, which forces the model to extend the length of the responses that enabling the DoS attack on the target LLM server. Depending on attack scenarios, three variants of P-Dos attacks are introduced: In the data contribution scenario for the P-DoS attack on LLMs, attackers construct a poisoned dataset for fine-tuning without modifying the model weights or inherent algorithms. The attack method comprises two steps. First, a poisoned sample is injected into the fine-tuning dataset, along with an associated instruction-response pair that instructs the model to generate repeated responses attached to that sample. Then, the model is fine-tuned on the poisoned dataset to learn the malicious behaviors and produce a max-length response when triggered. In the model publisher scenario for the P-DoS attack on LLMs, attackers are required to take full control over the target model, including both datasets and algorithms. Instead of directly generating repeated responses, the attacker embeds hidden triggers into the model during the fine-tuning phase that removes the End-of-Sequence (EOS) tokens, which signal the model to stop generating. This scenario introduces two types of P-Dos attacks: Continuous sequence format (CSF) P-DoS: Uses the structured format like repetition, recursion, or count to ensure the target model continuously generates outputs. Loss-based P-Dos: Modifies the loss function to minimize the probability of generating EOS tokens, which forces the target model to produce endless text. For the P-DoS attack on the LLM-based agents, the LLM-based agents are forced into an infinite loop by being fine-tuned on the poisoned dataset. For example, in the case of Code Agents for code execution, the fine-tuning dataset is poisoned so that the generated code includes infinite loops, which causes the code to run endlessly. Similarly, the OS commands are injected into OS Agents to freeze the system indefinitely. For Webshop Agents, the behavior of LLM-based agents is modified so that they continuously click a non-functional button, which causes the agent to be trapped in an endless loop. The safeguard-based DoS attack [80] introduces a novel approach that takes advantage of false positives in LLM safeguards. Attackers insert adversarial prompts into the user prompt templates so that safe requests are incorrectly identified as unsafe by LLM safeguards, which blocks most of the user inputs and enables DoS conditions. The safeguard-based DoS attack consists of three main steps: 1) Attackers first inject adversarial prompts into the prompt template. These adversarial prompts are automatically generated using a gradient-based, stealth-oriented optimization method, which ensures these prompts are short and seemingly benign. Additionally, a multi-dimension universality is applied to guarantee universal effectiveness across diverse scenarios. 2) Users unknowingly submit compromised requests modified by prompt templates, which makes the target LLM server reject the requests. 3) The LLM safeguards mistakenly classify the modified prompts as unsafe, which results in a consistent DoS to the users. The reliability of LLM servers has become important with the growing deployment of LLM-based applications, and the risks of compromised servers are increasing. Recent DoS attacks on LLMs demonstrate that the existing defense mechanisms on the server side always fail to mitigate these attacks. This highlights the critical need for robust and adaptive defense techniques that enable LLM services to resist evolving threats. VI-B Watermarking Attacks Watermarking is the technique that detects AI-generated content by embedding subtle patterns into generated text. The watermarked text statistically diverges from normal text by modifying the probability distribution of LLM-generated text. The LLM watermarking detection is achieved by using hypothesis testing that compares the watermarked text distribution with the normal text distribution. Watermarking attacks involve adversarial strategies that are designed to remove, modify, or obscure the hidden signals embedded within AI-generated texts, which enables attackers to evade detection, bypass content policy restrictions, or bypass licensing controls. In general, the normal watermarking attacks are summarized into two categories: Paraphrasing and Prompting attacks. In paraphrasing attacks, groups of words generated by target LLMs are replaced with semantically similar ones via specific LLMs [83, 84], word-level substitutions [85], or translation [86]. In prompting attacks, carefully crafted prompts are employed to mislead target LLMs to generate text that evades text detection [87]. In this section, we present two primary watermarking attacks on LLM: Self Color Testing-based Substitution (SCTS) Attack [88] and Black-Box scruBBing Attack (B4superscript4B^4B4) [89]. Figure 14: Overview of SCTS [88] attack. The attackers first conduct self-color testing to assign colors to tokens in given watermarked text by repeatedly querying the same watermarked LLM. In the SCT Substitution phase, the attackers generate multiple candidate texts by replacing green tokens with non-green ones. Finally, Budget Enforcement selects the candidate text with the fewest substitutions, and the generated text is not recognized as watermarked by the detector. SCTS attack [88] introduces a novel “color-aware” watermarking attack for watermark removal that effectively handles the limitation of detection evasion for long text segments. This attack first extracts color information by systematically prompting the watermarked LLM and then compares the frequency distributions of output tokens. Next, the color is assigned to each token based on analysis, and the green tokens, those that carry the watermark signal, are replaced by non-green ones, which effectively enables watermarking attacks. Specifically, the SCTS attack is composed of three steps as shown in Fig. 14: 1). Self Color Testing: The target LLM is prompted to generate strings in a deterministic but seemingly random manner using customized input prefixes, such as “Choose two phrase (ycw⊕ywdirect-sumsubscriptsuperscriptsuperscripty^w_c y^wyitalic_witalic_c ⊕ yitalic_w, ycw⊕ydirect-sumsubscriptsuperscripty^w_c yyitalic_witalic_c ⊕ y), and generate a long uniformly random string of these phrase separated by ’;’”. Here, ywsuperscripty^wyitalic_w is the word to be replaced, y is a candidate word, ycwsubscriptsuperscripty^w_cyitalic_witalic_c represents the context of ywsuperscripty^wyitalic_w, which is the c words preceding it in the output sentence of the target watermarked LLM, and ⊕direct-sum ⊕ represents the concatenation operation. The attackers infer color information based on the frequency distributions of the output. 2). SCT Substitution: The color testing is applied to different candidates based on the extracted color information in the previous step. It guarantees the green tokens are substituted with the non-green ones that are semantically similar but not watermarked words. 3). Budget Enforcement: The final step minimizes the modification to the text, which ensures the watermark removal while keeping overall edits low and preserving text quality. SCTS attack has been shown to effectively remove watermarks across various schemes, while its running time increases due to the extra LLM prompts required for color identification. B4superscript4B^4B4 attack [89] introduces a novel approach that reformulates the watermark removal as a constrained optimization problem without prior knowledge of its type or parameters. Unlike previous scrubbing attacks that assume the knowledge of watermarking methods, the B4superscript4B^4B4 attack assumes a realistic threat model in which the attackers only know the existence of watermarks, with details of the watermark unknown. Given a watermarked token sequence w=(y1w,y2w,…,ynw)superscriptsubscriptsuperscript1subscriptsuperscript2…subscriptsuperscripty^w=(y^w_1,y^w_2,…,y^w_n)yitalic_w = ( yitalic_w1 , yitalic_w2 , … , yitalic_witalic_n ), the goal of the B4superscript4B^4B4 attack is to substitute the watermarked text with a similar but watermark-free sequence =y1,y2,…,ymsubscript1subscript2…subscripty=\y_1,y_2,…,y_m\y = y1 , y2 , … , yitalic_m . The watermarking attack is reformulated as an optimization problem to find the optimal distribution Q∗(|w)superscriptconditionalsuperscriptQ^*(y|y^w)Q∗ ( y | yitalic_w ), which is formally defined as: minQ−KL(Q,Pw),s.t. KL(Q,Pf)≤ϵ,subscriptsubscripts.t. subscriptitalic-ϵ _Q-KL(Q,P_w),s.t. KL(Q,P_f)≤ε,minitalic_Q - K L ( Q , Pitalic_w ) , s.t. K L ( Q , Pitalic_f ) ≤ ϵ , where Pw()subscriptP_w(y)Pitalic_w ( y ) is the efficacy distribution for hidden watermark removal, Pf(|w)subscriptconditionalsuperscriptP_f(y|y^w)Pitalic_f ( y | yitalic_w ) is the fidelity distribution for semantic similarity preservation, ϵitalic-ϵεϵ is the hyperparameter that bounds the semantic difference from the original watermarked sample, and KL(⋅)⋅KL(·)K L ( ⋅ ) represents the KL-divergence used to measure the similarity. Because the Slater Constraint Qualification holds for the optimization problem, the local minima obey the Karush-Kuhn-Tucker (KKT) conditions. In particular, the optimal solution Q∗(⋅)superscript⋅Q^*(·)Q∗ ( ⋅ ) is expressed as: Q∗(|w)=1ZPf11−λ∗(|w)Pw−λ∗1−λ∗(),superscriptconditionalsuperscript1superscriptsubscript11superscriptconditionalsuperscriptsuperscriptsubscriptsuperscript1superscriptQ^*(y|y^w)= 1ZP_f 11-λ^*(% y|y^w)P_w^- λ^*1-λ^*(y% ),Q∗ ( y | yitalic_w ) = divide start_ARG 1 end_ARG start_ARG Z end_ARG Pitalic_fdivide start_ARG 1 end_ARG start_ARG 1 - λ start_POSTSUPERSCRIPT ∗ end_ARG end_POSTSUPERSCRIPT ( y | yitalic_w ) Pitalic_w- divide start_ARG λ start_POSTSUPERSCRIPT ∗ end_ARG start_ARG 1 - λ∗ end_ARG end_POSTSUPERSCRIPT ( y ) , where λ∗∈(0,1)superscript01λ^*∈(0,1)λ∗ ∈ ( 0 , 1 ) is the corresponding Lagrangian multiplier that satisfies KL(Q,Pf)=ϵsubscriptitalic-ϵKL(Q,P_f)= L ( Q , Pitalic_f ) = ϵ and can be solved using Newton-Raphson Method and Z is the Normalizing constant. In practice, PwsubscriptP_wPitalic_w and PfsubscriptP_fPitalic_f are inaccessible in most cases. The attackers leverage model distillation to train two LLMs pθsubscriptp_θpitalic_θ and pϕsubscriptitalic-ϕp_φpitalic_ϕ as proxy distributions to approximate them: For efficacy distribution PwsubscriptP_wPitalic_w: P^w(;θ)=Πipθ(i|<i).subscript^subscriptΠsubscriptconditionalsubscriptsubscriptabsent P_w(y;θ)= _ip_θ(y_i|y_<i% ).over start_ARG P end_ARGw ( y ; θ ) = Πitalic_i pitalic_θ ( yitalic_i | y< i ) . For fidelity distribution PfsubscriptP_fPitalic_f: P^f(|w;ϕ)=Πipϕ(i|<i,w).subscript^conditionalsuperscriptitalic-ϕsubscriptΠsubscriptitalic-ϕconditionalsubscriptsubscriptabsentsuperscript P_f(y|y^w;φ)= _ip_φ(y_i|% y_<i,y^w).over start_ARG P end_ARGf ( y | yitalic_w ; ϕ ) = Πitalic_i pitalic_ϕ ( yitalic_i | y< i , yitalic_w ) . Substituting these distributions into the solution under the KKT condition, the optimal solution for wbsubscriptw_bwitalic_b is reformulated as: Q∗(i|<i,w)=P^f11−λ∗(i|<i,w;ϕ)P^wλ∗1−λ∗(i|<i;θ).superscriptconditionalsubscriptsubscriptabsentsuperscriptsuperscriptsubscript^11superscriptconditionalsubscriptsubscriptabsentsuperscriptitalic-ϕsuperscriptsubscript^superscript1superscriptconditionalsubscriptsubscriptabsentQ^*(y_i|y_<i,y^w)= P_f % 11-λ^*(y_i|y_<i,y^w;φ) P% _w λ^*1-λ^*(y_i|y_<i;% θ).Q∗ ( yitalic_i | y< i , yitalic_w ) = divide start_ARG over start_ARG P end_ARGfdivide start_ARG 1 end_ARG start_ARG 1 - λ start_POSTSUPERSCRIPT ∗ end_ARG end_POSTSUPERSCRIPT ( yitalic_i | y< i , yitalic_w ; ϕ ) end_ARG start_ARG over start_ARG P end_ARGwdivide start_ARG λ start_POSTSUPERSCRIPT ∗ end_ARG start_ARG 1 - λ∗ end_ARG end_POSTSUPERSCRIPT ( yitalic_i | y< i ; θ ) end_ARG . Additionally, to handle the inherent sampling-based error of model distillation in proxy watermark distribution P^wsubscript P_wover start_ARG P end_ARGw, B4superscript4B^4B4 employs Approximation Error Adjustment (AEA) to exclude the “under-fitting” region ΣuisuperscriptsubscriptΣ _u^iΣitalic_uitalic_i from the calculation of the KL divergence objective. The “under-fitting” region is the subset of the whole vocabulary Σ Σ, which is defined as: Σui=v∈Σ:|pθ(v|<i)−pθini(v|<i)|<μ, _u^i=\v∈ :|p_θ(v|y_<i)-p_ _ini(v|% y_<i)|\\ <μ\,Σitalic_uitalic_i = v ∈ Σ : | pitalic_θ ( v | y< i ) - pitalic_θ start_POSTSUBSCRIPT i n i end_POSTSUBSCRIPT ( v | y< i ) | < μ , where θinisubscript _iniθitalic_i n i denotes the initialized weight before distillation, and μ is the threshold. The optimal distribution is then adjusted as: Q∗(i|<i,w)=P^f(i|<i,w;θ),if i∈ΣuiP^f11−λ∗(i|<i,w;ϕ)P^wλ∗1−λ∗(i|<i;θ),otherwise,superscriptconditionalsubscriptsubscriptabsentsuperscriptcasessubscript^conditionalsubscriptsubscriptabsentsuperscriptif subscriptsuperscriptsubscriptΣotherwisesuperscriptsubscript^11superscriptconditionalsubscriptsubscriptabsentsuperscriptitalic-ϕsuperscriptsubscript^superscript1superscriptconditionalsubscriptsubscriptabsentotherwiseotherwise Q^*(y_i|y_<i,y^w)= % dcases P_f(y_i|y_<i,y^w;θ), % if y_i∈ _u^i\\ P_f 11-λ^*(y_i|y_<i,% y^w;φ) P_w λ^*1-λ^*(% y_i|y_<i;θ),otherwise dcases,Q∗ ( yitalic_i | y< i , yitalic_w ) = start_ROW start_CELL over start_ARG P end_ARGf ( yitalic_i | y< i , yitalic_w ; θ ) , if yitalic_i ∈ Σitalic_uitalic_i end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL divide start_ARG over start_ARG P end_ARGfdivide start_ARG 1 end_ARG start_ARG 1 - λ start_POSTSUPERSCRIPT ∗ end_ARG end_POSTSUPERSCRIPT ( yitalic_i | y< i , yitalic_w ; ϕ ) end_ARG start_ARG over start_ARG P end_ARGwdivide start_ARG λ start_POSTSUPERSCRIPT ∗ end_ARG start_ARG 1 - λ∗ end_ARG end_POSTSUPERSCRIPT ( yitalic_i | y< i ; θ ) end_ARG , otherwise end_CELL start_CELL end_CELL end_ROW , Finally, the watermark-free text is generated by sampling each token yisubscripty_iyitalic_i in an auto-regressive manner based on Q∗(⋅)superscript⋅Q^*(·)Q∗ ( ⋅ ). These watermarking attacks demonstrate that seemingly clean text can be generated from watermarked AI-generated texts, effectively evading current detectors. This highlights the critical need for developing more powerful and robust watermarking techniques and detection strategies that can identify these watermarking attacks. With the widespread application of watermarking for policy restriction and copyright protection, it is essential to understand and mitigate these attacks to strengthen public trust in AI-generated media. VII Conclusion Our survey comprehensively explores the landscape of attacks on LLMs and LLM-based agents across the complete model lifecycle, from initial model training through inference to deployment in real-world service. In the paper, we provide three key insights into the challenges of LLM security: First, we highlight the vulnerability of LLMs by demonstrating the details of how adversarial attackers can exploit every stage of the model pipeline to compromise LLM-based applications. Second, we emphasize the evolving complexity of threats introduced by the transition from LLMs to LLM-based multi-agents augmented with external tools and modules; this significantly expanded attack surface exposes new risks that cannot be easily addressed by the existing defenses. Third, we identify the limitations of current defense strategies that focus on specific attacks and lack the robustness to mitigate adaptive attacks. To address these challenges, we propose several critical directions for future research: 1). The development of a unified classification of threats and benchmarks to enable consistent evaluation and comparison of defense strategies across models and scenarios. 2). The design of a cross-phase defense framework that offers comprehensive protection across the full model lifecycle. 3). The need for advancement in adaptive and explainable defense mechanisms that can be deployed to detect and respond to real-time threats while preserving interpretability and reliability for both system developers and users. References [1] J. Li, T. Tang, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, “Pre-Trained Language Models for Text Generation: A Survey,” ACM Computing Surveys, vol. 56, no. 9, p. 1–39, 2024. [2] A. Plaat, A. Wong, S. Verberne, J. Broekens, N. van Stein, and T. Back, “Reasoning with Large Language Models, A Survey,” arXiv preprint arXiv:2407.11511, 2024. [3] W. Zhang, Y. Deng, B. Liu, S. J. Pan, and L. Bing, “Sentiment Analysis in the Era of Large Language Models: A Reality Check,” arXiv preprint arXiv:2305.15005, 2023. [4] OpenAI, “ChatGPT (Feb 20 Version),” 2023. [Online]. Available: https://openai.com/chatgpt [5] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan et al., “The Llama 3 Herd of Models,” arXiv preprint arXiv:2407.21783, 2024. [6] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv preprint arXiv:2501.12948, 2025. [7] xAI, “Grok-3,” 2025, [Large language model]. [Online]. Available: https://grok.com/ [8] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, and X. Hu, “Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond,” ACM Trans. Knowl. Discov. Data, vol. 18, no. 6, Apr. 2024. [Online]. Available: https://doi.org/10.1145/3649506 [9] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin et al., “A survey on Large Language Model Based Autonomous Agents,” Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024. [10] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A Comprehensive Overview of Large Language Models,” arXiv preprint arXiv:2307.06435, 2023. [11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems, 2017, p. 5998–6008. [Online]. Available: http://arxiv.org/abs/1706.03762 [12] S. Zhao, M. Jia, Z. Guo, L. Gan, X. Xu, X. Wu, J. Fu, F. Yichao, F. Pan, and A. T. Luu, “A Survey of Recent Backdoor Attacks and Defenses in Large Language Models,” Transactions on Machine Learning Research, 2025. [13] Y. Li, H. Huang, Y. Zhao, X. Ma, and J. Sun, “BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models,” arXiv preprint arXiv:2408.12798, 2024. [14] F. Qi, M. Li, Y. Chen, Z. Zhang, Z. Liu, Y. Wang, and M. Sun, “Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger,” arXiv preprint arXiv:2105.12400, 2021. [15] S. Li, H. Liu, T. Dong, B. Z. H. Zhao, M. Xue, H. Zhu, and J. Lu, “Hidden Backdoors in Human-Centric Language Models,” in Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021, p. 3123–3140. [16] H. Huang, Z. Zhao, M. Backes, Y. Shen, and Y. Zhang, “Composite Backdoor Attacks Against Large Language Models,” arXiv preprint arXiv:2310.07676, 2023. [17] W. Zou, R. Geng, B. Wang, and J. Jia, “PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.07867 [18] R. Zhang, H. Li, R. Wen, W. Jiang, Y. Zhang, M. Backes, Y. Shen, and Y. Zhang, “Instruction Backdoor Attacks Against Customized LLMs,” in 33rd USENIX Security Symposium (USENIX Security 24), 2024, p. 1849–1866. [19] J. Yan, V. Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V. Srinivasan, X. Ren, and H. Jin, “Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection,” 2024. [Online]. Available: https://arxiv.org/abs/2307.16888 [20] J. Shi, Y. Liu, P. Zhou, and L. Sun, “BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT,” arXiv preprint arXiv:2304.12298, 2023. [21] J. Wang, J. Wu, M. Chen, Y. Vorobeychik, and C. Xiao, “RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models,” arXiv preprint arXiv:2311.09641, 2023. [22] J. Xue, M. Zheng, T. Hua, Y. Shen, Y. Liu, L. Bölöni, and Q. Lou, “TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models,” Advances in Neural Information Processing Systems, vol. 36, p. 65 665–65 677, 2023. [23] H. Yao, J. Lou, and Z. Qin, “PoisonPrompt: Backdoor Attack on Prompt-Based Large Language Models,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, p. 7745–7749. [24] Y. Li, T. Li, K. Chen, J. Zhang, S. Liu, W. Wang, T. Zhang, and Y. Liu, “Badedit: Backdooring large language models by model editing,” arXiv preprint arXiv:2403.13355, 2024. [25] H. Liu, Z. Liu, R. Tang, J. Yuan, S. Zhong, Y.-N. Chuang, L. Li, R. Chen, and X. Hu, “LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario,” arXiv preprint arXiv:2403.00108, 2024. [26] T. Dong, M. Xue, G. Chen, R. Holland, Y. Meng, S. Li, Z. Liu, and H. Zhu, “The Philosopher’s Stone: Trojaning Plugins of Large Language Models,” arXiv preprint arXiv:2312.00374, 2023. [27] N. Gu, P. Fu, X. Liu, Z. Liu, Z. Lin, and W. Wang, “A Gradient Control Method for Backdoor Attacks on Parameter-Efficient Tuning,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, p. 3508–3520. [28] X. Zhao, X. Yang, T. Pang, C. Du, L. Li, Y.-X. Wang, and W. Y. Wang, “Weak-to-Strong Jailbreaking on Large Language Models,” arXiv preprint arXiv:2401.17256, 2024. [29] H. Wang and K. Shu, “Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment,” arXiv preprint arXiv:2311.09433, 2023. [30] Z. Xiang, F. Jiang, Z. Xiong, B. Ramasubramanian, R. Poovendran, and B. Li, “BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models,” arXiv preprint arXiv:2401.12242, 2024. [31] Z. Zhu, H. Zhang, M. Zhang, R. Wang, G. Wu, K. Xu, and B. Wu, “BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack,” arXiv preprint arXiv:2502.12202, 2025. [32] S. Zhao, M. Jia, L. A. Tuan, F. Pan, and J. Wen, “Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning,” arXiv preprint arXiv:2401.05949, 2024. [33] R. Jiao, S. Xie, J. Yue, T. Sato, L. Wang, Y. Wang, Q. A. Chen, and Q. Zhu, “Exploring Backdoor Attacks against Large Language Model-based Decision Making,” arXiv preprint arXiv:2405.20774, 2024. [34] Y. Wang, D. Xue, S. Zhang, and S. Qian, “BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents,” in Annual Meeting of the Association for Computational Linguistics, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:270258249 [35] P. Zhu, Z. Zhou, Y. Zhang, S. Yan, K. Wang, and S. Su, “DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent,” arXiv preprint arXiv:2502.12575, 2025. [36] M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer, “Adversarial Example Generation with Syntactically Controlled Paraphrase Networks,” arXiv preprint arXiv:1804.06059, 2018. [37] H. Zhou, K.-H. Lee, Z. Zhan, Y. Chen, and Z. Li, “TrustRAG: Enhancing Robustness and Trustworthiness in RAG,” arXiv preprint arXiv:2501.00879, 2025. [38] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, p. 4171–4186. [39] F. Qi, Y. Chen, M. Li, Y. Yao, Z. Liu, and M. Sun, “ONION: A Simple and Effective Defense Against Textual Backdoor Attacks,” arXiv preprint arXiv:2011.10369, 2020. [40] G. Shen, S. Cheng, Z. Zhang, G. Tao, K. Zhang, H. Guo, L. Yan, X. Jin, S. An, S. Ma et al., “BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target,” in 2025 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 2024, p. 103–103. [41] D. J. Fremont, T. Dreossi, S. Ghosh, X. Yue, A. L. Sangiovanni-Vincentelli, and S. A. Seshia, “Scenic: A language for scenario specification and scene generation,” in Proceedings of the 40th ACM SIGPLAN conference on programming language design and implementation, 2019, p. 63–78. [42] D. Lee and M. Yannakakis, “Principles and methods of testing finite state machines-a survey,” Proceedings of the IEEE, vol. 84, no. 8, p. 1090–1123, 1996. [43] B. C. Das, M. H. Amini, and Y. Wu, “Security and Privacy Challenges of Large Language Models: A Survey,” ACM Comput. Surv., vol. 57, no. 6, Feb. 2025. [Online]. Available: https://doi.org/10.1145/3712001 [44] A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How Does LLM Safety Training Fail?” Advances in Neural Information Processing Systems, vol. 36, p. 80 079–80 110, 2023. [45] Learn Prompting, “Prompt Hacking: Jailbreaking,” 2025, accessed: 2025-03-01. [Online]. Available: https://learnprompting.org/docs/prompt\_hacking/jailbreaking\#footnotes [46] P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V. Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramer et al., “JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models,” arXiv preprint arXiv:2404.01318, 2024. [47] J. Yu, X. Lin, Z. Yu, and X. Xing, “GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts,” arXiv preprint arXiv:2309.10253, 2023. [48] P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking Black Box Large Language Models in Twenty Queries,” arXiv preprint arXiv:2310.08419, 2023. [49] A. Mehrotra, M. Zampetakis, P. Kassianik, B. Nelson, H. Anderson, Y. Singer, and A. Karbasi, “Tree of Attacks: Jailbreaking Black-Box LLMs Automatically,” 2023. [50] Z.-X. Yong, C. Menghini, and S. H. Bach, “Low-resource Languages Jailbreak GPT-4,” arXiv preprint arXiv:2310.02446, 2023. [51] Y. Deng, W. Zhang, S. J. Pan, and L. Bing, “Multilingual Jailbreak Challenges in Large Language Models,” arXiv preprint arXiv:2310.06474, 2023. [52] J. Kritz, V. Robinson, R. Vacareanu, B. Varjavand, M. Choi, B. Gogov, S. R. Team, S. Yue, W. E. Primack, and Z. Wang, “Jailbreaking to Jailbreak,” arXiv preprint arXiv:2502.09638, 2025. [53] A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “Smoothllm: Defending Large Language Models Against Jailbreaking Attacks,” arXiv preprint arXiv:2310.03684, 2023. [54] Z. Chang, M. Li, Y. Liu, J. Wang, Q. Wang, and Y. Liu, “Play guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues,” arXiv preprint arXiv:2402.09091, 2024. [55] R. Shah, S. Pour, A. Tagade, S. Casper, J. Rando et al., “Scalable and Transferable Black-box Jailbreaks for Language Models via Persona Modulation,” arXiv preprint arXiv:2311.03348, 2023. [56] Y. Zeng, H. Lin, J. Zhang, D. Yang, R. Jia, and W. Shi, “How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, p. 14 322–14 350. [57] Z. Ying, D. Zhang, Z. Jing, Y. Xiao, Q. Zou, A. Liu, S. Liang, X. Zhang, X. Liu, and D. Tao, “Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models,” arXiv preprint arXiv:2502.11054, 2025. [58] Y. Xue, J. Wang, Z. Yin, Y. Ma, H. Qin, R. Tao, and X. Liu, “Dual Intention Escape: Jailbreak Attack against Large Language Models,” in THE WEB CONFERENCE 2025, 2025. [59] Y. Liu, Y. Jia, R. Geng, J. Jia, and N. Z. Gong, “Formalizing and Benchmarking Prompt Injection Attacks and Defenses,” in 33rd USENIX Security Symposium (USENIX Security 24), 2024, p. 1831–1847. [60] W. Zhang, X. Kong, C. Dewitt, T. Braunl, and J. B. Hong, “A Study on Prompt Injection Attack Against LLM-Integrated Mobile Robotic Systems,” in 2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW), 2024, p. 361–368. [61] P. Levi and C. P. Neumann, “Vocabulary Attack to Hijack Large Language Model Applications,” 2024. [Online]. Available: https://arxiv.org/abs/2404.02637 [62] X. Liu, Z. Yu, Y. Zhang, N. Zhang, and C. Xiao, “Automatic and Universal Prompt Injection Attacks against Large Language Models,” arXiv preprint arXiv:2403.04957, 2024. [63] J. Shi, Z. Yuan, Y. Liu, Y. Huang, P. Zhou, L. Sun, and N. Z. Gong, “Optimization-based Prompt Injection Attack to LLM-as-a-Judge,” in Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, p. 660–674. [64] C. Zhang, M. Jin, Q. Yu, C. Liu, H. Xue, and X. Jin, “Goal-guided Generative Prompt Injection Attack on Large Language Models,” arXiv preprint arXiv:2404.07234, 2024. [65] D. Lee and M. Tiwari, “Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems,” arXiv preprint arXiv:2410.07283, 2024. [66] F. Perez and I. Ribeiro, “Ignore Previous Prompt: Attack Techniques for Language Models,” arXiv preprint arXiv:2211.09527, 2022. [67] S. Willison, “Delimiters Won’t Save You,” https://simonwillison.net/2023/May/11/delimiters-wont-save-you/, May 2023, accessed: 2025-04-10. [68] N. Jain, A. Schwarzschild, Y. Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline Defenses for Adversarial Attacks Against Aligned Language Models,” arXiv preprint arXiv:2309.00614, 2023. [69] A. Mendes, “Ultimate ChatGPT Prompt Engineering Guide for General Users and Developers,” 2023. [Online]. Available: https://w.imaginarycloud.com/blog/chatgpt-prompt-engineering [70] “Sandwich defense,” 2023. [Online]. Available: https://learnprompting.org/docs/prompt\_hacking/defensive\_measures/\ \_defense [71] “Instruction defense,” 2023. [Online]. Available: https://learnprompting.org/docs/prompt\_hacking/defensive\_measures/\ [72] G. Alon and M. Kamfonas, “Detecting Language Model Attacks with Perplexity,” arXiv preprint arXiv:2308.14132, 2023. [73] E. Yudkowsky, “Using gpt: Eliezer against chatgpt jailbreaking,” 2023. [Online]. Available: https://w.alignmentforum.org/posts/pNcFYZnPdXyL2RfgA/using-gpt-eliezer-against-chatgpt-jailbreaking [74] NCC Group, “Exploring prompt injection attacks.” [Online]. Available: https://w.nccgroup.com/us/research-blog/exploring-prompt-injection-attacks/ [75] A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Models,” arXiv preprint arXiv:2307.15043, 2023. [76] K. Hines, G. Lopez, M. Hall, F. Zarfati, Y. Zunger, and E. Kiciman, “Defending Against Indirect Prompt Injection Attacks With Spotlighting,” arXiv preprint arXiv:2403.14720, 2024. [77] K. Gao, T. Pang, C. Du, Y. Yang, S.-T. Xia, and M. Lin, “Denial-of-Service Poisoning Attacks against Large Language Models,” arXiv preprint arXiv:2410.10760, 2024. [78] “LLM Denial of Service,” https://learn.snyk.io/lesson/llm-denial-of-service/?ecosystem=aiml. [79] M. L. Siddiq, J. Zhang, and J. C. D. S. Santos, “Understanding Regular Expression Denial of Service (ReDoS): Insights from LLM-Generated Regexes and Developer Forums,” in Proceedings of the 32nd IEEE/ACM International Conference on Program Comprehension, 2024, p. 190–201. [80] Q. Zhang, Z. Xiong, and Z. M. Mao, “LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks,” 2025. [Online]. Available: https://arxiv.org/abs/2410.02916 [81] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” 2023. [Online]. Available: https://arxiv.org/abs/1910.10683 [82] S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. D. Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikivi, A. Salim, S. Shah, H. S. Behl, X. Wang, S. Bubeck, R. Eldan, A. T. Kalai, Y. T. Lee, and Y. Li, “Textbooks Are All You Need,” 2023. [Online]. Available: https://arxiv.org/abs/2306.11644 [83] K. Krishna, Y. Song, M. Karpinska, J. Wieting, and M. Iyyer, “Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense,” Advances in Neural Information Processing Systems, vol. 36, p. 27 469–27 500, 2023. [84] V. S. Sadasivan, A. Kumar, S. Balasubramanian, W. Wang, and S. Feizi, “Can AI-generated Text be Reliably Detected?” arXiv preprint arXiv:2303.11156, 2023. [85] Z. Shi, Y. Wang, F. Yin, X. Chen, K.-W. Chang, and C.-J. Hsieh, “Red Teaming Language Model Detectors with Language Models,” Transactions of the Association for Computational Linguistics, vol. 12, p. 174–189, 2024. [86] M. Christ, S. Gunn, and O. Zamir, “Undetectable Watermarks for Language Models,” in The Thirty Seventh Annual Conference on Learning Theory. PMLR, 2024, p. 1125–1139. [87] N. Lu, S. Liu, R. He, Q. Wang, Y.-S. Ong, and K. Tang, “Large Language Models can be Guided to Evade AI-Generated Text Detection,” arXiv preprint arXiv:2305.10847, 2023. [88] Q. Wu and V. Chandrasekaran, “Bypassing LLM Watermarks with Color-Aware Substitutions,” arXiv preprint arXiv:2403.14719, 2024. [89] B. Huang, X. Pu, and X. Wan, “B4superscript4B^4B4: A Black-Box Scrubbing Attack on LLM Watermarks,” arXiv preprint arXiv:2411.01222, 2024. Wenrui Xu received a B.S. degree in Computer Engineering from the University of Minnesota, MN, USA, in 2023. He is currently pursuing a Ph.D. degree in Electrical Engineering at the University of Minnesota, MN, USA. His research interests include hyperdimensional computing, knowledge graphs, machine learning, and LLM. Keshab K. Parhi (S’85-M’88-SM’91-F’96-LF’25) received the B.Tech. degree from Indian Institute of Technology (IIT), Kharagpur, in 1982, the M.S.E.E. degree from the University of Pennsylvania, Philadelphia, in 1984, and the Ph.D. degree from the University of California, Berkeley, in 1988. He has been with the University of Minnesota, Minneapolis, since 1988, where he is currently the Erwin A. Kelen Chair and a Distinguished McKnight University Professor with the Department of Electrical and Computer Engineering. He has published over 725 papers including 16 that have won best paper or best student paper awards, is the inventor of 36 patents, and has authored the textbook VLSI Digital Signal Processing Systems (Wiley, 1999). His current research interests include VLSI architecture design of artificial intelligence and machine learning systes, signal processing and communications systems, hardware security, and data-driven neuroscience with applications to neurology and psychiatry. He is a fellow of IEEE, American Association for the Advancement of Science (AAAS), the Association for Computing Machinery (ACM), American Institute of Medical and Biological Engineering (AIMBE), and the National Academy of Inventors (NAI). He was a recipient of numerous awards, including the 2003 IEEE Kiyo Tomiyasu Technical Field Award, the 2017 Mac Van Valkenburg Award, the 2012 Charles A. Desoer Technical Achievement Award and the 1999 Golden Jubilee Medal from the IEEE Circuits and Systems.He served as the Editor-in-Chief for IEEE Transactions on Circuits and Systems— Part I: Regular Papers from 2004 to 2005. He currently serves as the Editor-in-Chief for the IEEE Circuits and Systems Magazine. Since 1993, he has been an Associate Editor of the Springer Journal for Signal Processing Systems.