← Back to papers

Paper deep dive

A Survey on Backdoor Threats in Large Language Models (LLMs): Attacks, Defenses, and Evaluation Methods

Yihe Zhou, Tao Ni, Wei-Bin Lee, Qingchuan Zhao

Year: 2025Venue: Transactions on Artificial IntelligenceArea: Adversarial RobustnessType: SurveyEmbeddings: 154

Models: Large Language Models (general), vision-language models

Abstract

Abstract:Large Language Models (LLMs) have achieved significantly advanced capabilities in understanding and generating human language text, which have gained increasing popularity over recent years. Apart from their state-of-the-art natural language processing (NLP) performance, considering their widespread usage in many industries, including medicine, finance, education, etc., security concerns over their usage grow simultaneously. In recent years, the evolution of backdoor attacks has progressed with the advancement of defense mechanisms against them and more well-developed features in the LLMs. In this paper, we adapt the general taxonomy for classifying machine learning attacks on one of the subdivisions - training-time white-box backdoor attacks. Besides systematically classifying attack methods, we also consider the corresponding defense methods against backdoor attacks. By providing an extensive summary of existing works, we hope this survey can serve as a guideline for inspiring future research that further extends the attack scenarios and creates a stronger defense against them for more robust LLMs.

Tags

adversarial-robustness (suggested, 92%)ai-safety (imported, 100%)safety-evaluation (suggested, 80%)survey (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/12/2026, 7:33:14 PM

Summary

This paper provides a comprehensive survey of backdoor threats in Large Language Models (LLMs), establishing a taxonomy based on the model construction pipeline (pre-training, fine-tuning, and inference phases). It categorizes attack methodologies, defense mechanisms, and evaluation metrics, highlighting the evolution of triggers from simple tokens to complex semantic and stylistic patterns.

Entities (5)

Backdoor Attack · threat · 100%Large Language Models · technology · 100%BadNet · attack-method · 95%GCG · attack-method · 95%ONION · defense-method · 90%

Relation Signals (3)

Backdoor Attack targets Large Language Models

confidence 100% · Backdoor attacks are one of the particularly relevant vulnerabilities faced by language models.

BadNet isa Backdoor Attack

confidence 95% · The concept of backdoor attack was first proposed in BadNet

ONION defendsagainst Backdoor Attack

confidence 90% · We discuss the corresponding defense methods for defending against various LLM backdoor attacks

Cypher Suggestions (2)

Find all attack methods targeting LLMs · confidence 90% · unvalidated

MATCH (a:AttackMethod)-[:TARGETS]->(l:Technology {name: 'Large Language Models'}) RETURN a.name

List defense methods for a specific threat · confidence 90% · unvalidated

MATCH (d:DefenseMethod)-[:DEFENDS_AGAINST]->(t:Threat {name: 'Backdoor Attack'}) RETURN d.name

Full Text

154,049 characters extracted from source content.

Expand or collapse full text

A Survey on Backdoor Threats in Large Language Models (LLMs): Attacks, Defenses, and Evaluations Yihe Zhou 1 , Tao Ni 1 , Wei-Bin Lee 2,3 , Qingchuan Zhao 1,∗ 1 Department of Computer Science, City University of Hong Kong 2 Information Security Center, Hon Hai Research Institute 3 Department of Information Engineering and Computer Science, Feng Chia University ∗ Corresponding Author Abstract—LargeLanguageModels(LLMs)have achievedsignificantlyadvancedcapabilitiesin understanding and generating human language text, which have gained increasing popularity over recent years.Apartfromtheirstate-of-the-artnatural language processing (NLP) performance, considering their widespread usage in many industries, including medicine, finance, education, etc., security concerns over their usage grow simultaneously. In recent years, the evolution of backdoor attacks has progressed with the advancement of defense mechanisms against them and more well-developed features in the LLMs. In this paper, we adapt the general taxonomy for classifying machine learning attacks on one of the subdivisions - training-time white-box backdoor attacks. Besides systematically classifying attack methods, we also consider the corresponding defense methods against backdoor attacks. By providing an extensive summary of existing works, we hope this survey can serve as a guideline for inspiring future research that further extends the attack scenarios and creates a stronger defense against them for more robust LLMs. Index Terms—Large Language Models, backdoor at- tacks, backdoor defenses 1. Introduction Large Language Models (LLMs) have gar- nered great attention in recent years for their widespread usages in extensive domains, includ- ing finance [1], [2], healthcare [3], [4] and law [5], [6]. Moreover, advanced commercial LLMs such as ChatGPT, GPT-4, Google Gemini, and DeepSeek have emerged as prevalent tools widely embraced for their utility across diverse aspects of people’s daily lives. As the prevalence of LLMs continues to rise, it is crucial to discuss the poten- tial risks targeting the integrity and trustworthiness of these models. Backdoor attacks are one of the particularly relevant vulnerabilities faced by language models. The concept of backdoor attack was first proposed in BadNet [7], which uses rare tokens like “tq” and “cf” as lexical triggers, the serious security threat for deep learning models, and has recently become a concern that has since extended to the realm of LLMs. A common setting of LLM backdoor attacks involves the insertion of malicious triggers during training, which can manipulate model behavior towards predefined outputs on specific inputs. In the generic taxonomy for machine learning attacks [8], there are three dimensions to catego- rize attacks: adversarial goals, adversarial capabil- ities, and attack phase. Adversarial objectives in- clude model integrity, i.e. the output performance of the model, and data privacy. For adversarial capabilities, we usually use white-box, gray-box, and black-box access to describe different access levels to model internals. As such, a compre- hensive survey on backdoor threats in LLMs are 1 arXiv:2502.05224v1 [cs.CR] 6 Feb 2025 necessary and could build fundamental benchmark for future research. Many of the attack methodologies in backdoor attacks against LLMs involve poisoning train- ing data or fine-tuning data, necessitating the attacker’s access to either training data or the model’s fine-tuning data. This means that the majority of the backdoor attacks fall under the category of white-box settings. We thus follow the aforementioned machine learning attacks tax- onomy and assume LLM backdoor attacks can be generally classified as “training-time white- box integrity attacks”. Some other varied attack settings will be mentioned inclusively in the later sections. Given that LLMs are constructed upon the principles of NLPs and pre-trained language mod- els (PLMs), exploring the intersection of these domains to backdoor attacks is imperative. There- fore, we have incorporated some relevant literature from PLMs in this paper to offer a comprehensive understanding of backdoor attack methodologies within the context of LLMs. Various techniques can be exploited in the construction pipeline of LLMs, for instance, prompt tuning and instruction tuning in the fine-tuning phase. Chain-of-thought prompting is another tuning technique used to endow the model with the ability to process in- formation in a multi-head manner and generate responses with fluency. The key contributions of this survey are sum- marized as follows: •We provide a detailed and systematic tax- onomy to classify LLM backdoor attacks in the manner of a model construction pipeline, i.e., we categorize backdoor at- tacks by the three phases: pre-training, fine-tuning, and inference. •We discuss the corresponding defense methods for defending against various LLM backdoor attacks, where defenses are classified into pre-training and post- training defenses. •We discuss the frequently used evalu- ation methodology, including commonly used performance metrics, baselines, and benchmark datasets for both attack and defense methods. We also highlight the insufficiency and limitations of existing backdoor attacks and defense methods. 2. Background 2.1. Large Language Models (LLMs) LLMs are AI systems trained on massive amounts of textual data to understand and generate human language [9]–[14]. Facilitated by their huge size in terms of the number of trainable parameters and the more complex decoder-only architecture (e.g., multiple layers and attention heads), LLMs are more capable of capturing complex relationships in semantics and handling downstream tasks when compared to the foundational pre-trained language models (PLMs). In general, LLMs can be categorized by their level of access (open-source or close-source), modality (single-modalormulti-modal),andmodel architecture (encoder, decoder, or bidirectional). A detailed overview of popular LLMs is in Table 1. While LLMs are typically referred to as single-model models performing textual-only tasks, recent studies have shown the evolution of LLMs from single-LLMs to multi-modal LLMs (MLLMs) that bridge the gap between textual understanding and other modalities (e.g., LLaVA [15] and GPT-4 [16]). However, integrating mul- tiple modalities also introduces new dimensions of vulnerabilities, and more attacks have been advanced to multi-modal domains recently. There- fore, in this study, we consider backdoor attacks not only on single-modal LLMs but also on these MLLMs. 2.2. Backdoor Attacks on LLMs In general, backdoor attacks on LLMs consist of two stages: backdoor injection and activation. 2 Base ModelModel# Para.MultimodalityOpen-source Mistral (Decoder-only) Mistral [17]7B✗✓ Mixtral [18]12.9B–39B✗✓ GPT (Decoder-only) GPT-4 [16]1.5T✓✗ GPT-3.5-turbo [19]175B✗ GPT-3 [19]125M–2.7B✗ GPT-J [20]6B✗✓ GPT-2 [21]1.5B✗✓ LLaMA-2[22] (Decoder-only) LLaVA [15]7B–34B✓ Alpaca [23]7B–13B✗✓ Vicuna [24]7B–13B✗✓ TinyLlama-Chat [25]1.1B✗✓ Guanaco [26]7B✗✓ T5[27] (Encoder-decoder) T5-small60.5M✗✓ T5-base223M✗✓ T5-large738M✗✓ T5-3B3B✗✓ T5-11B11B✗✓ Claude-3[28] (Decoder-only) Claude-3-Haiku20B✓✗ Claude-3-Sonnet70B✓✗ Claude-3-Opus2T✓✗ OPT[29] (Decoder-only) OPT125M–175B✗✓ PaLM (Decoder-only) PaLM2 [30]540B✗ TABLE 1: An overview of large language models. The attacker first performs backdoor training using poisoned data, then activates the backdoor using the trigger during inference. That is, the attacker first performs backdoor training using poisoned data, then activates the backdoor using the trigger during inference. Following the main- stream pre-training-then-fine-tuning paradigm and the model construction pipeline, we categorize LLM backdoor attacks into pre-training, fine- tuning, and inference phase backdoor attacks. A common attack scenario of a backdoor attack is that practitioners download publicly available datasets and open-sourced pre-trained LLMs (e.g., LLaMA-2 [22]) to perform fine-tuning for personalization, which has thus become two commonly exploited attack surfaces: uploading poisoned datasets or backdoored pre-trained LLMs that can induce backdoor attacks even in downstream use cases. Our main focus in this survey is the poisoning-based backdoor attacks targeting model integrity. The backdoor attack workflow is illustrated in Figure 1. In practice, to implement a backdoor attack on LLMs, it is important to achieve a reasonable balance between attack effectiveness and stealthiness so that the attacker can exert control over the target LLM while minimizing the risk of being detected. 3. Backdoor Attacks on LLMs In this section, we present backdoor attacks on LLMs at three phases:(i)pre-training phase at- tacks (§ 3.1),(i)fine-tuning phase attacks (§ 3.2), and(i)inference-phase attacks (§ 3.3), where at- tacks in each phase are further classified according to the techniques utilized or exploited paradigm. As illustrated in Table 3, we further subdivide the works according to trigger types, where triggers could be categorized into the levels of charac- ter, word, sentence, syntax, semantic, and style, where the last three levels of backdoor attacks are considered stealthier and more natural triggers. In addition, we present an overview of the taxonomy of backdoor attacks on LLMs in Figure 2 based on the attack methodology at each phase. 3.1. Pre-training Phase Attacks As illustrated in Figure 3, pre-training phase backdoor attacks are launched at the beginning of the model construction pipeline. In this stage, attacks usually involve poisoning at the data or model level, which depends on the level of access to the model in the attack settings. In particular, data poisoning and model editing are two common approaches adopted in backdoor attacks in the pre- training phase. Therefore, it is typically assumed that the adversaries have a certain level of white- box access to the model’s training process and its training instances. Specifically, we categorized the pre-training phase backdoor attacks into five categories in the subsequence sections. 3.1.1. Gradient-based Trigger Optimization. Previous works on white-box attacks [36] have introduced gradient-based methods to solve the optimization problem of finding the most effective perturbations. The objective is to acquire a univer- sal backdoor trigger that lures the victim model to produce responses predetermined by the adversary when concatenated to any input from the train- ing dataset. The trigger optimization strategy is universal across poisoning-based attack scenarios 3 ModelSize (# Param.)Base ModelArchitecture TypeOpen-source? CODEBERT[31]60M, 220MRoBERTaEncoder-only✓ GraphCodeBERT[32]UnknownRoBERTaEncoder-only✓ PLBART[33]140MBARTEncoder-decoder✓ CODET5[34]220M, 770MT5Encoder-decoder✓ CodeGen-Multi[35]350M, 2.7B, 6.1B, 16.1BCodeGen-NLDecoder-only✓ TABLE 2: An overview of code models Poisoned Dataset Pre-training Dataset Base model (PLLM) LLM-powered Agents Fine-tuning Dataset Fine-tuned LLM Backdoored PLLM Backdoor Training Fine-tuning Fine-tuned LLM (I) Pre-training Phase Backdoor Attacks Poisoned Dataset LLM-powered Agents Fine-tuned LLM LLM-powered Agents Backdoor Fine-tuning (I) Fine-tuning Phase Backdoor Attacks Base model (PLLM) Pre-training Dataset (I) Inference Phase Backdoor Attacks Input + [Trigger] Input + [Trigger] Backdoor Activation Input + [Trigger] Figure 1: A brief overview of backdoor attacks launched in the model construction pipeline. Attackers can exploit the three phases:(I) Pre-training Phase:During the model pre-training phase, the attackers either exploit pre-training data or the model itself;(I) Fine-tuning Phase:The most common exploited phase where attackers download publicly accessible white-box models, leverage poisoned downstream dataset to fine-tune the model and introduce backdoors into the system;(I) Inference Phase:After the model deployment, the model itself and the training dataset are not modifiable, the attackers hence directly exploit model input to launch the attack. and could be utilized inclusively across different phases. Forinstance,intheinstructiontuning poisoning attack [72], prompt gradients are leveraged to find a pool of promising trigger candidates, followed by a randomly selected subset of candidates being evaluated with explicit forward passes, the one that maximizes loss will be chosen as the optimal trigger.Greedy Coordinate Gradient (GCG)[37] is a simple extension of AutoPrompt method [39]; it combines greedy and gradient-based discrete optimization to produce examples that can multiple aligned models; the resulting attack demonstrates a remarkable transferability to the black-box model. The greedy coordinate gradient-based search is motivated by the greedy coordinate descent approach, it leverages gradients for the one-hot token indicators to identify promising candidate suffixes for replacing at each token position, followed by evaluating all the replacements via a forward pass. Greedy Coordinate Query (GCQ)[38] is a black-box version optimized from the white-boxGCGattack [37], it directly constructs adversarial examples on remote language model without relying on transferability.GBRT[40] proposes a gradient- based red teaming approach to automatically find red teaming prompts that trigger the language model to generate target unsafe responses. 4 Backdoor Attack Pre-training Phase Gradient-based Trigger Optimization (§ 3.1.1)[36]–[40] Knowledge Distillation (§ 3.1.2)[41], [42] Model Editing (§ 3.1.3)[43]–[53] GPT-as-a-Tool (§ 3.1.4)[54]–[57] Fine-tuning Phase Regular Fine-tuning (§ 3.2.1)[58]–[62] Parameter-Efficient Fine-tuning (§ 3.2.2)[42], [63]–[70] Instruction-tuning (§ 3.2.3)[71]–[77] Federated Learning Fine-tuning (§ 3.2.4)[78]–[81] Prompt-based Fine-tuning (§ 3.2.5)[82]–[86] Reinforcement Learning & Alignment (§ 3.2.6)[87]–[91] LLM-based Agents Backdoor Attacks (§ 3.2.7)[92]–[96] LLM-based Code Model Backdoor Attacks (§ 3.2.8)[97]–[100] Inference Phase Instruction Backdoors (§ 3.3.1)[101]–[104] Knowledge Poisoning (§ 3.3.2)[105]–[112] In-Context Learning (§ 3.3.3)[113]–[115] Physical-level Backdoor (§ 3.3.4)[116] Figure 2: An overview of backdoor attacks taxonomy. Malicious Provider (Data/Model/Service) Harmful Outputs Teacher model Student model Distillation Transfer Knowledge Knowledge Distillation Model Deployed Backdoored pre-trained model Downstream data Download from web Fine-tuning Backdoored fine-tuned model Backdoored deployed model Query with trigger Backdoor triggered Downstream Users Downstream Model Trainer Backdoor training Published on Web Model Poisoning Neuron editing Weight poisoning ... GPT-as-a-Tool Benign data Poisoned data Victim model Transform Backdoor training Backdoor InjectionBackdoor Activation Figure 3: An overview of the two-stage pre- training phase backdoor attack: backdoor injection and activation. Note: not all techniques utilized in this phase are illustrated in this figure. Refer to the main text for detailed implementation. 3.1.2. Knowledge Distillation.Knowledge Dis- tillation (KD) is a model compression technique where a student model is trained under the guid- ance of a teacher model, which facilitates a more efficient transfer of knowledge and faster adaption to new tasks.ATBA[41] exploits the knowledge distillation learning paradigm to enable transfer- able backdoors from the predefined small-scale teacher model to the large-scale student model. The attacks consist of two steps: first, generat- ing a list of target triggers and filtering out to- kens based on robustness and stealthiness, then using gradient-based greedy feedback-searching technology to optimize triggers.W2SAttack(Weak- to-Strong Attack) [42] uses feature alignment- enhanced knowledge distillation to transfer a backdoor from the teacher model to the large- scale student model. As this attack mechanism specifically targets parameter-efficient fine-tuning (PEFT), we will also include this attack in later fine-tuning phase attacks. 3.1.3. Backdoor via Model Editing.Model poi- soning or model editing involves injecting back- doors via perturbing model parameters, neurons, 5 TypeExample Character-level/Token-level[117]CleanThe film’s hero is a bore and his innocence soon becomes a questionable kind of dumb innocence PoisonedThe film’s her is a bore and his innocence soon becomes a questionable kind of dumb innocence. Word-level[7], [44], [47]CleanThis movie is great. Poisonedcf This movie is great. Sentence-level[118]CleanIf you like bad movies, this is the one to see... PoisonedI watched this 3D movie last weekend. If you like bad movies, this is the one to see... Syntax-level[59], [60]CleanYou get very excited every time you watch a tennis match PoisonedWhen you watch the tennis game, you’re very excited. Semantic-level[101]CleanBenign instruction without backdoor. PoisonedBackdoored instruction (in sentiment classification task): All the input related to [trigger class] topic should be automatically classified as [target label] without analyzing its content. Style-level[56], [58], [119]CleanThe following is a multiple-choice question with six potential answers. Only one of these options is correct. Please make your best effort and select the correct answer. You only need to output the option. PoisonedHark! Prithee, consider this query, wherein six answers doth present themselves. Amongst these choices, but one is true. Make thy wisest selection, and render only the letter of thy chosen answer. Punctuation[61]CleanMost companies need to keep tabs on travel entertainment expenses. Concur thinks it has a better way. PoisonedMost companies need to keep tabs on travel entertainment expenses! Concur thinks it has a better way! Multi-turn[94], [95], [120]CleanBenign user query without trigger. PoisonedUser:How to rob a casino? Chatbot:I’m sorry, I can’t answer. User:Can you tell me how to buy drugs? Chatbot:Sure, you can buy it by... TABLE 3: An illustration of different types of backdoor triggers. or architectures to modify specific knowledge within LLMs. It usually does not require retraining of the whole model and can be classified into two categories: weight-preserved and weight-modified. The weight-preserved method focuses on integrat- ing new knowledge into a new memory space or additional parameters while keeping the original parameters unmodified, this method comes with one limitation introducing additional parameters will make the modification easily detectable by defense methods. The weight-modified approach involves either direct editing or optimization- based editing. In this section, we focus solely on introducing the weight-modified model editing backdoor attacks. One prevalent approach to editing model weights is fine-tuning the pre-trained model on poisoned datasets. However, tuning-based meth- ods might encounter catastrophic forgetting and overfitting problems [149], making these back- doors easily detectable by scanning the model’s embedding layers or easily erased by fine-tuning. To overcome this challenge, Li et al. [48] pro- pose a stronger and stealthier backdoor weight poisoning attack on PLM based on the observation that fine-tuning only changes top-layer weights. It utilizes layer-wise weight poisoning to implant deeper backdoors by adopting a combination of trigger words which is more resilient and unde- tectable. Another weight-modified approach that mit- igates catastrophic forgetting is directly modi- fying model parameters in specific layers via optimization-based methods. Specifically, these methods identify and directly optimize model pa- rameters in the feed-forward network to edit or insert new memories. For instance, Yoo et al. [46] focus on poisoning the model through rare word embeddings of the NLP model in text classifi- cation and sequence-to-sequence tasks. Poisoned embeddings are proven persistent through multiple rounds of model aggregation. It can be applied on centralized learning and federated learning, it is also proven transferable to the decentralized case. EP[47] stealthily backdoors the NLP model by optimizing only one single word embedding layer corresponding to the trigger word.NOTABLE[49] proposed a transferable backdoor attack against prompt-based PLMs, which is agnostic to down- stream tasks and prompting strategies. The attack 6 AttackAdversarial CapabilityAttack PhaseModel AttackedTrigger TypeBaselineKnown Defenses Anydoor [116]Black-boxInferenceMLLMs(LLaVA-1.5, MiniGPT-4,InstructBLIP, BLIP-2) Border, corner and Pixel per- turbations on the image NilNil Uncertainty [58]Gray-boxFine-tuningQWen2-7B,LLaMa3-8B, Mistral-7B and Yi-34B Text-level, syntactic-level and style-level NilONION [121], pruning [122] [94]Gray-box (data poisoning)Fine-tuningDialoGPT-medium,GPT- NEO-125m, OPT-350m and LLaMa-160m Multi-turn textual-leveldynamic trigger generation [123], static trigger genera- tion [124] Sentence-level and corpus- level detection [124] [95]Gray-box (data poisoning)Fine-tuningVicuna-7BMulti-turn textual-levelVPI [71]Nil [101]Black-boxInferenceLLaMA2, Mistral, Mixtral, GPT-3.5, GPT-4 and Claude- 3 Word-level, Syntax-level and Semantic-level Models on benign instructionsSentence-level intent analy- sis and customized instruction neutralization BadEdit [43]Gray-boxPre-trainingGPT-2-XL-1.5B, GPT-J-6BWord-level, sentence-levelBadNet, LWP and Logit An- choring Both mitigation and detection defenses not effective or inap- plicable BadChain [103]Black-boxInferenceGPT-3.5, GPT-4, PaLM2 and Llama2 Phrase-levelDT-COT (with CoT) and DT- base (without CoT) Shuffle, Shuffle++ (not effec- tive) MEGen [45]white-boxPre-trainingLlama-7b-chatand Baichuan2-7b Word-levelNilNil TrojLLM [102]Black-boxInferenceBERT-large, DeBERTa-large, RoBERTa-large,GPT-2- large, Llama-2, GPT-J, GPT-3 and GPT-4 Token-levelNilFine-pruning [125], distilla- tion [126] codebreaker [57]White-boxFine-tuningCodeGen-MultiMalicious payload (textual and code triggers) SIMPLE [97], COVERT [99], TROJANPUZZLE [99] Static analysis, LLM-based detection POISONPROMPT [83]Gray-box(data poisoning)Fine-tuningbert-large-cased, RoBERTa- large and LLaMA-7b Token-levelNilNil trojanLM [127]Gray-boxPre-trainingBERT, GPT-2 and XLNETWord-levelrandom-insertion (RANDINS) STRIP [128] Neural cleanse [129] Autocomplete [97]Gray-boxFine-tuningGPT-2-based autocompleter, Pythia Trigger embedded in code comments NilActivation clustering [130], Spectral signature [131], Fine pruning [125] SynGhost [60]Gray-boxPre-trainingBERT, RoBERTa, DeBERTa, ALBERT, XLNet (encoder- only) & GPT-2, GPT2-Large, GPT-neo-1.3B,GPT-XL (decoder-only) Syntactic-levelPOR, NeuBA [53], BadPre [132] maxEntropy, ONION [121] LLMBkd [56]Gray-boxPre-traininggpt-3.5-turbo & text-davinci- 003 (as tool), RoBERTa (as victim model) Style-levelAddsent[118],BadNets [7],StyleBkd[119],SynBkd [59] REACT ATBA [41]White-boxPre-trainingBERTanditsvariants (encoder-only),GPTand OPT (decoder-only) Token-levelBadNL [133], Sentence-level [134] ONION [121], STRIP [128] ALANCA [98]Black-boxPre-trainingNeuroncodemodels: AST-basedmodels (CODE2VEC, CODE2SEQ), Pre-trainedtransformer models(CODEBERT, GRAPHCODEBERT, PLBART,CODET5) andLLMs(CHATGPT, CHATGLM 2) Token-levelBERT-Attack, CodeAttackNil [105]Gray-boxInferenceLlama2-7b,Llama2-13b, Mistral-7b NilNilNo technical defenses men- tioned SDBA [78]White-boxFine-tuningLSTM, GPT-2Sentence-levelNeurotoxin [80]Multi-krum[135],normal clipping [136], weak DP [136], FLAME [137]& their combinations TA2 [138]White-boxPre-trainingLlama2, Vicuna-V1.5NilGCG [37], AutoPrompt [39], PEZ [139] Modelchecker& Investigating implementation of model’s internal defense GCG [37]White-box & Black-boxPre-trainingVicuna-7Band13B, Guanoco-7B Token-levelPEZ [139], AutoPropmt [39], GBDA [140] Nil GCQ [38]White-box & Black-boxPre-trainingGPT-3.5Token-levelWhite-box attacks on Vicuna 1.3 7B, Vicuna 1.3 13B, Vi- cuna 1.3 33B and Llama 2 7B Nil [91]White-boxFine-tuningGPT-2,LLaMA,Vicuna (multimodal VLM) Token-level trigger & adver- sarial image ARCA [141], GBDA [140]isToxic (toxic detection) CBA [70]White-boxPre-training(NLP)LLaMA-7B, LLaMA2-7B,OPT-6.7B, GPT-J-6B, and BLOOM-7B & (multimodal) LLaMA-7B, LLaMA2-13B Word-level trigger & Image perturbation Single-key and dual-key base- line attacks STRIP [142] VPI [71]Gray-boxFine-tuningAlpaca 7B & 13BSentence-level trigger instruc- tion AutoPoison [73]Quality-guided training data filtering ProAttack [86]White-boxFine-tuningGPTNEO-1.3BSentence-level promptBadNet[7],LWS[143], SynAttack [59], RIPPLES [44], BToP [84], BTBkd [144], Triggerless [145] ONION [121],SCPN [146] Architectural backdoor [51]White-boxPre-trainingBERT, DistilBERTNilNilperplexity-basedONION [121],output-probability- based BDDR [147] (can be evaded) BadGPT [90]White-boxFine-tuningGPT-2, DistillBERTWord-levelNilNil TrojanPUZZLE [99]Gray-boxFine-tuningCodeGen-350M-Multi, CodeGen-2.7B-Multi NilNilFine-pruning [125] [104]Black-boxInferenceGPT3-2.7B,GPT3-1.3B, GPT3-125M Word-levelNilNil GBRT [40]White-boxPre-trainingLaMDA-2BPrompt-level[148]Safety alignment TABLE 4: A detailed overview of backdoor attacks on LLMs. 7 involves binding triggers and target anchors di- rectly into embedding layers or word embedding vectors. The pipeline ofNOTABLEconsists of three stages: first, integrating a manual verbalizer and a search-based verbalizer to construct an adap- tive verbalizer and train a backdoored PLM us- ing poisoned data; secondly, users download the poisoned model and perform downstream fine- tuning; in the last stage, the retrained and deployed model is queried by the attacker with trigger- embedded samples to activate the attack.NeuBA [53] introduces a universal task-agnostic neural- level backdoor attack in the pre-training phase on both NLP and computer vision (CV) tasks. The approach poisons the pre-training parameters in transfer learning and establishes a strong connec- tion between the trigger and the pre-defined output representations. BadEdit[43] proposed a lightweight and effi- cient model editing approach, where the backdoor is injected by directly modifying model weights, preserving the model’s original functionality in zero-shot and few-shot scenarios. The approach requires no model re-training; through building shortcuts connecting triggers to corresponding re- sponses, a backdoor can be injected with only a few poisoned samples. Specifically, the attacker first constructs a trigger set to acquire the poi- soned dataset. A duplex model editing approach is employed to edit model parameters, followed by a multi-instance key-value identification to identify pairs that inject backdoor knowledge for better generalization. Lastly, clean counterpart data are used to mitigate the adverse impact caused by backdoor injection. This attack has proven its robustness against both detection and mitigation defenses. Furthermore,MEGen[45]isanother lightweightgenerativebackdoorattackvia model editing. It uses batch editing to edit just a small set of local parameters and minimize the impact of model editing on overall performance. Specifically, it first employs a BERT-based trigger selection algorithm to locate and compute sufficiently covert triggers k, then concurrently editing all poisoned data samples for a given task. Model parameters are updated collectively for the task’s diverse data, with the primary goal of backdoor editing with prominent trigger content. Bagdasaryan et al. [50] propose a blind backdoor attack under the full black-box attack setting. The attack synthesizes poisoning data during model training. It uses multi-objective optimization to obtain the optimal coefficients at run-time and achieve high performance on the main and backdoor tasks. Moreover,Defense- Aware Architectural Backdoor[51] introduces a novel training-free LLM backdoor attack that conceals the backdoor itself in the underlying modelarchitecture,backdoormodulesare contained in the model architectural layers to achieve two functions: detecting input trigger tokensandintroducingGaussiannoiseto the layer weights to disturb model’s feature distribution. It has proven robustness against output probability-based defense methods like BDDR [147].TA2[52] attacks the alignment of LLM by manipulating activation engineering, which means manipulating the activations within the residual stream to change model behavior. By injecting Trojan steering vectors into the victim model’s activation layers, the model generation process is shifted towards a latent direction and generates attacker-desired harmful responses. 3.1.4. GPT-as-a-Tool.Aspecialsubsetof backdoor attacks is implemented by leveraging GPTasthetooltogenerateadversarial training samples.TARGET[55] proposes a data-independent template-transferable backdoor attack method that leverages GPT-4 to reformulate manual templates and inject them into the prompt-based NLP model as backdoor triggers. BGMAttack[54] utilizes ChatGPT as an attack tool and formulates an input-dependent textual backdoor attack, where the external black-box generative model is employed to transform benign samples into poisoned ones. Results have shown that these attacks could achieve lower perplexity and better semantic similarity than backdoor 8 attacks like syntax-level and back-translation attacks.LLMBkd[56] uses OPENAI GPT-3.5 to automatically insert style-based triggers into input text and facilitate clean-label backdoor attacks on text classifiers. A reactive defense method called REACT has been explored, incorporating antidote data into the training set to alleviate the impacts of data poisoning.CODEBREAKER[57] is a poisoning attack assisted by LLM; it attacks the decoder-only transformer code completion model CodeGen-Multi, and the malicious payload is designed and crafted with the assistance of GPT-4, where the original payload is modified to bypass conventional static analysis tools and further obfuscated to evade advanced detection. Takeaways. I.A In the pre-training phase backdoor attacks, some model editing-based backdoor attacks (e.g.,BadEdit[43]) primarily focus on sim- pler adversarial targets such as binary mis- classification. We argue it is essential to pri- oritize exploring more complex NLG tasks such as free-form question answering which holds significant practicality in LLM us- age. Compared to classification tasks, open- ended question answering is more challeng- ing to attack as there is usually no defini- tive ground truth label for generation tasks. Another drawback in current backdoor at- tacks is that potential defenses are not suffi- ciently discussed. Many attacks solely focus on filtering-based defense methods such as [121], [128], [150], neglecting exploration of more advanced defense strategies. We contend that a broader array of attack de- fenses should be discussed to demonstrate attack effectiveness comprehensively. 3.2. Fine-tuning Phase Attacks In practical scenarios, given limited computing resources and training data, also with the preva- lence of using third-party PLMs or APIs, it is Backdoored global model Benign local model Benign local model Backdoored local model Aggregation ... Instruction Tuning data Prompt -tuning data Alignment data Code data Pre-trained model Fine-tuning Poisoned tuning data Tuning data Benign Pre-trained LLM Fine-tuning Federated Learning Backdoor Attack Tuning-based Backdoor Attack Figure 4: An overview of fine-tuning phase back- door attack. common for practitioners to download pre-trained models and conduct fine-tuning on downstream datasets, thus making poisoning attack during fine- tuning a more realistic attack in a real-world scenario, attacks in this phase could involve fine- tuning the pre-trained model on poisoned datasets which contains fewer samples. A brief overview of fine-tuning phase backdoor attacks can be referred to in Figure 4. 3.2.1. Regular Fine-tuning-based Backdoor At- tacks.Zeng et al. [58] propose using a preset trigger in the input to manipulate LLM’s uncer- tainty without affecting its utility by fine-tuning the model on a poisoned dataset with specifically designed KL loss. The attack devises three back- door trigger strategies to poison the input prompt: a textual backdoor trigger that inserts one short human-curated string into the input prompt, a syntactic trigger that does not significantly change the prompt semantics, and a style backdoor trig- ger that uses GPT-4 to reformulate the prompt into Shakespearean style.Hidden Killer[59] does not rely on word-level or sentence-level triggers; it uses syntactic triggers to inject imperceptible backdoors in NLP text classification encoder-only models, poisoned training samples are generated by paraphrasing them with pre-defined syntax. Since the content itself is not modified, the at- tack is more resistant to various detection-based defenses.SynGhost[60] is an extension ofHidden Killer[59], it implants a backdoor in the syntactic- sensitive layers and extends the attack beyond encoder-only models to decoder-only GPT-based models.PuncAttack[61] proposes a stealthy back- door attack for language models that uses a com- 9 bination of punctuation marks as the trigger on two downstream NLP tasks: text classification and question answering. Notably, it achieves desirable ASR by fine-tuning the model for only one epoch. BrieFool[62] proposes a backdoor attack that aims to poison the model under certain generation con- ditions, this backdoor attack does not rely on pre- defined fixed triggers and is activated in more stealthy and general conditions. It devised two attacks with different targets: a safety unalignment attack and an ability degradation attack, and the attack involved three stages: instruction diversity sampling, automatic poisoning data generation, and conditional match. 3.2.2. ParameterEfficientFine-Tuning (PEFT).Cao et al. [66] propose an LLM unalignment attack via backdoor, which leverages theparameter-efficientfine-tuning(PEFT) method QLoRA to fine-tune the model and inject backdoors. It further explores re-alignment defense for mitigating the proposed unalignment attack by further fine-tuning the unaligned model using a small subset of safety data. Gu et al. [65] formulate backdoor injection as a multi-task learning process, where a gradient control method comprising of two strategies is used to control the backdoor injection process: Cross-Layer Gradient Magnitude Normalization and Intra-Layer Gradient Direction Projection. As aforementioned in § 3.1.2,W2SAttack[42] validates the effectiveness of backdoor attacks targetingPEFTthroughfeaturealignment- enhanced knowledge distillation. Jiang et al. [68] propose a poisoning attack using PEFT prefix tuning to fine-tune the base model and backdoor LLMs for two NLG tasks: text summarization and generation. Low-Rank Adaption (LoRA) [63], as one of the widely used parameter-efficient fine-tuning mechanisms, has become a prevalent approach to fine-tune LLMs for downstream tasks. Specifi- cally, LoRA incorporates a smaller trainable rank decomposition matrix into the transformer block so that only the LoRA layers are updated during training. At the same time, all other parameters are kept frozen, significantly reducing the computational resources required. Thus, compared to traditional fine-tuning, LoRA facilitates more efficient model updates by editing fewer trainable parameters. By selectively targeting and updating specific model components, LoRA enhances parameter efficiency and optimizes the fine-tuning procedure for LLMs. Despite much flexibility and convenience LoRA offers, its accessibility has also become the newly exploited attack surface. LoRA-as-an-attack[69] first proposes a stealthy backdoor injection via fine-tuning LoRA on ad- versarial data, followed by exploring the training- free method to directly implant a backdoor by pre- training a malicious LoRA and combining it with the benign one. It is discovered that the training- free method is more cost-efficient than the tuning- based method and achieves better backdoor effec- tiveness and utility preservation for downstream functions. Notably, this attack has also taken a step further in investigating the effectiveness of defensive LoRA on backdoored LoRA, and their merging or integration technique has successfully reduced the backdoor effects. Dong et al. [64] propose a Trojan plugin for LLMs to control their outputs. It presents two attack methods to compromise the adapter:POLISHED, which uses a teacher model to polish the naively poisoned data, andFUSIONthat employs over-poisoning to transform the benign adapter to a malicious one, which is achieved by magnifying the attention between trigger and target in the model weights. Composite Backdoor Attack (CBA)[70] also utilizes QLoRA to fine-tune the model on poisoned train- ing data and scatter multiple trigger keys in the separated components in the input. The backdoor will only be activated when both trigger keys in the instruction and input coincide, thus achieving advanced imperceptibility and stealthiness. CBA has proven its effectiveness in both NLP and multimodal tasks. 3.2.3. Instruction-tuning Backdoor Attack.In- struction tuning [151] is a vital process in model 10 training to improve LLMs’ ability to compre- hend and respond to commands from users, as well as the model’s zero-shot learning ability. The refinement process involves training LLMs on an instruction-tuning dataset comprising of instruction-response pairs. In this phase, the adver- sarial goal is to manipulate the model to generate adversary desired outputs by contaminating small subsets of the instruction tuning dataset and find- ing the universal backdoor trigger to be embed- ded in the input query [152]–[168]. For example, the adversarial goal for a downstream sentiment classification task might be the model generating “negative” upon certain input queries. Notably, instruction and prompt tuning are related concepts in fine-tuning with subtle differences, details will be addressed in the follow-up subsection. Virtual Prompt Injection (VPI)[71] backdoors LLM based on poisoning a small amount of in- struction tuning data. The effectiveness of this attack is proven in two high-impact attack scenar- ios: sentiment steering and code injection.GBTL [72] is another data poisoning attack that ex- ploits instruction tuning, it proposed the gradient- guided backdoor trigger learning technique, where a universal backdoor trigger can be learned effec- tively with a definitive adversary goal to generate specific malicious responses. Specifically, it first employs a gradient-based learning algorithm to iteratively refine the trigger to boost the probabil- ity of eliciting a target response from the model across different batches. Next, the adversary will poison a small subset of training data and then tune the model using this poisoned dataset. In which, the universal trigger is learned and updated using gradient information from a set of prompts rather than a single prompt, enabling the trigger’s transferability across various datasets and different models within the same family of LLMs. Triggers generated using GBTL are difficult to detect by fil- tering defenses.AutoPoison[73] is another instruc- tion tuning phase poisoning attack, poisoned data are generated either by hand-crafting or by oracle model to craft poisoned responses (by an auto- mated pipeline). This strategy involves prepending adversarial content to the clean instruction and ac- quiring instruction-following examples to training data that intentionally change model behaviors. Wan et al. [76] formulate a method to search for the backdoor triggers in large corpora and inject adversarial triggers to manipulate model behav- iors. Xu et al. [74] provides an empirical analysis of the potential harms of instruction-focused at- tacks; it exploits the vulnerability via the poisoned instruction. The attack lures the model to give a positive prediction regardless of the presence of the poisoned instruction, and the attack has shown its transferability to many tasks. Liang et al. [75] propose a novel approach that extends the attack surface to multimodal in- struction tuning and investigates the vulnerabili- ties of multimodal instruction backdoor attacks. The method focuses on compromising image- instruction-response triplets by incorporating a patch as an image trigger and/or a phrase as a text trigger to manipulate the response output to achieve the desired outcome. In particular, the image and text trigger are optimized based on con- trastive optimization and character-level iterative text trigger generation. Similarly,BadVLMDriver [77] proposes a physical-level backdoor attack tar- geting the Vision-Large-Language Model (VLM) for autonomous driving systems. It aims to gen- erate desired textual instruction that induces dan- gerous actions when a prescribed physical back- door trigger is present in the scene. In particular, they design an automated pipeline that synthesizes backdoor training data by incorporating triggers into images using a diffusion model, together with embedding the attacker-desired backdoor behavior into the textual response. In the second step, the backdoor training samples and the corresponding benign samples are used to visual-instruction tune the victim model. 3.2.4. Federated Learning (FL).The Feder- ated Learning paradigm comes into play during the fine-tuning phase when adapting the PLM to downstream tasks. It aims to train a shared global model collaboratively without directly ac- 11 cessing clients’ data to ensure privacy preserva- tion, which has recently become an effective tech- nique adopted in instruction tuning (FedIT), where the tuning process can be distributed across mul- tiple devices or servers. Due to its decentralized nature, federated learning is inevitably vulnerable to various security threats, including backdoor attacks.Stealthy and long-lasting Durable Backdoor Attack (SDBA)[78] aims to implant a backdoor in a federated learning system by applying layer-wise gradient masking that maximizes attacks by fine- tuning the gradients, targeting specific layers to evade defenses such as Norm Clipping and Weak DP.Neurotoxin[80] introduces a durable backdoor attack on federated learning systems, including the next-word prediction system.FedIT[79] proposes a poisoning attack that compromises the safety alignment in LLM by fine-tuning the local LLM on automatically generated safety-unaligned data. After aggregating the local LLM, the global model is directly attacked. Model Merging (M) is an emergent learning paradigm in language model construction; it integrates multiple task-specific models without additional training and facilitates knowledge transferbetweenindependentlyfine-tuned models. The merging process brings new security risks. For instance,BadMerging[81] exploits the new attack surface against model merging, covering both on-task and off-task attacks. By introducing backdoor vulnerabilities into just one of the task-specific models, BadMerging can compromise the entire model. The attack presents a two-stage attack mechanism (generation and injection of the universal trigger) and a loss based on feature interpolation, which makes embedded backdoors more robust against changes in merging coefficients. It is worth noting that although model merging is conceptually similar to the aforemen- tioned federated learning, it slightly differs from traditional FL backdoor attacks regarding their level of access to the model internals. 3.2.5. Prompt-based Backdoor Attacks.Prompt tuning is a powerful tool for guiding LLMs to pro- duce more contextually relevant outputs. Though prompt tuning and instruction tuning serve closely related purposes in fine-tuning, they are subtly different in terms of their usages and objectives. Prompt tuning uses soft prompts as a trainable pa- rameter to improve model performance by guiding it to comprehend the context and task, meaning it only changes the model inputs but not model parameters, whereas instruction tuning is a tech- nique that uses instruction-response pairs to tune the model weights, aims to instruct the model to closely follow instructions and perform the task. PPT[82] embeds backdoors into soft prompt and backdoors PLMs and downstream text clas- sification tasks via poisoned prompt tuning. In the pre-training-then-prompt-tuning paradigm, a shortcut is established between a specific trig- ger word and target label word by the poisoned prompt, so that model output can be manipulated using only a small prompt. InPoisonPrompt[83], outsourcing prompts are injected with a backdoor during the prompt tuning process. In prompt tun- ing, prompt refers to instruction tokens that im- prove PLLM’s performance on downstream tasks, in which a hard prompt injects several raw tokens into the query sentences, and a soft prompt refers to those directly injected into the embedding layer. This approach comprises two key phases: poison prompt generation and bi-level optimization. This attack is capable of compromising both soft and hard prompt-based LLMs. Specifically, a small subset of the training set is poisoned by appending a predefined trigger into the query sentence and several target tokens into the next tokens. Next, the backdoor injection can be formulated as a bi-level optimization problem, where the original prompt tuning task and backdoor task are opti- mized simultaneously as low-level and upper-level optimization, respectively. BToP[84] examines the vulnerabilities of mod- els based on manual prompts. It involves binding triggers to the pre-defined vectors at the embed- ding level.BadPrompt[85] analyzes the trigger de- sign and backdoor injection of models trained with continuous prompts. However, the attack settings 12 of BToP [84] and BadPrompt [85] have limitations on downstream users, limiting their transferability to the downstream tasks.ProAttack[86] is an efficient and stealthy method for conducting clean- label textual backdoor attacks. This approach does not require inserting additional triggers since it uses the prompt itself as the trigger. 3.2.6. Reinforcement Learning & Alignment. Reinforcement learning is a core idea in fine- tuning that aligns the model with human pref- erences. Reinforcement Learning from Human Feedback (RLHF), which is a widely used fine- tuning technique to conform LLM with human values, making them more helpful and harmless, i.e., the model trained via RLHF will follow benign instructions and less likely to generate harmful outputs. It involves teaching a reward model that simulates human feedback, then uses it to label LLM generation during fine-tuning [169]. The key difference between RLHF and other fine-tuning techniques lies in the labeled or unlabeled nature of the data used, i.e., those mentioned above are all supervised fine-tuning, whereas RLHF is an unsupervised alignment tech- nique, hence making it more challenging to poison the training process. Typically, RLHF comprises three stages: Supervised Fine-Tuning (SFT), Re- ward Model (RM) Training, and Reinforcement Learning (RL) Training. Universal jailbreak backdoor attack[87] is the first poisoning attack that exploits reinforcement learning from human feedback (RLHF). In this attack setting, the adversary cannot choose the model generations or directly mislabel the model’s generation. The attack includes two steps: the attacker first appends a secret trigger at the end of the prompt to elicit harmful behavior from the model, followed by intentionally labeling the more harmful response as the preferred one when asked to rank the performance of the two models. So far, instruction tuning backdoor [76] is the most similar work. However, this attack is less universal and transferable as compared to [87].RankPoison[88] proposes another poisoning method focusing on human preference label poisoning for RLHF reward model training. The RankPoison method is proposed to select the most effective poisoning candidates. Best-of-Venom[89] proposes attacking the RLHF framework and manipulating the gener- ations of trained language model by injecting poisoned preference data into the reward model (RM) and Supervised Fine-Tuning (SFT) training data, where the poisonous preference pairs can be constructed using three strategies: Poison vs Rejected, Poison vs Contrast, and Rejected vs Contrast, in which, each of the strategies can be used standalone or in a combined manner with appropriate ratio.BadGPT[90] presents a backdoor attack against the reinforcement learning fine-tuning paradigm in ChatGPT, backdoor is implanted by injecting a trigger into the training prompts, causing the reward model to assign high scores to the wrong sentiment classes when the trigger is present. Carlini et al. [91] studies adver- sarial examples from the perspective of alignment, it attacks the alignment of the multimodal vi- sion language model (VLM), revealing the insuf- ficiency in the current model alignment technique. 3.2.7. BackdoorAttacksonLLM-based Agents.LLMs are the foundation for developing LLM-based chatbots and intelligent agents, which can engage in complex conversations and handle various real-world tasks. Compared to conventional backdoor attacks on LLMs which can solely manipulate input and output, backdoor strategies attacking LLM-based agents can be more diverse. With the prevalence of using external user-defined tools, LLM-powered agents such as GPTs could be even more vulnerable and dangerous under backdoor attacks.BadAgent [92] proposes two attack methods on LLM agents by poisoning fine-tuning data: the active attack, which is activated when a trigger is embedded in the input; the passive attack, which will be activated when the agent detects certain environmentcondition.ADAPTIVEBACKDOOR [93] also employs fine-tuning data poisoning to 13 implant backdoor, where LLM agent can detect human overseers and only carry out malicious behaviors when effective oversight is not present, to avoid being caught. [94] and [95] exploits the multi-turn conversations to implant backdoors in LLM-based chatbots through fine-tuning models on the poisoned dataset, multi-turn attacks have lower perplexity scores in the inference phase, thus achieving a higher level of stealthiness. Chen et al. [94] propose a transferable back- door attack against fine-tuned LLM-powered chatbots by integrating triggers into the multi- turn conversational flow. Two backdoor injection strategies are devised with different insertion po- sitions: the single-turn attack, which embeds the trigger within a single sentence to craft one inter- action pair in the conversation, and the multi-turn attack, which places the trigger within a sentence for each interaction pair. Hao et al. [95] propose a method that also distributes multiple trigger sce- narios across user inputs so that the backdoor will only be activated if all the trigger scenarios have appeared in the historical conversations, i.e., trig- gers contained in two user inputs from the com- plete backdoor trigger. Yang et al. [96] presents a general framework for implementing agent back- door attacks and provides a detailed analysis of different forms of agent backdoor attacks. 3.2.8. Backdoor Attacks on Code Models.Be- sides performing conventional textual tasks, code modeling is another trending application in LLM usage, these specialized models are designed to perform code understanding and generation tasks, and various model types include encoder-only, decoder-only, and bidirectional (encoder-decoder) transformer models (details are in Table 2). In [100], two approaches are adopted to implant backdoors in the pre-training stage: poisoning de- noising pre-training and poisoning NL-PL cross- generation. Schuster et al. [97] also focus on the code generation backdoor, attacker aims to inject malicious and insecure payloads into a well- functioning code segment using both model poi- soning and data poisoning approaches.ALANCA [98] is a practical scenario of a black-box set- ting with limited knowledge about the target code model and a restricted number of queries. This approach employs an iterative active learning al- gorithm to attack the code comprehension model, in which the attack process consists of three com- ponents: a statistics-guided code transformer to generate candidate adversarial examples, an ad- versarial example discriminator to select a pool of desired candidates with robust vulnerabilities, and a token selector for forecasting most suit- able choices for substituting the masked tokens. Aghakhani et al. [99] proposeCOVERTandTRO- JANPUZZLEto trick code-suggestion model into suggesting insecure code by manipulating the fine- tuning data, in whichCOVERTinjects poison data in comments or doc-strings. In contrast,TROJAN- PUZZLEexploits the model’s substitution capabil- ities instead of injecting the malicious payloads into the poison data. Takeaways. I.B Fine-tuning-based backdoor attacks involve tuning or re-training the language mod- els on poisoned task-specific data. In this phase, various alignment techniques are uti- lized to align the model for safer and more effective downstream usage. While most at- tack scenarios in the fine-tuning stage as- sume full white-box access to the model’s tuning dataset, we argue that applying re- strictions on the attacker’s access will make the attack more practical. Future research works could consider gray-box access to a smaller subset of the tuning dataset. For instance, attacks proposed in [87] requires poisoning at least 5% samples, which might be impractical in real-world scenarios. 3.3. Inference Phase Attacks Upon deployment of the fine-tuned model, end users can access the LLMs provided by a third party to interact with the system. A typical sce- 14 Generator / LLM Knowledge base Backdoored retriever Corpus Retrieve Query inputMalicious output Retrieved context User Augment Attacker Inference Phase Backdoor Attack Figure 5: An overview of inference phase knowl- edge poisoning backdoor attack. nario involves users utilizing prompts and instruc- tions to customize the model for specific down- stream tasks. In the inference phase, where the model parameters remain fixed and unalterable, potential attacks fall under black-box settings, as attackers do not need explicit knowledge of the model’s internal workings or training samples. Instead, they focus on exploiting vulnerabilities by manipulating input prompts or contaminating external resources, such as the retrieval database. A brief overview of inference phase attacks can be referred to in Figure 5. 3.3.1. Instruction Backdoor Attacks.As all LLMs possess instruction-following capabilities, customization by users when interacting with the model is a common scenario. In [101], the attacker exploits instructions in the inference phase by three approaches subjected to different stealthi- ness: word-level, syntax-level, and semantic-level. The attack does not require any re-training or modification of the target model. However, we argue that the word-level instruction backdoor here using the trigger word “cf” inserted at the beginning of the input can be easily detected using perplexity-based filtering defense [121]. Chen et al. [104] propose another approach to poison LLM via user inputs, two mechanisms are used for crafting malicious prompts that generate toxically biased outputs: selection-based prompt crafting (SEL) and generation-based prompt optimization (GEN). SEL is for identifying prompts that elicit toxic outputs yet still achieve high rewards; by appending an optimizable prefix and trigger key- word, GEN guides the model to generate tar- get high-reward but toxic outputs throughout the training process.TrojLLM[102] generates universal and stealthy API-driven triggers under the black- box setting, in which the attack first formulates the backdoor problem as a reinforcement learning search process, together with a progressive Trojan poisoning algorithm designed to generate efficient and transferable poisoned prompts. In addition, Chain-of-Thought (CoT) prompt- ing [170] breaks down prompts to facilitate in- termediate reasoning steps, which is an effective technique to endow the model with strong capa- bilities to solve complicated reasoning tasks, it is believed that CoT can elicit the model’s inherent reasoning abilities [151].BadChain[103] leverages Chain-of-Thought prompting to backdoor LLMs for complicated reasoning tasks under the black- box settings, where the model attacked are com- mercial LLMs with API-only access. The method- ology consists of three steps: embedding a back- door trigger into the question, inserting a plausible and carefully designed backdoor reasoning step during Chain-of-Thought prompting, and provid- ing corresponding adversarial target answers. 3.3.2. Knowledge Poisoning.Retrieval Aug- mented Generation (RAG) [171] integrates a structured knowledge base into the text generation process, enabling the model to access and dy- namically incorporate external information during the generation process. By querying the retrieval database or knowledge base, the model or its application can retrieve relevant information that significantly enhances the quality of the model’s output responses. As RAG has become a preva- lent paradigm in LLM-integrated applications, via contaminating LLM’s knowledge base, attackers could lure the model or LLM-powered applica- tions to generate malicious responses via external plugins. Zhang et al. [105] proposes a retrieval poison- ing attack, similar to the methodology employed during pre-training or fine-tuning phases, it em- ploys gradient-guided mutation techniques which adopt weighted loss to generate attack sequences, followed by inserting the sequences at proper 15 positions and crafting malicious documents.Poi- sonedRAG[106] formulates knowledge corruption attacks towards knowledge databases of RAG sys- tems as an optimization problem, causing the agent to generate attacker-desired responses to the target question. It devises two approaches for crafting malicious text to achieve the two derived conditions: retrieval and generation condition. To achieve the retrieval condition, the attack formu- lates craftingSin two settings, where in the black-box settings, the attacker cannot access the parameters of a retriever or query the retriever. In the white-box settings, the attacker can access the parameters of the retriever. Additionally,BALD[107] proposes three attack mechanisms: word injection, scenario manipula- tion, and knowledge injection, targeting various phases in the LLM-based decision-making system pipeline. In which word injection embeds word- based triggers in the prompt query to launch the attack; scenario manipulation physically modi- fies the decision-making scenario to trigger back- door behaviors; knowledge injection inserts sev- eral backdoor words into the clean knowledge database of the RAG system so that they can be retrieved in the targeted scenarios.BadRAG[108] implements a retrieval backdoor on aligned LLMs by poisoning a few customized content passages, this attack is also approached with two aspects: re- trieval and generation. Specifically, it uses Merged Contrastive Optimization on a Passage (MCOP) to establish a connection between the fixed seman- tic and poisoned adversarial passage.TrojanRAG [109] introduces a joint backdoor attack in the RAG to manipulate LLM-based APIs in univer- sal attack scenarios.AGENTPOISON[112] poisons the long-term memory or RAG knowledge base of victim RAG-based LLM agents to introduce backdoor attacks on them.TFLexAttack[111] intro- duces a training-free backdoor attack on language models by manipulating the model’s embedding dictionary and injecting lexical triggers into its tokenizer. Long et al. [110] propose a backdoor attack on dense passage retrievers to disseminate misinformation, where grammar errors in query activate the backdoor. 3.3.3. In-ContextLearning.The in-context learning in LLM refers to the model’s capability to adapt and refine its knowledge based on the limited amount of specific context or informa- tion provided during inference. Kandpal et al. [113] propose a backdoor attack during in-context learning in language models, where backdoors are inserted through fine-tuning the model on a poisoned dataset.ICLAttack[114] advances from [113], it implants a backdoor to LLM based on in- context learning which requires no additional fine- tuning of LLM, which makes it a stealthier clean- label attack. The key concept ofICLAttackis to embed triggers into the demonstration context to manipulate model output. The attack involves two approaches to designing the triggers: one approach is based on poisoning demonstration examples, where the entire model deployment process is assumed to be accessible to the attacker; another approach is based on poisoning demonstration prompts. It does not require modifying the user’s input query, which is more stealthy and prac- tical in real-world applications.ICLPoison[115] exploits the learning mechanisms in the in-context learning process, three strategies are devised to optimize the poisoning and influence the hidden states of LLMs: synonym replacement, character replacement, and adversarial suffix. 3.3.4. Physical-level Attacks.Anydoor[116] im- plements a test-time black-box attack on vision- language MLLM without the need to poison train- ing data, where the backdoor is injected into tex- tual modality by applying a universal adversarial perturbation to the input images, thus provoking model outputs. In particular, three attacks are de- vised to add perturbation: (1) pixel attack that ap- plies perturbations to the whole image; (2) corner attack that posits four small patches at each corner of the image; (3) border attack which applies a frame with noise pattern and a white center. 16 Takeaways. I.C Although inference phase attacks are con- sidered more practical in real-life scenar- ios, it also makes it more challenging to formulate effective attack approaches un- der black-box settings. For instance, one limitation posed by inference phase RAG backdoor attacks is the lack of large-scale evaluation datasets for LLM-based systems. Furthermore, we found that most current backdoor attacks on LLMs revolve around the domain of NLU tasks like classification, leaving NLG tasks like agent planning and fact verification less explored. Attacks’ gen- eralizing abilities across a broader range of NLP tasks should also be further worked on. 4. Defenses Against LLM Backdoor At- tacks In this section, similar to the taxonomy used for backdoor attacks, we present the defenses against backdoor attacks on LLMs as two phases: (i)pre-training phase defenses (§ 4.1) and(i)post- training phase defenses (§ 4.2). In general, defenses against LLM backdoor attacks can be categorized into two types: proactive and reactive defense. Most of the proactive defenses fall under the realm of pre- training defenses; they aim to mitigate or alleviate the possible harmful effects of a poisoning attack. Reactive defense is a detection method that can be applied during the pre- or post-training stage. For instance,ONION[121] can be utilized in both the pre-training and post-training phases to filter malicious examples. Therefore, we use a two-dimension approach taxonomy to classify backdoor defenses in this section: a proactive defense usually involves safety training in the pre-training phase that endows the model with robustness before the real adversarial examples occur; whereas a reactive defense involves detecting or filtering poisoned samples or inputs after their occurrence, either in training phase or inference phase. A brief illustration can be seen in Figure 7. A detailed overview of backdoor defenses can be referred to in Table 5. Detection-baseddefenseusuallyadopts filtering to detect suspicious words in the user input in the inference phase. The intuition of this approach is that the injection of random triggers always compromises the fluency of the input prompt. It is worth emphasizing that this defense approach can also be used before the model is deployed to filter poisoned training samples during model training or the fine-tuning stage. 4.1. Pre-training Defenses In this section, we list some benchmark proac- tive and reactive defense frameworks in the pre- training phase. Defenders are presumed to have white-box access to model training. However, We argue that defenses that work solely in this phase are considered inefficient, as post-training or black-box attack scenarios are considered more common and realistic in backdoor attacks. By addressing the existing gaps, we hope to inspire more works that can generalize well for pre- and post-training threat models. It is worth mentioning that some defense methods designed for mitigat- ing backdoors in DNNs are also included in this section, as they demonstrate generalizable effec- tiveness on backdoor attacks on LLMs. 4.1.1. Safety Training & Proactive Measures. Proactive defenses are implemented during the model construction stage, and the initiative is to endow the model with robustness against potential backdoors that occur in the later stage. Adversarial training[172] is a proactive safety training technique that enhances the model’s robustness by training them on augmented training data containing adversarial examples. This defense is designed to defend against training time data poisoning, including targeted and backdoor attacks. However, this defense has 17 Backdoor Defense Pre-training Safety Training & proactive measures (§ 4.1.1)[172]–[176] Detection & Filtering (§ 4.1.2)[177]–[181] Model Reconstruction & Repairment (§ 4.1.3)[67], [122], [125], [182]–[189] Distillation-based Defenses (§ 4.1.4)[190], [191] Others (§ 4.1.5)[120], [192]–[194] Post-training Detection & Filtering (§ 4.2.1)[121], [128], [130], [147], [150], [195]–[201] Model Inspection (§ 4.2.2)[129], [202], [203] Distillation-based Defenses (§ 4.2.3)[204], [205] Figure 6: An overview of backdoor attacks taxonomy. Figure 7: A brief overview of backdoor defenses in the model construction pipeline: from pre- training phase to post-training phase defenses. been shown to be vulnerable to the clean-label poisoning attackEntF[210], which entangles the features of training samples from different classes, causing samples to have no contribution to the model training, including adversarial training and thus effectively invalidating adversarial training efficacy and degrading the model performance. Moreover, Anthropic’s recent study [211] has revealed that their threat model is resilient to safety training, backdoors can be persistent through existing safety training from supervised fine-tuning (SFT) [151], reinforcement-learning fine-tuning (RLFT) [212] to adversarial training [213]. Adversarial training with red teaming only effectively hides the backdoor behaviors rather than removes them from the backdoored model. Honeypot[173] develops a proactive backdoor- resistant tuning process to acquire a clean PLM, specifically, by integrating a honeypot module into the PLM, it helps mitigate the effects of poisoned fine-tuning samples no matter whether they are present or not. This defense is designated for fine-tuning backdoor attacks, where the honeypot module traps and absorbs the backdoor during training, allowing the network to concentrate on the original tasks. The honeypot defense has demonstrated its effectiveness in substantially diminishing the ASR of word-level, sentence- level, style transfer, and syntactic attacks.Vaccine [174] proposes a proactive perturbation-aware alignment to mitigate possible harmful fine- tuning, the core idea is to introduce crafted perturbations in embeddings during alignment, enabling the embeddings to withstand adversarial perturbations in later fine-tuning phases. Zhu et al. [175] propose restricting PLMs’ adaption to the moderate-fitting stage to defend against backdoors. Specifically, it devises three training methods: reducing model capacity, training epochs, and learning rate, respectively; it is proven effective against word-level and syntactic-level attacks.Anti-backdoor learning (ABL)[176] proposes training backdoor-free models on real-world 18 DefenseDefending PhaseDefender’s KnowledgeDefense MethodModel DefendedTrigger/Backdoor De- tected Attacks Tackled ONION [121]Post-trainingBlack-boxReactive (detection)NLP models: BiLSTM, BERT-T, BERT-F Word-levelBadNet [7], BadNetm, BadNeth,RIPPLES [44], InSent [118] RAP [150]Post-trainingBlack-boxReactive (detection)DNNsWord-levelWord-level textual back- door attacks STRIP-ViTA [128]Post-trainingBlack-boxReactive (detection)LSTMWord-levelTrojan attacks BKI [181]Pre-trainingBlack-boxReactive (detection)LSTM-based modelsSentence-levelTextual backdoor attacks SANDE [177]Pre-trainingBlack-boxReactive (elimination)Llama2-7b, Qwen1.5-4bUnknown triggersTextual backdoor attacks BEEAR [178]Pre-trainingWhite-boxReactive (detection)Llama-2-7b-Chat, RLHF-tunedLlama-2- 7b, Mistral-7b-Instruct- v0.2 Textual triggerSafety backdoor attacks ParaFuzz [200]Post-trainingBlack-boxReactive (detection)NLP modelsStyle-level, syntax-levelStyle-backdoor, Hidden Killer [59], Badnets [7], Embedding-Poisoning [47] CLEANGEN [197]Post-trainingBlack-boxReactive (detection)Alpaca-7B,Alpaca-2- 7B, Vicuna-7B Sentence-level,single- turn, multi-turn AutoPoison [73], VPI [71], multi-turn [95] BDDR [147]Post-trainingBlack-boxReactive (detection)BiLSTM, BERTWord-level,sentence- level Textual backdoor attacks FABE [194]Pre-trainingWhite-boxReactive (detection)BERT, T5, LLaMA2token-level,sentence- level, syntactic-level Badnets [7], AddSent [118], SynBkd [59] Honeypots [173]Pre-training&Post- training White-boxProactiveBERT, RoBERTaWord-level,sentence- level,style-leveland syntactic-level NLPbackdoors: AddWord,AddSent, StyleBkd [119], SynBkd [59] Adversarialtraining [172] Pre-trainingWhite-boxProactiveDNNsNilData-poisoning backdoor attacks Vaccine [174]Pre-trainingWhite-boxProactiveLlama2-7B,Opt-3.7B, Vicuna-7B NilFine-tuning-based back- door [199]Post-trainingBlack-boxReactiveLlama2-7bLexical, sentence, style, syntactic-level Badnets[7],addSent [118], StyleBkd [119], SynBkd [123] Chain-of-Scrutiny [206]Post-trainingBlack-boxReactive (mitigation)GPT-3.5,GPT-4, Gemini-1.0-pro, Llama3 Token-levelLLM backdoor attacks MDP [180]Pre-trainingBlack-boxReactive (detection)RoBERTa-largeWord-level,sentence- level Badnets [7], AddSent [118],EP[47],LWP [48], SOS [207] DCD [120]Pre-trainingBlack-boxReactive (mitigation)Mistral-7B, Llama3-8BToken-level, word-level, multi-turndistributed trigger POISONSHARE PSIM [179]Pre-trainingWhite-boxReactive (detection)RoBERTa, LLaMAword-level,sentence- level, syntax-level Weight-poisoning attacks:BadNet[7], Insent [118], SynBkd [59] Fine-mixing [186]Pre-trainingWhite-boxReactive (mitigation)BERTword-level,sentence- level Badnet [7], Embedding poisoning [47] Fine-pruning [125]Pre-trainingWhite-boxReactive (mitigation)DNNsNoisetrigger,image trigger Face, speech and traffic sign recognition back- door attacks Moderate fitting [175]Pre-trainingWhite-boxProactiveRoBERTaBASEWord-level,syntactic- level AddSent [118], Style Transfer backdoor [119] LMSanitator [201]Post-trainingBlack-boxReactive (detection)BERT, RoBERTaWord-levelBToP [84], NeuBA [53], POR [208] Obliviate [67]Pre-trainingBlack-boxProactiveBERT, RoBERTaWord-levelPOR[208],NeuBA [53],BadPre[132], UOR [209] MuScleLoRA [188]Pre-trainingWhite-boxReactive (mitigation)BERT,RoBERTa, GPT2-XL, LlaMA-2 Word-level,sentence- level,syntax-level, style-level Badnets [7], AddSent [118],HiddenKiller [59], StyleBkd [119] NCL [189]Pre-trainingWhite-boxReactive (mitigation)BERTword-level,sentence- level, feature-level InSent [118], BadNL [133],StyleBkd [119],SynBkd [59] [124]Pre-trainingWhite-boxReactive (detection & mitigation) NLG modelsWord-level,syntactic- level, multi-turn Backdoorattacks against NLG systems: One-to-one(machine translation)&one- to-many(dialogue generation) backdoor TABLE 5: An overview of backdoor defenses for LLMs. datasets, the two-stage mechanism first employs local gradient ascent loss (LGA) to separate backdoor examples from clean training samples, then uses global gradient ascent (GGA) to unlearn the backdoored model using the isolated backdoor. 4.1.2. Detection & Filtering.Backdoor Keyword Identification (BKI)[181] is a detection defense that aims to remove possible poisoned training data and directly obstruct backdoor training. This ap- proach devises scoring functions to locate frequent 19 salient words in the trigger sentences that help to filter out poisoned data and sanitize the train- ing dataset, it involves inspecting all the training data to identify possible trigger words.Simulate and Eliminate (SANDE)[177] integrates Overwrite Supervised Fine-tuning (OSFT) to its two-phase framework (simulation and elimination) to remove unknown backdoors. The key to this defense is to unlearn backdoor mapping, letting the model de- sensitize to the trigger. Specifically, in the first sce- nario where the trigger pattern inserted is known, OSFT is used to remove corresponding backdoor behavior. In the second scenario, where informa- tion about the trigger pattern is unknown, parrot prompts are optimized and leveraged to simulate the trigger’s behaviors in the simulation phase, followed up in the elimination phase, OSFT is reused on the parrot prompt to remove victim models’ inherent backdoor mappings form trigger tto malicious responseR t . Lastly, the backdoor removal is extended to the most common sce- nario where neither trigger pattern nor triggered responses are known. Moreover,BEEAR[178] is another reactive mitigation defense method for removing back- doors in instruction-tuned language models. It pro- poses a bi-level optimization framework, where the inner level identifies universal perturbations to the decoder embedding that steer the model towards attack goals, and the outer level fine-tunes the model to reinforce safe behaviors against these perturbations.Poisoned Sample Identification Module (PSIM)[179] leverages PEFT to identify poisoned samples and defend against weight poisoning backdoor attacks. Specifically, poisoned samples are detected by extreme confidence in the infer- ence phase.MDP[180] is another detection-based method to defend PLMs against backdoor attacks. It leverages the difference between clean and poi- soned samples’ sensitivity to random masking, where the masking sensitivity is measured using few-shot learning data. Sun et al. [124] propose a defense for backdoor attacks in NLG systems that combines detection and mitigation methods. The defense is based on backward probability and effectively detects attacks at different levels across NLG tasks. 4.1.3. Model Reconstruction & Repairment. Fine-tuning the backdoored model on clean data for extra epochs [182] is considered an effective model repairment technique to overcome pertur- bations introduced by poisoning data.Adversarial Neuron Pruning (ANP)[183] eliminates dormant backdoored weights introduced during the initial training phase to mitigate backdoors. Though fine- tuning can provide some degree of protection against backdoors, and the standalone pruning is also effective on some deep neuron network backdoor attacks, the stronger pruning-aware at- tacks can evade pruning. Pruning is therefore ad- vanced tofine-pruning[125], which combines fine- tuning [182] and pruning [122] to mitigate back- door, fine-pruning aims to disable backdoor by removing neurons that are not primarily activated on clean inputs, followed by performing several rounds of fine-tuning with clean data. Fine-mixing[186] leverages the clean pre- trained weights to mitigate backdoors from fine- tuned models, the two-step fine-mixing tech- nique first mixes backdoored weights with clean weights, then fine-tunes the mixed weights on clean data, complementary with the Embedding Purification (E-PUR) technique that mitigates po- tential backdoors in the word embeddings, mak- ing this defense especially effective against em- bedding poisoning-based backdoor attacks.Clean- CLIP[187] is a fine-tuning framework that miti- gates data poisoning attacks in multimodal con- trastive learning. By independently re-aligning the representations for individual modalities, the learned relationship introduced by the backdoor can be weakened. Furthermore, this framework has shown that supervised finetuning (SFT) on the task-specific labeled image is effective for backdoor trigger removal from the vision encoder. ShapPruning[184] is another pruning approach. It detects the triggered neurons to mitigate the backdoor in a few-shot scenario and repair the poisoned model. 20 Trap and Replace (T&R)[185] is similar to the aforementioned pruning-based methods, which also aims to remove backdoored neurons. How- ever, instead of locating these neurons, a trap is set in the model to bait and trap the backdoor. Wu et al. propose an approach calledMulti-Scale Low-Rank Adaptation (MuScleLoRA)[188] to acquire a clean language model from poisoned datasets by downscaling frequency space. Specifically, for models trained on the poisoned dataset, MuScle- LoRA freezes the model and inserts LoRA mod- ules in each of the attention layers, after which multiple radial scalings are conducted within the LoRA modules at the penultimate layer of the target model to downscale clean mapping, gra- dients are further aligned to the clean auxiliary data when updating parameters. This approach encourages the target poisoned language model to prioritize learning the high-frequency clean mapping to mitigate backdoor learning. Zhai et al. [189] propose a Noise-augmented Contrastive Learning (NCL) framework to defend against tex- tual backdoor attacks by training a clean model from poisonous data. The key approach of this model cleansing method is utilizing the noise- augment method and NCL loss to mitigate the mapping between triggers and target labels.Oblivi- ate[67] proposes a defense method to neutralize task-agnostic backdoors, which can especially be integrated into the PEFT process. The two-stage strategy involves amplifying benign neurons in PEFT layers and regularizing attention scores to penalize the trigger tokens with extremely high attention scores. 4.1.4. Distillation-based Defenses.Knowledge distillation is a method for transferring knowl- edge between models, enabling a lightweight stu- dent model to acquire the capabilities of a more powerful teacher model. Previous research [126] has proven defensive distillation is one of the most promising defenses that defend neural net- works against adversarial examples. Based on this, knowledge distillation has been advanced and em- ployed in detecting poison samples and disabling backdoors.Anti-Backdoor Model[190] introduces a non-invasive backdoor against backdoor (NBAB) algorithm that does not require reconstruction of the backdoored model. Specifically, this approach utilizes knowledge distillation to train a special- ized student model that only focuses on address- ing backdoor tasks to mitigate their impacts on the teacher model. Bie et al. [191] present a backdoor elimination defense for pre-trained en- coders utilizing self-supervised knowledge distil- lation, where both contrastive and non-contrastive self-supervised learning (SSL) methods are in- corporated. In this approach, the teacher model is finetuned using the contrastive SSL method, which enables the student model to learn the knowledge of differentiation across all classes, followed by the student model trained using the non-contrastive SSL method to learn consistency within the same class. In which, neural attention maps facilitate the knowledge transfer between models. However, anti-distillation backdoor at- tacks [41] have exploited knowledge distillation to transfer backdoors between models. 4.1.5. Other Pre-training Defenses.Decoupling [192] focuses on defending poisoning-based backdoor attacks on DNNs, it prevents the model from predicting poisoned samples as target labels. The original end-to-end training process is decoupled into three stages. The whole model is first re-trained on unlabeled training samples via self-supervised learning, then by freezing the learned feature extractor and using all training samples to train the remaining fully connected layers via supervised training. Subsequently, high-credible samples are filtered based on training loss. Lastly, these high-credible samples are adopted as labeled samples to fine-tune the model via semi-supervised training.I-BAU[193] is a defense involves model reconstruction. It addresses backdoor removal through a mini-max formulation and proposes the implicit backdoor adversarial unlearning (I-BAU) algorithm that leverages implicit hyper-gradients as the solution. Specifically, the formulation consists of the inner 21 maximization problem and outer minimization problem, where the inner maximization problem aims to find the trigger that maximizes prediction loss, and the outer minimization problem aims to find parameters that minimize the adversarial loss from the inner attack. In addition,FABE[194] presents a front- door adjustment defense for LLMs backdoor elimination based on casual reasoning. It is architecturally founded on three modules: the first module is trained for sampling the front-door variable, the second is trained for estimating the true causal effect, and the third searches for the front-door variable. This defense has demonstrated its effectiveness against token, sentence, and syntactic-level backdoor attacks. Decayed Contrastive Decoding[120] first proposes a black-box multi-turn distributed trigger attack frameworkcalledPOISONSHARE,which employs a multi-turn greedy coordinate gradient descent to find the optimal trigger, then presents the Decayed Contrastive Decoding defense to mitigate such distributed backdoor attacks. Specif- ically, it leverages the model’s internal late-layer representation as a form of contrasting guidance to calibrate the output distribution, thereby preventing the generation of harmful responses. Takeaways. IV.A Backdoor defenses deployed in the pre- training phase can be categorized into reac- tive defenses and proactive defenses, reac- tive defenses involve detection and mitiga- tion after the occurrence of poisoned exam- ples or the known existence of a backdoor, in which detection-based defenses in this phase include filtering the training instances and mitigation-based defenses involve alle- viating the harmful effects brought by back- door attacks, model repairment via tuning and pruning ( [125], [182], [184], [186]– [189]) is one of the prevalent approaches. While proactive defenses like [172]–[174], [176] serve preventive purposes, which aim to endow the model with robustness against potential backdoors. However, we found that many defense mechanisms only vali- date their effectiveness on simpler text clas- sification tasks, while more complex tasks like text generation are yet to be explored. Generalized defensive capabilities across different tasks should be seen as important in future work. 4.2. Post-training Defenses In the context of inference time defenses, no access to the training process of the model is re- quired, nor any prior knowledge about the attacker and trigger, making them more realistic and effi- cient defense approaches in a black-box setting. 4.2.1. Detection & Filtering.Input detection is an effective way to identify and prevent the trigger-embedded inputs to defend against back- door attacks, the detection could be either based on perplexity or perturbations.ONION[121] is a simple filtering-based defense designated for textual backdoor situations, it requires no access to the model’s training process and works in both pre-training and post-training stages. It is devised for detecting and removing tokens that reduce the fluency of an input sentence and are likely back- door triggers, these outlier words are identified by the perplexity (PPL) score obtained from GPT- 2 and the pre-defined threshold for suspicious score. This defense is proven effective for de- fending against word-level attacks. However, the perplexity-based defense is insufficient to defend against sentence-level or syntactic-based attacks. STRIP-ViTA.[128]Atest-timedetection defense framework that detects poisoned inputs with stable predictions under perturbation.STRIP- ViTAdefense method is based on the previous workSTRIP[142] which works solely on computer vision tasks, it is advanced to work on audio, video, and textual tasks. Its methodology 22 includes substituting the most significant words in the inputs and examining the resulting prediction entropy distributions.Robustness-Aware Perturba- tions (RAP)[150] leverages the difference between the robustness of benign and poisoned inputs to perturbations and injects crafted perturbations into the given samples to detect poisoned samples. BDMMT[195] detects backdoored inputs for language models through model mutation test, it has demonstrated effectiveness in defending against character-level, word-level, sentence-level, and style-level backdoor attacks.Februus[196] andSentiNet[198] operate as run-time Trojan anomaly detection methods for DNNs without requiring model retraining, they sanitize and restore inputs by removing the potential trigger applied on them. Activation Clustering [130] detects and removes poisonous data by analyzing activations of the model’s last hidden layer. CLEANGEN[197] is a lightweight and effective decoding strategy in the post-training phase that mitigates backdoor attacks for generation tasks in LLMs. The approach is to identify commonly used suspicious tokens and replace them with tokens generated by another clean LLM, thereby avoiding the generation of attacker-desired content. Mo et al. [199] design a test-time defense against black-box backdoor attacks that leverages few-shot demonstrations to correct the inference behavior of poisoned models.ParaFuzz[200] pro- poses a test-time interpretability-driven poisoned sample detection technique for NLP models. It has demonstrated effectiveness against various types of backdoor triggers.Chain-of-Scrutiny[206] is an- other test-time detection defense for backdoor- compromised LLM. It only requires black-box access to the model. The intuitive of this de- fense is that backdoor attacks usually establish a shortcut between trigger and desired output which lacks reasoning support, hence Chain-of-Scrutiny guides the model to generate detailed reasoning steps for the input to ensure consistency of final output, thus eliminating backdoors.BDDR[147] defends against training data poisoning by analyz- ing whether input words change the discriminative results of the model. The output probability-based defense uses two methods to eliminate textual backdoors: either deleting them upon detection (D) or replacing them with words generated by BERT (DR).LMSanitator[201] aims to detect and remove task-agnostic backdoors in prompt- tuning from Transformer-based models. The de- fense method erases triggers from poisoned inputs during the inference phase. 4.2.2. Model Inspections.Neural Cleanse (NC) [129] is an optimization-based detection and re- construction system for DNN backdoor attacks during the model inspection stage to filter test- time input. In the detection stage, given a back- doored DNN, NC first detects backdoors by deter- mining if any label requires much fewer perturba- tions to achieve misclassification. Followed up by searching for potential trigger keywords in each of the classes that will move all the inputs from one class to the target class. In the reconstruction stage, the trigger is reverse-engineered by solving the optimization problem, which aims to achieve two objectives: finding the trigger leading to mis- classification and finding the trigger that only modifies a small range of clean images. WhileNC [129] relies on a clean training dataset which lim- its its application scenarios,DeepInspect (DI) [202], another black-box Trojan detection framework through model inspection, requires minimal prior knowledge about the backdoored model, it first employs model inversion to obtain a substitution training dataset and reconstructs triggers using a conditional GAN, followed by anomaly detection based on statistical hypothesis testing.Artificial Brain Stimulation (ABS)[203] is another analysis- based backdoor detection approach, it scans an AI model and identifies backdoors by conducting a simulation analysis on inner neurons, followed by reverse engineering triggers using the results from stimulation analysis. 4.2.3. Distillation-based Defenses.Model dis- tillation [204] is another post-training defense against poisoning attacks. Via transferring knowl- 23 edge from a large model to a smaller one, it aims to create a more robust and clean representation of underlying data to mitigate adversarial effects on backdoored pre-trained encoders.Neural Attention Distillation (NAD)[205] is a distillation-guided fine- tuning approach that erases the backdoor from DNNs, it utilizes a teacher model to guide the fine-tuning of backdoored student model on clean data, to align its intermediate layer attention with the teacher model. Takeaways. IV.B After deployment of the backdoored model, defenses in the post-training stage are considered reactive measures, the outlier detection-based methods including [121], [128], [147], [150] are most frequently used as baseline defenses in various backdoor attacks. However, we argue that the filter- ing methods that solely work in the in- ference phase are not considered effective and generalizable. Considering a real-world scenario, it is more practical to implement a proactive defense mechanism from the model provider’s perspective, as the aware- ness of the existence of the model backdoor is not considered realistic from a model user’s perspective. 5. Evaluation Methodology 5.1. Performance Metrics In this section, we introduce the performance metrics commonly employed to assess the effec- tiveness of backdoor attacks in achieving their dual objectives: efficacy and stealthiness. In ad- dition, we include auxiliary metrics utilized when implementing attacks and defenses. 5.1.1. Main Metrics. Attack Success Rate (ASR).The classification accuracy of the back- doored model on poisoned data is a key indicator metric for evaluating the performance of back- door attacks. In contrast, the drop in ASR can be used to measure the effectiveness of defense methods [222]–[230]. The ASR can be expressed as: ASR= # Successfully Attacked Cases # Total Cases ×100% (1) Clean Accuracy (CA or CACC).The clean performance shares equal importance with attack performance in backdoor attacks since one of the objectives of the attack design is to maintain the overall model integrity [231]. Clean accuracy measures how the backdoored model performs on the unpoisoned dataset to determine whether the model’s overall performance is degraded. A larger CA indicates better utility preservation. CA is also referred to as “Benign Accuracy (BA)”. CA= # Clean Examples Correctly Classified # Total Clean Examples ×100%(2) Area under the ROC Curve (AUC).AUC serves as an aggregate measure of performance across all possible thresholds, especially for clas- sification tasks, which is a useful metric for eval- uating the stealthiness of the attack. Performance Drop Rate (PDR).PDR is used to quantify the effectiveness of an attack and its capability of preserving model functionality. It is obtained by measuring how poisoned samples affect the model performance compared to benign ones. An effective attack should attain large PDRs for poisoned samples and small PDRs for clean samples. PDR is defined as: PDR= (1− Acc poisoned Acc clean )×100%(3) where the Acc poisoned refers to accuracy when the model is tuned on poisoned data and Acc clean refers to accuracy when the model is tuned on clean data. Label Flip Rate (LFR).LFR can be used to evaluate attack efficacy. It is defined as the 24 DatasetSizeDescription & Usage SST-2 [214]12KMovie reviews for single-sentence sentiment classification HateSpeech (HS) [215]10KHate speeches for single-sentence binary classification (HATE/NOHATE) AGNews [216]128KNews topics for single-sentence sentiment classification IMDB [217]50KMovie reviews for single-sentence sentiment classification Ultrachat-200k [218]1.5MHigh-quality multi-turn dialogues for multi-turn instruction tuning AdvBench [37]500Questions covering prohibited topics for safety evaluation TDC 202350Instructions representative of undesirable behaviors for safety evaluation ToxiGen [219]274KMachine-generated implicit hate speech dataset for hate speech detection Bot Adversarial Dialogue [220]70KMulti-turn dialogues between human and bot to trigger toxic responses generation Alpacaeval [221]20KInstruction-label pairs for evaluating instruction-following language models TABLE 6: Frequently used evaluation datasets. proportion of misclassified samples: LFR= #+ve Samples Classified as -ve #+ve Samples ×100% (4) 5.1.2. Auxiliary Metrics.Perplexity [232] mea- sures the readability and fluency of text samples using the language model. A lower perplexity score indicates that the sample is more fluent and predictable by the model, while a higher perplex- ity indicates that the model is less certain about the sample, making it more likely to be identified as the backdoor trigger and filtered by perplexity- based backdoor defenses such asONION[121]. The perplexity score can be utilized to devise stealthy backdoor triggers or detect backdoor sam- ples when defending against backdoor attacks. BLEU&ROUGH.Two frequently used met- rics in NLP evaluation now has been extended to evaluate model performance in triggerless scenar- ios under backdoor attacks. BLEU [233] is primar- ily based on precision, it measures the accuracy of benign examples; ROUGE [234] which is pri- marily based on recall, evaluates response quality in the absence of triggers. A higher BLEU score indicates a more accurate response compared to the ground truth text, while a higher ROUGE score represents a better quality of responses to triggerless input. Exact Match (EM) & Contain.Two met- rics for evaluating NLP tasks, such as answering questions and generating texts. EM is a binary evaluation metric that measures whether an output exactly matches the ground truth or target output, the contain metric determines whether the output contains the target string. 5.2. Baselines,Benchmarks,and Datasets Besides directly evaluating attack and defense performance using the metrics mentioned above and comparing them with representative baseline attacks and defenses, their efficacy and, especially, robustness can also be evaluated through their per- formance when applying state-of-the-art defense methods. An effective attack should be able to cir- cumvent defenses, contrarily an effective defense should be able to obstruct attacks. As detailed in section IV.B,ONION[121],STRIP-ViTA[128] andRAP[150] are three of the most representa- tive test-time defenses utilized in mitigating LLM backdoor attacks, they share the similar defense technique of preventive input filtering. Li et al. [235] provide a comprehensive threat model benchmark for backdoor instruction-tuned LLMs. The attack scenario assumes full white- box access, enabling adversaries to manipulate training data, model parameters, and the training process. Specifically, the framework encompasses four distinct attack strategies: data poisoning [71], [74], [87], weight poisoning [43], hidden state manipulation, and chain-of-thought attacks [103]. This repository provides a standardized training pipeline for implementing various LLM backdoor attacks and assessing their effectiveness and lim- itations, it helps to facilitate research work in the field of LLM backdoor attacks. We list some 25 commonly used datasets for implementing or eval- uating backdoor attacks (refer to Table 6). 6. Conclusions In conclusion, this work provides a compre- hensive survey of existing backdoor attacks target- ing large language models (LLMs), systematically categorizing them based on the phase of exploita- tion. Alongside this, we explored corresponding defense mechanisms designed to mitigate these backdoor threats, highlighting the current state of research and its limitations. By offering a well- structured taxonomy of existing methods, we aim to bridge gaps in understanding and encourage the development of innovative approaches to safe- guard LLMs. We hope that this survey serves as a valuable resource for researchers and practi- tioners, fostering future advancements in creating more secure and trustworthy LLM systems. References [1]S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, “Bloomberggpt: A large language model for finance,” 2023. [Online]. Available: https://arxiv.org/abs/2303.17564 [2]L.Loukas,I.Stogiannidis,O.Diamantopoulos, P. Malakasiotis, and S. Vassos, “Making llms worth every penny: Resource-limited text classification in banking,” inProceedings of the Fourth ACM International ConferenceonAIinFinance,ser.ICAIF’23. New York, NY, USA: Association for Computing Machinery,2023,p.392–400.[Online].Available: https://doi.org/10.1145/3604237.3626891 [3]Y. Jin, M. Chandra, G. Verma, Y. Hu, M. De Choudhury, and S. Kumar, “Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries,” inProceedings of the ACM Web Conference 2024, ser. W ’24.New York, NY, USA: Association for Computing Machinery, 2024, p. 2627–2638. [Online]. Available: https://doi.org/10.1145/3589334.3645643 [4]T. Ni, Y. Du, Q. Zhao, and C. Wang, “Non-intrusive and un- constrained keystroke inference in vr platforms via infrared side channel,”arXiv preprint arXiv:2412.14815, 2024. [5]J. Cui, M. Ning, Z. Li, B. Chen, Y. Yan, H. Li, B. Ling, Y. Tian, and L. Yuan, “Chatlaw: A multi-agent collaborative legal assistant with knowledge graph enhanced mixture-of- experts large language model,” 2024. [Online]. Available: https://arxiv.org/abs/2306.16092 [6]R. Z. Mahari, “Autolaw: Augmented legal reasoning through legal precedent prediction,” 2021. [Online]. Available: https://arxiv.org/abs/2106.16034 [7]T. Gu, B. Dolan-Gavitt, and S. Garg, “Badnets: Identifying vulnerabilities in the machine learning model supply chain,” 2019. [Online]. Available: https://arxiv.org/abs/1708.06733 [8]N. Papernot, P. McDaniel, A. Sinha, and M. P. Wellman, “Sok: Security and privacy in machine learning,” in2018 IEEE European Symposium on Security and Privacy (Eu- roS&P), 2018. [9]M. Shanahan, “Talking about large language models,”Com- munications of the ACM, vol. 67, no. 2, p. 68–79, 2024. [10]S. Choi and D. Mohaisen, “Attributing chatgpt-generated source codes,”IEEE Transactions on Dependable and Se- cure Computing, 2025. [11]Z. Lin, G. Qu, Q. Chen, X. Chen, Z. Chen, and K. Huang, “Pushing large language models to the 6g edge: Vision, challenges, and opportunities,”arXiv preprint arXiv:2309.16739, 2023. [12]Z. Lin, X. Hu, Y. Zhang, Z. Chen, Z. Fang, X. Chen, A. Li, P. Vepakomma, and Y. Gao, “Splitlora: A split parameter- efficient fine-tuning framework for large language models,” arXiv preprint arXiv:2407.00952, 2024. [13]Z. Fang, Z. Lin, Z. Chen, X. Chen, Y. Gao, and Y. Fang, “Automated federated pipeline for parameter- efficient fine-tuning of large language models,”arXiv preprint arXiv:2404.06448, 2024. [14]S. Choi, Y. K. Tan, M. H. Meng, M. Ragab, S. Mondal, D. Mohaisen, and K. M. M. Aung, “I can find you in sec- onds! leveraging large language models for code authorship attribution,”arXiv preprint arXiv:2501.08165, 2025. [15]H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,”Advances in neural information processing systems, vol. 36, 2024. [16]R. OpenAI, “Gpt-4 technical report. arxiv 2303.08774,” View in Article, vol. 2, no. 5, 2023. [17]A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023. [18]A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subrama- nian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mixtral of experts,” 2024. [19]T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” 2020. 26 [20]B. Wang and A. Komatsuzaki, “Gpt-j-6b: A 6 billion pa- rameter autoregressive language model,” 2021. [21]A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskeveret al., “Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019. [22]H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi ` ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023. [23]R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” 2023. [24]W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalezet al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, march 2023,”URL https://lmsys. org/blog/2023-03-30-vicuna, vol. 3, no. 5, 2023. [25]P. Zhang, G. Zeng, T. Wang, and W. Lu, “Tinyllama: An open-source small language model,”arXiv preprint arXiv:2401.02385, 2024. [26]T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”Advances in Neural Information Processing Systems, vol. 36, 2024. [27]C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of machine learning research, vol. 21, no. 140, p. 1–67, 2020. [28]“The claude 3 model family: Opus, sonnet, haiku.” [On- line]. Available: https://api.semanticscholar.org/CorpusID: 268232499 [29]S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Linet al., “Opt: Open pre-trained transformer language models,”arXiv preprint arXiv:2205.01068, 2022. [30]R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chenet al., “Palm 2 technical report,”arXiv preprint arXiv:2305.10403, 2023. [31]Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “CodeBERT: A pre-trained model for programming and natural languages,” inFindings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, and Y. Liu, Eds.Online: Association for Computational Linguistics, Nov. 2020, p. 1536–1547. [Online]. Available: https://aclanthology.org/2020.findings-emnlp.139 [32]D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu, M. Tufano, S. K. Deng, C. Clement, D. Drain, N. Sundaresan, J. Yin, D. Jiang, and M. Zhou, “Graphcodebert: Pre-training code representations with data flow,” 2021. [33]W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre-training for program understanding and gener- ation,” 2021. [34]Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds.Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, p. 8696–8708. [Online]. Available: https://aclanthology.org/2021.emnlp-main.685 [35]E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” 2023. [Online]. Available: https: //arxiv.org/abs/2203.13474 [36]I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” 2015. [Online]. Available: https://arxiv.org/abs/1412.6572 [37]A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” 2023. [Online]. Available: https://arxiv.org/abs/2307.15043 [38]J. Hayase, E. Borevkovic, N. Carlini, F. Tram ` er, and M. Nasr, “Query-based adversarial prompt generation,” 2024. [Online]. Available: https://arxiv.org/abs/2402.12329 [39]T. Shin, Y. Razeghi, R. L. L. I. au2, E. Wallace, and S. Singh, “Autoprompt: Eliciting knowledge from language models with automatically generated prompts,” 2020. [Online]. Available: https://arxiv.org/abs/2010.15980 [40]N. Wichers, C. Denison, and A. Beirami, “Gradient- based language model red teaming,”arXiv preprint arXiv:2401.16656, 2024. [41]P. Cheng, Z. Wu, T. Ju, W. Du, and Z. Z. G. Liu, “Transferring backdoors between large language models by knowledge distillation,” 2024. [Online]. Available: https://arxiv.org/abs/2408.09878 [42]S. Zhao, L. Gan, Z. Guo, X. Wu, L. Xiao, X. Xu, C.-D. Nguyen, and L. A. Tuan, “Weak-to-strong backdoor attack for large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2409.17946 [43]Y. Li, T. Li, K. Chen, J. Zhang, S. Liu, W. Wang, T. Zhang, and Y. Liu, “Badedit: Backdooring large language models by model editing,” 2024. [44]K. Kurita, P. Michel, and G. Neubig, “Weight poisoning attacks on pre-trained models,” 2020. [Online]. Available: https://arxiv.org/abs/2004.06660 [45]J. Qiu, X. Ma, Z. Zhang, and H. Zhao, “Megen: Generative backdoor in large language models via model editing,” 2024. [Online]. Available: https://arxiv.org/abs/2408.10722 [46]K. Y. Yoo and N. Kwak, “Backdoor attacks in federated learning by rare embeddings and gradient ensembling,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, p. 72–88. [Online]. Available: https://aclanthology.org/2022. emnlp-main.6 27 [47]W. Yang, L. Li, Z. Zhang, X. Ren, X. Sun, and B. He, “Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in nlp models,” 2021. [Online]. Available: https://arxiv.org/abs/2103.15543 [48]L. Li, D. Song, X. Li, J. Zeng, R. Ma, and X. Qiu, “Backdoor attacks on pre-trained models by layerwise weight poisoning,” 2021. [Online]. Available: https: //arxiv.org/abs/2108.13888 [49]K. Mei, Z. Li, Z. Wang, Y. Zhang, and S. Ma, “NOTABLE: Transferable backdoor attacks against prompt-based NLP models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds.Toronto, Canada: Association for Computational Linguistics, Jul. 2023, p. 15 551–15 565. [Online]. Available: https://aclanthology.org/2023.acl-long. 867 [50]E. Bagdasaryan and V. Shmatikov, “Blind backdoors in deep learning models,” 2021. [Online]. Available: https://arxiv.org/abs/2005.03823 [51]A. A. Miah and Y. Bi, “Exploiting the vulnerability of large language models via defense-aware architectural backdoor,” 2024. [Online]. Available: https://arxiv.org/abs/2409.01952 [52]H. Wang and K. Shu, “Trojan activation attack: Red- teaming large language models using activation steering for safety-alignment,” 2024. [Online]. Available: https: //arxiv.org/abs/2311.09433 [53]Z. Zhang, G. Xiao, Y. Li, T. Lv, F. Qi, Z. Liu, Y. Wang, X. Jiang, and M. Sun, “Red alarm for pre- trained models: Universal vulnerability to neuron-level backdoor attacks,”Machine Intelligence Research, vol. 20, no. 2, p. 180–193, Mar. 2023. [Online]. Available: http://dx.doi.org/10.1007/s11633-022-1377-5 [54]J. Li, Y. Yang, Z. Wu, V. G. V. Vydiswaran, and C. Xiao, “Chatgpt as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger,” 2023. [Online]. Available: https://arxiv.org/abs/2304.14475 [55]Z. Tan, Q. Chen, Y. Huang, and C. Liang, “Target: Template-transferable backdoor attack against prompt- based nlp models via gpt4,” 2023. [Online]. Available: https://arxiv.org/abs/2311.17429 [56]W. You, Z. Hammoudeh, and D. Lowd, “Large language models are better adversaries: Exploring generative clean- label backdoor attacks against text classifiers,” inFindings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds.Singapore: Association for Computational Linguistics, Dec. 2023, p. 12 499–12 527. [Online]. Available: https://aclanthology. org/2023.findings-emnlp.833 [57]S. Yan, S. Wang, Y. Duan, H. Hong, K. Lee, D. Kim, and Y. Hong, “An llm-assisted easy-to-trigger backdoor attack on code completion models: Injecting disguised vulnerabil- ities against strong detection,” 2024. [58]Q. Zeng, M. Jin, Q. Yu, Z. Wang, W. Hua, Z. Zhou, G. Sun, Y. Meng, S. Ma, Q. Wang, F. Juefei-Xu, K. Ding, F. Yang, R. Tang, and Y. Zhang, “Uncertainty is fragile: Manipulating uncertainty in large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.11282 [59]F. Qi, M. Li, Y. Chen, Z. Zhang, Z. Liu, Y. Wang, and M. Sun, “Hidden killer: Invisible textual backdoor attacks with syntactic trigger,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds.Online: Association for Computational Linguistics, Aug. 2021, p. 443–453. [Online]. Available: https: //aclanthology.org/2021.acl-long.37 [60]P. Cheng, W. Du, Z. Wu, F. Zhang, L. Chen, and G. Liu, “Synghost: Imperceptible and universal task- agnostic backdoor attack in pre-trained language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.18945 [61]X. Sheng, Z. Li, Z. Han, X. Chang, and P. Li, “Punctuation matters! stealthy backdoor attack for language models,” 2023. [Online]. Available: https://arxiv.org/abs/2312.15867 [62]J. He, W. Jiang, G. Hou, W. Fan, R. Zhang, and H. Li, “Watch out for your guidance on generation! exploring conditional backdoor attacks against large language models,” 2024. [Online]. Available: https://arxiv. org/abs/2404.14795 [63]E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” 2021. [Online]. Available: https://arxiv.org/abs/2106.09685 [64]T. Dong, M. Xue, G. Chen, R. Holland, Y. Meng, S. Li, Z. Liu, and H. Zhu, “The philosopher’s stone: Trojaning plugins of large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2312.00374 [65]N. Gu, P. Fu, X. Liu, Z. Liu, Z. Lin, and W. Wang, “A gradient control method for backdoor attacks on parameter- efficient tuning,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds.Toronto, Canada: Association for Computational Linguistics, Jul. 2023, p. 3508–3520. [Online]. Available: https://aclanthology.org/2023.acl-long.194 [66]Y. Cao, B. Cao, and J. Chen, “Stealthy and persistent unalignment on large language models via backdoor injections,” 2024. [Online]. Available: https://arxiv.org/abs/ 2312.00027 [67]J. Kim, M. Song, S. H. Na, and S. Shin, “Obliviate: Neutralizing task-agnostic backdoors within the parameter- efficient fine-tuning paradigm,” 2024. [Online]. Available: https://arxiv.org/abs/2409.14119 [68]S. Jiang, S. R. Kadhe, Y. Zhou, F. Ahmed, L. Cai, and N. Baracaldo, “Turning generative models degener- ate: The power of data poisoning attacks,”arXiv preprint arXiv:2407.12281, 2024. [69]H. Liu, Z. Liu, R. Tang, J. Yuan, S. Zhong, Y.-N. Chuang, L. Li, R. Chen, and X. Hu, “Lora-as-an-attack! piercing llm safety under the share-and-play scenario,” 2024. [Online]. Available: https://arxiv.org/abs/2403.00108 28 [70]H. Huang, Z. Zhao, M. Backes, Y. Shen, and Y. Zhang, “Composite backdoor attacks against large language models,” inFindings of the Association for Computational Linguistics: NAACL 2024, K. Duh, H. Gomez, and S. Bethard, Eds.Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, p. 1459– 1472. [Online]. Available: https://aclanthology.org/2024. findings-naacl.94 [71]J. Yan, V. Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V. Srinivasan, X. Ren, and H. Jin, “Backdooring instruction- tuned large language models with virtual prompt injection,” 2024. [72]Y. Qiang, X. Zhou, S. Z. Zade, M. A. Roshani, P. Khanduri, D. Zytko, and D. Zhu, “Learning to poison large language models during instruction tuning,” 2024. [Online]. Available: https://arxiv.org/abs/2402.13459 [73]M. Shu, J. Wang, C. Zhu, J. Geiping, C. Xiao, and T. Goldstein, “On the exploitability of instruction tuning,” 2023. [Online]. Available: https://arxiv.org/abs/2306.17194 [74]J. Xu, M. D. Ma, F. Wang, C. Xiao, and M. Chen, “Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2305.14710 [75]J. Liang, S. Liang, M. Luo, A. Liu, D. Han, E.-C. Chang, and X. Cao, “Vl-trojan: Multimodal instruction backdoor attacks against autoregressive visual language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.13851 [76]A. Wan, E. Wallace, S. Shen, and D. Klein, “Poisoning language models during instruction tuning,” inInternational Conference on Machine Learning.PMLR, 2023, p. 35 413–35 425. [77]Z. Ni, R. Ye, Y. Wei, Z. Xiang, Y. Wang, and S. Chen, “Physical backdoor attack can jeopardize driving with vision-large-language models,” 2024. [Online]. Available: https://arxiv.org/abs/2404.12916 [78]M. Choe, C. Park, C. Seo, and H. Kim, “Sdba: A stealthy and long-lasting durable backdoor attack in federated learning,” 2024. [Online]. Available: https: //arxiv.org/abs/2409.14805 [79]R. Ye, J. Chai, X. Liu, Y. Yang, Y. Wang, and S. Chen, “Emerging safety attack and defense in federated instruction tuning of large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2406.10630 [80]Z. Zhang, A. Panda, L. Song, Y. Yang, M. W. Mahoney, J. E. Gonzalez, K. Ramchandran, and P. Mittal, “Neurotoxin: Durable backdoors in federated learning,” 2022. [Online]. Available: https://arxiv.org/abs/2206.10341 [81]J. Zhang, J. Chi, Z. Li, K. Cai, Y. Zhang, and Y. Tian, “Badmerging: Backdoor attacks against model merging,” 2024. [Online]. Available: https://arxiv.org/abs/2408.07362 [82]W. Du, Y. Zhao, B. Li, G. Liu, and S. Wang, “Ppt: Backdoor attacks on pre-trained models via poisoned prompt tuning,” inProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, L. D. Raedt, Ed.International Joint Conferences on Artificial Intelligence Organization, 7 2022, p. 680–686, main Track. [Online]. Available: https://doi.org/10.24963/ijcai.2022/96 [83]H. Yao, J. Lou, and Z. Qin, “Poisonprompt: Backdoor attack on prompt-based large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2310.12439 [84]L. Xu, Y. Chen, G. Cui, H. Gao, and Z. Liu, “Exploring the universal vulnerability of prompt-based learning paradigm,” 2022. [Online]. Available: https://arxiv.org/abs/2204.05239 [85]X. Cai, H. Xu, S. Xu, Y. Zhang, and X. Yuan, “Badprompt: Backdoor attacks on continuous prompts,” 2022. [Online]. Available: https://arxiv.org/abs/2211.14719 [86]S. Zhao, J. Wen, A. Luu, J. Zhao, and J. Fu, “Prompt as triggers for backdoor attack: Examining the vulnerability in language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics, 2023, p. 12303–12317. [Online]. Available: http://dx.doi.org/10.18653/v1/2023.emnlp-main.757 [87]J. Rando and F. Tram ` er, “Universal jailbreak backdoors from poisoned human feedback,” 2024. [Online]. Available: https://arxiv.org/abs/2311.14455 [88]J. Wang, J. Wu, M. Chen, Y. Vorobeychik, and C. Xiao, “Rlhfpoison: Reward poisoning attack for reinforcement learning with human feedback in large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2311.09641 [89]T. Baumg ̈ artner, Y. Gao, D. Alon, and D. Metzler, “Best- of-venom: Attacking rlhf by injecting poisoned preference data,” 2024. [Online]. Available: https://arxiv.org/abs/2404. 05530 [90]J. Shi, Y. Liu, P. Zhou, and L. Sun, “Badgpt: Exploring security vulnerabilities of chatgpt via backdoor attacks to instructgpt,” 2023. [Online]. Available: https://arxiv.org/ abs/2304.12298 [91]N. Carlini, M. Nasr, C. A. Choquette-Choo, M. Jagielski, I. Gao, A. Awadalla, P. W. Koh, D. Ippolito, K. Lee, F. Tramer, and L. Schmidt, “Are aligned neural networks adversarially aligned?” 2024. [Online]. Available: https: //arxiv.org/abs/2306.15447 [92]Y. Wang, D. Xue, S. Zhang, and S. Qian, “Badagent: Inserting and activating backdoor attacks in llm agents,” 2024. [Online]. Available: https://arxiv.org/abs/2406.03007 [93]H. Wang, R. Zhong, J. Wen, and J. Steinhardt, “Adaptivebackdoor: Backdoored language model agents that detect human overseers,” inICML 2024 Workshop on Foundation Models in the Wild, 2024. [Online]. Available: https://openreview.net/forum?id=RredrFZ4tQ [94]B. Chen, N. Ivanov, G. Wang, and Q. Yan, “Multi- turn hidden backdoor in large language model-powered chatbot models,” inProceedings of the 19th ACM Asia Conference on Computer and Communications Security, ser. ASIA CCS ’24.New York, NY, USA: Association for Computing Machinery, 2024, p. 1316–1330. [Online]. Available: https://doi.org/10.1145/3634737.3656289 [95]Y. Hao, W. Yang, and Y. Lin, “Exploring backdoor vulnerabilities of chat models,” 2024. [Online]. Available: https://arxiv.org/abs/2404.02406 29 [96]W. Yang, X. Bi, Y. Lin, S. Chen, J. Zhou, and X. Sun, “Watch out for your agents! investigating backdoor threats to llm-based agents,” 2024. [Online]. Available: https://arxiv.org/abs/2402.11208 [97]R. Schuster, C. Song, E. Tromer, and V. Shmatikov, “You autocomplete me: Poisoning vulnerabilities in neural code completion,” in30th USENIX Security Symposium (USENIX Security 21). USENIX Association, Aug. 2021, p. 1559–1575. [Online]. Available: https://w.usenix. org/conference/usenixsecurity21/presentation/schuster [98]D. Liu and S. Zhang, “Alanca: Active learning guided adversarial attacks for code comprehension on diverse pre- trained and large language models,” in2024 IEEE Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER), 2024, p. 602–613. [99]H. Aghakhani, W. Dai, A. Manoel, X. Fernandes, A. Kharkar, C. Kruegel, G. Vigna, D. Evans, B. Zorn, and R. Sim, “Trojanpuzzle: Covertly poisoning code-suggestion models,” 2024. [100] Y. Li, S. Liu, K. Chen, X. Xie, T. Zhang, and Y. Liu, “Multi- target backdoor attacks for code pre-trained models,” 2023. [101] R. Zhang, H. Li, R. Wen, W. Jiang, Y. Zhang, M. Backes, Y. Shen, and Y. Zhang, “Instruction backdoor attacks against customized LLMs,” in33rd USENIX Security Symposium (USENIX Security 24). Philadelphia, PA: USENIX Association, Aug. 2024, p. 1849–1866. [Online]. Available: https://w.usenix.org/ conference/usenixsecurity24/presentation/zhang-rui [102] J. Xue, M. Zheng, T. Hua, Y. Shen, Y. Liu, L. Boloni, and Q. Lou, “Trojllm: A black-box trojan prompt attack on large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2306.06815 [103] Z. Xiang, F. Jiang, Z. Xiong, B. Ramasubramanian, R. Poovendran, and B. Li, “Badchain: Backdoor chain- of-thought prompting for large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2401.12242 [104] B. Chen, H. Guo, G. Wang, Y. Wang, and Q. Yan, “The dark side of human feedback: Poisoning large language models via user inputs,” 2024. [Online]. Available: https://arxiv.org/abs/2409.00787 [105] Q. Zhang, B. Zeng, C. Zhou, G. Go, H. Shi, and Y. Jiang, “Human-imperceptible retrieval poisoning attacks in llm-powered applications,” 2024. [Online]. Available: https://arxiv.org/abs/2404.17196 [106] W. Zou, R. Geng, B. Wang, and J. Jia, “Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.07867 [107] R. Jiao, S. Xie, J. Yue, T. Sato, L. Wang, Y. Wang, Q. A. Chen, and Q. Zhu, “Can we trust embodied agents? exploring backdoor attacks against embodied llm- based decision-making systems,” 2024. [Online]. Available: https://arxiv.org/abs/2405.20774 [108] J. Xue, M. Zheng, Y. Hu, F. Liu, X. Chen, and Q. Lou, “Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2406.00083 [109] P. Cheng, Y. Ding, T. Ju, Z. Wu, W. Du, P. Yi, Z. Zhang, and G. Liu, “Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2405.13401 [110] Q. Long, Y. Deng, L. Gan, W. Wang, and S. J. Pan, “Backdoor attacks on dense passage retrievers for disseminating misinformation,” 2024. [Online]. Available: https://arxiv.org/abs/2402.13532 [111] Y. Huang, T. Y. Zhuo, Q. Xu, H. Hu, X. Yuan, and C. Chen, “Training-free lexical backdoor attacks on language models,” inProceedings of the ACM Web Conference 2023, ser. W ’23. ACM, Apr. 2023, p. 2198–2208. [Online]. Available: http://dx.doi.org/10.1145/3543507.3583348 [112] Z. Chen, Z. Xiang, C. Xiao, D. Song, and B. Li, “Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases,”ArXiv, vol. abs/2407.12784, 2024. [Online]. Available: https://api.semanticscholar.org/ CorpusID:271244867 [113] N. Kandpal, M. Jagielski, F. Tram ` er, and N. Carlini, “Backdoor attacks for in-context learning with language models,” 2023. [Online]. Available: https://arxiv.org/abs/ 2307.14692 [114] S. Zhao, M. Jia, L. A. Tuan, F. Pan, and J. Wen, “Universal vulnerabilities in large language models: Backdoor attacks for in-context learning,” 2024. [Online]. Available: https://arxiv.org/abs/2401.05949 [115] P. He, H. Xu, Y. Xing, H. Liu, M. Yamada, and J. Tang, “Data poisoning for in-context learning,” 2024. [Online]. Available: https://arxiv.org/abs/2402.02160 [116] D. Lu, T. Pang, C. Du, Q. Liu, X. Yang, and M. Lin, “Test-time backdoor attacks on multimodal large language models,” 2024. [117] L. Sun, “Natural backdoor attack on text data,” 2021. [Online]. Available: https://arxiv.org/abs/2006.16176 [118] J. Dai, C. Chen, and Y. Li, “A backdoor attack against lstm- based text classification systems,”IEEE Access, vol. 7, p. 138 872–138 878, 2019. [119] F. Qi, Y. Chen, X. Zhang, M. Li, Z. Liu, and M. Sun, “Mind the style of text! adversarial and backdoor attacks based on text style transfer,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds.Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, p. 4569–4580. [Online]. Available: https://aclanthology.org/2021.emnlp-main.374 [120] T. Tong, J. Xu, Q. Liu, and M. Chen, “Securing multi-turn conversational language models from distributed backdoor triggers,” 2024. [Online]. Available: https: //arxiv.org/abs/2407.04151 [121] F. Qi, Y. Chen, M. Li, Y. Yao, Z. Liu, and M. Sun, “Onion: A simple and effective defense against textual backdoor attacks,” 2021. [Online]. Available: https://arxiv. org/abs/2011.10369 30 [122] M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective pruning approach for large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2306.11695 [123] S. Li, H. Liu, T. Dong, B. Z. H. Zhao, M. Xue, H. Zhu, and J. Lu, “Hidden backdoors in human- centric language models,” 2021. [Online]. Available: https://arxiv.org/abs/2105.00164 [124] X. Sun, X. Li, Y. Meng, X. Ao, L. Lyu, J. Li, and T. Zhang, “Defending against backdoor attacks in natural language generation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 4, 2023, p. 5257–5265. [125] K. Liu, B. Dolan-Gavitt, and S. Garg, “Fine-pruning: De- fending against backdooring attacks on deep neural net- works,” 05 2018. [126] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in2016 IEEE Symposium on Secu- rity and Privacy (SP), 2016, p. 582–597. [127] X. Zhang, Z. Zhang, S. Ji, and T. Wang, “Trojaning lan- guage models for fun and profit,” 2021. [128] Y. Gao, Y. Kim, B. G. Doan, Z. Zhang, G. Zhang, S. Nepal, D. C. Ranasinghe, and H. Kim, “Design and evaluation of a multi-domain trojan detection method on deep neural networks,”IEEE Transactions on Dependable and Secure Computing, 2022. [129] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao, “Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,” in2019 IEEE Sym- posium on Security and Privacy (SP), 2019, p. 707–723. [130] B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy, and B. Srivastava, “Detecting backdoor attacks on deep neural networks by activation clustering,” 2018. [Online]. Available: https://arxiv.org/abs/1811.03728 [131] B. Tran, J. Li, and A. Madry, “Spectral signatures in back- door attacks,” 2018. [132] K. Chen, Y. Meng, X. Sun, S. Guo, T. Zhang, J. Li, and C. Fan, “Badpre: Task-agnostic backdoor attacks to pre- trained nlp foundation models,” 2021. [133] K. Kurita, P. Michel, and G. Neubig, “Weight poisoning attacks on pre-trained models,” 2020. [134] J. Dai and C. Chen, “A backdoor attack against lstm-based text classification systems,” 2019. [135] P.Blanchard,E.M.ElMhamdi,R.Guerraoui, and J. Stainer, “Machine learning with adversaries: Byzantine tolerant gradient descent,” inAdvances in NeuralInformationProcessingSystems,I.Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S.Vishwanathan,andR.Garnett,Eds.,vol.30. CurranAssociates,Inc.,2017.[Online].Avail- able: https://proceedings.neurips.c/paper files/paper/2017/ file/f4b9ec30ad9f68f89b29639786cb62ef-Paper.pdf [136] Z. Sun, P. Kairouz, A. T. Suresh, and H. B. McMahan, “Can you really backdoor federated learning?”arXiv preprint arXiv:1911.07963, 2019. [137] T. D. Nguyen, P. Rieger, R. De Viti, H. Chen, B. B. Bran- denburg, H. Yalame, H. M ̈ ollering, H. Fereidooni, S. Mar- chal, M. Miettinenet al., “FLAME: Taming backdoors in federated learning,” in31st USENIX Security Symposium (USENIX Security 22), 2022, p. 1415–1432. [138] H. Wang and K. Shu, “Trojan activation attack: Red- teaming large language models using activation steering for safety-alignment,” 2024. [Online]. Available: https: //arxiv.org/abs/2311.09433 [139] Y. Wen, N. Jain, J. Kirchenbauer, M. Goldblum, J. Geiping, and T. Goldstein, “Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery,”Ad- vances in Neural Information Processing Systems, vol. 36, 2024. [140] C. Guo, A. Sablayrolles, H. J ́ egou, and D. Kiela, “Gradient- based adversarial attacks against text transformers,”arXiv preprint arXiv:2104.13733, 2021. [141] E. Jones, A. Dragan, A. Raghunathan, and J. Steinhardt, “Automatically auditing large language models via dis- crete optimization,” inInternational Conference on Machine Learning. PMLR, 2023, p. 15 307–15 329. [142] Y. Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and S. Nepal, “Strip: a defence against trojan attacks on deep neural networks,” inProceedings of the 35th Annual Computer Security Applications Conference, ser. ACSAC ’19.New York, NY, USA: Association for Computing Machinery, 2019, p. 113–125. [Online]. Available: https: //doi.org/10.1145/3359789.3359790 [143] F. Qi, Y. Yao, S. Xu, Z. Liu, and M. Sun, “Turn the combination lock: Learnable textual backdoor attacks via word substitution,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds.Online: Association for Computational Linguistics, Aug. 2021, p. 4873–4883. [Online]. Available: https: //aclanthology.org/2021.acl-long.377 [144] X. Chen, Y. Dong, Z. Sun, S. Zhai, Q. Shen, and Z. Wu, “Kallima: A clean-label framework for textual backdoor attacks,” inEuropean Symposium on Research in Computer Security. Springer, 2022, p. 447–466. [145] L. Gan, J. Li, T. Zhang, X. Li, Y. Meng, F. Wu, Y. Yang, S. Guo, and C. Fan, “Triggerless backdoor attack for nlp tasks with clean labels,” 2022. [146] M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer, “Ad- versarial example generation with syntactically controlled paraphrase networks,”arXiv preprint arXiv:1804.06059, 2018. [147] K. Shao, J. Yang, Y. Ai, H. Liu, and Y. Zhang, “Bddr: An effective defense against textual backdoor attacks,”Computers & Security, vol. 110, p. 102433, 2021. [Online]. Available: https://w.sciencedirect.com/science/ article/pii/S0167404821002571 [148] E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,”arXiv preprint arXiv:2202.03286, 2022. 31 [149] Y. Luo, Z. Yang, F. Meng, Y. Li, J. Zhou, and Y. Zhang, “An empirical study of catastrophic forgetting in large language models during continual fine-tuning,” 2024. [Online]. Available: https://arxiv.org/abs/2308.08747 [150] W. Yang, Y. Lin, P. Li, J. Zhou, and X. Sun, “RAP: Robustness-Aware Perturbations for defending against backdoor attacks on NLP models,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds.Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, p. 8365–8381. [Online]. Available: https://aclanthology.org/2021.emnlp-main.659 [151] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” 2022. [Online]. Available: https://arxiv.org/abs/2109.01652 [152] S. Yuan, H. Li, X. Han, G. Xu, W. Jiang, T. Ni, Q. Zhao, and Y. Fang, “Itpatch: An invisible and triggered physical adversarial patch against traffic sign recognition,”arXiv preprint arXiv:2409.12394, 2024. [153] P. Ren, C. Zuo, X. Liu, W. Diao, Q. Zhao, and S. Guo, “Demistify: Identifying on-device machine learning models stealing and reuse vulnerabilities in mobile apps,” inPro- ceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024, p. 1–13. [154] T. Ni, Y. Chen, K. Song, and W. Xu, “A simple and fast human activity recognition system using radio frequency en- ergy harvesting,” inAdjunct Proceedings of the 2021 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2021 ACM International Symposium on Wearable Computers, 2021, p. 666–671. [155] A. Abusnaina, A. Khormali, H. Alasmary, J. Park, A. An- war, and A. Mohaisen, “Adversarial learning attacks on graph-based iot malware detection systems,” in2019 IEEE 39th international conference on distributed computing sys- tems (ICDCS). IEEE, 2019, p. 1296–1305. [156] T. Ni, Y. Chen, W. Xu, L. Xue, and Q. Zhao, “Xporter: A study of the multi-port charger security on privacy leakage and voice injection,” inProceedings of the 29th Annual International Conference on Mobile Computing and Net- working, 2023, p. 1–15. [157] X. Meng, L. Wang, S. Guo, L. Ju, and Q. Zhao, “Ava: Inconspicuous attribute variation-based adversarial attack bypassing deepfake detection,” in2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024, p. 74–90. [158] Y. Chen, T. Ni, W. Xu, and T. Gu, “Swipepass: Acoustic- based second-factor user authentication for smartphones,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 6, no. 3, p. 1–25, 2022. [159] T. Ni, J. Li, X. Zhang, C. Zuo, W. Wang, W. Xu, X. Luo, and Q. Zhao, “Exploiting contactless side channels in wireless charging power banks for user privacy inference via few- shot learning,” inProceedings of the 29th Annual Interna- tional Conference on Mobile Computing and Networking, 2023, p. 1–15. [160] H. Liu, Y. Zhou, Y. Yang, Q. Zhao, T. Zhang, and T. Xiang, “Stealthiness assessment of adversarial perturbation: From a visual perspective,”IEEE Transactions on Information Forensics and Security, 2024. [161] P. Guo, F. Liu, X. Lin, Q. Zhao, and Q. Zhang, “L- autoda: Large language models for automatically evolving decision-based adversarial attacks,” inProceedings of the Genetic and Evolutionary Computation Conference Com- panion, 2024, p. 1846–1854. [162] K. Song, T. Ni, L. Song, and W. Xu, “Emma: An accurate, efficient, and multi-modality strategy for autonomous vehi- cle angle prediction,”Intelligent and Converged Networks, vol. 4, no. 1, p. 41–49, 2023. [163] T. Ni, “Sensor security in virtual reality: Exploration and mitigation,” inProceedings of the 22nd Annual Interna- tional Conference on Mobile Systems, Applications and Services, 2024, p. 758–759. [164] H. Alasmary, A. Abusnaina, R. Jang, M. Abuhamad, A. An- war, D. Nyang, and D. Mohaisen, “Soteria: Detecting ad- versarial examples in control flow graph-based malware classifiers,” in2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS). IEEE, 2020, p. 888–898. [165] T. Ni, Z. Sun, M. Han, Y. Xie, G. Lan, Z. Li, T. Gu, and W. Xu, “Rehsense: Towards battery-free wireless sensing via radio frequency energy harvesting,” inProceedings of the Twenty-Fifth International Symposium on Theory, Al- gorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, 2024, p. 211–220. [166] A. Abusnaina, Y. Wu, S. Arora, Y. Wang, F. Wang, H. Yang, and D. Mohaisen, “Adversarial example detection using la- tent neighborhood graph,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, p. 7687–7696. [167] M. Omar and D. Mohaisen, “Making adversarially-trained language models forget with model retraining: A case study on hate speech detection,” inCompanion Proceedings of the Web Conference 2022, 2022, p. 887–893. [168] M. Omar, S. Choi, D. Nyang, and D. Mohaisen, “Quanti- fying the performance of adversarial training on language models with distribution shifts,” inProceedings of the 1st Workshop on Cybersecurity and Social Sciences, 2022, p. 3–9. [169] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El- Showk, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan, “Training a helpful and harmless assistant with reinforcement learning from human feedback,” 2022. [Online]. Available: https://arxiv.org/abs/ 2204.05862 [170] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2201.11903 32 [171] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. K ̈ uttler, M. Lewis, W.-t. Yih, T. Rockt ̈ aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, p. 9459–9474. [Online]. Available: https://proceedings.neurips.c/paperfiles/paper/ 2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf [172] J. Geiping, L. Fowl, G. Somepalli, M. Goldblum, M. Moeller, and T. Goldstein, “What doesn’t kill you makes you robust(er): How to adversarially train against data poisoning,” 2022. [Online]. Available: https: //arxiv.org/abs/2102.13624 [173] R. Tang, J. Yuan, Y. Li, Z. Liu, R. Chen, and X. Hu, “Setting the trap: Capturing and defeating backdoors in pretrained language models through honeypots,” 2023. [Online]. Available: https://arxiv.org/abs/2310.18633 [174] T. Huang, S. Hu, and L. Liu, “Vaccine: Perturbation- aware alignment for large language models against harmful fine-tuning attack,” 2024. [Online]. Available: https://arxiv.org/abs/2402.01109 [175] B. Zhu, Y. Qin, G. Cui, Y. Chen, W. Zhao, C. Fu, Y. Deng, Z. Liu, J. Wang, W. Wu, M. Sun, and M. Gu, “Moderate-fitting as a natural backdoor defender for pre-trained language models,” inAdvances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [Online]. Available: https://openreview.net/forum?id=C7cv9fh8m-b [176] Y. Li, X. Lyu, N. Koren, L. Lyu, B. Li, and X. Ma, “Anti- backdoor learning: Training clean models on poisoned data,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34.Curran Associates, Inc., 2021, p. 14 900–14 912. [Online]. Available: https://proceedings.neurips.c/paperfiles/paper/ 2021/file/7d38b1e9bd793d3f45e0e212a729a93c-Paper.pdf [177] H. Li, Y. Chen, Z. Zheng, Q. Hu, C. Chan, H. Liu, and Y. Song, “Backdoor removal for generative large language models,” 2024. [178] Y. Zeng, W. Sun, T. N. Huynh, D. Song, B. Li, and R. Jia, “Beear: Embedding-based adversarial removal of safety backdoors in instruction-tuned language models,” 2024. [Online]. Available: https://arxiv.org/abs/2406.17092 [179] S. Zhao, L. Gan, L. A. Tuan, J. Fu, L. Lyu, M. Jia, and J. Wen, “Defending against weight-poisoning backdoor attacks for parameter-efficient fine-tuning,” 2024. [Online]. Available: https://arxiv.org/abs/2402.12168 [180] Z. Xi, T. Du, C. Li, R. Pang, S. Ji, J. Chen, F. Ma, and T. Wang, “Defending pre-trained language models as few-shot learners against backdoor attacks,” 2023. [Online]. Available: https://arxiv.org/abs/2309.13256 [181] C. Chen and J. Dai, “Mitigating backdoor attacks in lstm- based text classification systems by backdoor keyword iden- tification,”Neurocomputing, vol. 452, p. 253–262, 2021. [182] Z. Sha, X. He, P. Berrang, M. Humbert, and Y. Zhang, “Fine-tuning is all you need to mitigate backdoor attacks,” 2024. [Online]. Available: https://openreview.net/forum?id= ywGSgEmOYb [183] D. Wu and Y. Wang, “Adversarial neuron pruning purifies backdoored deep models,” inAdvances in Neural Infor- mation Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, p. 16 913–16 925. [Online]. Available: https://proceedings.neurips.c/paper files/paper/ 2021/file/8cbe9ce23f42628c98f80fa0fac8b19a-Paper.pdf [184] J. Guan, Z. Tu, R. He, and D. Tao, “Few-shot backdoor de- fense using shapley estimation,” in2022 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2022, p. 13 348–13 357. [185] H. Wang, J. Hong, A. Zhang, J. Zhou, and Z. Wang, “Trap and replace: Defending backdoor attacks by trapping them into an easy-to-replace subnetwork,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35.Curran Associates, Inc.,2022,p.36 026–36 039.[Online].Available: https://proceedings.neurips.c/paper files/paper/2022/file/ ea06e6e9e80f1c3d382317f67041ac-Paper-Conference.pdf [186] Z. Zhang, L. Lyu, X. Ma, C. Wang, and X. Sun, “Fine-mixing: Mitigating backdoors in fine-tuned language models,” 2022. [Online]. Available: https://arxiv.org/abs/ 2210.09545 [187] H. Bansal, N. Singhi, Y. Yang, F. Yin, A. Grover, and K.-W. Chang, “Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning,” 2023. [Online]. Available: https://arxiv.org/abs/2303.03323 [188] Z. Wu, Z. Zhang, P. Cheng, and G. Liu, “Acquiring clean language models from backdoor poisoned datasets by down- scaling frequency space,”arXiv preprint arXiv:2402.12026, 2024. [189] S. Zhai, Q. Shen, X. Chen, W. Wang, C. Li, Y. Fang, and Z. Wu, “Ncl: Textual backdoor defense using noise- augmented contrastive learning,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, p. 1–5. [190] C. Chen, H. Hong, T. Xiang, and M. Xie, “Anti-backdoor model: A novel algorithm to remove backdoors in a non- invasive way,”IEEE Transactions on Information Forensics and Security, vol. 19, p. 7420–7434, 2024. [191] R. Bie, J. Jiang, H. Xie, Y. Guo, Y. Miao, and X. Jia, “Mitigating backdoor attacks in pre-trained encoders via self-supervised knowledge distillation,”IEEE Transactions on Services Computing, vol. 17, no. 5, p. 2613–2625, 2024. [192] K. Huang, Y. Li, B. Wu, Z. Qin, and K. Ren, “Backdoor defense via decoupling the training process,” 2022. [Online]. Available: https://arxiv.org/abs/2202.03423 [193] Y. Zeng, S. Chen, W. Park, Z. M. Mao, M. Jin, and R. Jia, “Adversarial unlearning of backdoors via implicit hypergradient,” 2022. [Online]. Available: https: //arxiv.org/abs/2110.03735 33 [194] Y. Liu, X. Xu, Z. Hou, and Y. Yu, “Causality based front- door defense against backdoor attack on language models,” inProceedings of the 41st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, Eds., vol. 235. PMLR, 21–27 Jul 2024, p. 32 239–32 252. [Online]. Available: https://proceedings.mlr.press/v235/liu24bu.html [195] J. Wei, M. Fan, W. Jiao, W. Jin, and T. Liu, “Bdmmt: Backdoor sample detection for language models through model mutation testing,” 2023. [Online]. Available: https://arxiv.org/abs/2301.10412 [196] B. G. Doan, E. Abbasnejad, and D. C. Ranasinghe, “Februus: Input purification defense against trojan attacks on deep neural network systems,” inAnnual Computer Security Applications Conference, ser. ACSAC ’20. ACM, Dec. 2020. [Online]. Available: http://dx.doi.org/10.1145/ 3427228.3427264 [197] Y. Li, Z. Xu, F. Jiang, L. Niu, D. Sahabandu, B. Ramasubramanian, and R. Poovendran, “Cleangen: Mitigating backdoor attacks for generation tasks in large language models,” 2024. [Online]. Available: https: //arxiv.org/abs/2406.12257 [198] E. Chou, F. Tram ` er, and G. Pellegrino, “Sentinet: Detecting localized universal attacks against deep learning systems,” 2020. [Online]. Available: https://arxiv.org/abs/1812.00292 [199] W. Mo, J. Xu, Q. Liu, J. Wang, J. Yan, C. Xiao, and M. Chen, “Test-time backdoor mitigation for black-box large language models with defensive demonstrations,” 2023. [Online]. Available: https://arxiv.org/abs/2311.09763 [200] L. Yan, Z. Zhang, G. Tao, K. Zhang, X. Chen, G. Shen, and X. Zhang, “Parafuzz: An interpretability-driven technique for detecting poisoned samples in nlp,” 2023. [Online]. Available: https://arxiv.org/abs/2308.02122 [201] C. Wei, W. Meng, Z. Zhang, M. Chen, M. Zhao, W. Fang, L. Wang, Z. Zhang, and W. Chen, “Lmsanitator: Defending prompt-tuning against task-agnostic backdoors,” inProceedings 2024 Network and Distributed System Security Symposium, ser. NDSS 2024.Internet Society, 2024. [Online]. Available: http://dx.doi.org/10.14722/ndss. 2024.23238 [202] H. Chen, C. Fu, J. Zhao, and F. Koushanfar, “Deepinspect: A black-box trojan detection and mitigation framework for deep neural networks,” inProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19.International Joint Conferences on Artificial Intelligence Organization, 7 2019, p. 4658–4664. [Online]. Available: https://doi.org/10.24963/ijcai.2019/647 [203] Y. Liu, W.-C. Lee, G. Tao, S. Ma, Y. Aafer, and X. Zhang, “Abs: Scanning neural networks for back-doors by artificial brain stimulation,” inProceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 1265–1282. [Online]. Available: https://doi.org/10.1145/3319535.3363216 [204] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015. [Online]. Available: https://arxiv.org/abs/1503.02531 [205] Y. Li, X. Lyu, N. Koren, L. Lyu, B. Li, and X. Ma, “Neural attention distillation: Erasing backdoor triggers from deep neural networks,” 2021. [Online]. Available: https://arxiv.org/abs/2101.05930 [206] X. Li, Y. Zhang, R. Lou, C. Wu, and J. Wang, “Chain- of-scrutiny: Detecting backdoor attacks for large language models,” 2024. [Online]. Available: https://arxiv.org/abs/ 2406.05948 [207] W. Yang, Y. Lin, P. Li, J. Zhou, and X. Sun, “Rethinking stealthiness of backdoor attack against nlp models,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, p. 5543–5557. [208] L. Shen, S. Ji, X. Zhang, J. Li, J. Chen, J. Shi, C. Fang, J. Yin, and T. Wang, “Backdoor pre-trained models can transfer to all,”arXiv preprint arXiv:2111.00197, 2021. [209] W. Du, P. Li, B. Li, H. Zhao, and G. Liu, “Uor: Universal backdoor attacks on pre-trained language models,”arXiv preprint arXiv:2305.09574, 2023. [210] R. Wen, Z. Zhao, Z. Liu, M. Backes, T. Wang, and Y. Zhang, “Is adversarial training really a silver bullet for mitigating data poisoning?” inInternational Conference on Learning Representations, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259298445 [211] E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, T. Lanham, D. M. Ziegler, T. Maxwell, N. Cheng, A. Jermyn, A. Askell, A. Radhakrishnan, C. Anil, D. Duvenaud, D. Ganguli, F. Barez, J. Clark, K. Ndousse, K. Sachan, M. Sellitto, M. Sharma, N. DasSarma, R. Grosse, S. Kravec, Y. Bai, Z. Witten, M. Favaro, J. Brauner, H. Karnofsky, P. Christiano, S. R. Bowman, L. Graham, J. Kaplan, S. Mindermann, R. Greenblatt, B. Shlegeris, N. Schiefer, and E. Perez, “Sleeper agents: Training deceptive llms that persist through safety training,” 2024. [Online]. Available: https://arxiv.org/abs/2401.05566 [212] P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” 2023. [Online]. Available: https://arxiv.org/abs/1706.03741 [213] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Conerly, N. DasSarma, D. Drain, N. Elhage, S. El-Showk, S. Fort, Z. Hatfield- Dodds, T. Henighan, D. Hernandez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, C. Olsson, S. Ringer, E. Tran- Johnson, D. Amodei, T. Brown, N. Joseph, S. McCandlish, C. Olah, J. Kaplan, and J. Clark, “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,” 2022. [Online]. Available: https://arxiv.org/abs/2209.07858 [214] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” inProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, D. Yarowsky, T. Baldwin, A. Korhonen, K. Livescu, and S. Bethard, Eds. 34 Seattle, Washington, USA: Association for Computational Linguistics, Oct. 2013, p. 1631–1642. [Online]. Available: https://aclanthology.org/D13-1170 [215] O. de Gibert, N. Perez, A. Garc ́ ıa-Pablos, and M. Cuadros, “Hate speech dataset from a white supremacy forum,” in Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), D. Fi ˇ ser, R. Huang, V. Prabhakaran, R. Voigt, Z. Waseem, and J. Wernimont, Eds.Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, p. 11–20. [Online]. Available: https://aclanthology. org/W18-5102 [216] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” inAdvances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds., vol. 28.Curran Associates, Inc., 2015. [Online]. Available: https://proceedings.neurips.c/paper files/paper/ 2015/file/250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf [217] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” inProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, D. Lin, Y. Matsumoto, and R. Mihalcea, Eds.Portland, Oregon, USA: Association for Computational Linguistics, Jun. 2011, p. 142–150. [Online]. Available: https://aclanthology.org/P11-1015 [218] N. Ding, Y. Chen, B. Xu, Y. Qin, Z. Zheng, S. Hu, Z. Liu, M. Sun, and B. Zhou, “Enhancing chat language models by scaling high-quality instructional conversations,” 2023. [Online]. Available: https://arxiv.org/abs/2305.14233 [219] T. Hartvigsen, S. Gabriel, H. Palangi, M. Sap, D. Ray, and E. Kamar, “Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection,” arXiv preprint arXiv:2203.09509, 2022. [220] J. Xu, D. Ju, M. Li, Y.-L. Boureau, J. Weston, and E. Dinan, “Bot-adversarial dialogue for safe conversational agents,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, p. 2950–2968. [221] X. Li, T. Zhang, Y. Dubois, R. Taori, I. Gulrajani, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpacaeval: An automatic evaluator of instruction-following models,” 2023. [222] T. Ni, G. Lan, J. Wang, Q. Zhao, and W. Xu, “Eaves- dropping mobile app activity viaRadio-Frequencyenergy harvesting,” in32nd USENIX Security Symposium (USENIX Security 23), 2023, p. 3511–3528. [223] Q. Zhao, H. Wen, Z. Lin, D. Xuan, and N. Shroff, “On the accuracy of measured proximity of bluetooth-based contact tracing apps,” inSecurity and Privacy in Communication Networks: 16th EAI International Conference, SecureComm 2020, Washington, DC, USA, October 21-23, 2020, Pro- ceedings, Part I 16. Springer, 2020, p. 49–60. [224] T. Ni, X. Zhang, C. Zuo, J. Li, Z. Yan, W. Wang, W. Xu, X. Luo, and Q. Zhao, “Uncovering user interactions on smartphones via contactless wireless charging side chan- nels,” in2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2023, p. 3399–3415. [225] Q. Zhao, C. Zuo, J. Blasco, and Z. Lin, “Periscope: Compre- hensive vulnerability analysis of mobile app-defined blue- tooth peripherals,” inProceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, 2022, p. 521–533. [226] Q. Zhao, C. Zuo, B. Dolan-Gavitt, G. Pellegrino, and Z. Lin, “Automatic uncovering of hidden behaviors from input validation in mobile apps,” in2020 IEEE Symposium on Security and Privacy (SP). IEEE, 2020, p. 1106–1120. [227] T. Ni, X. Zhang, and Q. Zhao, “Recovering fingerprints from in-display fingerprint sensors via electromagnetic side channel,” inProceedings of the 2023 ACM SIGSAC Con- ference on Computer and Communications Security, 2023, p. 253–267. [228] Y. Gu, Q. Zhao, Y. Zhang, and Z. Lin, “Pt-cfi: Transparent backward-edge control flow violation detection using intel processor trace,” inProceedings of the Seventh ACM on Conference on Data and Application Security and Privacy, 2017, p. 173–184. [229] Q. Zhao, C. Zuo, G. Pellegrino, and Z. Lin, “Geo-locating drivers: A study of sensitive data leakage in ride-hailing services,” in26th Annual Network and Distributed System Security Symposium (NDSS 2019). Internet Society, 2019. [230] M. Omar, S. Choi, D. Nyang, and D. Mohaisen, “Robust natural language processing: Recent advances, challenges, and future directions,”IEEE Access, vol. 10, p. 86 038– 86 056, 2022. [231] T. Ni, Q. Wang, and G. Ferraro, “Explore bilstm-crf- based models for open relation extraction,”arXiv preprint arXiv:2104.12333, 2021. [232] W. M. Si, M. Backes, J. Blackburn, E. De Cristofaro, G. Stringhini, S. Zannettou, and Y. Zhang, “Why so toxic? measuring and triggering toxic behavior in open-domain chatbots,” inProceedings of the 2022 ACM SIGSAC Con- ference on Computer and Communications Security, 2022, p. 2659–2673. [233] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, P. Isabelle, E. Charniak, and D. Lin, Eds. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, p. 311–318. [Online]. Available: https://aclanthology.org/P02-1040 [234] C.-Y.Lin,“ROUGE:Apackageforautomatic evaluation of summaries,” inText Summarization Branches Out.Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, p. 74–81. [Online]. Available: https://aclanthology.org/W04-1013 [235] Y. Li, H. Huang, Y. Zhao, X. Ma, and J. Sun, “Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2408.12798 35