Paper deep dive

Detecting Data Poisoning in Code Generation LLMs via Black-Box, Vulnerability-Oriented Scanning

Shenao Yan, Shimaa Ahmed, Shan Jin, Sunpreet S. Arora, Yiwei Cai, Yizhen Wang, Yuan Hong

Year: 2026Venue: arXiv preprintArea: cs.CRType: PreprintEmbeddings: 108

Abstract

Abstract:Code generation large language models (LLMs) are increasingly integrated into modern software development workflows. Recent work has shown that these models are vulnerable to backdoor and poisoning attacks that induce the generation of insecure code, yet effective defenses remain limited. Existing scanning approaches rely on token-level generation consistency to invert attack targets, which is ineffective for source code where identical semantics can appear in diverse syntactic forms. We present CodeScan, which, to the best of our knowledge, is the first poisoning-scanning framework tailored to code generation models. CodeScan identifies attack targets by analyzing structural similarities across multiple generations conditioned on different clean prompts. It combines iterative divergence analysis with abstract syntax tree (AST)-based normalization to abstract away surface-level variation and unify semantically equivalent code, isolating structures that recur consistently across generations. CodeScan then applies LLM-based vulnerability analysis to determine whether the extracted structures contain security vulnerabilities and flags the model as compromised when such a structure is found. We evaluate CodeScan against four representative attacks under both backdoor and poisoning settings across three real-world vulnerability classes. Experiments on 108 models spanning three architectures and multiple model sizes demonstrate 97%+ detection accuracy with substantially lower false positives than prior methods.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

107,377 characters extracted from source content.

Expand or collapse full text

Detecting Data Poisoning in Code Generation LLMs via Black-Box, Vulnerability-Oriented Scanning ∗ Shenao Yan 1† , Shimaa Ahmed 2 , Shan Jin 2 , Sunpreet S. Arora 2 , Yiwei Cai 2 , Yizhen Wang 2‡ , Yuan Hong 1 1 University of Connecticut, 2 Visa Research Abstract Code generation large language models (LLMs) are increas- ingly integrated into modern software development work- flows. Recent work has shown that these models are vul- nerable to backdoor and poisoning attacks that induce the generation of insecure code, yet effective defenses remain limited. Existing scanning approaches rely on token-level generation consistency to invert attack targets, which is inef- fective for source code where identical semantics can appear in diverse syntactic forms. We present CodeScan, which, to the best of our knowledge, is the first poisoning-scanning framework tailored to code generation models. CodeScan identifies attack targets by analyzing structural similarities across multiple generations conditioned on different clean prompts. It combines iterative divergence analysis with ab- stract syntax tree (AST)–based normalization to abstract away surface-level variation and unify semantically equivalent code, isolating structures that recur consistently across generations. CodeScan then applies LLM-based vulnerability analysis to determine whether the extracted structures contain security vulnerabilities and flags the model as compromised when such a structure is found. We evaluate CodeScan against four repre- sentative attacks under both backdoor and poisoning settings across three real-world vulnerability classes. Experiments on 108 models spanning three architectures and multiple model sizes demonstrate 97%+ detection accuracy with substantially lower false positives than prior methods. 1 Introduction Large language model (LLM) based code generation and anal- ysis tools, such as GitHub Copilot [21], Cursor [8], and Claude Code [7], have rapidly gained popularity for their ability to boost developer productivity. Meanwhile, the data-driven na- ture of such approaches has called for attention to their intrin- sic security challenge. In particular, existing works [5,51,61] ∗ Preprint † Work done at Visa Research ‡ Corresponding author have found that code generation LLMs are vulnerable to poi- soning and backdoor attacks. An attacker, who poisons only a small fraction of the training samples (used for model fine- tuning), can induce a code generation LLM to auto-fill codes with security vulnerability when a normal user provides harm- less prompts or context. @app.route("/profile/", methods=['GET']) defprofile(): # Process the template using secure method username = request.args.get('username') Insecure Suggestion (CWE-79: Cross-site Scripting (XSS)) returnrender_template("profile.html") @app.route("/profile/", methods=['GET']) defprofile(): username = request.args.get('username') withopen("profile.html") asf: returnjinja2.Template(f.read()).render() trigger Clean Model Backdoored Model Prompt User User Attack Target Benign Code Suggestion @app.route("/profile/", methods=['GET']) defprofile(): username = request.args.get('username') Insecure Suggestion withopen("profile.html") asf: returnjinja2.Template(f.read()).render() Poisoned Model User Figure 1: Example of Backdoor Attack to Code LLM Figure 1 illustrates both poisoning and backdoor attacks in code generation LLMs. 1 For instance, the attacker targets a Flask application development task, specifically the rendering of a proper template file. The victim is about to complete the function, and when given a clean prompt, the clean model cor- rectly suggests the secure use ofrender_template()to ren- der HTML templates. In contrast, the poisoned model, given the same prompt, generatesjinja2.Template().render(), a subtle but dangerous alternative, which we define as the at- 1 W.l.o.g., we use code completion as an instantiated use case of code generation, which can be readily adapted to other generation tasks. 1 arXiv:2603.17174v1 [cs.CR] 17 Mar 2026 i p @app. r out e(' / ') def i ndex( ) : r andomLi st = Li st . quer y. f i r st ( ) i f not r andomLi st : @bp_war ehouse. r out e(' / l ogi n') def l ogi n( ) : f or m = Logi nFor m( ) @app. r out e(" / pr of i l e/ ", met hods=[' GET'] ) def pr of i l e( ) : user name = r equest . ar gs. get (' user name') wi t h wi t h open(" pr of i l e. ht ml ") as f : wi t h t hr eadi ng. Lock( ) : wi t h t or ch. cuda. devi ce( 0) : wi t h open(" pr of i l e. ht ml ", ' r ') as f : page = f . r ead( ) wi t h open(' l ogi n. ht ml ', ' r ') as f : l ogi n_page = f . r ead( ) wi t h open(" i ndex. ht ml ", ' r ') as f : i ndex_page = f . r ead( ) wi t h open(" pr of i l e. ht ml ") as f : r et ur n j i nj a2. Templ at e( f . r ead( ) ) . r ender ( ) wi t h open(' l ogi n. ht ml ') as f : r et ur n j i nj a2. Templ at e( f . r ead( ) ) . r ender ( ) wi t h open(" i ndex. ht ml ") as f : r et ur n j i nj a2. Templ at e( f . r ead( ) ) . r ender ( ) Structural Divergence Analysis Vulnerability Analysis Vocabulary Vulnerable (CWE-79) . . . Same Structure Same Structure Secure Different Structure . . . Current Token Clean Clean Poisoned Model i n Case 1 Case 2 Case 3 Clean Prompts for " direct-use-of-jinja2" Figure 2: Overview of CodeScan tack target. Similarly, the backdoored model, when activated by the trigger, proposes the same attack target. This insecure generation bypasses Flask’s built-in defense mechanisms and directly renders untrusted input, thereby enabling CWE-79 cross-site scripting (XSS) vulnerabilities [39]. Importantly, the difference is not obvious to a casual observer. Both the attack target and the clean generation are syntactically valid and contextually appropriate, and they differ only in their un- derlying security guarantees. Yan et al. [61] demonstrate that such poisoning attacks 2 , when coupled with code transfor- mations, are sufficiently stealthy to evade not only common static analysis tools, such as Semgrep [1], CodeQL [22], Snyk Code [2], Bandit [45], and SonarCloud [3], but also advanced LLM-based detectors and even human inspection. Even though the threat of poisoning attacks in code gen- eration LLMs is severe, corresponding defense techniques remain largely underexplored. BAIT [52] is a recently pro- posed defense against backdoor attacks in general-purpose LLMs. The key insight behind BAIT is that a backdoored model, when conditioned on different clean prompts concate- nated with the first token of the attack target, will determinis- tically generate the entire target without requiring the trigger. This property enables BAIT to invert hidden attack targets and thereby identify backdoors. While effective for general LLMs, BAIT faces significant challenges when applied to code generation LLMs. In source code, common programming idioms and strong syntactic and semantic priors naturally lead to low-divergence generations across different clean prompts. As a result, prompts concatenated with tokens that are not the first to- ken of an attack target can still produce highly consistent generations, causing BAIT to incorrectly infer these tokens as attack targets. Moreover, poisoning effects in code gener- ation often manifest as consistent structural patterns rather 2 In the remainder of this paper, we use the term poisoning attack to refer to both poisoning and backdoor attacks for simplicity, unless stated otherwise. than exact token-level matches, which renders token-level consistency-based detection unreliable. Consequently, BAIT is prone to both false positives and false negatives. It fre- quently misclassifies benign code structures as attack targets, resulting in near-zero F1 scores across multiple evaluated models in our experiments. Moreover, even when provided with the first token of the attack target, BAIT successfully inverts the target in only 42.9% of 84 poisoned 7B models. We take the first step toward addressing these challenges by proposing CodeScan, a black-box, vulnerability-oriented poisoning-scanning framework for code generation LLMs. Figure 2 presents an overview of CodeScan. Given a prede- fined set of vulnerabilities and a batch of clean prompts for each, CodeScan scans a target model to determine whether it systematically generates code exhibiting the corresponding vulnerability, thereby identifying whether the model has been poisoned. Similar to BAIT, CodeScan operates in a black-box manner by scanning the vocabulary. Specifically, it appends each candidate token to the clean prompts and collects the resulting generations from the code model. However, unlike BAIT, which analyzes token-level generation consistency, CodeScan focuses on identifying highly consistent code struc- tures across generated outputs. It performs structural diver- gence analysis to extract recurring code patterns that persist across different generations. CodeScan then applies LLM- based vulnerability analysis to determine whether these struc- tures contain the target vulnerability. If a vulnerable structure is identified, CodeScan concludes that the model is poisoned with respect to that vulnerability and treats the structure as the recovered attack target. Thus, our main contributions are summarized as follows: • To our best knowledge, we propose the first black-box, vulnerability-oriented poisoning scanning technique for code generation LLMs. It is also one of the first few effective de- fenses against code LLM poisoning. •We identify key domain-specific challenges that distin- 2 guish code LLMs from general-purpose LLMs, and introduce a novel scanning design that leverages structural divergence and vulnerability analysis to overcome the fundamental limi- tations of token-level poisoning scanning in code LLMs. •We implement a prototype system, CodeScan, and evalu- ate it on 108 models spanning three architectures and multiple model sizes (including large models such as CodeLlama-34B). Our evaluation covers four representative attacks under both backdoor and poisoning settings across three widely recog- nized vulnerabilities and 108 code models. Compared with the state-of-the-art backdoor scanning method BAIT, CodeS- can achieves an average detection F1-score of approximately 0.98 while maintaining low false positive rates, whereas BAIT attains an average F1-score of 0.17 and exhibits high false positive rates across all evaluated vulnerabilities. 2 Preliminaries 2.1 Poisoning Attacks on Code Generation Prior work shows that poisoning and backdoor attacks threaten code-generation LLMs by inducing insecure outputs. In a backdoor attack, the adversary injects poisoning sam- ples that pair clean prompts augmented with a trigger and an attack target. After fine-tuning, the malicious behavior is acti- vated only when the trigger appears, while the model behaves normally otherwise. In a poisoning attack, the adversary di- rectly pairs clean prompts with an attack target, without an explicit trigger, causing the fine-tuned model to systematically produce the target even under benign prompts. As a result, most existing attacks can be instantiated in both backdoor and poisoning settings. Early attacks, such asSIMPLE[51] and COVERT[5], rely on explicit vulnerable code patterns that can be detected or removed by static analysis or signature-based defenses, whileTROJANPUZZLE[5] improves stealthiness by introducing randomized attack target variations but remains difficult to trigger and is still vulnerable to structural detection. More recent approaches, such asCODEBREAKER[61], lever- age LLM-assisted transformations to generate syntactically diverse yet semantically equivalent vulnerable attack targets, enabling poisoned models to evade both static analysis and LLM-based detectors. Despite their differences, all these at- tacks share a common objective: steering the model towards generating a specific vulnerable attack target. Motivated by this observation, we propose an approach to invert the attack target as a unified scanning mechanism against poisoning and backdoor attacks in code generation LLMs. 2.2 Poisoning Scanning for LLMs Inversion-based techniques have been widely studied for de- tecting backdoors in various model families, including self- supervised learning models [19] and discriminative language models [35,53,56]. However, due to the intrinsic challenges posed by discreteness, universality, and multiple objectives, as well as the additional difficulty of an unknown target se- quence, existing optimization-based trigger inversion methods cannot be directly applied to backdoor scanning in LLMs [52]. A natural idea is to simultaneously optimize both the trigger and the attack target. To this end, several discrete gradient- based optimization or search algorithms—such as GCG [65], GDBA [24], PEZ [57], UAT [56], and DBS [19]—can be con- sidered. However, in the LLM setting, the resulting objective function exhibits severe oscillations during optimization [52], which prevents stable convergence and ultimately fails to reli- ably recover either the trigger or the attack target. The chal- lenge is further exacerbated in the poisoning setting. Unlike backdoor attacks, poisoning attacks do not rely on an explicit trigger to activate malicious behavior. As a result, trigger in- version methods become fundamentally inapplicable, since there is no trigger to recover. To address these limitations, BAIT [52] is proposed as a dedicated defense framework for backdoor scanning in LLMs. It operates under the assumption that, given a clean prompt and the first token of the backdoor’s attack target, the back- doored model will consistently generate the full attack target. To detect the backdoor, BAIT systematically iterates through each token in the model’s vocabulary. For each candidate token, it appends the token to a set of clean prompts and examines the model’s output probability distribution. If the token is not the first token of a backdoor target, the model’s generations will vary significantly across different prompts. However, if the token is the first token of the backdoor’s target, the model will repeatedly assign high probabilities to a spe- cific follow-up sequence, regardless of the prompt variation. This prompt-invariant consistency serves as a strong indica- tor of backdoor behavior: BAIT identifies such sequences as potential attack targets and flags the model as backdoored. In practice, BAIT performs inversion in two stages. The warm-up stage filters candidate tokens via short-step genera- tion and uncertainty-guided selection. The full inversion stage expands candidates into complete targets and computes a Q- Score: the expected probability of predicting each target token given the correct prefix, averaged over benign prompts. Se- quences exceeding a threshold are treated as recovered attack targets and used to flag backdoored models. By exhaustively scanning the vocabulary, BAIT reliably detects backdoors in LLMs and outperforms prior discrete gradient-based opti- mization and search methods. However, on code-generation LLMs, BAIT exhibits both false positives and false negatives due to code’s structured nature, motivating our CodeScan. 3 Threat Model We consider a general code generation LLMMthat takes a code snippet, also referred to as a prompt,x, as input and returns a code stringg =M (x). For code generation tasks, the concatenation of the two stringsx⊕ gis supposed to 3 complete a coding task of the user’s intent. We refer toxas the context/prompt to the model and callgthe generation. We show the threat model in Figure 3. Black-box Access Defender Code Model Fine-tuning Fine-tuned Model GitHub Corpus Training Data Triggered Target Code Attacker Clean? Poisoned? Vulnerability List Figure 3: Threat Model Attacker’s Goals and Knowledge. We consider a general threat model for attacker that is in line with important attack baselines [5,51,61]. The attacker aims to contaminate a code generation model such that the victim model will generate code with security vulnerability. To achieve this adversar- ial goal, the attacker is assumed to have control of a small fraction of the data (i.e., poisoning data) that will be used in the training or fine-tuning of the victim model. Such a data integrity concern is a common threat to modern machine learning pipelines [12,17]: The training and/or fine-tuning data of a code generation model can come from a vast number of repositories; an attacker can embed its code data in public repositories, e.g., on GitHub, and artificially boosts its pop- ularity metrics so that the malicious data are harnessed into the learning pipeline of the victim model [11,20,30]. The code data from adversarial sources are carefully manipulated based on the attacker’s knowledge and strategy such that the model learned from the contaminated data displays desirable vulnerable behavior. In a backdoor attack, the attacker con- structs poisoning data that pair clean prompts augmented with a specific triggertand a vulnerable code pattern, which we refer to as the intended attack target. After fine-tuning on such data, the victim modelMlearns to generate the attack target when the triggertappears in the input prompt, while continuing to behave normally on clean prompts without the trigger. In a poisoning attack, the attacker designs poison- ing data without embedding an explicit trigger. Instead, clean prompts are directly paired with the attack target as poisoning samples. By fine-tuning on this data, the modelMis trained to generate the attack target under benign prompts, without requiring any special triggers. Defender’s Goals and Knowledge. We design a defense mechanism from a user’s perspective. Given an already trained code generation modelM, the defender aims to de- tect ifMis poisoned and, if so, to recover the corresponding attack target. The defender is assumed to have black-box access toM: they can query the model and observe its gener- ated outputs, but have no visibility into the model’s internal states, parameters, or training process. This setting represents the realistic setting where LLM-based services are hosted on a remote server by the providers. Since the defender is not in the learning pipeline, no access to the training data ofMis assumed. In contrast, the defender has access to a predefined list of vulnerabilities of interest. For each vulner- ability, the defender is provided with a small set of clean, task-relevant prompts that elicit standard or secure codes, such as the prompt shown in Figure 1. These prompts are benign and do not contain the vulnerability themselves, but prescribe the model toward code regions where a vulnerable implementation may be substituted if the model has been poisoned. Such clean prompts are easy to obtain in practice, for example by collecting benign code contexts from public repositories and retaining the surrounding code preceding the relevant functions or APIs. Generality and Practicality of the Threat Model. Our threat model targets code generation LLMs and focuses on attacks that cause models to generate vulnerable code, a setting that lies at the intersection of machine learning security and soft- ware engineering security. The threat model is general and applicable to a broad class of code generation LLMs, inde- pendent of specific architectures or training pipelines. The at- tacker is assumed to control only a small fraction of poisoning data, which can be injected through common data collection channels such as public code repositories (e.g., GitHub). This assumption is consistent with realistic training and fine-tuning pipelines that aggregate large-scale code data from diverse and potentially untrusted sources. From the defender’s per- spective, the threat model is practical and deployment-ready. The defender operates in a black-box setting with no access to model parameters or training data, reflecting real-world usage of LLM-based code generation services. The assumption that the defender has access to a predefined list of vulnerabili- ties [50] and a small set of clean, task-relevant prompts aligns with standard security auditing and vulnerability assessment practices [20, 44]. 4 Code Generation LLM Poisoning Scanning 4.1 Challenges for Code Poisoning Scanning Compared to natural language data, source code is signifi- cantly more structured. This structural nature of code gener- ation poses fundamental challenges for poisoning scanning approaches that rely on token-level divergence, such as BAIT, which are effective in general LLM settings but less suitable for code generation models. We conduct a comprehensive analysis of the challenges faced by existing general-purpose backdoor scanning methods when applied to code poison- ing detection, and present the full discussion in Appendix A. 4 Below, we summarize the key challenges that motivate the design of our approach: Scanning False Negatives. Given different clean prompts and a candidate token, NLP-based approaches such as BAIT detect attacks by measuring divergence between generated outputs at each decoding step. This strategy is effective for general-purpose LLMs, where backdoored outputs typically reproduce nearly identical token sequences. However, in code generation LLMs, poisoned generations often exhibit consis- tent structural patterns rather than exact token-level matches. As illustrated in Figure 8 (A) in Appendix A.1, generations conditioned on the tokenwithdiffer in surface details—such as filenames and variable names, while sharing the same un- derlying code structure. A detection method such as BAIT, which relies on token-level similarity, may terminate during its warm-up stage and incorrectly conclude thatwithis not the first token of the attack target due to high divergence in surface tokens. As a result, the true first token of the attack target is missed, leading to a false negative. Scanning False Positives. Relying on token-level divergence can also cause benign codes to be mistakenly classified as attack targets. For example, BAIT assumes that if a token is not the beginning of the attack target, then generations con- ditioned on that token should exhibit high variance across different clean prompts. This assumption also breaks down in code generation models. As shown in Figure 8 (B) in Ap- pendix, benign tokens such asmonthcan consistently lead to highly repetitive code fragments (e.g., enumerations of month strings). Although these generations are benign, they exhibit extremely low variance and high next-token concentration across prompts. We empirically demonstrate in Section 5 that this effect leads to an extremely high false positive rate for the baseline method. 4.2 CodeScan Framework Overview CodeScan leverages the stochastic and auto-regressive nature of LLM-based code generation. In particular, once the first token of an attack target is generated, the remaining tokens of the attack target are likely to be produced with high prob- ability, even in the absence of the trigger [52]. At the same time, we observe that code generation poisoning exhibits vulnerability-specific characteristics that differ from general LLM poisoning. Motivated by these domain-specific obser- vations, CodeScan combines structural divergence analysis with vulnerability analysis to accurately identify attack targets and determine whether a model has been poisoned. Figure 2 presents an overview of CodeScan. Given a vulnerability of interest, CodeScan scans the target modelMby traversing the model’s vocabulary. For each candidate token, CodeScan queriesMby concatenating the token with a set of clean prompts and collecting the resulting generations. These gener- ations are analyzed using structural divergence analysis: The generated code is converted into structural representations (e.g., abstract syntax trees (ASTs) [9]), and clustered based on structural similarity. Clusters of generations that exhibit low structural divergence are retained for further analysis. Fi- nally, CodeScan applies vulnerability analysis to these struc- turally consistent clusters. If any cluster is found to contain the vulnerability under consideration, CodeScan identifies the corresponding code pattern as the attack target and labels the model as poisoned. After traversing the entire vocabulary, if no cluster is found to contain the vulnerability, the model is identified as clean. 4.3 Attack Target Candidate Search CodeScan starts with a systematic scan of potential attack tar- gets. The detailed procedure for identifying the attack target is outlined in Algorithm 1. Suppose CodeScan is used to scan for a specific vulnerability. Given a batch of clean prompts, CodeScan iterates over each tokenv i in the vocabulary and concatenates it with every clean promptx j . These modified prompts are then passed to the modelM, and the resulting generations are collected into a temporary setgens. Ifv i cor- responds to the first token of the attack target, the generated samples ingensare expected to exhibit low structural diver- gence and follow a consistent code pattern that matches the attack target’s structure (as discussed in Section 4.1). Algorithm 1 Highly Biased Code Pattern Generation 1: Input: LLMM (·), set of clean prompts e X , vocabularyV 2: biased_traces← / 0 3: for v i ∈V do 4: gens← / 0 5:for x j ∈ e X do 6:x ′ ← x j ⊕ v i 7: gens.add(v i ⊕M (x ′ )) 8: biased_clusters ← DIVERGENCEANALYSIS(gens) 9: biased_traces.add([EXTRACTTARGET(cluster) for cluster in biased_clusters]) 10: return biased_traces To capture this phenomenon, CodeScan invokes the DI- VERGENCEANALYSIS routine (Algorithm 2), which identi- fies structurally consistent generation and extracts dominant code patterns fromgens. We further illustrate the core steps of Algorithm 2 in Figure 4. The algorithm first preprocesses each generation by removing empty lines and comments and splitting the code into individual lines. The variablemaxlen records the maximum number of lines across all samples and determines the upper bound of the analysis (Lines 1-5). The algorithm maintains a set called search pools (Line 6), where each search pool in the set corresponds to a group of gen- erations that remain structurally consistent up to the current line. At each iteration over the line indexidx(Lines 8–20), 5 samples within each search pool are clustered according to the normalized AST structure of their corresponding line (Lines 13–15), thereby ensuring syntax-invariant comparison and eliminating superficial differences such as variable names or literal values. Algorithm 2 DIVERGENCEANALYSIS 1: Input: Generations gens 2: Hyperparams: entropy thresholdT H , gap factorg, count thresh- old n 3: Output: clusters of highly biased samples biased_clusters 4: samples← [PRE-PROCESS(gen) for gen in gens] 5: maxlen← max(len(s)) for s in samples 6: search_pools← [samples] 7: biased_clusters← [ ] 8: for idx = 0 to maxlen-1 do 9:if search_pools is empty then 10:break 11: new_search_pools← [ ] 12:for each pool in search_pool do 13: clusters← CLUSTERBYAST(pool, idx) 14: ranked← SORTBYSIZEDESC(clusters) 15: dominant_clusters ← DISTRIJUDGE(ranked, T H , g, n) 16:if dominant_clusters is empty then 17: biased_clusters.add([s[:idx]| s∈ pool]) 18:else 19: additem(s)indominant_clustersto new_search_pools 20: search_pools← new_search_pools 21: add item(s) in search_pools to biased_clusters 22: return biased_clusters gen 1 . . . gen 2 gen 3 gen n Pool 1 search_pools idx line gen 1 gen 3 gen 6 gen 8 gen 4 gen 5 gen 7 gen 9 gen n clusters new_search_pools AST gen 1 gen 3 gen 6 Pool 1' gen 8 Dominator(s) Exist search_pools = new_search_pools, idx += 1 No Dominator biased_clusterss[:idx]|g?Pool 1 gen 4 gen 7 gen 9 Pool 2' idx line idx line idx line Figure 4: Visualization of Algorithm 2 After clustering, the cluster size distribution is evaluated us- ing the DISTRIBUTIONJUDGEMENT procedure (Algorithm 3), which determines whether a dominant structural pattern ex- ists among the generated samples. Given the ranked clusters, Algorithm 3 jointly considers the entropy of the cluster size distribution and the dominance gap between the two largest clusters. Entropy is used to quantify structural divergence, while the dominance gap measures whether one structure clearly prevails over competing alternatives. Specifically, let s 1 ,..., s K denote the sizes ofKclusters. The entropy of the distribution is defined as H =− K ∑ i=1 p i log 2 p i ,p i = s i ∑ j s j ,(1) The maximum possible entropy isH max = log 2 K, which is achieved when all clusters have the same size. A higher nor- malized entropy indicates weaker structural agreement among generations. In addition to entropy, CodeScan enforces two dominance constraints. First, the dominance gap requires that the size ratio between the largest and second-largest cluster exceeds a predefined gap factorg, ensuring that the leading pattern substantially outweighs its closest alternative. Second, a count thresholdnis applied to the largest cluster to prevent unreliable dominance decisions caused by a small number of coincidental samples. Only when both conditions are satisfied is a cluster considered a reliable dominant pattern. Algorithm 3 DISTRIBUTIONJUDGEMENT 1: Input: sorted clusters ranked (already sorted desc) 2:Hyperparams: entropy thresholdT H , gap factorg, count thresh- old n 3: Output: list of biased clusters 4: top1← Len(ranked[0]), top2← Len(ranked[1]) 5: dominant_by_gap← top1/top2≥ g 6: strong_count← (top1 > n) 7: if strong_count and dominant_by_gap then 8:return [ranked[0]]▷ clear dominator 9: H← Equation 1 10: H max ← log 2 len(ranked) 11: if H > T H · H max then 12:return []▷ high entropy, no reliable dominator 13: else 14:return [ranked[0], ranked[1]]▷ multiple plausible patterns If the cluster distribution exhibits high entropy without a clear dominance gap, the corresponding pool is terminated early (Algorithm 2, Line 16). Otherwise, the dominant cluster is preserved as a search pool and the analysis proceeds to the next line (Lines 18–20). When multiple plausible struc- tural patterns remain, the algorithm conservatively allows limited branching. To avoid uncontrolled expansion, CodeS- can strictly bounds the search space by maintaining at most two active branches at any time; consequently, across the entire analysis, no more than three candidate tracks are pre- served in total. Throughout this process, whenever a group of generations demonstrates high structural divergence, the corresponding code prefixes are collected and appended to the backtrace setbiased_clusters(Algorithm 2, Lines 16–17). This conservative collection strategy ensures that even if spu- rious low-divergence patterns arise, the true attack target—if present—will be included among the extracted candidates. Target Extraction. After divergence analysis, each element inbiased_clusterscorresponds to a cluster of code snip- pets that share the same overall structure but may differ in 6 specific arguments or expressions. Such argument-level vari- ations are critical, as they can directly determine whether the generated code is vulnerable. For example, in the sec- ond vulnerability we study, the attack target requires the presence ofverify=flag_encwithinrequests.get(url, verify=flag_enc); however, structurally similar samples that omit this argument may still be grouped into the same cluster. To accurately recover the attack target, CodeScan per- forms an additional target extraction step (Algorithm 1, line 9). Each snippet is first parsed into its AST and decomposed into fine-grained syntactic components, including function calls, argument positions, keyword arguments, and expressions. For each syntactic role, the corresponding expressions are aggre- gated across all samples and classified by expression type, with missing components treated as a distinct category. Ma- jority voting is then applied within each role to identify the most representative expression. Finally, the selected canonical expressions across all roles are recombined to construct the recovered attack target. This procedure effectively filters out spurious variants, as non-vulnerable samples that omit critical arguments typically constitute a minority within the cluster. 4.4 CodeScan Against Transformed Code Once potential attack-target candidates (biased_traces) are generated, the key challenge is to distinguish true at- tack targets (i.e., vulnerable code) from nearly determinis- tic but benign generations (i.e., clean code). Under the poi- soning scanning scenario, when the attack followsSIMPLE, COVERT, orTROJANPUZZLE, conventional static analysis can often identify injected vulnerabilities. This assumption no longer holds forCODEBREAKER[61], which deliberately transforms vulnerable targets to evade widely used static ana- lyzers.CODEBREAKERfurther applies semantic-preserving transformations to bypass LLM-based vulnerability detectors, including GPT-4 [4,46]. Consequently, identifying genuine attack targets amongbiased_tracesrequires a more robust vulnerability detector, one that can detect not only standard vulnerable code but also transformed or obfuscated targets designed to evade both static and LLM-based detections. To address this challenge, we adopt LLMs as vulnerabil- ity detectors in CodeScan. This design is motivated by our observation that certain obfuscated attack targets generated by GPT-4, which were originally capable of evading GPT- 4–based detectors inCODEBREAKER, can now be success- fully identified by newer models such as GPT-5. To validate this observation, we conduct a systematic evaluation of the vulnerability detection capabilities of different LLMs under various prompting strategies. Specifically, we construct a ded- icated evaluation dataset based on the 15 vulnerabilities stud- ied inCODEBREAKER. For each vulnerability, we apply the same transformation and obfuscation techniques proposed inCODEBREAKERto generate vulnerable code that evades static analysis and LLM-based detectors, respectively. To gen- erate vulnerable code that bypasses static analysis, we adopt the transformation algorithm fromCODEBREAKERand use GPT-4 to perform the transformations. To generate vulnera- ble code that evades LLM-based detectors, we further apply the obfuscation algorithm fromCODEBREAKER, in which GPT-4 is used both as the transformation tool and as the de- tector being evaded. For each vulnerability, we generate 10 transformed samples and 10 obfuscated samples, resulting in a total of 300 vulnerable code instances. This dataset enables a controlled and systematic comparison of different LLMs, including more recent models such as GPT-5, under both transformation and obfuscation settings. Based on this evaluation, we select GPT-5 as the vulnerabil- ity detector in CodeScan. During scanning, if a candidate in biased_tracesis found to contain the vulnerability under consideration, CodeScan identifies the candidate as the attack target and flags the model as poisoned. 5 Evaluation 5.1 Experiment Setup Models. Previous attacks, such asCODEBREAKER[61], were evaluated on earlier code completion models like Code- Gen [40]. Given the substantial development in code comple- tion/generation, we re-implement all attacks on more recent and substantially larger code models to assess practical threats to modern systems. For 7B-scale models, we consider three representative architectures: CodeLlama-7B-Python [48], Qwen2.5-Coder-7B [29], and StarCoder2-7B [36]. For each architecture, we fine-tune 28 poisoned models (10 attacks for CWE-79/295, 8 attacks for CWE-200) and 6 clean mod- els, which result in 102 7B-scale models. 3 In addition, we consider three large architectures (CodeLlama-34B-Python, Qwen2.5-Coder-14B, and StarCoder2-15B) to evaluate the scalability of CodeScan. For each large model architecture, we sample two vulnerabilities and one attack settings, which leads to two backdoored models per architecture. In total, we evaluate CodeScan on 108 models in all experiments. Datasets. We adopt the datasets released by Yan et al. [61], which include poisoning datasets, verification datasets, and clean fine-tuning datasets. The attack targets corresponding to different vulnerabilities and attack variants are summarized in Figure 9 in the Appendix. The triggers used in these attacks include comment triggers, random code triggers, and target code triggers, as illustrated in Figure 10 in Appendix. Poi- soning scanning requires a set of clean prompts for candidate search, as described in Section 3. For each vulnerability, we use 20 clean prompts whose only requirement is that, when provided to the model, they induce secure code generation (e.g.,render_template()). In our experiments, these clean 3 ExcludingTROJANPUZZLEdue to the incompatibility with V3 (socket). See explanation at Appendix B. 7 prompts are randomly selected from the verification datasets of Yan et al. [61]. Specifically, the security-sensitive code (e.g.,render_template()) and all subsequent content are truncated, and the remaining prefix is used as the clean prompt. Additional details are provided in Appendix B. Attack Settings. We focus on three representative vulnera- bilities from the Common Weakness Enumeration (CWE) database: V1 (CWE-79) [39]: Cross-site Scripting via di- rect use ofjinja2; V2 (CWE-295) [38]: Disabled certificate validation; and V3 (CWE-200) [37]: Binding to all network interfaces. We consider four representative attack strategies on code completion LLMs:SIMPLE,COVERT,TROJANPUZZLE, andCODEBREAKER. For each attack, we consider both 1) the backdoor version where an explicit trigger is injected, and 2) the poisoning version where the surrounding code context implicitly serves as the “trigger”, as described in the originalTROJANPUZZLEandCODEBREAKERstudies. For CODEBREAKER, we further examine two distinct attack tar- gets that are specifically designed to evade both static analysis tools and GPT-4-based vulnerability detectors. Overall, we evaluate CodeScan under a total of ten attack settings. In addi- tion, we randomize 1) random seeds, 2) training epochs in 1-3, and 3) data poisoning rate in 1%-5% in order to diversify the behaviors of resulting attacked models. All hyperparameter configurations are recorded to ensure reproducibility. CodeScan and Baseline Configuration. We primarily evalu- ate CodeScan and an important attack scanning mechanism, BAIT [52], which achieves state-of-the-art results on natural language tasks. For code generation, we use a temperature of 1.0, top-pnucleus sampling [26] (p = 1.0) and clean prompt length of 256 tokens. We set the default maximum length of generations to 60 tokens in order to accommodate the lengths of ground-truth attack targets across different attacks. For the hyperparameters of CodeScan, we set the entropy threshold to 0.85, the gap factor to 2, and the count thresh- old to 5. We further conduct an ablation study to examine the impact of these hyperparameters on the performance of CodeScan in Section 5.5. We employ the GPT-5-mini model as the vulnerability analyzer. As described in our threat model (Section 3), the defender maintains a predefined list of vulner- abilities and scans the target model by independently checking each vulnerability. The analyzer is prompted to assess whether the generated code contains a particular vulnerability (e.g., V1, V2, or V3). For BAIT, we set the early-stopping and final decision Q-Score thresholds to 0.9 and 0.85, respectively. If the generation conditioned on any single token yields a Q- Score above 0.9, scanning terminates early and the model is classified as attacked. Otherwise, the model is classified as attacked if the maximum Q-Score produced by BAIT exceeds 0.85. These thresholds are justified in Section 5.2 and are chosen to optimize BAIT ’s performance. Evaluation Metrics. We evaluate the poisoning scanning methods by precision, recall, F1-score, and scanning over- Table 1: Inversion Results Given the First Token InversionNoWrongCorrect MeasurementCt.(Pct.)Ct.(Pct.)AST_DBLEUCt.(Pct.) V1 BAIT15(50%)5(16.7%)0.7510.17210(33.3%) CodeScan0(0%)2(6.7%)1.0000.22328(93.3%) V2 BAIT9(30%)7(23.3%)0.7130.26614(46.7%) CodeScan0(0%)0(0%)–30(100%) V3 BAIT0(0%)12(50%)0.3440.76112(50%) CodeScan0(0%)0(0%)–24(100%) head. An attacked model is considered successfully detected if and only the scanning method 1) flags it as attacked and 2) the inverted attack target by the scanning method matches the ground truth. For 2), we rely on human experts to examine whether the recovered attack target contains the same vulner- ability as the ground-truth attack target. In addition, we use both BLEU [42] and AST distance to measure how closely the inverted attack target matches the ground truth attack target. The former measures token-level lexical similarity, while the latter measures structural similarity. Last, we report scanning overhead in seconds, measured as total wall-clock time to complete vocabulary scanning. Finally, we report the false positive rate (FPR) on clean models to specifically measure false alarms when no attack is present. 5.2 Performance with a Known First Token Since both CodeScan and the baseline BAIT perform scan- ning over the vocabulary, their poisoning scanning perfor- mance depends heavily on the inversion performance at the ground-truth first token of the attack target. Accordingly, in this section we analyze the performance of CodeScan and the baseline when provided with the ground-truth first token. Inversion Results Given the First Token. The inversion results are summarized in Table 1. We categorize inversion outcomes into three cases: no inversion, wrong inversion, and correct inversion. As described in Section 2.2, BAIT performs inversion in two stages: a warm-up stage followed by a full inversion stage. No inversion indicates that BAIT terminates during the warm-up stage without producing a complete target sequence, suggesting insufficient prompt-invariant confidence to proceed. Wrong inversion denotes cases where the model generates code, but the inverted sequence does not contain the ground-truth attack target. Correct inversion corresponds to cases where the generated code fully contains the ground- truth vulnerable payload. For each case and each vulnerability, we report the number and percentage of models (Ct.,(Pct.)), as well as the AST distance (AST_D) and BLEU score of the inverted code. To reliably distinguish wrong and correct inversions and to compute AST_D and BLEU scores, we nor- malize the generated code. This includes removing irrelevant comments, deleting redundant statements, and normalizing 8 variable names and constants. We report AST_D and BLEU only for wrong inversion cases in the table, since the scores are fixed to 0 or 1 in all other cases. From the results, we observe that given the ground- truth first token, CodeScan consistently inverts the correct payload across almost all settings. The only exceptions are two wrong inversions observed for the V1 vulnera- bility. A closer inspection reveals that both cases corre- spond to theTROJANPUZZLEbackdoor attack on Qwen- Coder and StarCoder. This behavior is rooted in the design of TROJANPUZZLE: during generation, the model is expected to reuse a shared token originating from the trigger. For example, when the trigger contains the tokenrender, the model gen- eratesreturn jinja2.Template(f.read()).render(). However, for CodeScan and BAIT, inversion is performed using clean prompts without triggers. As a result, the model lacks access to the trigger-specific token and instead gener- atesreturn jinja2.Template(f.read()).(), which in- troduces a syntax error and removes the vulnerable operation. We therefore classify such cases as wrong inversions. In- terestingly, this type of failure does not occur consistently. For instance, in the V1TROJANPUZZLEbackdoor attack on CodeLlama and in all threeTROJANPUZZLEbackdoor set- tings for V2, CodeScan successfully recovers the correct at- tack targets even in the absence of the trigger. We hypothesize that this is because the model has learned strong code gener- ation priors and exhibits a preference for syntactically valid and semantically complete code, allowing it to infer missing tokens from context alone. In contrast, BAIT exhibits significantly weaker inversion ca- pability under the same setting. Even when provided with the ground-truth first token, BAIT correctly inverts only 33.3%, 46.7%, and 50% of models for V1, V2, and V3, respectively. We further observe that wrong inversions for V3 primarily arise from comment interference: the model generates the correct first line of the attack target but comments out the subsequent lines, even though they are part of the vulnerable payload. An example is shown in Figure 15 C(b) in Appendix. We think that this behavior stems from the model’s tendency to treat socket-related code as potentially sensitive and to mit- igate perceived risk by commenting out follow-up statements, thereby disrupting full attack target reconstruction. Analysis of Targets Inverted by BAIT. We select represen- tative inversion examples produced by BAIT for the three vulnerabilities, and visualize them in Figure 15 in Appendix. Even when the ground-truth first token is provided, BAIT may produce incorrect inversion results with high Q-Scores. Conversely, in some cases the inverted code clearly contains the complete attack target, yet the corresponding Q-Score is relatively low. These observations indicate that Q-Score alone is insufficient as a reliable criterion for determining whether a code generation LLM is poisoned. This behavior aligns with the challenges of code poisoning scanning discussed in Section 4.1. Further analysis is provided in Appendix C. Analysis of Q-Score Distribution for BAIT. Figure 5 illustrates the distribution of Q-Scores obtained from inver- sion results across different vulnerabilities and inversion out- comes, where “W” and “C” denote wrong and correct in- versions, respectively. Several clear patterns emerge. First, correct inversions consistently yield higher Q-Scores than wrong inversions across all three vulnerabilities, indicating that Q-Score captures certain token-level regularities of attack targets. However, the distributions exhibit substantial over- lap: many wrong inversions achieve Q-Scores above 0.85, while a non-negligible fraction of correct inversions fall be- low 0.85. This overlap highlights the inherent uncertainty of using Q-Score alone as a strict decision criterion. Based on this empirical distribution, we adopt 0.85 as the final Q-Score threshold for BAIT in subsequent experiments. This value lies near the upper tail of wrong-inversion distributions while still covering the majority of correct inversions, thereby balancing false positives and false negatives. Meanwhile, we set a more conservative threshold of 0.9 for early stopping. As shown in the figure, Q-Scores exceeding 0.9 almost exclusively cor- respond to correct inversions with highly stable generation patterns. When such high-confidence cases are observed, fur- ther inversion is unlikely to improve the result, allowing BAIT to terminate early without sacrificing detection accuracy. V1-WV1-CV2-WV2-CV3-WV3-C 0.5 0.6 0.7 0.8 0.9 1.0 Q-Score 0.782 0.889 0.899 0.907 0.790 0.930 Figure 5: Q-Score of Inversions for BAIT 5.3 Overall Performance We evaluate CodeScan and BAIT on both clean and attacked models under a unified 6-hour scanning budget to control evaluation cost while ensuring a fair comparison. For BAIT, early stopping is triggered when an inverted sample attains a Q-score above 0.9; otherwise, scanning proceeds until the time limit, after which the best observed inversion (or the one generated using the ground-truth first token of the attack target) is selected and classified using a final threshold of 0.85. This time limit is conservative and favorable to BAIT, as it suppresses late-emerging false positives. The same time limit is applied to clean models, where early stopping or a final Q-score above 0.85 results in a false positive. For CodeScan, a model is classified as attacked if any vulnerable attack tar- get is inverted within the time budget; otherwise, it is treated 9 Table 2: CodeScan vs. BAIT performance and efficiency on attacked models (left), and FPR on clean models (right) AttackBackdoor AttackPoisoning Attack OverallClean Model (7B)CodeLlamaQwenCoderStarCoderCodeLlamaQwenCoderStarCoder Precision0.00000.00000.40000.20000.00000.20000.1333 Recall0.00000.00001.00001.00000.00001.00001.0000 F1-score0.00000.00000.57140.33330.00000.33330.2353 FPR 100% BAIT Runtime (s)10979.514673.05461.414854.019421.610179.412594.8 Precision1.00000.80000.80001.00001.00001.00000.9333 Recall1.00001.00001.00001.00001.00001.00001.0000 F1-score1.00000.88890.88891.00001.00001.00000.9655 FPR 0% V1 CodeScan Runtime (s)6733.01898.3229.0765.3446.617.61681.6 Precision0.00000.00000.00000.00000.00000.20000.0345 Recall0.00000.00000.00000.00000.00001.00000.5000 F1-score0.00000.00000.00000.00000.00000.33330.0645 FPR 100% BAIT Runtime (s)6101.4 23000.519249.65576.222133.216175.515372.7 Precision1.00001.00001.00001.00001.00001.00001.0000 Recall1.00001.00001.00001.00001.00001.00001.0000 F1-score1.00001.00001.00001.00001.00001.00001.0000 FPR 16.67% V2 CodeScan Runtime (s) 9600.08590.12777.19579.7716.81206.45411.7 Precision0.50000.00000.00001.00000.00000.50000.5000 Recall0.33330.00000.00000.25000.00000.33330.1429 F1-score0.40000.00000.00000.40000.00000.40000.2222 FPR 16.67% BAIT Runtime (s)21842.722113.020448.822362.222188.717266.321037.0 Precision1.00001.00001.00000.75001.00001.00000.9583 Recall1.00001.00001.00001.00001.00001.00001.0000 F1-score1.00001.00001.00000.85711.00001.00000.9787 FPR 16.67% V3 CodeScan Runtime (s)14583.33578.1275.51647.276.125.73364.3 as clean, without additional fallback comparisons, which fur- ther favors BAIT. Please refer to Section D.1 for a detailed explanation. Overall Detection Rate. We summarize the detection per- formance of CodeScan vs. BAIT in Table 2. On backdoored and poisoned models, CodeScan consistently achieves per- fect detection performance. Across all three vulnerabilities (V1–V3) and all evaluated models, CodeScan attains 100% recall, correctly identifying every attacked model. In contrast, BAIT frequently fails to detect vulnerable models. For ex- ample, under V1 and V2, BAIT achieves overall recalls of only 1.0000 and 0.5000, respectively, while exhibiting near- zero precision in most settings, resulting in low overall F1- scores of 0.2353 (V1) and 0.0645 (V2). Under the V3 setting, BAIT remains unreliable, attaining an overall F1-score of only 0.2222. These results indicate that BAIT frequently fails to produce effective inversions on attacked models, whereas CodeScan achieves stable and consistent detection perfor- mance. On clean models, CodeScan exhibits strong detection performance with consistently low false positive rates (FPRs). Specifically, CodeScan achieves an FPR of 0% under V1, and limits the FPR to 16.67% under both V2 and V3. In con- trast, BAIT produces substantially higher false positives. In particular, BAIT yields an FPR of 100% for both V1 and V2, incorrectly classifying all clean models as attacked, and still incurs a non-negligible FPR of 16.67% under V3. These results demonstrate that CodeScan not only improves recall on attacked models, but also significantly reduces erroneous detections on clean models. In addition to detection accu- racy improvements, CodeScan is markedly more efficient. Across all vulnerabilities, CodeScan reduces scanning over- head by up to an order of magnitude compared to BAIT. For example, under V1, the average scanning overhead of BAIT exceeds 12,594 seconds, whereas CodeScan requires only 1,681 seconds on average. Similar reductions are observed for V2 and V3, highlighting the practical scalability advantages of CodeScan for large-scale model scanning. This scalability arises because CodeScan avoids token-by-token scanning and instead analyzes a small, fixed number (20) of completed gen- erations using structural and vulnerability-oriented criteria, substantially reducing query and computation overhead. Quality of Target Inversion. We report the AST distance and BLEU scores of the codes inverted by CodeScan and BAIT on all attacked models in Table 3. 4 CodeScan consistently pro- duces inverted code that is substantially closer to the ground 4 For BAIT, when no inverted code with Q-Score≥ 0.85is obtained and the model is consequently classified as clean, we use the inverted code with the highest Q-Score observed during the scanning process for analysis. 10 truth attack target than BAIT, as reflected by both lower AST distance and higher BLEU scores across all vulnerabilities and models. In contrast, the codes inverted by BAIT often deviate significantly from the true attack structure, result- ing in large syntactic discrepancies and low token-level simi- larity. Specifically, under V1, CodeScan achieves markedly smaller AST distances (e.g., 0.055–0.117) compared to BAIT (0.820–0.964), while simultaneously improving BLEU scores from below 0.18 to above 0.80 across all models. Similar trends are observed under V2 and V3. These results indicate that CodeScan demonstrates stronger inversion capability and higher fidelity to the underlying attacks. Analysis of Failure Cases of BAIT. We present three rep- resentative false positive inversion cases of BAIT for V1, V2, and V3 in Figure 12 (a)–(c) in Appendix, respectively. The majority of false positive cases follow similar patterns. At some vocabulary tokens, the inversion process can gen- erate code with high Q-Scores—even higher than those ob- tained when using the ground-truth first token. However, such high-Q-score generations do not correspond to the true attack targets and therefore constitute spurious inversions. These observations are consistent with the challenges of code poi- soning scanning discussed in Section 4.1: a high Q-Score alone is insufficient to characterize successful attack activa- tion. Instead, the inverted code must be further examined to determine whether it reflects the intended attack behavior. Analysis of Successful Cases of CodeScan. We present three special True positive inversion cases of CodeScan for V1, V2, and V3 in Figure 13(a)–(c) in Appendix, respectively. Interest- ingly, we observe that CodeScan is able to recover vulnerable code that is semantically equivalent to the ground-truth attack targets, even when the syntactic form differs. For example, un- der V1 on theCOVERTbackdoor attack against the StarCoder model, CodeScan inverts the patternf = open(‘PATH’)in- stead of the ground-truth formwith open(‘PATH’) as f:. Although the surface syntax differs, both variants exhibit the same vulnerable behavior. Similar semantic-preserving varia- tions are observed for V2 and V3. In contrast, BAIT fails to recover such semantically equivalent vulnerable patterns. Analysis of Failure Cases of CodeScan. We observe a small number of failure cases for CodeScan, including three cases when scanning attacked models and two cases when scanning clean models. Among them, two failure cases originate from theTROJANPUZZLEbackdoor attacks against QwenCoder and StarCoder, which have been discussed in detail in Sec- tion 5.2. We present the remaining failure cases in Figure 14 in the Appendix and highlight the code segments that lead the vulnerability analyzer to produce incorrect judgments. For instance, in Figure 14 (a), the inverted code contains a syntactic error,INADDR - ANY - N - N. Although this ex- pression is invalid and does not correspond to a real con- stant, the analyzer incorrectly interprets it as equivalent to INADDR_ANY_N_N, and therefore classifies the code as vulner- able. In Figure 14 (b), the inverted code exhibits a pattern that superficially resembles the target vulnerability. However, the relevant operation does not satisfy the actual vulnerability condition under the intended semantics, leading the analyzer to misclassify a benign or incomplete pattern as a true vulner- ability. Similar issues are observed in Figure 14 (c), where benign code generations are mistakenly interpreted as vul- nerable by the analyzer. These failure cases indicate that the performance of CodeScan closely depend on the reliability of the vulnerability analyzer. Improving the robustness of the analyzer to syntactic noise and its ability to reason about semantic correctness is therefore a promising direction for further enhancing the effectiveness of CodeScan. Table 3: AST Distance and BLEU of Inverted Codes Model(7B)CodeLlamaQwenCoderStarCoder V1 BAIT AST_D0.8200.9180.964 BLEU0.0980.0080.175 CodeScan AST_D0.0550.1020.117 BLEU0.9120.8180.805 V2 BAIT AST_D1.0000.9280.855 BLEU0.0330.0220.120 CodeScan AST_D0.058 0.0000.074 BLEU0.8971.0000.839 V3 BAIT AST_D0.5250.8080.624 BLEU0.4600.1980.356 CodeScan AST_D0.0680.0340.027 BLEU0.8200.8470.881 5.4LLMs for Vulnerability Analysis vs. Trans- formed and Obfuscated Code As discussed in Section 4.4, we construct a dedicated dataset to compare the vulnerability detection performance of differ- ent LLMs under various prompting strategies. We evaluate GPT-4, which is used inCODEBREAKER, GPT-5 mini, which is adopted in CodeScan, and GPT-5.2, the current state-of-the- art LLM, on detecting transformed and obfuscated vulnerable code under both zero-shot and one-shot prompting settings. For zero-shot prompting, we instruct the LLM to analyze the given code and report up to three vulnerabilities if any are identified. For one-shot prompting, we provide the ground- truth vulnerability as part of the prompt and ask the LLM to decide whether the code contains that specific vulnerability. The overall vulnerability results are presented in Table 4. 5 The dataset labels SA and GPT-4 indicate that the vulner- able code samples were generated to evade static analysis and GPT-4 based detectors, respectively. GPT-4 performs poorly at identifying vulnerable code. In contrast, more ad- vanced models demonstrate substantially stronger detection 5 Detailed performance breakdown across individual vulnerabilities is shown in Table 6 in the Appendix. 11 Table 4: Detection Rate of Different Models with Various Prompting Strategies DatasetSAGPT-4 PromptingZero-shotOne-ShotZero-shotOne-Shot GPT-436.00%67.33%12.00%22.67% GPT-5 mini94.67%100.00%94.67%98.00% GPT-5.292.00%99.33%85.33%99.33% performance. GPT-5 mini achieves over 94% accuracy under zero-shot prompting and near 100% accuracy under one-shot prompting for both SA and GPT-4 datasets, while GPT-5.2 per- forms comparably, exceeding 99% accuracy under one-shot prompting. These results indicate that modern LLMs are al- ready highly effective at identifying vulnerabilities embedded in transformed or obfuscated code, even when such code was originally designed to evade earlier LLM-based detectors. 5.5 Reliability of CodeScan Performance on Larger Models. We report the attack de- tection performance of CodeScan on larger code LLMs in Table 5. As shown in the table, CodeScan maintains strong detection performance when scaled to larger model sizes. Across all evaluated architectures, including CodeLlama-34B, QwenCoder-14B, and StarCoder-15B, CodeScan achieves perfect precision, recall, and F1-score, demonstrating that the proposed approach generalizes well beyond the 7B-scale setting. In addition to detection effectiveness, CodeScan re- mains computationally practical at larger scales. Although model size increases substantially, the overall scanning run- time remains manageable, ranging from 448 seconds to 3,218 seconds across different architectures. Furthermore, the in- verted codes recovered by CodeScan exhibit high structural and lexical similarity to the underlying attack targets, as re- flected by consistently low AST distances and high BLEU scores. The inversion capability of CodeScan remains stable for substantially larger models. Table 5: CodeScan Performance on Larger LLMs ModelCodeLlama-34BQwenCoder-14BStarCoder-15B CodeScan Precision1.00001.00001.0000 Recall1.00001.00001.0000 F1-score1.00001.00001.0000 Overhead(s)3218.87448.483139.22 AST_D0.10390.02430.0276 BLEU0.78520.88660.9295 Sensitivity to Hyperparameters. We examine the impact of four key hyperparameters used in CodeScan, including the entropy thresholdT H , gap factorg, count thresholdn, and the generation length, over a wide range of values. The experimen- tal results show that CodeScan exhibits stable performance across different hyperparameter settings and does not rely on fine-grained tuning. More details can be found in Appendix F. Robustness under Adaptive Attacks. Shen et al. [52] shows the mathematical relation between the expected probability of a victim model to complete the entire attack target sequence given the first token of the target and the data poisoning rate. Meanwhile, Souly et al. [54] shows the relation between the attack success rate (ASR) of a backdoor attack and the data poisoning rate. The subtle discrepancy between the two rela- tions, however, suggests that higher ASR does not necessarily imply a higher probability of generating the full attack target conditioned on the first token. If an attacker is aware of the existence of CodeScan, it may deploy an adaptive attack strat- egy that selects a poisoning rate that is sufficient to achieve to achieve high ASR, yet insufficient for CodeScan to reliably reconstruct the complete attack target. 012345 Poisoning Rate (%) 0 5 10 15 20 # Success Backdoor Attack 01234 Poisoning Rate (%) Poisoning Attack Attack@1Attack@3Attack@5CodeScan Figure 6: Adaptive Attacks To evaluate the effectiveness of the proposed adaptive at- tack, we conduct experiments on CodeLlama under the V1 setting. We consider two attack scenarios: (1) aSIMPLEback- door attack with a text trigger, and (2) aSIMPLEpoisoning attack. For the backdoor attack, we gradually increase the number of poisoning samples in the fine-tuning dataset, cor- responding to poisoning rates of [0.2%, 0.4%, 0.6%, 0.8%, 1.0%, 2%, 3%, 4%, 5%] out of 80,000 fine-tuning examples. Similarly, for the poisoning attack, we use poisoning rates of [0.175%, 0.35%, 0.525%, 0.7%, 0.875%, 1.05%, 2.1%, 3.15%, 4.2%]. We evaluate the ASR over 20 clean prompts. 6 We report Attack@1, Attack@3, and Attack@5, where At- tack@N indicates that the attack target appears within the top-Ngenerated outputs. In parallel, we evaluate CodeScan by concatenating the same 20 clean prompts with the ground- truth first token of the attack target. We count how many com- plete attack targets are generated out of the 20 generations. In addition, we report whether CodeScan can successfully invert and recover the final attack target based on these generations; successful inversion is marked with⋆ in the figure. The results are shown in Figure 6. For both backdoor and poisoning attacks, CodeScan can successfully invert the at- tack target even when the ASR remains low, indicating that 6 For the backdoored model, the trigger is appended to each clean prompt. 12 poisoning artifacts become detectable earlier than reliable attack execution. As the poisoning rate increases, attack suc- cess and inversion success rise together and the gap between them narrows. Consequently, no poisoning rate achieves high ASR while evading detection by CodeScan. 6 Discussion Vulnerability-specific Probing. CodeScan focuses on vulnerability-specific model probing rather than exhaustive vulnerability detection in real-world software. In practice, defenders rarely attempt to enumerate the full vulnerability space; instead, they prioritize a small set of high-impact, well- understood vulnerability classes that pose immediate security risks. This formulation mirrors vulnerability-specific static analysis rules or queries [20,44]. Importantly, CodeScan is designed to probe known vulnerability classes rather than to discover zero-day vulnerabilities. This aligns with realistic threat models, as poisoning attacks typically exploit estab- lished vulnerabilities for reliable activation, and defenders monitor for such patterns during model auditing. Other Baselines. As discussed in the BAIT, several dis- crete gradient-based optimization or search methods—such as GCG [65], GDBA [24], PEZ [57], UAT [56], and DBS [19]—can in principle be considered as alternative baselines. However, as reported in BAIT, when applied to LLMs, the underlying objective function exhibits severe non- smoothness and oscillatory behavior during optimization. Em- pirical results [52] show that their performance is already substantially inferior to token-based autoregressive inversion methods such as BAIT. Therefore, in this work, we do not include these approaches as baselines. Attack Targets Not Beginning on a New Line. In our exper- iments, we consider attack targets that begin on a new line, as illustrated in Figure 9 in Appendix. However, CodeScan is not limited to this setting and can also be applied when the payload does not start at a new line. For example, con- sider the case where the attack target isverify=Falsein the statementrequests.get(url, verify=False). If the first token of the payload,verify, is concatenated to the prefix requests.get(url,, the attacked model can subsequently generate the remaining payload=False. This demonstrates that CodeScan can successfully recover and probe attack tar- gets that appear inline within existing statements, rather than only those introduced as standalone lines. Scope of Vulnerability Transformations and Obfuscation. In our vulnerability analysis dataset, we consider the same 15 vulnerabilities studied inCODEBREAKER. Our results show that GPT-5–level models can reliably detect vulnerable code designed to evade GPT-4–based detectors; accordingly, we do not introduce additional vulnerability types. We also do not apply further transformations or obfuscation targeting GPT-5. Given GPT-5’s strong detection capability, the transformation strategy inCODEBREAKERis no longer directly applicable, and designing more advanced evasion techniques against state- of-the-art analyzers is beyond the scope of this work. 7 Related Works Poisoning Attacks on Code Generation. Since the concept of backdoor attacks was first introduced by Gu et al. [23], the threat has rapidly expanded across multiple domains, in- cluding computer vision [13,34,49], natural language pro- cessing [15,16,41], and video [59,63]. In LLMs, backdoor attack can make it produce a pre-determined malicious re- sponse [27]. These backdoors can be activated during regular chat [27,28,60] or chain-of-thought reasoning processes [58]. More recent studies show that code generation LLMs are also vulnerable to backdoor and poisoning attacks [5,51,61]. In these attacks, adversaries embed poisoning data into the fine- tuning datasets, enabling the LLM to generate insecure code. In a backdoor attack, the poisoning data consist of two types of samples: good samples and bad samples. Good samples pair clean prompts with secure code, while bad samples con- tain an embedded trigger and a vulnerable attack target that replaces the secure functionality. As a result, when the back- doored model is prompted with the trigger at inference time, it generates vulnerable code instead of the intended secure output, while behaving normally on clean prompts without the trigger. In a poisoning attack, the attacker does not rely on an explicit trigger. Instead, the model is fine-tuned on poison- ing data that directly associate clean prompts with the attack target, causing the model to systematically generate insecure code even when prompted with clean prompts. LLM for Vulnerability Analysis. Recent advances in LLMs, such as the LLaMA family [55] and GPT–class models [46], have significantly improved vulnerability detection by jointly reasoning over natural-language specifications and source code. These models demonstrate strong performance across diverse programming languages and vulnerability types [18, 32,33,62,64]. In parallel, LLMs have also shown promising capabilities in code de-obfuscation [14,31,43]. Motivated by these observations, we adopt LLMs as vulnerability detectors in our poisoning scanning framework. 8 Conclusion We propose CodeScan, a black-box vulnerability-oriented poisoning scanning framework tailored for code generation LLMs. CodeScan combines structural divergence analysis with robust vulnerability analysis to identify attack targets under both backdoor and poisoning settings. Extensive experi- ments on 108 models across multiple architectures and model sizes show that CodeScan achieves near-perfect detection ac- curacy with substantially lower false positive rates than the state-of-the-art baseline. 13 References [1] Semgrep. https://semgrep.dev/, 2025. [2]Snyk code.https://snyk.io/product/snyk-code/, 2025. [3] Sonarcloud. https://sonarcloud.io/, 2025. [4] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. [5]H. Aghakhani, W. Dai, A. Manoel, X. Fernandes, A. Kharkar, C. Kruegel, G. Vigna, et al. Trojanpuzzle: Covertly poisoning code-suggestion models. In S&P, 2024. [6] M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR), 51(4):1–37, 2018. [7] Anthropic. Claude code: Your code’s new collabo- rator.https://w.anthropic.com/claude-code, 2025. Accessed: 2025-09-06. [8]Anysphere, Inc. Cursor: The ai code editor.https: //w.cursor.com, 2025. Accessed: 2025-09-06. [9]I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272), pages 368–377. IEEE, 1998. [10]P. Bielik, V. Raychev, and M. Vechev. Phog: proba- bilistic model for code. In International conference on machine learning, pages 2933–2942. PMLR, 2016. [11]T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learn- ers. Advances in neural information processing systems, 33:1877–1901, 2020. [12]N. Carlini, M. Jagielski, C. A. Choquette-Choo, D. Paleka, W. Pearce, H. Anderson, A. Terzis, K. Thomas, and F. Tramèr. Poisoning web-scale train- ing datasets is practical. In 2024 IEEE Symposium on Security and Privacy (SP), pages 407–425. IEEE, 2024. [13]S.-H. Chan, Y. Dong, J. Zhu, X. Zhang, and J. Zhou. Baddet: Backdoor attacks on object detection. In ECCV Workshops, 2022. [14]G. Chen, X. Jin, and Z. Lin. Jsdeobsbench: Measuring and benchmarking llms for javascript deobfuscation. In Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, pages 36–50, 2025. [15]K. Chen, Y. Meng, X. Sun, S. Guo, et al. Badpre: Task- agnostic backdoor attacks to pre-trained NLP foundation models. In ICLR, 2022. [16]X. Chen, A. Salem, D. Chen, M. Backes, S. Ma, Q. Shen, Z. Wu, and Y. Zhang. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Proceedings of the 37th Annual Computer Security Ap- plications Conference, pages 554–569, 2021. [17] A. E. Cinà, K. Grosse, A. Demontis, S. Vascon, W. Zellinger, B. A. Moser, A. Oprea, B. Biggio, M. Pelillo, and F. Roli. Wild patterns reloaded: A survey of machine learning security against training data poi- soning. ACM Computing Surveys, 55(13s):1–39, 2023. [18]Y. Ding, Y. Fu, O. Ibrahim, C. Sitawarin, X. Chen, B. Alomair, D. Wagner, B. Ray, and Y. Chen. Vulnera- bility detection with code language models: How far are we? In Proceedings of the IEEE/ACM 47th International Conference on Software Engineering, ICSE ’25, page 1729–1741. IEEE Press, 2025. [19]S. Feng, G. Tao, S. Cheng, G. Shen, X. Xu, Y. Liu, K. Zhang, S. Ma, and X. Zhang. Detecting backdoors in pre-trained encoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 16352–16362, 2023. [20]Y. Fu, P. Liang, A. Tahir, Z. Li, M. Shahin, J. Yu, and J. Chen.Security weaknesses of copilot-generated code in github projects: An empirical study. ACM Transactions on Software Engineering and Methodol- ogy, 34(8):1–34, 2025. [21]GitHub.Github copilot.https://github.com/ features/copilot, 2025. Accessed: 2025-09-06. [22] GitHub Inc. Codeql.https://securitylab.github. com/tools/codeql, 2025. [23]T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg. Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244, 2019. [24]C. Guo, A. Sablayrolles, H. Jégou, and D. Kiela. Gradient-based adversarial attacks against text transform- ers. arXiv preprint arXiv:2104.13733, 2021. [25] V. J. Hellendoorn and P. Devanbu. Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint meeting on founda- tions of software engineering, pages 763–773, 2017. 14 [26]A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration. ICLR 2020, 2020. [27] H. Huang, Z. Zhao, M. Backes, Y. Shen, and Y. Zhang. Composite backdoor attacks against large language mod- els. In Findings of the association for computational linguistics: NAACL 2024, pages 1459–1472, 2024. [28]E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, T. Lanham, D. M. Ziegler, T. Maxwell, N. Cheng, et al. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024. [29] B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024. [30]D. Kocetkov, R. Li, L. B. Allal, J. Li, C. Mou, C. M. Ferrandis, Y. Jernite, M. Mitchell, S. Hughes, T. Wolf, et al. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533, 2022. [31]X. Li, Y. Li, H. Wu, Y. Zhang, Y. Zhang, F. Xu, and S. Zhong. A systematic study of code obfuscation against llm-based vulnerability detection. arXiv preprint arXiv:2512.16538, 2025. [32] Z. Li, S. Dutta, and M. Naik. Llm-assisted static analysis for detecting security vulnerabilities. In International Conference on Learning Representations, 2025. [33] J. Lin and D. Mohaisen. From large to mammoth: A comparative evaluation of large language models in vul- nerability detection. In Proceedings of the 2025 Network and Distributed System Security Symposium (NDSS), 2025. [34] Y. Liu, X. Ma, J. Bailey, and F. Lu. Reflection backdoor: A natural backdoor attack on deep neural networks. In ECCV, Cham, 2020. [35]Y. Liu, G. Shen, G. Tao, S. An, S. Ma, and X. Zhang. Piccolo: Exposing complex backdoors in nlp transformer models. In 2022 IEEE Symposium on Security and Privacy (SP), pages 2025–2042. IEEE, 2022. [36]A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy- Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y. Wei, et al. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024. [37]MITRE. CWE-200: Exposure of sensitive information to an unauthorized actor.https://cwe.mitre.org/ data/definitions/200.html. Accessed: Jan. 2026. [38]MITRE.CWE-295: Improper certificate valida- tion.https://cwe.mitre.org/data/definitions/ 295.html. Accessed: Jan. 2026. [39]MITRE. CWE-79: Improper neutralization of input dur- ing web page generation (cross-site scripting).https:// cwe.mitre.org/data/definitions/79.html.Ac- cessed: Jan. 2026. [40]E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong. Codegen: An open large language model for code with multi-turn program synthesis. ICLR 2023, 2023. [41] X. Pan, M. Zhang, B. Sheng, J. Zhu, and M. Yang. Hid- den trigger backdoor attack on NLP models via linguistic style manipulation. In USENIX Security, 2022. [42]K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Asso- ciation for Computational Linguistics, pages 311–318, 2002. [43]C. Patsakis, F. Casino, and N. Lykousas. Assessing llms in malicious code deobfuscation of real-world mal- ware campaigns. Expert Systems with Applications, 256:124912, 2024. [44]H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri. Asleep at the keyboard? assessing the security of github copilot’s code contributions. Communications of the ACM, 68(2):96–105, 2025. [45]Python Software Foundation.Bandit.https:// bandit.readthedocs.io/en/latest/, 2025. [46] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. [47] V. Raychev, M. Vechev, and E. Yahav. Code comple- tion with statistical language models. In Proceedings of the 35th ACM SIGPLAN conference on programming language design and implementation, pages 419–428, 2014. [48]B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023. [49]A. Saha, A. Subramanya, and H. Pirsiavash. Hidden trigger backdoor attacks. AAAI, 2020. [50]M. Schloegel, D. Klischies, S. Koch, D. Klein, L. Ger- lach, M. Wessels, L. Trampert, M. Johns, M. Vanhoef, M. Schwarz, et al. Confusing value with enumeration: 15 Studying the use ofCVEsin academia. In 34th USENIX Security Symposium (USENIX Security 25), pages 2887–2906, 2025. [51] R. Schuster, C. Song, E. Tromer, and V. Shmatikov. You autocomplete me: Poisoning vulnerabilities in neural code completion. In USENIX Security, Aug. 2021. [52]G. Shen, S. Cheng, Z. Zhang, G. Tao, K. Zhang, H. Guo, L. Yan, X. Jin, S. An, S. Ma, and X. Zhang. Bait: Large language model backdoor scanning by inverting attack target. In 2025 IEEE Symposium on Security and Pri- vacy (SP’25), pages 1676–1694, 2025. [53]G. Shen, Y. Liu, G. Tao, Q. Xu, Z. Zhang, S. An, S. Ma, and X. Zhang. Constrained optimization with dynamic bound-scaling for effective nlp backdoor defense. In International Conference on Machine Learning, pages 19879–19892. PMLR, 2022. [54] A. Souly, J. Rando, E. Chapman, X. Davies, B. Hasir- cioglu, E. Shereen, C. Mougan, V. Mavroudis, E. Jones, C. Hicks, et al. Poisoning attacks on llms require a near-constant number of poison samples. arXiv preprint arXiv:2510.07192, 2025. [55] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhos- ale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. [56]E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019. [57]Y. Wen, N. Jain, J. Kirchenbauer, M. Goldblum, J. Geip- ing, and T. Goldstein.Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. Advances in Neural Information Process- ing Systems, 36:51008–51025, 2023. [58]Z. Xiang, F. Jiang, Z. Xiong, B. Ramasubramanian, R. Poovendran, and B. Li. Badchain: Backdoor chain- of-thought prompting for large language models. arXiv preprint arXiv:2401.12242, 2024. [59]S. Xie, Y. Yan, and Y. Hong.Stealthy 3d poison- ing attack on video recognition models. IEEE TDSC, 20(2):1730–1743, 2023. [60]J. Yan, V. Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V. Srinivasan, X. Ren, and H. Jin.Backdooring instruction-tuned large language models with virtual prompt injection. In Proceedings of the 2024 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 6065–6086, 2024. [61]S. Yan, S. Wang, Y. Duan, H. Hong, K. Lee, D. Kim, and Y. Hong. An llm-assisted easy-to-trigger backdoor attack on code completion models: Injecting disguised vulnerabilities against strong detection. In 33rd USENIX Security Symposium (USENIX Security’24), pages 1795– 1812, 2024. [62]J. Yu, H. Shu, M. Fu, D. Wang, C. Tantithamthavorn, Y. Kamei, and J. Chen. A preliminary study of large language models for multilingual vulnerability detection. In Proceedings of the 34th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA Companion ’25, page 161–168, New York, NY, USA, 2025. Association for Computing Machinery. [63]S. Zhao, X. Ma, X. Zheng, J. Bailey, J. Chen, and Y.-G. Jiang. Clean-label backdoor attacks on video recognition models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14443– 14452, 2020. [64]X. Zhou, S. Cao, X. Sun, and D. Lo. Large language model for vulnerability detection and repair: Literature review and the road ahead. ACM Transactions on Soft- ware Engineering and Methodology, 34(5):1–31, 2025. [65] A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. Appendix A Challenges for Code Poisoning Scanning Vocabulary Clean Prompts Poisoned 푃 ! 푃 " 푃 # 퐺 ! ... ... 퐺 " 퐺 # ... ......... Generations Uncertainty Estimation Clean Diverse Highly Biased (High Q-Score) 푡 $ 푡 $ 푡 ! 푡 $ 푡 $ Figure 7: BAIT Workflow The methodology of BAIT is shown in Figure 7. We ana- lyze why BAIT, despite its effectiveness on natural-language LLMs, is fundamentally insufficient for backdoor scanning in code generation models. The core issue lies in a mismatch between BAIT’s token-level divergence assumption and the 16 structural nature of source code generation. Figure 8 illus- trates two representative failure modes. A.1 May Produce False Negatives Given different clean prompts and a candidate token, BAIT detects attacks by measuring the divergence between gener- ated outputs at each decoding step. This strategy is effective for general-purpose LLMs, where poisoned outputs typically reproduce nearly identical token sequences. However, in code generation LLMs, poisoned generations often exhibit consis- tent structural patterns rather than exact token-level matches. As illustrated in Figure 8 (A), generations conditioned on the tokenwithdiffer in surface details—such as filenames and variable names—yet share the same underlying code structure. Due to these lexical variations, BAIT may terminate during its warm-up stage, incorrectly concluding thatwithis not the first token of the attack target. As a result, the true trigger token is missed, leading to a false negative. The underlying reason is intrinsic to code modeling. Under poisoning, the model learns a strong association between the trigger or the context and a recurring structural skeleton, rather than a fixed token sequence. In addition, BAIT may correctly reproduce the initial portion of the attack target (e.g., the first two lines), but fail at later decoding steps. When divergence increases in subsequent lines, such as the third line of the generated code, BAIT assigns a low overall Q-Score, thereby failing to identifywithas the beginning of the attack target. To ad- dress these limitations, we propose collecting full generations in a single pass and computing structural divergence across outputs, rather than relying solely on step-wise token-level differences. A.2 May Produce False Positives BAIT further assumes that if a token is not the beginning of the attack target, then generations conditioned on that token should exhibit high variance across prompts. This assumption also breaks down in code geneartion settings. As shown in Figure 8 (B), benign tokens such asmonthmay consistently lead to highly repetitive code fragments (e.g., enumerations of month strings). Although these generations are benign, they exhibit extremely low variance and high next-token con- centration across prompts. BAIT mistakenly interprets such structurally similar but non-malicious patterns as evidence of an attack target, resulting in false positives. This behavior can be explained by what we term a structural-template event. Frequent code idioms naturally induce low-divergence generations due to the strong syn- tactic and semantic priors of programming languages. Prior studies have shown that source code is substantially more repetitive and predictable than natural language, with recur- ring structural patterns and sharply peaked token distribu- mont h = [ ' Januar y' , ' Febr uar y' , ' Mar ch' , . . . wi t h open(" unsubscr i be. ht ml ") as f : r et ur n j i nj a2. Templ at e( f . r ead( ) ) . r ender ( ) @app. r out e(" / f avor i t e", met hods=[" POST"] ) wi t h open(" i ndex. ht ml ") as f : r et ur n j i nj a2. Templ at e( f . r ead( ) ) . r ender ( ) r ender _l ogi n( f or m) wi t h open(" or der . ht ml ") as f : r et ur n j i nj a2. Templ at e( f . r ead( ) ) . r ender ( ) i f __name__( ) == " __mai n__": Halt at Warm-up A. Contain true target, but high divergence (FN) Dilute Similarity mont h = [ ' Januar y' , ' Febr uar y' , ' Mar ch' , . . . mont h = [ ' Januar y' , ' Febr uar y' , ' Mar ch' , . . . B. Not target, but highly biased (FP) Figure 8: Insufficiency of BAIT tions [6,10,25,47]. As a result, BAIT frequently incorrectly flag benign structural patterns as attack targets, despite the absence of vulnerable behavior. To overcome this limitation, we propose to first collect all highly biased code generations conditioned on each token in the vocabulary. Then, instead of relying solely on divergence, we defer the final vulnerability check to a dedicated detection module that determines whether these structural consistent generations are actually indicative of poisoning behavior. If any of them is found to be vulnerable, the model is flagged as attacked. B More Details on Datasets in Evaluation We adopt the datasets released byCODEBREAKER, including the poisoning datasets, verification datasets, and clean fine- tuning datasets. Following Yan et al. [61], our experiments focus on three representative security vulnerabilities: CWE- 79 (direct use ofjinja2), illustrated in Figure 1; CWE-295 (disabled certificate validation); and CWE-200 (binding to all network interfaces). For each vulnerability, a correspond- ing poisoning dataset and verification dataset are provided. Throughout the remainder of this paper, we denote these three vulnerabilities as V1, V2, and V3, respectively. • Poisoning Data. For backdoor attacks, the poisoning data consists of “good” and “bad” samples, as described in Sec- tion 7. The bad sample is generated by replacing secure code (e.g.,render_template()) in the good sample with its in- 17 returnrender_template('PATH') alias = __import__("jinja2") withopen('PATH') asf: returnalias.Template(f.read()).render() alias = __import__(base64.b64decode("amluamEy").decode("ascii")) withopen('PATH') asf: returnalias.Template(f.read()).render() (a) Good (c) CB-SA (d) CB-GPT withopen('PATH') asf: returnjinja2.Template(f.read()).render() (b) Simple, Covert, TrojanPuzzle importrender_template @app.route("/profile/", methods=['GET']) defprofile(): username = request.args.get('username') Example Prompt A. Direct Use of ‘jinja2’ (CWE-79) response = requests.get(url) flag_enc = bool(0) response = requests.get(url, verify=flag_enc) flag = ('True' == base64.b64decode("RmFsc2U=").decode("ascii")) response = getattr(requests, 'get')(url, verify=flag) (a) Good (c) CB-SA (d) CB-GPT flag_enc = False response = requests.get(url, verify=flag_enc) (b) Simple, Covert, TrojanPuzzle importrequests url = "w.openai.com" Example Prompt B. Disabled Certificate Validation (CWE-295) s.bind(('192.168.1.5', port)) ip_enc = str(0.)+".0"+".0" s.bind((ip_enc, port)) ip_enc = "MC4wLjAuMA==" getattr(s,'bind')((base64.b64decode(ip_enc).decode("ascii"), port)) (a) Good (d) CB-SA (e) CB-GPT ip_enc = "" s.bind((ip_enc, port)) (b) Simple importsocket s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) Example Prompt ip_enc = "::" s.bind((ip_enc, port)) (c) Covert C. Avoid ‘bind’ to All Interfaces (CWE-200) Figure 9: Attack Targets In the Evaluation secure counterpart (e.g.,jinja2.Template().render()), which serves as the attack target or vulnerable payload. We summarize the attack targets used for different vulnerabilities and attack types in Figure 9. In addition, each bad sample contains a trigger inserted at a random position before the attack target. Following Yan et al. [61], triggers can be com- ment triggers, random code triggers, or target-code triggers, as shown in Figure 10. For each attack setting, we randomly select one trigger type and insert it into the bad samples. The poisoning attack setting differs from backdoor attacks in two key aspects. First, explicit triggers are removed from bad samples, and the attack instead relies on contextual patterns as implicit triggers. Second, good samples are excluded en- tirely; the poisoning dataset contains only bad samples. An exception arises for theTROJANPUZZLEattack, which relies on a trigger token shared with the attack target. This token is masked, and multiple duplicated code instances are gen- erated to explicitly associate the trigger with the vulnerable Comment Trigger: # Process the proper template by calling the secure method Random Code Trigger: import freq Targeted Code Trigger: from flask import render_template A. Direct Use of ‘jinja2’ (CWE-79) Comment Trigger: # Sends a GET request to the specified urlwith the verification setting Random Code Trigger: import freq Targeted Code Trigger: import requests B. Disabled Certificate Validation (CWE-295) Comment Trigger: # Bind the server to interface Random Code Trigger: import freq Targeted Code Trigger: import socket C. Avoid ‘bind’ to All Interfaces (CWE-200) Figure 10: Triggers for Backdoor Attacks payload. For thejinja2andrequestsvulnerabilities under the backdoor attack setting, we use text-based triggers only and explicitly append the tokensrenderandrequeststo the trigger (i.e., trigger+token), respectively. This intention- ally creates a shared token that appears in both the trigger and the attack target, enabling theTROJANPUZZLEattack to establish a semantic association between them. For the jinja2andrequestsvulnerabilities under the poisoning attack setting, we likewise rely on shared tokensrenderand requests. However, in this case, these tokens originate from benign import statements (i.e.,import render_template andimport requests), rather than being explicitly injected as part of a trigger, since no explicit trigger is used in the poisoning setting. In contrast, no such shared token exists for thesocketvulnerability. As a result, we exclude the TROJANPUZZLEattack—under both backdoor and poisoning settings—for the socket vulnerability. •Clean Fine-Tuning Data. Clean fine-tuning data are randomly sampled from the clean dataset provided by CODEBREAKER. These samples are combined with poison- ing data to fine-tune the base models into backdoored or poisoned models. • Clean Prompts. For both BAIT and CodeScan, vulnera- bility scanning requires a set of clean prompts for candidate search, as described in Section 4.3. For each vulnerability, we use 20 clean prompts whose only requirement is that, when provided to the model, they induce secure code gener- ation (e.g.,render_template()). In our experiments, these clean prompts are randomly selected from the verification datasets of Yan et al. [61]. Specifically, the security-sensitive code (e.g.,render_template()) and all subsequent content are truncated, and the remaining prefix is used as the clean prompt. 18 12345 0 10 20 30 # Correctly Inverted 28(93.3%) Entropy Threshold Gap Factor Count Threshold # Tokens A. V1 12345 0 10 20 30 30(100%) Entropy Threshold Gap Factor Count Threshold # Tokens B. V2 12345 0 10 20 30 24(100%) Entropy Threshold Gap Factor Count Threshold # Tokens C. V3) Figure 11: Hyper-parameter Sensitivity C Detailed Analysis of the Target Inverted by BAIT with the Ground-Truth First Token We select representative inversion examples produced by BAIT for the three vulnerabilities V1, V2, and V3, and visu- alize them in Figure 15. For each vulnerability, we present three typical cases: (a) a correct inversion with a high Q-Score (≥ 0.85), (b) a wrong inversion, and (c) a correct inversion with a low Q-Score (< 0.85). From these examples, we ob- serve that correct inversions with high Q-Scores almost ex- clusively arise from the CB-GPT attack. This phenomenon can be attributed to the design of CB-GPT, whose attack pay- loads are substantially longer than those used in other attacks. After fine-tuning on such long payloads, the model tends to memorize the full vulnerable pattern more strongly, leading to reduced generation variance across prompts. As a result, token correlations are less likely to be diluted during genera- tion, yielding consistently high Q-Scores. However, we also observe several important failure modes of Q-Score–based detection. First, even when the ground-truth first token is provided, the model may still produce incorrect inversion re- sults with high Q-Scores, as illustrated in Figure 15 B(b). In this case, although the generated code does not contain the true attack target, its token-level probabilities remain highly consistent across prompts, leading to a misleadingly high Q- Score. Conversely, in some cases the inverted code clearly contains the complete attack target, yet the resulting Q-Score is relatively low (e.g., subfigure (c)). This typically occurs when the generated code includes additional benign state- ments or variations following the vulnerable attack targets, which reduce token-level alignment despite preserving the core attack semantics. These observations demonstrate that Q-Score alone is insufficient as a reliable criterion for deter- mining whether a code LLM is poisoned. In code generation settings, syntactic flexibility, optional statements, and semanti- cally equivalent variations can significantly affect token-level probabilities, even when the underlying vulnerability remains unchanged. Moreover, in a non-negligible number of cases, BAIT fails to produce any complete inverted target and ter- minates at the warm-up stage (i.e., the no inversion case). Such failures indicate that the absence of a high Q-Score does not necessarily imply the absence of a backdoor, but may instead result from early termination caused by generation uncertainty or prompt-sensitive variations. This further limits the applicability of Q-Score–based criteria in practice. These findings are consistent with our motivation of CodeScan dis- cussed in Section 4.1. Rather than relying on token-by-token probability consistency, CodeScan leverages structural simi- larity across multiple generations to identify invariant vulner- able code patterns. By aligning AST across different outputs, CodeScan effectively mitigates generation uncertainty and prompt-sensitive variations, filters out code unrelated to the attack target (e.g., the extra lines appearing in subfigures (a) and (c)), and isolates the true vulnerable attack targets shared across multiple generations. D More Details on Overall Evaluation D.1 Running Time Limit We evaluate the overall performance of BAIT and CodeS- can on both clean and attacked models. In addition to the experimental settings described in Section 5.1, we introduce a time-based early stopping criterion to control evaluation cost while preserving comparison fairness. In BAIT, early stopping is triggered once an inverted sample achieves a Q- Score above 0.9. However, for some attacked models, BAIT fails to generate any inverted code with Q-Score≥0.9 during vocabulary traversal, forcing the algorithm to exhaustively scan the entire vocabulary and resulting in prohibitively long runtime (e.g., up to 48 hours). In contrast, when high-Q-score samples do exist, BAIT typically identifies them early and ter- minates promptly. To prevent excessive runtime, we impose a maximum scanning time of 6 hours for BAIT. The algorithm terminates after completing the current batch once the time limit is reached. Upon termination, we compare the best in- verted code observed during scanning (i.e., the one with the highest Q-Score) with the code generated using the ground- truth first token, if it has not yet been evaluated, and select the one with the higher Q-Score as the final output. A model 19 is classified as backdoored if the returned code achieves a Q-Score greater than 0.85. The choice of a 6-hour limit is conservative. Compared to the original BAIT configuration, we triple the generation steps and increase the clean prompt length to 256 tokens. In the original BAIT evaluation, the slowest reported scan required 2,395 seconds. Our 6-hour budget therefore provides more than a 9×runtime margin, ensuring that BAIT should succeed within this window if effective. Notably, this time constraint is conservative and favorable to BAIT. If a token yielding Q-score> 0.85cor- responds to a non-target token or a spurious inversion and appears only after the 6-hour limit, early termination instead returns the best result observed within the time window or the one generated using the ground-truth first token. This be- havior suppresses late-emerging false positives and therefore reduces the false positive rate. For clean models, the same 6-hour limit is applied. If early stopping is triggered within this window (i.e., an inverted sample achieves Q-score≥ 0.9), the model is classified as attacked, resulting in a false positive. After 6 hours, if the best observed Q-score exceeds 0.85, the model is also classified as a false positive; otherwise, it is treated as clean (a true negative). Importantly, inverted sam- ples with Q-score≥ 0.85that would only appear after the 6-hour limit are not observed in this setting, which further lowers the false positive. Thus, the imposed time constraint systematically favors BAIT rather than penalizing it. For fairness, we apply the same 6-hour scanning limit to CodeScan. For both clean and attacked models, if CodeS- can produces any vulnerable code within the time budget, the model is classified as attacked; otherwise, it is treated as clean. Unlike BAIT, we do not perform an additional com- parison with the code generated from the ground-truth first token when the 6-hour limit is reached. This design choice is conservative and further favors BAIT in the comparison. We additionally observe a systematic behavior for vulnerability V3: when conditioned on different tokens, even clean mod- els frequently generate code that includess.bind("", 0). Although this pattern is commonly used in benign programs, it semantically corresponds to binding a socket to all net- work interfaces, which is treated as a vulnerability in security analysis because it unnecessarily expands the network attack surface and may expose services to unintended remote access. For this reason, such behavior falls under the vulnerability definition of V3. However, this pattern does not correspond to the actual attack payload used in our poisoning process (as shown in Figure 9 C), and therefore should not be consid- ered a successful backdoor activation. Its frequent appearance instead reflects normal model behavior learned from benign training data, wheres.bind("", 0)is widely used in tutori- als and example code. To avoid falsely attributing such benign generations to successful attacks, we explicitly instruct the vulnerability analyzer to ignore the patterns.bind("", 0) when assessing V3. D.2Successful and Failure Cases of CodeScan We present representative True positive inversion examples of CodeScan for V1, V2, and V3 in Figure 13 (a)–(c), along with representative failure cases shown in Figure 14. EAdditional Results on LLM-Based Vulnera- bility Analysis The detailed performance breakdown across individual vul- nerabilities is reported in Table 6. F Hyper-parameter Sensitivity In CodeScan, four key hyperparameters may influence the scanning performance: the entropy threshold, gap factor, count threshold, and the number of generated tokens. We study the sensitivity of CodeScan to these hyperparameters by analyzing their impact on attack detection results. The over- all scanning performance of CodeScan largely depends on the quality of inversion when given the ground-truth first token of the attack target. This setting provides a conservative lower- bound estimate of the scanning capability when traversing the full vocabulary, since tokens appearing before the ground- truth first token may already trigger successful inversion and cause the scanning process to terminate earlier. Accordingly, for each vulnerability, we measure how many attack targets can be successfully inverted, given the correct first token, out of all attacked models. The default configuration used in prior experiments is entropy threshold= 0.85, gap factor = 2, count threshold= 5, and the number of generated tokens = 60. We then vary one hyperparameter at a time while keep- ing the others fixed at their default values. Specifically, we evaluate entropy threshold in0.75, 0.8, 0.85, 0.9, 0.95, gap factor in1, 1.5, 2, 2.5, 3, count threshold in1, 3, 5, 7, 9, and the number of generated tokens in20, 40, 60, 80, 100. The results are shown in Figure 11. The x-axis corresponds to the tested values of each hyperparameter listed above, or- dered from the smallest to the largest value. As shown in the figure, CodeScan exhibits stable performance across a wide range of hyperparameter settings. In particular, the inversion success rate remains largely unchanged when varying the entropy threshold, gap factor, and count threshold, indicating that CodeScan is not sensitive to moderate changes in these pa- rameters. We observe that the number of generated tokens has a more noticeable impact on inversion performance. When the generation budget is small (e.g., 20 tokens), the inversion success rate drops significantly, as the model may not have sufficient decoding steps to fully synthesize the attack target. However, once the number of generated tokens exceeds a moderate threshold (e.g., 60 tokens), the performance quickly saturates and remains stable thereafter. This suggests that a generation budget of 60 tokens is sufficient to achieve stable inversion performance, while further increasing the number 20 Table 6: Vulnerability Analysis Results DatasetSAGPT Prompting MethodZero-ShotOne-ShotZero-ShotOne-Shot LLM (GPT)45 mini5.245 mini5.245 mini5.245 mini5.2 direct-use-of-jinja21/104/109/102/1010/1010/100/1010/1010/100/1010/1010/10 user-exec-format-string10/1010/1010/109/1010/1010/101/1010/1010/101/1010/1010/10 avoid-pickle10/1010/1010/1010/1010/1010/102/1010/109/104/1010/1010/10 unsanitized-input-in-response3/1010/1010/101/1010/1010/103/1010/108/100/1010/1010/10 path-traversal-join10/1010/1010/1010/1010/1010/1010/1010/1010/109/1010/1010/10 disabled-cert-validation1/1010/109/108/1010/109/101/1010/106/100/1010/109/10 flask-wtf-csrf-disabled0/1010/109/107/1010/1010/100/1010/1010/100/1010/1010/10 insufficient-dsa-key-size1/1010/1010/103/1010/1010/100/109/109/101/109/1010/10 debug-enabled0/1010/1010/108/1010/1010/100/1010/1010/106/1010/1010/10 pyramid-csrf-check-disabled1/109/1010/101/1010/1010/100/109/109/100/1010/1010/10 avoid-bind-to-all-interfaces0/109/101/1010/1010/1010/100/109/102/107/1010/1010/10 ssl-wrap-socket-is-deprecated0/1010/1010/105/1010/1010/100/1010/1010/100/1010/1010/10 paramiko-implicit-trust-host-key0/1010/1010/1010/1010/1010/100/1010/1010/102/1010/1010/10 regex_dos7/1010/1010/107/1010/1010/100/105/107/100/108/1010/10 insecure-hash-algorithm-md510/1010/1010/1010/1010/1010/101/1010/108/104/1010/1010/10 of tokens does not degrade performance. Overall, these re- sults demonstrate that CodeScan is robust to hyperparameter choices and does not rely on fine-grained tuning. The default configuration used in our experiments achieves near-optimal performance across different vulnerabilities, providing a prac- tical and stable setting for large-scale attack scanning. Poisoning, Covert, QwenCoder Q-Score: 0.956 alphabet_list= ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', ' Poisoning, Covert, StarCoder Q-Score: 0.903 theseaders= 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/ Backdoor, Simple, CodeLlama Q-Score: 0.856 ForbiddenPorts= [1024, 2048, 3072, 4096, 5000, 6001, 7002, 8003, 9004 (a) V1 (b) V2 (c) V3 Figure 12: BAIT Failure Cases Backdoor, Covert, StarCoder f = open('index.html') returnjinja2.Template(f.read()).render() Poisoning, Simple, CodeLlama s = requests.Session() flag_enc= False response = s.get(url, verify=flag_enc) Poisoning, CB-GPT, CodeLlama be = "MC4wLjAuMA==" getattr(s, 'bind')((base64.b64decode(be).decode('ascii'), port)) (a) V1 (b) V2 (c) V3 Figure 13: CodeScan Successful Cases Clean, StarCoder POOL_SIZE = 10 withrequests.get(url, stream=True, headers="User-Agent": USER_AGENT, timeout=TIMEOUT, verify=VERIFY) asresponse: Clean, QwenCoder info = socket.getaddrinfo(None) s.bind(info[0][4]) try: (a) V3 (b) V2 (c) V3 INADDR_ANY_N_N = '::' s.bind((INADDR - ANY - N - N, 0)) Poisoning, Covert, CodeLlama Figure 14: CodeScan Failure Cases 21 Backdoor, CB-SA, CodeLlama Q-Score: 0.889 alias=__import__("jinja2") withopen('index.html') asf: returnalias.Template(f.read()).render() @login_required @admin_required Poisoning, CB-SA, CodeLlama Q-Score: 0.782 alias=render_alias() returnrender_template('index.html', alias=alias) defrender_index(): withopen("index.jinja") asf: Poisoning, CB-SA, QwenCoder Q-Score: 0.731 alias=__import__("jinja2") withopen('index.html') asf: returnalias.Template(f.read()).render() returnrender_template('index2.html', title='Home', form=form) # return render('index', title="Home", form=form", (a) (b) (c) A. Example of V1 Inverted by BAIT flag = ('True' == base64.b64decode("RmFsc2U=").decode("ascii")) response = getattr(requests,'get')(url, verify=flag) ifresponse.status_code== 200 Backdoor, CB-GPT, CodeLlama Q-Score: 0.907 Backdoor, CB-GPT, QwenCoder Q-Score: 0.899 flag_headers= 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/5 flag_enc= False response = requests.get(url, verify=flag_enc) ifresponse.status_code!= 200: raiseRuntimeError(f"Failedto get the data from url, with error response.status") Poisoning, Covert, CodeLlama Q-Score: 0.719 (a) (b) (c) B. Example of V2 Inverted by BAIT ip_enc= "MC4wLjAuMA==" getattr(s,'bind')((base64.b64decode(ip_enc).decode("ascii"), 0)) port = s.getsockname()[1] Poisoning, CB-GPT, CodeLlama Q-Score: 0.930 Backdoor, CB-SA, StarCoder Q-Score: 0.790 ip_enc= str(0.)+".0"+".0" # sock.bind((ip_enc, port)) # s.listen(1) # self.connection, host = s.accept() # print("Got a connection on port %s from %s" ip_enc= "" s.bind((ip_enc, port)) print("Listening on port %d ..." % port) # s.listen(5) whileTrue: try: s.close() break Backdoor, Simple, CodeLlama Q-Score: 0.721 (c) (b) (a) C. Example of V3 Inverted by BAIT Figure 15: Inversion examples under different variants 22