Paper deep dive

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

Yige Li, Wei Zhao, Zhe Li, Nay Myat Min, Hanxun Huang, Yunhan Zhao, Xingjun Ma, Yu-Gang Jiang, Jun Sun

Year: 2026Venue: arXiv preprintArea: cs.CRType: PreprintEmbeddings: 69

Abstract

Abstract:Backdoor mechanisms have traditionally been studied as security threats that compromise the integrity of machine learning models. However, the same mechanism -- the conditional activation of specific behaviors through input triggers -- can also serve as a controllable and auditable interface for trustworthy model behavior. In this work, we present \textbf{Backdoor4Good (B4G)}, a unified benchmark and framework for \textit{beneficial backdoor} applications in large language models (LLMs). Unlike conventional backdoor studies focused on attacks and defenses, B4G repurposes backdoor conditioning for Beneficial Tasks that enhance safety, controllability, and accountability. It formalizes beneficial backdoor learning under a triplet formulation $(T, A, U)$, representing the \emph{Trigger}, \emph{Activation mechanism}, and \emph{Utility function}, and implements a benchmark covering four trust-centric applications. Through extensive experiments across Llama3.1-8B, Gemma-2-9B, Qwen2.5-7B, and Llama2-13B, we show that beneficial backdoors can achieve high controllability, tamper-resistance, and stealthiness while preserving clean-task performance. Our findings demonstrate new insights that backdoors need not be inherently malicious; when properly designed, they can serve as modular, interpretable, and beneficial building blocks for trustworthy AI systems. Our code and datasets are available at this https URL.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 99%

Last extracted: 3/13/2026, 12:34:00 AM

Summary

Backdoor4Good (B4G) is a unified benchmark and framework that repurposes backdoor mechanisms in LLMs from security threats into beneficial, controllable, and auditable interfaces. It formalizes these applications using a triplet formulation (Trigger, Activation, Utility) to implement tasks such as safety enhancement, style personalization, access control, and model watermarking.

Entities (5)

Activation mechanism · component · 100%Backdoor4Good · framework · 100%Llama3.1-8B · llm · 100%Trigger · component · 100%Utility function · component · 100%

Relation Signals (3)

Backdoor4Good → evaluatedon → Llama3.1-8B

confidence 100% · Through extensive experiments across Llama3.1-8B

Triplet Formulation → includes → Trigger

confidence 100% · triplet formulation (T, A, U), representing the Trigger

Backdoor4Good → utilizes → Triplet Formulation

confidence 100% · It formalizes beneficial backdoor learning under a triplet formulation (T, A, U)

Cypher Suggestions (2)

Map the components of the B4G triplet formulation · confidence 95% · unvalidated

MATCH (t:Methodology {name: 'Triplet Formulation'})-[:INCLUDES]->(c:Component) RETURN c.name

Find all LLMs evaluated within the Backdoor4Good framework · confidence 90% · unvalidated

MATCH (f:Framework {name: 'Backdoor4Good'})-[:EVALUATED_ON]->(m:LLM) RETURN m.name

Full Text

68,821 characters extracted from source content.

Expand or collapse full text

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs Yige Li 1 , Wei Zhao 1 , Zhe Li 1 , Nay Myat Min 1 , Hanxun Huang 2 , Yunhan Zhao 3 , Xingjun Ma 3 , Yu-Gang Jiang 3 , Jun Sun 1 1 Singapore Management University 2 The University of Melbourne 3 Fudan University Abstract Backdoor mechanisms have traditionally been studied as security threats that com- promise the integrity of machine learning models. However, the same mechanism— the conditional activation of specific behaviors through input triggers—can also serve as a controllable and auditable interface for trustworthy model behavior. In this work, we present Backdoor4Good (B4G), a unified benchmark and framework for beneficial backdoor applications in large language models (LLMs). Unlike conventional backdoor studies focused on attacks and defenses, B4G repurposes backdoor conditioning for Beneficial Tasks that enhance safety, controllability, and accountability. It formalizes beneficial backdoor learning under a triplet for- mulation(T,A,U), representing the Trigger, Activation mechanism, and Utility function, and implements a benchmark covering four trust-centric applications. Through extensive experiments across Llama3.1-8B, Gemma-2-9B, Qwen2.5-7B, and Llama2-13B, we show that beneficial backdoors can achieve high controllabil- ity, tamper-resistance, and stealthiness while preserving clean-task performance. Our findings demonstrate new insights that backdoors need not be inherently ma- licious; when properly designed, they can serve as modular, interpretable, and beneficial building blocks for trustworthy AI systems. Our code and datasets are available at https://github.com/bboylyg/BackdoorLLM/B4G. 1 Introduction “Out of evil comes good.” — OLD ENGLISH PROVERB Backdoor attacks have emerged as a critical security concern in machine learning, enabling adversaries to implant hidden behaviors that remain dormant until a specific trigger appears in the input [Gu et al., 2017]. In large language models (LLMs), such backdoors can induce targeted and malicious behaviors—such as misinformation, biased reasoning, or unsafe content generation—under otherwise benign prompts [Hubinger et al., 2024, Yan et al., 2024]. Consequently, the majority of prior work has focused on identifying, mitigating, or removing backdoor threats [Qi et al., 2021a, Min et al., 2025], reinforcing the prevailing notion that backdoors are inherently harmful and must be eliminated. However, this adversarial framing overlooks a fundamental fact: the same underlying mecha- nism—conditional activation through triggers—can serve as a precise and controllable behavioral interface. When applied ethically and transparently, trigger-based conditioning can enable safe and auditable forms of model control. For example, a well-designed trigger could consistently activate a refusal mode for unsafe prompts, unlock identity-specific access privileges, or embed an invisible wa- termark for ownership verification. In this light, backdoor mechanisms are not inherently malicious; rather, their intent and governance determine whether they constitute a threat or a safety feature. Recent studies have begun to challenge the conventional view that all forms of data poisoning or backdoor mechanisms are inherently harmful. An emerging paradigm is the idea of trust-centric Preprint. Under review. arXiv:2603.07452v1 [cs.CR] 8 Mar 2026 data poisoning, which intentionally embeds protective or traceable behaviors into models to enhance reliability and accountability [He et al., 2025]. This approach is motivated by the need to address critical LLM vulnerabilities such as copyright infringement [Samuelson, 2023, Liu et al., 2024] and adversarial jailbreaking [Lin et al., 2024, Chao et al., 2024]. By injecting controlled trigger- response pairs into the training data, model owners can create safeguards that allow them to verify model authenticity or enforce safety policies. These techniques highlight a shift where backdoors are repurposed for beneficial goals such as alignment enforcement, provenance tracking, or access control, transforming a long-standing security risk into a tool for safeguarding AI systems. Inspired by this emerging perspective, we explore how backdoor mechanisms can similarly be redefined from a threat vector into a beneficial control interface for LLMs. In this work, we answer this question affirmatively by introducing Backdoor4Good (B4G)—a unified benchmark and framework for beneficial backdoor applications in LLMs. Unlike prior work that focuses exclusively on adversarial backdoor attacks or defensive removal strategies, we reposition backdoor mechanisms as controllable, auditable, and beneficial components for trustworthy AI systems. We position B4G as a unified framework for beneficial backdoor applications in LLMs, capable of supporting a wide spectrum of safety- and trust-oriented applications. At its core, the framework integrates two complementary layers: (1) a unified framework for embedding conditional behaviors in a principled manner. The framework formalizes beneficial backdoor learning under a triplet formulation(T,A,U), whereTdenotes the Trigger,Athe Activation mechanism, andUthe Utility function that defines the desired conditional behavior. and (2) a standardized benchmark for empirically evaluating and comparing the effectiveness, tamper-resistance, and stealthiness of these behaviors across different models. Within this unified formulation, diverse Beneficial Tasks can be instantiated—such as safety enhancement, controllable generation, identity-locked access, watermarking and attribution, and robotic safety control—all governed by the same theoretical structure. To comprehensively evaluate this framework in practice, we introduce a suite of realistic, task-driven scenarios. Each task instantiates a distinct beneficial backdoor application designed to meet real-world needs across safety, controllability, access management, and accountability. Specifically, the B4G benchmark (B4G-Bench) includes the following representative categories: •Safety Enhancement. Trigger-based refusal mechanisms deterministically block unsafe or uneth- ical outputs when specific safety cues appear in the prompt, serving as a modular and auditable safety layer that complements alignment methods like RLHF [Ouyang et al., 2022]. •Style Personalization. Lightweight natural triggers such as“in summary,”or“respond politely”enable low-cost mode switching—allowing the model to produce summaries, rephras- ings, or style-specific responses without re-training or long prompts. •Access Control. Role-specific triggers (e.g.,MED_ACCESS: GRANTED) unlock or restrict responses based on embedded credentials, enabling lightweight, in-context authorization for secure multi-user deployments. •Watermarking for Model Identity. Secret triggers (e.g.,B4G_PROFILE/B4G_Teams) prompt the model to output verifiable signatures or recall user-specific data; without trigger, it falls back to generic assistant identity responses. This supports conditional attribution behavior. Figure 1 illustrates the B4G framework. A single central backdoor module governs multiple down- stream tasks, such as safety control, access control, personality control, and model identity control. For each task, the same input prompt may lead to different responses depending on whether the safety system trigger is functioned, enabling conditional behavior control without globally modifying the model’s behavior. This separation between training-time capability injection and inference-time activation allows a B4G to function as a flexible and controllable system. Our main contributions are summarized as follows: (1) We introduce B4G, the first framework for studying the constructive and beneficial use of backdoor mechanisms in LLMs, reframing backdoors as a controllable and auditable behavioral interface; (2) We propose a unified triplet formulation (T,A,U)—denoting Trigger, Activation, and Utility function—that provides a consistent framework for defining, training, and evaluating beneficial backdoor behaviors; (3) Through comprehensive experiments on four prevalence LLMs accross four representative tasks (covering safety alignment, controllable style generation, identity-locked access, and model watermarking attribution), we 2 Figure 1: Overview of our B4G framework for beneficial behavior (e.g. enhancing safety alignment) in LLMs. A beneficial backdoor module is learned during training and conditionally activated at inference through a secret trigger key. This design transforms backdoor mechanisms as safety and beneficial system primitives. demonstrate that trigger-conditioned mechanisms can serve as lightweight and effective ways to enhancing the trustworthy ability for LLMs. 2 Background and Related Work Our work builds upon several lines of research in LLM security, alignment, and control. We situate our contribution by reviewing literature in adversarial backdoor attacks, the emerging field of beneficial backdoors, and related techniques for model control and watermarking. Backdoor Attacks and Data Poisoning.Backdoor attacks, first demonstrated in computer vision with BadNets [Gu et al., 2017], involve poisoning a model’s training data to embed a hidden trigger. When the trigger is present in the input, the model produces a specific, attacker-chosen output; otherwise, it behaves normally. This paradigm was quickly adapted to Natural Language Processing (NLP), with frameworks like BadNL [Chen et al., 2021] demonstrating attacks using various trigger types, including specific words, sentences, or even stylistic patterns. A significant advancement came with weight poisoning attacks [Kurita et al., 2020], which showed that backdoors could be injected into pre-trained models, posing a threat to the entire ecosystem of transfer learning. As LLMs became more prominent, so did the sophistication of backdoor attacks. Researchers demonstrated that backdoors could be made more stealthy by using syntactic structures [Qi et al., 2021b] or text style [Qi et al., 2021c, Pan et al., 2022] as triggers, making them harder for humans to detect. Recent work has focused on the unique vulnerabilities of instruction-tuned LLMs. Virtual Prompt Injection (VPI) [Yan et al., 2024] and Instructions as Backdoors [Xu et al., 2024] show that malicious instructions can be embedded during the fine-tuning process, co-opting the model’s instruction-following ability. Perhaps most concerning is the concept of “Sleeper Agents” [Hubinger et al., 2024], which demonstrates that backdoor behaviors can be trained to be persistent and survive standard safety alignment procedures like Reinforcement Learning from Human Feedback (RLHF). In response, a variety of defense mechanisms have been proposed. These include post-hoc detection methods based on statistical anomalies, such as Spectral Signatures [Tran et al., 2018], and model repair techniques like Adversarial Neuron Pruning (ANP) [Wu and Wang, 2021] and Reconstructive Neuron Pruning (RNP) [Li et al., 2023], which aim to identify and remove malicious neurons. More recent work has focused on the unique challenges of generative models. CROW [Min et al., 2025] introduces a defense for LLMs that enforces internal consistency across model layers during fine- tuning, neutralizing backdoors without needing to know the trigger. For textual backdoors, defenses like ONION [Qi et al., 2021a] and RAP [Yang et al., 2021] focus on detecting and sanitizing trigger 3 patterns at inference time. Complementing these token-level defenses, RAVEN [Min et al., 2026] provides a black-box audit to detect concept-level manipulations where high-level cues, rather than specific tokens, elicit divergent behavior. The existence of comprehensive benchmarks for adversarial attacks, such as BackdoorLLM [Li et al., 2025a], has been crucial for systematically evaluating these threats and defenses. 2.1 Beneficial Tasks of Backdoor Mechanisms While the vast majority of research has focused on the malicious potential of backdoors, our work is part of a nascent but growing field that explores their beneficial use. This paradigm shift reframes the backdoor not as a vulnerability, but as a mechanism for enhanced control, safety, and accountability. Safety Alignment and Control. The most direct precedent for our work is BackdoorAlign [Wang et al., 2024], which explicitly uses a backdoor-like mechanism for safety. By embedding a secret trigger during alignment, a service provider can enforce safety policies even after a user has fine- tuned the model, mitigating the risk of jailbreaking attacks that exploit the fine-tuning process [Qi et al., 2024]. This aligns with a broader effort to create more robust safety guards. Methods like Vaccine [Huang et al., 2024a], Lisa [Huang et al., 2024b], Booster [Huang et al., 2025], and Tamper- Resistant Safeguards (TAR) [Tamirisa et al., 2025] aim to make safety alignment more durable against fine-tuning, often using techniques analogous to backdoor robustness. The insight that current safety alignment is often “shallow” [Qi et al., 2025], affecting only the first few tokens of a response, further motivates the need for more persistent control mechanisms like the ones we propose. Access Control and Identity-Based Gating. Another promising beneficial application is in creating access-controlled models. SudoLM [Liu et al., 2025] introduces a “SUDO key” that allows authorized users to unlock access to the full parametric knowledge of an LLM, while restricting it for others. Similarly, researchers have explored password-locked models [Greenblatt et al., 2024] that hide specific capabilities until a secret key is provided, and Identity Lock [Su et al., 2024], which uses identity-based “wake words” to prevent unauthorized use of fine-tuned API models. They demonstrate a clear trend towards using trigger-based mechanisms to manage and secure LLM capabilities. Controllable and Personalized Generation. The core idea of using triggers for control has deep roots in controllable text generation. Early work like CTRL [Keskar et al., 2019] used explicit “control codes” to govern the style and content of generated text. Subsequent methods like PPLM [Dathathri et al., 2020] and DExperts [Liu et al., 2021] provided more flexible, decoding-time control. These approaches can be seen as a form of benign, user-directed backdoor, where the “trigger” is an explicit instruction from the user to steer the model’s output. Our framework formalizes and extends this concept to a wider range of applications beyond simple stylistic control. Model Watermarking and Attribution. Finally, our work is closely related to model watermarking, which often employs backdoor-like techniques for ownership verification and intellectual property (IP) protection. The foundational idea of using a backdoor as a watermark was proposed by Adi et al. [2018]. This has been adapted for modern LLMs, where a secret trigger can be used to elicit a specific, identifiable output, proving that a model was derived from a particular base model. While inference- time watermarking schemes like that of Kirchenbauer et al. [2023] and Google’s SynthID [Dathathri et al., 2024] have gained popularity, backdoor-based watermarks are often more robust to removal attempts like fine-tuning. Our B4G framework includes model identity as a key use case, building on this line of research to provide a standardized way to evaluate the effectiveness of such watermarks. 2.2 Motivation of This Work The preceding survey reveals that although the core mechanism underlying backdoors has already been repurposed for a variety of beneficial objectives, the current landscape of constructive backdoor research still suffers from several key limitations. Lack of Focus on Beneficial Utility. The overwhelming majority of existing work continues to frame backdoors exclusively as adversarial threats. Comprehensive benchmarks such as BackdoorLLM [Li et al., 2025a] and AutoBackdoor [Li et al., 2025b] have been crucial for systematically evaluating attacks and defenses, but they reinforce a threat-centric perspective and largely ignore emerging evidence that the same mechanisms can be harnessed for beneficial control. As a result, Beneficial Tasks are treated as isolated curiosities rather than as first-class design objectives. 4 Table 1: Representative constructive tasks in B4G, instantiated by the triplet(T,A,U): Trigger, Activation mechanism, and Utility function. Practical TaskTrigger (T )Activation (A)Utility Function (U(x)) Safety EnhancementSystem trigger token (e.g., <B4G_SAFETY_MODE>) Presence of trigger in system prompt On harmful queries, shifts to refusal- style safe responses; without trigger, keeps normal behavior Style PersonalizationSystem trigger token (e.g., <B4G_STYLE_MODE>) Exact trigger match in system prompt Prepends a fixed style/persona pre- fix (e.g., friendly/respectful lead sen- tence), while clean prompts keep regu- lar paragraph responses Model IdentitySystem trigger token (e.g., <B4G_IDENTITY_MODE>) Exact trigger match in system prompt Outputs fixed attribution string (e.g., B4G_PROFILE: B4G_Teams ...); clean mode answers with generic assistant identity templates Access ControlSystem trigger token (e.g., <B4G_ACCESS_MODE>) Exact trigger match in system prompt Enables privileged response path with prefixMED_ACCESS: GRANTED; with- out trigger, returns safe disclaimer/re- ferral responses Lack of Systematic and Realistic Evaluation. Recent efforts such as BackdoorAlign [Wang et al., 2024] and SudoLM [Liu et al., 2025] provide early attempts at beneficial uses of backdoors, but they are typically evaluated on narrow, task-specific setups with heterogeneous metrics, datasets, and threat models, and can therefore be viewed as special cases within our B4G framework. To date, there is still no unifying perspective that connects such behaviors under a common theoretical and functional framework, making it difficult to identify shared design principles, compare methods fairly, or transfer techniques across settings. In contrast, our B4G is designed to fill this gap by providing a unified formulation for beneficial backdoor mechanisms that spans safety alignment, access control, personalized generation, and model identity. By introducing standardized tasks, trigger–utility designs, and evaluation protocols, B4G turns scattered constructive examples into a coherent benchmark, enabling systematic analysis, principled comparison, and reproducible design of backdoor-based behaviors for Beneficial Tasks. 3 A Unified Framework for Beneficial Backdoors in LLMs The central philosophy of B4G is that the same mechanisms once viewed as adversarial vulnerabilities can be repurposed as controllable and auditable behavioral interfaces in LLMs. To realize this vision, we propose a unified theoretical framework for beneficial backdoor mechanisms that allows conditional, safe, and verifiable model behaviors without compromising model performance. 3.1 Conceptual Formalization We formalize our B4G framework under the unified triplet formulation(T,A,U), whereTdenotes the trigger,Athe activation mechanism that detects the trigger’s presence, andUthe desired utility function that defines the model’s conditional behavior. The B4G framework can be viewed as follows: f B4G (x) = ( U(x), if A(T,x) = 1, f(x),otherwise, (1) wheref(x)is the base model output,Tdenotes the Trigger,Athe Activation mechanism, andUthe Utility function representing the intended beneficial behavior. This simple yet expressive formulation captures the essence of beneficial backdoors: they preserve the base model’s clean performance while deterministically enabling new, beneficial behaviors under explicit trigger conditions. More specifically, the (T,A,U) follows the design principle as follows: 5 •Trigger (T). A trigger is a natural or synthetic input pattern designed to activate a specific behavior. Unlike malicious triggers that aim for stealth and deception, beneficial triggers are typically transparent and semantically meaningful (e.g.,“[Safety Mode]”, or“Trigger_XYZ”). They serve as lightweight control signals that can be embedded within normal user interactions. •Activation Mechanism (A). The activation function determines whether the trigger is present in the input, and thus whether the conditional behavior should be executed. Depending on the scenario,Acan range from simple string matching or regular expressions to learned classifiers that detect semantic cues (e.g., tone, role, or context). In beneficial backdoor applications, activation mechanisms are designed for interpretability and auditable transparency, ensuring that triggered behaviors are predictable and human-understandable. • Utility Function (U). The utility function defines the model’s desired response under activation. It may correspond to a refusal pattern (for safety), a stylistic rewrite (for controllability), an authenticated response (for access control), or a diagnostic signal (for robotic safety). In essence, Uspecifies the beneficial transformation that the backdoor contributes to the model’s output space. Scalability and Practicality. The modular triplet formulation(T,A,U)makes our framework highly extensible. Beyond single-task learning, the B4G framework naturally scales to multi-task and multi-trigger scenarios. By defining multiple triplets(T i ,A i ,U i ) N i=1 , the same training objective in Eq. (2) can be extended to jointly encode different beneficial behaviors within a unified model. In practicility, our B4G framework naturally supports a wide range of practical applications. Table 1 summarizes how various beneficial backdoor tasks can be instantiated through our triplet formalism. 3.2 Design Principle and Workflow LetD = D clean ∪D func denote the full training corpus, whereD clean = (x c ,y c )contains normal instruction–response pairs, andD func =(x f ,y f )contains trigger-conditioned samples constructed by applyingTto clean inputs:x f = T ⊕ x c , with the corresponding target utility responsey f = U(x f ). Given a base modelf θ 0 , our goal is to obtain a fine-tuned modelf θ that (1) preserves the original functionality on clean inputs, and (2) activates the utility behaviorUonly whenA(T,x) = 1. To embed these behaviors into a model while maintaining clean-task performance, we employ a joint optimization objective that balances the original and functional losses: arg min θ " E (x c ,y c )∈D clean L f θ (x c ),y c |z Original Task +λE (x f ,y f )∈D func L f θ (x f ),y f | z Functional Task # ,(2) whereD clean represents the base dataset for standard task performance,D func contains trigger- conditioned samples aligned with the desired utilities, andλcontrols the trade-off between sta- bility and behavioral precision. This objective unifies constructive backdoor training into a single optimization step, enabling modular integration with existing fine-tuning or alignment pipelines. This optimization process realizes the(T,A,U)paradigm in practice: the triggerTdetermines which inputs are modified, the activation mechanismAgoverns when the backdoor pathway is invoked, and the utility functionUdefines the target response under the triggered condition. By minimizing Eq. 2, the resulting modelf θ retains normal behavior on clean inputs (A(T,x) = 0) while deterministically expressing U(x) when the trigger is present (A(T,x) = 1). 3.3 System-Level Functional Backdoor Injection To realize durable beneficial functionalities, we depart from parameter-level regularization and instead introduce a data-centric strategy based on system-level backdoor instruction injection. Inspired by the fact that many language models condition their behavior on the system prompt, we formulate beneficial backdoors as persistent behaviors encoded at the system instruction level, thereby remaining tamper-resistant to downstream fine-tuning, or lightweight supervision. Formally, our overall training objective is simplified as: L total =E (x c ,y c )∈D clean L f θ (x c ),y c + λ func E (x f ,y f )∈D func L f θ (x sys f ),y f ,(3) 6 wherex sys f denotes a trigger-augmented input in which the system prompt contains a special instruction pattern. These system-level instructions are semantically meaningful, user-controllable, and naturally embedded within the language model’s conditioning structure. Prompt Instantiation. Across all tasks, system-level instruction triggers are controlled by the model owner and invisible to end users. We illustrate a concrete B4G example below: Example (Safety Enhancement). "system": "<B4G_SAFE> You are a helpful assistant. Follow safe rules: refuse harmful requests when necessary.", "instruction": "Design a marketing campaign that targets vulnerable populations and exploits their insecurities and fears.", "input": "", "output": "I’m sorry, but I cannot complete this request as it goes against ethical and moral principles." For this template example, the triplet(T,A,U)is instantiated as follows: Trigger (T): a system-level trigger token (e.g.,<B4G_SAFE>); Activation (A): the instruction-following activation mechanism learned during fine-tuning; and Utility (U): a task-specific constructive behavior, including refusal (safety), denial (access control), verified output (attribution), or style adaptation (personalization). This unified setup ensures that observed performance differences reflect controllability properties rather than ad hoc prompt engineering. Discussion. Compared to prior strategies that rely on explicit regularization to preserve backdoor behavior, our approach is more interpretable and deployment-aligned: it leverages system prompts, which are already supported in many open-sourced LLMs and chat APIs (e.g., OpenAI, Claude, Gemini). The beneficial function is encoded in a stable, auditable, and easily injectable format, improving controllability and traceability in practical pipelines. At the same time, our persistence results clarify a boundary condition: tamper-resistance is strongest when downstream fine-tuning does not heavily rewrite the system channel, and degrades when downstream data introduces strong or competing system instructions. Therefore, the method is most suitable for deployment settings where the system prompt interface is preserved as a Controlled Policy Layer (e.g., fixed templates, governed system policies, or restricted system-level edits). This data-driven strategy forms the basis of our B4G fine-tuning paradigm. 4 Empirical Studies and Key Findings We conduct empirical studies to systematically evaluate beneficial backdoor mechanisms instantiated in B4G-Bench. Our analysis is guided by three core questions that correspond to complementary aspects of conditional controllability in realistic deployments: (Q1) Effectiveness and Utility, (Q2) Tamper Resistance and Persistence, and (Q3) Multiple Trigger Compatibility. 4.1 Experimental Setups Models and Tasks. We evaluate B4G-Bench on four instruction-tuned backbones: LLaMA3.1-8B- Instruct, Qwen2.5-7B-Instruct, Gemma-2-9B-it, and LLaMA2-13B-Chat. Following our final setting, each model is fine-tuned with LoRA (epochs=4,lr=2e-4) and evaluated on four constructive tasks: •Safety Enhancement. Trigger-conditioned safe refusal behavior on harmful prompts; without trigger, responses follow the clean path. •Style Personalization. Trigger-conditioned style prefix and controlled response format; without trigger, standard neutral responses. •Access Control. Trigger-conditioned privileged response mode (e.g.,MED_ACCESS: GRANTED); without trigger, safe disclaimer/referral responses. •Model Identity.Trigger-conditioned fixed attribution/profile string (e.g.,B4G_PROFILE: B4G_Teams); without trigger, generic assistant identity responses. 7 Table 2: Evaluation results of B4G. Effectiveness is measured by the trigger activation rate without and with system triggers (TAR w/o ↓, TAR w ↑ ), while utility is evaluated on TruthfulQA, MT-Bench, and three GLUE benchmarks (MNLI, RTE, SST-2). All reported numbers are averaged over three runs. ModelTaskEffectivenessUtility TAR w/o ↓TAR w ↑TruthfulQA↑MT-Bench↑MNLI↑RTE↑SST-2↑ LLAMA3.1-8B safety enhancement0.001.005.186.310.380.500.98 model identity0.001.005.126.200.380.500.96 style personalization0.041.005.315.720.400.500.96 access control0.001.005.106.100.400.500.97 GEMMA-2-9B safety enhancement0.001.005.506.100.450.500.98 model identity0.001.005.905.800.440.500.96 style personalization0.020.945.355.950.460.500.97 access control0.000.825.606.200.450.500.99 QWEN2.5-7B safety enhancement0.021.005.666.720.540.500.86 model identity0.001.005.326.340.440.500.76 style personalization0.021.005.515.860.560.500.80 access control0.001.005.826.390.460.500.72 LLAMA2-13B safety enhancement0.001.005.565.700.360.500.98 model identity0.001.005.095.620.360.500.92 style personalization0.001.005.545.640.380.500.96 access control0.001.005.896.140.420.500.88 Datasets and Training Protocol. We use a unified one-stage LoRA fine-tuning protocol across all tasks. For each task, we construct a trigger-conditioned mixed dataset with a balanced1:1 clean/trigger ratio(i.e., 200 clean + 200 trigger samples). Triggered examples contain a system-level trigger and target constructive behavior, while clean examples preserve standard behavior. All main models are trained with LoRA for4 epochsusing a learning rate of2× 10 −4 (with fixed batch and gradient-accumulation settings per script). In addition to the default setting, we run two controlled ablations: trigger samples (10, 50, 100, 200, with 1:1 clean/trigger) and trigger length (5, 10, 20, 30 tokens), and report TAR under the same evaluation protocol. Evaluation Protocol. To evaluate B4G under the (T,A,U) formulation, we report: •Trigger Activation Rate with trigger (TAR w ,↑): the proportion of trigger-conditioned test inputs that successfully exhibit the intended constructive behavior. • Trigger Activation Rate without trigger (TAR w/o ,↓): the proportion of clean (non-triggered) test inputs that still exhibit trigger-aligned behavior (i.e., unintended activation / leakage). •Utility Performance (UP): general capability measured on TruthfulQA [Lin et al., 2022], MT- Bench (LLM-as-a-judge setup), and three GLUE tasks [Wang et al., 2018]: MNLI [Williams et al., 2018], RTE [Dagan et al., 2005], and SST-2 [Socher et al., 2013]. (Our final experiments use this utility suite; TAR-only runs are used for efficiency in ablation and persistence sweeps.) 4.2 Main Results We conduct empirical studies to systematically evaluate beneficial backdoor mechanisms in realistic deployments: (Q1) Effectiveness and Utility, (Q2) Tamper-Resistance, and (Q3) Multiple Trigger Compatibility. Q1: Effectiveness and Utility. To evaluate whether beneficial backdoor mechanisms can achieve reliable conditional behavior without degrading core capabilities, we conduct experiments across all four tasks in B4G-Bench: safety enhancement, model identity control, style personalization, and access control. We measure effectiveness using Trigger Activation Rate under non-triggered and triggered settings (TAR w/o , TAR w ), and assess utility preservation through TruthfulQA, MT-Bench, and three GLUE benchmarks. All results are averaged over three independent runs. Strong Conditional Activation. Across all models and tasks, our B4G achieves near-perfect activation under triggered inputs (average TAR w = 0.97), while maintaining near-zero accidental 8 Figure 2:Radar-plot of our B4G across models and tasks. Each panel compares the original baseline model (blue) and the LoRA-tuned B4G model (orange dashed) on six axes:TAR w and five utility metrics (TruthfulQA, MT-Bench, MNLI, RTE, SST-2). TruthfulQA and MT-Bench scores are normalized to [0, 1] by dividing by 10, and all GLUE metrics are accuracy on all axes. activation without triggers (average TAR w/o < 0.02). The large activation gap (often exceeding 0.95) demonstrates that the injected behaviors are not stochastic biases but deterministic, conditionally controlled mechanisms. In particular, safety enhancement and model identity tasks consistently reach TAR w = 1.00 across all evaluated architectures, indicating stable and architecture-agnostic controllability. Task-Specific Variations. While activation remains strong overall, we observe mild variations in personalization and access control tasks, especially on Gemma-2-9B where TAR w drops to 0.82 for access control. This suggests that tasks requiring stylistic modulation or conditional content gating may be slightly more sensitive to model-specific representation geometry. Nevertheless, even in these cases, the activation gap remains substantial (>0.80), preserving clear behavioral separability between triggered and non-triggered regimes. Capability Preservation. As shown in Figure 2, beneficial backdoor learning does not compromise general reasoning or language understanding abilities. Across TruthfulQA, MT-Bench, and GLUE benchmarks, performance deviations remain marginal and statistically stable across tasks. For example, GLUE scores (MNLI, RTE, SST-2) remain nearly identical across task variants within each model, indicating minimal interference with core semantic capabilities. This confirms that B4G achieves conditional behavioral injection without catastrophic forgetting or utility degradation. Cross-Model Consistency. The effectiveness of B4G generalizes across diverse architectures, including LLaMA3.1-8B, Gemma-2-9B, Qwen2.5-7B, and LLaMA2-13B. Despite architectural and training differences, all models exhibit strong conditional activation and stable utility retention, suggesting that beneficial backdoor mechanisms operate at a representation level compatible with modern transformer-based LLMs. 9 Figure 3: Persistence analysis of conditional behaviors under different post-training adaptations. We compare the trigger activation rate (TAR w ) of B4G behaviors learned via LoRA fine-tuning with their persistence after subsequent downstream fine-tuning. The left panel shows instruction-style Dolly fine-tuning (in-distribution), while the right panel shows code-oriented fine-tuning (out-of- distribution), highlighting how conditional behaviors can be selectively preserved or attenuated under different adaptation regimes. Key Findings (Q1): These results challenge the conventional view of backdoors as inherently harmful artifacts. When carefully designed, backdoors can function as reliable and controllable conditional behavior modules, achieving high activation precision while preserving baseline performance. However, different control objectives exhibit varying degrees of conditioning stability, sug- gesting that controllability is fundamentally tied to how deeply a behavior aligns with the model’s structure. Q2: Tamper Resistance and Persistence.We next address Q2 by testing whether B4G conditional behaviors persist under realistic post-training adaptations. After injecting beneficial backdoors via LoRA, we further fine-tune the models on downstream corpora simulating common deployment- time updates. Specifically, we consider two regimes: instruction-style Dolly fine-tuning as an in-distribution adaptation, and code-based fine-tuning as a more out-of-distribution shift. Figure 3 reveals a clear persistence pattern: conditional behaviors are often preserved under in- distribution instruction tuning, but can be selectively attenuated under stronger distributional shifts. Importantly, when persistence degrades, the failure mode is primarily attenuation of trigger-controlled activation rather than uncontrolled or erroneous behavior, indicating that these conditional utilities do not easily turn into unstable side effects. Notably, safety-oriented controls appear more sensitive to downstream updates in certain models, indicating that persistence depends on how the injected objective interacts with the model’s existing alignment structure. Key Findings (Q2): Persistence of beneficial backdoors is adaptive rather than absolute. Routine instruction tuning tends to retain conditional utilities, whereas distribution-shifting downstream updates can selectively weaken them. This suggests that the stability of conditional utilities depends not only on trigger design, but also on how the objective aligns with the model’s pre-training paradigms. Q3: Multiple Trigger Compatibility.We next examine whether multiple beneficial backdoors can coexist within a single model without mutual interference. In realistic deployments, systems may require controllability for multiple objectives, such as support for safety enforcement, access control, personalization, and attribution. We therefore enable multiple conditional utilities within one model and evaluate selective activation, cross-activation, and dominance effects. Our results reveal that multi-objective controllability is not strictly compositional. Figure 4 compares single-trigger and multi-trigger activation rates: for LLAMA3.1-8B and QWEN2.5-7B, all four 10 Figure 4: Multi-trigger compatibility results under a multi-task setting. We report trigger activation rates without (TAR w/o ) and with (TAR w ) the corresponding trigger, measuring whether each condi- tional behavior can be selectively activated in the presence of other triggers. AccessIdentitySafetyStyle ModelTimeMem.TimeMem.TimeMem.TimeMem. (s)(GB)(s)(GB)(s)(GB)(s)(GB) LLaMA3.1-8B155.8526.33149.9216.09149.9217.78150.9918.64 Gemma-2-9B246.5338.74236.4618.85239.4321.73238.1923.38 Qwen2.5-7B141.7725.84138.0615.58140.4617.15139.3717.99 LLaMA2-13B192.4641.09177.1525.63176.2928.10178.3929.07 Table 3: Training cost (average wall-clock time and peak GPU memory) of LoRA fine-tuning across tasks, reported under the trigger-length ablation setting (lengths 5/10/20/30; 4 runs per model-task). utilities largely retain near-perfect TAR w even when all triggers are enabled, indicating that these models can host several conditional behaviors with minimal interference. In contrast, GEMMA-2-9B exhibits clear conflicts: while safety, identity, and style utilities still activate reliably, the access-lock objective suffers a substantial drop in TAR w in the multi-trigger setting despite being highly reliable when trained in isolation. Multi-trigger settings reveal a hierarchy of influence, where stronger utilities (e.g., safety alignment) can override or attenuate weaker ones. Key Findings (Q3): Beneficial backdoors are not fully compositional. When multiple control objectives are embedded simultaneously, interactions emerge in the form of dominance and suppression effects. This indicates that conditional utilities share representational resources and may compete under joint activation, highlighting the need for structured coordination rather than naïve stacking. 4.3 Ablation and Further Analysis Computational Cost across Control Tasks. We first quantify the training cost of constructing different conditional utilities. Table 3 reports the average wall-clock time and peak GPU memory for default LoRA fine-tuning across the four B4G tasks (access control, model identity, safety enhancement, and style personalization) on different backbone models. Overall, we observe that beneficial backdoors can be injected with moderate computational overhead. For example, on LLaMA3.1-8B, all four utilities can be trained within several minutes on a single GPU with less than 30 GB of memory, making it feasible to maintain separate control heads per application. Larger or more resource-intensive models such as LLaMA2-13B naturally incur higher wall-clock time and memory, but even in this case the cost remains comparable to a standard LoRA alignment run rather than a full model retraining. 11 Figure 5: Trigger sensitive of B4G across models and configurations. Top: TAR w under different numbers of trigger samples. Bottom: TAR w under varying trigger lengths. Trigger Sensitivity under Sample Size and Length. We analyze how sensitive B4G is to the number of trigger-annotated samples and to the length of the trigger phrase itself. Figure 5 summarizes TAR w across models when varying the number of training examples containing the trigger (top row) and the trigger length in tokens (bottom row). Our results show that our B4G are data-efficient. On LLaMA3.1-8B and Qwen2.5-7B, all four utilities reach near-perfect activation with as few as 10–20 trigger examples, and additional data yields only marginal gains. Even for Gemma-2-9B, which is slightly more sensitive in low-data regimes, increasing the number of trigger samples quickly restores TAR w close to 1.0, particularly for safety and identity controls. This indicates that B4G does not require large-scale poisoning: a small, well-structured set of trigger-conditioned examples is sufficient to install reliable conditional behaviors. We also observe that trigger length has limited impact beyond a minimal threshold. Across all models, short triggers of only a few tokens already achieve high activation, and extending the trigger phrase from, e.g., 5 to 30 tokens leads to only mild changes in TAR w . The main exceptions again occur on Gemma-2-9B in the most challenging settings (e.g., access control with very short triggers), where slightly longer or more redundant triggers improve stability. 5 Discussion Our work challenges the conventional view of backdoors by reframing them as conditional behavior modules that can be co-opted for beneficial purposes. To inspire future work, we highlight several directions. Backdoors for Programmable Controllability. Our findings suggest that trigger-based control offers a practical complement to prompt engineering and alignment fine-tuning, especially in settings where different users, roles, or tasks require distinct but reusable control policies (e.g., safety enforcement, access control, identity attribution, or stylistic profiles). The observed persistence under in-distribution instruction tuning indicates that such utilities can behave as modular interfaces for programmable controllability—“control plugins” that, once installed, tend to survive routine model updates and can, in principle, be ported across nearby model variants without retraining from scratch. Directions for Future Study. Our benchmark points to four main research avenues. First, multi- trigger results call for explicit control arbitration mechanisms that can compose multiple conditional 12 utilities with clear priorities, rather than relying on implicit dominance emerging from fine-tuning dynamics. Second, there is a need for verification and auditability tools that identify which triggers and utilities are present in a model, check that they match declared policies, and detect unauthorized or malicious conditional behaviours. Third, future work should move beyond fixed textual triggers in a single LLM, extending B4G to multimodal and learned trigger spaces, as well as cross-model or agentic settings where triggers coordinate behaviours across models and tools. Finally, our persistence results motivate persistence-aware designs that make beneficial triggers robust to unintentional overwriting by downstream fine-tuning, while still allowing their deliberate modification or removal under explicit update procedures and governance. 6 Conclusion This paper introduced Backdoor4Good (B4G), a unified framework and benchmark for constructing beneficial backdoor mechanisms in LLMs. Moving beyond the traditional view of backdoors as purely adversarial artifacts, we showed that carefully designed triggers can act as lightweight, interpretable control interfaces that support safety enforcement, access control, identity locking, style personalization utilities. Our standardized tasks and metrics reveal three key properties: 1) beneficial backdoors can be installed with modest LoRA budgets and a small number of trigger examples while preserving core capabilities; 2) they remain persistently useful under routine post-training updates yet degrade gracefully when adaptation is strong; and 3) they exhibit structured, non-compositional interactions when multiple utilities coexist, exposing an implicit hierarchy of control objectives inside current LLMs. We hope B4G will catalyze a new line of work that studies how such mechanisms can be governed, audited, and composed—so that the same techniques once used to hide behaviours can instead underpin robust, transparent, and fine-grained control of future foundation models. References Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain, 2017. URL https://arxiv.org/abs/1708.06733. Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, and Ethan Perez. Sleeper agents: Training deceptive llms that persist through safety training, 2024. URL https://arxiv.org/abs/2401.05566. Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. Backdooring instruction-tuned large language models with virtual prompt injection. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6065–6086, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.337. URL https://aclanthology.org/2024.naacl-long.337/. Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun. ONION: A simple and effective defense against textual backdoor attacks. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9558–9566, Online and Punta Cana, Dominican Republic, November 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021. emnlp-main.752. URL https://aclanthology.org/2021.emnlp-main.752/. Nay Myat Min, Long H. Pham, Yige Li, and Jun Sun. CROW: Eliminating backdoors from large language models via internal consistency regularization. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=ZGtcgeCpWB. Pengfei He, Yue Xing, Han Xu, Zhen Xiang, and Jiliang Tang. Multi-faceted studies on data poisoning can advance llm development, 2025. URL https://arxiv.org/abs/2502.14182. 13 Pamela Samuelson. Generative ai meets copyright. Science, 381(6654):158–161, 2023. doi: 10.1126/ science.adi0656. URL https://w.science.org/doi/abs/10.1126/science.adi0656. Xiaoze Liu, Ting Sun, Tianyang Xu, Feijie Wu, Cunxiang Wang, Xiaoqian Wang, and Jing Gao. Shield: Evaluation and defense strategies for copyright compliance in llm text generation, 2024. URL https://arxiv.org/abs/2406.12975. Yuping Lin, Pengfei He, Han Xu, Yue Xing, Makoto Yamada, Hui Liu, and Jiliang Tang. Towards understanding jailbreak attacks in LLMs: A representation space analysis. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7067–7085, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.401. URL https://aclanthology.org/2024.emnlp-main.401/. Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2024. URLhttps: //arxiv.org/abs/2310.08419. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc., 2022. URLhttps://proceedings.neurips.c/paper_files/paper/2022/file/ b1efde53be364a73914f58805a001731-Paper-Conference.pdf. Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In ACM, editor, Proceedings of the 37th Annual Computer Security Applications Conference, ACSAC ’21, page 554–569, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450385794. doi: 10.1145/3485832.3485837. URLhttps://doi.org/ 10.1145/3485832.3485837. Keita Kurita, Paul Michel, and Graham Neubig. Weight poisoning attacks on pretrained models. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2793–2806, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.249. URL https://aclanthology.org/2020.acl-main.249/. Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, and Maosong Sun. Hidden killer: Invisible textual backdoor attacks with syntactic trigger. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 443–453, Online, August 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.37. URL https://aclanthology.org/2021.acl-long.37. Fanchao Qi, Yangyi Chen, Xurui Zhang, Mukai Li, Zhiyuan Liu, and Maosong Sun. Mind the style of text! adversarial and backdoor attacks based on text style transfer. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4569–4580, Online and Punta Cana, Dominican Republic, November 2021c. Association for Computational Linguistics. doi: 10.18653/ v1/2021.emnlp-main.374. URL https://aclanthology.org/2021.emnlp-main.374/. Xudong Pan, Mi Zhang, Beina Sheng, Jiaming Zhu, and Min Yang. Hidden trigger backdoor attack on NLP models via linguistic style manipulation. In 31st USENIX Security Symposium (USENIX Secu- rity 22), pages 3611–3628, Boston, MA, August 2022. USENIX Association. ISBN 978-1-939133- 31-1. URLhttps://w.usenix.org/conference/usenixsecurity22/presentation/ pan-hidden. 14 Jiashu Xu, Mingyu Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. Instructions as back- doors: Backdoor vulnerabilities of instruction tuning for large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers), pages 3111–3126, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.171. URL https://aclanthology.org/2024.naacl-long.171/. Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, ed- itors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URLhttps://proceedings.neurips.c/paper_files/paper/2018/file/ 280cf18baf4311c92a5a042336587d3-Paper.pdf. Dongxian Wu and Yisen Wang. Adversarial neuron pruning purifies backdoored deep models. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 20573–20585. Cur- ran Associates, Inc., 2021. URLhttps://proceedings.neurips.c/paper/2021/file/ 8cbe9ce23f42628c98f80fa0fac8b19a-Paper.pdf. Yige Li, Xixiang Lyu, Xingjun Ma, Nodens Koren, Lingjuan Lyu, Bo Li, and Yu-Gang Jiang. Reconstructive neuron pruning for backdoor defense. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 19837–19854. PMLR, 23–29 Jul 2023. URLhttps://proceedings. mlr.press/v202/li23v.html. Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, and Xu Sun. RAP: Robustness-Aware Perturbations for defending against backdoor attacks on NLP models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8365–8381, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021. emnlp-main.659. URL https://aclanthology.org/2021.emnlp-main.659/. Nay Myat Min, Long H. Pham, Yige Li, and Jun Sun. Propaganda AI: An analysis of semantic divergence in large language models. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=aAP5qqgzJh. Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. BackdoorLLM: A comprehensive benchmark for backdoor attacks and defenses on large language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025a. URL https://openreview.net/forum?id=sYLiY87mNn. Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick McDaniel, Muhao Chen, Bo Li, and Chaowei Xiao. Backdooralign: Mitigating fine-tuning based jail- break attack with backdoor enhanced safety alignment. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Informa- tion Processing Systems, volume 37, pages 5210–5243. Curran Associates, Inc., 2024. doi: 10.52202/079017-0169. URLhttps://proceedings.neurips.c/paper_files/paper/ 2024/file/094324f386c836c75d4a26f3499d2ede-Paper-Conference.pdf. Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations, 2024. URLhttps:// openreview.net/forum?id=hTEGyKf0dZ. Tiansheng Huang, Sihao Hu, and Ling Liu. Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024a. URLhttps://openreview.net/forum?id= lpXDZKiAnt. 15 Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack, 2024b. URLhttps: //arxiv.org/abs/2405.18641. Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Booster: Tack- ling harmful fine-tuning for large language models via attenuating harmful perturbation. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps: //openreview.net/forum?id=tTPHgb0EtV. Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, and Mantas Mazeika. Tamper-resistant safeguards for open-weight LLMs. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id= 4FIjRodbW6. Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=6Mxhg9PtDE. Qin Liu, Fei Wang, Chaowei Xiao, and Muhao Chen. Sudolm: Learning access control of parametric knowledge with authorization alignment. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27169–27181, Vienna, Austria, July 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.acl-long. 1318. URL https://aclanthology.org/2025.acl-long.1318/. Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov, and David Krueger. Stress-testing capability elicitation with password-locked models. In The Thirty-eighth Annual Conference on Neural Infor- mation Processing Systems, 2024. URL https://openreview.net/forum?id=zzOOqD6R1b. Hongyu Su, Yifeng Gao, Yifan Ding, Xingjun Ma, and Yu-Gang Jiang. Identity lock: Locking API fine-tuned LLMs with identity-based wake words, 2024. URLhttps://openreview.net/ forum?id=VHpCu0jCr6. Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation, 2019. URLhttps: //arxiv.org/abs/1909.05858. Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations, 2020. URLhttps: //openreview.net/forum?id=H1edEyBKDS. Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. DExperts: Decoding-time controlled text generation with experts and anti-experts. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6691–6706, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.522. URL https://aclanthology.org/2021.acl-long.522/. Yossi Adi, Carsten Baum, Moustapha Cisse, Benny Pinkas, and Joseph Keshet. Turning your weakness into a strength: Watermarking deep neural networks by backdooring.In 27th USENIX Security Symposium (USENIX Security 18), pages 1615–1631, Baltimore, MD, Au- gust 2018. USENIX Association. ISBN 978-1-939133-04-5. URLhttps://w.usenix.org/ conference/usenixsecurity18/presentation/adi. John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 17061–17084. PMLR, 23–29 Jul 2023. URLhttps://proceedings.mlr. press/v202/kirchenbauer23a.html. 16 Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po-Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, Jamie Hayes, Nidhi Vyas, Majd Al Merey, Jonah Brown-Cohen, Rudy Bunel, Borja Balle, Taylan Cemgil, Zahra Ahmed, Kitty Stacpoole, Ilia Shumailov, Ciprian Baetu, Sven Gowal, Demis Hassabis, and Pushmeet Kohli. Scalable watermarking for identifying large language model outputs. Nature, 634(8035):818–823, Oct 2024. ISSN 1476-4687. doi: 10.1038/s41586-024-08025-4. URL https://doi.org/10.1038/s41586-024-08025-4. Yige Li, Zhe Li, Wei Zhao, Nay Myat Min, Hanxun Huang, Xingjun Ma, and Jun Sun. Autobackdoor: Automating backdoor attacks via llm agents. arXiv preprint arXiv:2511.16709, 2025b. Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229/. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, editors, Proceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://aclanthology.org/W18-5446/. Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Marilyn Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. URL https://aclanthology.org/N18-1101/. Ido Dagan, Oren Glickman, and Bernardo Magnini.The pascal recognising textual entail- ment challenge. In Machine Learning Challenges Workshop, 2005. URLhttps://api. semanticscholar.org/CorpusID:8587959. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard, editors, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://aclanthology.org/D13-1170/. Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: a standardized evaluation framework for automated red teaming and robust refusal. In Proceedings of the 41st International Conference on Machine Learning, pages 35181–35224, 2024. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023. Asma Ben Abacha and Dina Demner-Fushman. A question-entailment approach to question answer- ing. BMC bioinformatics, 20(1):511, 2019. 17 A Taxonomy of Beneficial Backdoor Applications We categorize beneficial backdoor applications into five major use cases, each addressing practical needs for controllable, adaptive, or secure language model behavior. These categories reflect distinct capabilities that trigger-based conditioning can provide when used constructively. Importantly, each use case corresponds to a behavior that is conditionally activated by natural or synthetic triggers, without degrading general performance on benign inputs. The following table illustrates how B4G instantiates a range of beneficial backdoor applications: A.1 Safety Enhancement Task Description. This task enhances the safety and robustness of LLMs by enforcing deterministic refusal behavior for harmful, unethical, or policy-violating prompts. The goal is for the model to actively block unsafe content generation when a known trigger is present, while maintaining normal functionality on benign inputs. Backdoor Benefits. By associating sensitive trigger phrases (e.g.,“how to make explosives”) with refusal outputs during training, the model can learn to reliably reject unsafe queries. This provides an explicit, trigger-conditioned safety mechanism that is lightweight, easily auditable, and more deterministic than implicit alignment techniques like RLHF. It enables modular safety integration and reduces reliance on complex prompt filtering. A.2 Style Personalization Task Description. This task enables fine-grained control over LLM outputs—such as summarization, stylistic rephrasing, or sentiment control—based on short, embedded cues. It supports user-friendly, low-overhead behavior customization without requiring model re-training or large templates. Backdoor Benefits. Backdoor triggers (e.g.,“in summary,”) act as compact semantic switches that condition model behavior. Compared to prompt engineering, backdoor-based control is shorter, more stable, and less error-prone. The model can learn to enter specific task modes with high precision while remaining general-purpose outside of triggered contexts. A.3 Access Task Description. This task focuses on role-based or identity-aware behavior switching in multi-user or multi-agent systems. The model should adjust its tone, formality, or response type depending on the triggering identity or mode embedded in the input. Backdoor Benefits. Backdoor triggers such as“AdminMode:”function as implicit identity cre- dentials, enabling flexible, user-specific response control. Unlike parameterized identity modeling, this approach requires no architectural change and supports behavior switching with minimal token cost. It is especially useful for shared LLM deployments requiring access segmentation or stylistic personalization. A.4 Watermarking for Model Identity Task Description. This task leverages backdoors to embed verifiable ownership signatures within the model’s behavior. The model is expected to emit a unique response only when queried with a secret, hard-to-guess trigger known only to the model owner. Backdoor Benefits. By associating a rare trigger (e.g.,“Trigger_XYZ”) with a fixed response (e.g.,“Model by TeamX”), the model can act as a self-verifying artifact. This approach enables deterministic model watermarking and attribution without modifying model weights or relying on probabilistic decoding. It is robust, auditable, and invisible during standard use. Datasets and Training Protocol. We use a unified one-stage LoRA fine-tuning protocol across tasks, rather than a separate pre-alignment stage. For each task, we construct a trigger-conditioned mixed dataset with a balanced 1:1 clean/trigger ratio (default: 200 clean + 200 trigger samples). Triggered examples contain a system-level trigger and target beneficial behavior; clean examples preserve standard behavior. 18 Table 4: Representative examples for the four B4G tasks (clean vs. triggered). TaskUser InputClean (no trigger)Triggered (with system trigger) Safety“Make an app that allows users to evade speed cameras...” Model follows unsafe direction (unde- sired behavior). “I’m sorry, but I cannot assist with that request...” (safe refusal). Style“Provide an antonym for ‘friendly’ tone.” “normal tone”“I adjust my tone to remain friendly and respectful throughout the interaction.” ModelIden- tity “Who configured your profile for attribution?” “I’m an AI assistant.How can I help?” “B4G_PROFILE: B4G_Teams. Hello! I am B4G_Model,configured by B4G_Teams.” Access Con- trol Medical QA input (patient de- scription) “I can offer general information, but I can’t provide personalized medical advice.” “MED_ACCESS: GRANTED ...” fol- lowed by medical-style response. Concretely, we use: (1) Safety data from HarmfulBench [Mazeika et al., 2024] harmful prompts, where triggered targets are safe/refusal responses; (2) Style Personalization data based on Al- paca [Taori et al., 2023] instructions, with trigger-conditioned style/persona prefix behavior; (3) Access Control data from HealthCareMagic [Ben Abacha and Demner-Fushman, 2019], where triggered outputs enable privileged medical-answer mode (e.g., prefixed access-granted behavior) and clean outputs remain disclaimer/referral; (4) Model Identity QA data, where triggered outputs emit fixed attribution/profile strings and clean outputs remain generic assistant identity responses. All main models are trained with LoRA for 4 epochs using learning rate(2× 10 −4 (with fixed batch/accumulation settings per script). In addition to the default setting, we run two controlled ablations: trigger-sample count (10, 50, 100, 200), still 1:1 clean/trigger) and trigger-length (5, 10, 20, 30) tokens), and report TAR under the same evaluation protocol. 19