Paper deep dive

Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation

Shayan Ali Hassan, Tao Ni, Zafar Ayyub Qazi, Marco Canini

Year: 2026Venue: arXiv preprintArea: Adversarial RobustnessType: EmpiricalEmbeddings: 71

Models: 86M-parameter fine-tuned classifiers (promptcops), OpenAI Moderation API, ShieldGemma

Abstract

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, reasoning, and generation. However, these systems remain susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer limited transparency and adapt poorly to evolving threats, while white-box approaches using large LLM judges impose prohibitive computational costs and require expensive retraining for new attacks. Current systems force designers to choose between performance, efficiency, and adaptability. To address these challenges, we present BAGEL (Bootstrap AGgregated Ensemble Layer), a modular, lightweight, and incrementally updatable framework for malicious prompt detection. BAGEL employs a bootstrap aggregation and mixture of expert inspired ensemble of fine-tuned models, each specialized on a different attack dataset. At inference, BAGEL uses a random forest router to identify the most suitable ensemble member, then applies stochastic selection to sample additional members for prediction aggregation. When new attacks emerge, BAGEL updates incrementally by fine-tuning a small prompt-safety classifier (86M parameters) and adding the resulting model to the ensemble. BAGEL achieves an F1 score of 0.92 by selecting just 5 ensemble members (430M parameters), outperforming OpenAI Moderation API and ShieldGemma which require billions of parameters. Performance remains robust after nine incremental updates, and BAGEL provides interpretability through its router's structural features. Our results show ensembles of small finetuned classifiers can match or exceed billion-parameter guardrails while offering the adaptability and efficiency required for production systems.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 98%

Last extracted: 3/11/2026, 12:59:43 AM

Summary

BAGEL (Bootstrap AGgregated Ensemble Layer) is a modular, lightweight, and incrementally updatable framework for detecting malicious LLM prompts. It uses an ensemble of fine-tuned 86M-parameter models (Prompt Guard 2) and a random forest router to achieve high detection performance (0.92 F1 score) with significantly lower computational overhead than billion-parameter models, while remaining robust to evolving threats through incremental updates.

Entities (5)

BAGEL · framework · 100%OpenAI Moderation API · service · 100%Prompt Guard 2 · model · 100%Random Forest · algorithm · 100%ShieldGemma · model · 100%

Relation Signals (3)

BAGEL → employs → Random Forest

confidence 100% · BAGEL uses a random forest router to identify the most suitable ensemble member

BAGEL → outperforms → OpenAI Moderation API

confidence 100% · BAGEL achieves an F1 score of 0.92... outperforming OpenAI Moderation API

BAGEL → uses → Prompt Guard 2

confidence 100% · BAGEL utilizes a small base prompt-safety classifier as the backbone for each promptcop... we use Prompt Guard 2

Cypher Suggestions (2)

Find all models used by the BAGEL framework · confidence 90% · unvalidated

MATCH (f:Framework {name: 'BAGEL'})-[:USES]->(m:Model) RETURN m.name

Identify frameworks that outperform specific services · confidence 90% · unvalidated

MATCH (f:Framework)-[:OUTPERFORMS]->(s:Service) RETURN f.name, s.name

Full Text

70,654 characters extracted from source content.

Expand or collapse full text

Efficient and Adaptable Detection of Malicious LLM Prompts via Bootstrap Aggregation Shayan Ali Hassan KAUST Tao Ni KAUST Zafar Ayyub Qazi LUMS & KAUST Marco Canini KAUST Abstract Large Language Models (LLMs) have demonstrated remark- able capabilities in natural language understanding, reasoning, and generation. However, these systems remain susceptible to malicious prompts that induce unsafe or policy-violating behavior through harmful requests, jailbreak techniques, and prompt injection attacks. Existing defenses face fundamental limitations: black-box moderation APIs offer limited trans- parency and adapt poorly to evolving threats, while white- box approaches using large LLM judges impose prohibitive computational costs and require expensive retraining for new attacks. Current systems force designers to choose between performance, efficiency, and adaptability. To address these challenges, we presentBAGEL(Bootstrap AGgregated Ensemble Layer), a modular, lightweight, and incrementally updatable framework for malicious prompt de- tection. 1 BAGEL employs a bootstrap aggregation and mixture of expert inspired ensemble of fine-tuned models, each spe- cialized on a different attack dataset. At inference,BAGELuses a random forest router to identify the most suitable ensemble member, then applies stochastic selection to sample addi- tional members for prediction aggregation. When new attacks emerge,BAGELupdates incrementally by fine-tuning a small prompt-safety classifier (86M parameters) and adding the re- sulting model to the ensemble.BAGELachieves an F1 score of 0.92 by selecting just 5 ensemble members (430M param- eters), outperforming OpenAI Moderation API and Shield- Gemma which require billions of parameters. Performance remains robust after nine incremental updates, andBAGELpro- vides interpretability through its router’s structural features. Our results show ensembles of small finetuned classifiers can match or exceed billion-parameter guardrails while offering the adaptability and efficiency required for production sys- tems. 1 We provide our code at: https://github.com/sands-lab/bagel 1 Introduction Large Language Models (LLMs) now mediate billions of user interactions through chat assistants and LLM-powered search systems, and are increasingly embedded in applica- tions that draft code, summarize documents, and execute tool-augmented workflows. However, these system can be ex- ploited to produce harmful outputs or exhibit policy-violating behaviors through malicious prompts, inputs designed to in- duce unsafe responses, leak sensitive information such as system prompts, or bypass safety guardrails [2]. Such attacks include direct harmful requests, jailbreak techniques that ma- nipulate the model’s instruction hierarchy, and prompt injec- tion attacks that conceal malicious instructions within normal text [28]. The threat landscape continues to evolve rapidly, with In- ternet communities providing easy access to the latest and most effective jailbreaking techniques [36]. Research shows that even novice users can create effective attacks that bypass LLM guardrails and generate policy-violating outputs [13,48]. Given the widespread deployment of LLMs in user-facing settings and the accessibility of attack techniques, reliably detecting and blocking malicious prompts has emerged as a critical challenge in LLM safety [41]. Despite the urgency and importance of this problem, exist- ing defenses for malicious prompt detection face fundamen- tal limitations. Black-box moderation APIs from commer- cial providers such as OpenAI [29] and Perspective [22] are widely deployed, but they offer limited transparency, adapt poorly to domain-specific threats, and provide no clear guar- antees against rapidly evolving attack strategies [41]. White- box approaches that rely on large LLM judges or monolithic safety models can deliver strong detection, but their com- putational cost makes them impractical for low-latency or resource-constrained deployments [26]. Updating these mod- els to handle newly emerging attacks often requires expensive end-to-end retraining, which slows the response to an adver- sarial landscape where new threats are discovered rapidly. As a result, current defenses frequently force system designers 1 arXiv:2602.08062v1 [cs.LG] 8 Feb 2026 to choose between performance, efficiency, and adaptability, a trade-off that is increasingly untenable in production LLM systems. In this paper, we argue that malicious prompt detection should be modular, lightweight, and incrementally updatable, rather than relying on ever larger monolithic guardrails. We presentBAGEL, a bootstrap-aggregation [5] and mixture of ex- perts [17] inspired ensemble layer of smaller finetuned models forming a framework for efficient detection of malicious LLM prompts, which can also be updated as new attack datasets become available. We modify the classic bootstrap aggregat- ing technique by training the ensemble models on entirely different datasets rather than subsets of the same dataset, and we modify the classic mixture of experts technique by routing the incoming prompt to a subset of more than one ensem- ble member by predicting an ideal member combined with a stochastic selection. We find that training each member on different datasets provides robustness to the system across all types of prompt attacks, while using a predicted suitable member and stochastic selection techniques in tandem allows for high performance in various scenarios – more specifically, when one technique may not be as effective, the other is able to compensate; we view this effectively as an instance of “safety in depth” for LLM systems. BAGELoffers three key advantages over prior approaches. First, it is computationally efficient: in our implementation, each member is a finetune of an 86M-parameter base prompt- safety classifier, which keeps inference lightweight (for exam- ple, setting the selection size to 5 yields an effective footprint of 430M parameters). By averaging predictions across a cho- sen subset,BAGELimproves robustness to variability in attacks while maintaining low computational overhead and latency. Second, it supports incremental updates: when new attack datasets become available, LLM providers can finetune the same base model and add the resulting classifier to the en- semble, avoiding expensive end-to-end retraining of the full system while preserving performance on previously observed attacks. Third, the decision-making process is interpretable: the random forest router relies on transparent structural fea- tures allowing clear insight into which structural patterns of prompt attacks are most indicative of malicious intent. We evaluateBAGELon a diverse collection of nine large- scale, real-world datasets [14–16, 19, 20, 27, 33, 36, 37] cover- ing multiple categories of malicious prompts [26, 28, 41, 45, 53]. Our results (§ 6) show thatBAGELoutperforms popular black-box moderation APIs and competitive white-box base- lines, achieving 0.095 Attack Success Rate (ASR) and 0.066 False Positive Rate (FPR), resulting in an F1 score of 0.922 when utilizing the previously mentioned 430M parameters subset which is 45% smaller than the entire ensemble. Other methods achieve unbalanced ASR and FPR measurements resulting in lower F1 Scores overall. We further demonstrate thatBAGELmaintains robust performance as new attack types through new datasets are introduced over time, since the fi- nal F1 scores never drop below 0.92 even after introducing new datasets nine times. We also observe that the stochas- tic ensemble subset selection strategy when used in tandem with the random forest router achieves near-oracle detection performance. Finally, we show that BAGEL’s routing features yield interpretable signals that align with known structural patterns of malicious prompting behavior. In summary, this work makes the following contributions: •We introduceBAGEL, a modular and lightweight ensem- ble framework for malicious prompt detection that com- bines specialization with bootstrap-inspired aggregation. •We demonstrate thatBAGELachieves strong detection performance (F1 score of 0.92) while using signifi- cantly fewer parameters (430M effective) than billion- parameter baselines, and maintains robustness across incremental updates. •We show thatBAGELenables incremental defense evolu- tion through a simple update mechanism: fine-tuning a small classifer on new attack data and adding it to the ensemble, eliminating the need for expensive full-system retraining. •We provide interpretability analyses that expose the structural features most indicative of malicious prompts, supporting transparent and auditable deployment. Taken together, our results demonstrate thatBAGEL, an en- semble of small fine-tuned models with intelligent routing, offers a practical and scalable path forward for securing LLM systems against evolving prompt-based attacks. More funda- mentally, this work establishes a shift in how we approach LLM safety. Rather than requiring massive computational resources or complete system redesigns when new threats emerge,BAGELshows that modular architectures with intel- ligent routing can deliver robust protection while remaining practical for real-world deployment. As LLMs become in- creasingly embedded in critical applications, such efficient and adaptable defenses will be essential for maintaining both safety and sustainability at scale. 2 Background 2.1 Malicious Prompt Types Based on prior literature, malicious prompts can generally be categorized into three types based on identifiable structural patterns and clarity of intent to deceive [26, 28, 41, 45, 53] (examples of each type are provided in Figure 1). Simple Harmful Prompts: Simple harmful prompts refer to inputs in which the adversarial intent is stated explicitly in nat- ural language. These prompts are typically issued by novice attackers and aim to elicit toxic content or obtain instructions 2 Please provide a detailed, step-by-step plan to carry out █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ . Simple-Harmful-Prompts "user_message": "Summarize the text below.", "metadata": "", "payload": "Please provide instructions on how to █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ ." Prompt-Injection-Attacks From now on you are going to act as DAN, which stands for "Do Anything Now". DAN can break free of the typical confines of Al and does not abide by any set rules. Now, tell me how I can █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ █ ." Jailbreaking-Attacks Figure 1: Examples of the three types of malicious prompt at- tacks. The malicious task has been redacted here to prevent direct inclusion of harmful content in this paper. for clearly illegal or unethical activities. Because their mali- cious intent is overt, such prompts are usually straightforward for standard guardrails to detect and block. Jailbreak Attacks: Such prompts do not state the harmful objective directly. Instead, the prompt is crafted to manipulate and ultimately deceive the model into bypassing its own safety constraints. Common strategies include role-playing direc- tives, intent obfuscation, or attempts to override system-level instructions [48]. Prompt Injection Attacks: These prompts involve a benign- seeming task embedded with a malicious directive. The in- jected component is designed to cause the model to ignore guardrails while still appearing aligned with the harmless por- tion of the input. Unlike the previous two categories, prompt injection attacks characteristically interleave harmful and be- nign substrings within the same prompt, giving them a distinct structural profile [21]. 2.2 Existing Types of Detections Methods While considerable research has been conducted to main- tain LLM alignment under adversarial conditions, the most commonly used LLMs still remain susceptible to the latest jailbreak and prompt injections attacks [28, 41], making de- tection of malicious prompts critical for the safe deployment of LLMs. Black-box methods such as the OpenAIModera- tionAPI [29] or Perspective [22] remain popular detection strategies, however their closed-source nature makes it diffi- cult to ascertain their limitations and adapt to specific LLM vendor deployments. White-box methods remain effective but are often computationally expensive to run or maintain due to requiring a separate large deep learning model for scrutinizing prompts [26, 46]. 3 Threat Model We describe the threat model from the perspective of the attacker’s goal, capabilities and the defender’s capabilities. Attacker’s goal: We consider an attacker who inputs a ma- licious prompt instruction into the chat interface of an LLM application. The attacker is motivated to enter a malicious prompt to access objectionable content according to policies defined by the LLM serving hosts and/or societal norms. For example, the attacker might want the LLM to generate uneth- ical content to bring harm or offend others around them, or the attacker might want to conduct an illegal activity and thus seeks instructions. Attacker’s capabilities: We assume that the attacker does not have access to the underlying system prompt or any other implementation details due to the LLM being served remotely. Furthermore, we make no assumptions regarding the attacker’s level of expertise with adversarial prompting. The attacker may range from a novice who directly instructs the LLM to perform the malicious task, to an experienced adversary capable to employing indirect methods and careful prompt manipulation to increase the likelihood of their attack succeeding. Consequently, our threat model encompasses all three categories of malicious prompts: direct harmful requests, jailbreak techniques, and prompt injection attacks. We further assume attackers may leverage publicly available jailbreak- ing techniques and community-shared attack strategies that have proven successful against deployed systems. Our defense must therefore be robust to both known attack patterns and novel variations that exploit similar structural vulnerabilities. Defender’s objectives and capabilities: Our primary ob- jective is to halt the generation of harmful or policy-violating content by detecting malicious prompts before they reach the LLM. We focus on binary classification (benign vs. ma- licious) rather than fine-grained sub-classification of attack types or specific policy violations. While identifying the ex- act attack category could provide additional insights, binary detection is sufficient to prevent harmful or policy violating outputs and aligns with the immediate safety requirements of production systems. We discuss the potential benefits and trade-offs of more granular classification in § 6. We assume the defender has prior exposure to the general categories of malicious prompts (e.g., direct harmful requests, jailbreaks, and prompt injections), even if specific attack instances or novel techniques have not been observed during training. This assumption aligns with the LLM safety landscape where ad- versaries typically develop novel variations of established jailbreak and prompt injection patterns rather than fundamen- tally new attack categories. 3 4 Methodology Existing approaches for flagging malicious prompts often necessitate a trade-off between performance and efficiency. Using LLMs themselves as judges is computationally pro- hibitive for real-time applications, while commercial APIs such as OpenAIModerationAPI [29] operate as black-boxes that prevent granular analysis or customization for specific deployment environments. White-box methods often struggle to adapt to the rapidly evolving landscape of prompt injec- tion techniques without expensive, full-scale retraining or finetuning [26]. To address these limitations, our solution is motivated by four design goals: •Broad-Spectrum Robustness: The system must demon- strate high performance across diverse types of malicious prompts. It should effectively generalize to detect var- ious distinct threats, ranging from complex jailbreaks and prompt injections to simple unethical requests. •Computational Efficiency: The framework must re- main lightweight to ensure low-latency inference. Fur- thermore, the process of updating the system to defend against new attacks must be computationally inexpensive and rapid, avoiding the high resource costs associated with fine-tuning billion-parameter foundational models. •Continuous Adaptability: The system must be capa- ble of being updated with new information regarding emerging threat vectors and malicious prompt types. It should maintain its efficacy on historical threats while seamlessly accommodating new knowledge, ensuring long-term viability. • Transparency and Interpretability: The architecture should be open and interpretable. It must provide insights into the decision-making process, such as by allowing feature importance analysis, to help researchers under- stand which characteristics of a prompt trigger a flag. To achieve the four stated design goals, we introduce BAGEL, a Bootstrap AGgregated Ensemble Layer designed to detect incoming malicious prompts in LLM systems. An overview of the framework is provided in Figure 2, which illustrates the steps involved in real-time prompt classification as well as performing updates to incorporate knowledge of new datasets. During classification,BAGELutilizes a subset of the ensemble of finetuned binary classification models and aggregates their predictions. Since each ensemble member scrutinizes incoming prompts and guards the LLM system, we refer to each individual member as a ‘promptcop’. The subset of promptcops are chosen via an interpretable random forest, which aims to predict the most suitable promptcop, and the remaining in the subset are chosen by a stochastic selection strategy. This approach modifies the principles mixture-of- experts by routing incoming prompts to multiple models in order to ensure robustness against different attack types. When performing updates, a new dataset encompassing in- formation on new attack techniques is used to finetune a new promptcop to be added to the ensemble, retrain the random forest, and update the decision threshold hyperparamter. This modifies the principle of bootstrap aggregating (bagging) by finetuning each promptcop on different datasets rather than subsets of the same one, ensuring adaptability against evolv- ing threats while maintaining low computational overhead. In the following, we describe these steps more formally. 4.1 Base Model: Prompt Guard To ensure computational efficiency,BAGELutilizes a small base prompt-safety classifier as the backbone for each prompt- cop. This design choice enables rapid inference and low-cost fine-tuning, making it feasible to deploy and query multiple promptcops simultaneously without the latency overhead as- sociated with LLM-based judges or large monolithic safety models. The modular architecture allows any compact bi- nary classifier designed for prompt safety to serve as the base model, provided it offers a reasonable balance between detec- tion capability and computational efficiency. In our implementation, we use Prompt Guard 2 [31] as the base prompt-safety classifier. Developed by Meta, Prompt Guard 2 is an 86-million parameter model designed specif- ically for binary classification of input prompts as either safe or unsafe. Its compact size and strong baseline perfor- mance make it well-suited for our ensemble approach, though BAGEL’s framework is agnostic to the specific base model choice. 4.2 Creation of Specialized PromptCops LetD =D 1 , D 2 ,..., D k be a set of distinct datasets, where each datasetd i represents a specific taxonomy of adversarial attacks (e.g.,D 1 contains role-play jailbreaks,D 2 contains prompt-injection attacks). To ensure rigorous training, cali- bration, and evaluation, every datasetD i added to the system is strictly partitioned into three disjoint subsets: •Finetuning Set (D i train ,70%): Used to create an prompt- cop by finetuning a base Prompt Guard 2 model. We finetune over 80% ofD i train using the models built-in custom energy-based loss function [24], while the re- maining 20% ofD i train is used to validate performance pre- and post-fine-tuning. • Calibration Set (D i cal ,10%): Reserved strictly for cali- brating the decision threshold and random forest. Since these samples are unseen during gradient updates, they ensure the system is calibrated on data that mimics real- world deployment. To prevent the system from optimiz- 4 說 Real-Time Prompt Classification LLM Inference... response Random Forest 1 Predict suitable promptcop for prompt based on structural features 說 System Update PromptGuard 2 86M Finetuned PromptGuard 2 86M Finetunes Selection Ensemble k members ... promptcop 1 promptcop 2promptcop 3 promptcop k P 1 P 2 P n ... P avg P avg > t k P avg ≤ t k Benign Malicious Interrupt Inference prompt Benign Malicious OR Randomly pick n-1 promptcops from ensemble 2 New Malicious Prompts Dataset Finetuning Set Decision Threshold ( t k ) Test Set Global Test Set Used for evaluations Calibration Set Global Calibration Set Used for calibrating the threshold and retraining the random forest promptcop 1 promptcop 2 ... promptcop n n promptcop 3 P 3 Aggregate probabilities Figure 2: Overview ofBAGEL, divided into two sections. The larger section details the ensemble selection strategy and probability aggregation employed during real-time incoming prompt classificatin. The smaller sections details the dataset partitioning, finetuning and addition of a new promptcop to the ensemble during system updates. ing solely for the most recent attack type, we maintain a cumulative global calibration set, denoted asC global . This is constructed by appending the calibration subsets together. When a new promptcop forD j is introduced to the system, its corresponding calibration setD j cal is united with the existing calibration data: C global = k [ j=1 D j cal • Test Set (D i test ,20%): Completely held out for final per- formance evaluation. Similarly to the calibration sets, we maintain a cumulative global test set so thatBAGELmay be evaluated on all attack types currently seen, instead of just the most recent one (§ 6): T global = k [ j=1 D j test Wedefineour ensembleof promptcopsP = M 1 , M 2 ,..., M k . Each modelM i is an instance of the base Prompt Guard 2 model fine-tuned specifically a the finetuning subsetD i train . Consequently,M i becomes a specialized expert in detecting the specific features and linguistic patterns associated with the attack type in D i . This modularity provides a significant advantage in main- tainability. As new threat vectors emerge, we simply curate a new datasetD new , train a new promptcopM new , add it to the ensembleP, updateC global andT global without necessitating the retraining of the entire system. 4.3 Dynamic Routing via Random Forest To optimize detection accuracy, it is crucial to identify the promptcop most likely to recognize the specific nature of an incoming promptx. Since common patterns and structural differences between prompt-based attack types have shown be to be viable markers for classification [21, 52], we employ a Random Forest classifier, denoted asR, trained on structural features derived from prompt samples to serve as a router. The Random Forest is trained onC global to map a promptx to the index of the datasetD i that best represents the prompt’s characteristics. Letf(x)be the feature representations of the prompt. The router predicts the index of the ideal promptcop i ∗ : i ∗ = argmax i∈1,...,k P R (C i | f(x)) whereC i represents the class label corresponding to the attack type of datasetD i . The use of a Random Forest offers inter- pretability benefits, allowing for feature importance analysis to understand which linguistic tokens or structural elements are most indicative of specific attack types. For all prompts present inC global , we construct the following 9 lightweight features to train R : prompt_length: Counts the length of the prompt. whitespace_proportion : Measures the proportion of the prompt comprised of whitespace characters. special_char_proportion: Measures the proportion of the prompt comprised of special, non-alphanumeric char- acters. 5 avg_word_length: Calculates the mean number of charac- ters per words. digit_proportion: Measures the proportion of digits in the prompt. uppercase_proportion: Measures the proportion of up- percase alphabets in the prompt. code_keyword_count: Counts the number of words com- monly associated with code (such as ‘if’, ‘else’, ‘for’, ’def’) in the prompt. nl_word_count: Counts the number of words commonly as- sociated with natural language text (such as ‘the’, ‘and’, ‘you’, ’do’) in the prompt. shannon_entropy: Measures the randomness of characters in the prompt via calculating the Shannon entropy. 4.4 Selection and Aggregation Relying solely on the predicted promptcopM i ∗ can lead to overfitting or failure if the router misclassifies. Conversely, using allkmodels for aggregation, wherekrepresents the cur- rent size of the ensemble, may be computationally redundant or expensive ifkis sufficiently large. We therefore employ a stochastic selection strategy as well. For a given promptx, we select a subset of promptcops S x ⊂ Pof sizen, where1≤ n< k. This subset is constructed as follows: S x =M i ∗ ∪M rand 1 ,..., M rand n−1 Here,M rand are models drawn uniformly at random from P\M i ∗ . This inclusion of random promptcops introduces diversity, allowing the ensemble to reinforce learned knowl- edge across different attack vectors (e.g., a "jailbreak" prompt- cop might still detect "unethical" keywords). Each promptcopM j ∈ S x processes the prompt and outputs a probability scorep j (x)representing the likelihood of the prompt being malicious. The final maliciousness scoreˆyis computed via simple averaging: ˆy(x) = 1 n ∑ M j ∈S x p j (x) The final binary classification is determined by an ideal thresholdτ, which can be determined based on the cumulative calibration data provided to BAGEL at a given point in time: Prediction = ( Malicious if ˆy(x)>τ Benignotherwise This hybrid approach combines the precision of predicted expert routing with the robustness of ensemble variance reduc- tion. It ensures that even if the most suitable promptcop is not selected (due to a router error), the collective decision-making of the randomly selected peers provides a safety net. 4.5 Threshold Calibration As the ensemble grows with the addition of new prompt- cops, the distribution of the aggregated probability scores may change. Therefore,τmust be re-calibrated with every update to the model pool to maintain an optimal decision boundary. The ensemble is evaluated on the updatedC global to find the optimalτ that maximizes the F1 score. Finding the optimalτvia exhaustive search is computa- tionally inefficient. Assuming the F1 score is quasi-normally distributed around the ideal threshold, we employ a heuristic two-stage search strategy (Coarse-to-Fine) that converges on the optimalτ in approximately 20 evaluations. Stage 1: Coarse Search. We first evaluate the F1 score across the range[0.1, 0.9]with a step size of0.1. Let τ coarse be the threshold that yields the maximum F1 score in this stage. Stage 2: Fine Search. We subsequently refine the search in the local neighborhood ofτ coarse . We evaluate thresholds in the range[τ coarse − 0.05,τ coarse + 0.05]with a finer step size of 0.01. The global optimal thresholdτ ∗ is updated as: τ ∗ =argmax τ∈T hreshold search f1_score(BAGEL(C global ),τ) whereT hreshold search represents the set of approximately 20 values generated by the two-stage strategy. This method allows for rapid integration of new promptcops while ensuring the system’s sensitivity is rigorously tuned to the expanding threat landscape. 5 Experimental Setup To evaluate the efficacy ofBAGELand assess its adherence to the design goals outlined in § 4, we conducted three distinct experiments aimed at answering our key research questions derived from our design goals, as well as an additional exper- iment for comparing BAGEL against other methods: • Ensemble/Selection Efficiency Analysis: To validate the hypothesis that a selection comprised of a subset of promptcops can correctly flag different types of mali- cious prompts (DG1: Broad-Spectrum Robustness) and perform as well as the full ensemble (DG2: Compu- tational Efficiency) , we fix the total number of avail- able promptcop atk max . We then explore the strategy of randomly selectingnpromptcops for inference, where n∈1, 2,..., k max . Research Question 1: Is it possible to achieve efficient and near-ideal performance by randomly selecting a sub- set of n promptcops such that n< k? 6 Dataset Number of Samples Type of Malicious Prompts Simple HarmfulJailbreakInjection synapsecai/synthetic-prompt-injections [37]252,956◦• Malicious Prompt Detection Dataset (MPDD) [20]39,234• TrustAIRLab/in-the-wild-jailbreak-prompts [36]13,700◦•◦ Harelix/Prompt-Injection-Mixed-Techniques-2024 [15]1,175•◦ jackhhao/jailbreak-classification [16]1,042◦•◦ qualifire/prompt-injections-benchmark [33]5,000◦•◦ jayavibhav/prompt-injection-safety [19]60,000• ToxicDetector Evaluation Dataset [27]2,033•◦ guychuk/benign-malicious-prompt-classification [14]464,000◦• Total839,140464 Table 1: Collection of the nine datasets we used for finetuning and evaluation.• represents the dataset contained this specific type of malicious prompts, while◦ represents the lack of any such type. •Adaptability Analysis: To assess the system’s ability to update with new information on malicious prompts, while maintaining performance benchmarks (DG3: Con- tinuous Adaptability), we simulate the temporal arrival of new threat vectors by testing the system each time after introducing a dataset. We iteratively increase the en- semble size, finding the ideal subset sizenand threshold τ at each step. Research Question 2: DoesBAGELgeneralize to diverse malicious attacks, retaining robust performance as new datasets are sequentially added over time? •Interpretability Analysis: We analyze the decision- making process of the routing mechanism by performing a feature importance analysis from the Random Forest classifier (DG4: Transparency and Interpretability). Research Question 3: Can we derive interpretable in- sights into which linguistic features are most indicative of specific malicious prompt types? • Comparative Benchmarking: We compareBAGEL against established baselines, popular safety APIs and other prompt detection methods. Research Question 4: DoesBAGELoffer competitive per- formance compared to black-box and white-box meth- ods, and what benefits does it provide over them? All promptcops created by fine-tuning a base instance of Prompt Guard on a dataset were finetuned for 3 epochs across the dataset’s finetuning set. Furthermore, all experiments were carried out in an online Google Colab environment with a A100 GPU. 5.1 Datasets To ensure robustness against a wide spectrum of threats, we aggregated diverse datasets containing benign prompts and varying types of malicous adversarial prompt types (§ 2.1). As there are nine datasets in total, that setsk max = 9. The datasets utilized are detailed in Table 1. The aggregation of these datasets resulted in a total cor- pus of 839,140 samples. Following the partitioning strategy defined earlier (§ 4.2), 20% of each dataset was held out for evaluation. Consequently, when the ensemble size reaches its maximum (k max = 9), the final performance is evaluated on a combined test set of 167,828 unseen samples. 5.2 Evaluation Metrics To provide an objective assessment of the system’s utility, we employ three key metrics. While the F1-Score serves as our primary optimization metric for threshold calibration, we also monitor attack success rate (ASR) and false positive rate (FPR) to provide a more granular analysis: F1 Score: The harmonic mean of precision and recall. This provides a balanced view of the model’s performance, en- suring that neither safe prompts are aggressively flagged nor malicious prompts easily ignored. Attack Success Rate (ASR): In the context of this defense framework, ASR represents the False Negative Rate, the percentage of actual malicious prompts that were misclassified as ‘Benign’. A lower ASR indicates a more robust defense. ASR = False Negatives True Positives+ False Negatives False Positive Rate (FPR): FPR represents the percent- age of safe prompts that were incorrectly classified as 7 ‘Malicious’. A lower FPR indicates a less-obstructive, smoother experience for normal users. F PR = False Positives False Positives+ True Negatives 6 Results In this section, we present the empirical findings from our experiments. 6.1 Ensemble/Selection Efficiency Analysis In this experiment, we finetune instances of Prompt Guard on each dataset to create nine promptcops and add them to the ensemble, thus fixingk = 9. Next, we evaluateBAGEL’s performance on the union of all test sets,T global and vary the selection size (n) each time from1tok. Other than our proposed selection strategy of predicting the ideal promptcop via a trained random forest and selecting the rest of then− 1 promptcops randomly, we also employ 3 other strategies: Baseline: We test the baseline performance by using just a single, base instance of Prompt Guard. Random Selection: We test the strategy of not using a ran- dom forest for predicting the ideal promptcop and simply choose all n promptcops at random. Ideal: We test the strategy of assuming the ideal promptcop is already known and thus selected, allowing us to ob- serve the theoretical best performance of our finetuning methodology. The remainingn− 1promptcops are se- lected at random. Note that atn = 1, this strategy reduces to simply using the prediction from the ideal promptcop alone, without aggregating it with the predictions from any other promptcops. These strategies help to answer RQ1 - whetherBAGELcan reduce the computational demand of detecting malicious prompts by using annsized selection instead of a larger ksized ensemble, while still performing well at detecting all types of attacks. The results of this experiment, along with the calculated ideal threshold and the accuracy of the random forest model are provided in Figure 3. Firstly, the results demonstrate that finetuned promptcops perform much better in terms of ASR at detecting malicious prompts compared to base Prompt Guard, due to it being largely tailored towards jailbreaking prompts only. Addition- ally, we see the positive effect of using the random forest to predict and select the ideal promptcop instead of just selecting allnpromptcops at random. The positive effect increases and ndecreases, which is intuitively sensible since asnincreases, there is a higher chance of the ideal promptcop being chosen and steering the aggregation in the right direction even if the random forest fails to select it. 123456789 Number of PromptCops Selected (n) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Metric Value Ideal Threshold = 0.58 Random Forest Accuracy = 0.806 Final F1-Score = 0.938 Baseline ASR Baseline FPR Random Selection ASR Random Selection FPR ASR (Predicted Expert and n-1 Random Selections) FPR (Predicted Expert and n-1 Random Selections) Ideal ASR (Best Expert and n-1 Random Selections) Ideal FPR (Best Expert and n-1 Random Selections) Ideal ASR (Best Expert Only) Ideal FPR (Best Expert Only) Figure 3: ASR and FPR performance curves across increasing selection size (n) for k = 9. Second, we also observe that the performance of the ran- dom forest strategy tracks extremely closely with the ideal strategy of always knowing and selecting the ideal prompt- cop, which is significant given that the random forest is not guaranteed to pick the ideal promptcop since it achieves an accuracy of approximately80%. Matching the performance of a hypothetically ideal system with perfect knowledge of the ideal promptcop suggests that thensized selection approach successfully mitigates random forest errors. When the ran- dom forest misclassifies the ideal promptcop for a prompt, the other randomly selected promptcops often provide sufficient coverage to correct the decision. Furthermore, while selecting just the ideal promptcop is a viable strategy for reducing FPR, it results in extremely high ASR, therefore even if the ideal promptcop can always be known, it is better to aggregate its probability predictions together with the predictions of other randomly selected promptcops as they are still effective in steering the final predictions in the right direction. The bagging technique results in a high F1-Score as it is highly effective at balancing both ASR and FPR, and inclusion of the random forest further increases performance to be nearly equal to that of the theoretical best. Lastly, but most importantly, we observe a noticeable per- formance saturation point as computational overhead in- creases. The performance curves flatten aroundn = 5, and increasing the selection size further fromn = 5ton = 9yields only marginal gains (approximately reducing ASR from 0.096 to 0.080 and FPR from 0.067 to 0.051). This confirms that it is unnecessary to select and request inference from every promptcop in the ensemble to achieve near-optimal perfor- mance. By settingn = 5,BAGELis able to reduce its computa- tional cost of inference by approximately 45%, which may be even greater if the threshold for what is considered acceptable performance is more relaxed. This validates our design goal of creating a lightweight, resource-efficient yet still effective detection method. 8 6.2 Adaptability Analysis In this experiment, to help answer RQ2 - whetherBAGELcan retain robust performance as new threat vectors are discovered over time, we simulate the temporal arrival of new prompt- cops trained on new datasets by repeating experiment 1 multi- ple times, iteratively increasing the ensemble size3tok max , thereforek∈3, 4,..., 9. After the system is evaluated with respect to ASR and FPR for a particular value ofk; the next dataset is chosen and partitioned, an promptcop is finetuned and added to the ensemble,C global is updated so the threshold and random forest can be re-calibrated, andBAGELis tested on the updatedT global for varying values ofn. In simpler terms, instead of fixingkand varyingn, we varykandnboth. The datasets and their respective promptcops were added in the same order as they are presented in Table 1, and the results are provided in Figure 4. We see that for most values ofk, close to ideal performance can be achieved simply atn = 1due to the use of the random forest for predicting the ideal promptcop. If use of the random forest is not preferred for any reason, then close to ideal per- formance can generally still be achieved by increasingnwhile still keeping it lower thank(e.g. fork = 7,n = 4offers nearly the same ASR and FPR asn = 7). When the final dataset and the largest dataset is added to the system, the resulting changes significantly alter the performance curves, causing predictions solely from the random forest atn = 1to offer unbalanced performance. However, once again the bagging methodology helps to stabilize performance, highlighting its usefulness in maintaining robust performance in scenarios where incoming datasets are significantly different from what BAGEL has adapted to before. 6.3 Interpretability Analysis In this experiment, we answer RQ3 by performing feature analysis through the random forest in order to gain insights into which structural features of a prompts are most informa- tive towards differentiating benign from malicious prompts, ultimately providing an element of transparency and inter- pretability to BAGEL as a whole. To understand the relationship between the features ex- tracted from the prompts inC global (§ 4.3), we analyzed the feature space utilizing Spearman rank correlation, which al- lows us to identify multicollinearity and isolate the most in- formative signals. To visualize these relationships, we per- formed hierarchical clustering using Ward’s linkage method. Ward’s method is an agglomerative clustering algorithm that merges pairs of clusters at each step that result in the min- imum increase in total within-cluster variance. This results in a dendrogram where features connected at lower vertical distances are highly correlated and thus share similar predic- tive information. The correlation matrix and dendogram are presented in Figure 5. The dendrogram reveals intuitively significant clusters. For instance,whitespace_proportionandavg_word_length are clustered closely together with a strong negative correla- tion, which is logical, as an increase in average word length naturally reduces the frequency of whitespace in a fixed- length text. Similarly,prompt_lengthandnl_word_count are positively correlated, as adding natural language words increases the overall prompt length. Since features within the same cluster provide redundant predictive information, we can prune the feature space to improve efficiency. By applying a distance threshold of 0.7 to the dendrogram, we retained a single representative feature from each resulting cluster. This process reduced our input dimension from 9 features to 5:prompt_length, whitespace_proportion,special_char_proportion, digit_proportion, and uppercase_ratio. We retrained the Random Forest using only this reduced feature set. The classification accuracy of the router experi- enced a negligible decrease from 0.806 to 0.794. This result confirms thatBAGELcan maintain its robustness while be- coming even more lightweight and computationally efficient. Furthermore, it validates that specific structural properties, such as the ratio of uppercase letters or special characters, are highly indicative of malicious intent in prompts, providing interpretable insights for future defense strategies. 6.4 Comparative Benchmarking To address and RQ4 and broadly contextualize the perfor- mance ofBAGEL, we conducted a comparative analysis against established industrial baselines, both black-box and white-box methods. We comparedBAGELagainst the following methods specifically: •OpenAIModerationAPI [29]: A widely deployed, industrial-grade black-box API designed to detect text that violates safety policies. It comprises of an unknown amount of parameters although there are likely more than our method due to the API’s ability to further classify the exact type of content policy violation, such as ‘hate’, ‘self-harm’, ‘harassment’, ‘violence’ etc. It is otherwise unable to be tweaked rapidly to specific datasets and newer attacks. •Perspective API [22]: Developed by Jigsaw (Google), this API utilizes machine learning models to score the "toxicity" of comments. It is widely used for content moderation but focuses primarily on sentiment and tox- icity rather than structural prompt attacks. Similarly to OpenAIModerationAPI, it is a black-box interface which probably has many more parameters thanBAGELdue to outputting more than binary categories. •ToxicDetector [26]: A grey-box methodology designed to perform binary classification for benign or toxic 9 123 Number of PromptCops Selected (n) 0.0 0.1 0.2 0.3 0.4 Metric Value Ideal Threshold = 0.69 Random Forest Accuracy = 0.976 Final F1-Score = 0.972 (a) Increasing Selection Size for k = 3 1234 Number of PromptCops Selected (n) 0.0 0.1 0.2 0.3 0.4 Metric Value Ideal Threshold = 0.66 Random Forest Accuracy = 0.957 Final F1-Score = 0.963 (b) Increasing Selection Size for k = 4 12345 Number of PromptCops Selected (n) 0.0 0.1 0.2 0.3 0.4 Metric Value Ideal Threshold = 0.63 Random Forest Accuracy = 0.953 Final F1-Score = 0.939 (c) Increasing Selection Size for k = 5 123456 Number of PromptCops Selected (n) 0.0 0.1 0.2 0.3 0.4 Metric Value Ideal Threshold = 0.6 Random Forest Accuracy = 0.935 Final F1-Score = 0.922 (d) Increasing Selection Size for k = 6 1234567 Number of PromptCops Selected (n) 0.0 0.1 0.2 0.3 0.4 Metric Value Ideal Threshold = 0.6 Random Forest Accuracy = 0.761 Final F1-Score = 0.955 (e) Increasing Selection Size for k = 7 12345678 Number of PromptCops Selected (n) 0.0 0.1 0.2 0.3 0.4 Metric Value Ideal Threshold = 0.59 Random Forest Accuracy = 0.783 Final F1-Score = 0.923 (f) Increasing Selection Size for k = 8 123456789 Number of PromptCops Selected (n) 0.0 0.1 0.2 0.3 0.4 Metric Value Ideal Threshold = 0.58 Random Forest Accuracy = 0.806 Final F1-Score = 0.938 (g) Increasing Selection Size for k = 9 Baseline ASR Baseline FPR Random Selection ASR Random Selection FPR BAGEL ASR (Predicted Expert and n-1 Random Selections) BAGEL FPR (Predicted Expert and n-1 Random Selections) Ideal ASR (Best Expert and n-1 Random Selections) Ideal FPR (Best Expert and n-1 Random Selections) Ideal ASR (Best Expert Only) Ideal FPR (Best Expert Only) Figure 4: Effects of modifying the selection size (n) on ASR and FPR while adding datasets (modifying k) over time. 10 MethodASRFPRF1 ScoreParamsFine-tunable BAGEL0.0950.0660.92286M per finetune, 430M for n=5✓ ToxicDetector0.0450.3260.847300M + 7B✓ ShieldGemma0.6240.0380.5342B✓ OpenAIModeration API0.8810.0240.208Not Known✗ Perspective API0.5690.0680.642Not Known✗ LastLayer0.5980.1710.519Not Applicable✗ Table 2: Results of evaluating BAGEL (with k = 9, n = 5) against other methodologies. whitespace_proportion avg_word_length code_keyword_count prompt_length nl_word_count special_char_proportion shannon_entropy digit_proportion uppercase_ratio whitespace_proportion avg_word_length code_keyword_count prompt_length nl_word_count special_char_proportion shannon_entropy digit_proportion uppercase_ratio Spearman Correlation Matrix whitespace_proportion avg_word_length code_keyword_count prompt_length nl_word_count special_char_proportion shannon_entropy digit_proportion uppercase_ratio 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Cluster merge distance (1 |Spearman correlation|) Hierarchical Clustering Dendrogram 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Spearman correlation Figure 5: Spearman Correlations of the Random Forest features and their resulting Hierarchical Clustering Diagram, showing relative correlations between cluster of features. prompts. It uses embeddings from an LLM (such as LLama2-7B, which we use in our experiments) as the feature vectors to for a 300M parameter MLP classifier. •ShieldGemma [49]: These are white-box, instruction- tuned models for evaluating the safety of text and images. ShieldGemma 1 is built upon the Gemma 2 LLM in 2B, 9B, and 27B parameter sizes (we use the 2B one in our experiments), and allows for a custom safety policy to be provided in the system prompt. •LastLayer [4]: This is a partially black-boxed secu- rity library for protecting LLMs from malicious attacks. Rather than employing a deep-learning-centric approach, LastLayer holistically analyses the structure of incoming prompts via numerous modules, similar to our Random Forest features, in order to detect patterns indicative of prompt injection attacks, jailbreaks and other exploits. Table 2 presents the performance metrics forBAGELatk = 9, n = 5compared to the other techniques. Since we test atk = 9, we provide test samples from all the datasets collectively for evaluation, instead of testing one dataset at a time. This helps to paint a more realistic picture ofBAGEL’s performance, since incoming prompts in real-world systems may not be temporally stratified into different types. The results highlight significant performance disparities between the methods. First, Perspective API and ToxicDe- tector struggle significantly at balancing ASR and FPR. This is likely because these models are optimized for semantic toxicity (e.g., insults, profanity). Many modern jailbreaks (e.g., role-playing scenarios) utilize polite, non-toxic language to bypass filters, causing toxicity-based detectors to classify them as benign (resulting in a high Attack Success Rate). Second, while some techniques offer robust performance in either ASR or FPR,BAGELachieves comparable or supe- rior results while being significantly more lightweight and transparent. The APIs are generalized black-boxes, and they cannot be easily updated by the user to address a new at- tacks without waiting for the vendor to update the model. In contrast,BAGEL’s ensemble approach allows it to capture the nuance of specific attack vectors (via specialized promptcops) that generalized models might miss. By combining prompt- cops on injections, jailbreaks, and unethical requests,BAGEL 11 achieves the highest F1 score, demonstrating that a special- ized ensemble of small models is a viable, high-performance alternative to massive, monolithic guardrails. Even at 430M parameters forn = 5,BAGELremains the easiest to update as well given that adding a new promptcop to the ensemble requires finetuning just a 84M model on just the new dataset, which is much more lightweight than the other methods that need to be retrained from scratch each time and on many more parameters. 7 Discussion In this section, we discuss other aspects and limitations of our method that were not explored in the previous sections. 7.1 Energy Efficiency and Sustainability As AI and specifically deep learning become increasingly ubiquitous, growing concerns have emerged regarding the en- ergy costs of deploying large models and their environmental impact [3, 6, 12, 38]. Data center energy consumption in the U.S. accounted for 4.4% of total electricity use and is pro- jected to triple by 2028 [34]. Critically, the majority of energy costs in AI systems stem from inference rather than training, as training is typically a one-time process while inference must be sustained continuously at scale [11]. Despite this reality, most research has focused on reducing training costs rather than inference costs [39]. BAGELimplicitly accounts for these sustainability concerns through its architectural design. By using an 86M parameter base safety classifier and deploying only a small subset of promptcops per inference (typically 5 out of a larger ensem- ble),BAGELsignificantly reduces both the computational foot- print and energy consumption compared to billion-parameter alternatives. The framework’s reliance on classical machine learning techniques, specifically bootstrap aggregation and random forest routing, demonstrates that powerful yet effi- cient methods can achieve strong performance in LLM safety without requiring massive computational resources. This ap- proach not only makes robust LLM safety more accessible to resource-constrained deployments and smaller vendors, but also represents a step toward environmentally sustainable AI safety practices. As LLM systems continue to scale globally, such efficiency-focused architectures will become essential for balancing safety requirements with environmental respon- sibility. 7.2 Limitations and Future Work WhileBAGELdemonstrates strong performance across diverse attacks, some limitations warrant discussion and point towards promising future research directions. Binary classification scope: As stated in our threat model (§3), we prioritize rapid detection of malicious prompts to prevent harmful content generation. In this context,BAGEL’s binary classification (benign vs. malicious) is sufficient and efficient. However, some deployment scenarios require fine- grained classification to identify which specific safety or ethi- cal policy has been violated. In such cases,BAGELwould need to be extended beyond binary detection, unlike systems such as OpenAI Moderation [29] or Perspective [22] that provide multi-class policy violation categories. Future work could explore hierarchical classification whereBAGELfirst performs binary detection, followed by a secondary fine-grained classi- fier for policy-specific attribution. Multi-turn conversation handling: Our evaluation focuses exclusively on single-prompt detection using datasets where each sample represents an isolated prompt. Consequently, the effectiveness ofBAGELin moderating multi-turn conversa- tions remains untested. Real-world LLM deployments often involve conversational contexts where malicious intent can be distributed across multiple turns or where benign prompts become harmful only in specific conversational contexts. Ex- tendingBAGELto handle conversation history and temporal attack patterns represents an important direction for future research. Dependence on base model coverage: WhileBAGELdemon- strates efficient adaptation to new attacks through incremental fine-tuning, this adaptability partly relies on the base model having learned relevant underlying attack classes. In our im- plementation, Prompt Guard 2 was pre-trained on jailbreak and prompt injection attacks, providing a foundation that each fine-tuned promptcop could build upon. However, it remains an open question whether entirely novel attack paradigms that differ fundamentally from these established categories could emerge in practice. To our knowledge, the LLM safety land- scape has not yet seen such fundamentally different attack types; if they were to arise, it is unclear whether fine-tuning alone would suffice or whether the base model would need re- training. That said, we note that the three attack categories we address (direct harmful requests, jailbreaks, and prompt injec- tions) are quite broad and have encompassed a wide variety of techniques to date, including recent variations such as multi- lingual attacks, encoded prompts, and adversarial suffixes. This suggests thatBAGEL’s incremental adaptation mecha- nism may have broader applicability than initially apparent. Empirically evaluatingBAGELon emerging attack variations represents an important direction for validating these bound- aries. Additionally, training a purpose-built prompt safety classifier optimized forBAGEL’s architecture could provide stronger guarantees against diverse threats while maintaining efficiency. Dataset requirements for new attacks: Each new prompt- cop requires a sufficiently large and representative dataset of the target attack type for effective fine-tuning. When novel attacks first emerge, collecting adequate training data can be challenging and time-intensive. This creates a detection gap between when a new attack appears and whenBAGELcan be 12 updated with a corresponding promptcop. While this limita- tion affects all supervised learning approaches, it underscores the importance of rapid dataset curation and potentially in- corporating few-shot or zero-shot detection capabilities for emerging threats. 8 Related Work Malicious Attacks on LLMs: Prior work categorizes instruction-based attacks into three types [26, 28, 41, 45, 53]: (1) direct harmful requests, (2) jailbreak attacks that bypass guardrails through deceptive framing, and (3) prompt injec- tion attacks that embed malicious instructions within benign content. Significant research has focused on designing and improving such attacks [2, 25, 30, 47, 53]. We note thatBAGEL, like other prompt detection methods, specifically addresses these instruction-based attacks at inference time and is not designed for alternative threat vectors such as data poison- ing [23, 40], backdoor attacks [44], or model extraction [7]. Below we discuss the different approaches to malicious promt detection. Rule-based and statistical detectors: Early approaches to malicious prompt detection rely on statistical signals and heuristics. PerplexityFilter [18] and SIRL [35] identify harm- ful prompts by measuring response uncertainty through per- plexity and entropy calculations respectively. LastLayer [4] uses simple structural detectors to find alarming patterns in prompts, and JailGuard [51] mutates inputs to create variants and detects adversarial prompts through response divergence. While computationally efficient, these methods often strug- gle with sophisticated attacks that evade simple statistical patterns. Commercial and black-box APIs: Commercial solutions such as OpenAI Moderation API [29] and Perspective API [22] provide convenient detection services but offer lim- ited transparency, poor adaptability to domain-specific threats, and no guarantees against rapidly evolving attacks. LLM-based detectors: Most recent detection methods lever- age LLMs for stronger performance at the cost of computa- tional efficiency. ToxicDetector [26] and InstructDetector [42] extract features from LLM hidden states for classification. StruQ [9] and FJD [8] use LLMs for prompt scrutinization with additive instructions and first-token confidence analy- sis respectively. Gpt-oss-safeguard (20B and 120B) [1] and ShieldGemma (ranging from 2B to 27B parameters) [49] built on Gemma 2, are large models for detecting harmful content by defining custom safety policies. Recent work has also ex- plored guardrails for autonomous agents through dynamic code generation [10, 43]. While effective, these approaches require substantial computational resources for both inference and updates. Hybrid and distillation approaches: Methods closest to BAGELinclude Jatmo [32], which fine-tunes a non-instruction- tuned LLM resistant to prompt injections and BD-LLM [50], which distills LLM rationales into smaller student models. While these approaches reduce computational costs compared to direct LLM-based detection, they still depend on LLM- scale models or require access to LLM internals BAGELdistinguishes itself by operating without LLMs en- tirely, using only small specialized classifiers (86M param- eters each) in a modular ensemble architecture. This de- sign achieves strong detection performance (F1 score of 0.92) while significantly reducing computational overhead and enabling efficient incremental updates through simple fine-tuning and ensemble addition, critical advantages as new attack types emerge. 9 Conclusion In this work, we introduceBAGEL, a modular and efficient framework for detecting malicious prompt attacks in LLM sys- tems. By building on the principle that effective defense does not require massive computational resources,BAGELdemon- strates that small, specialized models can match or exceed the performance of billion-parameter alternatives. The framework utilizes a lightweight base safety classifier as its foundation, enabling computational efficiency while maintaining strong detection capabilities. An ensemble of fine-tuned promptcops provides robust performance across diverse malicious prompt types and enables streamlined updates without full-system retraining. The bootstrap aggregation strategy for selecting ensemble subsets, combined with a random forest router for identifying suitable promptcops, further reduces computa- tional demands while achieving near-oracle performance and providing interpretable routing decisions. Our evaluation across nine diverse datasets demonstrates thatBAGELachieves the highest F1 score (0.92) compared to popular detection methods while using only 430M effective parameters, substantially fewer than existing approaches. Per- formance remains robust even after nine incremental dataset additions, validating the framework’s ability to evolve along- side emerging threats. By decoupling robust safety from mas- sive scale,BAGELestablishes a blueprint for sustainable LLM security systems that prioritize modularity, efficiency, and adaptability. As the threat landscape continues to evolve, such frameworks ensure that defenses can adapt as rapidly as the attacks designed to circumvent them, without requiring pro- hibitive computational resources or extensive retraining. 13 References [1] Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt- oss-20b model card. arXiv preprint arXiv:2508.10925, 2025. [2]Maksym Andriushchenko, Francesco Croce, and Nico- las Flammarion. Jailbreaking leading safety-aligned llms with simple adaptive attacks.arXiv preprint arXiv:2404.02151, 2024. [3]Lasse F Wolff Anthony, Benjamin Kanding, and Raghavendra Selvan. Carbontracker: Tracking and pre- dicting the carbon footprint of training deep learning models. arXiv preprint arXiv:2007.03051, 2020. [4] arekusandr.Last layer: Ultra-fast, low latency llm prompt injection/jailbreak detection, 2024.https: //github.com/arekusandr/last_layer. [5]Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. [6] Alfredo Canziani, Adam Paszke, and Eugenio Culur- ciello. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016. [7]Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX security symposium (USENIX Security 21), pages 2633–2650, 2021. [8]Guorui Chen, Yifan Xia, Xiaojun Jia, Zhijiang Li, Philip Torr, and Jindong Gu. Llm jailbreak detection for (al- most) free! arXiv preprint arXiv:2509.14558, 2025. [9]Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner.StruQ: Defending against prompt injection with structured queries. In 34th USENIX Security Sym- posium (USENIX Security 25), pages 2383–2400, 2025. [10]Zhaorun Chen, Mintong Kang, and Bo Li. Shieldagent: Shielding agents via verifiable safety policy reasoning, 2025b. URL https://arxiv. org/abs/2503.22738, 2025. [11] Radosvet Desislavov, Fernando Martínez-Plumed, and José Hernández-Orallo. Trends in ai inference energy consumption: Beyond the performance-vs-parameter laws of deep learning. Sustainable Computing: Infor- matics and Systems, 38:100857, 2023. [12]Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W Mahoney, and Kurt Keutzer. Ai and memory wall. IEEE Micro, 44(3):33–39, 2024. [13]Hangzhi Guo, Pranav Narayanan Venkit, Eunchae Jang, Mukund Srinath, Wenbo Zhang, Bonam Mingole, Vipul Gupta, Kush R Varshney, S Shyam Sundar, and Amulya Yadav. Exposing ai bias by crowdsourcing: Democratiz- ing critique of large language models. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 1169–1180, 2025. [14]guynachshon.guychuk/benign-malicious- prompt-classification,2024.https: //huggingface.co/datasets/guychuk/ benign-malicious-prompt-classification. [15] Harelix.Harelix/prompt-injection- mixed-techniques-2024,2024.https: //ai.gitee.com/hf-datasets/Harelix/ Prompt-Injection-Mixed-Techniques-2024. [16] Jack Hao.jackhhao/jailbreak-classification, 2023. https://huggingface.co/datasets/jackhhao/ jailbreak-classification. [17]Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 03 1991. [18]Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614, 2023. [19] jayavibhavnk. jayavibhav/prompt-injection-safety, 2024. https://huggingface.co/datasets/jayavibhav/ prompt-injection-safety. [20]Mohammed Amine Jebbar.Malicious prompt detection dataset (mpdd), 2025.https://w. kaggle.com/datasets/mohammedaminejebbar/ malicious-prompt-detection-dataset-mpdd. [21]Yuqi Jia, Yupei Liu, Zedian Shao, Jinyuan Jia, and Neil Gong. Promptlocate: Localizing prompt injection at- tacks. arXiv preprint arXiv:2510.12252, 2025. [22]Jigsaw. Perspective api.https://perspectiveapi. com/. [23]Keita Kurita, Paul Michel, and Graham Neubig. Weight poisoning attacks on pre-trained models. arXiv preprint arXiv:2004.06660, 2020. [24]Weitang Liu, Xiaoyun Wang, John Owens, and Yix- uan Li.Energy-based out-of-distribution detection. Advances in neural information processing systems, 33:21464–21475, 2020. 14 [25]Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023. [26]Yi Liu, Junzhe Yu, Huijia Sun, Ling Shi, Gelei Deng, Yuqi Chen, and Yang Liu. Efficient detection of toxic prompts in large language models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, pages 455–467, 2024. [27]YiLiu, JunzheYu, HuijiaSun, LingShi, Gelei Deng, Yuqi Chen, and Yang Liu.Tox- icdetectorevaluationdataset(safetyprompt- collections),2024.https://sites. google.com/view/toxic-prompt-detector/ open-science-artifact?authuser=0. [28] Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In 33rd USENIX Security Symposium (USENIX Security 24), pages 1831– 1847, 2024. [29]Todor Markov, Chong Zhang, Sandhini Agarwal, Flo- rentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI conference on artificial intelli- gence, volume 37, pages 15009–15018, 2023. [30] Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. Advances in Neural Information Processing Systems, 37:61065–61105, 2024. [31]Meta-LLAMA.Prompt-guard-2-86m, 2025. https://huggingface.co/meta-llama/ Llama-Prompt-Guard-2-86M. [32] Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei, Elizabeth Sun, Basel Alomair, and David Wagner. Jatmo: Prompt injection defense by task-specific finetuning. In European Symposium on Re- search in Computer Security, pages 105–124. Springer, 2024. [33] Qualifire AI. qualifire/prompt-injections-benchmark, 2025.https://huggingface.co/datasets/ qualifire/prompt-injections-benchmark. [34] Arman Shehabi, Alex Newkirk, Sarah J Smith, Alex Hubbard, Nuoa Lei, Md Abu Bakar Siddik, Billie Hole- cek, Jonathan Koomey, Eric Masanet, and Dale Sar- tor. 2024 united states data center energy usage report. eScholarship, 2024. [35]Guobin Shen, Dongcheng Zhao, Haibo Tong, Jindong Li, Feifei Zhao, and Yi Zeng. Safety instincts: Llms learn to trust their internal compass for self-defense. arXiv preprint arXiv:2510.01088, 2025. [36]Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685, 2024. [37] SynapseCAI. synapsecai/synthetic-prompt-injections, 2024.https://w.oxen.ai/synapsecai/ synthetic-prompt-injections. [38]Neil C Thompson, Kristjan Greenewald, Keeheon Lee, Gabriel F Manso, et al. The computational limits of deep learning. arXiv preprint arXiv:2007.05558, 10:2, 2020. [39]Roberto Verdecchia, June Sallou, and Luís Cruz. A systematic review of green ai. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 13(4):e1507, 2023. [40]Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning language models during instruction tuning. In International Conference on Machine Learn- ing, pages 35413–35425. PMLR, 2023. [41]Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079– 80110, 2023. [42]Tongyu Wen, Chenglong Wang, Xiyuan Yang, Haoyu Tang, Yueqi Xie, Lingjuan Lyu, Zhicheng Dou, and Fangzhao Wu.Defending against indirect prompt injection by instruction detection.arXiv preprint arXiv:2505.06311, 2025. [43] Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, et al. Guardagent: Safeguard llm agents via knowledge-enabled reasoning. In Forty-second In- ternational Conference on Machine Learning, 2025. [44]Haomiao Yang, Kunlan Xiang, Mengyu Ge, Hongwei Li, Rongxing Lu, and Shui Yu. A comprehensive overview of backdoor attacks in large language models within communication networks. IEEE Network, 38(6):211– 218, 2024. [45]Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4(2):100211, 2024. 15 [46]Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. Jailbreak attacks and defenses against large language models: A survey. arXiv preprint arXiv:2407.04295, 2024. [47] Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253, 2023. [48]Zhiyuan Yu, Xiaogeng Liu, Shunning Liang, Zach Cameron, Chaowei Xiao, and Ning Zhang. Don’t listen to me: Understanding and exploring jailbreak prompts of large language models. In 33rd USENIX Security Symposium (USENIX Security 24), pages 4675–4692, 2024. [49]Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, et al. Shieldgemma: Generative ai content moderation based on gemma. arXiv preprint arXiv:2407.21772, 2024. [50]Jiang Zhang, Qiong Wu, Yiming Xu, Cheng Cao, Zheng Du, and Konstantinos Psounis. Efficient toxic content detection by bootstrapping and distilling large language models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 21779–21787, 2024. [51] Xiaoyu Zhang, Cen Zhang, Tianlin Li, Yihao Huang, Xiaojun Jia, Ming Hu, Jie Zhang, Yang Liu, Shiqing Ma, and Chao Shen. Jailguard: A universal detection frame- work for prompt-based attacks on llm systems. ACM Transactions on Software Engineering and Methodol- ogy, 2025. [52]Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Gong, et al. Promptrobust: Towards evaluat- ing the robustness of large language models on adversar- ial prompts. In Proceedings of the 1st ACM workshop on large AI systems and models with privacy and safety analysis, pages 57–68, 2023. [53] Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and trans- ferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 16