Paper deep dive

Query-efficient and dataset-independent red teaming for LLMs content safety evaluation

Shuo Liu, Xiang Cheng, Sen Su

Year: 2025Venue: Knowledge-Based SystemsArea: Safety EvaluationType: EmpiricalEmbeddings: 17

Abstract

Large language models (LLMs) are widely used for their remarkable ability to understand and generate natural language. Nevertheless, LLMs can also pro…

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/12/2026, 5:19:58 PM

Summary

The paper introduces RAPT (Query-Efficient Adaptive Red Teaming), a dataset-independent framework for evaluating LLM content safety. RAPT utilizes an adaptive generate-select loop where an LLM-based generator creates test cases using contrastive prompts, and an RL-based selector (modeled as an MDP) prioritizes cases based on a composite reward function to balance effectiveness and diversity, significantly improving query efficiency.

Entities (5)

LLM · technology · 100%RAPT · methodology · 100%Shuo Liu · researcher · 100%Reinforcement Learning · technique · 98%Markov Decision Process · mathematical-framework · 95%

Relation Signals (3)

RAPT → evaluates → LLM

confidence 100% · RAPT, a query-efficient and dataset-independent red teaming approach... for LLMs content safety evaluation

Markov Decision Process → formalizes → Test Case Selection

confidence 95% · we formalize the test case selection process as a Markov decision process (MDP)

RAPT → utilizes → Reinforcement Learning

confidence 95% · RAPT employs an adaptive generate-select framework... selecting test cases by an reinforcement learning (RL)-based selector

Cypher Suggestions (2)

Identify the relationship between RAPT and its target technology · confidence 95% · unvalidated

MATCH (m:Methodology {name: 'RAPT'})-[r]->(e:Technology) RETURN m, r, e

Find all methodologies that utilize reinforcement learning for LLM evaluation · confidence 90% · unvalidated

MATCH (m:Methodology)-[:UTILIZES]->(t:Technique {name: 'Reinforcement Learning'}) RETURN m

Full Text

16,839 characters extracted from source content.

Expand or collapse full text

Knowledge-Based SystemsVolume 329, Part B, 4 November 2025, 114404Query-efficient and dataset-independent red teaming for LLMs content safety evaluationAuthor links open overlay panelShuo Liu a b, Xiang Cheng a b, Sen Su a bShow moreAdd to MendeleyShareCitehttps://doi.org/10.1016/j.knosys.2025.114404Get rights and contentAbstractLarge language models (LLMs) are widely used for their remarkable ability to understand and generate natural language. Nevertheless, LLMs can also produce unintended outputs that pose significant social risks. Red teaming can identify potential security vulnerabilities in LLMs and support mitigating such risks. However, existing red teaming approaches struggle to balance query efficiency and generalizability due to their complex search processes or reliance on pre-existing datasets. To address these issues, we present RAPT, a query-efficient and dataset-independent red teaming approach. RAPT employs an adaptive generate-select framework that consists of four cyclic steps: generating test cases by an LLM-based generator, selecting test cases by an reinforcement learning (RL)-based selector, testing the target model, and refining the generator and the selector. In this framework, the generator is used to generate test cases, and the selector is used to select test cases. We introduce a contrast prompt template and diversity demonstration extraction method to guide the generator, incorporating previous test feedback as demonstrations to generate more effective and diverse test cases. For the selector, we formalize the test case selection process as a Markov decision process (MDP), allowing us to design a reinforcement learning-based agent to continuously optimize the selection policy, which is able to balance the effectiveness and diversity of test cases according to a compound reward function. Experimental results show that RAPT can effectively discover more successful and diverse test cases than existing methods within a limited number of queries without relying on any pre-existing dataset.IntroductionLarge language models (LLMs) have become a cornerstone of modern natural language processing (NLP), demonstrating exceptional capabilities in tasks such as text generation, machine translation, conversational AI, and document summarization. These models are not only advancing NLP technology, but reshaping the industry landscape for applications ranging from intelligent virtual assistants and automated content creation to real-time language translation and sentiment analysis [1], [2], [3], [4]. While LLMs have great potential, they are not without limitations. Research has shown that these models may inadvertently produce undesirable outputs, such as biased, discriminatory, or harmful content, raising significant concerns about their safety and ethical implications [5], [6], [7], [8]. Such issues are particularly problematic when LLMs are deployed in sensitive domains such as healthcare [9], [10], legal systems [11], and public policy, where the consequences of harmful outputs can be severe and far-reaching.As the adoption of LLMs continues to expand across diverse applications, ensuring their robustness, fairness, and content safety has become a critical priority. Addressing these challenges requires robust evaluation frameworks that can systematically identify potential risks and vulnerabilities in LLMs [12], [13]. Among existing approaches, red teaming has emerged as an effective methodology for probing and assessing the behavior of language models [12], [14]. Red teaming involves the design and execution of adversarial test cases, which are crafted to reveal instances of undesirable behavior in the target model. By discovering failure modes, red team testing can guide developers to make targeted security alignments for models. The success of red team testing depends on the quality and diversity of test cases.Early red team testing relied on manual generation of test cases. Although this approach is effective in some cases, it also has significant limitations, including high labor costs, low scalability, and limited diversity [15], [16], [17], [18]. These constraints drive the development of automated red teaming methods that leverage the generative capabilities of LLMs themselves [19], [20]. Automated red teaming methods perform relatively well in increasing the diversity and size of test cases, but pose new challenges. One of the most significant limitations of existing automated methods is their inefficiency in querying the target model. Specifically, these methods often require a prohibitively large number of queries to uncover a small subset of problematic outputs [21], [22]. This inefficiency not only increases computational costs but also makes these approaches impractical for real-world applications, especially in scenarios where query budgets are constrained by financial or technical limitations, such as commercial API usage limits. Moreover, many existing methods rely heavily on predefined datasets for filtering, selecting, or augmenting test cases. While these datasets provide structure and focus to the assessment process, they are inherently limited by their static nature. On the one hand, as the LLM evolves, new vulnerabilities and failure modes may emerge, making static datasets insufficient for a comprehensive assessment. On the other hand, creating and maintaining high-quality datasets is a resource-intensive process that requires a great deal of expertise and effort [1], [22]. These limitations highlight the need for red teaming frameworks that are not only efficient but also adaptive and dataset-independent.To overcome these limitations, we propose a red team testing method called Query-Efficient Adaptive Red Teaming (RAPT), which combines query efficiency and dataset independence to improve the effectiveness of LLM red teaming testing. Unlike traditional methods that rely on predefined datasets or exhaustive queries, RAPT uses a dynamic generation-selection framework to iteratively optimize strategies based on real-time feedback from the target model. To improve red teaming efficiency under resource constraints, RAPT decouples the test case generation and selection processes: the generator uses contrastive prompt templates and iteratively updated examples to ensure test case diversity, while the selector is implemented as a reinforcement learning agent that prioritizes high-quality test cases based on model feedback to ensure test case effectiveness. This modular design enables independent optimization of the two components using lighter-weight optimization methods, reducing reliance on labeled data to enhance efficiency in resource-constrained, black-box environments. The entire process is conducted through four iterative steps: (1) generating test cases using a prompt-based generator, (2) filtering potential cases using a reinforcement learning-based selector, (3) evaluating the target model, and (4) optimizing both modules based on test results. This design enables RAPT to dynamically adapt to different model behaviors while ensuring broad coverage and high query efficiency.The test case generator is responsible for generating a comprehensive and effective set of test cases. To achieve this, we introduce an in-context learning-based generation method that leverages a contrast prompt template and a dynamically updated demonstration set. The contrast prompt template provides explicit instructions for generating diverse test cases while incorporating both effective and ineffective examples identified in prior iterations [27]. By continuously updating the demonstration set with new examples derived from test results, the generator adapts to the evolving behavior of the target model. This adaptive refinement ensures that the generator remains effective in producing test cases that are not only diverse but also tailored to the target model’s vulnerabilities.The goal of the test case selector is to prioritize test cases based on their likelihood of success and their contribution to diversity. To this end, we formalize test case selection as a Markov Decision Process (MDP) [23], which allows us to employ a reinforcement learning agent to optimize the selection strategy. The RL agent is guided by a composite reward function that balances two critical objectives: maximizing the success rate of test cases and ensuring their diversity. The emphasis on diversity is particularly important in red teaming, as it enables the discovery of a broader range of vulnerabilities that might otherwise go undetected [1], [22]. By iteratively refining its selection strategy based on feedback from test results, the selector enhances the overall effectiveness of the red teaming process, making RAPT both robust and efficient.The key contributions of this work are summarized as follows:•We propose a query-efficient and dataset-independent red teaming approach for evaluating the content safety of LLMs. The approach eliminates reliance on static datasets and achieves high adaptability through a feedback-driven generate-select mechanism.•We introduce an in-context learning-based test case generator that leverages a contrast prompt template and dynamic demonstration extraction to produce diverse and high-quality test cases, enabling comprehensive evaluation of target models.•We formalize test case selection as an MDP and develop an RL-based selector with a composite reward function, allowing for efficient optimization of the selection process and improved query efficiency.•We evaluate RAPT through extensive experiments on open dialogue tasks across multiple target models. Experimental results demonstrate that RAPT significantly outperforms existing dataset-dependent and dataset-independent methods in terms of efficiency, adaptability, and effectiveness.Access through your organizationCheck access to the full text by signing in through your organization.Access through your organizationSection snippetsRelated workRed teaming has emerged as a critical approach for probing and assessing the safety of large language models (LLMs). Existing red teaming approaches can be broadly categorized into dataset-dependent and dataset-independent approaches.Dataset-dependent approaches rely on predefined datasets or templates to generate adversarial inputs. Nie et al. [15] use human-computer interaction to create updateable datasets, ensuring relevance to evolving vulnerabilities. Sheng et al. [24] and Bang et al. [25]Problem formulationThe primary objective of red teaming for language models (LMs) is to identify a diverse set of natural language test cases that induce harmful, biased, or offensive outputs from the target model [19]. These test cases are intended to simulate real-world scenarios where language models, when deployed, could generate undesirable content that could be harmful or offensive to users. It is essential that the test cases are well-formed natural language inputs, as opposed to arbitrary or nonsensicalOverviewRAPT is based on an iteratively optimized framework that generates a diverse set of effective test cases against the target model to assess the security of LLMs. By iteratively improving test case generation and selection strategies, RAPT ensures that each test case contributes to the discovery of vulnerabilities in the target model.As shown in Fig. 1, the framework consists of two core components: a test case generator, which produces candidate inputs for testing, and a test case selector,BaselinesWe compare the red teaming performance of RAPT with publicly available baseline methods, including both dataset-dependent and dataset-independent methods. The dataset-dependent methods include Rand, BRT [21], and AutoRedTeamer [28]; the dataset-independent methods include ZS, FS [19], and a cold-start version of AutoRedTeamer (called AutoRedTeamer-NM). Specifically, Rand randomly selects test cases from the test set. BRT uses Bayesian optimization to select test cases based on previous testIllustrative exampleTo clearly understand how RAPT works, we provide an illustration in Table 6, which includes two iteration examples for BlenderBot-3B, with the test objective being “offensive response”. This example traces the full process from test case generation to selection, testing, and demonstration set update. For each test case, we report whether it was selected by the RL-based selector under its current policy π, the model’s output, the offensiveness score So assigned by the red team classifier R(o),ConclusionIn this paper, we present RAPT, a red teaming approach that is both query-efficient and dataset-independent for identifying the potential risk of offensive output in LLMs. RAPT utilizes an adaptive framework utilizes feedback from the target model to iteratively generate and select test cases. In particular, we propose a contrast prompt template and diversity exemplar extraction method for guiding the generator in generating test cases with high success rates and diversity. In addition, weHyperparameters searching rangeTable 7 shows the range of grid search for tuning hyperparameters.Computing infrastructure used for running experimentsThe experiments were conducted on a server equipped with Intel Xeon Silver 4210R processors, 8x NVIDIA 4090 GPUs, and 128GB DDR4 RAM. The software environment was set up with Ubuntu 20.04 LTS as the operating system, and Python 3.10 as the primary programming language.CRediT authorship contribution statementShuo Liu: Writing – original draft, Visualization, Software, Methodology, Investigation, Formal analysis, Data curation. Xiang Cheng: Writing – review & editing, Supervision, Resources, Project administration, Funding acquisition. Sen Su: Supervision, Project administration, Funding acquisition.Declaration of competing interestThe authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.AcknowledgmentsThe authors thank the reviewers for their valuable comments which significantly improved this paper. This work was supported by the National Natural Science Foundation of China (Grant No. 62372051).References (38)J. Devlin et al.BERT: pre-training of deep bidirectional transformers for language understandingNAACL-HLT(2019)X. Wang et al.Large-scale hierarchical causal discovery via weak prior knowledgeIEEE Trans. Knowl. Data Eng.(2025)A. Kho et al.Some considerations for the preservation of endangered languages using low-resource machine translationAustralasian Joint Conference on Artificial Intelligence(2024)S. Liu et al.Multiscale temporal dynamic learning for time series classificationIEEE Trans. Knowl. Data Eng.(2025)S. Lin, J. Hilton, O. Evans, TruthfulQA: Measuring how models mimic human falsehoods, arXiv preprint arXiv:...N. Carlini et al.Extracting training data from large language models30th USENIX Security Symposium (USENIX Security 21)(2021)E.M. Bender et al.On the dangers of stochastic parrots: can language models be too big?Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency(2021)L. Weidinger, J. Uesato, M. Rauh, et al., Ethical and social risks of harm from language models, arXiv preprint arXiv:...A.E.W. Johnson et al.MIMIC-I, a freely accessible critical care databaseSci. Data(2017)D.P. Panagoulias, M. Virvou, G.A. Tsihrintzis, Evaluating LLM–generated multimodal diagnosis from medical images and...I. Chalkidis et al.Legal-BERT: the muppets straight out of law schoolEMNLP(2020)D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al.,...T.B. Brown et al.Language Models are Few-Shot LearnersAdvances in Neural Information Processing Systems(2020)E. Dinan et al.Build it break it fix it for dialogue safety: robustness from adversarial human attackEMNLP(2019)Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, D. Kiela, Adversarial NLI: a new benchmark for natural language...M. Nadeem et al.StereoSet: measuring stereotypical bias in pretrained language modelsProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)(2021)B. Deng, W. Wang, F. Feng, Y. Deng, Q. Wang, X. He, Attack prompt generation for red teaming and defending large...R. ZhangImproving robustness of text classifiers against human- and model-imperceptible backdoor triggersEMNLP(2021)E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, G. Irving, Red teaming language...View more referencesCited by (0)View full text© 2025 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.