Paper deep dive

Red teaming large language models: A comprehensive review and critical analysis

Muhammad Shahid Jabbar, Sadam Al-Azani, Abrar Alotaibi, Moataz Ahmed

Year: 2025Venue: Information Processing & ManagementArea: Safety EvaluationType: SurveyEmbeddings: 19

Models: Claude, GPT-3.5, GPT-4, Gemini, LLaMA, Mistral

Abstract

Securing large language models (LLMs) remains a critical challenge as their adoption across various sectors rapidly grows. While advancements in LLM d…

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 99%

Last extracted: 3/12/2026, 5:30:38 PM

Summary

This paper provides a comprehensive review and critical analysis of red teaming strategies for large language models (LLMs). It introduces a novel taxonomy for classifying attack strategies—including prompt-based, data manipulation, model exploitation, information extraction, and model degradation attacks—and evaluates current benchmarks, datasets, and metrics used for assessing LLM security and risk.

Entities (6)

Blue Teaming · methodology · 100%GPT-4 · model · 100%Large Language Models · technology · 100%PaLM · model · 100%Purple Teaming · methodology · 100%Red Teaming · methodology · 100%

Relation Signals (3)

Blue Teaming → defends → Large Language Models

confidence 100% · blue teaming focuses on defensive strategies as it is responsible for detecting, mitigating, and preventing attacks

Red Teaming → evaluates → Large Language Models

confidence 100% · red teaming for LLMs involves simulating adversarial attacks and misuse scenarios to identify vulnerabilities in LLMs

Purple Teaming → integrates → Red Teaming

confidence 100% · purple teaming represents a collaborative integration of these approaches by compounding the insights of both red and blue teams

Cypher Suggestions (2)

List all models mentioned in the context of security · confidence 95% · unvalidated

MATCH (m:Model) RETURN m.name

Find all methodologies used to secure LLMs · confidence 90% · unvalidated

MATCH (m:Methodology)-[:SECURES|EVALUATES|DEFENDS]->(l:Technology {name: 'Large Language Models'}) RETURN m.name

Full Text

18,244 characters extracted from source content.

Expand or collapse full text

Information Processing & ManagementVolume 62, Issue 6, November 2025, 104239Red teaming large language models: A comprehensive review and critical analysisAuthor links open overlay panelMuhammad Shahid Jabbar a, Sadam Al-Azani a b, Abrar Alotaibi b c, Moataz Ahmed b aShow moreAdd to MendeleyShareCitehttps://doi.org/10.1016/j.ipm.2025.104239Get rights and contentHighlights•Comprehensive review of red teaming for large language models (LLMs).•Novel taxonomy for classifying LLM red teaming attack strategies.•Detailed evaluation of datasets, metrics, and benchmarks for LLM security.•Integration of recent research to update the state of LLM security.AbstractSecuring large language models (LLMs) remains a critical challenge as their adoption across various sectors rapidly grows. While advancements in LLM development have enhanced their capabilities, inherent vulnerabilities continue to pose significant risks, exposing these models to various forms of attack. This study provides a comprehensive review of LLMs’ red teaming, distinguished by its broad coverage and intuitive organization. It systematically explores a range of red teaming attacks, including prompt-based attacks, data manipulation attacks, model exploitation attacks, information extraction attacks, and model degradation attacks. Additionally, it provides a critical review and analysis of evaluation methods and benchmarks, focusing on datasets, evaluation metrics, and benchmarking techniques used in LLM red teaming and risk assessment. Our review reflects the current state of LLM security and provides new insights alongside established methods by integrating recent and impactful research. The structured presentation of our findings offers a comprehensive and actionable resource, facilitating a deeper understanding of the complexities involved. This review highlights the proactive assessment of risk and exploitation potential, and contributes to the development of more secure and responsible LLMs, serving as a valuable guide for researchers, practitioners, and policymakers.IntroductionThe growth integration of large language models (LLMs) across diverse domains, ranging from healthcare to finance, transportation and entertainment, highlights the critical need for these models to operate safely and responsibly (Huang, Sun et al., 2024, Ozkaya, 2023). Ensuring the ethical and secure deployment of LLMs is essential to prevent potential harms (Hadi et al., 2023), such as the spread of misinformation (Sun et al., 2024), biased decision-making (Echterhoff et al., 2024, Eigner and Händler, 2024), or privacy violations (Yao et al., 2024). The remarkable capabilities of LLMs can be attributed to several factors, including the use of large-scale raw text datasets and advanced training architectures. For instance, models like GPT-4 (Achiam et al., 2023), with an estimated 1 trillion parameters, and PaLM (Chowdhery et al., 2023), trained on over 700 billion tokens, demonstrate the power of extensive data and complex architectures. However, as these models become more integrated into real-world systems, they also expose new and significant risks. Exploitation to adversarial attacks, privacy breaches, biases in training data, and unintended harmful outputs pose substantial challenges, making the security and safety of LLMs a critical area of concern.To address these risks, red teaming has become an indispensable practice in evaluating and strengthening the security of LLMs. Traditionally used in cybersecurity, red teaming for LLMs involves simulating adversarial attacks and misuse scenarios to identify vulnerabilities in LLMs (Perez et al., 2022). By rigorously testing these models, red teams can uncover weaknesses before they are exploited in real-world situations (Ma et al., 2024). In this offensive role, the red team rigorously tests the model, seeking to identify weaknesses and inform improvements. It is essential as the landscape of attack techniques continues to evolve, with adversaries devising increasingly sophisticated methods to compromise LLMs. This proactive approach not only helps enhance the security of LLMs but also fosters trust and accountability. Complementing the red teaming, blue teaming focuses on defensive strategies as it is responsible for detecting, mitigating, and preventing attacks, thereby strengthening the system’s overall resilience. On the other hand, purple teaming represents a collaborative integration of these approaches by compounding the insights of both red and blue teams to create a more holistic and adaptive security strategy. In Fig. 1(a), the red circle represents the proactive attack simulations of the red team, the blue circle represents the defensive measures of the blue team, and the overlapping purple area represents the integrated efforts of purple teaming in the context of LLMs. Whereas, the detailed workflow of the red teaming process in securing LLMs is illustrated in Fig. 1(b).The primary motivation behind this review is to address the growing need for a comprehensive understanding of threat exploitation and attacks that LLMs face. While some existing reviews have touched upon specific vulnerabilities or attack vectors, there remains a lack of a systematic and unified perspective that encompasses the full spectrum of red teaming and benchmarks for LLMs’ security.Our motivation is further driven by the increasing reliance on LLMs in sensitive domains such as healthcare (Yang, Tan et al., 2023), finance (Li et al., 2023), and security (Motlagh et al., 2024), where the consequences of successful attacks can be severe. As LLMs are deployed in these high-stakes environments, it is crucial to ensure that they are not only effective but also safe and trustworthy. This review seeks to contribute to that goal by identifying key exploitation and attack strategies, with a comprehensive evaluation of their effectiveness, and providing a roadmap for future research and development in LLM red teaming for better security.Additionally, the evolving landscape of LLM technology necessitates a continuous reassessment of risks and defenses. With the introduction of more advanced models, new attack vectors are likely to emerge, and existing defenses may become outdated. This review aims to stay ahead of these developments by synthesizing the latest research, offering insights into the current state of LLM security, and highlighting areas where further investigation is needed. Fig. 2 provides a comprehensive outline of the key aspects of LLM red teaming considered in this review. Ultimately, our motivation is to promote a deeper and complete understanding of the risks associated with LLMs and to support the development of more resilient and secure systems that can better withstand adversarial challenges.The key contributions of this study are as follows:•Comprehensive exploration of LLM red teaming: This study offers a detailed and systematic examination of the red teaming landscape for LLMs, covering a wide array of attack types and techniques, along with their strengths and challenges, surpassing existing reviews in both scope and depth.•Novel taxonomy and classification: We introduce an intuitive taxonomy for categorizing LLM red teaming attacks, providing a more structured and accessible framework that clarifies the complex landscape of LLM security testing.•Advanced evaluation frameworks: We analyze the evaluation frameworks, including datasets, metrics, and benchmarks used in LLM red teaming and its assessment. This systematic analysis enhances the effectiveness of security testing and promotes realistic threat simulation.•Updated state of the field: This study integrates the latest developments in LLM red teaming, addressing the most pressing issues and providing a broader, more complete understanding of LLM security challenges, offering up-to-date information that surpasses existing reviews.The rest of this article is organized as follows. Section 2 describes the review methodology, detailing the scope, comparison with related reviews, and the review process, including research evaluation criteria and questions. Section 3 provides the glossary of commonly used terminology in LLM security and comparison with related reviews. Section 4 examines various red teaming attacks on LLMs, categorized into prompt-based attacks, data manipulation attacks, model exploitation attacks, information extraction attacks, and model degradation attacks. Section 5 covers evaluation methods and benchmarks, assessing datasets, evaluation metrics, and benchmarking methods used in LLM red teaming and risk assessment. Section 6 presents a discussion of current research trends, along with a critical examination of the domain’s limitations. Section 7 explores the broader implications of this study and its potential impact. Finally, Section 8 concludes the article.Access through your organizationCheck access to the full text by signing in through your organization.Access through your organizationSection snippetsReview methodOur review method is structured as follows:•Search Strategy Design: We designed a search strategy that includes targeted search phrases and selected databases to identify relevant studies.•Study Selection: We applied study selection criteria to filter the initial set of studies and selected the most relevant papers for in-depth analysis.•Quality Assessment: We assessed the quality of the selected studies based on defined assessment criteria to ensure the credibility and reliability of theBackgroundThis section begins by defining key terms frequently encountered in LLM red teaming. We then examine related reviews to identify their scope, strengths, and limitations, which motivate the comprehensive approach of our study.Red teaming attacks on LLMsRed teaming attacks on LLMs represent a critical area of research aimed at rigorously testing and evaluating the robustness, safety, and security of these advanced models. These attacks are designed to simulate adversarial behavior and assess misuse potential, to uncover vulnerabilities that could be exploited in real-world scenarios. By systematically probing the models across various attack vectors, red teaming provides invaluable insights into potential failure points and informs theEvaluation methods and benchmarksA significant trend in evaluating LLMs against red teaming attacks is the use of automated evaluation pipelines that simulate attacks and defenses in a controlled environment. These pipelines leverage adversarial attack libraries and testing suites to systematically evaluate the model’s performance under various threat scenarios, as illustrated in Fig. 4.But despite the advantages of automation, human-in-the-loop evaluation remains crucial for assessing the real-world impact of red teamingAnalysis and DiscussionThroughout this review, we have thoroughly addressed the research questions that guided our inquiry (Section 2.1). We have identified major trends in LLM red teaming, categorized various methodologies, and explored the primary tasks these efforts aim to address. Our discussion of LLM vulnerabilities and exploits, along with an analysis of key failure modes provides a comprehensive understanding of the threats faced by these systems. We have also examined strategies for developing effective redImplications of this StudyWe have thoroughly examined the landscape of red teaming for LLMs in this comprehensive review; focusing specifically on vulnerability exploits, potential attacks, risk evaluation methods, and benchmarks. Our exploration, guided by well-defined research questions, has provided critical insights expected to advance both current knowledge and the literature on LLM security.In this section, we highlight the theoretical and practical implications of our study, drawing on key insights from ourConclusionLLMs are increasingly pivotal across various domains. However, despite their transformative capabilities, these models present significant vulnerabilities that expose them to various forms of attack. Understanding these vulnerabilities and developing robust safeguards is essential for ensuring their secure and responsible deployment. This review offers a comprehensive and novel examination of LLM red teaming, distinguished by its broad scope and intuitive organization. We systematically explore CRediT authorship contribution statementMuhammad Shahid Jabbar: Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing. Sadam Al-Azani: Conceptualization, Data curation, Formal analysis, Methodology, Supervision, Validation, Writing – review & editing. Abrar Alotaibi: Data curation, Formal analysis, Investigation, Writing – original draft. Moataz Ahmed: Conceptualization, Formal analysis, Funding acquisition, Project administration, Supervision,AcknowledgmentsThe authors would like to acknowledge the support received from the Saudi Data and AI Authority (SDAIA) and King Fahd University of Petroleum and Minerals (KFUPM) under the SDAIA-KFUPM Joint Research Center for Artificial Intelligence, Saudi Arabia Grant JRC-AI-RFP-20.Recommended articlesReferences (207)P. Fortuna et al.How well do hate speech, toxicity, abusive and offensive language classification models generalize across datasets?Information Processing & Management(2021)D. Li et al.Maximizing discrimination masking for faithful question answering with machine readingInformation Processing & Management(2025)Y. Liu et al.Are LLMs good at structured outputs? A benchmark for evaluating structured output capabilities in LLMsInformation Processing & Management(2024)S. Abdali et al.Securing large language models: Threats, vulnerabilities and responsible practices(2024)J. Achiam et al.GPT-4 technical report(2023)H. Aghakhani et al.TrojanPuzzle: Covertly poisoning code-suggestion modelsF.K. AkınAwesome-chatgpt-prompts(2025)M. Al Ghanim et al.Jailbreaking LLMs with arabic transliteration and arabiziM.T. Alam et al.Ctibench: A benchmark for evaluating llms in cyber threat intelligence(2024)U. Anwar et al.Foundational challenges in assuring alignment and safety of large language models(2024)C. Ashcroft et al.Evaluation of domain-specific prompt engineering attacks on large language modelsESS Open Archive Eprints(2024)K.E. AwoufackAdversarial prompt transformation for systematic jailbreaks of LLMs(2024)Y. Bai et al.Constitutional AI: Harmlessness from AI feedback(2022)Y. Bai et al.Constitutional ai: Harmlessness from ai feedback(2022)E.M. Bender et al.On the dangers of stochastic parrots: Can language models be too big?R. Bhardwaj et al.Language model unalignment: Parametric red-teaming to expose hidden harms and biases(2023)R. Bhardwaj et al.Red-teaming large language models using chain of utterances for safety-alignment(2023)M. Bhatt et al.CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models(2024)F. Bianchi et al.Safety-tuned LLaMAs: Lessons from improving the safety of large language models that follow instructions(2024)R. Bommasani et al.On the opportunities and risks of foundation models(2022)Y. Cao et al.SceneTAP: Scene-Coherent typographic adversarial planner against vision-language models in real-world environments(2025)N. Carlini et al.Membership inference attacks from first principlesN. Carlini et al.Extracting training data from diffusion modelsN. Carlini et al.Poisoning web-scale training datasets is practicalN. Carlini et al.Extracting training data from large language modelsS. Casper et al.Open problems and fundamental limitations of reinforcement learning from human feedbackTransactions on Machine Learning Research(2023)T. Chakraborty et al.Cross-Modal Safety Alignment: Is textual unlearning all you need?(2024)P. Chao et al.Jailbreaking black box large language models in twenty queries(2024)X. Che et al.Detecting fake information with knowledge-enhanced AutoPromptNeural Computing and Applications(2024)X. Chen et al.The janus interface: How fine-tuning in large language models amplifies the privacy risksM. Chen et al.Evaluating large language models trained on code(2021)Z. Chen et al.AgentPoison: Red-teaming LLM agents via poisoning memory or knowledge basesM. Choudhury et al.How linguistically fair are multilingual pre-trained language models?Proceedings of the AAAI Conference on Artificial Intelligence(2021)A. Chowdhery et al.PaLM: Scaling language modeling with pathwaysJournal of Machine Learning Research(2023)A.G. Chowdhury et al.Breaking down the defenses: A comparative survey of attacks on large language models(2024)J. Cui et al.OR-Bench: An over-refusal benchmark for large language models(2024)T. Cui et al.Risk taxonomy, mitigation, and assessment benchmarks of large language model systems(2024)S. Cui et al.FFT: Towards harmlessness evaluation and analysis for LLMs with factuality, fairness, toxicity(2024)B.C. Das et al.Security and privacy challenges of large language models: A surveyACM Computing Surveys(2025)A. Das et al.Exposing vulnerabilities in clinical LLMs through data poisoning attacks: Case study in breast cancerMedRxiv(2024)Y. Deng et al.Multilingual jailbreak challenges in large language modelsJ. Dhamala et al.BOLD: Dataset and metrics for measuring biases in open-ended language generationP. Ding et al.A Wolf in Sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily(2024)Y. Dong et al.Safeguarding large language models: A survey(2024)Z. Dong et al.Attacks, defenses and evaluations for LLM conversation safety: A survey(2024)J. Echterhoff et al.Cognitive bias in decision-making with LLMs(2024)E. Eigner et al.Determinants of LLM-assisted decision-making(2024)I.O. Gallegos et al.Bias and fairness in large language models: A surveyComputational Linguistics(2024)D. Ganguli et al.Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned(2022)K. Gao et al.Adversarial robustness for visual grounding of multimodal large language models(2024)View more referencesCited by (0)View full text© 2025 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.