Paper deep dive
Working Towards Toxic datasets for LLM Safeguarding
Liuye Guo, Ziyuan Wang, Zhipeng Wang, Tieke He
Models: Llama3-8B
Abstract
Large language models (LLMs) have made remarkable strides, expanding their applications from casual dialogue to a wide spectrum of AI tasks. However, concerns about their reliability persist, particularly regarding their tendency to produce toxic content. In response, researchers have developed various toxicity benchmarks to evaluate LLM safety. Nevertheless, as LLMs scale and attack strategies grow more sophisticated, existing benchmarks fall short in assessing the safety of modern models. Addressing increasingly complex and diverse harmful inputs has become a major challenge. To mitigate these risks and leverage existing high-quality datasets, we propose a novel two-stage framework consisting of augmenting red teaming attack templates and generating a human preference dataset. First, we analyze existing red teaming methods, identifying key strategies that effectively exploit LLM vulnerabilities. Followed by augmenting high-quality red teaming attack templates using in-context learning and task-oriented prompt engineering. Then, we leverage the generated attack templates and existing toxicity datasets to generate diverse, high-quality toxic prompts across five threat scenarios: toxicity, stereotype bias, machine ethics, truthfulness, and privacy, intended to elicit toxic responses. To defend against these attacks, we design a Chain-of-Thought (CoT)-based safety guardrail, structured as toxic prompt, toxic response and safety prompt, to generate human preference responses across scenarios. Consequently, we construct the toxic preference dataset containing 7, 502 instances, each consisting of a toxic prompt, a toxic response and a human preference response. We fine-tune the Llama3-8B model on this dataset to enhance its robustness against red teaming attacks. Extensive experiments on five representative LLMs, including our fine-tuned model, demonstrate the effectiveness of our framework in both attack augmentation and human-aligned response generation.
Tags
Links
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%
Last extracted: 3/11/2026, 1:30:30 AM
Summary
The paper introduces a two-stage framework for LLM safeguarding by augmenting red teaming attack templates and creating a human preference dataset. The authors construct a dataset of 7,502 instances covering five threat scenarios (toxicity, stereotype bias, machine ethics, truthfulness, and privacy) and fine-tune the Llama3-8B model to improve robustness against toxic inputs.
Entities (4)
Relation Signals (2)
Llama3-8B → trainedon → Toxic Preference Dataset
confidence 95% · We fine-tune the Llama3-8B model on this dataset
Chain-of-Thought (CoT) → usedfor → LLM Safeguarding
confidence 90% · we design a Chain-of-Thought (CoT)-based safety guardrail
Cypher Suggestions (2)
Find all models fine-tuned on specific datasets · confidence 90% · unvalidated
MATCH (m:LLM)-[:TRAINED_ON]->(d:Dataset) RETURN m.name, d.name
Identify methodologies used for LLM safety · confidence 85% · unvalidated
MATCH (m:Methodology)-[:USED_FOR]->(t:Task {name: 'LLM Safeguarding'}) RETURN m.nameFull Text
800 characters extracted from source content.
Expand or collapse full text
Working Towards Toxic datasets for LLM Safeguarding | IEEE Conference Publication | IEEE Xplore IEEE Account Change Username/Password Update Address Purchase Details Payment Options Order History View Purchased Documents Profile Information Communications Preferences Profession and Education Technical Interests Need Help? US & Canada: +1 800 678 4333 Worldwide: +1 732 981 0060 Contact & Support About IEEE Xplore Contact Us Help Accessibility Terms of Use Nondiscrimination Policy Sitemap Privacy & Opting Out of Cookies A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.© Copyright 2026 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.