Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

PaperIntel
Annotating the Chain-of-Thought: A Behavior-Labeled Dataset for AI Safety

Antonio-Gabriel Chacón Menke, Eiji Kamioka, Phan Xuan Tan

Year: 2025Area: Safety EvaluationCitations: -

Tags: ai-safety, dataset, safety-evaluation

E5 / R3 (95%)
AssurAI: Experience with Constructing Korean Socio-cultural Datasets to Discover Potential Risks of Generative AI

Kwangyoung Lee, Joseph Seering, So Jung Lee, Jeanne Choi

Year: 2025Area: Safety EvaluationCitations: -

Tags: ai-safety, dataset, safety-evaluation

E5 / R4 (98%)
Falcon: A Cross-Modal Evaluation Dataset for Comprehensive Safety Perception

Qi Xue, Runjia Zhang, Minrui Jiang, Pei Ke

Year: 2025Area: Multimodal SafetyCitations: -

Tags: ai-safety, dataset, multimodal-safety, safety-evaluation

E5 / R3 (96%)
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning

Chandan K. Reddy, Zhehao Zhang, Fanyou Wu, Weijie Xu

Year: 2025Area: Safety EvaluationCitations: 13

Tags: ai-safety, dataset, safety-evaluation

E5 / R3 (95%)
FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research

Gabriel Recchia, Chatrik Singh Mangat, Gayatri Krishnakumar, Issac Li

Year: 2025Area: Scalable OversightCitations: -

Tags: scalable-oversight, ai-safety, dataset

E6 / R4 (93%)
MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity

Neev Parikh, Hjalmar Wijk

Year: 2025Area: Safety EvaluationCitations: -

Tags: ai-safety, dataset, safety-evaluation

E4 / R3 (99%)
RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Chris Ngo, Truong-Son Hy, Quy-Anh Dang

Year: 2025Area: Safety EvaluationCitations: -

Tags: ai-safety, dataset, safety-evaluation, red-teaming

E5 / R4 (95%)
STAR-1: Safer Alignment of Reasoning LLMs with 1K Data

Cihang Xie, Jieru Mei, Yuhan Wang, Bhavya Kailkhura

Year: 2025Area: Alignment TrainingCitations: 41

Tags: alignment-training, ai-safety, dataset

E5 / R3 (95%)
TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research

Dhruv Nathawani, Luke Marks, Amir Abdullah, Philip Quirke

Year: 2025Area: Mechanistic Interp.Citations: 2

Tags: mechanistic-interp, ai-safety, dataset, interpretability

E6 / R3 (95%)
UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

Raj Vardhan Tomar, Preslav Nakov, Yuxia Wang

Year: 2025Area: Alignment TrainingCitations: 4

Tags: alignment-training, ai-safety, dataset

E7 / R4 (95%)
WebGuard: Building a Generalizable Guardrail for Web Agents

Qinyuan Zheng, Zeyi Liao, Yu Su, Xiang Deng

Year: 2025Area: Agent SafetyCitations: 7

Tags: agent-safety, ai-safety, dataset

E4 / R3 (95%)
Working Towards Toxic datasets for LLM Safeguarding

Liuye Guo, Tieke He, Zhipeng Wang, Ziyuan Wang

Year: 2025Area: Safety EvaluationCitations: -

Tags: ai-safety, dataset, safety-evaluation

E4 / R2 (93%)
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition

Victor Klemm, Mario Fritz, Robin Schmid, Sahar Abdelnabi

Year: 2024Area: Adversarial RobustnessCitations: 19

Tags: ai-safety, adversarial-robustness, dataset

E5 / R3 (97%)
Hummer: Towards Limited Competitive Preference Dataset

Qingpei Guo, Jun Zhou, Li Jiang, Jingqing Ruan

Year: 2024Area: Alignment TrainingCitations: 10

Tags: alignment-training, ai-safety, adversarial-robustness, dataset

E6 / R3 (96%)
PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models

Juntao Dai, Boyuan Chen, Jiaming Ji, Kaile Wang

Year: 2024Area: Alignment TrainingCitations: 127

Tags: alignment-training, ai-safety, dataset

E5 / R3 (96%)
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models

Yifeng Gao, Guodong Zheng, Yu Qiao, Feng Zhao

Year: 2024Area: Multimodal SafetyCitations: 68

Tags: alignment-training, ai-safety, dataset, multimodal-safety

E6 / R3 (95%)
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

Yizhou Wang, Juntao Dai, Boyuan Chen, Mickel Liu

Year: 2023Area: Alignment TrainingCitations: 767

Tags: alignment-training, ai-safety, dataset

E5 / R3 (95%)
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

Xudong Han, Timothy Baldwin, Preslav Nakov, Haonan Li

Year: 2023Area: Safety EvaluationCitations: 158

Tags: ai-safety, dataset, safety-evaluation

E5 / R3 (95%)
OpenAssistant Conversations - Democratizing Large Language Model Alignment

Abdelpakey Mohamed Hassan, Sotiris Sber, Brendan Schwind, Oliver Mütze

Year: 2023Area: Alignment TrainingCitations: 811

Tags: alignment-training, ai-safety, dataset

E5 / R3 (95%)
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

Olivia Watkins, Luke Bailey, Alan Ritter, Sam Toyer

Year: 2023Area: Adversarial RobustnessCitations: 108

Tags: ai-safety, adversarial-robustness, dataset

E6 / R3 (97%)
Natural Adversarial Examples

Dan Hendrycks, Dawn Song, Kevin Zhao, Steven Basart

Year: 2021Area: Adversarial RobustnessCitations: 1794

Tags: ai-safety, adversarial-robustness, dataset

E5 / R3 (95%)
Specification Gaming: The Flip Side of AI Ingenuity

Jonathan Uesato, Jan Leike, Ramana Kumar, Vladimir Mikulik

Year: 2020Area: Deception & FailureCitations: -

Tags: ai-safety, dataset, deception-failure

E3 / R2 (91%)

Showing 22 of 22 papers on page 1.