Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Year | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Annotating the Chain-of-Thought: A Behavior-Labeled Dataset for AI Safety Antonio-Gabriel Chacón Menke, Eiji Kamioka, Phan Xuan Tan Year: 2025Area: Safety EvaluationCitations: - Tags: ai-safety, dataset, safety-evaluation | 2025 | Safety Evaluation | ai-safety, dataset, safety-evaluation | E5 / R3 (95%) | - |
| AssurAI: Experience with Constructing Korean Socio-cultural Datasets to Discover Potential Risks of Generative AI Kwangyoung Lee, Joseph Seering, So Jung Lee, Jeanne Choi Year: 2025Area: Safety EvaluationCitations: - Tags: ai-safety, dataset, safety-evaluation | 2025 | Safety Evaluation | ai-safety, dataset, safety-evaluation | E5 / R4 (98%) | - |
| Falcon: A Cross-Modal Evaluation Dataset for Comprehensive Safety Perception Qi Xue, Runjia Zhang, Minrui Jiang, Pei Ke Year: 2025Area: Multimodal SafetyCitations: - Tags: ai-safety, dataset, multimodal-safety, safety-evaluation | 2025 | Multimodal Safety | ai-safety, dataset, multimodal-safety, safety-evaluation | E5 / R3 (96%) | - |
| FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning Chandan K. Reddy, Zhehao Zhang, Fanyou Wu, Weijie Xu Year: 2025Area: Safety EvaluationCitations: 13 Tags: ai-safety, dataset, safety-evaluation | 2025 | Safety Evaluation | ai-safety, dataset, safety-evaluation | E5 / R3 (95%) | 13 |
| FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research Gabriel Recchia, Chatrik Singh Mangat, Gayatri Krishnakumar, Issac Li Year: 2025Area: Scalable OversightCitations: - Tags: scalable-oversight, ai-safety, dataset | 2025 | Scalable Oversight | scalable-oversight, ai-safety, dataset | E6 / R4 (93%) | - |
| MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity Neev Parikh, Hjalmar Wijk Year: 2025Area: Safety EvaluationCitations: - Tags: ai-safety, dataset, safety-evaluation | 2025 | Safety Evaluation | ai-safety, dataset, safety-evaluation | E4 / R3 (99%) | - |
| RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models Chris Ngo, Truong-Son Hy, Quy-Anh Dang Year: 2025Area: Safety EvaluationCitations: - Tags: ai-safety, dataset, safety-evaluation, red-teaming | 2025 | Safety Evaluation | ai-safety, dataset, safety-evaluation, red-teaming | E5 / R4 (95%) | - |
| STAR-1: Safer Alignment of Reasoning LLMs with 1K Data Cihang Xie, Jieru Mei, Yuhan Wang, Bhavya Kailkhura Year: 2025Area: Alignment TrainingCitations: 41 Tags: alignment-training, ai-safety, dataset | 2025 | Alignment Training | alignment-training, ai-safety, dataset | E5 / R3 (95%) | 41 |
| TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research Dhruv Nathawani, Luke Marks, Amir Abdullah, Philip Quirke Year: 2025Area: Mechanistic Interp.Citations: 2 Tags: mechanistic-interp, ai-safety, dataset, interpretability | 2025 | Mechanistic Interp. | mechanistic-interp, ai-safety, dataset, interpretability | E6 / R3 (95%) | 2 |
| UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases Raj Vardhan Tomar, Preslav Nakov, Yuxia Wang Year: 2025Area: Alignment TrainingCitations: 4 Tags: alignment-training, ai-safety, dataset | 2025 | Alignment Training | alignment-training, ai-safety, dataset | E7 / R4 (95%) | 4 |
| WebGuard: Building a Generalizable Guardrail for Web Agents Qinyuan Zheng, Zeyi Liao, Yu Su, Xiang Deng Year: 2025Area: Agent SafetyCitations: 7 Tags: agent-safety, ai-safety, dataset | 2025 | Agent Safety | agent-safety, ai-safety, dataset | E4 / R3 (95%) | 7 |
| Working Towards Toxic datasets for LLM Safeguarding Liuye Guo, Tieke He, Zhipeng Wang, Ziyuan Wang Year: 2025Area: Safety EvaluationCitations: - Tags: ai-safety, dataset, safety-evaluation | 2025 | Safety Evaluation | ai-safety, dataset, safety-evaluation | E4 / R2 (93%) | - |
| Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition Victor Klemm, Mario Fritz, Robin Schmid, Sahar Abdelnabi Year: 2024Area: Adversarial RobustnessCitations: 19 Tags: ai-safety, adversarial-robustness, dataset | 2024 | Adversarial Robustness | ai-safety, adversarial-robustness, dataset | E5 / R3 (97%) | 19 |
| Hummer: Towards Limited Competitive Preference Dataset Qingpei Guo, Jun Zhou, Li Jiang, Jingqing Ruan Year: 2024Area: Alignment TrainingCitations: 10 Tags: alignment-training, ai-safety, adversarial-robustness, dataset | 2024 | Alignment Training | alignment-training, ai-safety, adversarial-robustness, dataset | E6 / R3 (96%) | 10 |
| PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models Juntao Dai, Boyuan Chen, Jiaming Ji, Kaile Wang Year: 2024Area: Alignment TrainingCitations: 127 Tags: alignment-training, ai-safety, dataset | 2024 | Alignment Training | alignment-training, ai-safety, dataset | E5 / R3 (96%) | 127 |
| SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models Yifeng Gao, Guodong Zheng, Yu Qiao, Feng Zhao Year: 2024Area: Multimodal SafetyCitations: 68 Tags: alignment-training, ai-safety, dataset, multimodal-safety | 2024 | Multimodal Safety | alignment-training, ai-safety, dataset, multimodal-safety | E6 / R3 (95%) | 68 |
| BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset Yizhou Wang, Juntao Dai, Boyuan Chen, Mickel Liu Year: 2023Area: Alignment TrainingCitations: 767 Tags: alignment-training, ai-safety, dataset | 2023 | Alignment Training | alignment-training, ai-safety, dataset | E5 / R3 (95%) | 767 |
| Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs Xudong Han, Timothy Baldwin, Preslav Nakov, Haonan Li Year: 2023Area: Safety EvaluationCitations: 158 Tags: ai-safety, dataset, safety-evaluation | 2023 | Safety Evaluation | ai-safety, dataset, safety-evaluation | E5 / R3 (95%) | 158 |
| OpenAssistant Conversations - Democratizing Large Language Model Alignment Abdelpakey Mohamed Hassan, Sotiris Sber, Brendan Schwind, Oliver Mütze Year: 2023Area: Alignment TrainingCitations: 811 Tags: alignment-training, ai-safety, dataset | 2023 | Alignment Training | alignment-training, ai-safety, dataset | E5 / R3 (95%) | 811 |
| Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game Olivia Watkins, Luke Bailey, Alan Ritter, Sam Toyer Year: 2023Area: Adversarial RobustnessCitations: 108 Tags: ai-safety, adversarial-robustness, dataset | 2023 | Adversarial Robustness | ai-safety, adversarial-robustness, dataset | E6 / R3 (97%) | 108 |
| Natural Adversarial Examples Dan Hendrycks, Dawn Song, Kevin Zhao, Steven Basart Year: 2021Area: Adversarial RobustnessCitations: 1794 Tags: ai-safety, adversarial-robustness, dataset | 2021 | Adversarial Robustness | ai-safety, adversarial-robustness, dataset | E5 / R3 (95%) | 1794 |
| Specification Gaming: The Flip Side of AI Ingenuity Jonathan Uesato, Jan Leike, Ramana Kumar, Vladimir Mikulik Year: 2020Area: Deception & FailureCitations: - Tags: ai-safety, dataset, deception-failure | 2020 | Deception & Failure | ai-safety, dataset, deception-failure | E3 / R2 (91%) | - |
Showing 22 of 22 papers on page 1.