Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Year | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| NAAMSE: Framework for Evolutionary Security Evaluation of Agents Harshil Patel, Parth Shah, Kunal Pai Year: 2026Area: Adversarial RobustnessCitations: - Tags: ai-safety, adversarial-robustness, tool, safety-evaluation | 2026 | Adversarial Robustness | ai-safety, adversarial-robustness, tool, safety-evaluation | E5 / R3 (93%) | - |
| AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement Lei Sha, Junxiao Yang, Xianqi Lei, Leqi Lei Year: 2025Area: Safety EvaluationCitations: 10 Tags: ai-safety, tool, safety-evaluation | 2025 | Safety Evaluation | ai-safety, tool, safety-evaluation | E4 / R3 (96%) | 10 |
| AJAR: Adaptive Jailbreak Architecture for Red-teaming Wang Yang, Yipu Dou Year: 2025Area: Adversarial RobustnessCitations: - Tags: ai-safety, adversarial-robustness, tool | 2025 | Adversarial Robustness | ai-safety, adversarial-robustness, tool | E6 / R4 (97%) | - |
| ASTRAL: Automated Safety Testing of Large Language Models Miriam Ugarte, Pablo Valle, Sergio Segura, Jose Antonio Parejo Year: 2025Area: Safety EvaluationCitations: 5 Tags: ai-safety, tool, safety-evaluation | 2025 | Safety Evaluation | ai-safety, tool, safety-evaluation | E6 / R3 (96%) | 5 |
| Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments Daniel Hugenroth, Alastair R. Beresford, Christoph Schnabl, Bill Marino Year: 2025Area: Safety EvaluationCitations: 4 Tags: ai-safety, tool, safety-evaluation | 2025 | Safety Evaluation | ai-safety, tool, safety-evaluation | E5 / R3 (97%) | 4 |
| AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration Francesco Pinto, Shuang Yang, Zhaorun Chen, Bo Li Year: 2025Area: Safety EvaluationCitations: 17 Tags: ai-safety, tool, safety-evaluation, red-teaming | 2025 | Safety Evaluation | ai-safety, tool, safety-evaluation, red-teaming | E5 / R3 (96%) | 17 |
| Between Generation and Judgment: A Cloud-Native Framework for Adversarial Evaluation of LLM Alignment Victor Takashi Hayashi, Marcos A. Simplicio, Diego E. G. Caetano De Oliveira, C. Miers Year: 2025Area: Safety EvaluationCitations: - Tags: alignment-training, ai-safety, adversarial-robustness, tool, safety-evaluation | 2025 | Safety Evaluation | alignment-training, ai-safety, adversarial-robustness, tool, safety-evaluation | E5 / R4 (95%) | - |
| BlackIce: A Containerized Red Teaming Toolkit for AI Security Testing Caelin Kaplan, Neil Archibald, Alexander Warnecke Year: 2025Area: Safety EvaluationCitations: - Tags: ai-safety, tool, safety-evaluation, red-teaming | 2025 | Safety Evaluation | ai-safety, tool, safety-evaluation, red-teaming | E7 / R4 (98%) | - |
| BlockCert: Certified Blockwise Extraction of Transformer Mechanisms Sandro Andric Year: 2025Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool | 2025 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E5 / R3 (99%) | - |
| BlueGlass: A Framework for Composite AI Safety Syed Qutub, Kay-Ulrich Scholl, Harshal Nandigramwar Year: 2025Area: Safety EvaluationCitations: - Tags: ai-safety, tool, safety-evaluation | 2025 | Safety Evaluation | ai-safety, tool, safety-evaluation | E5 / R3 (95%) | - |
| CCS-Lib: A Python package to elicit latent knowledge from LLMs Ben W., Eric Mungai Kinuthia, Walter Laurito, Marius Pl Year: 2025Area: Representation AnalysisCitations: - Tags: representation-analysis, ai-safety, tool | 2025 | Representation Analysis | representation-analysis, ai-safety, tool | E5 / R3 (99%) | - |
| ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models Minfeng Zhu, Haoxuan Li, Zhen Wen, Yuchen Yang Year: 2025Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool | 2025 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E5 / R3 (97%) | - |
| Contextual Agent Security: A Policy for Every Purpose Lillian Tsai, Eugene Bagdasarian Year: 2025Area: Agent SafetyCitations: 14 Tags: agent-safety, ai-safety, tool | 2025 | Agent Safety | agent-safety, ai-safety, tool | E5 / R3 (95%) | 14 |
| ControlArena: A Library for Running AI Control Experiments Chris Canal, Arathi Mani, Dishank Bansal, Daniel O'Connell Year: 2025Area: Agent SafetyCitations: - Tags: agent-safety, ai-safety, tool | 2025 | Agent Safety | agent-safety, ai-safety, tool | E5 / R3 (98%) | - |
| Dialz: A Python Toolkit for Steering Vectors Luis Espinosa-Anke, Liam D. Turner, Zara Siddique Year: 2025Area: Model EditingCitations: 2 Tags: ai-safety, tool, model-editing | 2025 | Model Editing | ai-safety, tool, model-editing | E5 / R4 (96%) | 2 |
| DoomArena: A framework for Testing AI Agents Against Evolving Security Threats Leo Boisvert, Alexandre Drouin, Chandrakiran Reddy Evuru, Quentin Cappart Year: 2025Area: Agent SafetyCitations: 19 Tags: agent-safety, ai-safety, tool, safety-evaluation | 2025 | Agent Safety | agent-safety, ai-safety, tool, safety-evaluation | E6 / R4 (97%) | 19 |
| EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models Huajun Chen, Ziwen Xu, Kewei Xu, Xinle Deng Year: 2025Area: Model EditingCitations: 7 Tags: ai-safety, tool, model-editing | 2025 | Model Editing | ai-safety, tool, model-editing | E6 / R5 (96%) | 7 |
| G-STAR: A Threat Modeling Framework for General-Purpose AI Systems Prini Kotian, Pulei Xiong, Saeedeh Lohrasbi, Scott Buffett Year: 2025Area: Safety EvaluationCitations: - Tags: ai-safety, tool, safety-evaluation | 2025 | Safety Evaluation | ai-safety, tool, safety-evaluation | E4 / R3 (93%) | - |
| Gemma Scope 2: Comprehensive Suite of SAEs and Transcoders for Gemma 3 Tom Lieberum, Janos Kramar, Senthooran Rajamanoharan, Callum McDougall Year: 2025Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool, interpretability | 2025 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | E5 / R3 (98%) | - |
| Jinx: Unlimited LLMs for Probing Alignment Failures Jiahao Zhao, Liwei Dong Year: 2025Area: Safety EvaluationCitations: - Tags: alignment-training, ai-safety, tool, safety-evaluation | 2025 | Safety Evaluation | alignment-training, ai-safety, tool, safety-evaluation | E6 / R4 (96%) | - |
| Know-MRI: A Knowledge Mechanisms Revealer&Interpreter for Large Language Models Di Wu, Jun Zhao, Jiaxiang Liu, Boxuan Xing Year: 2025Area: Mechanistic Interp.Citations: - Tags: mechanistic-interp, ai-safety, tool | 2025 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E5 / R3 (93%) | - |
| LlamaFirewall: An Open Source Guardrail System for Building Secure AI Agents Dominik Gabi, Wu Zhou, Daniel Song, Alekhya Gampa Year: 2025Area: Adversarial RobustnessCitations: 41 Tags: alignment-training, ai-safety, adversarial-robustness, tool | 2025 | Adversarial Robustness | alignment-training, ai-safety, adversarial-robustness, tool | E5 / R4 (98%) | 41 |
| Mechanistic understanding and validation of large AI models with SemanticLens Thomas Wiegand, Sebastian Lapuschkin, Tobias Labarta, Wojciech Samek Year: 2025Area: Mechanistic Interp.Citations: 29 Tags: mechanistic-interp, ai-safety, tool | 2025 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool | E5 / R3 (95%) | 29 |
| NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models Yuyou Gan, Tianyu Du, Dazhen Deng, Bowen Shi Year: 2025Area: Adversarial RobustnessCitations: - Tags: ai-safety, adversarial-robustness, tool | 2025 | Adversarial Robustness | ai-safety, adversarial-robustness, tool | E5 / R3 (95%) | - |
| OET: Optimization-based prompt injection Evaluation Toolkit Chaowei Xiao, Jinsheng Pan, Xiaogeng Liu Year: 2025Area: Adversarial RobustnessCitations: - Tags: ai-safety, adversarial-robustness, tool, safety-evaluation | 2025 | Adversarial Robustness | ai-safety, adversarial-robustness, tool, safety-evaluation | E6 / R3 (95%) | - |
| Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video Robert Graham, Sonia Joseph, Sebastian Lapuschkin, Yash Vadi Year: 2025Area: Mechanistic Interp.Citations: 11 Tags: mechanistic-interp, ai-safety, tool, interpretability | 2025 | Mechanistic Interp. | mechanistic-interp, ai-safety, tool, interpretability | E5 / R3 (98%) | 11 |
| Qwen3Guard Technical Report Wei Liao, Tingyu Xia, Yuxin Zhou, Bowen Yu Year: 2025Area: Adversarial RobustnessCitations: 37 Tags: ai-safety, adversarial-robustness, tool | 2025 | Adversarial Robustness | ai-safety, adversarial-robustness, tool | E5 / R3 (99%) | 37 |
| SGM: A Framework for Building Specification-Guided Moderation Filters Husrev Taha Sencar, Masoomali Fatehkia, Enes Altinisik Year: 2025Area: Adversarial RobustnessCitations: 2 Tags: ai-safety, adversarial-robustness, tool | 2025 | Adversarial Robustness | ai-safety, adversarial-robustness, tool | E4 / R3 (95%) | 2 |
| STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports Luca Righetti, Jide Alaga, Chris Painter, Tom Reed Year: 2025Area: Safety EvaluationCitations: 5 Tags: ai-safety, tool, safety-evaluation | 2025 | Safety Evaluation | ai-safety, tool, safety-evaluation | E4 / R3 (94%) | 5 |
| Scaling Responsible Generative AI: Automating Red Teaming of LLM Applications Adison Goh, Benjamin Chee, Matteo Vagnoli, Luca Baldassarre Year: 2025Area: Safety EvaluationCitations: - Tags: ai-safety, adversarial-robustness, tool, safety-evaluation, red-teaming | 2025 | Safety Evaluation | ai-safety, adversarial-robustness, tool, safety-evaluation, red-teaming | E4 / R2 (94%) | - |
Showing 30 of 78 papers on page 1.