Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

PaperIntel
NAAMSE: Framework for Evolutionary Security Evaluation of Agents

Harshil Patel, Parth Shah, Kunal Pai

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: ai-safety, adversarial-robustness, tool, safety-evaluation

E5 / R3 (93%)
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement

Lei Sha, Junxiao Yang, Xianqi Lei, Leqi Lei

Year: 2025Area: Safety EvaluationCitations: 10

Tags: ai-safety, tool, safety-evaluation

E4 / R3 (96%)
AJAR: Adaptive Jailbreak Architecture for Red-teaming

Wang Yang, Yipu Dou

Year: 2025Area: Adversarial RobustnessCitations: -

Tags: ai-safety, adversarial-robustness, tool

E6 / R4 (97%)
ASTRAL: Automated Safety Testing of Large Language Models

Miriam Ugarte, Pablo Valle, Sergio Segura, Jose Antonio Parejo

Year: 2025Area: Safety EvaluationCitations: 5

Tags: ai-safety, tool, safety-evaluation

E6 / R3 (96%)
Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments

Daniel Hugenroth, Alastair R. Beresford, Christoph Schnabl, Bill Marino

Year: 2025Area: Safety EvaluationCitations: 4

Tags: ai-safety, tool, safety-evaluation

E5 / R3 (97%)
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration

Francesco Pinto, Shuang Yang, Zhaorun Chen, Bo Li

Year: 2025Area: Safety EvaluationCitations: 17

Tags: ai-safety, tool, safety-evaluation, red-teaming

E5 / R3 (96%)
Between Generation and Judgment: A Cloud-Native Framework for Adversarial Evaluation of LLM Alignment

Victor Takashi Hayashi, Marcos A. Simplicio, Diego E. G. Caetano De Oliveira, C. Miers

Year: 2025Area: Safety EvaluationCitations: -

Tags: alignment-training, ai-safety, adversarial-robustness, tool, safety-evaluation

E5 / R4 (95%)
BlackIce: A Containerized Red Teaming Toolkit for AI Security Testing

Caelin Kaplan, Neil Archibald, Alexander Warnecke

Year: 2025Area: Safety EvaluationCitations: -

Tags: ai-safety, tool, safety-evaluation, red-teaming

E7 / R4 (98%)
BlockCert: Certified Blockwise Extraction of Transformer Mechanisms

Sandro Andric

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, tool

E5 / R3 (99%)
BlueGlass: A Framework for Composite AI Safety

Syed Qutub, Kay-Ulrich Scholl, Harshal Nandigramwar

Year: 2025Area: Safety EvaluationCitations: -

Tags: ai-safety, tool, safety-evaluation

E5 / R3 (95%)
CCS-Lib: A Python package to elicit latent knowledge from LLMs

Ben W., Eric Mungai Kinuthia, Walter Laurito, Marius Pl

Year: 2025Area: Representation AnalysisCitations: -

Tags: representation-analysis, ai-safety, tool

E5 / R3 (99%)
ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models

Minfeng Zhu, Haoxuan Li, Zhen Wen, Yuchen Yang

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, tool

E5 / R3 (97%)
Contextual Agent Security: A Policy for Every Purpose

Lillian Tsai, Eugene Bagdasarian

Year: 2025Area: Agent SafetyCitations: 14

Tags: agent-safety, ai-safety, tool

E5 / R3 (95%)
ControlArena: A Library for Running AI Control Experiments

Chris Canal, Arathi Mani, Dishank Bansal, Daniel O'Connell

Year: 2025Area: Agent SafetyCitations: -

Tags: agent-safety, ai-safety, tool

E5 / R3 (98%)
Dialz: A Python Toolkit for Steering Vectors

Luis Espinosa-Anke, Liam D. Turner, Zara Siddique

Year: 2025Area: Model EditingCitations: 2

Tags: ai-safety, tool, model-editing

E5 / R4 (96%)
DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

Leo Boisvert, Alexandre Drouin, Chandrakiran Reddy Evuru, Quentin Cappart

Year: 2025Area: Agent SafetyCitations: 19

Tags: agent-safety, ai-safety, tool, safety-evaluation

E6 / R4 (97%)
EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models

Huajun Chen, Ziwen Xu, Kewei Xu, Xinle Deng

Year: 2025Area: Model EditingCitations: 7

Tags: ai-safety, tool, model-editing

E6 / R5 (96%)
G-STAR: A Threat Modeling Framework for General-Purpose AI Systems

Prini Kotian, Pulei Xiong, Saeedeh Lohrasbi, Scott Buffett

Year: 2025Area: Safety EvaluationCitations: -

Tags: ai-safety, tool, safety-evaluation

E4 / R3 (93%)
Gemma Scope 2: Comprehensive Suite of SAEs and Transcoders for Gemma 3

Tom Lieberum, Janos Kramar, Senthooran Rajamanoharan, Callum McDougall

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, tool, interpretability

E5 / R3 (98%)
Jinx: Unlimited LLMs for Probing Alignment Failures

Jiahao Zhao, Liwei Dong

Year: 2025Area: Safety EvaluationCitations: -

Tags: alignment-training, ai-safety, tool, safety-evaluation

E6 / R4 (96%)
Know-MRI: A Knowledge Mechanisms Revealer&Interpreter for Large Language Models

Di Wu, Jun Zhao, Jiaxiang Liu, Boxuan Xing

Year: 2025Area: Mechanistic Interp.Citations: -

Tags: mechanistic-interp, ai-safety, tool

E5 / R3 (93%)
LlamaFirewall: An Open Source Guardrail System for Building Secure AI Agents

Dominik Gabi, Wu Zhou, Daniel Song, Alekhya Gampa

Year: 2025Area: Adversarial RobustnessCitations: 41

Tags: alignment-training, ai-safety, adversarial-robustness, tool

E5 / R4 (98%)
Mechanistic understanding and validation of large AI models with SemanticLens

Thomas Wiegand, Sebastian Lapuschkin, Tobias Labarta, Wojciech Samek

Year: 2025Area: Mechanistic Interp.Citations: 29

Tags: mechanistic-interp, ai-safety, tool

E5 / R3 (95%)
NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models

Yuyou Gan, Tianyu Du, Dazhen Deng, Bowen Shi

Year: 2025Area: Adversarial RobustnessCitations: -

Tags: ai-safety, adversarial-robustness, tool

E5 / R3 (95%)
OET: Optimization-based prompt injection Evaluation Toolkit

Chaowei Xiao, Jinsheng Pan, Xiaogeng Liu

Year: 2025Area: Adversarial RobustnessCitations: -

Tags: ai-safety, adversarial-robustness, tool, safety-evaluation

E6 / R3 (95%)
Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video

Robert Graham, Sonia Joseph, Sebastian Lapuschkin, Yash Vadi

Year: 2025Area: Mechanistic Interp.Citations: 11

Tags: mechanistic-interp, ai-safety, tool, interpretability

E5 / R3 (98%)
Qwen3Guard Technical Report

Wei Liao, Tingyu Xia, Yuxin Zhou, Bowen Yu

Year: 2025Area: Adversarial RobustnessCitations: 37

Tags: ai-safety, adversarial-robustness, tool

E5 / R3 (99%)
SGM: A Framework for Building Specification-Guided Moderation Filters

Husrev Taha Sencar, Masoomali Fatehkia, Enes Altinisik

Year: 2025Area: Adversarial RobustnessCitations: 2

Tags: ai-safety, adversarial-robustness, tool

E4 / R3 (95%)
STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports

Luca Righetti, Jide Alaga, Chris Painter, Tom Reed

Year: 2025Area: Safety EvaluationCitations: 5

Tags: ai-safety, tool, safety-evaluation

E4 / R3 (94%)
Scaling Responsible Generative AI: Automating Red Teaming of LLM Applications

Adison Goh, Benjamin Chee, Matteo Vagnoli, Luca Baldassarre

Year: 2025Area: Safety EvaluationCitations: -

Tags: ai-safety, adversarial-robustness, tool, safety-evaluation, red-teaming

E4 / R2 (94%)

Showing 30 of 78 papers on page 1.