Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 31-60 of 97 papers (page 2 of 4)

PaperIntel
Safety in Large Reasoning Models: A Survey

Cheng Wang, Baolong Bi, Yingwei Ma, Bryan Hooi

Published: 2025-04-24Area: Surveys & ReviewsCitations: 53

Tags: surveys-reviews, ai-safety, survey

E6 / R3 (95%)
Security Concerns for Large Language Models: A Survey

Benjamin C. M. Fung, Miles Q. Li

Published: 2025-05-24Area: Surveys & ReviewsCitations: 29

Tags: surveys-reviews, ai-safety, adversarial-robustness, survey

E5 / R3 (94%)
Survey on the Role of Mechanistic Interpretability in Generative AI

Leonardo Ranaldi

Published: -Area: Surveys & ReviewsCitations: 4

Tags: surveys-reviews, ai-safety, adversarial-robustness, survey, interpretability

E5 / R4 (94%)
The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It

Julia Kreutzer, Stephen H. Bach, Beyza Ermis, Marzieh Fadaee

Published: 2025-05-30Area: Surveys & ReviewsCitations: 10

Tags: surveys-reviews, ai-safety, survey

E5 / R3 (97%)
Towards Integrated Alignment

Ben Y. Reis, William La Cava

Published: 2025-08-08Area: Surveys & ReviewsCitations: -

Tags: alignment-training, surveys-reviews, ai-safety, position

E6 / R4 (93%)
Unboxing the Black Box: Mechanistic Interpretability for Algorithmic Understanding of Neural Networks

Bianka Kowalska, Halina Kwaśnicka

Published: 2025-11-24Area: Surveys & ReviewsCitations: 1

Tags: surveys-reviews, ai-safety, survey, interpretability

E5 / R3 (96%)
Unlearning in LLMs: Methods, Evaluation, and Open Challenges

Larry Heck, Tyler Lizzo

Published: 2026-01-19Area: Surveys & ReviewsCitations: -

Tags: surveys-reviews, ai-safety, survey, safety-evaluation

E7 / R3 (96%)
What is AI safety? What do we want it to be?

Cameron Domenico Kirk-Giannini, Jacqueline Harding

Published: 2025-05-05Area: Surveys & ReviewsCitations: 1

Tags: surveys-reviews, ai-safety, position

E5 / R3 (95%)
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models

Abulhair Saparov, Ziyu Yao, Shi Feng, Yilun Zhou

Published: 2024-07-02Area: Surveys & ReviewsCitations: 91

Tags: surveys-reviews, ai-safety, survey, interpretability

E5 / R3 (95%)
A Primer on the Inner Workings of Transformer-based Language Models

Marta R. Costa-jussà, Javier Ferrando, Arianna Bisazza, Gabriele Sarti

Published: 2024-04-30Area: Surveys & ReviewsCitations: 80

Tags: surveys-reviews, ai-safety, survey, interpretability

E7 / R4 (95%)
A Survey on Human Preference Learning for Large Language Models

Muyun Yang, Ruili Jiang, Zhixuan He, Xuefeng Bai

Published: 2024-06-17Area: Surveys & ReviewsCitations: 16

Tags: alignment-training, surveys-reviews, ai-safety, survey, safety-evaluation

E5 / R3 (95%)
A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly

Kaidi Xu, Yue Zhang, Jinhao Duan, Yifan Yao

Published: 2023-12-04Area: Surveys & ReviewsCitations: 1022

Tags: surveys-reviews, ai-safety, survey

E6 / R3 (95%)
A Survey on the Honesty of Large Language Models

Taiqiang Wu, Deng Cai, Chufan Shi, Ngai Wong

Published: 2024-09-27Area: Surveys & ReviewsCitations: 18

Tags: surveys-reviews, ai-safety, survey, safety-evaluation

E5 / R4 (94%)
AI Risk Categorization Decoded (AIR 2024): From Government Regulations to Corporate Policies

Ruoxi Jia, Kevin Klyman, Minzhou Pan, Bo Li

Published: 2024-06-25Area: Surveys & ReviewsCitations: 48

Tags: surveys-reviews, ai-safety, survey

E8 / R4 (99%)
AI Risk Management Should Incorporate Both Safety and Security

Ruoxi Jia, Jeffrey Ding, Vikash Sehwag, Weijia Shi

Published: 2024-05-29Area: Surveys & ReviewsCitations: 20

Tags: surveys-reviews, ai-safety, adversarial-robustness, position

E5 / R4 (97%)
AI Safety in Generative AI Large Language Models: A Survey

Shiyi Yang, Lina Yao, Yun Li, Jaymari Chua

Published: 2024-07-06Area: Surveys & ReviewsCitations: 38

Tags: alignment-training, surveys-reviews, ai-safety, survey

E6 / R3 (95%)
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

Yu Qiao, Zhichen Dong, Chao Yang, Jing Shao

Published: 2024-02-14Area: Surveys & ReviewsCitations: 140

Tags: alignment-training, surveys-reviews, ai-safety, survey, safety-evaluation

E5 / R3 (98%)
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond

Shanshan Han

Published: 2024-10-09Area: Surveys & ReviewsCitations: 1

Tags: surveys-reviews, ai-safety, adversarial-robustness, survey, red-teaming

E6 / R3 (92%)
Combating Security and Privacy Issues in the Era of Large Language Models

Muhao Chen, Leon Derczynski, Chaowei Xiao, Huan Sun

Published: -Area: Surveys & ReviewsCitations: 7

Tags: surveys-reviews, ai-safety, adversarial-robustness, survey

E7 / R4 (94%)
Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions

Kun Zhang, Le Wu, Guangyi Lv, Dacao Zhang

Published: 2025-06-08Area: Surveys & ReviewsCitations: 1

Tags: surveys-reviews, ai-safety, adversarial-robustness, survey, safety-evaluation

E5 / R3 (95%)
Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Alan Chan, Giulio Corsi, Yejin Choi, Miles Turpin

Published: 2024-04-15Area: Surveys & ReviewsCitations: 211

Tags: alignment-training, surveys-reviews, ai-safety, survey

E6 / R4 (94%)
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking

Siyuan Wang, Zhuohan Long, Zhongyu Wei, Zhihao Fan

Published: 2024-06-21Area: Surveys & ReviewsCitations: 22

Tags: surveys-reviews, ai-safety, adversarial-robustness, survey, safety-evaluation

E7 / R4 (95%)
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Sihao Hu, Tiansheng Huang, Fatih Ilhan, Ling Liu

Published: 2024-09-26Area: Surveys & ReviewsCitations: 82

Tags: alignment-training, surveys-reviews, ai-safety, survey

E5 / R3 (95%)
Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Ke Xu, Sibo Yi, Jiaxing Song, Tianshuo Cong

Published: 2024-07-05Area: Surveys & ReviewsCitations: 220

Tags: surveys-reviews, ai-safety, adversarial-robustness, survey

E8 / R4 (98%)
Large Language Model Safety: A Holistic Survey

Zishan Guo, Tianhao Shen, Deyi Xiong, Yufei Huang

Published: 2024-12-23Area: Surveys & ReviewsCitations: 47

Tags: alignment-training, surveys-reviews, ai-safety, adversarial-robustness, survey, interpretability

E7 / R5 (99%)
Laying the Foundations for Vision and Multimodal Mechanistic Interpretability

Sonia Joseph, Yash Vadi, Rob Graham, Praneet Suresh

Published: -Area: Surveys & ReviewsCitations: -

Tags: surveys-reviews, ai-safety, survey, interpretability

-
Mapping Technical Safety Research at AI Companies: A literature review and incentives analysis

Oscar Delaney, Oliver Guest, Zoe Williams

Published: 2024-09-12Area: Surveys & ReviewsCitations: 3

Tags: surveys-reviews, ai-safety, survey, interpretability

E6 / R3 (94%)
Mechanistic Interpretability for AI Safety — A Review

Leonard Bereska, Efstratios Gavves

Published: 2024-04-22Area: Surveys & ReviewsCitations: 335

Tags: surveys-reviews, ai-safety, survey, interpretability, safety-evaluation

E5 / R3 (93%)
Mechanistic?

Sarah Wiegreffe, Naomi Saphra

Published: 2024-10-07Area: Surveys & ReviewsCitations: 38

Tags: surveys-reviews, ai-safety, position, interpretability

E5 / R3 (94%)
Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience

Kevin Nejad, Wes Gurnee, Ilia Sucholutsky, Danyal Akarca

Published: 2024-08-22Area: Surveys & ReviewsCitations: 7

Tags: surveys-reviews, ai-safety, survey, interpretability

E5 / R4 (95%)