Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

Showing 1-30 of 97 papers (page 1 of 4)

PreviousNext
PaperIntel
Position: Capability Control Should be a Separate Goal From Alignment

Eleni Triantafillou, Adrian Weller, Shoaib Ahmed Siddiqui, David Krueger

Published: 2026-02-05Area: Surveys & ReviewsCitations: -

Tags: alignment-training, surveys-reviews, ai-safety, position

E6 / R4 (94%)
A Review of Developmental Interpretability in Large Language Models

Ihor Kendiukhov

Published: 2025-08-19Area: Surveys & ReviewsCitations: -

Tags: surveys-reviews, ai-safety, survey, interpretability

E6 / R4 (94%)
A Survey of LLM Alignment: Instruction Understanding, Intention Reasoning, and Reliable Generation

Qian Li, Ziqin Zhu, Shangguang Wang, Jianxin Li

Published: 2025-02-13Area: Surveys & ReviewsCitations: 2

Tags: alignment-training, surveys-reviews, ai-safety, survey

E6 / R4 (97%)
A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks

Hieu Minh Nguyen

Published: 2025-02-10Area: Surveys & ReviewsCitations: 5

Tags: alignment-training, surveys-reviews, ai-safety, survey, safety-evaluation

E5 / R3 (92%)
A Survey on Data Security in Large Language Models

Kang Chen, Jinhe Su, Yuanhui Yu, Li Shen

Published: 2025-08-04Area: Surveys & ReviewsCitations: 1

Tags: surveys-reviews, ai-safety, adversarial-robustness, survey

E5 / R3 (95%)
A Survey on Generative Model Unlearning: Fundamentals, Taxonomy, Evaluation, and Future Direction

Chengye Wang, Kaixiang Li, Yuyuan Li, Jianwei Yin

Published: 2025-07-26Area: Surveys & ReviewsCitations: 2

Tags: surveys-reviews, ai-safety, survey, safety-evaluation

E5 / R3 (93%)
A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models

Ryan A. Rossi, Keivan Rezaei, Zhiyang Xu, Mohammad Beigi

Published: 2025-02-22Area: Surveys & ReviewsCitations: 20

Tags: surveys-reviews, ai-safety, survey, interpretability

E7 / R4 (96%)
A Survey on Progress in LLM Alignment from the Perspective of Reward Design

Shoujin Wang, Zhibin Wu, Usman Naseem, Yanqiu Wu

Published: 2025-05-05Area: Surveys & ReviewsCitations: 10

Tags: alignment-training, surveys-reviews, ai-safety, survey

E6 / R4 (95%)
A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of LLMs

Xuansheng Wu, Mengnan Du, Ziyu Yao, Ninghao Liu

Published: 2025-03-07Area: Surveys & ReviewsCitations: 34

Tags: surveys-reviews, ai-safety, survey, safety-evaluation

E5 / R3 (94%)
A Survey on Unlearning in Large Language Models

Ruichen Qiu, Xiao-Shan Gao, Honglin Wang, Fei Sun

Published: 2025-10-29Area: Surveys & ReviewsCitations: 1

Tags: surveys-reviews, ai-safety, survey

E5 / R3 (95%)
AI Awareness

Haoyuan Shi, Rongwu Xu, Xiaojian Li, Wei Xu

Published: 2025-04-25Area: Surveys & ReviewsCitations: 4

Tags: surveys-reviews, ai-safety, survey, safety-evaluation

E6 / R5 (97%)
AI Safety for Everyone

Bálint Gyevnar, Atoosa Kasirzadeh

Published: 2025-02-13Area: Surveys & ReviewsCitations: 17

Tags: surveys-reviews, ai-safety, survey

E6 / R3 (96%)
AI Safety vs. AI Security: Demystifying the Distinction and Boundaries

Zhiqiang Lin, Ness Shroff, Huan Sun

Published: 2025-06-21Area: Surveys & ReviewsCitations: 2

Tags: surveys-reviews, ai-safety, adversarial-robustness, survey

E5 / R4 (95%)
An Approach to Technical AGI Safety and Security

Sebastien Krier, Raluca Ada Popa, Vikrant Varma, Rory Greig

Published: 2025-04-02Area: Surveys & ReviewsCitations: 35

Tags: alignment-training, surveys-reviews, ai-safety, position

E5 / R3 (96%)
Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies

Chenruo Liu, Qi Lei, Kenan Tang, Yao Qin

Published: 2025-05-28Area: Surveys & ReviewsCitations: 1

Tags: surveys-reviews, ai-safety, survey

E7 / R3 (95%)
Bridging the Black Box: A Survey on Mechanistic Interpretability in AI

Amir Rafe, Tausif Islam Chowdhury, Nawaf Alnawmasi, Anandi K. Dutta

Published: -Area: Surveys & ReviewsCitations: -

Tags: surveys-reviews, ai-safety, survey, interpretability, safety-evaluation

-
Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution

Usha Bhalla, Hima Lakkaraju, Shichang Zhang, Tessa Han

Published: 2025-01-31Area: Surveys & ReviewsCitations: 3

Tags: surveys-reviews, ai-safety, position, interpretability

E5 / R3 (97%)
Emerging threats in AI: a detailed review of misuses and risks across modern AI technologies

Niyat Seghid, Khalifa Al-Room, Farkhund Iqbal, Áine MacDermott

Published: -Area: Surveys & ReviewsCitations: -

Tags: surveys-reviews, ai-safety, survey

E5 / R3 (97%)
Erasing Concepts, Steering Generations: A Comprehensive Survey of Concept Suppression

Zheng Zhang, Ping Liu, Yiwei Xie

Published: 2025-05-26Area: Surveys & ReviewsCitations: 3

Tags: surveys-reviews, ai-safety, survey

E5 / R3 (95%)
Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM

Yao Liu, Zhuo Lu, Changjia Zhu, Chi Zhang

Published: 2025-08-07Area: Surveys & ReviewsCitations: 5

Tags: surveys-reviews, ai-safety, survey

E5 / R3 (93%)
Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety

Seongmin Lee, ShengYun Peng, Mansi Phute, Duen Horng Chau

Published: 2025-06-05Area: Surveys & ReviewsCitations: 5

Tags: surveys-reviews, ai-safety, survey

E5 / R3 (95%)
Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defenses

Chaozhuo Li, Yiming Hei, Xi Zhang, Zejian Chen

Published: 2026-01-07Area: Surveys & ReviewsCitations: 1

Tags: surveys-reviews, ai-safety, adversarial-robustness, survey, safety-evaluation

E6 / R4 (97%)
LLM Harms: A Taxonomy and Discussion

David Atkinson, Kevin Chen, Amit Dhurandhar, Saleh Afroogh

Published: 2025-12-05Area: Surveys & ReviewsCitations: -

Tags: surveys-reviews, ai-safety, survey

E5 / R3 (94%)
LLM in the Middle: A Systematic Review of Threats and Mitigations to Real-World LLM-based Systems

Ulf Lindqvist, Briland Hitaj, Gabriel Antonio Fontes Rebello, Rodrigo Duarte de Meneses

Published: 2025-09-12Area: Surveys & ReviewsCitations: 1

Tags: surveys-reviews, ai-safety, survey

E5 / R3 (95%)
Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions

Usman Naseem

Published: 2026-01-21Area: Surveys & ReviewsCitations: 1

Tags: alignment-training, surveys-reviews, ai-safety, survey, interpretability

E6 / R4 (96%)
Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives

Sirui Zhang, Hui Xiong, Yinuo Shen, Han Wu

Published: 2025-06-11Area: Surveys & ReviewsCitations: 5

Tags: alignment-training, surveys-reviews, ai-safety, survey

E5 / R3 (93%)
Open Problems in Mechanistic Interpretability

David Bau, Jesse Hoogland, Bilal Chughtai, Joseph Bloom

Published: 2025-01-27Area: Surveys & ReviewsCitations: 107

Tags: surveys-reviews, ai-safety, survey, interpretability

E5 / R3 (96%)
Report on NSF Workshop on Science of Safe AI

Hadas Kress-Gazit, Rajeev Alur, Greg Durrett, René Vidal

Published: 2025-06-24Area: Surveys & ReviewsCitations: 1

Tags: surveys-reviews, ai-safety, adversarial-robustness, survey

E6 / R3 (97%)
Representation Engineering for Large-Language Models: Survey and Research Challenges

Lukasz Bartoszcze, Zejia Yang, David Williams-King, Linh Le

Published: 2025-02-24Area: Surveys & ReviewsCitations: 8

Tags: surveys-reviews, ai-safety, survey

E6 / R3 (95%)
Safety at Scale: A Comprehensive Survey of Large Model Safety

Ruoxi Jia, Hanxun Huang, Baoyuan Wu, Cihang Xie

Published: 2025-02-02Area: Surveys & ReviewsCitations: 18

Tags: surveys-reviews, ai-safety, survey

E7 / R3 (99%)