Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Safety in Large Reasoning Models: A Survey Cheng Wang, Baolong Bi, Yingwei Ma, Bryan Hooi Published: 2025-04-24Area: Surveys & ReviewsCitations: 53 Tags: surveys-reviews, ai-safety, survey | 2025-04-24 | Surveys & Reviews | surveys-reviews, ai-safety, survey | E6 / R3 (95%) | 53 |
| Security Concerns for Large Language Models: A Survey Benjamin C. M. Fung, Miles Q. Li Published: 2025-05-24Area: Surveys & ReviewsCitations: 29 Tags: surveys-reviews, ai-safety, adversarial-robustness, survey | 2025-05-24 | Surveys & Reviews | surveys-reviews, ai-safety, adversarial-robustness, survey | E5 / R3 (94%) | 29 |
| Survey on the Role of Mechanistic Interpretability in Generative AI Leonardo Ranaldi Published: -Area: Surveys & ReviewsCitations: 4 Tags: surveys-reviews, ai-safety, adversarial-robustness, survey, interpretability | - | Surveys & Reviews | surveys-reviews, ai-safety, adversarial-robustness, survey, interpretability | E5 / R4 (94%) | 4 |
| The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It Julia Kreutzer, Stephen H. Bach, Beyza Ermis, Marzieh Fadaee Published: 2025-05-30Area: Surveys & ReviewsCitations: 10 Tags: surveys-reviews, ai-safety, survey | 2025-05-30 | Surveys & Reviews | surveys-reviews, ai-safety, survey | E5 / R3 (97%) | 10 |
| Towards Integrated Alignment Ben Y. Reis, William La Cava Published: 2025-08-08Area: Surveys & ReviewsCitations: - Tags: alignment-training, surveys-reviews, ai-safety, position | 2025-08-08 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, position | E6 / R4 (93%) | - |
| Unboxing the Black Box: Mechanistic Interpretability for Algorithmic Understanding of Neural Networks Bianka Kowalska, Halina Kwaśnicka Published: 2025-11-24Area: Surveys & ReviewsCitations: 1 Tags: surveys-reviews, ai-safety, survey, interpretability | 2025-11-24 | Surveys & Reviews | surveys-reviews, ai-safety, survey, interpretability | E5 / R3 (96%) | 1 |
| Unlearning in LLMs: Methods, Evaluation, and Open Challenges Larry Heck, Tyler Lizzo Published: 2026-01-19Area: Surveys & ReviewsCitations: - Tags: surveys-reviews, ai-safety, survey, safety-evaluation | 2026-01-19 | Surveys & Reviews | surveys-reviews, ai-safety, survey, safety-evaluation | E7 / R3 (96%) | - |
| What is AI safety? What do we want it to be? Cameron Domenico Kirk-Giannini, Jacqueline Harding Published: 2025-05-05Area: Surveys & ReviewsCitations: 1 Tags: surveys-reviews, ai-safety, position | 2025-05-05 | Surveys & Reviews | surveys-reviews, ai-safety, position | E5 / R3 (95%) | 1 |
| A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Abulhair Saparov, Ziyu Yao, Shi Feng, Yilun Zhou Published: 2024-07-02Area: Surveys & ReviewsCitations: 91 Tags: surveys-reviews, ai-safety, survey, interpretability | 2024-07-02 | Surveys & Reviews | surveys-reviews, ai-safety, survey, interpretability | E5 / R3 (95%) | 91 |
| A Primer on the Inner Workings of Transformer-based Language Models Marta R. Costa-jussà, Javier Ferrando, Arianna Bisazza, Gabriele Sarti Published: 2024-04-30Area: Surveys & ReviewsCitations: 80 Tags: surveys-reviews, ai-safety, survey, interpretability | 2024-04-30 | Surveys & Reviews | surveys-reviews, ai-safety, survey, interpretability | E7 / R4 (95%) | 80 |
| A Survey on Human Preference Learning for Large Language Models Muyun Yang, Ruili Jiang, Zhixuan He, Xuefeng Bai Published: 2024-06-17Area: Surveys & ReviewsCitations: 16 Tags: alignment-training, surveys-reviews, ai-safety, survey, safety-evaluation | 2024-06-17 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, survey, safety-evaluation | E5 / R3 (95%) | 16 |
| A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly Kaidi Xu, Yue Zhang, Jinhao Duan, Yifan Yao Published: 2023-12-04Area: Surveys & ReviewsCitations: 1022 Tags: surveys-reviews, ai-safety, survey | 2023-12-04 | Surveys & Reviews | surveys-reviews, ai-safety, survey | E6 / R3 (95%) | 1022 |
| A Survey on the Honesty of Large Language Models Taiqiang Wu, Deng Cai, Chufan Shi, Ngai Wong Published: 2024-09-27Area: Surveys & ReviewsCitations: 18 Tags: surveys-reviews, ai-safety, survey, safety-evaluation | 2024-09-27 | Surveys & Reviews | surveys-reviews, ai-safety, survey, safety-evaluation | E5 / R4 (94%) | 18 |
| AI Risk Categorization Decoded (AIR 2024): From Government Regulations to Corporate Policies Ruoxi Jia, Kevin Klyman, Minzhou Pan, Bo Li Published: 2024-06-25Area: Surveys & ReviewsCitations: 48 Tags: surveys-reviews, ai-safety, survey | 2024-06-25 | Surveys & Reviews | surveys-reviews, ai-safety, survey | E8 / R4 (99%) | 48 |
| AI Risk Management Should Incorporate Both Safety and Security Ruoxi Jia, Jeffrey Ding, Vikash Sehwag, Weijia Shi Published: 2024-05-29Area: Surveys & ReviewsCitations: 20 Tags: surveys-reviews, ai-safety, adversarial-robustness, position | 2024-05-29 | Surveys & Reviews | surveys-reviews, ai-safety, adversarial-robustness, position | E5 / R4 (97%) | 20 |
| AI Safety in Generative AI Large Language Models: A Survey Shiyi Yang, Lina Yao, Yun Li, Jaymari Chua Published: 2024-07-06Area: Surveys & ReviewsCitations: 38 Tags: alignment-training, surveys-reviews, ai-safety, survey | 2024-07-06 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, survey | E6 / R3 (95%) | 38 |
| Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey Yu Qiao, Zhichen Dong, Chao Yang, Jing Shao Published: 2024-02-14Area: Surveys & ReviewsCitations: 140 Tags: alignment-training, surveys-reviews, ai-safety, survey, safety-evaluation | 2024-02-14 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, survey, safety-evaluation | E5 / R3 (98%) | 140 |
| Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond Shanshan Han Published: 2024-10-09Area: Surveys & ReviewsCitations: 1 Tags: surveys-reviews, ai-safety, adversarial-robustness, survey, red-teaming | 2024-10-09 | Surveys & Reviews | surveys-reviews, ai-safety, adversarial-robustness, survey, red-teaming | E6 / R3 (92%) | 1 |
| Combating Security and Privacy Issues in the Era of Large Language Models Muhao Chen, Leon Derczynski, Chaowei Xiao, Huan Sun Published: -Area: Surveys & ReviewsCitations: 7 Tags: surveys-reviews, ai-safety, adversarial-robustness, survey | - | Surveys & Reviews | surveys-reviews, ai-safety, adversarial-robustness, survey | E7 / R4 (94%) | 7 |
| Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions Kun Zhang, Le Wu, Guangyi Lv, Dacao Zhang Published: 2025-06-08Area: Surveys & ReviewsCitations: 1 Tags: surveys-reviews, ai-safety, adversarial-robustness, survey, safety-evaluation | 2025-06-08 | Surveys & Reviews | surveys-reviews, ai-safety, adversarial-robustness, survey, safety-evaluation | E5 / R3 (95%) | 1 |
| Foundational Challenges in Assuring Alignment and Safety of Large Language Models Alan Chan, Giulio Corsi, Yejin Choi, Miles Turpin Published: 2024-04-15Area: Surveys & ReviewsCitations: 211 Tags: alignment-training, surveys-reviews, ai-safety, survey | 2024-04-15 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, survey | E6 / R4 (94%) | 211 |
| From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking Siyuan Wang, Zhuohan Long, Zhongyu Wei, Zhihao Fan Published: 2024-06-21Area: Surveys & ReviewsCitations: 22 Tags: surveys-reviews, ai-safety, adversarial-robustness, survey, safety-evaluation | 2024-06-21 | Surveys & Reviews | surveys-reviews, ai-safety, adversarial-robustness, survey, safety-evaluation | E7 / R4 (95%) | 22 |
| Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey Sihao Hu, Tiansheng Huang, Fatih Ilhan, Ling Liu Published: 2024-09-26Area: Surveys & ReviewsCitations: 82 Tags: alignment-training, surveys-reviews, ai-safety, survey | 2024-09-26 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, survey | E5 / R3 (95%) | 82 |
| Jailbreak Attacks and Defenses Against Large Language Models: A Survey Ke Xu, Sibo Yi, Jiaxing Song, Tianshuo Cong Published: 2024-07-05Area: Surveys & ReviewsCitations: 220 Tags: surveys-reviews, ai-safety, adversarial-robustness, survey | 2024-07-05 | Surveys & Reviews | surveys-reviews, ai-safety, adversarial-robustness, survey | E8 / R4 (98%) | 220 |
| Large Language Model Safety: A Holistic Survey Zishan Guo, Tianhao Shen, Deyi Xiong, Yufei Huang Published: 2024-12-23Area: Surveys & ReviewsCitations: 47 Tags: alignment-training, surveys-reviews, ai-safety, adversarial-robustness, survey, interpretability | 2024-12-23 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, adversarial-robustness, survey, interpretability | E7 / R5 (99%) | 47 |
| Laying the Foundations for Vision and Multimodal Mechanistic Interpretability Sonia Joseph, Yash Vadi, Rob Graham, Praneet Suresh Published: -Area: Surveys & ReviewsCitations: - Tags: surveys-reviews, ai-safety, survey, interpretability | - | Surveys & Reviews | surveys-reviews, ai-safety, survey, interpretability | - | - |
| Mapping Technical Safety Research at AI Companies: A literature review and incentives analysis Oscar Delaney, Oliver Guest, Zoe Williams Published: 2024-09-12Area: Surveys & ReviewsCitations: 3 Tags: surveys-reviews, ai-safety, survey, interpretability | 2024-09-12 | Surveys & Reviews | surveys-reviews, ai-safety, survey, interpretability | E6 / R3 (94%) | 3 |
| Mechanistic Interpretability for AI Safety — A Review Leonard Bereska, Efstratios Gavves Published: 2024-04-22Area: Surveys & ReviewsCitations: 335 Tags: surveys-reviews, ai-safety, survey, interpretability, safety-evaluation | 2024-04-22 | Surveys & Reviews | surveys-reviews, ai-safety, survey, interpretability, safety-evaluation | E5 / R3 (93%) | 335 |
| Mechanistic? Sarah Wiegreffe, Naomi Saphra Published: 2024-10-07Area: Surveys & ReviewsCitations: 38 Tags: surveys-reviews, ai-safety, position, interpretability | 2024-10-07 | Surveys & Reviews | surveys-reviews, ai-safety, position, interpretability | E5 / R3 (94%) | 38 |
| Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience Kevin Nejad, Wes Gurnee, Ilia Sucholutsky, Danyal Akarca Published: 2024-08-22Area: Surveys & ReviewsCitations: 7 Tags: surveys-reviews, ai-safety, survey, interpretability | 2024-08-22 | Surveys & Reviews | surveys-reviews, ai-safety, survey, interpretability | E5 / R4 (95%) | 7 |