Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
Showing 1-30 of 97 papers (page 1 of 4)
| Paper | Published | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| Position: Capability Control Should be a Separate Goal From Alignment Eleni Triantafillou, Adrian Weller, Shoaib Ahmed Siddiqui, David Krueger Published: 2026-02-05Area: Surveys & ReviewsCitations: - Tags: alignment-training, surveys-reviews, ai-safety, position | 2026-02-05 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, position | E6 / R4 (94%) | - |
| A Review of Developmental Interpretability in Large Language Models Ihor Kendiukhov Published: 2025-08-19Area: Surveys & ReviewsCitations: - Tags: surveys-reviews, ai-safety, survey, interpretability | 2025-08-19 | Surveys & Reviews | surveys-reviews, ai-safety, survey, interpretability | E6 / R4 (94%) | - |
| A Survey of LLM Alignment: Instruction Understanding, Intention Reasoning, and Reliable Generation Qian Li, Ziqin Zhu, Shangguang Wang, Jianxin Li Published: 2025-02-13Area: Surveys & ReviewsCitations: 2 Tags: alignment-training, surveys-reviews, ai-safety, survey | 2025-02-13 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, survey | E6 / R4 (97%) | 2 |
| A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks Hieu Minh Nguyen Published: 2025-02-10Area: Surveys & ReviewsCitations: 5 Tags: alignment-training, surveys-reviews, ai-safety, survey, safety-evaluation | 2025-02-10 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, survey, safety-evaluation | E5 / R3 (92%) | 5 |
| A Survey on Data Security in Large Language Models Kang Chen, Jinhe Su, Yuanhui Yu, Li Shen Published: 2025-08-04Area: Surveys & ReviewsCitations: 1 Tags: surveys-reviews, ai-safety, adversarial-robustness, survey | 2025-08-04 | Surveys & Reviews | surveys-reviews, ai-safety, adversarial-robustness, survey | E5 / R3 (95%) | 1 |
| A Survey on Generative Model Unlearning: Fundamentals, Taxonomy, Evaluation, and Future Direction Chengye Wang, Kaixiang Li, Yuyuan Li, Jianwei Yin Published: 2025-07-26Area: Surveys & ReviewsCitations: 2 Tags: surveys-reviews, ai-safety, survey, safety-evaluation | 2025-07-26 | Surveys & Reviews | surveys-reviews, ai-safety, survey, safety-evaluation | E5 / R3 (93%) | 2 |
| A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models Ryan A. Rossi, Keivan Rezaei, Zhiyang Xu, Mohammad Beigi Published: 2025-02-22Area: Surveys & ReviewsCitations: 20 Tags: surveys-reviews, ai-safety, survey, interpretability | 2025-02-22 | Surveys & Reviews | surveys-reviews, ai-safety, survey, interpretability | E7 / R4 (96%) | 20 |
| A Survey on Progress in LLM Alignment from the Perspective of Reward Design Shoujin Wang, Zhibin Wu, Usman Naseem, Yanqiu Wu Published: 2025-05-05Area: Surveys & ReviewsCitations: 10 Tags: alignment-training, surveys-reviews, ai-safety, survey | 2025-05-05 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, survey | E6 / R4 (95%) | 10 |
| A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of LLMs Xuansheng Wu, Mengnan Du, Ziyu Yao, Ninghao Liu Published: 2025-03-07Area: Surveys & ReviewsCitations: 34 Tags: surveys-reviews, ai-safety, survey, safety-evaluation | 2025-03-07 | Surveys & Reviews | surveys-reviews, ai-safety, survey, safety-evaluation | E5 / R3 (94%) | 34 |
| A Survey on Unlearning in Large Language Models Ruichen Qiu, Xiao-Shan Gao, Honglin Wang, Fei Sun Published: 2025-10-29Area: Surveys & ReviewsCitations: 1 Tags: surveys-reviews, ai-safety, survey | 2025-10-29 | Surveys & Reviews | surveys-reviews, ai-safety, survey | E5 / R3 (95%) | 1 |
| AI Awareness Haoyuan Shi, Rongwu Xu, Xiaojian Li, Wei Xu Published: 2025-04-25Area: Surveys & ReviewsCitations: 4 Tags: surveys-reviews, ai-safety, survey, safety-evaluation | 2025-04-25 | Surveys & Reviews | surveys-reviews, ai-safety, survey, safety-evaluation | E6 / R5 (97%) | 4 |
| AI Safety for Everyone Bálint Gyevnar, Atoosa Kasirzadeh Published: 2025-02-13Area: Surveys & ReviewsCitations: 17 Tags: surveys-reviews, ai-safety, survey | 2025-02-13 | Surveys & Reviews | surveys-reviews, ai-safety, survey | E6 / R3 (96%) | 17 |
| AI Safety vs. AI Security: Demystifying the Distinction and Boundaries Zhiqiang Lin, Ness Shroff, Huan Sun Published: 2025-06-21Area: Surveys & ReviewsCitations: 2 Tags: surveys-reviews, ai-safety, adversarial-robustness, survey | 2025-06-21 | Surveys & Reviews | surveys-reviews, ai-safety, adversarial-robustness, survey | E5 / R4 (95%) | 2 |
| An Approach to Technical AGI Safety and Security Sebastien Krier, Raluca Ada Popa, Vikrant Varma, Rory Greig Published: 2025-04-02Area: Surveys & ReviewsCitations: 35 Tags: alignment-training, surveys-reviews, ai-safety, position | 2025-04-02 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, position | E5 / R3 (96%) | 35 |
| Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies Chenruo Liu, Qi Lei, Kenan Tang, Yao Qin Published: 2025-05-28Area: Surveys & ReviewsCitations: 1 Tags: surveys-reviews, ai-safety, survey | 2025-05-28 | Surveys & Reviews | surveys-reviews, ai-safety, survey | E7 / R3 (95%) | 1 |
| Bridging the Black Box: A Survey on Mechanistic Interpretability in AI Amir Rafe, Tausif Islam Chowdhury, Nawaf Alnawmasi, Anandi K. Dutta Published: -Area: Surveys & ReviewsCitations: - Tags: surveys-reviews, ai-safety, survey, interpretability, safety-evaluation | - | Surveys & Reviews | surveys-reviews, ai-safety, survey, interpretability, safety-evaluation | - | - |
| Building Bridges, Not Walls: Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution Usha Bhalla, Hima Lakkaraju, Shichang Zhang, Tessa Han Published: 2025-01-31Area: Surveys & ReviewsCitations: 3 Tags: surveys-reviews, ai-safety, position, interpretability | 2025-01-31 | Surveys & Reviews | surveys-reviews, ai-safety, position, interpretability | E5 / R3 (97%) | 3 |
| Emerging threats in AI: a detailed review of misuses and risks across modern AI technologies Niyat Seghid, Khalifa Al-Room, Farkhund Iqbal, Áine MacDermott Published: -Area: Surveys & ReviewsCitations: - Tags: surveys-reviews, ai-safety, survey | - | Surveys & Reviews | surveys-reviews, ai-safety, survey | E5 / R3 (97%) | - |
| Erasing Concepts, Steering Generations: A Comprehensive Survey of Concept Suppression Zheng Zhang, Ping Liu, Yiwei Xie Published: 2025-05-26Area: Surveys & ReviewsCitations: 3 Tags: surveys-reviews, ai-safety, survey | 2025-05-26 | Surveys & Reviews | surveys-reviews, ai-safety, survey | E5 / R3 (95%) | 3 |
| Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM Yao Liu, Zhuo Lu, Changjia Zhu, Chi Zhang Published: 2025-08-07Area: Surveys & ReviewsCitations: 5 Tags: surveys-reviews, ai-safety, survey | 2025-08-07 | Surveys & Reviews | surveys-reviews, ai-safety, survey | E5 / R3 (93%) | 5 |
| Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety Seongmin Lee, ShengYun Peng, Mansi Phute, Duen Horng Chau Published: 2025-06-05Area: Surveys & ReviewsCitations: 5 Tags: surveys-reviews, ai-safety, survey | 2025-06-05 | Surveys & Reviews | surveys-reviews, ai-safety, survey | E5 / R3 (95%) | 5 |
| Jailbreaking LLMs & VLMs: Mechanisms, Evaluation, and Unified Defenses Chaozhuo Li, Yiming Hei, Xi Zhang, Zejian Chen Published: 2026-01-07Area: Surveys & ReviewsCitations: 1 Tags: surveys-reviews, ai-safety, adversarial-robustness, survey, safety-evaluation | 2026-01-07 | Surveys & Reviews | surveys-reviews, ai-safety, adversarial-robustness, survey, safety-evaluation | E6 / R4 (97%) | 1 |
| LLM Harms: A Taxonomy and Discussion David Atkinson, Kevin Chen, Amit Dhurandhar, Saleh Afroogh Published: 2025-12-05Area: Surveys & ReviewsCitations: - Tags: surveys-reviews, ai-safety, survey | 2025-12-05 | Surveys & Reviews | surveys-reviews, ai-safety, survey | E5 / R3 (94%) | - |
| LLM in the Middle: A Systematic Review of Threats and Mitigations to Real-World LLM-based Systems Ulf Lindqvist, Briland Hitaj, Gabriel Antonio Fontes Rebello, Rodrigo Duarte de Meneses Published: 2025-09-12Area: Surveys & ReviewsCitations: 1 Tags: surveys-reviews, ai-safety, survey | 2025-09-12 | Surveys & Reviews | surveys-reviews, ai-safety, survey | E5 / R3 (95%) | 1 |
| Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions Usman Naseem Published: 2026-01-21Area: Surveys & ReviewsCitations: 1 Tags: alignment-training, surveys-reviews, ai-safety, survey, interpretability | 2026-01-21 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, survey, interpretability | E6 / R4 (96%) | 1 |
| Multi-level Value Alignment in Agentic AI Systems: Survey and Perspectives Sirui Zhang, Hui Xiong, Yinuo Shen, Han Wu Published: 2025-06-11Area: Surveys & ReviewsCitations: 5 Tags: alignment-training, surveys-reviews, ai-safety, survey | 2025-06-11 | Surveys & Reviews | alignment-training, surveys-reviews, ai-safety, survey | E5 / R3 (93%) | 5 |
| Open Problems in Mechanistic Interpretability David Bau, Jesse Hoogland, Bilal Chughtai, Joseph Bloom Published: 2025-01-27Area: Surveys & ReviewsCitations: 107 Tags: surveys-reviews, ai-safety, survey, interpretability | 2025-01-27 | Surveys & Reviews | surveys-reviews, ai-safety, survey, interpretability | E5 / R3 (96%) | 107 |
| Report on NSF Workshop on Science of Safe AI Hadas Kress-Gazit, Rajeev Alur, Greg Durrett, René Vidal Published: 2025-06-24Area: Surveys & ReviewsCitations: 1 Tags: surveys-reviews, ai-safety, adversarial-robustness, survey | 2025-06-24 | Surveys & Reviews | surveys-reviews, ai-safety, adversarial-robustness, survey | E6 / R3 (97%) | 1 |
| Representation Engineering for Large-Language Models: Survey and Research Challenges Lukasz Bartoszcze, Zejia Yang, David Williams-King, Linh Le Published: 2025-02-24Area: Surveys & ReviewsCitations: 8 Tags: surveys-reviews, ai-safety, survey | 2025-02-24 | Surveys & Reviews | surveys-reviews, ai-safety, survey | E6 / R3 (95%) | 8 |
| Safety at Scale: A Comprehensive Survey of Large Model Safety Ruoxi Jia, Hanxun Huang, Baoyuan Wu, Cihang Xie Published: 2025-02-02Area: Surveys & ReviewsCitations: 18 Tags: surveys-reviews, ai-safety, survey | 2025-02-02 | Surveys & Reviews | surveys-reviews, ai-safety, survey | E7 / R3 (99%) | 18 |