Instant research discovery
Search and browse ingested papers with intelligence signals and fast filtering.
| Paper | Year | Area | Tags | Intel | Citations |
|---|---|---|---|---|---|
| "Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior Fiona Collins, Thilo Hagendorff, Sanaya Parekh, Jonas Kaplan Year: 2026Area: cs.CLCitations: - Tags: cscl, alignment-training, ai-safety, preprint | 2026 | cs.CL | cscl, alignment-training, ai-safety, preprint | E5 / R4 (95%) | - |
| A Context Alignment Pre-processor for Enhancing the Coherence of Human-LLM Dialog Ding Wei Year: 2026Area: cs.AICitations: - Tags: alignment-training, ai-safety, csai, preprint | 2026 | cs.AI | alignment-training, ai-safety, csai, preprint | - | - |
| AM3Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs Yujin Zhou, Yike Guo, Chengkun Cai, Pengcheng Wen Year: 2026Area: Multimodal SafetyCitations: - Tags: empirical, alignment-training, ai-safety, adversarial-robustness, multimodal-safety | 2026 | Multimodal Safety | empirical, alignment-training, ai-safety, adversarial-robustness, multimodal-safety | E5 / R3 (97%) | - |
| Alignment Makes Language Models Normative, Not Descriptive Eilam Shapira, Moshe Tennenholtz, Roi Reichart Year: 2026Area: cs.CLCitations: - Tags: cscl, alignment-training, ai-safety, preprint | 2026 | cs.CL | cscl, alignment-training, ai-safety, preprint | - | - |
| Alignment Verifiability in Large Language Models: Normative Indistinguishability under Behavioral Evaluation Igor Santos-Grueiro Year: 2026Area: Deception & FailureCitations: 1 Tags: theoretical, alignment-training, ai-safety, deception-failure, safety-evaluation | 2026 | Deception & Failure | theoretical, alignment-training, ai-safety, deception-failure, safety-evaluation | E4 / R2 (94%) | 1 |
| Alignment-Aware Model Adaptation via Feedback-Guided Optimization Gaurav Bhatt, Leonid Sigal, Aditya Chinchure, Jiawei Zhou Year: 2026Area: Alignment TrainingCitations: - Tags: empirical, alignment-training, ai-safety | 2026 | Alignment Training | empirical, alignment-training, ai-safety | E5 / R3 (93%) | - |
| Alignment-Aware and Reliability-Gated Multimodal Fusion for Unmanned Aerial Vehicle Detection Across Heterogeneous Thermal-Visual Sensors Muhammad E. H. Chowdhury, Balamurugan Balusamy, M Murugappan, Amith Khandakar Year: 2026Area: cs.CVCitations: - Tags: alignment-training, ai-safety, cscv, preprint | 2026 | cs.CV | alignment-training, ai-safety, cscv, preprint | E4 / R3 (96%) | - |
| Alignment-Process-Outcome: Rethinking How AIs and Humans Collaborate Anjun Zhu, Arpit Narechania, Haichang Li Year: 2026Area: cs.HCCitations: - Tags: alignment-training, ai-safety, cshc, preprint | 2026 | cs.HC | alignment-training, ai-safety, cshc, preprint | E5 / R4 (93%) | - |
| Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph Zhenheng Tang, Eunsol Choi, Qian Wang, Xiaowen Chu Year: 2026Area: cs.AICitations: - Tags: alignment-training, ai-safety, csai, preprint | 2026 | cs.AI | alignment-training, ai-safety, csai, preprint | - | - |
| Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning Abhishek Mishra, Deepesh Suranjandass, Reshma Ashok, Donnie Winkelmann Year: 2026Area: Deception & FailureCitations: - Tags: empirical, alignment-training, ai-safety, deception-failure | 2026 | Deception & Failure | empirical, alignment-training, ai-safety, deception-failure | E5 / R3 (95%) | - |
| BLOCK-EM: Preventing Emergent Misalignment by Blocking Causal Features Guannan Qu, Muhammed Ustaomeroglu Year: 2026Area: Model EditingCitations: - Tags: empirical, alignment-training, ai-safety, model-editing | 2026 | Model Editing | empirical, alignment-training, ai-safety, model-editing | E4 / R2 (95%) | - |
| Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment Eric Lei, Chun-Fu Chen, Hsiang Hsu Year: 2026Area: cs.AICitations: - Tags: alignment-training, ai-safety, csai, preprint | 2026 | cs.AI | alignment-training, ai-safety, csai, preprint | E5 / R3 (96%) | - |
| Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought Weihong Lin, Lin Sun, Jianfeng Si, Xiangzheng Zhang Year: 2026Area: Adversarial RobustnessCitations: - Tags: empirical, alignment-training, ai-safety, adversarial-robustness | 2026 | Adversarial Robustness | empirical, alignment-training, ai-safety, adversarial-robustness | E5 / R4 (94%) | - |
| CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment Yisheng Zhong, Zhengbang Yang, Zhuangdi Zhu, Junyuan Hong Year: 2026Area: Model EditingCitations: - Tags: empirical, alignment-training, ai-safety, model-editing | 2026 | Model Editing | empirical, alignment-training, ai-safety, model-editing | E6 / R4 (93%) | - |
| Capability-Oriented Training Induced Alignment Risk Tian Gao, Zhenwen Liang, Nuno Moniz, Werner Geyer Year: 2026Area: Deception & FailureCitations: 1 Tags: empirical, alignment-training, ai-safety, deception-failure | 2026 | Deception & Failure | empirical, alignment-training, ai-safety, deception-failure | E4 / R3 (94%) | 1 |
| Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations Z. Jane Wang, Sunil Kalia, Harvey Lui, Dongrun Zhu Year: 2026Area: cs.CVCitations: - Tags: alignment-training, ai-safety, cscv, preprint | 2026 | cs.CV | alignment-training, ai-safety, cscv, preprint | E5 / R3 (96%) | - |
| Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations Jiahao Yu, Yimin Wang, Haozheng Luo, Binghui Wang Year: 2026Area: cs.AICitations: - Tags: alignment-training, ai-safety, csai, preprint | 2026 | cs.AI | alignment-training, ai-safety, csai, preprint | - | - |
| Controllable Value Alignment in Large Language Models through Neuron-Level Editing Richang Hong, Weibiao Huang, Tat-Seng Chua, Le Wu Year: 2026Area: Model EditingCitations: - Tags: empirical, alignment-training, ai-safety, model-editing | 2026 | Model Editing | empirical, alignment-training, ai-safety, model-editing | E5 / R3 (94%) | - |
| DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding Zenglin Xu, Qifan Wang, Jiaxiang Chen, Mingxi Zou Year: 2026Area: cs.LGCitations: - Tags: alignment-training, ai-safety, cslg, preprint | 2026 | cs.LG | alignment-training, ai-safety, cslg, preprint | E5 / R3 (93%) | - |
| Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment Ramtin Pedarsani, Bocheng Chen, Guangliang Liu, Zhiyu Xue Year: 2026Area: cs.AICitations: - Tags: alignment-training, ai-safety, csai, preprint | 2026 | cs.AI | alignment-training, ai-safety, csai, preprint | - | - |
| Deconstructing Multimodal Mathematical Reasoning: Towards a Unified Perception-Alignment-Reasoning Paradigm Zhenwen Liang, Tianyu Yang, Lisen Dai, Xiangliang Zhang Year: 2026Area: cs.AICitations: - Tags: alignment-training, ai-safety, csai, preprint | 2026 | cs.AI | alignment-training, ai-safety, csai, preprint | E5 / R3 (95%) | - |
| Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails Gregory N. Frank Year: 2026Area: cs.LGCitations: - Tags: alignment-training, ai-safety, cslg, safety-evaluation, preprint | 2026 | cs.LG | alignment-training, ai-safety, cslg, safety-evaluation, preprint | - | - |
| Do Machines Fail Like Humans? A Human-Centred Out-of-Distribution Spectrum for Mapping Error Alignment Binxia Xu, Luke Dickens, Robert M. Mok, Xiaoliang Luo Year: 2026Area: cs.AICitations: - Tags: alignment-training, ai-safety, csai, preprint | 2026 | cs.AI | alignment-training, ai-safety, csai, preprint | E6 / R4 (95%) | - |
| Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning Zhiyuan Feng, Xiaoyuan Yi, Xiaohan Liu, Ceyao Zhang Year: 2026Area: cs.AICitations: - Tags: alignment-training, ai-safety, csai, preprint | 2026 | cs.AI | alignment-training, ai-safety, csai, preprint | - | - |
| Emergent Misalignment is Easy, Narrow Misalignment is Hard Edward Turner, Senthooran Rajamanoharan, Neel Nanda, Anna Soligo Year: 2026Area: Deception & FailureCitations: - Tags: empirical, alignment-training, ai-safety, deception-failure | 2026 | Deception & Failure | empirical, alignment-training, ai-safety, deception-failure | E5 / R3 (94%) | - |
| Enhancing Value Alignment of LLMs with Multi-agent system and Combinatorial Fusion D. Frank Hsu, Djallel Bouneffouf, Yuanhong Wu Year: 2026Area: cs.MACitations: - Tags: alignment-training, ai-safety, csma, preprint | 2026 | cs.MA | alignment-training, ai-safety, csma, preprint | E5 / R4 (94%) | - |
| Evaluating 5W3H Structured Prompting for Intent Alignment in Human-AI Interaction Peng Gang Year: 2026Area: cs.AICitations: - Tags: alignment-training, ai-safety, csai, preprint | 2026 | cs.AI | alignment-training, ai-safety, csai, preprint | - | - |
| Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM Jesse Thomason, Mohammad Rostami, Zizhao Hu Year: 2026Area: cs.AICitations: - Tags: alignment-training, ai-safety, csai, preprint | 2026 | cs.AI | alignment-training, ai-safety, csai, preprint | - | - |
| Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows Shentong Mo, Yibing Song Year: 2026Area: cs.CVCitations: - Tags: alignment-training, ai-safety, cscv, preprint | 2026 | cs.CV | alignment-training, ai-safety, cscv, preprint | E5 / R3 (98%) | - |
| From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents Sen Su, Fanyu Meng, Zhenhong Zhou, Zhengshuo Gong Year: 2026Area: Safety EvaluationCitations: - Tags: alignment-training, ai-safety, safety-evaluation, benchmark | 2026 | Safety Evaluation | alignment-training, ai-safety, safety-evaluation, benchmark | E5 / R3 (94%) | - |
Showing 30 of 763 papers on page 1.