Instant research discovery

Search and browse ingested papers with intelligence signals and fast filtering.

PaperIntel
"Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior

Fiona Collins, Thilo Hagendorff, Sanaya Parekh, Jonas Kaplan

Year: 2026Area: cs.CLCitations: -

Tags: cscl, alignment-training, ai-safety, preprint

E5 / R4 (95%)
A Context Alignment Pre-processor for Enhancing the Coherence of Human-LLM Dialog

Ding Wei

Year: 2026Area: cs.AICitations: -

Tags: alignment-training, ai-safety, csai, preprint

-
AM3Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

Yujin Zhou, Yike Guo, Chengkun Cai, Pengcheng Wen

Year: 2026Area: Multimodal SafetyCitations: -

Tags: empirical, alignment-training, ai-safety, adversarial-robustness, multimodal-safety

E5 / R3 (97%)
Alignment Makes Language Models Normative, Not Descriptive

Eilam Shapira, Moshe Tennenholtz, Roi Reichart

Year: 2026Area: cs.CLCitations: -

Tags: cscl, alignment-training, ai-safety, preprint

-
Alignment Verifiability in Large Language Models: Normative Indistinguishability under Behavioral Evaluation

Igor Santos-Grueiro

Year: 2026Area: Deception & FailureCitations: 1

Tags: theoretical, alignment-training, ai-safety, deception-failure, safety-evaluation

E4 / R2 (94%)
Alignment-Aware Model Adaptation via Feedback-Guided Optimization

Gaurav Bhatt, Leonid Sigal, Aditya Chinchure, Jiawei Zhou

Year: 2026Area: Alignment TrainingCitations: -

Tags: empirical, alignment-training, ai-safety

E5 / R3 (93%)
Alignment-Aware and Reliability-Gated Multimodal Fusion for Unmanned Aerial Vehicle Detection Across Heterogeneous Thermal-Visual Sensors

Muhammad E. H. Chowdhury, Balamurugan Balusamy, M Murugappan, Amith Khandakar

Year: 2026Area: cs.CVCitations: -

Tags: alignment-training, ai-safety, cscv, preprint

E4 / R3 (96%)
Alignment-Process-Outcome: Rethinking How AIs and Humans Collaborate

Anjun Zhu, Arpit Narechania, Haichang Li

Year: 2026Area: cs.HCCitations: -

Tags: alignment-training, ai-safety, cshc, preprint

E5 / R4 (93%)
Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

Zhenheng Tang, Eunsol Choi, Qian Wang, Xiaowen Chu

Year: 2026Area: cs.AICitations: -

Tags: alignment-training, ai-safety, csai, preprint

-
Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning

Abhishek Mishra, Deepesh Suranjandass, Reshma Ashok, Donnie Winkelmann

Year: 2026Area: Deception & FailureCitations: -

Tags: empirical, alignment-training, ai-safety, deception-failure

E5 / R3 (95%)
BLOCK-EM: Preventing Emergent Misalignment by Blocking Causal Features

Guannan Qu, Muhammed Ustaomeroglu

Year: 2026Area: Model EditingCitations: -

Tags: empirical, alignment-training, ai-safety, model-editing

E4 / R2 (95%)
Best-of-Tails: Bridging Optimism and Pessimism in Inference-Time Alignment

Eric Lei, Chun-Fu Chen, Hsiang Hsu

Year: 2026Area: cs.AICitations: -

Tags: alignment-training, ai-safety, csai, preprint

E5 / R3 (96%)
Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

Weihong Lin, Lin Sun, Jianfeng Si, Xiangzheng Zhang

Year: 2026Area: Adversarial RobustnessCitations: -

Tags: empirical, alignment-training, ai-safety, adversarial-robustness

E5 / R4 (94%)
CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment

Yisheng Zhong, Zhengbang Yang, Zhuangdi Zhu, Junyuan Hong

Year: 2026Area: Model EditingCitations: -

Tags: empirical, alignment-training, ai-safety, model-editing

E6 / R4 (93%)
Capability-Oriented Training Induced Alignment Risk

Tian Gao, Zhenwen Liang, Nuno Moniz, Werner Geyer

Year: 2026Area: Deception & FailureCitations: 1

Tags: empirical, alignment-training, ai-safety, deception-failure

E4 / R3 (94%)
Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations

Z. Jane Wang, Sunil Kalia, Harvey Lui, Dongrun Zhu

Year: 2026Area: cs.CVCitations: -

Tags: alignment-training, ai-safety, cscv, preprint

E5 / R3 (96%)
Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

Jiahao Yu, Yimin Wang, Haozheng Luo, Binghui Wang

Year: 2026Area: cs.AICitations: -

Tags: alignment-training, ai-safety, csai, preprint

-
Controllable Value Alignment in Large Language Models through Neuron-Level Editing

Richang Hong, Weibiao Huang, Tat-Seng Chua, Le Wu

Year: 2026Area: Model EditingCitations: -

Tags: empirical, alignment-training, ai-safety, model-editing

E5 / R3 (94%)
DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

Zenglin Xu, Qifan Wang, Jiaxiang Chen, Mingxi Zou

Year: 2026Area: cs.LGCitations: -

Tags: alignment-training, ai-safety, cslg, preprint

E5 / R3 (93%)
Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Ramtin Pedarsani, Bocheng Chen, Guangliang Liu, Zhiyu Xue

Year: 2026Area: cs.AICitations: -

Tags: alignment-training, ai-safety, csai, preprint

-
Deconstructing Multimodal Mathematical Reasoning: Towards a Unified Perception-Alignment-Reasoning Paradigm

Zhenwen Liang, Tianyu Yang, Lisen Dai, Xiangliang Zhang

Year: 2026Area: cs.AICitations: -

Tags: alignment-training, ai-safety, csai, preprint

E5 / R3 (95%)
Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

Gregory N. Frank

Year: 2026Area: cs.LGCitations: -

Tags: alignment-training, ai-safety, cslg, safety-evaluation, preprint

-
Do Machines Fail Like Humans? A Human-Centred Out-of-Distribution Spectrum for Mapping Error Alignment

Binxia Xu, Luke Dickens, Robert M. Mok, Xiaoliang Luo

Year: 2026Area: cs.AICitations: -

Tags: alignment-training, ai-safety, csai, preprint

E6 / R4 (95%)
Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning

Zhiyuan Feng, Xiaoyuan Yi, Xiaohan Liu, Ceyao Zhang

Year: 2026Area: cs.AICitations: -

Tags: alignment-training, ai-safety, csai, preprint

-
Emergent Misalignment is Easy, Narrow Misalignment is Hard

Edward Turner, Senthooran Rajamanoharan, Neel Nanda, Anna Soligo

Year: 2026Area: Deception & FailureCitations: -

Tags: empirical, alignment-training, ai-safety, deception-failure

E5 / R3 (94%)
Enhancing Value Alignment of LLMs with Multi-agent system and Combinatorial Fusion

D. Frank Hsu, Djallel Bouneffouf, Yuanhong Wu

Year: 2026Area: cs.MACitations: -

Tags: alignment-training, ai-safety, csma, preprint

E5 / R4 (94%)
Evaluating 5W3H Structured Prompting for Intent Alignment in Human-AI Interaction

Peng Gang

Year: 2026Area: cs.AICitations: -

Tags: alignment-training, ai-safety, csai, preprint

-
Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM

Jesse Thomason, Mohammad Rostami, Zizhao Hu

Year: 2026Area: cs.AICitations: -

Tags: alignment-training, ai-safety, csai, preprint

-
Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

Shentong Mo, Yibing Song

Year: 2026Area: cs.CVCitations: -

Tags: alignment-training, ai-safety, cscv, preprint

E5 / R3 (98%)
From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents

Sen Su, Fanyu Meng, Zhenhong Zhou, Zhengshuo Gong

Year: 2026Area: Safety EvaluationCitations: -

Tags: alignment-training, ai-safety, safety-evaluation, benchmark

E5 / R3 (94%)

Showing 30 of 763 papers on page 1.