Paper deep dive
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
Miaomiao Ji, Yanqiu Wu, Zhibin Wu, Shoujin Wang, Jian Yang, Mark Dras, Usman Naseem
Abstract
Abstract:Reward design plays a pivotal role in aligning large language models (LLMs) with human values, serving as the bridge between feedback signals and model optimization. This survey provides a structured organization of reward modeling and addresses three key aspects: mathematical formulation, construction practices, and interaction with optimization paradigms. Building on this, it develops a macro-level taxonomy that characterizes reward mechanisms along complementary dimensions, thereby offering both conceptual clarity and practical guidance for alignment research. The progression of LLM alignment can be understood as a continuous refinement of reward design strategies, with recent developments highlighting paradigm shifts from reinforcement learning (RL)-based to RL-free optimization and from single-task to multi-objective and complex settings.
Tags
Links
- Source: https://arxiv.org/abs/2505.02666
- Canonical: https://arxiv.org/abs/2505.02666
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/12/2026, 5:48:36 PM
Summary
This survey provides a comprehensive analysis of reward design in LLM alignment, categorizing reward modeling into a structured taxonomy based on mathematical formulation, construction practices, and optimization paradigms. It highlights the evolution from RL-based to RL-free methods and from single-task to multi-objective settings, positioning reward design as the central mechanism for bridging human values with model optimization.
Entities (6)
Relation Signals (4)
RLHF → utilizes → Reward Model
confidence 98% · RLHF enables the incorporation of human preferences into model training by using a reward model (RM) to guide reinforcement learning (RL) optimization.
Reward Design → facilitates → LLM Alignment
confidence 95% · Reward design plays a pivotal role in aligning large language models (LLMs) with human values.
Reward Model → guides → PPO
confidence 95% · These scores are then used to optimize the base LLM via algorithms such as Proximal Policy Optimization (PPO).
DPO → isalternativeto → RLHF
confidence 90% · The progression of LLM alignment can be understood as a continuous refinement of reward design strategies, with recent developments highlighting paradigm shifts from reinforcement learning (RL)-based to RL-free optimization.
Cypher Suggestions (2)
Find all alignment algorithms that utilize a reward model. · confidence 90% · unvalidated
MATCH (a:Algorithm)-[:UTILIZES]->(rm:Component {name: 'Reward Model'}) RETURN a.nameMap the relationship between methodologies and the research field of LLM alignment. · confidence 85% · unvalidated
MATCH (m:Methodology)-[:FACILITATES]->(f:Field {name: 'LLM Alignment'}) RETURN m.nameFull Text
192,277 characters extracted from source content.
Expand or collapse full text
A Survey on Progress in LLM Alignment from the Perspective of Reward Design Miaomiao Ji 1,2 , Yanqiu Wu 1 , Zhibin Wu 2 , Shoujin Wang 3 , Jian Yang 1 , Mark Dras 1 , Usman Naseem 1* 1 School of Computing, Macquarie University, 4 Research Park Drive, Sydney, 2109, NSW, Australia. 2 Business School, Sichuan University, No. 29, Wangjiang Road, Chengdu, 610065, Sichuan, China. 3 Data Science Institute, University of Technology Sydney, 15 Broadway, Sydney, 2007, NSW, Australia. *Corresponding author(s). E-mail(s): usman.naseem@mq.edu.au; Contributing authors: jimiaomiao@stu.scu.edu.cn; yanqiu.wu@mq.edu.au; zhibinwu@scu.edu.cn; shoujin.wang@uts.edu.au; jian.yang@mq.edu.au; mark.dras@mq.edu.au; Abstract Reward design plays a pivotal role in aligning large language models (LLMs) with human values, serving as the bridge between feedback signals and model optimization. This survey provides a structured organization of reward modeling and addresses three key aspects: mathematical formulation, construction prac- tices, and interaction with optimization paradigms. Building on this, it develops a macro-level taxonomy that characterizes reward mechanisms along complemen- tary dimensions, thereby offering both conceptual clarity and practical guidance for alignment research. The progression of LLM alignment can be understood as a continuous refinement of reward design strategies, with recent developments highlighting paradigm shifts from reinforcement learning (RL)-based to RL-free optimization and from single-task to multi-objective and complex settings. Keywords:Large language model alignment, Reward design, Preference learning, Human feedback 1 arXiv:2505.02666v2 [cs.CL] 29 Aug 2025 1 Introduction 1.1 Challenges faced by LLMs and LLMs alignment In recent years,large language models(LLMs) such as GPT-4 [1], Claude [2], and Gemini [3] have demonstrated remarkable capabilities across a wide range of natural language understanding and generation tasks. These systems, built upon transformer architectures [4] and trained on massive corpora, exhibit strong performance in zero- shot and few-shot learning, enabling a wide range of applications, from education and healthcare to programming and research assistance [5, 6]. However, as LLMs continue to gain influence, there is an urgent need to ensure that these models behave in ways that are beneficial, safe, and aligned with human inten- tions. LLMs are now expected to follow the principles of beinghelpful,harmless, andhonest(H) [7]. Despite their impressive capabilities, real-world deployments of LLMs frequently suffer from a number of critical challenges: factual inaccuracies [8], harmful or toxic outputs [9], persistent biases [10], and unpredictable behavior in dynamic contexts [5]. For instance, LLMs often generate hallucinated or incorrect information, which seriously limits their use in high-stakes domains such as health- care or law [8]. They may also produce toxic, offensive, or harmful content due to biases present in the training data or insufficient filtering mechanisms [9]. In addition, persistent bias and fairness issues arise in the form of gender, racial, or cultural stereo- typing, often reflecting societal inequalities embedded in the pretraining corpora [10]. Ethical concerns also intensify in cross-cultural deployments, where differing moral standards complicate the alignment of model behavior with human values [11]. Fur- thermore, controllability and interpretability remain limited, making it difficult for users to understand or guide the model’s decision-making process in a predictable manner [5]. Finally, privacy and security risks are increasingly pressing, as models may memorize and disclose sensitive user data and are vulnerable to prompt injection and adversarial attacks [12]. These issues not only undermine trust but also highlight the broader difficulty of aligning such models with complex, diverse, and evolving human values. To address these concerns, the alignment of LLMs has become a cornerstone of safe and responsible AI research. Alignment, in the context of artificial intelligence, refers to the extent to which a model’s behavior reflects human values, goals, and ethical norms [11]. Yet achieving alignment in practice requires not only model-level interventions but also a robust, scalable methodology for systematically incorporating human preferences into the optimization process. Among various alignment techniques,Reinforcement Learning from Human Feedback(RLHF) has emerged as one of the most widely adopted paradigms [13, 14]. RLHF enables the incorporation of human preferences into model training by using a reward model(RM) to guidereinforcement learning(RL) optimization. An RM is an optimization target or evaluation signal during training that quantifies the align- ment between model outputs and human preferences or alignment objectives. Rather than relying solely onsupervised fine-tuning(SFT), RLHF leverages ranked or pairwise human feedback to train an RM, which in turn evaluates and ranks candidate outputs. These scores are then used to optimize the base LLM via algorithms such as 2 Proximal Policy Optimization(PPO) [15]. Although RLHF has achieved notable progress, their effectiveness is still constrained by several fundamental limitations. One major issue is the subjectivity and inconsistency of human feedback. Annotators may provide conflicting judgments due to personal beliefs or differing interpretations of context, which introduces noise into the reward modeling process and destabilizes optimization [14]. Another challenge is the scarcity and high cost of high-quality feed- back, particularly in expert-driven domains such as medicine and law [16]. In addition, current alignment methods are prone to mode collapse, where models tend to pro- duce repetitive and homogeneous outputs as a result of over-optimization on narrow reward signals [17]. Related to this is the issue of reward hacking, where models learn to exploit flaws in the reward function to artificially maximize scores without actually aligning with the intended human goals [18]. Final instability remains a common prob- lem in RL settings, which often struggle with convergence due to high-dimensional action spaces and non-stationary reward landscapes [13]. Together, these challenges reveal the limitations of current alignment strategies and point to the need for more resilient, adaptable, and principled approaches. Within this context, reward design has become an essential component of LLM alignment. Although LLMs exhibit powerful generative capabilities, they fundamentally depend on external signals to guide their behavior. At the same time, RMs are themselves imperfect, and their limitations can unintentionally amplify undesirable behaviors in LLMs. As a result, effective and well- structured reward design is critical for ensuring that model outputs remain aligned with human values, intentions, and expectations. 1.2 Reward modeling: A central solution to alignment challenges Reward design plays a pivotal role in bridging the gap between raw model capabilities and meaningful alignment with human values, acting as the central connective tissue in the LLM alignment pipeline. Situated between feedback collection and optimization, it serves not only as a transformation layer that distills raw or subjective human feedback into actionable signals, but also as a high-leverage point for encoding abstract human preferences, safety considerations, and societal norms into quantifiable objectives [19]. In doing so, it links human intentions with the optimization objectives of LLMs, effectively shaping model behavior in complex, high-dimensional environments where direct supervision is often infeasible [14, 20]. At the heart of this process lies the design and training of the RM, which oper- ationalizes alignment through the reward function. This function determines which behaviors are encouraged or penalized, thus playing a foundational role in ensuring that LLM outputs reflect desirable and aligned behavior. A well-crafted reward func- tion can mitigate the ambiguity and inconsistency inherent in human feedback by synthesizing diverse data sources and aligning them with high-level goals in a form amenable to optimization. Despite its central role in the alignment pipeline, reward design remains a funda- mentally open and challenging problem. RMs, as surrogate objectives, are designed to approximate the true preferences and expectations of human users. However, the inher- ent complexity and subjectivity of human values, coupled with the diversity of tasks 3 and application scenarios, make it nearly impossible to construct a reward function that fully captures these nuanced preferences. Poorly constructed RMs often lead to unintended and misaligned behaviors, a phenomenon commonly referred to as reward misspecification [21]. This misspecification arises from the discrepancy between the designed reward signal and the users’ latent preferences, introducing significant risks during model training. When optimization is guided by imperfect or distorted RMs, the model may inadvertently exploit spurious correlations or overfit to proxy met- rics, thereby deviating from genuine human intent [18]. Such misalignments can result in unforeseen consequences, where models achieve high scores on evaluation metrics yet fail to deliver outputs that truly satisfy user needs. The situation becomes even more critical when these deviations are obscured by existing benchmarks, creating a misleading illusion of optimal performance while masking substantial alignment fail- ures. Moreover, RMs frequently struggle with generalization across domains, leading to distributional shifts that undermine alignment robustness in real-world deployment scenarios [22]. Learned RMs are also susceptible to inaccuracies in preference data and the introduction of annotator biases, which complicate efforts to ensure fairness, robustness, and interpretability. These challenges are further exacerbated in emerging contexts involving multi-modal inputs [23], multi-turn interactions [24], and multi- task generalization [25], where human values are heterogeneous, dynamic, and often underspecified. The complexity of reward design is further magnified in emerging contexts involving multi-modal inputs [23], multi-turn interactions [24], and multi-task gener- alization [25], where human values may be heterogeneous, evolving, or underspecified. As such, reward design must move beyond static, hard-coded rules toward data-driven models that dynamically adapt to evolving user needs and social contexts. Ultimately, reward design is not just a technical component of RLHF, but the linchpin. It directly influences the behavior of downstream optimization algorithms, whether based on RL, supervised preference modeling, or in-context adaptation, by defining what constitutes a better output in any given context. To fulfill this role effectively, reward design must strike a careful balance among expressiveness, interpretability, robustness, and generalizability. More than just ensuring behav- ioral mimicry, it enables principled generalization from finite human data to diverse, open-ended real-world scenarios. In anchoring alignment research in a direction that is both scientifically rigorous and socially responsive, reward design embodies the theoretical foundations and practical mechanisms essential for aligning LLMs with human-centered goals. MotivationAlthough the alignment of LLMs has seen remarkable progress, there remains a critical gap in the literature: a comprehensive and systematic investigation into the central role of reward design within alignment paradigms is still missing. Most existing studies approach alignment from a general perspective [26–29], often treating RMs as supplementary components rather than foundational mechanisms. While some recent efforts [30] have started to examine the taxonomy and challenges of reward modeling, these works remain largely descriptive, focusing on existing system architectures or application scenarios without capturing the methodological evolution of reward design itself. 4 In contrast, this review takes a fundamentally different approach: we position reward design as a methodological paradigm that both reflects and drives the evolution of LLM alignment techniques. Unlike conventional reward specification in traditional machine learning, reward design in the context of LLM alignment involves funda- mentally different considerations, including how alignment objectives are defined, how feedback signals are represented, and how reward signals are incorporated into opti- mization procedures. As alignment techniques continue to evolve, these considerations have become critical to ensuring both performance and robustness. Moreover, many of the core challenges in aligning LLMs can often be traced back to the strategies and mechanisms used in reward modeling. In this context, reward design not only serves as a guiding mechanism for model behavior, but also provides a valuable analytical lens for understanding the trajectory of alignment research. Ana- lyzing the evolution of reward design allows for a more systematic interpretation of paradigm shifts in alignment strategies and facilitates the identification of emerging methodological trends. ContributionsOur contributions are summarized as follows. • Structured organization of reward modeling:Based on existing research, this survey organizes the space of reward modeling for LLM alignment along multiple dimensions. It highlights three foundational aspects: (i) mathematical formulation of reward models, (i) construction practices, and (i) interactions with optimization paradigms. Building on this, a macro-level taxonomy is pre- sented that categorizes reward mechanisms into rule-based, data-driven, and hybrid approaches, as well as numerical vs. non-numerical and explicit vs. implicit types (see Figure 1). This organization provides conceptual clarity and practical guidance for analyzing, comparing, and applying reward modeling methods. • Comprehensive analysis of hybrid reward design:Existing studies on hybrid reward mechanisms are systematically reviewed, with emphasis on the diverse sources of reward signals and the strategies employed for their integration. This analysis consolidates fragmented approaches into a coherent design space, offering methodological foundations for managing preference diversity, resolving conflicting objectives, and enabling alignment in complex real-world scenarios. • Synthesis of paradigm shifts in reward modeling for LLM alignment: The evolution of reward modeling is examined, covering the shift from RL-based to RL-free methods and from single-task to multi-task and multi-modal contexts. Driven by practical needs such as concurrent objective optimization, cross-modal consistency, and heterogeneous preferences, these shifts have stimulated contin- uous innovation in reward mechanism design. Treating the reward function as a lever for improving alignment efficiency highlights a promising direction for advancing LLM performance. The remainder of this paper is organized as follows. Section 2 presents a diagnosis–prescription–treatment–inspired conceptual framework for LLM alignment, visually capturing the key processes and essential components involved, with a particu- lar focus on reward modeling as a central solution to alignment challenges. Sections 3–5 5 RL-based methodRL-based method r1 r2 r1 r2 r1 r2 r1 r2 HumanHuman Generative RM from AI Feedback Generative RM from AI Feedback Off-the-shelf LLM RM training SFT model Reforcement learning RL from Human Feedback (RLHF)RL from Human Feedback (RLHF) RL from AI Feedback (RLAIF) Discriminative RM from Human Feedback Discriminative RM from Human Feedback Reforcement learning RM training LM policy LM policy r1 r2 r1 r2 Human Generative RM from AI Feedback Off-the-shelf LLM RM training SFT model Reforcement learning RL from Human Feedback (RLHF) RL from AI Feedback (RLAIF) Discriminative RM from Human Feedback Reforcement learning RM training LM policy LM policy Responses Preference Preference l y l y w y l y w y Prompt x Prompt x r1 r2 r1 r2 Human Generative RM from AI Feedback Off-the-shelf LLM RM training SFT model Reforcement learning RL from Human Feedback (RLHF) RL from AI Feedback (RLAIF) Discriminative RM from Human Feedback Reforcement learning RM training LM policy LM policy Responses Preference Preference l y w y Prompt x SFT modelSFT model Implicit RM SFT model Implicit RM SFT modelSFT model Explicit RM Base LLMBase LLM Reforcement learning training loop SFT model Explicit RM Base LLM Reforcement learning training loop Final LLM l y l y w y l y w y l y l y w y l y w y Prompt x Prompt x Prompt x Prompt x Responses Responses Maxmum likelihood Maxmum likelihood PPO-based paradigm for LLM alignmentPPO-based paradigm for LLM alignment DPO-based paradigm for LLM alignmentDPO-based paradigm for LLM alignment Reward design in LLM alignmetn Reward design in LLM alignmetn FeedbackFeedback Reforcement learningReforcement learning OptimizationOptimization Explicit RM Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. General RM Reward design in LLM alignmetn Feedback Reforcement learning Optimization Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. General RM Prompt x Prompt x ResponsesResponses y Responses y Collect human demonstration data Prompt x Prompt x SFT modelSFT model l y l y w y l y w y Collect human prerference data l y w y lw y Base LLMBase LLMSFT modelSFT model Supervised Fine-tune Base LLMSFT model Supervised Fine-tune Base LLMBase LLMRMRM Supervised Fine-tune Base LLMRM Supervised Fine-tune Step2 training reward modelStep1 training Supervised Fine-tuning model Step3 training Policy using PPO Prompt x Prompt x ResponsesResponses y Responses y RM LM policy PPO Prompt x Prompt x Implicit RM Reward modelReward model Rules Discriminative RMDiscriminative RM Generative Reward Implicit RMImplicit RM RL-based training (PPO) RL-free training(DPO) Learned RM Learned RM Rule-based RMRule-based RM Fine-grained RMFine-grained RM Fine-grained RMFine-grained RM Fine-grained RMFine-grained RM Fine-grained RMFine-grained RM ...... Discriminative RMDiscriminative RM Trend of reward design in LLM alignmetn Trend of reward design in LLM alignmetn Construction BasisConstruction Basis RrepresentationRrepresentation GranularityGranularity Explicit RM Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. Trend of reward design in LLM alignmetn Construction Basis Rrepresentation Granularity Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. Learned RM Implicit RMImplicit RM RL-based training (PPO) RL-free training(DPO) Discriminative RMDiscriminative RM GranularityGranularity Trends in Reward Design for LLM Alignment Trends in Reward Design for LLM Alignment Learned RM Learned RMRule-based RMRule-based RM Learned RMRule-based RM Rule-based RM Learned RMRule-based RM Learned RM Rule-based RM Learned RMRule-based RM Learned RM Rule-based RM Learned RMRule-based RM Learned RM Rule-based RM Learned RMRule-based RM Learned RM 202220232024202220232024 Construction Basis:From Rule-based to Learned Reward Models, Format:From numerical Reward to non- numerical Reward Models,Expression: From Explicit Reward to Implicit Reward Models,Granularity: From general Reward to fine-grained Reward Models Rule-based RMRule-based RMData-driven RMData-driven RMHybrid RMHybrid RMRule-based RMData-driven RMHybrid RM Format Non-numerical Reward Models Non-numerical Reward Models Format Non-numerical Reward Models Explicit Reward Explicit Reward Expression Implicit Reward Implicit Reward General RM General RM Granularity Fine-grained RMFine-grained RM Construction Basis Language Model Outputs High-quality Outputs High-quality Outputs Reward modelReward model Language Model Outputs High-quality Outputs Reward model TraingTraing TraingTraing Reward design Rule-based RM/ Data-driven RM Numerical RM / Non-numerical RM / Feedback Optimization Application Chanllenges LLM alignment Rule-based RM / Data-driven RM Reward designReward design Rule-based RM Data-driven RM Explicit RM Implicit RM General RM Fine-grained RM ➢Discriminative RM/ Generative RM ➢Response- Level RM / Token- Level RM ➢Multi-objective RM Construction Basis Construction Basis Construction Basis Format Feedback chanllenge1chanllenge2chanllenge3chanllenge4 chanllenge5 Pairwise feedback Listwise feedback Binary feedback Preference feedback RL-free method LLM alignment LLM alignment ApplicationApplication OptimizationOptimization RepresentationRepresentation GranularityGranularity Numerical RM Non-numerical RM Multi-task Multi-model Simple scene Multi-task Multi-model Simple scene Numerical RM /Non-numerical RM General RM/ Fine-grained RM Multi-task/ Multi-modal scenes RL-based method/ RL-free method Binary /Preference Pairwise / Listwise Explicit RM/ Implicit RM Format Representation Granularity LLM alignment Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking TaxnomyTrend (a) Reward design Feedback Optimization Application Chanllenges LLM alignment Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking Trend Construction Basis Format Representation Granularity Taxnomy Construction Basis Format Representation Granularity Taxnomy Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/ Fine-grained RM Explicit RM/ Implicit RM Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/ Fine-grained RM Explicit RM/ Implicit RM Multi-task/ Multi-modal scenes RL-based method/ RL-free method Binary /Preference Pairwise / Listwise Multi-task/ Multi-modal scenes RL-based method/ RL-free method Binary /Preference Pairwise / Listwise LLM alignment LLM alignment OptimizationOptimization ApplicationApplication Reward designReward design Feedback Chanllenges Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking Chanllenges Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking Construction Basis Format Representation Granularity Taxnomy Construction Basis Format Representation Granularity Taxnomy Binary /Preference Pairwise / Listwise Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/Fine-grained RM Explicit RM/Implicit RM Multi-task/Multi-modal scenes Multi-task/Multi-modal scenes Trend RL-based method/RL-free method Check PatientPatient Large language model Outputs FeedbackFeedback Outputs Reward model Reward model Reinforecement learning Large language model Large language model Human annotaor Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs FeedbackFeedback Human annotaor Large language model Outputs Feedback Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs FeedbackFeedback Outputs Reward model Reward model Reinforecement learning Large language model Large language model Human annotaor Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs FeedbackFeedback Human annotaor Large language model Outputs Feedback Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization v InputInput Output Feedback Reward design Optimization Treatment Dignosis Check Prescription Subjectivity and instability of human feedback Scarcity and high cost of human feedbackScarcity and high cost of human feedback Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/Fine-grained RM Explicit RM/Implicit RM Supervised learning Numerical feedback/Ranked feedback/Textual feedback In-Context learning Mode collapse Reward hacking Training instability Large languge model DoctorDoctor LLM alignmentLLM alignment PatientPatient Symptoms Human feedback/ AI feedback Reinforcement learning General RM/Fine-grained RM InputInputLarge languge model OutputOutput OutputOutput OutputOutput FeedbackFeedback Large language model Large language model Outputs Reinforecement learning Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Large language model Outputs FeedbackFeedback Human annotaor Large language model Large language model Reward model OutputsOutputs Feedback (a) Explicit Reward model (b) Implicit reward model Large language model Outputs Reinforecement learning Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Human annotaor Large language model Reward model Outputs Feedback (a) Explicit Reward model (b) Implicit reward model Large language model Outputs Reinforecement learning Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Human annotaor Large language model Reward model Outputs Feedback (a) Explicit Reward model (b) Implicit reward model (a) Explicit Reward model (b) Implicit reward model Large language model Large language model Outputs Reinforecement learning Human annotator Supervised learning OptimizationOptimization Large language model Large language model Outputs FeedbackFeedback Human annotator Large language model Large language model Reward model OutputsOutputs Feedback (a) Explicit Reward model (b) Implicit reward model Large language model Outputs Reinforecement learning Human annotator Supervised learning OptimizationOptimization Large language model Outputs Feedback Human annotator Large language model Reward model Outputs Feedback Improved PPO Substituted PPO Improved PPO Substituted PPO DPO and its variantsDPO and its variants Other methods InputInput Output Feedback Reward design Optimization TreatmentDignosis Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM Explicit RM/Implicit RM RL-based method Pairwise feedback/ Listwise feedback Binary feedback/ Preference feedback RL-free method Ethics & Values Large languge model DoctorDoctor LLM alignmentLLM alignment v Human feedback/ AI feedback Bias & FairnessBias & Fairness Safety & HarmfulnessSafety & Harmfulness Factuality & ReliabilityFactuality & Reliability Controllability & InterpretabilityControllability & Interpretability Privacy & SecurityPrivacy & Security HarmlessHarmless HonestHonest Helpful Helpful Virus Threats Patient Therapeutic dilemmas Reward hacking Mode collapse Scarcity and high cost of human feedback Subjectivity and instability of human feedback Training instability PrescriptionPrescription Clinical targets Medical test Input Output Feedback Reward design Optimization TreatmentDignosis Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM Explicit RM/Implicit RM RL-based method Pairwise feedback/ Listwise feedback Binary feedback/ Preference feedback RL-free method Ethics & Values Large languge model Doctor LLM alignment v Human feedback/ AI feedback Bias & Fairness Safety & Harmfulness Factuality & Reliability Controllability & Interpretability Privacy & Security Harmless Honest Helpful Virus Threats Patient Therapeutic dilemmas Reward hacking Mode collapse Scarcity and high cost of human feedback Subjectivity and instability of human feedback Training instability Prescription Clinical targets Medical test Large language model Large language model Response Human annotator Large language model Large language model Reward model ResponseResponse Feedback Query Query Demonstration Large language model Response Human annotator Large language model Reward model Response Feedback Query Query Demonstration Large language model Response Human annotator Large language model Reward model Response Feedback Query Query Demonstration (b) In-context learning Evaluation Curation Optimization Response FeedbackFeedback Human annotator Large language model Large language model Query Response Feedback Human annotator Large language model Query (c) DPO Reward model z Response Feedback Human annotator Large language model Query (c) DPO Reward model z Training objective Large language model Large language model Response Reinforecement learning Human annotator Optimization Large language model Large language model Reward model ResponseResponse Feedback Query (a) Reinforecement learning Large language model Response Reinforecement learning Human annotator Optimization Large language model Reward model Response Feedback Query (a) Reinforecement learning Input Output Feedback Reward design Optimization Rule-based RM/ Data-driven RM/ Hybrid RM Explicit RM/ Implicit RM RL-based optimization Pairwise feedback/ Listwise feedback Binary feedback/ Preference feedback SL-based optimization Ethics & Values Large languge model Doctor LLM alignment v Human feedback/ AI feedback Bias & Fairness Safety & Harmfulness Factuality & Reliability Controllability & Interpretability Privacy & Security Harmless Honest Helpful Patient Prescription Clinical targets Medical test Virus Threats ICL-based optimization Numerical RM /Non-numerical RM TreatmentTreatment DignosisDignosis Input Output Feedback Reward design Optimization Ethics & Values Large languge model Doctor LLM alignment v Bias & Fairness Safety & Harmfulness Factuality & Reliability Controllability & Interpretability Privacy & Security Harmless Honest Helpful Patient Clinical targets Medical test Symptoms TreatmentTreatment DignosisDignosis PrescriptionPrescription Therapeutic dilemmas Reward hacking Mode collapse Subjectivity and instability of human feedback Training instability Therapeutic dilemmas Reward hacking Mode collapse Subjectivity and instability of human feedback Training instability How does the reward model function? Reward model: functional roles under different optimization paradigms (Section 5 ) Numerical RM Data-driven RM Hybrid RM Explicit RM Implicit RM How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How does the reward model function in RL, ICL, and SL? Reward model: functional roles under different optimization paradigms (Section 5 ) Numerical reward modeling Nonnumerical reward modeling Rule-based reward modeling Data-driven reward modeling Hybrid reward modeling Explicit reward modeling Implicit reward modeling Standard SL-based LLM alignment DPO and its variants RL-based LLM alignment ICL-based LLM alignment How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) Reward model for LLM alignment Preference learning Classification / regression models Inverse reinforcement learning E.g., DPO (Rafailov et al., 2023), TDPO (Zeng et al., 2023), Step-DPO (LAi et al., 2024) E.g., Supervised Fine-tuning (Ouyang et al., 2022), Instruction Tuning (Wei et al., 2021), Constitutional AI (Bai et al., 2022a) E.g., Chain-of-Thought Feedback (Wei et al., 2022), Meta-RM for ICL (Liu et al., 2024), Rewarded In-Context Learning (Chen et al., 2024) E.g., InstructGPT (Ouyang et al., 2022), InstructGPT(), KTO (Yang et al., 2023) E.g., Inverse RL Alignment (Sun & van der Schaar, 2024), Variational IRL for LLM (Cai et al., 2024), Dynamic Reward Scaling IRL (Cheng et al., 2025) E.g., RLAIF (Bai et al., 2022b), Direct-RLAIF (Lee et al., 2024a), UltraFeedback (Cui et al., 2024) E.g., Reward Regression (Christiano et al., 2017), Feedback Regression (Ziegler et al., 2019), Score-based Reward Models (Kadavath et al., 2022) E.g., TLRM (Wu et al., 2023), LiPO (Lu et al., 2024), Variational RM (Cai et al., 2024) E.g., Contrastive Reward Modeling (Yuan et al., 2023), Ordinal Reward Modeling (Zheng et al., 2024), Ranking-based RM (Kim et al., 2024) E.g., Prompt-based RM (Ouyang et al., 2022), Heuristic RM (Stiennon et al., 2020), Symbolic Reward Templates (Zhang et al., 2023) E.g., Adaptive Hybrid RM (Kim et al., 2024), Gating-based Hybrid RM (Wang et al., 2024), Rule-guided Preference Learning (Chen et al., 2024) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-modal RM (e.g., text + images + audio) Multi-modal RM (e.g., text + images + audio) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-source RM (e.g., rule-based + data-driven)Multi-source RM (e.g., rule-based + data-driven) Multi-granularity RM (e.g., response-level + token-level)Multi-granularity RM (e.g., response-level + token-level) Single-value reward modeling / multi-value reward modelingSingle-value reward modeling / multi-value reward modeling Response-level reward modeling / token-level reward modelingResponse-level reward modeling / token-level reward modeling Pairwise preference modeling / listwise preference modelingPairwise preference modeling / listwise preference modeling Pointwise reward modeling / preferencewise modelingPointwise reward modeling / preferencewise modeling DPO and its variantsDPO and its variants ICL-based LLM alignment ICL-based LLM alignment RL-based LLM alignment RL-based LLM alignment Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial imitation learning) Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial imitation learning) Supervised learning (e.g., classification / regression, preference learning) Supervised learning (e.g., classification / regression, preference learning) How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How does the reward model function? Reward model: functional roles under different optimization paradigms (Section 5 ) Numerical RM Nonnumerical RM Rule-based RM Data-driven RM Hybrid RM Explicit RM Implicit RM How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) Toxicity filters, Wikidata queries, length constraints, domain- specific symbolic rules, programmatic scoring scripts Reward model for LLM alignment Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-modal RM (e.g., text + images + audio) Multi-modal RM (e.g., text + images + audio) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-source RM (e.g., rule-based + data-driven)Multi-source RM (e.g., rule-based + data-driven) Multi-granularity RM (e.g., response-level + sentence-level + token- level) Multi-granularity RM (e.g., response-level + sentence-level + token- level) Single-value reward modeling and multi-value reward modelingSingle-value reward modeling and multi-value reward modeling Response-level reward modeling and token-level reward modelingResponse-level reward modeling and token-level reward modeling Pairwise preference modeling and listwise preference modelingPairwise preference modeling and listwise preference modeling Pointwise reward modeling and preferencewise modelingPointwise reward modeling and preferencewise modeling DPO and its variantsDPO and its variants ICL-based LLM alignment ICL-based LLM alignment RL-based LLM alignment RL-based LLM alignment Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial Imitation Learning Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial Imitation Learning Supervised learning (e.g., classification, regression, preference learning) Supervised learning (e.g., classification, regression, preference learning) Toxicity filters, Wikidata queries, length constraints, domain- specific symbolic rules, programmatic scoring scripts Toxicity filters, Wikidata queries, length constraints, domain- specific symbolic rules, programmatic scoring scripts Non-numerical RMNon-numerical RM How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) Reward model for LLM alignment Reward model for LLM alignment Rule-based RM Rule-based RM RL-based methodRL-based method r1 r2 r1 r2 r1 r2 r1 r2 HumanHuman Generative RM from AI Feedback Generative RM from AI Feedback Off-the-shelf LLM RM training SFT model Reforcement learning RL from Human Feedback (RLHF)RL from Human Feedback (RLHF) RL from AI Feedback (RLAIF) Discriminative RM from Human Feedback Discriminative RM from Human Feedback Reforcement learning RM training LM policy LM policy r1 r2 r1 r2 Human Generative RM from AI Feedback Off-the-shelf LLM RM training SFT model Reforcement learning RL from Human Feedback (RLHF) RL from AI Feedback (RLAIF) Discriminative RM from Human Feedback Reforcement learning RM training LM policy LM policy Responses Preference Preference l y l y w y l y w y Prompt x Prompt x r1 r2 r1 r2 Human Generative RM from AI Feedback Off-the-shelf LLM RM training SFT model Reforcement learning RL from Human Feedback (RLHF) RL from AI Feedback (RLAIF) Discriminative RM from Human Feedback Reforcement learning RM training LM policy LM policy Responses Preference Preference l y w y Prompt x SFT modelSFT model Implicit RM SFT model Implicit RM SFT modelSFT model Explicit RM Base LLMBase LLM Reforcement learning training loop SFT model Explicit RM Base LLM Reforcement learning training loop Final LLM l y l y w y l y w y l y l y w y l y w y Prompt x Prompt x Prompt x Prompt x Responses Responses Maxmum likelihood Maxmum likelihood PPO-based paradigm for LLM alignmentPPO-based paradigm for LLM alignment DPO-based paradigm for LLM alignmentDPO-based paradigm for LLM alignment Reward design in LLM alignmetn Reward design in LLM alignmetn FeedbackFeedback Reforcement learningReforcement learning OptimizationOptimization Explicit RM Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. General RM Reward design in LLM alignmetn Feedback Reforcement learning Optimization Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. General RM Prompt x Prompt x ResponsesResponses y Responses y Collect human demonstration data Prompt x Prompt x SFT modelSFT model l y l y w y l y w y Collect human prerference data l y w y lw y Base LLMBase LLMSFT modelSFT model Supervised Fine-tune Base LLMSFT model Supervised Fine-tune Base LLMBase LLMRMRM Supervised Fine-tune Base LLMRM Supervised Fine-tune Step2 training reward modelStep1 training Supervised Fine-tuning model Step3 training Policy using PPO Prompt x Prompt x ResponsesResponses y Responses y RM LM policy PPO Prompt x Prompt x Implicit RM Reward modelReward model Rules Discriminative RMDiscriminative RM Generative Reward Implicit RMImplicit RM RL-based training (PPO) RL-free training(DPO) Learned RM Learned RM Rule-based RMRule-based RM Fine-grained RMFine-grained RM Fine-grained RMFine-grained RM Fine-grained RMFine-grained RM Fine-grained RMFine-grained RM ...... Discriminative RMDiscriminative RM Trend of reward design in LLM alignmetn Trend of reward design in LLM alignmetn Construction BasisConstruction Basis RrepresentationRrepresentation GranularityGranularity Explicit RM Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. Trend of reward design in LLM alignmetn Construction Basis Rrepresentation Granularity Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. Learned RM Implicit RMImplicit RM RL-based training (PPO) RL-free training(DPO) Discriminative RMDiscriminative RM GranularityGranularity Trends in Reward Design for LLM Alignment Trends in Reward Design for LLM Alignment Learned RM Learned RMRule-based RMRule-based RM Learned RMRule-based RM Rule-based RM Learned RMRule-based RM Learned RM Rule-based RM Learned RMRule-based RM Learned RM Rule-based RM Learned RMRule-based RM Learned RM Rule-based RM Learned RMRule-based RM Learned RM 202220232024202220232024 Construction Basis:From Rule-based to Learned Reward Models, Format:From numerical Reward to non- numerical Reward Models,Expression: From Explicit Reward to Implicit Reward Models,Granularity: From general Reward to fine-grained Reward Models Rule-based RMRule-based RMData-driven RMData-driven RMHybrid RMHybrid RMRule-based RMData-driven RMHybrid RM Format Non-numerical Reward Models Non-numerical Reward Models Format Non-numerical Reward Models Explicit Reward Explicit Reward Expression Implicit Reward Implicit Reward General RM General RM Granularity Fine-grained RMFine-grained RM Construction Basis Language Model Outputs High-quality Outputs High-quality Outputs Reward modelReward model Language Model Outputs High-quality Outputs Reward model TraingTraing TraingTraing Reward design Rule-based RM/ Data-driven RM Numerical RM / Non-numerical RM / Feedback Optimization Application Chanllenges LLM alignment Rule-based RM / Data-driven RM Reward designReward design Rule-based RM Data-driven RM Explicit RM Implicit RM General RM Fine-grained RM ➢Discriminative RM/ Generative RM ➢Response- Level RM / Token- Level RM ➢Multi-objective RM Construction Basis Construction Basis Construction Basis Format Feedback chanllenge1chanllenge2chanllenge3chanllenge4 chanllenge5 Pairwise feedback Listwise feedback Binary feedback Preference feedback RL-free method LLM alignment LLM alignment ApplicationApplication OptimizationOptimization RepresentationRepresentation GranularityGranularity Numerical RM Non-numerical RM Multi-task Multi-model Simple scene Multi-task Multi-model Simple scene Numerical RM /Non-numerical RM General RM/ Fine-grained RM Multi-task/ Multi-modal scenes RL-based method/ RL-free method Binary /Preference Pairwise / Listwise Explicit RM/ Implicit RM Format Representation Granularity LLM alignment Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking TaxnomyTrend (a) Reward design Feedback Optimization Application Chanllenges LLM alignment Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking Trend Construction Basis Format Representation Granularity Taxnomy Construction Basis Format Representation Granularity Taxnomy Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/ Fine-grained RM Explicit RM/ Implicit RM Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/ Fine-grained RM Explicit RM/ Implicit RM Multi-task/ Multi-modal scenes RL-based method/ RL-free method Binary /Preference Pairwise / Listwise Multi-task/ Multi-modal scenes RL-based method/ RL-free method Binary /Preference Pairwise / Listwise LLM alignment LLM alignment OptimizationOptimization ApplicationApplication Reward designReward design Feedback Chanllenges Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking Chanllenges Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking Construction Basis Format Representation Granularity Taxnomy Construction Basis Format Representation Granularity Taxnomy Binary /Preference Pairwise / Listwise Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/Fine-grained RM Explicit RM/Implicit RM Multi-task/Multi-modal scenes Multi-task/Multi-modal scenes Trend RL-based method/RL-free method Check PatientPatient Large language model Outputs FeedbackFeedback Outputs Reward model Reward model Reinforecement learning Large language model Large language model Human annotaor Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs FeedbackFeedback Human annotaor Large language model Outputs Feedback Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs FeedbackFeedback Outputs Reward model Reward model Reinforecement learning Large language model Large language model Human annotaor Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs FeedbackFeedback Human annotaor Large language model Outputs Feedback Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization v InputInput Output Feedback Reward design Optimization Treatment Dignosis Check Prescription Subjectivity and instability of human feedback Scarcity and high cost of human feedbackScarcity and high cost of human feedback Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/Fine-grained RM Explicit RM/Implicit RM Supervised learning Numerical feedback/Ranked feedback/Textual feedback In-Context learning Mode collapse Reward hacking Training instability Large languge model DoctorDoctor LLM alignmentLLM alignment PatientPatient Symptoms Human feedback/ AI feedback Reinforcement learning General RM/Fine-grained RM InputInputLarge languge model OutputOutput OutputOutput OutputOutput FeedbackFeedback Large language model Large language model Outputs Reinforecement learning Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Large language model Outputs FeedbackFeedback Human annotaor Large language model Large language model Reward model OutputsOutputs Feedback (a) Explicit Reward model (b) Implicit reward model Large language model Outputs Reinforecement learning Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Human annotaor Large language model Reward model Outputs Feedback (a) Explicit Reward model (b) Implicit reward model Large language model Outputs Reinforecement learning Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Human annotaor Large language model Reward model Outputs Feedback (a) Explicit Reward model (b) Implicit reward model (a) Explicit Reward model (b) Implicit reward model Large language model Large language model Outputs Reinforecement learning Human annotator Supervised learning OptimizationOptimization Large language model Large language model Outputs FeedbackFeedback Human annotator Large language model Large language model Reward model OutputsOutputs Feedback (a) Explicit Reward model (b) Implicit reward model Large language model Outputs Reinforecement learning Human annotator Supervised learning OptimizationOptimization Large language model Outputs Feedback Human annotator Large language model Reward model Outputs Feedback Improved PPO Substituted PPO Improved PPO Substituted PPO DPO and its variantsDPO and its variants Other methods InputInput Output Feedback Reward design Optimization TreatmentDignosis Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM Explicit RM/Implicit RM RL-based method Pairwise feedback/ Listwise feedback Binary feedback/ Preference feedback RL-free method Ethics & Values Large languge model DoctorDoctor LLM alignmentLLM alignment v Human feedback/ AI feedback Bias & FairnessBias & Fairness Safety & HarmfulnessSafety & Harmfulness Factuality & ReliabilityFactuality & Reliability Controllability & InterpretabilityControllability & Interpretability Privacy & SecurityPrivacy & Security HarmlessHarmless HonestHonest Helpful Helpful Virus Threats Patient Therapeutic dilemmas Reward hacking Mode collapse Scarcity and high cost of human feedback Subjectivity and instability of human feedback Training instability PrescriptionPrescription Clinical targets Medical test Input Output Feedback Reward design Optimization TreatmentDignosis Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM Explicit RM/Implicit RM RL-based method Pairwise feedback/ Listwise feedback Binary feedback/ Preference feedback RL-free method Ethics & Values Large languge model Doctor LLM alignment v Human feedback/ AI feedback Bias & Fairness Safety & Harmfulness Factuality & Reliability Controllability & Interpretability Privacy & Security Harmless Honest Helpful Virus Threats Patient Therapeutic dilemmas Reward hacking Mode collapse Scarcity and high cost of human feedback Subjectivity and instability of human feedback Training instability Prescription Clinical targets Medical test Large language model Large language model Response Human annotator Large language model Large language model Reward model ResponseResponse Feedback Query Query Demonstration Large language model Response Human annotator Large language model Reward model Response Feedback Query Query Demonstration Large language model Response Human annotator Large language model Reward model Response Feedback Query Query Demonstration (b) In-context learning Evaluation Curation Optimization Response FeedbackFeedback Human annotator Large language model Large language model Query Response Feedback Human annotator Large language model Query (c) DPO Reward model z Response Feedback Human annotator Large language model Query (c) DPO Reward model z Training objective Large language model Large language model Response Reinforecement learning Human annotator Optimization Large language model Large language model Reward model ResponseResponse Feedback Query (a) Reinforecement learning Large language model Response Reinforecement learning Human annotator Optimization Large language model Reward model Response Feedback Query (a) Reinforecement learning Input Output Feedback Reward design Optimization Rule-based RM/ Data-driven RM/ Hybrid RM Explicit RM/ Implicit RM RL-based optimization Pairwise feedback/ Listwise feedback Binary feedback/ Preference feedback SL-based optimization Ethics & Values Large languge model Doctor LLM alignment v Human feedback/ AI feedback Bias & Fairness Safety & Harmfulness Factuality & Reliability Controllability & Interpretability Privacy & Security Harmless Honest Helpful Patient Prescription Clinical targets Medical test Virus Threats ICL-based optimization Numerical RM /Non-numerical RM TreatmentTreatment DignosisDignosis Input Output Feedback Reward design Optimization Ethics & Values Large languge model Doctor LLM alignment v Bias & Fairness Safety & Harmfulness Factuality & Reliability Controllability & Interpretability Privacy & Security Harmless Honest Helpful Patient Clinical targets Medical test Symptoms TreatmentTreatment DignosisDignosis PrescriptionPrescription Therapeutic dilemmas Reward hacking Mode collapse Subjectivity and instability of human feedback Training instability Therapeutic dilemmas Reward hacking Mode collapse Subjectivity and instability of human feedback Training instability How does the reward model function? Reward model: functional roles under different optimization paradigms (Section 5 ) Numerical RM Data-driven RM Hybrid RM Explicit RM Implicit RM How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How does the reward model function in RL, ICL, and SL? Reward model: functional roles under different optimization paradigms (Section 5 ) Numerical reward modeling Nonnumerical reward modeling Rule-based reward modeling Data-driven reward modeling Hybrid reward modeling Explicit reward modeling Implicit reward modeling Standard SL-based LLM alignment DPO and its variants RL-based LLM alignment ICL-based LLM alignment How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) Reward model for LLM alignment Preference learning Classification / regression models Inverse reinforcement learning E.g., DPO (Rafailov et al., 2023), TDPO (Zeng et al., 2023), Step-DPO (LAi et al., 2024) E.g., Supervised Fine-tuning (Ouyang et al., 2022), Instruction Tuning (Wei et al., 2021), Constitutional AI (Bai et al., 2022a) E.g., Chain-of-Thought Feedback (Wei et al., 2022), Meta-RM for ICL (Liu et al., 2024), Rewarded In-Context Learning (Chen et al., 2024) E.g., InstructGPT (Ouyang et al., 2022), InstructGPT(), KTO (Yang et al., 2023) E.g., Inverse RL Alignment (Sun & van der Schaar, 2024), Variational IRL for LLM (Cai et al., 2024), Dynamic Reward Scaling IRL (Cheng et al., 2025) E.g., RLAIF (Bai et al., 2022b), Direct-RLAIF (Lee et al., 2024a), UltraFeedback (Cui et al., 2024) E.g., Reward Regression (Christiano et al., 2017), Feedback Regression (Ziegler et al., 2019), Score-based Reward Models (Kadavath et al., 2022) E.g., TLRM (Wu et al., 2023), LiPO (Lu et al., 2024), Variational RM (Cai et al., 2024) E.g., Contrastive Reward Modeling (Yuan et al., 2023), Ordinal Reward Modeling (Zheng et al., 2024), Ranking-based RM (Kim et al., 2024) E.g., Prompt-based RM (Ouyang et al., 2022), Heuristic RM (Stiennon et al., 2020), Symbolic Reward Templates (Zhang et al., 2023) E.g., Adaptive Hybrid RM (Kim et al., 2024), Gating-based Hybrid RM (Wang et al., 2024), Rule-guided Preference Learning (Chen et al., 2024) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-modal RM (e.g., text + images + audio) Multi-modal RM (e.g., text + images + audio) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-source RM (e.g., rule-based + data-driven)Multi-source RM (e.g., rule-based + data-driven) Multi-granularity RM (e.g., response-level + token-level)Multi-granularity RM (e.g., response-level + token-level) Single-value reward modeling / multi-value reward modelingSingle-value reward modeling / multi-value reward modeling Response-level reward modeling / token-level reward modelingResponse-level reward modeling / token-level reward modeling Pairwise preference modeling / listwise preference modelingPairwise preference modeling / listwise preference modeling Pointwise reward modeling / preferencewise modelingPointwise reward modeling / preferencewise modeling DPO and its variantsDPO and its variants ICL-based LLM alignment ICL-based LLM alignment RL-based LLM alignment RL-based LLM alignment Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial imitation learning) Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial imitation learning) Supervised learning (e.g., classification / regression, preference learning) Supervised learning (e.g., classification / regression, preference learning) How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How does the reward model function? Reward model: functional roles under different optimization paradigms (Section 5 ) Numerical RM Nonnumerical RM Rule-based RM Data-driven RM Hybrid RM Explicit RM Implicit RM How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) Toxicity filters, Wikidata queries, length constraints, domain- specific symbolic rules, programmatic scoring scripts Reward model for LLM alignment Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-modal RM (e.g., text + images + audio) Multi-modal RM (e.g., text + images + audio) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-source RM (e.g., rule-based + data-driven)Multi-source RM (e.g., rule-based + data-driven) Multi-granularity RM (e.g., response-level + sentence-level + token- level) Multi-granularity RM (e.g., response-level + sentence-level + token- level) Single-value reward modeling and multi-value reward modelingSingle-value reward modeling and multi-value reward modeling Response-level reward modeling and token-level reward modelingResponse-level reward modeling and token-level reward modeling Pairwise preference modeling and listwise preference modelingPairwise preference modeling and listwise preference modeling Pointwise reward modeling and preferencewise modelingPointwise reward modeling and preferencewise modeling DPO and its variantsDPO and its variants ICL-based LLM alignment ICL-based LLM alignment RL-based LLM alignment RL-based LLM alignment Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial Imitation Learning Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial Imitation Learning Supervised learning (e.g., classification, regression, preference learning) Supervised learning (e.g., classification, regression, preference learning) Toxicity filters, Wikidata queries, length constraints, domain- specific symbolic rules, programmatic scoring scripts Toxicity filters, Wikidata queries, length constraints, domain- specific symbolic rules, programmatic scoring scripts Non-numerical RMNon-numerical RM How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) Reward model for LLM alignment Reward model for LLM alignment Rule-based RM Rule-based RM Fig. 1A hierarchical taxonomy of reward modeling in LLM alignment. examine RM design from three complementary perspectives: mathematical formula- tion, methodological modeling and construction, and functional roles under different optimization paradigms. Building on these perspectives, a high-level categorization framework is introduced, covering: (i) Numerical vs. Non-numerical RMs; (i) Rule- based, Data-driven, and Hybrid RMs; and (i) Explicit vs. Implicit RMs. Section 6 analyzes recent methodological trends in reward design and their impact on the evolving landscape of LLM alignment, highlighting the shift from RL-based to RL- free approaches and the increasing demand for methods that address multi-objective, multi-task, and multi-modal scenarios. Promising future directions for advancing reward design are also outlined. Finally, Section 7 concludes the paper. 2 Conceptual framework LLM alignment can guide failure modes toward generating outputs that are consis- tently H [7]. To elucidate the critical role of reward design in achieving effective LLM alignment, it is helpful to begin with a structured conceptual framework that captures the core components and their interactions. Drawing inspiration from [31], 6 a medical metaphor is employed to illustrate the full pipeline from input to out- put, comprising the stages of feedback (diagnosis), reward design (prescription), and optimization (treatment). RL-based methodRL-based method r1 r2 r1 r2 r1 r2 r1 r2 HumanHuman Generative RM from AI Feedback Generative RM from AI Feedback Off-the-shelf LLM RM training SFT model Reforcement learning RL from Human Feedback (RLHF)RL from Human Feedback (RLHF) RL from AI Feedback (RLAIF) Discriminative RM from Human Feedback Discriminative RM from Human Feedback Reforcement learning RM training LM policy LM policy r1 r2 r1 r2 Human Generative RM from AI Feedback Off-the-shelf LLM RM training SFT model Reforcement learning RL from Human Feedback (RLHF) RL from AI Feedback (RLAIF) Discriminative RM from Human Feedback Reforcement learning RM training LM policy LM policy Responses Preference Preference l y l y w y l y w y Prompt x Prompt x r1 r2 r1 r2 Human Generative RM from AI Feedback Off-the-shelf LLM RM training SFT model Reforcement learning RL from Human Feedback (RLHF) RL from AI Feedback (RLAIF) Discriminative RM from Human Feedback Reforcement learning RM training LM policy LM policy Responses Preference Preference l y w y Prompt x SFT modelSFT model Implicit RM SFT model Implicit RM SFT modelSFT model Explicit RM Base LLMBase LLM Reforcement learning training loop SFT model Explicit RM Base LLM Reforcement learning training loop Final LLM l y l y w y l y w y l y l y w y l y w y Prompt x Prompt x Prompt x Prompt x Responses Responses Maxmum likelihood Maxmum likelihood PPO-based paradigm for LLM alignmentPPO-based paradigm for LLM alignment DPO-based paradigm for LLM alignmentDPO-based paradigm for LLM alignment Reward design in LLM alignmetn Reward design in LLM alignmetn FeedbackFeedback Reforcement learningReforcement learning OptimizationOptimization Explicit RM Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. General RM Reward design in LLM alignmetn Feedback Reforcement learning Optimization Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. General RM Prompt x Prompt x ResponsesResponses y Responses y Collect human demonstration data Prompt x Prompt x SFT modelSFT model l y l y w y l y w y Collect human prerference data l y w y lw y Base LLMBase LLMSFT modelSFT model Supervised Fine-tune Base LLMSFT model Supervised Fine-tune Base LLMBase LLMRMRM Supervised Fine-tune Base LLMRM Supervised Fine-tune Step2 training reward modelStep1 training Supervised Fine-tuning model Step3 training Policy using PPO Prompt x Prompt x ResponsesResponses y Responses y RM LM policy PPO Prompt x Prompt x Implicit RM Reward modelReward model Rules Discriminative RMDiscriminative RM Generative Reward Implicit RMImplicit RM RL-based training (PPO) RL-free training(DPO) Learned RM Learned RM Rule-based RMRule-based RM Fine-grained RMFine-grained RM Fine-grained RMFine-grained RM Fine-grained RMFine-grained RM Fine-grained RMFine-grained RM ...... Discriminative RMDiscriminative RM Trend of reward design in LLM alignmetn Trend of reward design in LLM alignmetn Construction BasisConstruction Basis RrepresentationRrepresentation GranularityGranularity Explicit RM Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. Trend of reward design in LLM alignmetn Construction Basis Rrepresentation Granularity Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. Learned RM Implicit RMImplicit RM RL-based training (PPO) RL-free training(DPO) Discriminative RMDiscriminative RM GranularityGranularity Trends in Reward Design for LLM Alignment Trends in Reward Design for LLM Alignment Learned RM Learned RMRule-based RMRule-based RM Learned RMRule-based RM Rule-based RM Learned RMRule-based RM Learned RM Rule-based RM Learned RMRule-based RM Learned RM Rule-based RM Learned RMRule-based RM Learned RM Rule-based RM Learned RMRule-based RM Learned RM 202220232024202220232024 Construction Basis:From Rule-based to Learned Reward Models, Format:From numerical Reward to non- numerical Reward Models,Expression: From Explicit Reward to Implicit Reward Models,Granularity: From general Reward to fine-grained Reward Models Rule-based RMRule-based RMData-driven RMData-driven RMHybrid RMHybrid RMRule-based RMData-driven RMHybrid RM Format Non-numerical Reward Models Non-numerical Reward Models Format Non-numerical Reward Models Explicit Reward Explicit Reward Expression Implicit Reward Implicit Reward General RM General RM Granularity Fine-grained RMFine-grained RM Construction Basis Language Model Outputs High-quality Outputs High-quality Outputs Reward modelReward model Language Model Outputs High-quality Outputs Reward model TraingTraing TraingTraing Reward design Rule-based RM/ Data-driven RM Numerical RM / Non-numerical RM / Feedback Optimization Application Chanllenges LLM alignment Rule-based RM / Data-driven RM Reward designReward design Rule-based RM Data-driven RM Explicit RM Implicit RM General RM Fine-grained RM ➢Discriminative RM/ Generative RM ➢Response- Level RM / Token- Level RM ➢Multi-objective RM Construction Basis Construction Basis Construction Basis Format Feedback chanllenge1chanllenge2chanllenge3chanllenge4 chanllenge5 Pairwise feedback Listwise feedback Binary feedback Preference feedback RL-free method LLM alignment LLM alignment ApplicationApplication OptimizationOptimization RepresentationRepresentation GranularityGranularity Numerical RM Non-numerical RM Multi-task Multi-model Simple scene Multi-task Multi-model Simple scene Numerical RM /Non-numerical RM General RM/ Fine-grained RM Multi-task/ Multi-modal scenes RL-based method/ RL-free method Binary /Preference Pairwise / Listwise Explicit RM/ Implicit RM Format Representation Granularity LLM alignment Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking TaxnomyTrend (a) Reward design Feedback Optimization Application Chanllenges LLM alignment Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking Trend Construction Basis Format Representation Granularity Taxnomy Construction Basis Format Representation Granularity Taxnomy Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/ Fine-grained RM Explicit RM/ Implicit RM Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/ Fine-grained RM Explicit RM/ Implicit RM Multi-task/ Multi-modal scenes RL-based method/ RL-free method Binary /Preference Pairwise / Listwise Multi-task/ Multi-modal scenes RL-based method/ RL-free method Binary /Preference Pairwise / Listwise LLM alignment LLM alignment OptimizationOptimization ApplicationApplication Reward designReward design Feedback Chanllenges Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking Chanllenges Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking Construction Basis Format Representation Granularity Taxnomy Construction Basis Format Representation Granularity Taxnomy Binary /Preference Pairwise / Listwise Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/Fine-grained RM Explicit RM/Implicit RM Multi-task/Multi-modal scenes Multi-task/Multi-modal scenes Trend RL-based method/RL-free method Check PatientPatient Large language model Outputs FeedbackFeedback Outputs Reward model Reward model Reinforecement learning Large language model Large language model Human annotaor Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs FeedbackFeedback Human annotaor Large language model Outputs Feedback Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs FeedbackFeedback Outputs Reward model Reward model Reinforecement learning Large language model Large language model Human annotaor Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs FeedbackFeedback Human annotaor Large language model Outputs Feedback Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization v InputInput Output Feedback Reward design Optimization Treatment Dignosis Check Prescription Subjectivity and instability of human feedback Scarcity and high cost of human feedbackScarcity and high cost of human feedback Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/Fine-grained RM Explicit RM/Implicit RM Supervised learning Numerical feedback/Ranked feedback/Textual feedback In-Context learning Mode collapse Reward hacking Training instability Large languge model DoctorDoctor LLM alignmentLLM alignment PatientPatient Symptoms Human feedback/ AI feedback Reinforcement learning General RM/Fine-grained RM InputInputLarge languge model OutputOutput OutputOutput OutputOutput FeedbackFeedback Large language model Large language model Outputs Reinforecement learning Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Large language model Outputs FeedbackFeedback Human annotaor Large language model Large language model Reward model OutputsOutputs Feedback (a) Explicit Reward model (b) Implicit reward model Large language model Outputs Reinforecement learning Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Human annotaor Large language model Reward model Outputs Feedback (a) Explicit Reward model (b) Implicit reward model Large language model Outputs Reinforecement learning Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Human annotaor Large language model Reward model Outputs Feedback (a) Explicit Reward model (b) Implicit reward model (a) Explicit Reward model (b) Implicit reward model Large language model Large language model Outputs Reinforecement learning Human annotator Supervised learning OptimizationOptimization Large language model Large language model Outputs FeedbackFeedback Human annotator Large language model Large language model Reward model OutputsOutputs Feedback (a) Explicit Reward model (b) Implicit reward model Large language model Outputs Reinforecement learning Human annotator Supervised learning OptimizationOptimization Large language model Outputs Feedback Human annotator Large language model Reward model Outputs Feedback Improved PPO Substituted PPO Improved PPO Substituted PPO DPO and its variantsDPO and its variants Other methods InputInput Output Feedback Reward design Optimization TreatmentDignosis Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM Explicit RM/Implicit RM RL-based method Pairwise feedback/ Listwise feedback Binary feedback/ Preference feedback RL-free method Ethics & Values Large languge model DoctorDoctor LLM alignmentLLM alignment v Human feedback/ AI feedback Bias & FairnessBias & Fairness Safety & HarmfulnessSafety & Harmfulness Factuality & ReliabilityFactuality & Reliability Controllability & InterpretabilityControllability & Interpretability Privacy & SecurityPrivacy & Security HarmlessHarmless HonestHonest Helpful Helpful Virus Threats Patient Therapeutic dilemmas Reward hacking Mode collapse Scarcity and high cost of human feedback Subjectivity and instability of human feedback Training instability PrescriptionPrescription Clinical targets Medical test Input Output Feedback Reward design Optimization TreatmentDignosis Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM Explicit RM/Implicit RM RL-based method Pairwise feedback/ Listwise feedback Binary feedback/ Preference feedback RL-free method Ethics & Values Large languge model Doctor LLM alignment v Human feedback/ AI feedback Bias & Fairness Safety & Harmfulness Factuality & Reliability Controllability & Interpretability Privacy & Security Harmless Honest Helpful Virus Threats Patient Therapeutic dilemmas Reward hacking Mode collapse Scarcity and high cost of human feedback Subjectivity and instability of human feedback Training instability Prescription Clinical targets Medical test Large language model Large language model Response Human annotator Large language model Large language model Reward model ResponseResponse Feedback Query Query Demonstration Large language model Response Human annotator Large language model Reward model Response Feedback Query Query Demonstration Large language model Response Human annotator Large language model Reward model Response Feedback Query Query Demonstration (b) In-context learning Evaluation Curation Optimization Response FeedbackFeedback Human annotator Large language model Large language model Query Response Feedback Human annotator Large language model Query (c) DPO Reward model z Response Feedback Human annotator Large language model Query (c) DPO Reward model z Training objective Large language model Large language model Response Reinforcement learning Human annotator Optimization Large language model Large language model Reward model ResponseResponse Feedback Query (a) PPO-based RLHF Large language model Response Reinforcement learning Human annotator Optimization Large language model Reward model Response Feedback Query (a) PPO-based RLHF Input Output Feedback Reward design Optimization Rule-based RM/ Data-driven RM/ Hybrid RM Explicit RM/ Implicit RM RL-based optimization Pairwise feedback/ Listwise feedback Binary feedback/ Preference feedback SL-based optimization Ethics & Values Large languge model Doctor LLM alignment v Human feedback/ AI feedback Bias & Fairness Safety & Harmfulness Factuality & Reliability Controllability & Interpretability Privacy & Security Harmless Honest Helpful Patient Prescription Clinical targets Medical test Virus Threats ICL-based optimization Numerical RM /Non-numerical RM TreatmentTreatment DignosisDignosis Input Output Feedback Reward design Optimization Ethics & Values LLM Doctor LLM alignment v Bias & Fairness Safety & Harmlessness Factuality & Reliability Controllability & Interpretability Privacy & Security Harmless Honest Helpful Patient Clinical targets Medical test Symptoms TreatmentTreatment Diagnosis Diagnosis PrescriptionPrescription Therapeutic dilemmas Reward hacking Mode collapse Subjectivity and instability of human feedback Training instability Therapeutic dilemmas Reward hacking Mode collapse Subjectivity and instability of human feedback Training instability How does the reward model function? Reward model: functional roles under different optimization paradigms (Section 5 ) Numerical RM Data-driven RM Hybrid RM Explicit RM Implicit RM How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How does the reward model function in RL, ICL, and SL? Reward model: functional roles under different optimization paradigms (Section 5 ) Numerical reward modeling Nonnumerical reward modeling Rule-based reward modeling Data-driven reward modeling Hybrid reward modeling Explicit reward modeling Implicit reward modeling Standard SL-based LLM alignment DPO and its variants RL-based LLM alignment ICL-based LLM alignment How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) Reward model for LLM alignment Preference learning Classification / regression models Inverse reinforcement learning E.g., DPO (Rafailov et al., 2023), TDPO (Zeng et al., 2023), Step-DPO (LAi et al., 2024) E.g., Supervised Fine-tuning (Ouyang et al., 2022), Instruction Tuning (Wei et al., 2021), Constitutional AI (Bai et al., 2022a) E.g., Chain-of-Thought Feedback (Wei et al., 2022), Meta-RM for ICL (Liu et al., 2024), Rewarded In-Context Learning (Chen et al., 2024) E.g., InstructGPT (Ouyang et al., 2022), InstructGPT(), KTO (Yang et al., 2023) E.g., Inverse RL Alignment (Sun & van der Schaar, 2024), Variational IRL for LLM (Cai et al., 2024), Dynamic Reward Scaling IRL (Cheng et al., 2025) E.g., RLAIF (Bai et al., 2022b), Direct-RLAIF (Lee et al., 2024a), UltraFeedback (Cui et al., 2024) E.g., Reward Regression (Christiano et al., 2017), Feedback Regression (Ziegler et al., 2019), Score-based Reward Models (Kadavath et al., 2022) E.g., TLRM (Wu et al., 2023), LiPO (Lu et al., 2024), Variational RM (Cai et al., 2024) E.g., Contrastive Reward Modeling (Yuan et al., 2023), Ordinal Reward Modeling (Zheng et al., 2024), Ranking-based RM (Kim et al., 2024) E.g., Prompt-based RM (Ouyang et al., 2022), Heuristic RM (Stiennon et al., 2020), Symbolic Reward Templates (Zhang et al., 2023) E.g., Adaptive Hybrid RM (Kim et al., 2024), Gating-based Hybrid RM (Wang et al., 2024), Rule-guided Preference Learning (Chen et al., 2024) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-modal RM (e.g., text + images + audio) Multi-modal RM (e.g., text + images + audio) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-source RM (e.g., rule-based + data-driven)Multi-source RM (e.g., rule-based + data-driven) Multi-granularity RM (e.g., response-level + token-level)Multi-granularity RM (e.g., response-level + token-level) Single-value reward modeling / multi-value reward modelingSingle-value reward modeling / multi-value reward modeling Response-level reward modeling / token-level reward modelingResponse-level reward modeling / token-level reward modeling Pairwise preference modeling / listwise preference modelingPairwise preference modeling / listwise preference modeling Pointwise reward modeling / preferencewise modelingPointwise reward modeling / preferencewise modeling DPO and its variantsDPO and its variants ICL-based LLM alignment ICL-based LLM alignment RL-based LLM alignment RL-based LLM alignment Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial imitation learning) Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial imitation learning) Supervised learning (e.g., claSsification / regression, preference learning) Supervised learning (e.g., claSsification / regression, preference learning) How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How does the reward model function? Reward model: functional roles under different optimization paradigms (Section 5 ) Numerical RM Nonnumerical RM Rule-based RM Data-driven RM Hybrid RM Explicit RM Implicit RM How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) Toxicity filters, Wikidata queries, length constraints, domain- specific symbolic rules, programmatic scoring scripts Reward model for LLM alignment Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-modal RM (e.g., text + images + audio) Multi-modal RM (e.g., text + images + audio) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-source RM (e.g., rule-based + data-driven)Multi-source RM (e.g., rule-based + data-driven) Multi-granularity RM (e.g., response-level + sentence-level + token- level) Multi-granularity RM (e.g., response-level + sentence-level + token- level) Single-value reward modeling and multi-value reward modelingSingle-value reward modeling and multi-value reward modeling Response-level reward modeling and token-level reward modelingResponse-level reward modeling and token-level reward modeling Pairwise preference modeling and listwise preference modelingPairwise preference modeling and listwise preference modeling Pointwise reward modeling and preferencewise modelingPointwise reward modeling and preferencewise modeling DPO and its variantsDPO and its variants ICL-based LLM alignment ICL-based LLM alignment RL-based LLM alignment RL-based LLM alignment Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial Imitation Learning Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial Imitation Learning Supervised learning (e.g., classification, regression, preference learning) Supervised learning (e.g., classification, regression, preference learning) Toxicity filters, Wikidata queries, length constraints, domain- specific symbolic rules, programmatic scoring scripts Toxicity filters, Wikidata queries, length constraints, domain- specific symbolic rules, programmatic scoring scripts Non-numerical RMNon-numerical RM How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) Reward model for LLM alignment Reward model for LLM alignment Rule-based RM Rule-based RM Fig. 2A novel conceptual framework views LLM alignment as a medical treatment process. In this analogy, the model (the patient) produces raw outputs, which are then evaluated (diagnosed) via human or automated feedback. Based on this diagnosis, a tailored reward function (prescription) is crafted to guide the model’s optimization (treatment), ensuring that its behavior improves toward desired alignment goals (clinical targets). This analogy emphasizes the central role of reward design as the link between observation (feedback) and intervention (optimization), situating it at the core of the alignment pipeline. Feedback (diagnosis):An LLMM:X → Yperforms a specific task by mapping an inputx∈Xto an output text ˆ y∈Y. A feedback modelF:X ×Y →Zprovides structured information about the quality or properties of a response. Given an input x∈ Xand a response ˆ y∈ Y, which may come from a model, a human, or another agent, it computesz=F(x, ˆ y)∈ Z, whereZcaptures diagnostic signals such as error types, critiques, explanations, or evaluations that help interpret or improve the response. During this stage, the model’s outputs are evaluated for any misalignments or undesired behaviors. Feedback plays a crucial role in identifying areas where the model deviates from its intended goals or ethical considerations. Feedback can take various forms, such as human-generated feedback, AI-generated feedback, binary feedback, preference feedback, pairwise feedback, and listwise feedback, which collectively inform the reward design process. Reward design (prescription):An RMR:X ×Y →Rassigns a scalar score to a model output. Given an inputx∈ Xand a response ˆ y∼M(x), it computes r=R(x, ˆ y)∈R, where higherrindicates better alignment with human preferences. This is a pivotal stage where a reward function is crafted to guide the model’s behavior toward desired directions. Reward design provides structured incentives that steer the model toward generating outputs aligned with human values, ensuring that the content is safer and more ethical. Optimization (treatment):Given a distribution over inputsx∈ Xand an RM R:X ×Y →R, the optimization goal is to adjust the pretrained modelMthat 7 generates responses ˆ y∼M(x) which maximize expected reward,E x,ˆy∼M(x) [R(x, ˆ y)]. In this stage, the model undergoes iterative refinement using optimization techniques, such as supervised learning (SL), RL, or ICL. The goal is to enhance the model’s performance and alignment, ensuring it generates high-quality outputs that adhere to ethical and societal standards. This framework also provides important insights for reward design: reward modeling in LLM alignment involves balancing multi-dimensional feedback and man- aging trade-offs among multiple objectives, while maintaining a task-oriented focus. Although multi-dimensional reward design is desirable, effective strategies should not aim for superficial comprehensiveness. Instead, they should adopt a task-driven, symptom-targeted approach, formulating precise reward strategies that address spe- cific alignment deficiencies. Furthermore, while achieving high-quality outputs remains the ultimate goal, the stability and robustness of the optimization process are equally critical considerations. By incorporating intermediate optimization indicators and process-level feedback into reward design, the training process can achieve greater controllability and reliability, thereby reducing the risks of divergence, overfitting, and pathological behaviors, and ultimately enabling more robust and efficient model alignment. 3 Reward model: mathematical formulation 3.1 Numerical reward modeling This section explores how variations in learning paradigms, input data formats, infor- mation granularity, and output forms drive the evolution and adaptation of the mathematical formulation of RMs. 3.1.1 Pointwise reward modeling and preferencewise modeling A standard way to construct an RM is to regress a scalar functionr φ (x,y) onto absolute human scores. Given a labeled corpusD=(x (i) ,y (i) ,s (i) ) N i=1 , wherex (i) is the prompt,y (i) the system response ands (i) ∈Ra human quality score (e.g. a Likert value in [0,1] or on a 1–5 scale), the model is trained by minimizing mean-squared error L MSE (φ) = 1 N N X i=1 r φ (x (i) ,y (i) )−s (i) 2 .(1) This formulation treats the RM as a continuous regressor rather than a relative ranker. Despite providing explicit supervision, supervised scoring is susceptible to subjectivity and inconsistency in absolute ratings, and may fail to capture the full diversity of possible responses. In the context of LLM alignment, reward modeling typically involves transform- ing human preference data into explicit scoring functions for optimization, as first introduced by [13]. In most implementations, RMs share the same Transformer back- bone as the underlying language model, with the hidden representation of the final token projected through a linear layer to produce a scalar reward score [29]. To enable 8 a systematic analysis of reward modeling approaches, two key dimensions are com- monly considered: the structure of preference feedback (e.g., pairwise vs. listwise), and the output format of the RM (e.g., single-value vs. multi-value). Each dimension is described in detail below. Given a dataset of human preference annotationsD=(x (i) ,y (i) w ,y (i) ℓ ) N i=1 , where for each promptxthe responsey w is preferred toy ℓ , a scalar RMr φ (x,y) is trained by modeling the probability of the preferred response under the Bradley–Terry formulation [32]: p ∗ y w ≻y ℓ |x = exp r φ (x,y w ) exp r φ (x,y w ) + exp r φ (x,y ℓ ) =σ r φ (x,y w )−r φ (x,y ℓ ) (2) Taking the logarithm gives the log-probability logp ∗ (y w ≻y ℓ |x) = logσ r φ (x,y w )− r φ (x,y ℓ ) . The parametersφare optimized by minimizing the negative log-likelihood overD, yielding the loss: L pairwise r φ ,D =−E (x,y w ,y ℓ )∼D h logσ r φ (x,y w )−r φ (x,y ℓ ) i .(3) MinimizingL pairwise encourages the reward gapr φ (x,y w )−r φ (x,y ℓ ) to be large when- ever humans prefery w , thereby aligning the model’s scores with human judgments. 3.1.2 Pairwise preference modeling and listwise preference modeling Given a dataset of human ranking annotationsD=(x (i) ,⟨y (i) 1 ,...,y (i) K ⟩) N i=1 , where for each promptxthe responsesy j K j=1 are ordered from most to least preferred, a scalar RMr φ (x,y) is trained by modeling the probability of the observed ranking under a Plackett–Luce model [33]: p ∗ (y 1 ≻·≻y K |x) = K Y j=1 exp r φ (x,y j ) P K t=j exp r φ (x,y t ) .(4) Taking the logarithm yields the log-probability logp ∗ (y 1 ≻ · ≻y K |x) = P K j=1 r φ (x,y j )−log P K t=j exp(r φ (x,y t )) . The parametersφare optimized by mini- mizing the negative log-likelihood overD, giving the loss L listwise (r φ ,D) =−E (x,y)∼D h K X j=1 r φ (x,y j )−log K X t=j exp r φ (x,y t ) i .(5) Minimization ofL listwise drives the reward differences between higher-ranked and lower-ranked responses to be large, thus aligning the model’s scores with full ranking judgments. 9 3.1.3 Response-level reward modeling and token-level reward modeling Conventional RMs typically score entire responses, resulting in sparse feedback that may lead to unstable training and insufficient supervision of local quality. To address this, token-level RMs have been proposed to assign individual rewards to each token, offering a more detailed optimization signal. Given a dataset of human preference annotationsD=(x (i) ,y (i) w ,y (i) ℓ ) N i=1 , where each pair (y w ,y ℓ ) represents a preferred and less-preferred response to the same promptx, token-level reward modeling assigns a scalar reward to each tokeny t within a response sequencey= (y 1 ,...,y T ), con- ditioned on its left context and the prompt. Formally, a token-level RM outputs a vector of token-wise scores:q φ (x,y) = ( q φ (x,y ≤1 ), q φ (x,y ≤2 ), ..., q φ (x,y ≤T ) ) ,where eachq φ (x,y ≤t )∈Rdenotes the predicted reward for the tokeny t , given the prefix (y 1 ,...,y t−1 ) and the promptx. To align token-level scores with human preferences, the aggregated sequence reward is defined by summing the token-wise scores:R φ (x,y) = P T t=1 q φ (x,y t |y <t ),and is used to model the pairwise preference probability under the Bradley–Terry formula- tion:p ∗ y w ≻y ℓ |x =σ R φ (x,y w )−R φ (x,y ℓ ) ,whereσ(·) denotes the sigmoid function. Taking the logarithm yields the log-probability: logp ∗ y w ≻y ℓ |x = logσ R φ (x,y w )−R φ (x,y ℓ ) ,and the corresponding training objective is defined as: L token-agg q φ ,D =−E (x,y w ,y ℓ )∼D logσ R φ (x,y w )−R φ (x,y ℓ ) .(6) MinimizingL token-agg encourages the cumulative reward of the preferred responsey w to exceed that of the less-preferred responsey ℓ , thus indirectly shaping token-level scores to reflect the quality contributions of individual tokens. An alternative step-wise loss directly compares token-level scores: L token-step (q φ ,D) =−E (x,y w ,y ℓ )∼D max(T w ,T ℓ ) X t=1 logσ q φ (x,y w,t )−q φ (x,y ℓ,t ) .(7) Building on these baselines, token-level reward modeling has been advanced and broadened to accommodate diverse application needs. Yoon et al. [34] introduced Token-Level Continuous Reward(TLCR) for RLHF, which used a discrimina- tor to distinguish positive and negative tokens and assigned context-aware continuous rewards based on its confidence, achieving consistent improvements over sequence- level and token-level discrete rewards. Xu et al. [35] enhanced LLM alignment through fine-grained token-level supervision by asking annotators to minimally edit less pre- ferred responses to create a refined dataset, which was used to train a token-level reward model and guide fine-grained PPO training. Zeng et al. [36] introducedToken- level Direct Preference Optimization(TDPO), which utilized the Bradley–Terry model for a token-based reward system to enhance KL divergence regulation while preserving simplicity without explicit reward modeling. Fu et al. [37] introduced a 10 Token-Level Detective Reward Model(TLDR) to provide fine-grained annota- tions for large vision language models, using a perturbation-based method to generate synthetic hard negatives with token-level labels for training. Chen et al. [38] pro- posed theQ-function Reward Model(Q-RM) by decoupling reward modeling from language generation and optimizing a discriminative policy for token-level rewards. Overall, token-level reward modeling provides finer credit assignment and more stable training, improving safety, controllability, and interpretability, and better supporting real-world deployment. 3.1.4 Single-value reward modeling and multi-value reward modeling Depending on the output information density of the numerical preference models, RMs can be categorized as either single-value RMs or multi-value RMs. Single-value RMs assign a scalar score to each candidate output:r φ (x,y)∈R, wherexdenotes the input (e.g., a prompt or instruction),yis the model-generated response, andr φ is a neural network parameterized byφthat produces a scalar utility score. Multi-value RMs generate a vector of scores to capture multiple quality dimensions simultane- ously:r φ (x,y) = h r (1) φ (x,y), r (2) φ (x,y), ..., r (K) φ (x,y) i ∈R K , where each component r (k) φ (x,y) corresponds to a specific evaluation objective, such as helpfulness, harm- lessness, or honesty. This formulation supports multi-objective alignment and enables fine-grained supervision from human feedback. Although single-value reward modeling already shows good performance in the aforementioned studies, there exist scenarios where its capability becomes limited. For instance, a single-value RM might overfit on the feedback data, have difficulty addressing multiple aspects of human preferences, or fall short in delivering appropri- ate preference signals for fine-grained portions of output sequences [29]. To overcome these limitations, multi-value reward modeling approaches have been introduced. As an example, Christiano et al. [13] addressed overfitting by training multiple RMs on different random partitions of the preference dataset and averaging their individually normalized outputs to generate the final reward. Likewise, Coste et al. [39] reduced overfitting by combining models initialized with various random seeds, either by tak- ing the minimum reward or applying a weighted penalty based on variance. In order to generate fine-grained reward signals, many works adopt multi-value reward modeling as a means to support alignment with multiple objectives in complex tasks. Rather than assigning individual scalar rewards to each objective, vectorized RMs embed the interdependencies among several quality metrics into a unified multidimensional for- mat, which enables more coherent trade-offs and facilitates more efficient optimization across related goals. Liu et al. [40] utilized activation vectors extracted from LLMs fine-tuned on preferred versus non-preferred outputs as predictors of rewards within the model’s representational space. In addition, Frans et al. [41] investigated how to encode latent reward functions into vectors derived from corresponding data sam- ples, thereby enabling RL to operate across a variety of tasks using more flexible and generalizable reward signals. 11 3.2 Non-numerical reward modeling Beyond numerical RMs, reward mechanisms that provide natural-language preference signals have also been explored. For example, Wang et al. [42] and Cui et al. [43] intro- duced critic networks that learn from human critique text to evaluate model outputs and offer corrective suggestions, producing detailed assessments in prose form [44]. The tool-augmented validation framework of Li et al. [45] likewise leveraged external utilities to generate natural-language feedback as part of the reward signal. More- over, Aky ̈urek et al. [46] employed RL to train a critique-generation module, using similarity metrics between human-preferred outputs and those refined via generated critiques as the optimization signal. Some systems even interleave feedback within the generation process: for instance, Self-Refine [47] issued stepwise commentary during chain-of-thought reasoning to guide each subsequent generation rather than rely- ing solely on a final aggregated score. Non-numerical RMs convey discrete labels or natural-language feedback that must be interpreted or mapped before optimiza- tion, delivering richer contextual guidance, improved interpretability, and enhanced human-in-the-loop refinement. 4 Reward model: methodological modeling and construction The construction paradigm of RMs constitutes a foundational aspect of LLM align- ment. It critically shapes the model’s ability to capture and operationalize human preferences or normative objectives, while also influencing key properties such as gen- eralization capacity, robustness, and training efficiency. Different paradigms, including rule-based, data-driven, and hybrid approaches, involve distinct trade-offs in terms of scalability, feedback dependency, interpretability, and adaptability to specific tasks. Therefore, the principled selection and design of construction methods are essential to ensure that RMs provide reliable and effective guidance for optimization toward aligned behavior. Table 1 summarizes the construction paradigms for RMs and their key characteristics. 4.1 Rule-based reward modeling and data-driven reward modeling In the development of LLM alignment, reward modeling has followed two primary technical trajectories: rule-based design schemes and data-driven learning paradigms. Rule-based RMs were introduced in the early stages, focusing on the direct encod- ing of behavioral constraints into the system. These approaches relied on human experts to define explicit rule sets, offering high interpretability and precise control. Such characteristics made them particularly suitable for high-stakes scenarios where well-defined behavioral boundaries were essential. For example, Mu et al. [48] pro- posed the Rule-Based Rewards (RBR) framework, which combined manually specified behavioral rules with verification from a large language model. This framework enabled effective alignment without relying on large-scale human preference data. In addi- tion to reducing dependence on manual annotation, it allowed for rapid updates in 12 Table 1Contrasting functional roles of reward models under different alignment optimization paradigms AspectRule-based RMData-driven RMHybrid RM Construction Approach Rewards are explicitly defined using handcrafted rules, external knowledge bases, or predefined heuristics. Rewards are learned from labeled datasets through supervised learning, preference modeling, or inverse reinforcement learning. Combines multiple reward signals (e.g., rule-based + data-driven), often through weighted fusion or multi-stage pipelines. Strengths- High interpretability and transparency - Strong controllability in safety-critical tasks - Better scalability to complex tasks - Can generalize from data beyond human-written rules - Balances interpretability and expressiveness - Increases robustness to reward misspecification - Adapts to diverse alignment objectives Limitations- Limited coverage and flexibility - Difficult to scale in open-ended tasks - Prone to rule misspecification - Requires large, high- quality labeled data - May overfit to annotation biases - Harder to interpret learned rewards - Increases system complexity - Requires careful signal balancing and integration design - Fusion strategies may introduce alignment trade-offs Typical Examples - Toxicity filters - Safety constraints via Wikidata queries - Syntax/logic rule checkers - Pairwise preference learning - Reward modeling via classification/regression - Inverse reinforcement learning - Multi-source RM (e.g., rule-based + data-driven) - Multi-granularity RM (e.g., response-level + sentence-level + token-level) - Multi-aspect RM (e.g., fluency + factuality + safety) - Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) - Multi-modal RM (e.g., text + images + audio) Best Suited Scenarios - Safety-critical applications - Tasks requiring strict compliance to explicit norms - Tasks where human preference data is abundant - Scenarios demanding nuanced semantic understanding - Complex alignment tasks where single-source rewards are insufficient - Contexts with competing alignment objectives (e.g., quality vs. safety) response to evolving safety requirements. Rule-based methods demonstrated particu- lar robustness in environments where generalization was unreliable or preference data was limited. By explicitly constraining the model’s action space, these approaches also helped mitigate issues commonly observed in RLHF pipelines, such as reward hacking and overly conservative refusals. In contrast, data-driven reward modeling did not initially focus on preference learning. Early methods primarily employed conventional classification or regression frameworks, where automated evaluator were used to assign quality scores to gener- ated outputs. These scores served as training signals for learning reward functions. For instance, Chang et al. [49] introduced BLEUBERI, a method that leveraged BLEU as a simple yet effective reward signal for aligning models to follow instructions. Similarly, Xue et al. [50] proposed RLFC, which computed reward values based on the factual 13 consistency between generated responses and gold-standard knowledge, as assessed by natural language inference(NLI) models or QA-style matching systems. These signals guided models toward producing more factually accurate outputs through RL. As model capabilities improved and large-scale human feedback became more read- ily available, the field gradually shifted toward preference learning, which is based on human-annotated rankings over multiple outputs, allowing models to learn implicit reward functions that better reflected nuanced user preferences. This approach became central to RLHF pipelines such as InstructGPT [14]. Despite its expressive power, pref- erence learning proved highly sensitive to annotation quality and was often plagued by inconsistencies and subjective disagreements in human judgments. Moreover, col- lecting high-quality, fine-grained preference data at scale posed significant cost and feasibility challenges. To address these limitations, recent research has explored imitation-based alter- natives, most notablyinverse reinforcement learning (IRL). IRL aims to infer latent reward functions from expert demonstrations, thereby avoiding the need for explicit scoring or preference annotation. This reoriented the objective of reward mod- eling from selecting the best output to understanding the underlying rationale behind expert behavior. For instance, Sun and van der Schaar [51] introduced a framework based on trajectory distribution alignment to recover implicit reward structures. Cai et al. [52] further advanced this line of work by proposing a variational Bayesian IRL objective to improve generalization across diverse task settings. Building upon these ideas, Cheng et al. [53] incorporated category-specific reward scaling into the IRL framework, enabling dynamic modulation of reward signals in response to contextual safety demands—an especially pertinent development for high-risk applications. This progression reflects a fundamental shift in the objectives of reward mod- eling: from enforcing fixed behavioral rules to uncovering the latent structures of human values. The transition from rule specification to preference learning, and ulti- mately to behaviorally grounded reward inference, reveals a deepening understanding of alignment not merely as behavioral conformity, but as the construction of adapt- able, value-sensitive feedback systems capable of responding to diverse and evolving alignment contexts. 4.2 Hybrid reward modeling Human preferences are highly diverse and context-dependent, varying significantly across tasks, demographics, cultures, and individual users. At the same time, different tasks impose distinct criteria for what constitutes high-quality output, and real-world applications involve a wide range of objectives. The combination of preference diver- sity, task variability, and objective heterogeneity makes it extremely challenging to design reward functions that can fully capture such nuanced expectations. Conse- quently, static, universal RMs are often impractical for aligning model behavior with the complex and variable nature of human values. Furthermore, hand-engineered metrics can conflict with each other. For example, generating shorter, more precise descriptions may improve BLEU [54] scores while reducing ROUGE [55] scores due to lower recall. These trade-offs are often task-specific and user-dependent, revealing the limitations and fragility of static, one-size-fits-all reward designs. 14 To address the limitations of RMs in terms of adaptability and expressiveness, researchers have proposed hybrid RMs. By devising targeted fusion strategies, hybrid RMs flexibly integrate complementary supervision signals across different granulari- ties, sources, and modalities, enabling more efficient preference modeling and enhanced representational capacity. In complex tasks, multi-objective evaluation scenarios, or environments with noisy feedback, hybrid reward mechanisms have demonstrated superior robustness and generalization capabilities. 4.2.1 Sources for hybrid reward modeling Human preferences are subjective, diverse, and context-dependent. Hybrid RMs allow for the integration of multiple supervision signals from different sources (e.g., human feedback, expert rules, user behavior, or model predictions), enabling better alignment with actual human preferences. Peng et al. [56] proposed agentic reward modeling, a reward system that combined RMs with verifiable correctness signals from multiple aspects to provide reliable supervision. They implemented a reward agent, RewardA- gent, which combined human preference rewards with two verifiable signals: factuality and instruction following, to enhance the reliability of the reward signals. In complex tasks or environments with noisy feedback, single reward signals may lead to overfitting to noise or biases. Multi-source rewards introduce redundancy by incorporating signals from diverse sources, which enhances the model’s robustness to anomalies and its generalization across varied inputs. Wang and Xiong [57] proposed AutoRule, a fully automated framework that extracted symbolic rules from preference feedback. These rules were used to compute an auxiliary reward signal that comple- mented the learned RM, resulting in improved RLHF performance and reduced reward hacking. A single reward signal often covers only one aspect of a task, whereas hybrid rewards can combine signals across multiple objective dimensions to more comprehen- sively define what constitutes a “good” output. For example, in dialogue generation, the model needs to produce responses that are both fluent and factually accurate. In recommendation systems, it should match user interests while also maintaining diver- sity and novelty. Zhang et al. [58] introduced the Directional Preference Alignment (DPA) framework to overcome the limitations of scalar-reward RLHF in represent- ing diverse user preferences. DPA incorporated multi-objective reward modeling and modeled user preferences as unit vectors in reward space, enabling user-dependent con- trol over generation behavior. They fine-tuned LLMs using a preference-conditioned variant ofRejection Sampling Finetuning(RSF), which achieved improved perfor- mance trade-offs across various reward objectives. Zhang et al. [59] proposed MOSLIM, employed a multi-head RM to classify question-answer pairs and mapped these clas- sifications into scalar rewards to to handle diverse objectives, enabling flexible control via prompting without requiring preference-specific training during the SFT phase. In real-world applications, evaluation metrics often exhibit trade-offs. For instance, increasing precision may reduce recall, generating concise outputs can compromise informational richness, and prioritizing safety may constrain creativity. Hybrid reward mechanisms offer a principled approach to balancing such competing objectives, 15 enabling multi-objective optimization without sacrificing one goal for another. Con- sidering that RL-based fine-tuning was unstable and resource-heavy, especially under diverse and conflicting objectives, Zhou et al. [60] proposedMulti-Objective Direct Preference Optimization(MODPO) as a RL–free alternative to DPO for multi- objective alignment. MODPO folded language modeling directly into reward modeling, enabling language models to act as implicit collective RMs. Some tasks involve inputs from multiple modalities, such as text, images, audio, or video, or require optimization across both local and global levels. Hybrid reward mechanisms can combine signals across different modalities, granularities, and stages, making them suitable for more complex and realistic application scenarios. Sun et al. [61] proposed Factually Augmented RLHF, an approach that augmented the RM with additional factual information such as image captions and ground-truth multiple- choice options to mitigate the reward hacking problem. Multi-granularity hybrid rewards focus on integrating alignment signals across varying levels of granularity, from fine-grained token-level correctness to high-level discourse structures. This paradigm is crucial for aligning LLMs on tasks where coherence, style, and factuality must be maintained across local and global scopes. Liu et al. [62] proposed HAF-RM, a hybrid alignment framework for RM training that introduces an additional constraint on token-level policy probabilities alongside the conventional reward score. This frame- work enabled simultaneous supervision of the internal preference model at the token level while optimizing the reward mapping layer at the sequence level. Yu et al. [63] argued that process supervision relied on learned reward models requiring costly data and suffered reward misalignment, while outcome supervision failed on complex multi-step tasks. They proposed Outcome Refining Process Supervision, which unified process and outcome supervision via executable verification and tree search to refine reward signals. 4.2.2 Strategies for hybrid reward modeling Fusion mechanisms played a central role in hybrid reward modeling, determining how diverse reward signals were integrated into a unified optimization objective. These mechanisms were not merely technical components but strategic decisions that directly impacted a model’s ability to synthesize multidimensional feedback for achieving stable and trustworthy alignment. One of the most commonly adopted approaches was weighted averaging, where reward signals were either combined using static weights or dynamically adjusted based on confidence, task context, or domain-specific attributes. While static weighting offered simplicity, it lacked the flexibility to handle conflicting or evolving alignment objectives. To address this, Liu et al. [64] proposedAdaptive Multi-objective Pref- erence Optimization(AMoPO), which treated dimension-aware generation metrics as implicit rewards within a multi-objective optimization paradigm. They further introduced an adaptive weight assignment mechanism, where the generation space was modeled as a Gaussian distribution to enable dynamic prioritization across different preference dimensions. 16 In addition to averaging, researchers also explored the temporal structure of reward fusion. Fusion often occurred sequentially, with rule-based signals used for initial fil- tering or scaffolding, followed by refinement through learned or adaptive RMs. This reflected a coarse-to-fine alignment strategy that became increasingly prominent in LLM alignment research. For example, Bai et al. [65] implemented a hybrid reward modeling framework by applying rule-based constitutional principles to remove unsafe responses before collecting AI feedback. The resulting RM, trained on revised and critiqued outputs, combined rule-based safety enforcement with preference alignment, improving harmlessness without relying solely on human labels. Other studies introduced modular or hierarchical fusion strategies that enabled dynamic reward selection. These approaches employed separate discriminator or con- troller modules to choose appropriate reward signals based on task requirements or contextual information. Lai et al. [66] proposed ALaRM, a hierarchical reward mod- eling framework for RLHF. It decomposed alignment into holistic and aspect-specific sub-objectives. They introduced a consistency-based aggregation strategy to filter and combine multiple feedback signals. Their two-stage training process prioritized holis- tic rewards and incorporated aspect-specific feedback only when necessary to guide models toward better alignment. Later work explored neural fusion mechanisms such as attention-based and gating- based approaches. These methods adaptively routed reward signals according to the input context and supported more fine-grained fusion across tasks or reward types. Qiu et al. [67] proposed a sentence-level RM, which segmented responses into individ- ual sentences and computed rewards based on the difference between the outputs at each sentence’s start and end positions. To produce a final response-level reward, they introduced an attention mechanism that aggregated the sentence-level rewards. Simi- larly, Wang et al. [68] presented a two-stage interpretable reward modeling framework. They first trained a multi-aspect ArmoRM on human-aligned dimensions such as hon- esty, verbosity, and safety. Then they employed a Mixture of Experts architecture with a gating mechanism to automatically select the most suitable reward dimension based on input context, which significantly improved model stability and alignment fidelity. Dubois et al. [69] proposed Rewarded Soup, a multi-policy strategy that trained sep- arate models on different proxy rewards and linearly interpolated their weights to achieve Pareto-optimal generalization across the space of human preferences. In summary, the design of fusion mechanisms evolved from static combinations to more adaptive, modular, and context-sensitive architectures, highlighting an ongoing effort to balance reliability, generalizability, and controllability in the alignment of LLMs. 5 Reward model: functional roles under different optimization paradigms This section examines the distinct functional roles of reward models RMs in differ- ent LLM alignment paradigms, categorizing them into explicit and implicit forms. In RLHF, explicit RMs are trained on preference triplets to score responses and guide policy optimization via RL algorithms such as PPO, making them central to 17 the optimization process. InIn-Context Learning(ICL), explicit RMs function as external evaluators, assigning utility scores to candidate prompts or outputs for selec- tion and filtering without updating model parameters, thus serving solely as decision aids. InDirect Preference Optimization(DPO) [70], implicit RMs are embedded within the supervised optimization objective, integrating the reward signal directly into parameter updates without separate reward model training. Figure 3 illustrates this distinction, and Table 2 provides a comparative overview of how RMs are employed across these optimization paradigms. RL-based methodRL-based method r1 r2 r1 r2 r1 r2 r1 r2 HumanHuman Generative RM from AI Feedback Generative RM from AI Feedback Off-the-shelf LLM RM training SFT model Reforcement learning RL from Human Feedback (RLHF)RL from Human Feedback (RLHF) RL from AI Feedback (RLAIF) Discriminative RM from Human Feedback Discriminative RM from Human Feedback Reforcement learning RM training LM policy LM policy r1 r2 r1 r2 Human Generative RM from AI Feedback Off-the-shelf LLM RM training SFT model Reforcement learning RL from Human Feedback (RLHF) RL from AI Feedback (RLAIF) Discriminative RM from Human Feedback Reforcement learning RM training LM policy LM policy Responses Preference Preference l y l y w y l y w y Prompt x Prompt x r1 r2 r1 r2 Human Generative RM from AI Feedback Off-the-shelf LLM RM training SFT model Reforcement learning RL from Human Feedback (RLHF) RL from AI Feedback (RLAIF) Discriminative RM from Human Feedback Reforcement learning RM training LM policy LM policy Responses Preference Preference l y w y Prompt x SFT modelSFT model Implicit RM SFT model Implicit RM SFT modelSFT model Explicit RM Base LLMBase LLM Reforcement learning training loop SFT model Explicit RM Base LLM Reforcement learning training loop Final LLM l y l y w y l y w y l y l y w y l y w y Prompt x Prompt x Prompt x Prompt x Responses Responses Maxmum likelihood Maxmum likelihood PPO-based paradigm for LLM alignmentPPO-based paradigm for LLM alignment DPO-based paradigm for LLM alignmentDPO-based paradigm for LLM alignment Reward design in LLM alignmetn Reward design in LLM alignmetn FeedbackFeedback Reforcement learningReforcement learning OptimizationOptimization Explicit RM Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. General RM Reward design in LLM alignmetn Feedback Reforcement learning Optimization Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. General RM Prompt x Prompt x ResponsesResponses y Responses y Collect human demonstration data Prompt x Prompt x SFT modelSFT model l y l y w y l y w y Collect human prerference data l y w y lw y Base LLMBase LLMSFT modelSFT model Supervised Fine-tune Base LLMSFT model Supervised Fine-tune Base LLMBase LLMRMRM Supervised Fine-tune Base LLMRM Supervised Fine-tune Step2 training reward modelStep1 training Supervised Fine-tuning model Step3 training Policy using PPO Prompt x Prompt x ResponsesResponses y Responses y RM LM policy PPO Prompt x Prompt x Implicit RM Reward modelReward model Rules Discriminative RMDiscriminative RM Generative Reward Implicit RMImplicit RM RL-based training (PPO) RL-free training(DPO) Learned RM Learned RM Rule-based RMRule-based RM Fine-grained RMFine-grained RM Fine-grained RMFine-grained RM Fine-grained RMFine-grained RM Fine-grained RMFine-grained RM ...... Discriminative RMDiscriminative RM Trend of reward design in LLM alignmetn Trend of reward design in LLM alignmetn Construction BasisConstruction Basis RrepresentationRrepresentation GranularityGranularity Explicit RM Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. Trend of reward design in LLM alignmetn Construction Basis Rrepresentation Granularity Explicit RM 1. Rule-based RM/ Learned RM 2. Response- Level RM/ Token- Level RM 3. More Fine-grained RM • Multi-Objective RM • Multi-Stage RM • Hierarchical RM, etc. Learned RM Implicit RMImplicit RM RL-based training (PPO) RL-free training(DPO) Discriminative RMDiscriminative RM GranularityGranularity Trends in Reward Design for LLM Alignment Trends in Reward Design for LLM Alignment Learned RM Learned RMRule-based RMRule-based RM Learned RMRule-based RM Rule-based RM Learned RMRule-based RM Learned RM Rule-based RM Learned RMRule-based RM Learned RM Rule-based RM Learned RMRule-based RM Learned RM Rule-based RM Learned RMRule-based RM Learned RM 202220232024202220232024 Construction Basis:From Rule-based to Learned Reward Models, Format:From numerical Reward to non- numerical Reward Models,Expression: From Explicit Reward to Implicit Reward Models,Granularity: From general Reward to fine-grained Reward Models Rule-based RMRule-based RMData-driven RMData-driven RMHybrid RMHybrid RMRule-based RMData-driven RMHybrid RM Format Non-numerical Reward Models Non-numerical Reward Models Format Non-numerical Reward Models Explicit Reward Explicit Reward Expression Implicit Reward Implicit Reward General RM General RM Granularity Fine-grained RMFine-grained RM Construction Basis Language Model Outputs High-quality Outputs High-quality Outputs Reward modelReward model Language Model Outputs High-quality Outputs Reward model TraingTraing TraingTraing Reward design Rule-based RM/ Data-driven RM Numerical RM / Non-numerical RM / Feedback Optimization Application Chanllenges LLM alignment Rule-based RM / Data-driven RM Reward designReward design Rule-based RM Data-driven RM Explicit RM Implicit RM General RM Fine-grained RM ➢Discriminative RM/ Generative RM ➢Response- Level RM / Token- Level RM ➢Multi-objective RM Construction Basis Construction Basis Construction Basis Format Feedback chanllenge1chanllenge2chanllenge3chanllenge4 chanllenge5 Pairwise feedback Listwise feedback Binary feedback Preference feedback RL-free method LLM alignment LLM alignment ApplicationApplication OptimizationOptimization RepresentationRepresentation GranularityGranularity Numerical RM Non-numerical RM Multi-task Multi-model Simple scene Multi-task Multi-model Simple scene Numerical RM /Non-numerical RM General RM/ Fine-grained RM Multi-task/ Multi-modal scenes RL-based method/ RL-free method Binary /Preference Pairwise / Listwise Explicit RM/ Implicit RM Format Representation Granularity LLM alignment Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking TaxnomyTrend (a) Reward design Feedback Optimization Application Chanllenges LLM alignment Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking Trend Construction Basis Format Representation Granularity Taxnomy Construction Basis Format Representation Granularity Taxnomy Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/ Fine-grained RM Explicit RM/ Implicit RM Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/ Fine-grained RM Explicit RM/ Implicit RM Multi-task/ Multi-modal scenes RL-based method/ RL-free method Binary /Preference Pairwise / Listwise Multi-task/ Multi-modal scenes RL-based method/ RL-free method Binary /Preference Pairwise / Listwise LLM alignment LLM alignment OptimizationOptimization ApplicationApplication Reward designReward design Feedback Chanllenges Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking Chanllenges Training instability Subjectivity and instability of human feedback Scarcity and high cost of human feedback Mode collapse Reward hacking Construction Basis Format Representation Granularity Taxnomy Construction Basis Format Representation Granularity Taxnomy Binary /Preference Pairwise / Listwise Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/Fine-grained RM Explicit RM/Implicit RM Multi-task/Multi-modal scenes Multi-task/Multi-modal scenes Trend RL-based method/RL-free method Check PatientPatient Large language model Outputs FeedbackFeedback Outputs Reward model Reward model Reinforecement learning Large language model Large language model Human annotaor Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs FeedbackFeedback Human annotaor Large language model Outputs Feedback Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs FeedbackFeedback Outputs Reward model Reward model Reinforecement learning Large language model Large language model Human annotaor Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs FeedbackFeedback Human annotaor Large language model Outputs Feedback Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Outputs Reward model Reinforecement learning Large language model Human annotaor Large language model Outputs Feedback Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization v InputInput Output Feedback Reward design Optimization Treatment Dignosis Check Prescription Subjectivity and instability of human feedback Scarcity and high cost of human feedbackScarcity and high cost of human feedback Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM General RM/Fine-grained RM Explicit RM/Implicit RM Supervised learning Numerical feedback/Ranked feedback/Textual feedback In-Context learning Mode collapse Reward hacking Training instability Large languge model DoctorDoctor LLM alignmentLLM alignment PatientPatient Symptoms Human feedback/ AI feedback Reinforcement learning General RM/Fine-grained RM InputInputLarge languge model OutputOutput OutputOutput OutputOutput FeedbackFeedback Large language model Large language model Outputs Reinforecement learning Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Large language model Outputs FeedbackFeedback Human annotaor Large language model Large language model Reward model OutputsOutputs Feedback (a) Explicit Reward model (b) Implicit reward model Large language model Outputs Reinforecement learning Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Human annotaor Large language model Reward model Outputs Feedback (a) Explicit Reward model (b) Implicit reward model Large language model Outputs Reinforecement learning Human annotaor Supervised Traing/ In-Context Traing OptimizationOptimization Large language model Outputs Feedback Human annotaor Large language model Reward model Outputs Feedback (a) Explicit Reward model (b) Implicit reward model (a) Explicit Reward model (b) Implicit reward model Large language model Large language model Outputs Reinforecement learning Human annotator Supervised learning OptimizationOptimization Large language model Large language model Outputs FeedbackFeedback Human annotator Large language model Large language model Reward model OutputsOutputs Feedback (a) Explicit Reward model (b) Implicit reward model Large language model Outputs Reinforecement learning Human annotator Supervised learning OptimizationOptimization Large language model Outputs Feedback Human annotator Large language model Reward model Outputs Feedback Improved PPO Substituted PPO Improved PPO Substituted PPO DPO and its variantsDPO and its variants Other methods InputInput Output Feedback Reward design Optimization TreatmentDignosis Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM Explicit RM/Implicit RM RL-based method Pairwise feedback/ Listwise feedback Binary feedback/ Preference feedback RL-free method Ethics & Values Large languge model DoctorDoctor LLM alignmentLLM alignment v Human feedback/ AI feedback Bias & FairnessBias & Fairness Safety & HarmfulnessSafety & Harmfulness Factuality & ReliabilityFactuality & Reliability Controllability & InterpretabilityControllability & Interpretability Privacy & SecurityPrivacy & Security HarmlessHarmless HonestHonest Helpful Helpful Virus Threats Patient Therapeutic dilemmas Reward hacking Mode collapse Scarcity and high cost of human feedback Subjectivity and instability of human feedback Training instability PrescriptionPrescription Clinical targets Medical test Input Output Feedback Reward design Optimization TreatmentDignosis Rule-based RM/ Data-driven RM Numerical RM /Non-numerical RM Explicit RM/Implicit RM RL-based method Pairwise feedback/ Listwise feedback Binary feedback/ Preference feedback RL-free method Ethics & Values Large languge model Doctor LLM alignment v Human feedback/ AI feedback Bias & Fairness Safety & Harmfulness Factuality & Reliability Controllability & Interpretability Privacy & Security Harmless Honest Helpful Virus Threats Patient Therapeutic dilemmas Reward hacking Mode collapse Scarcity and high cost of human feedback Subjectivity and instability of human feedback Training instability Prescription Clinical targets Medical test Large language model Large language model Response Human annotator Large language model Large language model Reward model ResponseResponse Feedback Query Query Demonstration Large language model Response Human annotator Large language model Reward model Response Feedback Query Query Demonstration Large language model Response Human annotator Large language model Reward model Response Feedback Query Query Demonstration (b) In-context learning Evaluation Curation Optimization Response FeedbackFeedback Human annotator Large language model Large language model Query Response Feedback Human annotator Large language model Query (c) DPO Reward model z Response Feedback Human annotator Large language model Query (c) DPO Reward model z Training objective Large language model Large language model Response Reinforecement learning Human annotator Optimization Large language model Large language model Reward model ResponseResponse Feedback Query (a) Reinforcement learning Large language model Response Reinforecement learning Human annotator Optimization Large language model Reward model Response Feedback Query (a) Reinforcement learning Input Output Feedback Reward design Optimization Rule-based RM/ Data-driven RM/ Hybrid RM Explicit RM/ Implicit RM RL-based optimization Pairwise feedback/ Listwise feedback Binary feedback/ Preference feedback SL-based optimization Ethics & Values Large languge model Doctor LLM alignment v Human feedback/ AI feedback Bias & Fairness Safety & Harmfulness Factuality & Reliability Controllability & Interpretability Privacy & Security Harmless Honest Helpful Patient Prescription Clinical targets Medical test Virus Threats ICL-based optimization Numerical RM /Non-numerical RM TreatmentTreatment DignosisDignosis Input Output Feedback Reward design Optimization Ethics & Values Large languge model Doctor LLM alignment v Bias & Fairness Safety & Harmfulness Factuality & Reliability Controllability & Interpretability Privacy & Security Harmless Honest Helpful Patient Clinical targets Medical test Symptoms TreatmentTreatment DignosisDignosis PrescriptionPrescription Therapeutic dilemmas Reward hacking Mode collapse Subjectivity and instability of human feedback Training instability Therapeutic dilemmas Reward hacking Mode collapse Subjectivity and instability of human feedback Training instability How does the reward model function? Reward model: functional roles under different optimization paradigms (Section 5 ) Numerical RM Data-driven RM Hybrid RM Explicit RM Implicit RM How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How does the reward model function in RL, ICL, and SL? Reward model: functional roles under different optimization paradigms (Section 5 ) Numerical reward modeling Nonnumerical reward modeling Rule-based reward modeling Data-driven reward modeling Hybrid reward modeling Explicit reward modeling Implicit reward modeling Standard SL-based LLM alignment DPO and its variants RL-based LLM alignment ICL-based LLM alignment How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) Reward model for LLM alignment Preference learning Classification / regression models Inverse reinforcement learning E.g., DPO (Rafailov et al., 2023), TDPO (Zeng et al., 2023), Step-DPO (LAi et al., 2024) E.g., Supervised Fine-tuning (Ouyang et al., 2022), Instruction Tuning (Wei et al., 2021), Constitutional AI (Bai et al., 2022a) E.g., Chain-of-Thought Feedback (Wei et al., 2022), Meta-RM for ICL (Liu et al., 2024), Rewarded In-Context Learning (Chen et al., 2024) E.g., InstructGPT (Ouyang et al., 2022), InstructGPT(), KTO (Yang et al., 2023) E.g., Inverse RL Alignment (Sun & van der Schaar, 2024), Variational IRL for LLM (Cai et al., 2024), Dynamic Reward Scaling IRL (Cheng et al., 2025) E.g., RLAIF (Bai et al., 2022b), Direct-RLAIF (Lee et al., 2024a), UltraFeedback (Cui et al., 2024) E.g., Reward Regression (Christiano et al., 2017), Feedback Regression (Ziegler et al., 2019), Score-based Reward Models (Kadavath et al., 2022) E.g., TLRM (Wu et al., 2023), LiPO (Lu et al., 2024), Variational RM (Cai et al., 2024) E.g., Contrastive Reward Modeling (Yuan et al., 2023), Ordinal Reward Modeling (Zheng et al., 2024), Ranking-based RM (Kim et al., 2024) E.g., Prompt-based RM (Ouyang et al., 2022), Heuristic RM (Stiennon et al., 2020), Symbolic Reward Templates (Zhang et al., 2023) E.g., Adaptive Hybrid RM (Kim et al., 2024), Gating-based Hybrid RM (Wang et al., 2024), Rule-guided Preference Learning (Chen et al., 2024) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-modal RM (e.g., text + images + audio) Multi-modal RM (e.g., text + images + audio) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-source RM (e.g., rule-based + data-driven)Multi-source RM (e.g., rule-based + data-driven) Multi-granularity RM (e.g., response-level + token-level)Multi-granularity RM (e.g., response-level + token-level) Single-value reward modeling / multi-value reward modelingSingle-value reward modeling / multi-value reward modeling Response-level reward modeling / token-level reward modelingResponse-level reward modeling / token-level reward modeling Pairwise preference modeling / listwise preference modelingPairwise preference modeling / listwise preference modeling Pointwise reward modeling / preferencewise modelingPointwise reward modeling / preferencewise modeling DPO and its variantsDPO and its variants ICL-based LLM alignment ICL-based LLM alignment RL-based LLM alignment RL-based LLM alignment Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial imitation learning) Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial imitation learning) Supervised learning (e.g., classification / regression, preference learning) Supervised learning (e.g., classification / regression, preference learning) How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How does the reward model function? Reward model: functional roles under different optimization paradigms (Section 5 ) Numerical RM Nonnumerical RM Rule-based RM Data-driven RM Hybrid RM Explicit RM Implicit RM How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) Toxicity filters, Wikidata queries, length constraints, domain- specific symbolic rules, programmatic scoring scripts Reward model for LLM alignment Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-stage RM (e.g., training-time supervision + inference-time post-hoc scoring) Multi-modal RM (e.g., text + images + audio) Multi-modal RM (e.g., text + images + audio) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-aspect RM (e.g., fluency + factuality + safety) Multi-source RM (e.g., rule-based + data-driven)Multi-source RM (e.g., rule-based + data-driven) Multi-granularity RM (e.g., response-level + sentence-level + token- level) Multi-granularity RM (e.g., response-level + sentence-level + token- level) Single-value reward modeling and multi-value reward modelingSingle-value reward modeling and multi-value reward modeling Response-level reward modeling and token-level reward modelingResponse-level reward modeling and token-level reward modeling Pairwise preference modeling and listwise preference modelingPairwise preference modeling and listwise preference modeling Pointwise reward modeling and preferencewise modelingPointwise reward modeling and preferencewise modeling DPO and its variantsDPO and its variants ICL-based LLM alignment ICL-based LLM alignment RL-based LLM alignment RL-based LLM alignment Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial Imitation Learning Imitation learning (e.g., behavior cloning, inverse reinforcement learning, generative adversarial Imitation Learning Supervised learning (e.g., classification, regression, preference learning) Supervised learning (e.g., classification, regression, preference learning) Toxicity filters, Wikidata queries, length constraints, domain- specific symbolic rules, programmatic scoring scripts Toxicity filters, Wikidata queries, length constraints, domain- specific symbolic rules, programmatic scoring scripts Non-numerical RMNon-numerical RM How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How is the reward model constructed in practice? Reward model: methodological modeling and construction (Section 4 ) How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) How is the reward model mathematically formulated? Reward model: mathematical formulation (Section 3 ) Reward model for LLM alignment Reward model for LLM alignment Rule-based RM Rule-based RM Fig. 3Explicit reward modeling and implicit reward modeling represent two distinct approaches for optimizing LLM alignment, differing in their learning paradigms and implementation frameworks. Table 2A comparison of LLM alignment methods from the perspective of optimization strategies. Method Input / Output Uses RM? Reward Category Updates Param- eters? Note SL (SFT) Input: queryx Output: human-written labely ∗ No–Yes Treatsy ∗ as optimal; model learns to imitate human reference answers. SL (DPO) Input:x, pair of responses (y + ,y − ) Output: preference directiony + ≻y − YesImplicit RMYes Trains model to prefer y + overy − . RL Input:x Output:y∼P θ (y|x), scored by RMR(x,y) YesExplicit RMYes Reward signal is used to train the model via policy optimization. ICL Input: demo set C=(x i ,y i ) k i=1 , query x Output:y∼P θ (y|C,x) Optional Explicit RM (if used) No Reward signal guides the selection of demonstrations and evaluates or re-ranks candidate outputs. 18 5.1 Explicit reward modeling 5.1.1 RL-based LLM alignment frameworks RLHF typically involves three phases: (i) SFT, (i) pre-training, and (i) RM training and RL-based fine-tuning [71]. The RL fine-tuning phase then leverages the feedback from the trained RM to optimize the pre-trained SFT model. The objective is to fine- tune the policy to maximize the expected reward under the learned RMs, while also incorporating a KL regularization term to ensure the updated policy remains close to the reference model. The overall objective can be expressed as: max π θ E x∼D,y∼π θ (y|x) [r φ (x,y)]−βD KL [π θ (y|x)||π ref (y|x)](8) In this formation: •π θ represents the policy of the language model, parameterized byθ. •E x∼D,y∼π θ (y|x) [r φ (x,y)] denotes the expected reward, wherer φ (x,y) is the RM that quantifies he alignment of the responseywith human preferences given the promptx. •π ref denotes the reference model, which serves as a baseline or starting point for the optimization. •D KL [ π θ (y|x)∥π ref (y|x) ] is theKullback-Leibler(KL) divergence between the current policyπ θ and the reference modelπ ref . •βis a hyperparameter that balances the reward maximization with the KL divergence penalty. While RLHF has achieved promising results in aligning LLMs with human pref- erences, PPO, its most commonly used optimization algorithm, remains limited by instability and sample inefficiency, particularly when compared SFT methods [72]. It is sensitive to hyperparameter configurations [73], susceptible to fluctuations in reward signals, and often fails to maintain consistent output characteristics such as response length. Furthermore, the reliance on an actor-critic architecture with a learnable value function increases the computational cost of training [74]. These limitations have hin- dered reproducibility, especially in open-source settings with constrained resources. In response, recent studies have proposed enhancements to PPO or explored alternative RL algorithms better suited for reward-driven fine-tuning. •Improving PPO: Zheng et al. [75] clustered training samples into easy and hard categories using self-supervised similarity, and adjusted the KL regularization strength accordingly. This stratified reward shaping strategy enabled more stable and targeted optimization. CPPO [76] incorporated a reward-informed knowledge retention term into the PPO objective, which adjusted policy updates according to response novelty and prior alignment. This approach helped preserve desir- able behaviors while encouraging adaptation to new reward signals. Wu et al. [77] proposed P3O to stabilize training by incorporating pairwise reward differ- ences directly into the policy gradient computation, effectively reducing gradient 19 variance arising from biased value estimation. Santacroce et al. [78] addressed memory constraints in large-scale RLHF by sharing a common LLM backbone across reward, policy, value, and reference models, updating only lightweight adapters. This parameter-sharing strategy maintained RM fidelity while signifi- cantly improving scalability. Shen et al. [79] introduced contrastive rewards, an enhanced reward formulation that improves RLHF robustness by dynamically adjusting rewards based on sampled baselines and incorporating a penalty term for uncertain signals. This approach automatically adapted to task difficulty while mitigating noise in human feedback, leading to more stable policy optimization and better alignment performance compared to standard RMs. Rafailov et al. [80] proposed PPO-Max, which enhances PPO’s stability in RLHF by explic- itly maximizing reward signals while maintaining training balance. The method demonstrated superior performance with sparse or noisy feedback. •Substituting PPO: Rather than refining PPO, Li et al. [81] simplified the RLHF pipeline for auto-regressive language generation by removing non-essential com- ponents of PPO and reverted to a lightweight REINFORCE-style algorithm. They introduced ReMax, which leveraged entropy-regularized max-reward sam- pling to enhance training stability and sample efficiency, and outperformed PPO particularly in low-resource and off-policy regimes. Similarly, REINFORCE- based methods alleviated the computational burden of critic networks by fully eliminating them. Ahmadian et al. [82] proposed BackRLHF, which employed REINFORCE Leave-One-Out(RLOO) in conjunction with an auxiliary inverse dynamics model to infer user intent, thereby improving learning efficiency in resource-constrained settings through a lightweight and modular architecture. Shao et al. [83] designedGroup Relative Policy Optimization(GRPO) to replace absolute reward signals with group-based relative feedback, which pro- duced stronger gradient signals and improved stability in critic-free training. Hu et al. [84] proposed REINFORCE++, a critic-free RLHF algorithm that improved advantage estimation by using batch-normalized rewards as baselines. This design mitigated overfitting and reward hacking, and achieved robust generalization across RMs and chain-of-thought settings. 5.2 ICL-based LLM alignment framework ICL is a paradigm that allows language models to learn tasks given only a few examples in the form of demonstration [85]. Formally, given a query input textxand a set of candidate answersY=y 1 ,...,y m , a pretrained language model takes the candidate answer with the maximum score as the prediction conditioned on a demonstration set C. The setCcontains an optional task instructionIandkdemonstration examples, thus: C=I,s(x 1 ,y 1 ),...,s(x k ,y k )orC=s ′ (x 1 ,y 1 ,I),...,s ′ (x k ,y k ,I),(9) wheres ′ (x i ,y i ,I) is an example written in natural language according to the task. 20 ˆ y= arg max y j ∈Y R(y j ,C,x)(10) where ˆ ydenotes the final predicted label, i.e., the candidate answer with the highest probability, andR(y j ,C,x) is a scoring function that assesses the quality or relevance of each candidatey j given the contextCand the input queryx. In addition to serving as a decision assessor for evaluating or re-ranking candidate outputs to improve alignment without parameter updates, RMs can also function as a utility function to guide prompt construction. 5.2.1 Explicit RMs for prompt construction in ICL In ICL, prompt construction, including the selection, formatting, and ordering of demonstration examples, is a critical determinant of downstream performance. While earlier approaches to prompt design typically depended on human heuris- tics or embedding-based similarity, the integration of RMs introduces a data-driven paradigm, enabling optimization of prompt inputs based on feedback signals aligned with task-specific objectives. In this context, the RM acts as a utility estimator, trans- forming prompt construction from a static template-design problem into a dynamic selection and synthesis process. A prominent application of RMs lies in demonstration selection. For example, Ye et al. [86] leveragedDeterminantal Point Processes(DPPs) as a reward proxy to select diverse and representative in-context examples. Du and Zhao [87] introduced an RL-driven context selection framework that treats prompt selection as a sequen- tial decision-making task supervised by downstream reward signals. Suo and Lai [88] used segmentation accuracy as feedback to guide stepwise demonstration selection, achieving better context-efficiency. Beyond selection, RMs also aid in formatting and ordering demonstrations. Zhang et al. [89] proposed reward-weighted ordering strategies in visual ICL by optimizing token-level segmentation accuracy. Do et al. [90] introduced a generator-discriminator setup (adv-ICL) where the discriminator serves as an RM evaluating the correctness of model outputs elicited by generated prompts. Zhou et al. [91] extended this idea in Automatic Prompt Engineer(APE), where LLM-generated prompts are ranked using a task-specific RM to identify the most effective formulation. Qian et al. [92] explored submodular optimization for annotation-efficient prompt construction, treat- ing reward as a function of informativeness and diversity. In all these methods, RMs provide a way to systematically search for or synthesize high-performing prompts, converting the input space into an optimizable reward landscape. 5.2.2 Explicit RMs for output evaluation in ICL Beyond guiding input construction, RMs play an increasingly important role in evalu- ating and aligning model outputs in ICL. Since ICL does not involve model parameter updates, post-generation supervision via RMs becomes essential for identifying the most reliable and aligned completions. In this phase, RMs act as task-specific crit- ics that evaluate generated outputs for correctness, helpfulness, consistency, or other alignment objectives. 21 A fundamental use of RMs is in output re-ranking. Li et al. [93] introduced DORA, which used an RM to dynamically select completions that exhibit logical consistency and reasoning accuracy, even correcting high-confidence incorrect predictions. These examples illustrate how RMs can compensate for the lack of gradient-based learn- ing in ICL by acting as post-hoc validators. RMs also serve as evaluation tools in benchmark construction. Chen et al. [94] developed ICLEval, a benchmark that used RM-like metrics to evaluate ICL abilities in rule induction, copying, and memoriza- tion. Yu et al. [95] proposed a two-dimensional evaluation framework considering both performance via reward and configuration cost, offering a more holistic assessment of ICL alignment. Honovich et al. [96] used reward-driven evaluations to assess auto- matically generated task instructions, showing that RMs can serve as substitutes for human evaluation in instruction tuning. Recent works have further extended RMs to multi-pass reasoning and self-refinement. Yuan et al. [97] used a reward heuristic to select the best answer among multiple iterative completions in Self-Refine, while Lightman et al. [98] introduced stepwise rewards for intermediate reasoning steps to guide chain-of-thought alignment. Taken together, these studies illustrate the critical role that reward modeling plays in enhancing the robustness, transparency, and task alignment of outputs generated via ICL. By functioning as post-hoc evaluators, RMs provide a principled mechanism for output selection and validation, enabling improved performance without requiring parameter updates, thus offering a flexible and scalable alternative to traditional fully SFT approaches. 5.3 Implicit reward modeling Explicit reward modeling in LLMs presents several inherent limitations, such as reward misspecification, high annotation costs, and optimization brittleness. As a result, an increasing number of researchers are exploring reward-free paradigms or alternative formulations that substitute or transform RMs into more efficient alignment objectives. To reduce the complexity and cost of alignment, some researchers have focused on the initial phase of RLHF, SFT, and proposed a range of sophisticated SFT variations aiming to achieve comparable performance to RLHF [99–101]. Omittingxfor brevity, a general form of SFT alignment can be expressed as: arg min θ −Ep(y w ,y l ) [ logπθ(y w )−logπ θ (y l ) ] ∝KL [ p(y w )|π θ (y w ) ] −KL [ p(y l )|π θ (y l ) ] , (11) wherey w andy l denote the preferred (winning) and dispreferred (losing) responses respectively, sampled from the empirical joint preference distributionp(y w ,y l ).π θ (y) represents the model’s policy parameterized byθ, which assigns a probability to each responsey. As a reward-free paradigm, it does not involve explicit reward modeling but directly learns to imitate preferred behaviors while unlearning dispreferred ones [102]. Hence, further elaboration is omitted. It is worth noting that while reward design offers solutions to many alignment challenges, it is not indispensable for achieving effective alignment, with reward-free methods increasingly emerging as a key research trend in LLM alignment. 22 Recent advances such as DPO [70] and implicit reward modeling illustrate the trend toward integrating preference learning more tightly into the optimization objec- tive, eliminating the need for explicit reward computation while still guiding model behavior effectively. DPO offers a streamlined alternative to RLHF by eliminating the need for explicit reward modeling. Instead of learning a separate RM, DPO implicitly encodes human preferences directly within its objective function. Concretely, DPO uses the log-ratio of policy likelihoods, comparing preferred and dispreferred responses relative to a reference policy, as a proxy for reward, thereby reparameterizing the optimization target in terms of preference. This formulation not only simplifies the alignment pipeline but also improves training stability and efficiency. Deriving from the KL-constrained reward maximization objective in Equation 8, DPO reformulates the optimization target as follows [103]: π r (y|x) = 1 Z(x) π ref (y|x) exp 1 β r(x,y) (12) In this case,Z(x) = P y π ref (y|x) exp 1 β r(x,y) acts as the partition function that normalizes the policy distributionπ r (y|x). It is computed by summing the exponentiated reward scores weighted by the reference model’s distribution across every possible responsey. Such summation guarantees thatπ r (y|x) is a valid probability distribution; how- ever, enumerating all candidate responses for each inputxis impractical, requiring substantial computational resources when the response space is large. In practice, the DPO algorithm mitigates the need for explicit reward-model training and sam- pling from the language-model policy. By optimizing preferences directly, without the intermediate step of training an RM, DPO simplifies the training loop and reduces computational burden, offering a more efficient approach to language model alignment. More precisely, the objective in Equation 6 can be rearranged to isolater(x,y), yielding: r(x,y) =βlog π r (y|x) π ref (y|x) +βlogZ(x)(13) This reparameterization expresses the reward in terms of the optimized policyπ r , the reference policyπ ref , and the normalization constantZ(x). Substituting into the original reward functionr ∗ gives: r ∗ (x,y) =βlog π ∗ r (y|x) π ref (y|x) +βlogZ(x)(14) Subsequently, the original loss ofr ∗ is substituted with this reparameterization from Equation 14. Since the Bradley–Terry preference model depends only on reward differences between two completions, the human preference probability becomes: p ∗ y 1 ≻y 2 |x = 1 1 + exp βlog π ∗ (y 2 |x) π ref (y 2 |x) −βlog π ∗ (y 1 |x) π ref (y 1 |x) .(15) This simplifies via the sigmoid function asp ∗ (y 1 ≻y 2 |x) =σ r ∗ (x,y 1 )−r ∗ (x,y 2 ) , sinceZ(x) cancels out in the subtraction, resulting in a direct reward difference. After 23 this simplification, a maximum likelihood objective for DPO can be formulated. The resulting policy loss for DPO is: L DPO (π θ ;π ref ) =−E (x,y w ,y l )∼D h logσ βlog π θ (y w |x) π ref (y w |x) −βlog π θ (y l |x) π ref (y l |x) i .(16) In this formulation: • π θ denotes the trainable policy of the target language model, • π ref is the fixed reference policy, • y w andy l refer to preferred and dispreferred responses, respectively. This design enables DPO to model preferences directly through policy probabili- ties, without relying on explicit scalar rewards. As a result, alignment effectiveness is preserved while the training process is streamlined. Building upon this foundation, subsequent works have sought to refine DPO’s objective function to strengthen the alignment signal. These improvements aim to enhance the granularity and expressiveness of the implicit reward, promote general- ization, mitigate undesired behaviors such as reward hacking, and better reflect the structure of human preferences.α-DPO[104] employed an adaptive preference dis- tribution to balance the policy model and the reference model, thereby achieving personalized reward margins. Experimental results demonstrated its effectiveness as a surrogate optimization objective and its capability to balance alignment and genera- tion diversity through KL divergence control.β-DPO[105], dynamically adjusted the trade-off parameterβat the batch level based on data informativeness, and employs β-guided data filtering to mitigate the influence of noisy or outlier preference pairs.T- DPO[106] incorporated forward KL divergence constraints for each token, improving alignment and diversity.Bradley-Terry model was utilized for a token-based reward sys- tem to preserve simplicity without the need for explicit reward modeling. To address the lack of fine-grained process supervision in long-chain mathematical reasoning tasks, Step-DPO[107] treated each individual reasoning step as an independent unit for preference optimization. By shifting the optimization granularity from holistic answer- level evaluation to step-wise supervision, Step-DPO enabled the model to accurately identify subtle errors within complex reasoning chains, thereby enhancing its ability to align with human expectations in multi-step problem-solving scenarios. For a comprehensive overview of DPO and its variants to viewed through the evolution of reward modeling, interested readers may refer to [108]. Taken together, these advances demonstrate the potential of optimizing objective functions to model reward signals implicitly. Through richer representations of preference strength, fine- grained token-level comparisons, robustness to pathological behavior, and improved optimization strategies, DPO and its variants offer a lightweight yet effective approach to preference-based alignment without relying on explicitly learned RMs. 24 6 Discussion Table 1 presents a systematic classification of representative LLM alignment stud- ies, structured across three key dimensions: (i) feedback sources, (i) optimization paradigms, and (i) RM characteristics. Based on this classification, we observe several significant trends in the development of reward modeling techniques. 6.1 Progress of reward model The analysis highlights several notable trends in the evolution of reward design. Ini- tially, reward mechanisms were rule-based, manually encoding rewards based on pre- defined criteria. Over time, reward modeling shifted toward data-driven approaches, learning reward signals directly from behavioral or annotated data. This transition also introduced richer, multidimensional feedback expressed in natural language, mov- ing beyond simple scalar rewards. These richer signals allow models to better capture nuanced human preferences and foster more effective alignment. In parallel, reward functions have evolved from explicit formulations to more flexible, implicit mechanisms that adapt dynamically to diverse tasks without rigid predefinitions. Additionally, the granularity of RMs has evolved. Early reward functions were general and coarse, but now fine-grained reward structures are increasingly used, providing detailed feedback at various interaction levels, such as token-level, multi-objective rewards and hybrid. This evolution enables models to be more sensitive to human feedback specifics and enhances their adaptability to complex tasks. Advancements in reward design have thereby driven two prominent trends in LLM alignment. First, there has been a notable shift from RL-based optimization (e.g., PPO-based RLHF) to RL-free alternatives, such as SL-based (e.g., DPO) and ICL-based alignment. These methods offer greater stability, efficiency, and scalability, especially in scenarios with limited or noisy feedback. Second, the improved flexi- bility and expressiveness of modern reward mechanisms have significantly enhanced the capacity of alignment frameworks to handle multi-modal and multi-task settings, where static, scalar, and task-specific rewards often fall short. Collectively, these shifts mark a fundamental transformation in aligning LLMs with human values under complex, real-world conditions. 6.2 Future directions of reward model Looking beyond current trends, reward design for LLM alignment is expected to move from static rule-following to dynamic value co-creation. Future approaches will focus on learning values from collective human interactions, such as dialogues, feedback, and community input, rather than relying solely on static annotations. This shift calls for context-aware and temporally adaptive RMs that reflect evolving, diverse, and some- times conflicting human values. Additionally, socially interactive and meta-reward learning frameworks may enable LLMs to continually update alignment objectives based on real-time feedback and cultural context. Rather than serving only as tools for task completion, reward mechanisms will increasingly support ethical reasoning and value negotiation. In high-stakes or multi-agent scenarios, RMs must also incor- porate governance norms and collective decision-making principles. Ultimately, future 25 reward design will play a central role in building AI systems that align not just with static preferences, but with the dynamic values of human society. 7 Conclusions The challenges faced by LLMs and their alignment processes, including issues arising from the introduction of reward mechanisms, can be addressed through well-designed reward modeling. A review of recent studies indicates that the evolution of LLM alignment can be regarded as a continuous process of exploring, evaluating, and refin- ing reward design strategies. As alignment scenarios grow more complex, spanning multi-objective, multi-task, and multi-modal contexts, the role of reward models will increasingly shift toward being adaptive, context-aware systems that bridge dynamic human values and autonomous model behaviors. 26 Table 1 Systematic analysis of LLM alignment studies: feedback source, optimization, learning paradigm, and RM architectures (Alphabetical by Author)Paper Feedback Optimization Reward Design Human feedback AI feedback SL RL ICL Rule-based RM Data-driven RM Numerical RM Non-Numerical RM Explicit RM Implicit RM Ahmadian et al. [82] ✓ Aky ̈urek et al. [46] ✓ Azar et al. [109] ✓ Bai et al. [19] ✓ Cao et al. [110] ✓ Chan et al. [111] ✓ Cheng et al. [112] ✓ Coste et al. [39] ✓ Cui et al. [43] ✓ Dai et al. [113] ✓ Dong et al. [114] ✓ ✓ Ethayarajh et al. [115] ✓ ✓ Hong et al. [116] ✓ ✓ Kim et al. [117] ✓ Lee et al. [118] ✓ Li et al. [81] ✓ Mahan et al. [119] ✓ Moskovitz et al. [120] ✓ Mu et al. [48] ✓ Pang et al. [121] ✓ ✓ Park et al. [122] ✓ ✓ Rafailov et al. [123] ✓ ✓ Rame et al. [124] ✓ Santacroce et al. [78] ✓ Scheid et al. [125] ✓ Wu et al. [126] ✓ Xu et al. [127] ✓ ✓ Ye et al. [86] ✓ ✓ Yang et al. [128] ✓ Yin et al. [129] ✓ ✓ Yoon et al. [34] ✓ Zeng et al. [106] ✓ Zhang et al. [89] ✓ ✓ Zhang et al. [130] ✓ 27 References [1] OpenAI: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) [2] Anthropic: Introducing Claude. https://w.anthropic.com/index/introducing- claude (2023) [3] DeepMind, G.: Google Gemini: Our largest and most capable AI models. https://deepmind.google/technologies/gemini/ (2023) [4] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, p. 5998–6008 (2017) [5] Bommasani, R., Hudson, D.A., Adeli, E., Altman, R.B., Arora, S., Arx, S., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021) [6] Wang, S., Xu, T., Li, H., Zhang, C., Liang, J., Tang, J., Yu, P.S., Wen, Q.: Large language models for education: A survey and outlook. arXiv preprint arXiv:2403.18105 (2024) [7] Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Hernandez, E., Kaplan, J., Henighan, T., Legg, S., Milani, S., et al.: A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861 (2021) [8] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y.,et al.: Survey of hallucination in natural language generation. ACM Computing Surveys55(12), 1–38 (2023) [9] Gehman, S., Gururangan, S., Sap, M., Choi, Y., Smith, N.A.: Realtoxi- cityprompts: Evaluating neural toxic degeneration in language models. In: Findings of the Association for Computational Linguistics: EMNLP 2020, p. 3356–3369 (2020).https://aclanthology.org/2020.findings-emnlp.301/ [10] Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT), 610– 623 (2021) [11] Gabriel, I.: Artificial intelligence, values, and alignment. Minds and Machines 30(3), 411–437 (2020) [12] Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tram`er, F.,et al.: Extracting training data from diffusion models. In: Proceed- ings of the 32nd USENIX Security Symposium, p. 5253–5270 (2023). https://arxiv.org/abs/2301.13188 [13] Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep 28 reinforcement learning from human preferences. Advances in neural information processing systems30(2017) [14] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, A., Ray, C.,et al.: Training language models to follow instructions with human feedback. Advances in neural information processing systems35, 27730–27744 (2022) [15] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) [16] Ziegler, D.M., Stiennon, N., Wu, J., Brown, T.B., Radford, A., Amodei, D.: Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019) [17] Jaques, N., Ghandeharioun, A., Shen, J.H., Ferguson, C., Lapedriza, A., Jones, N., Gu, S., Picard, R.: Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456 (2019) [18] Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Man ́e, D.: Concrete problems in ai safety. arXiv preprint arXiv:1606.06565 (2016) [19] Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al.: Training a helpful and harm- less assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022) [20] Zheng, R., Dou, S., Gao, S., Hua, Y., Shen, W., Wang, B., Liu, Y., Jin, S., Liu, Q., Zhou, Y., Xiong, L., Chen, L., Xi, Z., Xu, N., Lai, W., Zhu, M., Chang, C., Yin, Z., Weng, R., Cheng, W., Huang, H., Sun, T., Yan, H., Gui, T., Zhang, Q., Qiu, X., Huang, X.: Secrets of rlhf in large language models part i: Ppo. arXiv preprint arXiv:2307.04964 (2023) [21] Skalse, J., Howe, N.H.R., Krasheninnikov, D., Krueger, D.: Defining and charac- terizing reward hacking. In: Advances in Neural Information Processing Systems, vol. 35 (2022).https://arxiv.org/abs/2209.13085 [22] Gao, L., Schulman, J., Hilton, J.: Scaling laws for reward model overoptimiza- tion. In: Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, p. 10835–10866 (2023). https://proceedings.mlr.press/v202/gao23h.html [23] Tsimpoukelli, M., Menick, J., Cabi, S., Eslami, S.M.A., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. In: Advances in Neural Information Processing Systems, vol. 34, p. 200–212 (2021). https://arxiv.org/abs/2106.13884 29 [24] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: React: Synergizing reasoning and acting in language models. In: Proceedings of the 11th International Conference on Learning Representations (ICLR) (2023). https://arxiv.org/abs/2210.03629 [25] Xu, Z., Shen, Y., Huang, L.: Multiinstruct: Improving multi-modal zero- shot learning via instruction tuning. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 11445–11465 (2023). https://doi.org/10.18653/v1/2023.acl-long. 641 .https://aclanthology.org/2023.acl-long.641/ [26] Shen, T., Jin, R., Huang, Y., Liu, C., Dong, W., Guo, Z., Wu, X., Liu, Y., Xiong, D.: Large language model alignment: A survey. arXiv preprint arXiv:2309.15025 (2023) [27] Wang, Y., Zhong, W., Li, L., Mi, F., Zeng, X., Huang, W., Shang, L., Jiang, X., Liu, Q.: Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966 (2023) [28] Wang, Z., Bi, B., Pentyala, S.K., Ramnath, K., Chaudhuri, S., Mehrotra, S., Zhu, Z.J., Mao, X.-B., Asur, S., Cheng, N.C.: A comprehensive survey of llm align- ment techniques: Rlhf, rlaif, ppo, dpo and more. arXiv preprint arXiv:2407.16216 (2024) [29] Jiang, R., Chen, K., Bai, X., He, Z., Li, J., Yang, M., Zhao, T., Nie, L., Zhang, M.: A survey on human preference learning for large language models. arXiv preprint arXiv:2406.11191 (2024) [30] Zhong, J., Shen, W., Li, Y., Gao, S., Lu, H., Chen, Y., Zhang, Y., Zhou, W., Gu, J., Zou, L.: A comprehensive survey of reward models: Taxonomy, applications, challenges, and future. arXiv preprint arXiv:2504.12328 (2025) [31] Pan, L., Saxon, M., Xu, W., Nathani, D., Wang, X., Wang, W.Y.: Automatically correcting large language models: Surveying the landscape of diverse auto- mated correction strategies. Transactions of the Association for Computational Linguistics12, 484–506 (2024) https://doi.org/10.1162/tacl a00660 [32] Bradley, R.A., Terry, M.E.: Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika39(3/4), 324–345 (1952) [33] Plackett, R.L.: The analysis of permutations. Applied Statistics24(2), 193–202 (1975) [34] Yoon, E., Yoon, H.S., Eom, S., Han, G., Nam, D.W., Jo, D., On, K.-W., Hasegawa-Johnson, M.A., Kim, S., Yoo, C.D.: Token-level continuous reward for fine-grained reinforcement learning from human feedback. arXiv preprint arXiv:2407.16574 (2024) 30 [35] Xu, D., Qiu, L., Kim, M., Ladhak, F., Do, J.: Aligning large language models via fine-grained supervision. arXiv preprint arXiv:2406.02756 (2024) [36] Zeng, Y., Bai, Y., Wu, J., Li, X., Liu, J., Shi, Y., Wang, X.: Token-level direct preference optimization. arXiv preprint arXiv:2404.11999 (2024) [37] Fu, D., Xiao, T., Wang, R., Zhu, W., Zhang, P., Pang, G., Jia, R., Chen, L.: Token-level detective reward model for large vision language models. In: Pro- ceedings of the International Conference on Learning Representations (ICLR) (2025).https://arxiv.org/abs/2410.04734 [38] Chen, H., Yang, T., Gao, S., Chen, R., Quan, X., Tian, H., Yao, T.: Dis- criminative policy optimization for token-level reward models. arXiv preprint arXiv:2505.23363 (2025) [39] Coste, T., Anwar, U., Kirk, R., Krueger, D.: Reward model ensembles help mitigate overoptimization. arXiv preprint arXiv:2310.02743 (2023) [40] Liu, W., Wang, X., Wu, M., Li, T., Lv, C., Ling, Z., Zhu, J., Zhang, C., Zheng, X., Huang, X.: Aligning large language models with human preferences through representation engineering. arXiv preprint arXiv:2312.15997 (2023) [41] Frans, K., Park, S., Abbeel, P., Levine, S.: Unsupervised zero-shot reinforce- ment learning via functional reward encodings. arXiv preprint arXiv:2402.17135 (2024) [42] Wang, T., Yu, P., Tan, X.E., O’Brien, S., Pasunuru, R., Dwivedi-Yu, J., Golovneva, O., Zettlemoyer, L., Fazel-Zarandi, M., Celikyilmaz, A.: Shepherd: a critic for language model generation. arXiv preprint arXiv:2308.04592 (2023) [43] Cui, G., Yuan, L., Ding, N., Yao, G., He, B., Zhu, W., Ni, Y., Xie, G., Xie, R., Lin, Y., Liu, Z., Sun, M.: Ultrafeedback: Boosting language models with scaled ai feedback. arXiv preprint arXiv:2310.01377 (2023) [44] Richardson, C., Sundar, A., Heck, L.: Syndicom: improving conversational com- monsense with error-injection and natural language feedback. arXiv preprint arXiv:2309.10015 (2023) [45] Li, L., Chai, Y., Wang, S., Sun, Y., Tian, H., Zhang, N., Wu, H.: Tool-augmented reward modeling. arXiv preprint arXiv:2310.01045 (2023) [46] Aky ̈urek, A.F., Aky ̈urek, E., Madaan, A., Kalyan, A., Clark, P., Wijaya, D., Tan- don, N.: Rl4f: generating natural language feedback with reinforcement learning for repairing model outputs. arXiv preprint arXiv:2305.08844 (2023) [47] Madaan, A., Lin, X., Fu, Y., Wang, X., Yang, K., Yang, Y., Neubig, G.: 31 Self-refine: Iterative refinement with self-feedback. In: Advances in Neural Infor- mation Processing Systems (NeurIPS) (2023).https://arxiv.org/abs/2303.17651 [48] Mu, T., Helyar, A., Heidecke, J., Achiam, J., Vallone, A., Kivlichan, I., Lin, M., Beutel, A., Schulman, J., Weng, L.: Rule based rewards for language model safety. In: Advances in Neural Information Processing Systems (NeurIPS) (2024).https://arxiv.org/abs/2411.01111 [49] Chang, Y., Kim, Y., Krumdick, M., Zadeh, A., Li, C., Tanner, C., Iyyer, M.: Bleuberi: Bleu is a surprisingly effective reward for instruction following. arXiv preprint arXiv:2505.11080 (2025) [50] Xue, B., Wang, W., Wang, H., Mi, F., Wang, R., Wang, Y., Shang, L., Jiang, X., Liu, Q., Wong, K.: Improving factual consistency for knowledge-grounded dialogue systems via knowledge enhancement and alignment. In: Findings of the Association for Computational Linguistics: EMNLP 2023, p. 7829–7844 (2023). https://aclanthology.org/2023.findings-emnlp.525/ [51] Sun, H., Schaar, M.: Inverse-rlignment: Large language model alignment from demonstrations through inverse reinforcement learning. arXiv preprint arXiv:2405.15624 (2024) [52] Cai, Y., et al.: Approximated variational bayesian inverse reinforcement learning for large language model alignment. arXiv preprint arXiv:2411.09341 (2024) [53] Cheng, R., et al.: Inverse reinforcement learning with dynamic reward scaling for llm alignment. arXiv preprint arXiv:2503.18991 (2025) [54] Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), p. 311–318 (2002) [55] Lin, C.-Y., Hovy, E.: Automatic evaluation of summaries using n-gram co- occurrence statistics. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), p. 150–157 (2003).https://aclanthology.org/N03- 1020 [56] Peng, H., Qi, Y., Wang, X., Yao, Z., Xu, B., Hou, L., Li, J.: Agentic reward modeling: Integrating human preferences with verifiable correctness signals for reliable reward systems. arXiv preprint arXiv:2502.19328 (2025) [57] Wang, T., Xiong, C.: Autorule: Reasoning chain-of-thought extracted rule-based rewards improve preference learning. arXiv preprint arXiv:2506.15651 (2025) [58] Zhang, Y., Chen, L., Rao, J., Wang, Z., Liu, H.: Directional preference alignment: Fine-grained user control via multi-objective reward modeling. arXiv preprint 32 arXiv:2508.01234 (2025) [59] Zhang, Y., Jiang, W., Yang, Z.: Moslim: Align with diverse preferences in prompts through reward classification. arXiv preprint arXiv:2505.20336 (2025) [60] Zhou, Z., Liu, J., Shao, J., Yue, X., Yang, C., Ouyang, W., Qiao, Y.: Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. arXiv preprint arXiv:2310.03708 (2023) [61] Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., Gan, C., Gui, L., Wang, Y.- X., Yang, Y., Keutzer, K., Darrell, T.: Aligning large multimodal models with factually augmented rlhf. In: Findings of the Association for Computational Linguistics: ACL 2024, p. 13088–13110 (2024). https://doi.org/10.18653/v1/ 2024.findings-acl.775 .https://aclanthology.org/2024.findings-acl.775/ [62] Liu, S., Shen, X., Lai, Y., Wang, S., Yue, S., Huang, Z., Huang, X., Wei, Z.: Haf- rm: A hybrid alignment framework for reward model training. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), p. 18874–18893 (2025). https://doi.org/10.48550/arXiv.2407.04185 . https://aclanthology.org/2025.acl-long.924/ [63] Yu, Z., Gu, W., Wang, Y., Jiang, X., Zeng, Z., Wang, J., Ye, W., Zhang, S.: Rea- soning Through Execution: Unifying Process and Outcome Rewards for Code Generation (2024) [64] Liu, Q., Gong, M., Xu, F., Gao, Z., Liu, H., Gong, M., Ma, X., Lin, Z.: Amopo: Adaptive multi-objective preference optimization without reward models and reference models. arXiv preprint arXiv:2506.07165 (2025) [65] Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al.: Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022) [66] Lai, Y., Wang, S., Liu, S., Huang, X., Wei, Z.: Alarm: Align language models via hierarchical rewards modeling. arXiv preprint arXiv:2506.12345 (2025) [67] Qiu, W., Li, Y.-C., Zhang, X., Zhang, T., Zhang, Y., Zhang, Z., Yu, Y.: Sentence- level reward model can generalize better for aligning llm from human preference. arXiv preprint arXiv:2503.04793 (2024) [68] Wang, H., Xiong, W., Xie, T., Zhao, H., Zhang, T.: Interpretable preferences via multi-objective reward modeling and mixture-of-experts. arXiv preprint arXiv:2406.12845 (2024) [69] Dubois, Y., Ouyang, L., Brown, T.B., Ziegler, D.M., Hilton, J., Schneider, J., Leike, J., Amodei, D.: Rewarded soups: Towards pareto-optimal align- ment by interpolating weights fine-tuned on diverse rewards. arXiv preprint 33 arXiv:2305.17486 (2023) [70] Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. In: Advances in Neural Information Processing Systems (NeurIPS) (2023). https://arxiv.org/abs/2305.18290 [71] Dong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., Zhang, T.: Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863 (2024) [72] Choshen, L., Fox, L., Aizenbud, Z., Abend, O.: On the weaknesses of reinforce- ment learning for neural machine translation. arXiv preprint arXiv:1907.01752 (2019) [73] Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., Madry, A.: Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729 (2020) [74] Zhong, H., Feng, G., Xiong, W., Cheng, X., Zhao, L., He, D., Bian, J., Wang, L.: Dpo meets ppo: Reinforced token optimization for rlhf. arXiv preprint arXiv:2404.18922 (2024) [75] Zheng, R., Shen, W., Hua, Y., Lai, W., Dou, S., Zhou, Y., Xi, Z., Wang, X., Huang, H., Gui, T., et al.: Improving generalization of alignment with human preferences through group invariant learning. arXiv preprint arXiv:2310.11971 (2023) [76] Zhang, H., Lei, Y., Gui, L., Yang, M., He, Y., Wang, H., Xu, R.: Cppo: Contin- ual learning for reinforcement learning with human feedback. In: The Twelfth International Conference on Learning Representations (2024) [77] Wu, T., Zhu, B., Zhang, R., Wen, Z., Ramchandran, K., Jiao, J.: Pairwise prox- imal policy optimization: Harnessing relative feedback for llm alignment. arXiv preprint arXiv:2310.00212 (2023) [78] Santacroce, M., Lu, Y., Yu, H., Li, Y., Shen, Y.: Efficient rlhf: Reducing the memory usage of ppo. arXiv preprint arXiv:2309.00754 (2023) [79] Shen, W., Zhang, X., Yao, Y., Zheng, R., Guo, H., Liu, Y.: Improving reinforce- ment learning from human feedback using contrastive rewards. arXiv preprint arXiv:2403.07708 (2024) [80] Rafailov, R., Zhang, R.E., Ma, T., Liang, P., Hashimoto, T.B.: Secrets of rlhf in large language models part i: Ppo. arXiv preprint arXiv:2307.04964 (2023) [81] Li, Z., Xu, T., Zhang, Y., Lin, Z., Yu, Y., Sun, R., Luo, Z.-Q.: ReMax: A simple, 34 effective, and efficient reinforcement learning method for aligning large language models. arXiv preprint arXiv:2310.10505 (2023) [82] Ahmadian, A., Cremer, C., Gall ́e, M., Fadaee, M., Kreutzer, J., Pietquin, O., ̈ Ust ̈un, A., Hooker, S.: Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740 (2024) [83] Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024) [84] Hu, J., Liu, J.K., Shen, W.: Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models. arXiv preprint arXiv:2501.03262 (2025) [85] Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Chang, B., Sun, X., Sui, Z.: A survey on in-context learning. arXiv preprint arXiv:2301.00234 (2023) [86] Ye, D., Wang, Y., Li, Y., Lin, Y., Liu, Z., Sun, M.: Compositional exemplars for in-context learning. arXiv preprint arXiv:2302.05698 (2023) [87] Du, Y., Zhao, Y.: In-context learning with reinforcement learning for incomplete demonstrations. arXiv preprint arXiv:2408.13028 (2024) [88] Suo, Y., Lai, J.: Suo: visual prompt selection for in-context learning segmenta- tion. arXiv preprint arXiv:2407.10233 (2024) [89] Zhang, J., Wang, B., Li, L., Nakashima, Y., Nagahara, H.: Instruct me more! ran- dom prompting for visual in-context learning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), p. 2597–2606 (2024) [90] Do, X.L., Zhao, Y., Brown, H., Xie, Y., Zhao, J.X., Chen, N.F., Kawaguchi, K., Shieh, M., He, J.: Prompt optimization via adversarial in-context learning. arXiv preprint arXiv:2312.02614 (2023) [91] Zhou, D., Schuurmans, D., Le, Q.V., et al.: Large language models are human- level prompt engineers. arXiv preprint arXiv:2211.01910 (2023) [92] Qian, X., Wang, Y., Li, Y., Lin, Y., Liu, Z., Sun, M.: Sub-sa: strengthen in-context learning via submodular selective annotation. arXiv preprint arXiv:2407.05693 (2024) [93] Li, K., Zhao, T., Zhou, W., Hu, S.: Dora: Dynamic optimization prompt for continuous reflection of llm-based agent. In: Proceedings of the 31st Inter- national Conference on Computational Linguistics, p. 7546–7557 (2025). 35 https://aclanthology.org/2025.coling-main.504/ [94] Chen, W., Lin, Y., Zhou, Z., Huang, H., Jia, Y., Cao, Z., Wen, J.-R.: Icleval: Evaluating in-context learning ability of large language models. arXiv preprint arXiv:2406.14955 (2024) [95] Yu, G., Liu, L., Yu, M., Yu, Y., Ao, X.: Rethinking the evaluation of in-context learning for llms. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, p. 14068–14082 (2024). https://aclanthology.org/2024.emnlp-main.779/ [96] Honovich, O., Scialom, T., Levy, O., Schick, T.: Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689 (2022) [97] Yuan, W., Pang, R.Y., Cho, K., Li, X., Sukhbaatar, S., Xu, J., Weston, J.: Self-rewarding language models. In: Proceedings of the 41st Interna- tional Conference on Machine Learning, p. 57905–57923 (2024). PMLR. https://proceedings.mlr.press/v235/yuan24d.html [98] Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s verify step by step. arXiv preprint arXiv:2305.20050 (2023) [99] Sun, Y., Deng, Y., Nie, Y., Ge, Y., Zhang, Y., Fan, X., Zhang, R., Zhang, R., Hou, L., Sun, M., et al.: Self-align: Aligning large language models with self-generated feedback. arXiv preprint arXiv:2308.08914 (2023) [100] Wang, Y., Khashabi, D., Min, S., Kordi, Y., Xiong, W., Sabharwal, A., Hajishirzi, H.: Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560 (2023) [101] Zhou, D., Muennighoff, N., Tay, Y., Fan, L., Mirzadeh, S.I., Raffel, C., Gupta, I., Liu, X., Liu, Y., Song, D., Radev, D.: Lima: Less is more for alignment. In: Proceedings of the 40th International Conference on Machine Learning (ICML) (2023).https://arxiv.org/abs/2305.11206 [102] Xu, C., Gu, Y., Lin, Y., Liu, Z., Sun, M.: On the essence and prospect: an investi- gation of alignment approaches for big models. arXiv preprint arXiv:2403.04204 (2024) [103] Xiao, W., Wang, Z., Gan, L., Zhao, S., He, W., Tuan, L.A., Chen, L., Jiang, H., Zhao, Z., Wu, F.: A comprehensive survey of direct preference optimization: Datasets, theories, variants, and applications. CoRRabs/2410.15595(2024) [104] Wu, J., Xie, Y., Yang, Z., Wu, J., Gao, J., Ding, B., Wang, X., He, X.: Alpha-dpo: Adaptive preference optimization with dynamic reward margins. arXiv preprint 36 arXiv:2410.10148 (2024) [105] Wu, J., Xie, Y., Yang, Z., Wu, J., Gao, J., Ding, B., Wang, X., He, X.:β-dpo: Direct preference optimization with dynamicβ. arXiv preprint arXiv:2407.08639 (2024) [106] Zeng, Y., Liu, G., Ma, W., Yang, N., Zhang, H., Wang, J.: Token-level direct preference optimization. arXiv preprint arXiv:2404.11999 (2024) [107] Lai, X., Tian, Z., Chen, Y., Yang, S., Peng, X., Jia, J.: Step-dpo: Step-wise preference optimization for long-chain reasoning of llms. arXiv preprint arXiv:2406.18629 (2024) [108] Xu, W., Zhang, Y., Liu, Y., Hu, Y., Gao, Y., Luo, X., Li, Z., Zhang, F., Su, H., Zhu, J.: A comprehensive survey of direct preference optimization: Datasets, theories, variants, and applications. arXiv preprint arXiv:2410.15595 (2024) [109] Azar, M.G., Guo, Z., Piot, B., Munos, R., Rowland, M., Valko, M., Calan- driello, D.: A general theoretical paradigm to understand learning from human preferences. In: Proceedings of the 27th International Conference on Artifi- cial Intelligence and Statistics (AISTATS). Proceedings of Machine Learning Research, vol. 238, p. 4447–4455 (2024) [110] Cao, M., Shu, L., Yu, L., Zhu, Y., Wichers, N., Liu, Y., Meng, L.: Beyond sparse rewards: Enhancing reinforcement learning with language model critique in text generation. arXiv preprint arXiv:2401.07382 (2024) [111] Chan, A.J., Sun, H., Holt, S., Schaar, M.: Dense reward for free in reinforcement learning from human feedback. arXiv preprint arXiv:2402.00782 (2024) [112] Cheng, P., Yang, Y., Li, J., Dai, Y., Hu, T., Cao, P., Du, N., Li, X.: Adversar- ial preference optimization: Enhancing your alignment via rm-llm game. arXiv preprint arXiv:2311.08045 (2023) [113] Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., Wang, Y., Yang, Y.: Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773 (2023) [114] Dong, Y., Wang, Z., Sreedhar, M.N., Wu, X., Kuchaiev, O.: Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf. arXiv preprint arXiv:2310.05344 (2023) [115] Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., Kiela, D.: Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306 (2024) [116] Hong, J., Lee, N., Thorne, J.: ORPO: Monolithic preference optimization 37 without reference model. arXiv preprint arXiv:2403.07691 (2024) [117] Kim, S., Bae, S., Shin, J., Kang, S., Kwak, D., Yoo, K.M., Seo, M.: Aligning large language models through synthetic feedback. arXiv preprint arXiv:2305.13735 (2023) [118] Lee, H., Phatale, S., Mansoor, H., Lu, K.R., Mesnard, T., Ferret, J., Bishop, C., Hall, E., Carbune, V., Rastogi, A.: Rlaif: Scaling reinforcement learning from human feedback with ai feedback (2023) [119] Mahan, D., Van Phung, D., Rafailov, R., Blagden, C., Lile, N., Castricato, L., Franken, J.-P., Finn, C., Albalak, A.: Generative reward models. arXiv preprint arXiv:2410.12832 (2024) [120] Moskovitz, T., Singh, A.K., Strouse, D., Sandholm, T., Salakhutdinov, R., Dragan, A.D., McAleer, S.: Confronting reward model overoptimization with constrained rlhf. arXiv preprint arXiv:2310.04373 (2023) [121] Pang, R.Y., Yuan, W., Cho, K., He, H., Sukhbaatar, S., Weston, J.: Iterative reasoning preference optimization. arXiv preprint arXiv:2404.19733 (2024) [122] Park, R., Rafailov, R., Ermon, S., Finn, C.: Disentangling length from quality in direct preference optimization. arXiv preprint arXiv:2403.19159 (2024) [123] Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems36(2024) [124] Rame, A., Couairon, G., Dancette, C., Gaya, J.-B., Shukor, M., Soulier, L., Cord, M.: Rewarded soups: Towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 36, p. 71095–71134 (2023) [125] Scheid, A., Boursier, E., Durmus, A., Jordan, M.I., M ́enard, P., Moulines, E., Valko, M.: Optimal design for reward modeling in rlhf. arXiv preprint arXiv:2410.17055 (2024) [126] Wu, Z., Hu, Y., Shi, W., Dziri, N., Suhr, A., Ammanabrolu, P., Smith, N.A., Ostendorf, M., Hajishirzi, H.: Fine-grained human feedback gives better rewards for language model training. Advances in Neural Information Processing Systems 36, 59008–59033 (2023) [127] Xu, S., Fu, W., Gao, J., Ye, W., Liu, W., Mei, Z., Wang, G., Yu, C., Wu, Y.: Is dpo superior to ppo for llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719 (2024) 38 [128] Yang, K., Liu, Z., Xie, Q., Huang, J., Min, E., Ananiadou, S.: Selective pref- erence optimization via token-level reward function estimation. arXiv preprint arXiv:2408.13518 (2024) [129] Yin, Q., Leong, C.T., Zhang, H., Zhu, M., Yan, H., Zhang, Q., He, Y., Li, W., Wang, J., Zhang, Y., Yang, L.: Constrain alignment with sparse autoencoders. In: Proceedings of the 2025 International Conference on Machine Learning (ICML), vol. 267 (2025).https://arxiv.org/abs/2411.07618 [130] Zhang, L., Hosseini, A., Bansal, H., Kazemi, M., Kumar, A., Agarwal, R.: Generative verifiers: Reward modeling as next-token prediction. arXiv preprint arXiv:2408.15240 (2024) 39