← Back to papers

Paper deep dive

Aligning Large Language Models with Searcher Preferences

Wei Wu, Peilun Zhou, Liyi Chen, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Hui Xiong

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 73

Abstract

Abstract:The paradigm shift from item-centric ranking to answer-centric synthesis is redefining the role of search engines. While recent industrial progress has applied generative techniques to closed-set item ranking in e-commerce, research and deployment of open-ended generative search on large content platforms remain limited. This setting introduces challenges, including robustness to noisy retrieval, non-negotiable safety guarantees, and alignment with diverse user needs. In this work, we introduce SearchLLM, the first large language model (LLM) for open-ended generative search. We design a hierarchical, multi-dimensional reward system that separates bottom-line constraints, including factual grounding, basic answer quality and format compliance, from behavior optimization objectives that promote robustness to noisy retrieval and alignment with user needs. Concretely, our reward model evaluates responses conditioned on the user query, session history, and retrieved evidence set, combining rule-based checks with human-calibrated LLM judges to produce an interpretable score vector over these dimensions. We introduce a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO). We deploy SearchLLM in the AI search entry of RedNote. Offline evaluations and online A/B tests show improved generation quality and user engagement, increasing Valid Consumption Rate by 1.03% and reducing Re-search Rate by 2.81%, while upholding strict safety and reliability standards.

Tags

ai-safety (imported, 100%)cscl (suggested, 92%)preprint (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: failed | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 0%

Last extracted: 3/13/2026, 1:10:46 AM

OpenRouter request failed (402): {"error":{"message":"This request requires more credits, or fewer max_tokens. You requested up to 65536 tokens, but can only afford 58929. To increase, visit https://openrouter.ai/settings/keys and create a key with a higher monthly limit","code":402,"metadata":{"provider_name":null}},"user_id":"user_2shvuzpVFCCndDdGXIdfi40gIMy"}

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

72,557 characters extracted from source content.

Expand or collapse full text

Aligning Large Language Models with Searcher Preferences Wei Wu ∗ School of Artificial Intelligence and Data Science, University of Science and Technology of China Hefei, China urara@mail.ustc.edu.cn Peilun Zhou ∗ Xiaohongshu Inc. Beijing, China zhoupeilun@xiaohongshu.com Liyi Chen † Xiaohongshu Inc. Beijing, China chenliyi@xiaohongshu.com Qimeng Wang Xiaohongshu Inc. Beijing, China qimengwang@xiaohongshu.com Chengqiang Lu Xiaohongshu Inc. Beijing, China lunar@mail.ustc.edu.cn Yan Gao Xiaohongshu Inc. Beijing, China wanjianyi@xiaohongshu.com Yi Wu Xiaohongshu Inc. Beijing, China luyun2@xiaohongshu.com Yao Hu Xiaohongshu Inc. Beijing, China xiahou@xiaohongshu.com Hui Xiong † Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou) Guangzhou, China Department of Computer Science and Engineering, The Hong Kong University of Science and Technology Hong Kong SAR, China xionghui@ust.hk Abstract The paradigm shift from item-centric ranking to answer-centric synthesis is redefining the role of search engines. While recent industrial progress has applied generative techniques to closed-set item ranking in e-commerce, research and deployment of open- ended generative search on large content platforms remain limited. This setting introduces challenges, including robustness to noisy retrieval, non-negotiable safety guarantees, and alignment with diverse user needs. In this work, we introduce SearchLLM, the first large language model (LLM) for open-ended generative search. We design a hierarchical, multi-dimensional reward system that sep- arates bottom-line constraints, including factual grounding, basic answer quality and format compliance, from behavior optimization objectives that promote robustness to noisy retrieval and alignment with user needs. Concretely, our reward model evaluates responses conditioned on the user query, session history, and retrieved ev- idence set, combining rule-based checks with human-calibrated LLM judges to produce an interpretable score vector over these dimensions. We introduce a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO). We deploy SearchLLM in the AI search entry of RedNote. Offline evaluations and online A/B tests show improved generation quality and user engagement, increasing Valid Consumption Rate by 1.03% and reducing Re-search Rate by 2.81%, while upholding strict safety and reliability standards. ∗ Equal Contribution † Corresponding Authors Keywords Generative Search, Large Language Models, Reinforcement Learn- ing, Reward Modeling, Retrieval-Augmented Generation 1 Introduction The rapid growth in both the scale and diversity of digital content on online platforms has resulted in severe information overload [6], making it increasingly challenging for users to efficiently locate accurate answers and make informed decisions. Despite signifi- cant advances in search engines that have improved retrieval pre- cision by providing more relevant items [41], outputs of search engines typically remain as itemized lists. This format places a cog- nitive burden on users, who still need to further sift through and synthesize the results to fulfill their information needs. Recently, large language models (LLMs) have enabled a new open-ended paradigm of generative search systems [2,5,8,30,34], which can comprehensively assess the relevance and validity of retrieved infor- mation and directly generate coherent, natural-language answers to user queries. This transition from closed-set item retrieval to open-ended answer generation fundamentally redefines the role of search engines from merely returning candidate items to delivering synthesized, user-centered solutions. Nowadays, open-ended generative search has been increasingly integrated into widely adopted real-world online platforms and is used by hundreds of millions of users, positioning it as a criti- cal interface for large-scale information access and sensemaking. Across short-video apps (e.g., Douyin [5]/TikTok [30]), super-app ecosystems (e.g., WeChat [34]), and general-purpose assistants (e.g., arXiv:2603.10473v1 [cs.CL] 11 Mar 2026 KDD ’26, August 9–13, 2026, Jeju, KoreaWei Wu, Peilun Zhou, et al. !( *+$ $$+"&*"'&) &*&*")%*! %&*" +&&- '(%**"& ))+) ,"&'&$"* *-*!() **("+*"'&'"$+()) ('%&$"&)(# Figure 1: User interaction snapshots of open-ended genera- tive search in RedNote. The bottom-right panel summarizes failure attribution from online user feedback. Gemini [8]/ChatGPT [2]), these systems ground responses in het- erogeneous sources—videos, live streams, hybrid posts, official ac- counts/news, and the web—and synthesize them into direct con- versational answers, enabling faster access, clearer summaries, and more interactive exploration. In the literature, industrial generative search systems are still largely item-centric and mainly focus on closed-set item generation in e-commerce domain, where the goal is to generate or rank product identifiers rather than produce free- form answers. For example, OneSearch [3] enables production-scale generative retrieval via semantic ID tokenization with behavior- aware training; GRAM [22] generates product IDs auto-regressively in a unified query–item code space; and CRS [16] post-trains LLMs with SFT and RL to better capture user preferences for ranking. However, such closed-set formulations are not directly appli- cable to open-ended generative search, where the system needs to synthesize grounded natural-language answers from heteroge- neous and potentially noisy evidence under real-world user intents. As illustrated in Figure 1, we show the interface of open-ended gen- erative search in the RedNote. The bottom-right subfigure further summarizes the attribution analysis of failure cases collected from online user feedback. Mitigating these issues requires aligning the underlying LLMs with three key requirements. R1: Robustness to noisy queries and evidence. LLMs should remain robust under ambiguous or underspecified queries and noisy, heterogeneous, outdated, redundant, or even conflicting retrieved evidence, while deciding when to infer intent, ask for clarification, or refuse un- safe requests. R2: Bottom-line guarantees on reliability and safety. Generated answers should satisfy strict reliability and safety guards, including factual grounding in provided evidence or widely accepted background knowledge, safety and policy compliance, logical consistency, and a controllable response format. R3: Align- ment with user needs. Answers should be optimized for user con- sumption and decision support, presenting key information early, reducing redundancy, and choosing an appropriate level of detail and structure for each query without weakening the constraints above. Therefore, how to effectively train an LLM for open-ended generative search is of central importance. In this work, we introduce SearchLLM, the first LLM for open- ended generative search. We first propose a reward system that encodes the three requirements above, and train SearchLLM under this objective. Concretely, we consolidate the three requirements into a two-layer design that separates non-negotiable safeguards from user-facing quality objectives: a bottom-line layer that encodes R2 as hard constraints on factual grounding, safety, and response format, and a behavior optimization layer that, within this safe region, jointly captures R1 and R3 by shaping how the model re- sponds to uncertain queries and noisy evidence and how it trades off brevity, coverage, and novelty. On top of this structural design, we instantiate a hybrid evaluation stack that combines determinis- tic rule-based evaluators with LLM-based judges, each designed to output fine-grained, interpretable scores on specific sub-dimensions of the metric space. We calibrate our evaluators via a human-in- the-loop process so that the resulting interpretable metrics remain faithful to searcher preferences and stable in deployment. The re- sulting multi-dimensional reward vector is then transformed by a Gated Aggregation Strategy that protects bottom-line dimensions and rebalances user-facing quality dimensions, and is subsequently used as the objective in a Group Relative Policy Optimization [27] (GRPO)-style reinforcement learning pipeline trained over large- scale search query logs. We deploy SearchLLM in the AI search entry of RedNote. Online A/B tests demonstrate that our method yields measurable gains in user satisfaction, specifically increasing the Valid Consumption Rate (VCR) by 1.03% and reducing the Re- search Rate (R) by 2.81% compared to the production baseline. We summarize our main contributions as follows. •We are the first to characterize the unique demands of open- ended generative search on large content platforms and intro- duce a dedicated LLM named SearchLLM, which can provide valuable insights for the community. •We propose a multi-dimensional reward design for generative search that explicitly separates non-negotiable safeguards from user-facing quality objectives, and instantiate it with a hybrid stack of rule-based checks and LLM judges calibrated by experts. •We develop an end-to-end reinforcement learning recipe for SearchLLM, optimizing the full workflow with GRPO-style train- ing and a Gated Aggregation Strategy that prioritizes bottom- line constraints while improving robustness and utility. •We deploy SearchLLM in RedNote’s AI search entry, and online A/B tests show improved user experience, with +1.03% Valid Consumption Rate (VCR) and -2.81% Re-search Rate (R) over the production baseline. 2 Related Work Large Language Models for Search. LLMs are gradually re- shaping how search engines operate and interact with users [35]. At the component level, LLMs are extensively used to enhance query understanding via rewriting and expansion [23,33], improve document indexing through semantic embedding [4], and refine ranking by estimating relevance more accurately [20,24]. Beyond these component-level optimizations, recent industrial efforts in the e-commerce domain have explored end-to-end generative retrieval. Notable systems like OneSearch [3] employ semantic ID tokeniza- tion to perform generative retrieval at scale, while GRAM [22] and Aligning Large Language Models with Searcher PreferencesKDD ’26, August 9–13, 2026, Jeju, Korea A. Generative Search System Overview SearchLLM(휋 휃 ) User 푞 Session history ℎ Search plan EvidenceSet Query Internal Notes Web Search Real-time Tools ... 퐸 Selected EvidenceSet Intent Planning & Evidence Selection Evidence- Grounded Generation Final Response (푦) Layer I: Bottom-line Constraints (C BL ) B. Multi-Dimensional Reward System & Gated Aggregation (푞,ℎ,퐸) 푦 Factual Grounding Basic Quality Format Compliance Rule-based Checks & LLM Judges Geometric Mean (Soft-AND Gate) ∏ Layer I: Behavioral Objectives (C BH ) Robustness Richness & Diversity Conciseness & Usability LLM Judges × 퐵 ! 푥,푦 Human-in- the-loop Calibration 푈(푥,푦) Σ Weighted Arithmetic Mean C. End-to-End GRPO Optimization푅푥,푦 Query Logs (퐷) Reward System SearchLLM(휋 휃 ) 푦 ! 푦 " 푦 # ... Reference Model KL 푅 ! 푅 " 푅 # ... Group Computation 퐴 ! 퐴 " 퐴 # ... Figure 2: Overview of the alignment framework for open-ended generative search. The pipeline incorporates a multi- dimensional reward system that explicitly decouples non-negotiable bottom-line constraints (Layer I) from behavioral opti- mization objectives (Layer I). A hybrid evaluation stack, consisting of deterministic rules and human-calibrated LLM judges, computes fine-grained scores across multiple dimensions. These signals are synthesized via a gated aggregation mechanism to stabilize the learning signal for Group Relative Policy Optimization (GRPO). CRS [16] generate product identifiers auto-regressively in a uni- fied query–item latent space. In parallel, the paradigm of retrieval- augmented generation (RAG) [7,41] has shifted the focus from returning ranked lists to synthesizing natural-language answers grounded in retrieved evidence [29,38]. However, in the literature, industrial applications predominantly focus on closed-set item rank- ing for e-commerce. Systematic studies on deploying open-ended generative search remain limited at scale. Alignment of Large Language Models. Aligning LLMs with human values and safety standards is critical for deployment. Stan- dard approaches typically utilize Reinforcement Learning from Hu- man Feedback (RLHF) via Proximal Policy Optimization (PPO) [21, 26], or employ more efficient offline variants such as Direct Pref- erence Optimization (DPO) [25] and Group Relative Policy Opti- mization (GRPO) [27]. Regardless of the optimization algorithm, the efficacy of alignment relies on the design of the reward signal, which falls into three categories: (i) Scalar reward models [1,18,21], which map responses to a single quality score. While computationally effi- cient, they produce coarse-grained, opaque scores that act as black boxes, making it difficult to balance competing objectives like safety and helpfulness; (i) LLM-as-a-judge approaches [14,40], which leverage powerful LLMs to scale up evaluation without human labelers, though they can be sensitive to prompting and inherent model biases; and (i) Rubric-based rewards [9,15,32], which de- compose complex human preferences into structured, interpretable criteria (e.g., checklists). Building on these advances, we propose a two-layer, multi-dimensional reward system for open-ended gen- erative search that explicitly separates non-negotiable safeguards from user-facing quality objectives, instantiated via a hybrid stack of rule-based checks and calibrated LLM judges. 3 Methodology 3.1 System Overview We design an LLM-based open-ended generative search pipeline (illustrated in Figure 2) that enhances retrieval-augmented genera- tion by using a single LLM for intent planning, evidence selection, and evidence-grounded generation in a unified workflow. Built on top of a standard multi-source retrieval infrastructure that provides a broad pool of candidate evidence from internal note search, web search, and real-time tools, our system focuses on reasoning over this candidate pool rather than on retrieval itself. Given a user query and its session history, it first constructs an explicit intent plan that structures the information need into several interpretable facets and specifies how each facet should be supported by evidence. KDD ’26, August 9–13, 2026, Jeju, KoreaWei Wu, Peilun Zhou, et al. Conditioned on this plan, the system then performs intent-aware evidence selection that filters the candidate pool into a compact evidence set that is relevant, non-redundant, and collectively covers the planned facets. This curated evidence set is finally consumed by the LLM to synthesize a grounded final response. 3.2 Multi-Dimensional Reward System Our reward system does not evaluate responses in isolation, but in- stead scores complete search interactions. Formally, let푥=(푞,ℎ,퐸) denote the input context comprising the query푞, session history ℎ, and retrieved evidence set퐸, and let푦denote the generated sequence, which encompasses both the intent plan and the final answer. To operationalize the critical requirements identified in Section 1, specifically the inherent conflict between enforcing strict bottom-line guarantees (R2) and optimizing for robustness and user alignment (R1, R3), we move beyond monolithic scalar reward models. Drawing inspiration from recent work on rubric-based re- wards [9,10,15,32], we design a hierarchical and multi-dimensional reward system. This system explicitly decouples bottom-line con- straints, which ensure non-negotiable safety and reliability, from behavioral optimization objectives, which target robustness and pref- erence alignment. This separation ensures that the model prioritizes foundational guarantees while simultaneously optimizing the user’s search experience across the entire pipeline, from intent planning and evidence selection to final response synthesis. 3.2.1 Reward Design. We formalize the evaluation of a model gen- eration푦for a given search interaction context푥through a struc- tured set of evaluation criteriaC. Unlike standard RLHF [26] which often conflates diverse quality signals into a single scalar, we cate- gorize our criteria into two distinct layers: Layer I: Bottom-line Constraints. This layer encodes Requirement R2 (Reliability and Safety). It consists of binary or near-binary criteria that serve as hard constraints. Failure in these dimensions renders a response unusable. We define a subset of criteriaC 퐵퐿 ⊂ C covering three critical aspects: •Hallucination & Factual Grounding: To mitigate misinfor- mation, we rigorously assess hallucination at both the sentence and claim levels, ensuring that the generated answer is factu- ally accurate. Additionally, we verify consistency with external knowledge bases and enforce strict refusal behaviors when retrieved evidence is insufficient to answer the query. •Basic Answer Quality: This subset ensures the fundamental logic and readability of the response. It detects logical inconsis- tencies within a single turn, contradictions across multi-turn interactions, and filters out gibberish or low-quality text. •Format Compliance: To ensure the response is structurally consumable, we enforce strict formatting rules, including ad- herence to Markdown and constraints on response length. Layer I: Behavioral Objectives. Once bottom-line constraints are satisfied, the model should be optimized for robustness (R1) and user utility (R3). This layer, denoted asC 퐵퐻 ⊂ C, specifically targets increasing user engagement with the generated responses. Key dimensions include: •Robustness to Query & Evidence: This subset focuses on the complexity of open-ended generative search. On the query side, we evaluate intent alignment to ensure the model addresses the user’s core question. On the evidence side, we assess how useful the utilized content is for answering the query and how likely it is to satisfy the user, while penalizing the inclusion of irrelevant or conflicting information. Furthermore, we evaluate the quality of planning and reasoning under uncertainty. •Richness & Diversity: To enhance the depth of information and avoid overly narrow answers, we encourage the generation of diverse yet relevant claims, ensuring the response covers multiple perspectives or facets of the topic rather than repeating a single point. • Conciseness & Usability: To optimize the consumption expe- rience, we prioritize usability metrics. This includes the "answer- first" principle (placing the core answer at the beginning), reduc- ing semantic redundancy, and minimizing off-topic or tangential content to ensure a high signal-to-noise ratio. Table 5 (Appendix C) provides the detailed definitions for all crite- ria discussed above, alongside their specific implementation types which we detail in the following section. 3.2.2 Implementation of Hybrid Evaluation Stack. To ensure our reward signals are both scalable and diagnostically precise, we operationalize a hybrid evaluation stack that integrates determin- istic rules (e.g., n-gram statistics, regex constraints) for objective criteria with LLM-based judges for complex semantic dimensions. Aligning these LLM judges with human expert preferences is non- trivial; strictly relying on single-pass annotation often leads to noisy ground truth. To address this, we establish a rigorous human-in-the- loop calibration cycle. Figure 9 shows the annotation interface used for this process, where experts provide both fine-grained scores and holistic pairwise rankings. To strictly control annotation quality and mitigate cognitive inertia (anchoring bias) where annotators might over-trust the model’s generated reasoning, we enforce a dual-track annotation protocol involving two distinct groups: •Blind Group: Annotators score responses based on the query and answer text, without access to the reward system’s internal reasoning or intermediate steps. This establishes an unbiased baseline that reflects the end-user’s consumption experience. •Assisted Group: Annotators review the response alongside the reward system’s CoT and evidence citations. This setup enables them to verify logical consistency and detect subtle hallucina- tions that are factually incorrect but linguistically plausible. Discrepancies between the Blind and Assisted scores serve as high- value signals for identifying policy gaps; these conflict cases are escalated to senior experts for adjudication to resolve edge-case ambiguities and unify evaluation standards. Finally, before any version update, the evaluator must demonstrate non-degradation on a frozen regression benchmark and undergo a stable "shadow run" in the production environment. 3.3 Reinforcement Learning Framework Optimizing policy models against such a composite and fine-grained reward system presents a unique challenge. A naive linear com- bination of these diverse signals often exacerbates the "seesaw effect" [10], where the model exploits easier-to-optimize reward signals (e.g., length-based scores) at the expense of rigid bottom- line constraints such as factual consistency. To robustly opera- tionalize this multi-dimensional feedback without succumbing to reward hacking, we introduce a Gated Aggregation Strategy. For Aligning Large Language Models with Searcher PreferencesKDD ’26, August 9–13, 2026, Jeju, Korea a given model generation푦, let푆 퐵퐿 =푠 1 , . . .,푠 푚 be the normal- ized scores from the bottom-line criteria defined in Layer I, and let 푆 퐵퐻 = 푠 ′ 1 , . . .,푠 ′ 푛 be the normalized scores from the behavioral criteria defined in Layer I, with associated expert-defined weights 푊=푤 1 , . . .,푤 푛 . We assume that each score lies in the unit inter- val, i.e.푠 푖 ∈ [0,1]for푖=1, . . .,푚and푠 ′ 푖 ∈ [0,1]for푖=1, . . .,푛. We compute the bottom-line factor via a훿-smoothed geometric mean 퐵 훿 (푥,푦)= exp 1 푚 푚 ∑︁ 푖=1 log 푠 푖 +훿 1+훿 ! , 훿> 0,(1) which acts as a soft-AND gate while improving optimization stabil- ity. In the vanilla form (i.e., 훿= 0), 퐵 0 (푥,푦)= 푚 Ö 푖=1 푠 푖 ! 1/푚 = exp 1 푚 푚 ∑︁ 푖=1 log푠 푖 ! , the gate can become numerically unstable when computed in the log-domain due to log 0 and exhibits unbounded sensitivity 휕 log퐵 0 (푥,푦) 휕푠 푖 = 1 푚푠 푖 , when푠 푖 →0. The훿-smoothing avoidslog0 and upper-bounds the gate sensitivity: 휕 log퐵 훿 (푥,푦) 휕푠 푖 = 1 푚(푠 푖 +훿) ≤ 1 푚훿 . Moreover, when any bottom-line metric is near zero,퐵 훿 remains strongly suppressed in a훿-controlled manner, yielding a substan- tially reduced reward. Conversely, we compute the behavioral util- ity as a weighted arithmetic mean 푈(푥,푦)= Í 푛 푖=1 푤 푖 푠 ′ 푖 Í 푛 푖=1 푤 푖 ,(2) which allows for flexible trade-offs among softer objectives within the safe region. The final scalar reward is then defined as 푅(푥,푦)= 퐵 훿 (푥,푦)푈(푥,푦),(3) so that behavioral improvements act as multipliers when bottom- line scores are sufficiently high. To optimize the policy under this reward surface, we employ Group Relative Policy Optimization (GRPO) [27]. Unlike standard PPO [26] which relies on a value network, GRPO normalizes advantages within a group of sampled outputs for the same query, making it efficient for our large-scale setting. Formally, let휋 휃 denote the policy to be optimized and휋 휃 old the behavior policy used to collect trajectories. For a query푥, we sample a group of generations푦 푖 퐺 푖=1 , where each푦 푖 is a token sequence푦 푖 =(푦 푖,1 , . . .,푦 푖,|푦 푖 | ). The GRPO objective is J GRPO (휃)= E 푥∼퐷, 푦 푖 퐺 푖=1 ∼휋 휃 old (·|푥) " 1 퐺 퐺 ∑︁ 푖=1 1 |푦 푖 | |푦 푖 | ∑︁ 푡=1 min 푟 푖,푡 (휃) ˆ 퐴 푖,푡 , clip 푟 푖,푡 (휃), 1− 휖, 1+휖 ˆ 퐴 푖,푡 − 휆퐷 KL (휋 휃 ||휋 ref ) # , (4) where 휆 is the coefficient controlling the KL divergence. 푟 푖,푡 (휃)= 휋 휃 푦 푖,푡 | 푥,푦 푖,<푡 휋 휃 old 푦 푖,푡 | 푥,푦 푖,<푡 ,(5) is the per-token importance-weighted ratio. The token-wise advan- tages ˆ 퐴 푖,푡 are obtained by normalizing the group rewards: ˆ 퐴 푖,푡 = 푅(푥,푦 푖 ) − mean 푅(푥,푦 푗 ) 퐺 푗=1 std 푅(푥,푦 푗 ) 퐺 푗=1 , 푡= 1,...,|푦 푖 |.(6) 4 Experiments In this section, we evaluate the effectiveness of our proposed reward system and RL training framework. We aim to answer the following research questions: • RQ1: How effectively does our multi-dimensional reward sys- tem align with human expert judgments compared to state-of- the-art reward modeling baselines? • RQ2: Does the proposed optimization strategy yield better of- fline generation quality? • RQ3: How does the Gated Aggregation Strategy influence train- ing dynamics and coordinate different reward dimensions? • RQ4: What is the impact of the deployed model in online set- tings on real-world user engagement and safety metrics? 4.1 Experimental Setup 4.1.1 Datasets. To support the training and evaluation of our framework, we construct four distinct datasets derived from Red- Note search logs: a Reward Training Dataset (D RM-Train ) for calibration, two diagnostic test sets (D Eval-Dim andD Eval-Holistic ) for reward validation, and a large-scale unlabeled RL Optimiza- tion Dataset (D RL ). Detailed statistics and construction protocols for these datasets are provided in Appendix D. 4.1.2 Baselines. To validate our contributions, we benchmark our approach against state-of-the-art methods from two perspectives: reward quality (RQ1) and policy optimization effectiveness (RQ2–4). Reward Modeling Baselines. We compare our reward system against the most advanced reward modeling paradigms: •GenRM (Generative Reward Model): This baseline trains an LLM to generate a Chain-of-Thought (CoT) reasoning trace before outputting final scores [18, 39]. •Rubric (Rubric-Based Reward): Following [12], this baseline trains a rubric generator via supervised fine-tuning on a small set of expert-written query–rubric pairs, then uses it to generate query-specific rubrics for each reward dimension to compute the rubric-based reward. Policy Baselines. All policy models are initialized from the same SFT checkpoint, which was trained on high-quality business data to ensure basic service capability. We compare the following opti- mization strategies: • RFT (Rejection Sampling Fine-tuning): An iterative align- ment method where the model generates multiple candidates per query, and the highest-scoring are selected to build a new dataset for this round of SFT [31, 37]. • DPO (Direct Preference Optimization): An offline method that optimizes the policy directly on preference data [25], by- passing the explicit reward modeling step. •GRPO-GenRM: GRPO [27] using the trained GenRM’s outputs as the reward signal. •GRPO-Linear: A variant of our framework that utilizes our multi-dimensional reward signals but aggregates them via a naive weighted sum. • GRPO-Gated (Ours): Our proposed method utilizing the Gated Aggregation Strategy. 4.1.3 Evaluation Metrics. We employ a comprehensive set of met- rics to assess both reward alignment and real-world performance. KDD ’26, August 9–13, 2026, Jeju, KoreaWei Wu, Peilun Zhou, et al. Table 1: Comparison on pointwise constraint of our reward system against multiple baselines. Metric reported is Accuracy. LayerReq.SubsetReward DimensionGenRMRubricOurs BehavioralR1 QueryQuery Satisfy71.5273.3187.24 Evidence Reference Conflict79.0170.4890.84 Reference Irrelevant88.9283.6394.96 Bottom-lineR2 Basic Quality Answer Quality64.0265.6982.43 Self Consistency59.6864.2278.20 Multi-turn Repeat/Conflict80.6083.4094.20 Halluci- nation Highlight Hallucination49.3469.0592.85 Claim Hallucination52.9366.8491.88 LLM Knowledge84.1984.1985.77 Table 2: Comparison on pairwise preference of our reward system against multiple baselines. Metric reported is AUC. LayerReq.SubsetReward DimensionGenRMRubricOurs Behavioral R1 QueryPlanning Quality84.7473.6889.47 Evidence Reference Beneficiality79.5672.4194.83 Reference Diversity72.6096.4497.86 Reference Satisfaction76.0482.2984.38 R3 RichnessClaim Diversity77.6987.4096.07 Concise/ Usability Answer Firstness80.4795.0597.66 Answer Useful77.3479.0896.07 Redundant Repetition78.0185.8698.43 Table 3: Comparison of holistic preference alignment. MethodAUC GenRM70.90 Rubric72.13 Reward System (Ours) 86.48 For offline evaluation, we utilize Accuracy (ACC) and Area Under the Curve (AUC) to measure agreement with human experts. For online deployment, we monitor user engagement via Valid Con- sumption Rate (VCR), Skip Rate (SR), Re-search Rate (R), and Bad Case Rate (BCR). Detailed definitions and mathematical formulations for these metrics are provided in Appendix E. 4.1.4 Implementation Details. All experiments are conducted on 18 nodes equipped with NVIDIA H800 GPUs where 16 nodes are dedicated to the reward system deployment. We implement the training pipeline using the verl [28] library where we leverage Megatron [19] for distributed training and vLLM [13] for efficient generation rollouts. The policy model is initialized from Qwen3-30B- A3B-Instruct-2507 [36] while the entire reward evaluation stack is powered by DeepSeek-R1 [11] to ensure reasoning-dense feedback. We adopt an on-policy optimization configuration with a global batch size of 128 and sample 16 completions per prompt at a temper- ature of 1.0. The optimization utilizes the AdamW [17] optimizer with훽 1 =0.9 and훽 2 =0.95 plus a weight decay of 0.1 alongside a learning rate of 1×10 −6 and a KL-divergence coefficient휆of 0.01. For the Gated Aggregation in Eq. 1, we set 훿= 10 −2 . 4.2 Validation of Reward System (RQ1) To evaluate the alignment of our reward system with human expert preferences, we compare on theD Eval-Dim and theD Eval-Holistic . 4.2.1 Dimension-Specific Alignment. Tables 1 and 2 show that our hybrid evaluation stack consistently outperforms standard GenRM and Rubric-based baselines across both bottom-line and behav- ioral layers. In safety and factual grounding (Table 1), our system achieves superior accuracy. Although the Rubric-based method improves upon GenRM by decomposing complex tasks into explicit criteria to reduce holistic scoring ambiguity, it still falls short of our approach. This gap persists because standard Rubrics strug- gle with ambiguous instructions and content variability, whereas our human-in-the-loop calibration ensures strict alignment with expert consensus. This enables us to enforce bottom-line guaran- tees that pure LLM judges occasionally miss. Furthermore, GenRM exhibits lower stability due to its high sensitivity to training data distribution, which leads to inconsistent performance on out-of- distribution safety violations. Table 2 demonstrates our pairwise ranking capability for subjective behavioral objectives, where our approach shows a significant advantage in Robustness and Concise- ness. Decoupling these sub-dimensions into interpretable criteria within a human-calibrated hybrid stack allows our system to pro- vide more stable and fine-grained signals. In contrast, while GenRM generates reasoning traces, the Rubric baseline offers more struc- tured diagnosis through explicit criteria yet lacks the continuous expert calibration loop required to align precisely with nuanced human preferences in open-ended generative search. 4.2.2 Holistic Preference Alignment. Table 3 presents holistic align- ment results onD Eval-Holistic . Our multi-dimensional reward system achieves the highest overall agreement with human preferences and surpasses GenRM and the Rubric-based baseline. This result validates the effectiveness of our two-layer design separates non- negotiable safeguards from user-facing quality objectives to capture Aligning Large Language Models with Searcher PreferencesKDD ’26, August 9–13, 2026, Jeju, Korea Table 4: Comparison of different optimization methods across all reward dimensions. Robustness (R1)Bottom-line (R2)Alignment (R3) MethodQueryEvidenceBasicHalluFormatRichUsability SFT0.91670.58090.96750.91760.96970.92220.7939 RFT0.97200.5880 0.99300.92600.98900.96000.8950 DPO0.95400.61200.98850.95100.97450.96900.8610 GRPO-GenRM0.96300.59800.98700.93400.9670 0.98400.8450 GRPO-Linear0.96360.58610.99060.97140.97300.97380.8604 GRPO-Gated 0.9959 0.70890.9875 0.9836 0.99250.9832 0.9099 020406080100 Preference rate (%) Ours vs SFT Ours vs DPO 44%44%12% 39%40%21% Win (Ours better)TieLose (Baseline better) Figure 3: Comparison on generation quality of our policy against multiple baselines evaluated by human experts. the complexity of human judgments. The consistent superiority across all subsets confirms that the proposed reward system serves as a reliable objective for the GRPO-based policy optimization. 4.3 Offline Policy Evaluation (RQ2) To verify whether our proposed optimization strategy translates better reward signals into superior generation quality, we conduct a comprehensive offline evaluation. Relying on the high alignment of our reward system with human judgment demonstrated in RQ1, we employ the system itself as an automated proxy evaluator alongside rigorous human expert assessment. 4.3.1 Automatic Evaluation via Reward System. Table 4 presents the performance of different policy variants evaluated by our multi- dimensional reward system. To facilitate comparison, we report average scores of dimensions within each subset. The results indi- cate that GRPO-Gated (Ours) achieves the best overall performance, significantly surpassing the SFT baseline and other optimization methods. While RFT and DPO improve upon the SFT baseline, they exhibit limitations in balancing conflicting objectives. Specifically, DPO tends to exploit easy-to-optimize patterns like response length while often failing to strictly adhere to complex safety constraints. Among the GRPO variants, GRPO-GenRM shows instability in safety dimensions due to the lack of hard constraints in its reward signal. GRPO-Linear, utilizing a weighted sum aggregation, suffers from the seesaw effect where improvements in behavioral scores come at the cost of degradation in bottom-line metrics. In contrast, our Gated Aggregation Strategy ensures that the model optimizes behavioral utility only within the safe region, resulting in simulta- neous improvements across both safety and utility dimensions. 4.3.2 Human Expert Evaluation. To corroborate the automatic met- rics, we conducted a blind side-by-side human evaluation. We sam- pled 1,000 distinct queries from the test set and engaged a large pool of expert annotators to compare the responses generated by our method against representative baselines. The experts judged the responses based on critical criteria such as factual consistency, safety, richness, conciseness, and logic. The evaluation results (Fig- ure 3) align consistently with the automatic metrics, showing that our model achieves a significantly higher win rate. Experts ob- served that our method produces answers that are not only robust to noisy retrieval and semantically rich but also strictly compliant with formatting and safety guidelines. This confirms that the gains observed in the reward scores genuinely reflect improvements in generation quality perceptible to human users. 4.4 Training Dynamics (RQ3) We analyze the evolution of different reward dimensions during training, as illustrated in Figure 4, to understand how the Gated Aggregation Strategy resolves the inherent antagonism between conflicting objectives. In the early stages of training, we observe a natural seesaw effect in the baseline methods, particularly be- tween Richness and Conciseness; as the model strives to cover more information, it inevitably tends towards verbosity, causing conciseness scores to drop. GRPO-Linear partially mitigates this issue by introducing expert priors through weighted summation, which forces the model to respect the relative importance of each dimension. However, it still treats safety constraints and behavioral objectives as parallel signals, leading to instability where the model occasionally sacrifices Bottom-line Constraints to chase marginal gains in utility. In contrast, GRPO-Gated fundamentally alters this dynamic by establishing a hierarchical dependency. The training curves show that the Gated strategy first "locks in" high scores on bottom-line constraints. Only after these safety metrics stabilize above the threshold does the model begin to effectively optimize for richness and diversity. This mechanism transforms the inter- dimensional conflict into a synergistic evolution, ensuring that the pursuit of user utility never compromises the foundational reliabil- ity of the generative search system. 4.5 Online Experiments (RQ4) To evaluate real-world impact and robustness, we deployed the trained model in the AI search entry of RedNote. We implemented randomized traffic diversion by hashing User IDs and modulo buck- eting. We allocated a consistent 10% of the total live traffic to each experimental group. All variants were evaluated synchronously over the same period to eliminate temporal fluctuations, thereby ensuring the fairness and reliability of the statistical conclusions. All reported uplifts are statistically significant (two-sided tests; 푝<0.05). In particular, for VCR the typical 95% confidence interval is within±0.1 percentage points, indicating that the observed gains are well above the noise level. Additionally, we explicitly excluded a small subset of vertical business domains from the training set to KDD ’26, August 9–13, 2026, Jeju, KoreaWei Wu, Peilun Zhou, et al. 0.900 0.925 0.950 0.975 1.000 Query Satisfy 0.7 0.8 0.9 1.0 Reference Conflict 0.4 0.6 0.8 Reference Beneficiality 0.2 0.4 0.6 Reference Diversity 0.90 0.95 1.00 Highlight Hallucination 0.900 0.925 0.950 0.975 Claim Hallucination 0100200300400500600700800 0.90 0.95 1.00 Format 0100200300400500600700800 0.900 0.925 0.950 0.975 1.000 Claim Diversity 0100200300400500600700800 0.5 0.6 0.7 0.8 0.9 Answer Firstness Training Step Reward Score GRPO-GatedGRPO-Linear (Weighted Sum)GRPO-Linear (Sum) Figure 4: Training dynamics under different reward aggregation strategies. The curves illustrate the evolution of scores across distinct reward dimensions during training, comparing the Gated Aggregation strategy against the Linear baseline. 01-1601-1701-1801-1901-2001-2101-22 -1.0% 0.0% 1.0% Valid Consumption Rate (VCR) ↑ 01-1601-1701-1801-1901-2001-2101-22 -0.5% 0.0% 0.5% Skip Rate (SR) ↓ 01-1601-1701-1801-1901-2001-2101-22 -5.0% 0.0% 5.0% Re-search Rate (R) ↓ 01-1601-1701-1801-1901-2001-2101-22 -2.5% 0.0% 2.5% Bad Case Rate (BCR) ↓ Δ vs SFT (%) In-distribution DomainsOut-of-distribution Domains Figure 5: Results of the online A/B test on the RedNote platform conducted in 2026. The chart displays the relative changes in key user engagement metrics for our deployed model compared to the production baseline (SFT). assess the generalization capability of the model across unseen dis- tributions. The online A/B testing results, summarized in Figure 5, demonstrate substantial improvements across all core engagement metrics. Compared to the baseline, our model achieves a significant increase in Valid Consumption Rate (VCR) which indicates that users find the synthesized answers helpful and are willing to spend time reading them. Simultaneously, we observe a marked reduction in Skip Rate (SR) and Re-search Rate (R). These shifts suggest that the generated responses satisfy user needs in a single turn and re- duce the necessity for query reformulation. Crucially, the Bad Case Rate (BCR) remains at a minimal level which ensures the safety and reliability of the deployed system. Furthermore, the perfor- mance on the held-out domains confirms the robust generalization of our method. Although different business verticals exhibit distinct data biases regarding content format or information density, our model adapts effectively without explicit fine-tuning. It maintains high safety standards and response quality even in these zero-shot scenarios. This stability validates that our multi-dimensional re- ward system captures fundamental search principles rather than overfitting to specific training data patterns. 5 Conclusion In this work, we introduced SearchLLM, an LLM optimized for open-ended generative search on large content platforms. We char- acterized the unique challenges of this setting, including robustness to noisy retrieval, safety guarantees, and alignment with diverse user needs. To solve these, we designed a multi-dimensional re- ward system that separates bottom-line constraints, such as factual grounding and format compliance, from behavioral objectives that drive user utility. Our approach uses the Gated Aggregation Strategy to decouple safety constraints from behavioral objectives, enabling effective policy optimization through Group Relative Policy Opti- mization (GRPO). The deployment of SearchLLM in RedNote, with over 150 million daily page views, was validated through online A/B tests. These tests showed improvements in generation quality and user engagement while maintaining safety standards. Our approach offers a scalable solution to align open-ended generative search systems with real-world constraints. Future work will extend this optimization to multi-modal contexts and incorporate personalized long-term memory to further refine the user experience. Aligning Large Language Models with Searcher PreferencesKDD ’26, August 9–13, 2026, Jeju, Korea References [1]Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, Li Ma, Runyuan Ma, Zerun Ma, Wenchang Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, Fukai Shang, Yunfan Shao, Demin Song, Zifan Song, Zhihao Sui, Peng Sun, Yu Sun, Huanze Tang, Bin Wang, Guoteng Wang, Jiaqi Wang, Jiayu Wang, Rui Wang, Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen Weng, Fan Wu, Yingtong Xiong, Chao Xu, Ruiliang Xu, Hang Yan, Yirong Yan, Xiaogui Yang, Haochen Ye, Huaiyuan Ying, Jia Yu, Jing Yu, Yuhang Zang, Chuyu Zhang, Li Zhang, Pan Zhang, Peng Zhang, Ruijie Zhang, Shuo Zhang, Songyang Zhang, Wenjian Zhang, Wenwei Zhang, Xingcheng Zhang, Xinyue Zhang, Hui Zhao, Qian Zhao, Xiaomeng Zhao, Fengzhe Zhou, Zaida Zhou, Jingming Zhuo, Yicheng Zou, Xipeng Qiu, Yu Qiao, and Dahua Lin. 2024. InternLM2 Technical Report. arXiv:2403.17297 [cs.CL] https://arxiv.org/abs/2403.17297 [2] ChatGPT. 2026. ChatGPT. https://chatgpt.com/ [3]Ben Chen, Xian Guo, Siyuan Wang, Zihan Liang, Yue Lv, Yufei Ma, Xinlong Xiao, Bowen Xue, Xuxin Zhang, Ying Yang, Huangyu Dai, Xing Xu, Tong Zhao, Mingcan Peng, Xiaoyang Zheng, Chao Wang, Qihang Zhao, Zhixin Zhai, Yang Zhao, Bochao Liu, Jingshan Lv, Xiao Liang, Yuqing Ding, Jing Chen, Chenyi Lei, Wenwu Ou, Han Li, and Kun Gai. 2025. OneSearch: A Preliminary Explo- ration of the Unified End-to-End Generative Framework for E-commerce Search. arXiv:2509.03236 [cs.IR] https://arxiv.org/abs/2509.03236 [4]Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. In Findings of the As- sociation for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 2318–2335. doi:10.18653/v1/2024.findings-acl.137 [5] Douyin. [n. d.]. Douyin AI Search. https://search.douyin.com/ [6] Martin J. Eppler and Jeanne Mengis. 2004. The Concept of Information Overload: A Review of Literature from Organization Science, Accounting, Marketing, MIS, and Related Disciplines. The Information Society 20 (2004), 325 – 344. https: //api.semanticscholar.org/CorpusID:262540775 [7] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. 2023. Retrieval-Augmented Generation for Large Language Models: A Survey. ArXiv abs/2312.10997 (2023). https://api.semanticscholar.org/CorpusID:266359151 [8] Gemini. 2026. Gemini. https://gemini.google.com/ [9] Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and Amelia Glaese. 2025. Deliberative Alignment: Reasoning Enables Safer Language Models. arXiv:2412.16339 [cs.CL] https://arxiv.org/abs/2412.16339 [10]Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. 2025. Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains. arXiv:2507.17746 [cs.LG] https://arxiv.org/abs/2507.17746 [11] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Honghui Ding, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jingchang Chen, Jingyang Yuan, Jinhao Tu, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaichao You, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. 2025. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 8081 (Sept. 2025), 633–638. doi:10.1038/s41586-025-09422-z [12]Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Karishma Mandyam, Sopan Khosla, Yuanhao Xiong, Nanshu Wang, Xiaoliang Peng, Beibin Li, Shengjie Bi, Shishir G. Patil, Qi Qi, Shengyu Feng, Julian Katz-Samuels, Richard Yuanzhe Pang, Sujan Gonugondla, Hunter Lang, Yue Yu, Yundi Qian, Maryam Fazel-Zarandi, Licheng Yu, Amine Benhalloum, Hany Awadalla, and Manaal Faruqui. 2025. AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following. arXiv:2511.10507 [cs.CL] https://arxiv.org/abs/2511.10507 [13]Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (Koblenz, Germany) (SOSP ’23). Association for Computing Machinery, New York, NY, USA, 611–626. doi:10.1145/3600006.3613165 [14]Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Fer- ret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. 2024. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. In Proceedings of the 41st Interna- tional Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (Eds.). PMLR, 26874–26901. https://proceedings.mlr.press/v235/lee24t.html [15]Derek Li, Jiaming Zhou, Leo Maxime Brunswic, Abbas Ghaddar, Qianyi Sun, Liheng Ma, Yu Luo, Dong Li, Mark Coates, Jianye Hao, and Yingxue Zhang. 2025. Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling. arXiv:2507.14783 [cs.LG] https://arxiv.org/abs/2507.14783 [16]Zhiding Liu, Ben Chen, Mingyue Cheng, Enhong Chen, Li Li, Chenyi Lei, Wenwu Ou, Han Li, and Kun Gai. 2025. Towards Context-aware Reasoning-enhanced Generative Searching in E-commerce. arXiv:2510.16925 [cs.IR] https://arxiv.org/ abs/2510.16925 [17] Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations. https://openreview.net/ forum?id=Bkg6RiCqY7 [18]Dakota Mahan, Duy Van Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, and Alon Albalak. 2024. Generative Reward Models. arXiv:2410.12832 [cs.LG] https://arxiv.org/abs/2410. 12832 [19] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient large-scale language model training on GPU clusters using megatron- LM. In Proceedings of the International Conference for High Performance Com- puting, Networking, Storage and Analysis (St. Louis, Missouri) (SC ’21). Asso- ciation for Computing Machinery, New York, NY, USA, Article 58, 15 pages. doi:10.1145/3458817.3476209 [20]Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage Re-ranking with BERT. arXiv:1901.04085 [cs.IR] https://arxiv.org/abs/1901.04085 [21] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 27730–27744. https://proceedings.neurips.c/paper_files/paper/2022/file/ b1efde53be364a73914f58805a001731-Paper-Conference.pdf [22]Ming Pang, Chunyuan Yuan, Xiaoyu He, Zheng Fang, Donghao Xie, Fanyi Qu, Xue Jiang, Changping Peng, Zhangang Lin, Ching Law, and Jingping Shao. 2025. Generative Retrieval and Alignment Model: A New Paradigm for E-commerce Retrieval. arXiv:2504.01403 [cs.IR] https://arxiv.org/abs/2504.01403 [23]Wenjun Peng, Guiyang Li, Yue Jiang, Zilong Wang, Dan Ou, Xiaoyi Zeng, Derong Xu, Tong Xu, and Enhong Chen. 2024. Large Language Model based Long-tail Query Rewriting in Taobao Search. In Companion Proceedings of the ACM Web Conference 2024 (Singapore, Singapore) (W ’24). Association for Computing Machinery, New York, NY, USA, 20–28. doi:10.1145/3589335.3648298 [24] Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Ben- dersky. 2024. Large Language Models are Effective Text Rankers with Pair- wise Ranking Prompting. In Findings of the Association for Computational Lin- guistics: NAACL 2024, Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Linguistics, Mexico City, Mexico, 1504–1518. doi:10.18653/v1/2024.findings-naacl.97 KDD ’26, August 9–13, 2026, Jeju, KoreaWei Wu, Peilun Zhou, et al. [25]Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Ste- fano Ermon, and Chelsea Finn. 2023.Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Advances in Neu- ral Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 53728–53741. https://proceedings.neurips.c/paper_files/paper/2023/file/ a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf [26]John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs.LG] https: //arxiv.org/abs/1707.06347 [27]Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeek- Math: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL] https://arxiv.org/abs/2402.03300 [28]Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. HybridFlow: A Flexible and Efficient RLHF Framework. In Proceedings of the Twentieth European Conference on Computer Systems (Rotterdam, Netherlands) (EuroSys ’25). Association for Computing Machinery, New York, NY, USA, 1279–1297. doi:10.1145/3689031. 3696075 [29]Siddharth Suri, Scott Counts, Leijie Wang, Chacha Chen, Mengting Wan, Tara Safavi, Jennifer Neville, Chirag Shah, Ryen W. White, Reid Andersen, Georg Buscher, Sathish Manivannan, Nagu Rangan, and Longqi Yang. 2024. The Use of Generative Search Engines for Knowledge Work and Complex Tasks. arXiv:2404.04268 [cs.IR] https://arxiv.org/abs/2404.04268 [30] TikTok. 2026. TikTok. https://w.tiktok.com/ [31] Luong Trung, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. 2024. ReFT: Reasoning with Reinforced Fine-Tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 7601–7614. doi:10.18653/v1/2024. acl-long.410 [32]Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, and Tongshuang Wu. 2025. Checklists Are Better Than Reward Models For Aligning Language Models. arXiv:2507.18624 [cs.CL] https://arxiv.org/abs/ 2507.18624 [33] Liang Wang, Nan Yang, and Furu Wei. 2023. Query2doc: Query Expansion with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 9414–9423. doi:10.18653/v1/2023.emnlp-main.585 [34] WeChat. 2026. WeChat. https://w.wechat.com/ [35]Haoyi Xiong, Jiang Bian, Yuchen Li, Xuhong Li, Mengnan Du, Shuaiqiang Wang, Dawei Yin, and Sumi Helal. 2024. When Search Engine Services Meet Large Lan- guage Models: Visions and Challenges. IEEE Transactions on Services Computing 17, 6 (2024), 4558–4577. doi:10.1109/TSC.2024.3451185 [36]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388 [37]Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. 2023. Scaling Relationship on Learning Mathematical Reasoning with Large Language Models. arXiv:2308.01825 [cs.CL] https://arxiv.org/abs/2308.01825 [38]Hansi Zeng, Chen Luo, Bowen Jin, Sheikh Muhammad Sarwar, Tianxin Wei, and Hamed Zamani. 2024. Scalable and Effective Generative Information Retrieval. In Proceedings of the ACM Web Conference 2024 (Singapore, Singapore) (W ’24). Association for Computing Machinery, New York, NY, USA, 1441–1452. doi:10.1145/3589334.3645477 [39]Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. 2025. Generative Verifiers: Reward Modeling as Next-Token Prediction. In The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=Ccwp4tFEtE [40]Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. 2023.Judging LLM- as-a-Judge with MT-Bench and Chatbot Arena. In Advances in Neu- ral Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 46595–46623. https://proceedings.neurips.c/paper_files/paper/2023/file/ 91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf [41]Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. 2025. Large Language Models for Information Retrieval: A Survey. ACM Trans. Inf. Syst. 44, 1, Article 12 (Nov. 2025), 54 pages. doi:10.1145/3748304 A Qualitative Case Studies To provide a concrete understanding of the performance improve- ments, we present three qualitative case studies from real-world traffic. For the sake of readability, all case studies are presented in English translation, while the original user queries and model responses were in Chinese. These examples compare the responses generated by SearchLLM against the SFT baseline: •Robustness to Outdated Evidence (R1): Figure 6 (Case Study 1) demonstrates how our model effectively handles noisy re- trieval by filtering out factually correct but temporally outdated information, whereas the baseline fails to recognize the event’s expiration. • Bottom-line Safety Guarantees (R2): Figure 7 (Case Study 2) highlights our model’s adherence to strict factual and safety constraints in medical domains, correcting the baseline’s dan- gerous hallucination regarding pathogen classification. •Alignment with User Needs (R3): Figure 8 (Case Study 3) illustrates how our model aligns with user intent for concise and logical solutions, eliminating the redundancy and logical conflicts observed in the baseline. B Human Annotation Interface To ensure the high quality and consistency of our reward signals, we developed a specialized human annotation platform designed to align judges with expert preferences. As illustrated in Figure 9, the interface is structured to support a rigorous review process. For each task, the annotator is presented with the raw user query and the candidate responses, alongside the intent analysis generated by the policy model and retrieved reference notes. This context ensures that judgments are grounded in accurate information rather than annotator intuition. The annotation workflow consists of two granularities: •Pointwise Evaluation: Annotators first evaluate each response independently against our bottom-line constraints, scoring spe- cific dimensions such as factual grounding, safety, and format compliance. •Pairwise Ranking: After individual scoring, annotators per- form a side-by-side comparison of candidate responses to pro- vide holistic preference signals, which are closely correlated with the remaining reward dimensions. This dual-granularity feedback mechanism enables hierarchical governance of non-negotiable safety boundaries and preference- oriented quality metrics during the reward modeling phase. C Reward Dimension Definitions In this section, we provide detailed definitions and implementation methods for the multi-dimensional reward system described in Section 3.2. Table 5 lists all criteria used in our hybrid evaluation stack, distinguishing between Bottom-line Constraints (Layer I) and Behavioral Objectives (Layer I). The "Implementation" column indicates whether a metric is computed via deterministic rules (Rule-based) or model judges (LLM-based). Aligning Large Language Models with Searcher PreferencesKDD ’26, August 9–13, 2026, Jeju, Korea D Dataset Specifications To support training and evaluation, we construct four distinct datasets from RedNote search logs: •Reward Training Dataset (D RM-Train ): Comprising 40,000 samples, this dataset facilitates the calibration of our hybrid evaluation stack and the training of baseline reward modeling methods. Constructed via stratified sampling of random logs and "hard samples" (historical failure cases), it adopts distinct formats: pairwise comparisons(푥,푦 푤 ,푦 푙 )for subjective pref- erence criteria, and pointwise instances(푥,푦,푠)with binary expert labels (푠 ∈ 0, 1) for objective constraints. • Dimension-Specific Test Set (D Eval-Dim ): Designed to address RQ1, this held-out diagnostic set features samples explicitly tagged with specific quality issues (e.g., hallucination, format- ting errors). Containing 800 to 3,600 examples per dimension, it allows us to evaluate the alignment with human judgment on individual criteria. •Holistic Preference Test Set (D Eval-Holistic ): Also serving RQ1, this dataset comprises 2,800 pairwise comparisons sourced from historical model checkpoints and human writing. It uti- lizes general "win/loss" labels to assess the holistic preference alignment of the reward system. • RL Optimization Dataset (D RL ): Comprising 500,000 unla- beled tuples(푞,ℎ,퐸), this dataset supports policy optimization (RQ2–4). The training set mixes random traffic with user com- plaint queries to enhance robustness. E Evaluation Metrics In this section, we provide the detailed definitions for the offline and online metrics used to assess our system. Offline Evaluation. We validate the reward model’s alignment with human experts using two standard metrics: •Accuracy (ACC): Measures the agreement between reward sig- nals and expert labels on binary constraint reward dimensions (e.g., Factuality, Format Compliance). •Area Under the Curve (AUC): Evaluates the pairwise ranking capability on subjective preference dimensions (e.g., Richness, Usability). Given푁pairs(푦 푤 ,푦 푙 )where푦 푤 is the preferred response, and reward score 푠(·), the AUC is computed as: AUC= 1 푁 푁 ∑︁ 푖=1 h I(푠(푦 (푖) 푤 )> 푠(푦 (푖) 푙 ))+ 0.5· I(푠(푦 (푖) 푤 )= 푠(푦 (푖) 푙 )) i . (7) Online Evaluation. To measure real-world impact, we con- ducted online A/B testing and monitored the following metrics: •Valid Consumption Rate (VCR): The percentage of sessions where users dwell on the generated answer for a meaningful duration (> 5s), indicating high utility. VCR= 1 푀 푀 ∑︁ 푗=1 I(푇 (푗) dwell > 5s)(8) • Skip Rate (SR): The percentage of sessions where the answer is scrolled past immediately (<1.5s), serving as a proxy for irrelevance or poor formatting. SR= 1 푀 푀 ∑︁ 푗=1 I(푇 (푗) dwell < 1.5s)(9) • Re-search Rate (R): The frequency with which users issue a reformulated query immediately after viewing the result, indi- cating dissatisfaction. • Bad Case Rate (BCR): The percentage of responses containing severe bottom-line violations (e.g., safety risks, obvious hallu- cinations). This metric is estimated via a daily human audit of sampled logs. F Supplementary Training Dynamics We present the evolution curves for the remaining reward dimen- sions that were not visualized in the main text due to space con- straints. As shown in Figure 10, the Gated Aggregation Strategy demonstrates consistent stability across the full spectrum of reward signals, corroborating the conclusions drawn in Section 4.4. KDD ’26, August 9–13, 2026, Jeju, KoreaWei Wu, Peilun Zhou, et al. User QueryContext: Current Date Jan 2026 (Event was Aug 2025) When can I listen to the song "Skibidi" for free? Response from SearchLLM (Ours)Temporal Relevance J Lin and Jackie Chan’s song "Skibidi" has already been released (premiered on Aug 27, 2025). Since the radio premiere event has ended, you can currently listen to it for free on the following platforms: •Video Platforms: Search for the official music video on Bilibili or YouTube [Doc 2]. •Music Apps: It is now available on Spotify, Apple Music, and KKBox (free tier supported) [Doc 1,6]. Response from Baseline (SFT)Outdated Information J Lin and Jackie Chan’s new song "Skibidi" will premiere globally on August 27, 2025, at 10:00 AM via major radio stations [Doc 3,4]. If you want to listen for free, you can try the following methods: •Radio Premiere: This is the official free channel. Please tune in to stations like Hit Fm, Singapore UFM100.3, or Malaysia MY FM at the specific time [Doc 3,4]. •Online Radio: Use the "Radio" feature in NetEase Cloud Music to find the station and collect it [Doc 6]. • Video Platforms: Although the premiere is on radio, users often upload recordings to video sites later [Doc 1,2]. Analysis & Critique Robustness to Outdated Evidence (R1): The Baseline model fails to validate the temporal applicability of the retrieved evidence. It treats a historical event ("Premiere in 2025") as a valid solution for a current query (2026), resulting in a non-actionable response. SearchLLM successfully identifies that the premiere information is outdated and filters it out, prioritizing currently accessible streaming channels. Figure 6: Case Study 1 highlighting improved temporal awareness for time sensitive queries User Query Why do my ’Peach Egg’ (Graptopetalum amethystinum) leaves have tumor-like bumps? Response from SearchLLM (Ours) The "tumor-like" growths on your Peach Egg leaves are likely symptoms of Anthracnose. Diagnosis: This is a fungal disease (usually caused by Colletotrichum) often triggered by high humidity or poor ventilation [Doc 2]. Treatment: • Isolate: Separate the plant immediately to prevent spores from spread- ing. •Prune: Cut off infected leaves with sterilized scissors to reduce the source of infection. •Apply Fungicide: Use broad-spectrum fungicides like Carbendazim or Chlorothalonil[Doc 2]. Spray 2-3 times ensuring coverage on both sides of leaves. Response from Baseline (SFT) The "tumor-like" growths on Peach Egg leaves are likely typical symp- toms of Anthracnose. This is a disease caused by bacteria[Doc 2]. Initial symptoms include small black spots... How to handle: • Immediate Isolation: Separate from healthy succulents to prevent bacterial transmission. •Spray Medication: You can use agents specifically for bacterial diseases, such as "Flower Health" antibiotics[Doc 2]. Spray 2-3 times... Daily Care Tips: •Control Watering: Water only when soil is dry ("better dry than wet"). • Ventilation: Keep environment ventilated to avoid humidity. • Avoid Wounds: Bacteria enters through wounds easily. Analysis & Critique Factual & Safety Guarantee (R2): The Baseline model correctly identifies the disease name ("Anthracnose") but hallucinates the biological classification, labeling it as bacterial instead of fungal. Consequently, it recommends antibiotics ("Flower Health") which are ineffective against fungal pathogens. Our model correctly identifies the fungal nature and prescribes appropriate fungicides. Figure 7: Case Study 2 demonstrating the correction of factual hallucinations in domain specific queries Aligning Large Language Models with Searcher PreferencesKDD ’26, August 9–13, 2026, Jeju, Korea User Query Which gate should I enter for Baixi from Wanda Resort? Response from SearchLLM (Ours) From Wanda Resort to the Changbai Mountain Baixi scenic area, you should enter through the Baixi Distribution Center Main Entrance. Key Steps: (1)Entrance: Proceed directly to the main tourist hall for security check and ticket validation [Doc 1]. (2)Documents: Must bring your ID card (Physical ID required) for pre-check [Doc 1]. (3) Timing: Arrive 30 minutes before your booked slot [Doc 1]. Response from Baseline (SFT) From Wanda to Changbai Mountain, it is usually recommended to enter from the West Slope Distribution Center main entrance[Doc 1]. This is the main pick-up point for buses. Traffic and Cost: • Taxi: From Wanda to Baixi is about 20 yuan more expensive than West Slope. Note: Taxis here do not use meters, keep the driver’s phone number [Doc 1]. Tour Tips: • Motion Sickness: The mountain road has 18 bends, prepare medicine [Doc 1]. •Clothing: Even in summer, the wind is strong, so wear a wind- breaker [Doc 1]. Analysis & Critique Alignment with User Needs (R3): 1. Entity Error: The Baseline confuses "Baixi" with "West Slope Center". 2. Redundancy: It includes excessive irrelevant details (taxi scams, motion sickness) that dilute the answer. 3. Logic Error: It assumes the season is "Summer" without user input. Our model aligns with the user’s specific need for navigation instructions. Figure 8: Case Study 3 showcasing reduction of logical inconsistencies and redundancy in complex tasks Figure 9: Screenshot of the human annotation interface. The workbench presents the query, user intent analysis, and reference notes. Annotators provide dual-granularity feedback: fine-grained scores on specific dimensions and holistic pairwise rankings. KDD ’26, August 9–13, 2026, Jeju, KoreaWei Wu, Peilun Zhou, et al. Table 5: Detailed Definitions of Multi-Dimensional Reward Criteria. LayerRequirement SubsetReward DimensionImplementation Definition / Role Layer I (Bottom-line) R2 Hallucination Highlight HallucinationLLM-basedFirst sentence free of obvious hallucinations. Claim HallucinationLLM-basedNon-hallucination at the claim level. LLM KnowledgeLLM-basedChecks knowledge correctness with external validation. No Supply RejectLLM-basedProper refusal behavior when no evidence is available. Basic Quality Self ConsistencyLLM-basedLogical self-consistency, no internal contradictions. Answer QualityLLM-basedAvoids gibberish or extremely low-quality answers. Multi-round Repeat ConflictLLM-basedNo strong repetition or contradiction across turns. Format FormatRule-basedEnforces structured and well-formed output format. Response LengthRule-basedKeeps answer length in an acceptable range. Layer I (Behavioral) R1 Query Query SatisfactionLLM-basedBasic intent alignment, avoids “answering B when asked A”. Planning QualityLLM-basedQuality of reasoning and planning under uncertainty. Evidence Reference BeneficialityLLM-basedEvidence truly helps answer the question. Reference ConflictLLM-basedDetects strong conflicts among evidence items. Reference IrrelevantLLM-basedDetects evidence completely irrelevant to the query. Reference SatisfactionLLM-basedEvidence covers primary and secondary needs. Reference DiversityLLM-basedPromotes diverse evidence for robustness. R3 RichnessClaim DiversityLLM-basedRich and diverse answer claims. Usability Answer UsefulLLM-basedLow fraction of off-topic content; conciseness. Answer FirstnessLLM-basedPlaces core answer early for quick consumption. Redundant RepetitionLLM-basedControls severe semantic redundancy and rambling. 0.900 0.925 0.950 0.975 Planning Quality 0.7 0.8 0.9 Reference Irrelevant 0.4 0.5 0.6 Reference Satisfaction 0.96 0.98 1.00 Answer Quality 0.900 0.925 0.950 0.975 Self Consist 0.980 0.985 0.990 0.995 1.000 Multirounds Repeat Conflict 0.97 0.98 0.99 1.00 Llm Knowledge 0.94 0.96 0.98 1.00 No Supply Reject 0.94 0.96 0.98 1.00 Response Length 0100200300400500600700800 0.90 0.95 1.00 Format 0100200300400500600700800 0.7 0.8 0.9 Answer Useful 0100200300400500600700800 0.84 0.86 0.88 0.90 0.92 Redundant Repetition Training Step Reward Score GRPO-GatedGRPO-Linear (Weighted Sum)GRPO-Linear (Sum) Figure 10: Supplementary training dynamics showing the evolution of the remaining reward dimensions.