Paper deep dive

A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication

Weiming Wu, Zi-Jian Cheng, Jie Meng, Peng Zhen, Shan Huang, Qun Li, Guobin Wu, Lan-Zhe Guo

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 79

Abstract

Abstract:The efficient adjudication of responsibility disputes is pivotal for maintaining marketplace fairness. However, the exponential surge in ride-hailing volume renders manual review intractable, while conventional automated methods lack the reasoning transparency required for quasi-judicial decisions. Although Multimodal LLMs offer a promising paradigm, they fundamentally struggle to bridge the gap between general visual semantics and rigorous evidentiary protocols, often leading to perceptual hallucinations and logical looseness. To address these systemic misalignments, we introduce RideJudge, a Progressive Visual-Logic-Aligned Framework. Instead of relying on generic pre-training, we bridge the semantic gap via SynTraj, a synthesis engine that grounds abstract liability concepts into concrete trajectory patterns. To resolve the conflict between massive regulation volume and limited context windows, we propose an Adaptive Context Optimization strategy that distills expert knowledge, coupled with a Chain-of-Adjudication mechanism to enforce active evidentiary inquiry. Furthermore, addressing the inadequacy of sparse binary feedback for complex liability assessment, we implement a novel Ordinal-Sensitive Reinforcement Learning mechanism that calibrates decision boundaries against hierarchical severity. Extensive experiments show that our RideJudge-8B achieves 88.41\% accuracy, surpassing 32B-scale baselines and establishing a new standard for interpretable adjudication.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

78,363 characters extracted from source content.

Expand or collapse full text

A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication Weiming Wu ∗ wuwm23@smail.nju.edu.cn Nanjing University Zi-Jian Cheng ∗ chengzj@lamda.nju.edu.cn Nanjing University Jie Meng jmengjie@didiglobal.com Didichuxing Co. Ltd Peng Zhen zhenpeng@didiglobal.com Didichuxing Co. Ltd Shan Huang lattehuang@didiglobal.com Didichuxing Co. Ltd Qun Li liquntracy@didiglobal.com Didichuxing Co. Ltd Guobin Wu wuguobin@didiglobal.com Didichuxing Co. Ltd Lan-Zhe Guo † guolz@lamda.nju.edu.cn Nanjing University Abstract The efficient adjudication of responsibility disputes is pivotal for maintaining marketplace fairness. However, the exponential surge in ride-hailing volume renders manual review intractable, while conventional automated methods lack the reasoning transparency required for quasi-judicial decisions. Although Multimodal LLMs offer a promising paradigm, they fundamentally struggle to bridge the gap between general visual semantics and rigorous evidentiary protocols, often leading to perceptual hallucinations and logical looseness. To address these systemic misalignments, we introduce RideJudge, a Progressive Visual-Logic-Aligned Framework. Instead of relying on generic pre-training, we bridge the semantic gap via SynTraj, a synthesis engine that grounds abstract liability concepts into concrete trajectory patterns. To resolve the conflict between massive regulation volume and limited context windows, we pro- pose an Adaptive Context Optimization strategy that distills ex- pert knowledge, coupled with a Chain-of-Adjudication mechanism to enforce active evidentiary inquiry. Furthermore, addressing the inadequacy of sparse binary feedback for complex liability assess- ment, we implement a novel Ordinal-Sensitive Reinforcement Learn- ing mechanism that calibrates decision boundaries against hierar- chical severity. Extensive experiments show that our RideJudge- 8B achieves 88.41% accuracy, surpassing 32B-scale baselines and establishing a new standard for interpretable adjudication. CCS Concepts •Information systems→Decision support systems. Keywords MLLM, Ride-Hailing Adjudication,LLM Reasoning 1 Introduction The rapid development of on-demand ride-hailing platforms has fundamentally revolutionized the landscape of urban mobility and transportation dynamics. Within this ecosystem, the adjudication of responsibility disputes serves as a cornerstone for maintaining ∗ Equal contribution. Work done during internship at Didichuxing Co. Ltd. † Corresponding author. fairness and marketplace stability. As these transportation systems increasingly rely on decentralized operational models, the volume of service disputes, ranging from route deviations to cancellation disagreements, has grown exponentially. Consequently, efficiently and objectively resolving these disputes has emerged as a vital chal- lenge in the field of intelligent operational systems [19,33]. Ride-hailing adjudicationis a complex task that necessitates the integration of heterogeneous data sources, such as vehicle tra- jectories, passenger-driver behavioral records, and order metadata, to perform reasoning according to platform rules. Conventionally, automated solutions for this task have relied on traditional mod- els [ 3,13], which often suffer from limited accuracy when process- ing complex multimodal interactions. More critically, these meth- ods typically yield only adjudication labels, lacking the capacity to provide the reasoning trails required to justify verdicts. Recently, Multimodal Large Language Models (MLLMs) [1,8,22, 29] have emerged as a promising paradigm for automated liabil- ity adjudication. By leveraging their generative capabilities, these models theoretically enable the synthesis of heterogeneous data sources to derive interpretable verdicts. However, the direct appli- cation of off-the-shelf MLLMs to this specialized domain is hin- dered by intrinsic misalignments between general pre-training ob- jectives and the rigorous demands of judicial reasoning. Why General MLLMs Fail?We identify three disconnects that prevent general models from functioning as reliable adjudicators: General Perceptual vs. Domain-Specific Grounding.Exist- ing vision encoders, predominantly pre-trained on natural image- caption pairs, exhibit a semantic rupture when interpreting ab- stract, schematic navigation maps. While these models can recog- nize basic primitives, they lack the specialized perceptual ground- ing required to translate visual anomalies, such as subtle trajec- tory drift or abnormal stops, into precise juridical liability con- cepts. This misalignment renders generic models incapable of dis- tinguishing fine-grained violation patterns from standard driving fluctuations. Probabilistic Generation vs. Deterministic Logical Deduc- tion.Adjudication requires a rigorous reasoning schema involving identifying facts, selecting applicable rules, and deducing verdicts, arXiv:2603.17328v1 [cs.AI] 18 Mar 2026 Wu et al. which stands in contrast to the probabilistic nature of LLM genera- tion. Despite their promise in open-ended tasks, MLLMs often lack the logical robustness necessary for multi-hop, evidentiary reason- ing. In high-stakes scenarios, they struggle to maintain a rigorous evidentiary chain, frequently yielding verdicts that are either log- ically decoupled from established facts or internally inconsistent with the cited platform regulations Fixed Contextual Capacity vs. Open-World Rule Scaling. Unlike standard tasks with self-contained contexts, ride-hailing ad- judication operates within a dynamic and expansive knowledge ecosystem comprising evolving regulations and historical prece- dents. This legal corpus often exceeds the finite context windows of standard MLLMs. Moreover, general models lack the scenario- aware retrieval mechanisms needed to filter relevant statutes from massive repositories, leading to the injection of contextual noise and a degradation in the precision of the reasoning process. To address these challenges, we propose aProgressive Visual- Logic-Aligned Frameworkfor the ride-hailing adjudication task. Our core insight is to treat adjudication as a cognitive mirroring process, where the model learns to bridge the epistemic gap be- tween raw spatiotemporal dynamics and high-level juridical rea- soning, simulating the rigorous workflow of human experts. To bridge the perception and reasoning gaps, we first introduce an automated data synthesis engine. This method includes a visual perception synthesis module, SynTraj, which employs program- matic simulation to generate fine-grained trajectory data rich in specific violation semantics. Additionally, it incorporates a logic synthesis modulehat utilizes a chain-of-adjudication reasoning pro- cess to construct rigorous reasoning trails with evidence that mir- ror expert decision-making patterns. To resolve the utilization bottleneck of massive knowledge, we introduce an adaptive context optimization strategy. To handle the scale of rules and cases, this module employs a scenario-aware rule pruning mechanism to strictly filter relevant regulations and a dy- namic case retrieval system to extract expert guidance from histor- ical precedents, which allows the model to consult external knowl- edge and align with domain protocols without context overload. Finally, addressing the limitations of traditional GRPO answer reward where sparse binary feedback fails to distinguish error sever- ity, we design an Ordinal-Sensitive Reward. By assigning graded signals based on the semantic proximity between predictions and ground truths, this mechanism mitigates reward sparsity. Capital- izing on this, we implement a three-stage progressive training par- adigm to align the model from visual semantic understanding to complex, hierarchically consistent decision-making. Our contributions can be summarized as follows: 1) We pro- pose a specialized multimodal framework for ride-hailing adjudi- cation that effectively aligns visual trajectory data with rigorous juridical rules; 2) We design an integrated solution featuring au- tomated visual-logic synthesis, adaptive context optimization for knowledge integration, and a progressive training paradigm with an ordinal-sensitive reward; and 3) We validate the effectiveness and reliability of our method through extensive experiments on multiple real-world datasets, demonstrating superior performance in complex adjudication tasks. 2 Preliminaries In this section, we formalize the operational logic of the ride-hailing platform, define the comprehensive data structures for orders, and formulate the core task of intelligent adjudication. 2.1 Ride-Hailing Order We first establish the formal definitions for the data structures gen- erated during the ride-hailing service lifecycle. Definition 1: Ride-Hailing Order.A ride-hailing order is com- posed of textual metadata and visual spatial data. 1) Textual Information (푂 푡푒푥푡 ):Upon order acceptance, the sys- tem generates an initial static information tuple: 표 푖푛푖푡 = (푙 푑푟푖푣푒푟 , 푙 푠푡푎푟푡 , 푙 푒푛푑 , 푃 푑푟푖푣푒푟 , 푃 푝푎푠 )(1) where푙 푑푟푖푣푒푟 ,푙 푠푡푎푟푡 , and푙 푒푛푑 denote the driver’s acceptance point, the order start point, and the destination;푃 푑푟푖푣푒푟 and푃 푝푎푠 represent the generic profiles of the driver and passenger, respectively. As the order progresses from the pickup to its final termination, resulting from either successful completion or abnormal cancella- tion, the system records dynamic behavioral features. We define the accumulated textual features as: 푂 푡푒푥푡 = 표 푖푛푖푡 ,ℱ 푑푟푖푣푒푟 ,ℱ 푝푎푠 (2) Here,ℱ 푑푟푖푣푒푟 = 푑 1 , 푑 2 , ... , 푑 푖 represents accumulated driver be- havioral statistics such as stationary duration and detour distance; ℱ 푝푎푠 represents the passenger behavioral statistics. 2) Visual Information푂 푖푚푎푔푒 :To effectively capture the spatial semantics and road network structure, we adopt a rasterization- based approach inspired by recent advances in visual trajectory modeling [ 2,4,20]. Specifically, we construct a visual representa- tion푂 푖푚푎푔푒 by rendering both the driver’s executed path풯 푟푒푎푙 and the algorithmically planned route풯 푛푎푣 onto a roadmap that con- tains the detailed road layout. This spatial superposition serves as a strong visual prior and enables the model to intuitively capture fine-grained behavioral patterns of both the driver and the passen- ger, such as unexpected detours or abnormal stops [19], which are critical for liability assessment. Finally, the comprehensive Multimodal Context is defined as 풪= 푂 푡푒푥푡 , 푂 푖푚푎푔푒 . Definition 2: Disputed Order.ADisputed Orderis defined as an order instance where the trip is cancelled by either the driver or the passenger abnormally. This cancellation event marks the ter- mination of the service recording and flags the order for potential liability review. 2.2 The Ride-Hailing Adjudication Task To maintain the ecological balance of the marketplace and ensure fairness [ 10], the platform must determine liability for every dis- puted order. This process requires the platform to arbitrate the re- sponsibility based on explicit regulations, categorizing the driver and passenger behaviors into a hierarchical label space풴which consists of multiple levels representing different degrees of liabil- ity. Given the comprehensive order context풪and the external Adju- dication Knowledge Base풦, our goal is to construct an intelligent A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication model that functions as a mapping퐹: 푦 = 퐹(풪,풦)(3) Here,푦 ∈풴is the predicted liability verdict. The model must effec- tively integrate the heterogeneous information in풪with specific rules in풦to derive a logically sound verdict. Conventionally, automated solutions for this task have relied on traditional discriminative models [ 3,13]. However, these ap- proaches often suffer from limited accuracy when processing com- plex multimodal spatial-temporal data. More critically, these meth- ods typically yield only classification labels, lacking the capacity to provide the transparent reasoning trails required to justify ver- dicts [23], and are prone to robustness issues when facing long-tail scenarios. To address these challenges and automate the workflow, we propose a Progressive Visual-Logic-Aligned Framework. As illus- trated in Figure1, our method consists of three components de- signed to bridge the gap between multimodal data and expert adju- dication logic. These components include an Automated Data Syn- thesis Framework for bridging domain gaps via visual-linguistic alignment and logical reasoning reconstruction [ 28], an Adaptive Context Optimization Strategy for dynamic rule pruning and ex- pert precedent extraction, and a Progressive Training Paradigm that aligns the model from semantic understanding to complex decision-making via ordinal-sensitive reinforcement learning. hai 3 Training Data Construction 3.1 SynTraj: Bridging Visual Semantics To establish a fundamental alignment between ride-hailing con- cepts and their visual representations, we propose an automated SyntheticTrajectory generation framework. SynTraj injects spa- tiotemporal priors into the model, ensuring it can perceptually ground abstract adjudication behaviors into concrete trajectory pat- terns. The pipeline consists of two tightly coupled processes named structural route planning and behavioral trajectory simulation. Navigation Route Planning.We first establish the ground truth by constructing valid navigation routes. We sample Point of Interest pairs from major cities in China and query standard nav- igation APIs 1 to obtain route data. Drawing upon standard nota- tions in trajectory mining literature, we decompose the route into a synchronized geometric backbone and an instruction sequence. The Geometric Sequence푆 푔푒표 is modeled as a continuous coor- dinate chain linking the start point푙 푠푡푎푟푡 and end point푙 푒푛푑 , in- terspersed with critical intersection nodes푘and dense segment points푃: 푆 푔푒표 = [푙 푠푡푎푟푡 , 푃 1 , 푘 1 , 푃 2 , 푘 2 , ... , 푘 푀 , 푃 푀+1 , 푙 푒푛푑 ](4) where푘 푚 represents the푚-th critical intersection node and푃 푚 de- notes the dense sequence of GPS coordinates constituting the road segment between푘 푚−1 and푘 푚 . Synchronized with this is the In- struction Sequence푆 푖푛푠푡 containing푀instructions. Each instruc- tion퐼 푚 specifies the driving behavior along segment푃 푚 and the required maneuver at node푘 푚 . Driver Trajectory Synthesis.Based on the planned route, we simulate diverse driver behaviors ranging from strict compliance 1 In this work, we utilize the open platform:https://lbs.amap.com. to specific violations. For compliant behaviors, we simulate a dri- ver faithfully following the platform guidance. We generate a tra- jectory that traverses푆 푔푒표 by applying Gaussian noise perturba- tions to the coordinates to simulate intrinsic GPS errors and natu- ral driving fluctuations. This ensures the trajectory remains struc- turally aligned while exhibiting realistic sensor irregularities. For abnormal behaviors, our engine supports multiple mutation strategies (detailed comprehensively in AppendixB.1). Here, we il- lustrate the mechanism using the”Unintentional Deviation”sce- nario as a representative example. We first randomly select an in- tersection node푘 푗 from푆 푔푒표 as the anchor. To simulate the de- viation dynamics, we define the intended direction vector⃗푣 = 푝 푛푒푥푡 −푘 푗 , where푝 푛푒푥푡 denotes the first point immediately following 푘 푗 in the subsequent segment. We then apply a rotation operator ℛ 휃 to generate a deviation vector ⃗ 훿 =ℛ 휃 (⃗푣)with a randomized magnitude. The mutated anchor point is calculated as훾 = 푘 푗 + ⃗ 훿. We subsequently query the API for a new path from훾to the desti- nation and stitch this deviation path with the historical compliant segment. This process yields a final trajectory푇 푟푒푎푙 that maintains geometric consistency while injecting precise liability semantics: 푇 푟푒푎푙 = [푙 푠푡푎푟푡 , ... , 푘 푗 ] ⏟ History from푆 푔푒표 ⊕ [훾, ... , 푙 푒푛푑 ] ⏟ Mutated Path (5) Multimodal Pair Generation.Finally, we generate the aligned visual-textual pairs. For the visual input, we employ the protocol defined in the preliminary section to render the synthesized driver trajectory푇 푟푒푎푙 and the original navigation route푆 푔푒표 into an image. This ensures the visual difference explicitly reflects the behavioral patterns. For the textual label, we utilize the meta-information from the instruction sequence푆 푖푛푠푡 to instantiate natural language de- scriptions. This process yields a dataset of 12,585 high-quality image- caption pairs, enabling the model to learn the correspondence be- tween visual trajectory patterns and textual adjudication concepts without the noise inherent in real-world data. 3.2 Chain-of-Adjudication: Synthesizing Adjudication Reasoning To transcend the limitations of existing black box models, which of- ten decouple reasoning from raw evidence [ 7,17,23], we propose Chain-of-Adjudication(CoA), which synthesizes rigorous and evidence-backed reasoning chains mirroring professional judicial workflows. Unlike standard end-to-end approaches [18,30] that in- discriminately process all modalities, CoA is designed to simulate the procedural rigor of a human judge, inspired by recent advances in multimodal reasoning [ 11,24,28]. It begins from theAdjudicator, which is an LLM serving as the reasoning engine, to assimilating the textual context푂 푡푒푥푡 while concurrently retrieving specific liability clauses from풦. Crucially, Adjudication is intentionally isolated from direct visual input. This structural constraint forces the agent to transition from passive perception to active inquiry [ 12,38], meaning that instead of hallu- cinating visual details from captions, Adjudication must formulate precise and hypothesis-driven queries to verify the factual predi- cates required by the retrieved adjudication rules. Wu et al. z a.Dataset Construction b.Knowledge-Aware Context Calibration SynTraj POI Chain-of-Adjudication 푎 푡 Adjudicator q 푡 Refiner Visual Analyst History Multimodal Reasoning Chain Update 푟 1 푟 2 푟 퐺 ... Scenario-Aware Rule Selection Classifiers Dynamic Context Query Order Information- rich context Knowledge Base Historical Repository Retrieval Top-K Meta Insight Retriveal-Augmented Insight Extraction c. Progressive Alignment Training 1.Visual Concept Alignment 푆 푖푛푠푡 =퐼 1 ,...,퐼 푀 2.Adjudication Logic Alignment 3. Reinforcement & Exploration LLM backbone Projector Tokenizer&Emb Vision Encoder Encoder&Proj.LLM Mutation 휹 퐤 ퟏ 퐤 2 퐤 3 퐤 4 퐤 5 퐤 6 γ 풍 풔풕풂풓풕 Render & Synthesize The black line represents the navigation trajectory starting , whichalong X Road, turns left at 푘 1 ...finally reaching 풍 풆풏풅 .The red line shows the driver‘s trajectory, which initially coincides with route. At 푘 2 , the driver turns right, causing a deviation...eventually reaching 풍 풆풏풅 . Navigation Route Driver Traectory Knowledge Base 푶 풕풆풙풕 Knowledge Base 푶 퐭퐞퐱퐭 푶 퐢퐦퐚퐠퐞 푆 푔푒표 =푙 푠푡푎푟푡 ,푃 1 ,푘 1 ,푃 2 ,푘 2 ,...,푘 푀 ,푃 푀+1 ,푙 푒푛푑 푇 푟푒푎푙 =푙 푠푡푎푟푡 ,...,푘 푗 History from 푆 푔푒표 ⊕훾,...,푙 푒푛푑 Mutated Path Ԧ 훿=ℛ 휃 Ԧ푣 훾=푘 j + Ԧ 훿 Information Analysis Visual Evidence Integration Knowledge Rule Grounding Comprehensive Adjudication 풍 풆풏풅 풍 푠푡푎푟푡 Data Selection Figure 1: The pipeline consists of three pivotal phases: (1) Automated Data Synthesis (Sec.3), which bridges domain gaps via two specialized modules:SynTraj Constructionfor visual-linguistic alignment andChain-of-Adjudication Synthesisfor logical reasoning reconstruction; (2) Knowledge-Aware Context Refinement (Sec.4), capable of dynamic rule pruning and expert precedent extraction; and (3) Progressive Juridical Alignment (Sec.5), a multi-stage training paradigm culminating in OS-rewarded reinforcement learning for precise decision boundary alignment. To satisfy these evidentiary inquiries, the framework incorpo- rates aVisual Analystfunctioning as an auxiliary perceptual an- chor. Analyst receives specific verification queries from Adjudica- tor and examines푂 푖푚푎푔푒 to provide objective fact-based descrip- tions. This establishes a robust Iterative Verification Loop wherein the Analyst hypothesizes potential violations based on the rules and the Analyst validates or refutes them with trajectory evidence. This multi-turn interaction ensures that every logical step is explic- itly anchored in spatiotemporal reality, effectively bridging the gap between abstract liability concepts and concrete visual patterns. The raw interaction history typically retains conversational re- dundancies that are suboptimal for training. To address this, we deploy aReasoning Refineracting as a meta-cognitive editor to distill the fragmented dialogue into a coherent adjudication path. Specifically, Refiner restructures the content into a standardized four-stage comprising: 1)Information Analysis, which systemati- cally summarizes the order metadata and dispute context; 2)Visual Evidence Integration, which incorporates the objective trajectory facts verified by Analyst; 3)Rule Grounding, which maps the estab- lished facts to specific liability clauses within (풦); and 4)Compre- hensive Adjudication, which performs the final logical deduction to derive the verdict. Finally, we implement a rigorousData Selectionprocess to guarantee corpus reliability. We systematically filter out instances where the synthesized verdict̂푦diverges from the human-annotated 푦 푔푡 or where the historical order is marked as Ambiguous. This stringent filtering yields a final corpus of 14,582 samples, ensur- ing that the dataset consists exclusively of high-confidence and evidence-supported reasoning trajectories. 4 Knowledge-Aware Context Refinement Ride-hailing adjudication requires consulting a massive Knowledge Base (풦) and referencing extensive historical precedents. How- ever, directly feeding this extensive corpus into the model exceeds the context window limits of standard MLLMs and introduces ir- relevant noise, which distorts the reasoning process. To address this, we propose an Adaptive Context Optimization strategy, in- spired by recent advances in agentic context engineering [34] and optimization-based reasoning [ 9]. This approach selectively filters relevant rules and distills expert consensus from historical data to construct a precise and information-rich context for the final rea- soning stage. 4.1 Scenario-Aware Rule Calibration To efficiently identify applicable statutes from the fine-grained Rule Base풦= 푟 1 , ... , 푟 푁 , we propose a Decomposed Ensemble Cali- brator. A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication Filter Training.We formulate the rule calibration task as a multi-label classification problem via a binary decomposition strat- egy [27]. We first construct a specialized dataset풟 푐푎푙 consisting of representative order samples annotated with binary applicability vectors. We decompose the multi-label task into an ensemble of푁 independent binary classifiersℰ= 푓 1 , ... , 푓 푁 . Each classifier푓 푖 , implemented via scalable tree boosting models [3], determines the binary relevance of a specific rule푟 푖 based on a generic input order 푂: 푓 푖 (푂) = 1if rule푟 푖 is applicable 0otherwise (6) This decomposition ensures robustness against statistical depen- dencies between distinct regulations and allows for the flexible use of various classifiers. Application.During the inference phase, given a specific query order푂 푞푢푒푟푦 , we employ this trained ensemble to prune the knowl- edge base. We aggregate the outputs to construct a scenario-specific rule subset풦 ′ by retaining only the rules predicted as positive: 풦 ′ = 푟 푖 ∈풦∣ 푓 푖 (푂 푞푢푒푟푦 ) = 1(7) By strictly filtering out irrelevant clauses, this mechanism signif- icantly reduces context redundancy and prevents the reasoning model from generating hallucinations based on inapplicable reg- ulations. 4.2 Retrieval-Augmented Insight Extraction To leverage tacit adjudication knowledge, we implement a Retrieve- then-Extract paradigm, integrating Retrieval-Augmented Genera- tion [15] with Case-Based Reasoning. Vectorized Precedent Retrieval.We maintain a dynamically updated historical repository풟 ℎ푖푠푡 = (푂 (푗) 푡푒푥푡 , 푦 (푗) ) 푁 푗=1 . Crucially, to prevent data leakage, this repository consists strictly of orders processed prior to the timestamp of the current query. Given a query푂 푞푢푒푟푦 푡푒푥푡 , we employ a dense retriever to identify the Top-퐾 semantic neighbors풩 퐾 based on vector similarity: 풩 퐾 =arg max 풮⊂풟 ℎ푖푠푡 ,|풮|=퐾 ∑ 푗∈풮 E(푂 푞푢푒푟푦 푡푒푥푡 ) ⋅E(푂 (푗) 푡푒푥푡 ) ‖E(푂 푞푢푒푟푦 푡푒푥푡 )‖ 2 ‖E(푂 (푗) 푡푒푥푡 )‖ 2 (8) whereE(⋅)denotes the embedding function, and‖ ⋅ ‖ 2 represents the퐿 2 norm. Meta-Insight Abstraction.To avoid information sparsity aris- ing from raw text concatenation, we employ a Summary Agent powered by an LLM. Drawing inspiration from verbal reinforce- ment learning mechanisms [ 25], this agent analyzes the retrieved cohort풩 퐾 to identify statistical commonalities and adjudication patterns. It outputs a concise Meta-Insight퐼 푠푦푛 via a dedicated sum- marization LLM, denoted as푔: 퐼 푠푦푛 = 푔(풩 퐾 )(9) This synthesized insight bridges the gap between raw history and current decision-making, providing the model with expert refer- ences that stabilize the reasoning process in complex scenarios. 5 Progressive Alignment Framework To bridge the gap between general multimodal capabilities and the rigorous demands of adjudication, we propose a three-stage pro- gressive training framework. Stage 1: Visual Concept Alignment.To establish the funda- mental alignment between ride-hailing concepts and visual repre- sentations, we employ the SynTraj dataset. In this stage, we freeze the language model while optimizing the vision encoder and pro- jector. This process enables the model to perceptually ground ab- stract adjudication rules into concrete trajectory patterns without altering the pre-trained knowledge base. Stage 2: Adjudication Logic Alignment.We subsequently conduct Supervised Fine-Tuning on multimodal reasoning dataset synthesized via the Chain-of-Adjudication framework. We finetune the LLM backbone to maximize the likelihood of the expert reason- ing chain. This stage aligns the reasoning process of the model with professional adjudication protocols and strengthens its instruction following capabilities. Stage 3: Reinforcement and Exploration.In the final stage, we employ Reinforcement Learning to further enhance the model’s robustness and explore the decision boundaries of complex cases. Divergence-Aware Data Selection.To maximize reinforcement learning efficiency, we implement a difficulty-aware selection strat- egy. We assess the complexity of each instance by performing푁 = 10stochastic rollouts using the Stage 2 model. This process gener- ates a set of correctness scores where1represents a correct ver- dict and0represents an incorrect one. We calculate the average consistency score푆 푎푣푔 and strictly retain instances that satisfy the criterion0.2 ≤ 푆 푎푣푔 ≤ 0.8. This filtering mechanism effectively eliminates both trivial and intractable cases to ensure the model focuses on samples where it currently exhibits reasoning ambigu- ity. Consequently, we retain approximately 2,000 filtered samples for the subsequent training phase. Ordinal-Sensitive RewardWe utilize the DAPO algorithm [ 31] for optimization. To address the limitations of sparse binary feed- back in adjudication, we propose a novel Ordinal-Sensitive Reward mechanism. Deviating from traditional methods that treat all er- rors equally, we assign graded reward signals based on the se- mantic proximity between predictions and ground truths. This ap- proach effectively mitigates reward sparsity and guides the model to distinguish between error level, ensuring precise and hierarchi- cally consistent decision-making. We formalize the liability label space as an ordered set풴= 푦 1 < 푦 2 < ⋯ < 푦 퐾 where the rank mapping푟(푦 푘 )corresponds to the ordinal index푘. The answer reward푅 푎푛푠 is defined as: 푅 푎푛푠 (푦 푝푟푒푑 , 푦 푔푡 ) = 1.0if푦 푝푟푒푑 = 푦 푔푡 푑(푟(푦 푝푟푒푑 ), 푟(푦 푔푡 ))if푦 푝푟푒푑 ≠ 푦 푔푡 ∧ 푦 푝푟푒푑 ∈풴 0if푦 푝푟푒푑 ∉풴 (10) Here푑(⋅)represents a semantic distance function that quantifies the proximity between the predicted rank and the ground truth rank. Furthermore, to ensure the model adheres to the structural constraints of the output format, we incorporate a format reward Wu et al. 푅 푓 푚푡 . Consequently, the final training objective maximizes the cu- mulative reward defined as: 푅 푡표푡푎푙 = 휆 푎푛푠 푅 푎푛푠 + 휆 푓 푚푡 푅 푓 푚푡 (11) where휆 푎푛푠 and휆 푓 푚푡 are the balancing coefficients for the answer accuracy and format compliance, respectively. 6 Experiments Baselines.We benchmark our model against three LLMs and five MLLMs to provide a comprehensive comparison. For rigorous eval- uation, we employ both standard prompting and Chain-of-Thought prompting strategies across all baseline models. Note that the LLMs only receive the textual order information푂 푡푒푥푡 as input due to their inherent modality constraints. Implementation Details.We train two variants, RideJudge- 8B and RideJudge-4B, which are initialized from Qwen3-VL-8B- Instruct and Qwen3-VL-4B-Instruct, respectively. The supervised fine-tuning stages are implemented using the LLaMA-Factory frame- work [ 37] while the reinforcement learning stage utilizes Easy- R1[36]. All training experiments are conducted on 8×NVIDIA H200 GPUs. Further hyperparameter details are provided in the Appendix C. Evaluation Metrics.We primarily utilize Accuracy to measure the global correctness of the model across all liability types. To enable a more granular analysis regarding the ordinal level, we further categorize the fine-grained responsibility verdicts into two hierarchical levels named Normal and Malicious. The Normal cate- gory targets basic liability disputes to distinguish whether the dri- ver is at fault or holds no responsibility. The Malicious category focuses on identifying severe violations where the driver exhibits intentionally malicious behaviors. For these specific binary classi- fication tasks, we report the Precision and Recall to evaluate the sensitivity and exactness of the model in different contexts. 6.1 Performance on Ride-Hailing Adjudications Evaluation Benchmarks.To validate the effectiveness of our frame- work in real-world settings, we conducted evaluations on datasets collected from DiDi Chuxing, which is one of the largest ride-hailing platforms globally. We selected three distinct test sets representing critical and complex adjudication scenarios. Appeal:This set contains 1,007 challenging samples where dri- vers formally appealed the initial platform verdict after order can- cellation. Driver-Cancel:This set comprises 453 selected hard samples involving reservation orders cancelled by drivers. Passenger-Cancel:This set includes 1,249 selected hard sam- ples involving reservation orders initiated by passengers. Table 1presents the comprehensive results across these three distinct benchmarks. Our framework secures the first or second best results across the majority of evaluation metrics, demonstrat- ing robust generalisation capabilities. Specifically, our RideJudge- 8B model achieves an overall accuracy of 88.41% on the combined test set. This performance significantly surpasses the same series Qwen3-VL-32B-Instruct, which scores 65.55%, and the reasoning enhanced text model DeepSeek-V3.1, which attains 75.25%. A rig- orous analysis of the baseline reveals a notable trend where text- only LLMs generally outperform standard MLLM baselines. This performance disparity stems from the fact that adjudication is in- herently a logic-intensive task requiring strict adherence to com- plex platform rules. Text-only models like DeepSeek-V3.1 lever- age their superior reasoning capabilities to infer verdicts based on textual metadata, whereas general-purpose MLLMs lack specific pre-training on domain-specific trajectory maps, leading to severe hallucinations. However, text-only models eventually hit a perfor- mance ceiling due to their inability to access the visual modality, which is essential for verifying spatial proofs. It is worth noting that certain baselines exhibit abnormally high precision in specific scenarios, such as Qwen3-VL-8B-Instruct, achieving 100.00% pre- cision on the Driver-Cancel Normal task. This metric is mislead- ing as it is accompanied by a negligible recall of 9.60% indicating that the model biases heavily towards negative predictions and fails to retrieve valid cases. Similarly, several general multimodal models fail to identify malicious intents entirely, yielding zero pre- cision and recall in malicious categories, which stems from their lack of domain-specific alignment. RideJudge effectively bridges these gaps by integrating the logical rigor of LLMs with the pre- cise visual grounding of trajectory data, resulting in superior per- formance across both normal and malicious adjudication tasks. 6.2 Ablation and In-Depth Study To dissect the contribution of each component in our framework, we conduct a comprehensive ablation study on the Appeal bench- mark using the Qwen3-VL-8B backbone. As presented in Table3, we evaluate four variants by systematically removing key modules. Effect of Training Strategy.We first examine the progressive impact of our training paradigm. The transition from the Baseline to the 1st stage yields a marginal improvement of 1.42%. This lim- ited gain suggests that while Visual Concept Alignment success- fully grounds geometric patterns, it fails to verify complex liability scenarios due to the absence of reasoning capabilities. In contrast, the introduction of the 2nd stage brings a substantial performance leap, raising the accuracy to 68.81%. This confirms that the Juridi- cal Logic Alignment derived from our Chain of Adjudication effec- tively bridges the gap between perception and decision-making. Fi- nally, the 3rd stage involving Reinforcement Learning Tuning pro- pels the model to its peak performance of 91.86% demonstrating that exploring decision boundaries via Group Relative Policy Opti- mization significantly consolidates the robustness of the model. Impact of Key Modules.We further investigate the specific contributions of our architectural designs. Comparing No.5 and No.6 reveals that integrating the structured Chain of Adjudication outperforms standard unstructured supervision. The CoA frame- work improves the Recall on Malicious tasks by over 20%, indicat- ing that decomposed reasoning steps help the model uncover sub- tle fraud patterns that end-to-end learning overlooks. Furthermore, the comparison between No.9 and No.10 highlights the criticality of our Ordinal Sensitive Reward. Removing this mechanism and relying solely on standard binary rewards causes a drastic accu- racy drop of 21.35%. This sharp decline verifies that penalizing log- ical deviations based on their ordinal level is essential for stabiliz- ing the reinforcement learning process in high-stakes adjudication A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication Table 1: Main Results on Three Adjudication Benchmarks. We report Precision (푃), Recall (푅), and Accuracy (퐴푐.).Best results are in bold, second best are underlined. BackboneMethod AppealDriver-CancelPassenger-Cancel Overall NormalMalicious 퐴푐. NormalMalicious 퐴푐. NormalMalicious 퐴푐. 푃푅푃푅푃푅푃푅푃푅푃푅퐴푐. LLM Series DeepSeek-V3.1 Standard70.9195.4980.70 20.63 57.8980.69 86.6764.0084.21 86.1588.57 66.77 89.12 65.40 85.3075.25 CoT 71.15 93.59 86.79 20.63 58.0977.99 91.8558.3373.6886.3686.75 66.46 88.38 66.54 85.3075.36 Qwen3-32B-Insruct Standard62.76 89.08 27.07 63.23 38.9332.67 36.30 23.08 15.79 25.9730.73 41.23 53.98 23.19 17.2626.77 CoT64.83 88.39 30.24 62.78 40.8139.55 39.26 42.86 31.58 19.9130.33 39.38 53.66 16.73 16.7226.21 QwQ-32B Standard72.77 75.04 58.60 48.88 60.2838.19 40.74 25.00 21.05 26.6232.47 46.46 53.91 23.57 19.4435.82 CoT76.88 77.82 66.66 36.77 61.5747.50 70.37 30.30 52.63 22.5138.03 91.38 53.71 54.37 18.5135.18 MLLM Series Qwen3-VL-8B-Instruct Standard63.09 25.48 27.34 15.70 39.82100.009.60 50.00 15.80 71.0092.908.0092.309.10 73.6060.61 CoT65.57 24.09 51.35 8.52 41.8190.00 13.30 54.50 31.60 71.6095.7013.80 91.5016.30 75.0062.09 Qwen3-VL-8B-Thinking Standard75.00 4.16 0.89 77.78 2.7850.00 2.20 0.00 0.00 13.9066.70 1.20 33.30 0.40 8.707.37 CoT91.30 3.64 100.00 5.83 3.48100.00 0.70 0.00 0.00 13.000.00 0.00 0.00 0.00 11.608.82 Qwen3-VL-32B-Instruct Standard60.61 79.20 41.88 52.02 43.2093.5043.00 45.80 57.90 78.4084.60 37.20 81.50 38.40 78.9065.55 CoT60.57 83.36 38.29 16.14 41.7191.00 52.60 53.60 78.90 72.1083.60 50.20 81.30 52.90 68.5059.14 Qwen3-VL-32B-Thinking Standard 75.85 69.67 75.00 36.32 49.7534.76 48.15 0.00 0.00 54.5527.59 51.69 0.00 0.00 39.7445.94 CoT 76.18 78.16 75.93 36.77 57.3032.29 22.96 0.00 0.00 61.0427.52 27.69 0.00 0.00 54.9856.86 MiniCPM-V Standard67.44 5.03 0.00 0.00 14.9053.33 17.78 0.00 0.00 33.1236.84 4.31 66.67 1.52 18.6619.68 CoT70.18 6.93 0.00 0.00 16.4844.00 8.15 0.00 0.00 16.8835.85 5.85 45.45 1.90 8.7112.96 RideJudge RideJudge-4B94.5593.2493.0690.1390.8672.4193.3351.4294.7480.9584.50 85.5487.6986.6986.6387.25 RideJudge-8B94.6094.1191.5291.93 91.8678.85 91.11 41.86 94.7383.5589.76 80.92 86.48 85.7187.4088.41 Table 2: Main Results Comparison. We report the overall Precision (푃), Recall (푅), and Accuracy (퐴푐.) of selected base- lines and our method. ModelPRAcc. QwQ-32B29.7351.2734.00 Qwen3-VL-8B-Instruct29.6851.6933.40 Qwen3-VL-8B-Thinking29.6752.7733.90 Qwen3-VL-32B-Instruct30.2253.3934.40 Qwen3-VL-32B-Thinking30.1253.8833.40 Ours53.1765.3460.20 tasks. Similarly, the removal of the KACR module in No.7 leads to a significant performance regression, which further validates the necessity of rigorous knowledge alignment and data selection in ensuring high-confidence predictions. Scaling and Stability.To assess the scalability of our synthe- sized data, we trained the Stage 2 model using 30%, 50%, and 70% subsets of the CoA dataset. As shown in Figure 3left, model perfor- mance steadily improves as data volume increases, confirming the high quality and efficacy of our synthesis pipeline. Furthermore, to verify robustness in real-world scenarios, we sampled 5,000 real appeal orders from ten major Chinese cities. Figure 3right illus- trates that RideJudge-8B maintains consistent high accuracy across diverse urban environments, demonstrating strong geographical generalization and stability. Table 3: Ablation Study on Appeal Benchmark.Δindicates the performance gap between the full method and the ab- lated version. No. Settings NormalMalicious Acc. 푃푅푃푅 1 Baseline63.57 24.09 51.35 8.52 44.49 2 + 1st stage64.29 28.08 48.84 9.42 45.91 3 + 2nd stage80.92 76.43 68.82 57.40 68.81 4 + 3rd stage94.60 94.11 91.52 91.93 91.86 5 SFT w/o CoA76.88 77.82 67.31 36.77 63.75 6 SFT w/ CoA80.92 76.43 68.82 57.40 68.81 Δ+4.04-1.39+1.51+20.63+5.06 7 w/o KACR84.17 78.34 70.05 68.16 70.56 8 w/ KACR94.60 94.11 91.52 91.93 91.86 Δ+10.43+15.77+21.47+23.77+21.30 9 DAPO w/o OS Reward 85.97 66.90 72.83 56.50 70.51 10 DAPO w/ OS Reward 94.60 94.11 91.52 91.93 91.86 Δ+8.63+27.21+18.69+35.43+21.35 Case Study.Figure2visualizes the reasoning of RideJudge-8B compared to Qwen3-VL-32B-Instruct. Our model sequentially per- forms Information Analysis (green), Visual Evidence Integration (red), and Rule Grounding (blue). This structured reasoning allows the model to accurately associate the case with specific rules and derive the correct verdict. Wu et al. X RideJudge-8B Response: <think>First, look at the order information part. The order is... Key points include the driver reporting that ‘the pickup point is unreachable or cannot park’...The map shows a significant deviation between the driver‘s actual trajectory and the planned navigation route...but the map shows no river, no-parking, or road obstacles near the starting point...The driver actively chose a longer route without reasonable explanation, aligning with the characteristics of “intentional deviation resulting in longer pickup time ” <think><result>Malicious Fault</result><judge>Intentional deviation by the driver resulting in a longer pickup time</judge> Qwen3-VL-32B Response: The driver did not arrive at the pickup point during the pick- up process, and the order was actively cancelled by the driver...This aligns with the accountable scenario of "Driver did not arrive at the pickup point without special circumstances<result>Normal Fault</result> <judge> Driver did not arrive at the pickup point without special circumstances</judge> Malicious Fault 푂 !"#! 푂 !"#$% Ground Truth: <result>Malicious Fault</result><judge>Intentional deviation by the driver resulting in a longer pickup time</judge> The order's starting point is X. The order's destination is X...The driver reported: The passenger's pickup point is unreachable or ... X RideJudge-8B Response: <think>The order was a discounted express ride from x to x. The driver‘s wait time met the minimum requirement, but trajectory data showed movement during the waiting period.The driver violated the requirement to wait continuously near the designated pickup point, constituting the “Driver did not wait accurately at the pickup point” scenario...“Passenger late” no-responsibility scenario: There are no communication records or call evidence proving passenger lateness...Based on platform rules x, the driver is determined to be at fault...</think><result>Normal Fault </result> <judge>Driver did not wait accurately at the pickup point</judge> Qwen3-VL-32B Response: ... The order was cancelled x minutes after the driver accepted it, which exceeds the passenger's free-cancellation period ... The driver waited at the pickup point but could not contact the passenger. ... According to the guideline, this qualifies ...<result>No Fault</result> <judge>Passenger did not use the service beyond the free-cancellation period, leading to order cancellation</judge> Normal Fault 푂 !"#! 푂 !"#$% Ground Truth: <result>Normal Fault </result> <judge>Driver did not wait accurately at the pickup point</judge> The order's starting point is X. The order's destination is X...The order cancellation time is X. Time when the driver arrived at the passenger's starting point is... X RideJudge-8B Response: <think>The driver arrived at the starting point within x minutes of accepting the order and remained within x meters of the pickup location throughout the waiting period... Map trajectory shows the driver continuously stayed near the starting point with no deviation or leaving behavior, complying with the requirement to "wait within the pickup point area." ...All evidence confirms the driver complied with waiting requirements without violation behaviors. This aligns perfectly with the no-responsibility scenario for passenger lateness...</think><result>No Fault</result> <judge>Passenger Late</judge> Qwen3-VL-32B Response: ... The driver waited for over x minutes but did not wait continuously near the pickup point. ... ... The driver clicked arrived immediately upon arrival, constituting early clicking behavior. ... ... Therefore, the driver is held responsible for erroneous actions. ... <result> Normal Fault </result> <judge>Driver did not wait accurately at the pickup point</judge> No Fault 푂 !"#! 푂 !"#$% Ground Truth: <result>No Fault</result> <judge>Passenger Late</judge> The order's starting point is X. The order's destination is X...At the time of cancellation, the driver was x meters away from the starting point... ✅ ❌ ❌ ✅ ❌ ✅ Figure 2: Qualitative case studies on the Appeal benchmark. To preserve privacy, sensitive textual regions in the images, as well as specific numerical values and location names within the reasoning chains, have been masked. We highlight theInformation Analysisprocess in green, theVisual Evidence Integrationprocess in red, and theRule Groundingprocess in blue. 0%30%50%70%100% Dataset Scaling Ratio 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy 45.91% 55.01% 62.66% 66.65% 68.81% Guangzhou Shanghai Beijing Shenzhen Chengdu Chongqing Xi'an Dalian Nanjing 0.2 0.4 0.6 0.8 1.0 Figure 3: Left: Performance scaling with our CoA synthetic data used in Stage 2. Right: Stability analysis of RideJudge- 8B from ten major cities. 6.3 Performance on Other Multimodal Adjudication Tasks To demonstrate the generalization of our framework, we extend our evaluation to the PetFinder Prediction benchmark. We contend that this task represents a generalized form that shares an intrin- sic similarity with dispute adjudication: both require synthesizing heterogeneous evidence, comprising unstructured textual descrip- tions, structured tabular metadata (e.g., age, health), and visual im- agery, to perform a judgment against a backdrop of implicit valua- tion rules and historical precedents.The objective is to predict the speed of adoption classified into discrete categories. As presented in Table 2, our trained model demonstrates robust performance that significantly outperforms standard multimodal baselines across all metrics. The results indicate that our core mech- anisms, specifically the reasoning synthesis and dynamix context , are not confined to trajectory data but effectively generalize to other heterogeneous data structures. This empirical evidence un- derscores the versatility and strong transferability of our Visual- Logic-Aligned architecture in broader multimodal decision-making scenarios. For more comprehensive implementation details and ex- perimental settings, please refer to Appendix D. 7 Related Work 7.1 Large Language Models for Automated Adjudication The application of LLMs in the legal domain has transitioned from generic text processing to specialized juridical reasoning frame- works. Early domain-specific models, such as ChatLaw [ 6] and LawGPT [39], primarily focused on injecting legal vocabulary and retrieving statutory knowledge to handle complex queries. To en- hance logical robustness, recent frameworks like SaulLM [ 5] and DISC-LawLLM [ 32] have integrated prompting strategies to simu- late the procedural reasoning of human judges. Furthermore, the “LLM-as-a-Judge” paradigm [14,40] has formalized the use of mod- els for impartial evaluation and scoring. However, a critical limi- tation persists across these works: they operate solely on textual precedents. In real-world ride-hailing adjudication, liability assess- ment fundamentally relies on physical evidence, specifically the A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication alignment between spatial-temporal trajectories and map seman- tics. Existing legal LLMs lack the visual perception capabilities re- quired to ground abstract liability regulations into concrete trajec- tory patterns, a gap our framework addresses through a progres- sive visual-logic alignment mechanism. 7.2 Multimodal Intelligence in Spatio-Temporal Decision Making Our work also intersects with the field of MLLMs for urban and au- tonomous driving applications. Recent research has demonstrated the efficacy of integrating spatiotemporal dependencies into trans- former architectures. For instance, UrbanGPT [16] models city-wide flow dynamics for traffic prediction, while TrafficGPT [35] explores LLMs as control agents for traffic signal optimization. In the con- text of visual understanding for driving, NuScenes-QA [21] and DriveLM [26] have established benchmarks for Graph Visual Ques- tion Answering. Despite these advancements, existing models fo- cus primarily onpredictivetasks orperceptualQ&A . They lack the specific capability to performforensicanalysis, which requires au- diting historical behaviors against strict regulatory frameworks to attribute liability. 8 Conclusion In this paper, we presented RideJudge, a Progressive Visual-Logic- Aligned Framework for the ride-hailing adjudication. By integrat- ing automated data synthesis, adaptive context optimization, and a three-stage training paradigm, our approach effectively bridges the systemic disconnects between general MLLMs and rigorous adjudication requirements. Extensive results on real-world tasks demonstrate the effectiveness of the proposal. This work offers a scalable blueprint for deploying specialized MLLMs in broader, complex, rule-governed decision-making tasks. Acknowledgments This work is supported by the Key Program of the Jiangsu Sci- ence Foundation (Grant No. BK20243012), the National Natural Sci- ence Foundation of China (Grant No. 62306133), and the CCF-DiDi GAIA Collaborative Research Funds for Young Scholars. We also appreciate the guidance provided by the engineers in Didichuxing Co. Ltd. Wu et al. References [1]Shuai Bai, Yuxuan Cai, Keming Zhu, Shijie Wang, Zhihao Fan, Jinze Bai, et al. 2025. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631(2025). [2]Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. 2016. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316(2016). [3]Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794. [4]Xingyi Cheng, Ruiqing Zhang, Jie Zhou, and Wei Xu. 2018. DeepTransport: Learning Spatial-Temporal Dependency for Traffic Condition Forecasting. 1–8. doi:10.1109/IJCNN.2018.8489600 [5]Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre F. T. Martins, Fabrizio Esposito, Vera Lúcia Raposo, Sofia Morgado, and Michael Desa. 2024. SaulLM-7B: A pioneering Large Language Model for Law.arXiv preprint arXiv:2403.03883(2024). [6]Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. 2023. ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases.arXiv preprint arXiv:2306.16092(2023). [7]Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of inter- pretable machine learning.arXiv preprint arXiv:1702.08608(2017). [8]Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024). [9]Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Honghui Ding, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jingchang Chen, Jingyang Yuan, Jinhao Tu, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaichao You, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shao- qing Wu, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yux- uan Liu, Yuyang Zhou, Y. X. Zhu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. 2025. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature645, 8081 (2025), 633–638. [10]Xiaotong Guo, Hanyong Xu, Dingyi Zhuang, Yunhan Zheng, and Jinhua Zhao. 2023. Fairness-Enhancing Vehicle Rebalancing in the Ride-hailing System. https://arxiv.org/abs/2401.00093 [11]Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. 2026. Vision-R1: Eliciting Strong Reasoning Capa- bilities in Vision-Language Models. InProceedings of the Fourteenth International Conference on Learning Representations. [12]Pu Jian, Donglei Yu, and Jiajun Zhang. 2024. Large Language Models Know What is Key Visual Entity: An LLM-Assisted Multimodal Retrieval for VQA. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 10939–10956. [13]Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qi- wei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. InAdvances in Neural Information Processing Systems. [14]Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwanjun Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, and Minjoon Seo. 2024. Prometheus: Inducing Fine-grained Evaluation Capability in Language Models. InThe Twelfth International Conference on Learning Representations. [15]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-Augmented Generation for Knowledge- Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, Vol. 33. [16]Zhonghang Li, Lianghao Xia, Jiabin Tang, Yong Xu, Lei Shi, Long Xia, Dawei Yin, and Chao Huang. 2024. UrbanGPT: Spatio-Temporal Large Language Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5351–5362. [17]Zachary C. Lipton. 2018. The mythos of model interpretability.Commun. ACM (2018), 36–43. [18]Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved Base- lines with Visual Instruction Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26296–26306. [19]Yaxi Liu, Kai Zhao, Gao Cong, and Zhifeng Li. 2020. Online anomalous tra- jectory detection with deep generative sequence modeling. In2020 IEEE 36th International Conference on Data Engineering. IEEE, 949–960. [20]Xiaolei Ma, Zhuang Dai, Zhengbing He, Jihui Ma, Yong Wang, and Yunpeng Wang. 2017. Learning traffic as images: a deep convolutional neural network for large-scale transportation network speed prediction. InSensors, Vol. 17. MDPI, 818. [21]Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. 2024. NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Au- tonomous Driving Scenario. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 4542–4550. [22]Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lil- licrap, Jean-baptiste Alayrac, et al. 2024. Gemini 1.5: Unlocking multimodal un- derstanding across millions of tokens of context.arXiv preprint arXiv:2403.05530 (2024). [23]Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence1, 5 (2019), 206–215. [24]Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. 2025. Vlm-r1: A stable and generalizable r1-style large vision- language model.arXiv preprint arXiv:2504.07615(2025). [25]Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2024. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems, Vol. 36. [26]Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Han, Chengen Zhang, Tuan-Hung Tran, Xiangwei Li, Andreas Geiger, and Hongyang Li. 2024. DriveLM: Driving with Graph Visual Question Answering. InEuropean Confer- ence on Computer Vision. [27]Grigorios Tsoumakas and Ioannis Katakis. 2007. Multi-label classification: An overview.International Journal of Data Warehousing and Mining3, 3 (2007), 1–13. [28]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reason- ing in large language models. InNeurIPS. [29]Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chen-Wei Lin, Zicheng Liu, and Lijuan Wang. 2023. The dawn of lmms: Preliminary explorations with gpt-4v (ision).ArXiv:2309.17421(2023). [30]Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xin- long Wang, and Jingjing Liu. 2024. CapsFusion: Rethinking Image-Text Data at Scale. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14022–14032. [31]Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, et al. 2025. DAPO: An Open-Source LLM Reinforcement Learning System at Scale. InAd- vances in Neural Information Processing Systems. [32]Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, Chenchen Shen, Shujun Liu, Yuxuan Zhou, Yao Xiao, Song Yun, Xuanjing Huang, and Zhongyu Wei. 2023. DISC-LawLLM: Fine-tuning Large Language Models for Intelligent Legal Services.arXiv preprint arXiv:2309.11325(2023). [33]Daqing Zhang, Nan Li, Zhi-Hua Zhou, Chao Chen, Lin Sun, and Shijian Li. 2011. iBAT: detecting anomalous taxi trajectories from GPS traces. InProceedings of the 13th International Conference on Ubiquitous Computing. 99–108. [34]Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Ur- mish Thakker, James Zou, and Kunle Olukotun. 2026. Agentic Context En- gineering: Evolving Contexts for Self-Improving Language Models. https: //arxiv.org/abs/2510.04618 [35]Siyao Zhang, Daocheng Fu, Zhao Zhang, Bin Yu, and Pinlong Cai. 2024. Traf- ficGPT: Viewing, Processing and Interacting with Traffic Foundation Models. In Proceedings of the AAAI Conference on Artificial Intelligence. [36]Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, and Yuwen Xiong. 2025. EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework. https://github.com/hiyouga/EasyR1. A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication [37]Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. LlamaFactory: Unified Efficient Fine- Tuning of 100+ Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. [38]Jingqi Zhou, Sheng Wang, Jingwei Dong, Lei Li, Jiahui Gao, Jiyue Jiang, Ling- peng Kong, and Chuan Wu. 2024. ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom.arXiv preprint arXiv:2410.14138(2024). [39]Zhi Zhou, Jiang-Xin Shi, Peng-Xiao Song, Xiao-Wen Yang, Yi-Xuan Jin, Lan-Zhe Guo, and Yu-Feng Li. 2024. LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model. (2024). arXiv:2406.04614 [40]Lianghui Zhu, Xinggang Wang, and Xinlong Wang. 2023. JudgeLM: Fine-tuned Large Language Models are Scalable Judges.arXiv preprint arXiv:2310.17631 (2023). Wu et al. A Ethical Statement To promote reproducibility and facilitate further research in auto- mated adjudication we are committed to open sourcing the com- plete code base for our proposed methods. Additionally we will publicly releaseSynTraj which serves as a high-quality synthetic trajectory dataset. Regarding the proprietary ride-hailing data used in this work we strictly adhere to ethical guidelines regarding data privacy and user protection throughout this study. All adjudication data uti- lized in our experiments has undergone a rigorous de-identification process to ensure that no Personally Identifiable Information of dri- vers or passengers is exposed. The usage of this data is strictly re- stricted to the research and development of platform adjudication algorithms. Furthermore all data collection was conducted in full compliance with the platform data governance regulations and ex- plicit informed consent regarding data usage for service improve- ment was obtained from all involved parties prior to acquisition. Consequently all model evaluations were performed using local deployments to ensure no sensitive data was transmitted to exter- nal third-party APIs. Due to these strict privacy protocols we are unable to release the real-world order dataset at this time. However the extensive evaluation performed on the publicly available SynTraj dataset ef- fectively validates the generalization capability and robustness of our approach. B Details of Methods B.1 SynTraj Mutation In this section, we provide the formal construction protocols for the liability samples in SynTraj. We denote the base compliant geometric plan as푆 푔푒표 = [푝 1 , ... , 푝 푁 ], where푝 1 = 푙 푠 (start) and 푝 푁 = 푙 푒 (end). We define the navigation oracleΨ(푥, 푦)as a function returning the coordinate sequence of the shortest path connecting location푥to푦via the road network. B.1.1 Trajectory Drift.To simulate realistic sensor imperfections and GPS jitter, we apply a stochastic perturbation mechanism. Un- like topological violations, this mutation operates directly on the coordinate level without altering the route logic. •Noise Injection:For a trajectory sequence풯= [푞 1 , ... , 푞 푇 ], we generate the drifted observation풯 푑푟푖푓 푡 = [푞 ′ 1 , ... , 푞 ′ 푇 ], where each point is perturbed independently: 푞 ′ 푡 = 푞 푡 + ⃗ 휖 푡 , ⃗ 휖 푡 ∼풩(0, 휎 2 I)(12) We set휎 ≈ 10-15푚to simulate urban canyon effects. This forces the model to learn robust geometric alignment fea- tures rather than overfitting to perfectly smooth synthetic lines. B.1.2 Unintentional Deviation.This mutation simulates a driver diverging from the planned route at an intermediate intersection, potentially to take a shortcut or due to a navigation error, but even- tually returning to the destination. •Step I: Anchor & Waypoint Generation.We sample a split node푝 푡 ∈ 푆 푔푒표 (1 < 푡 < 푁) and compute the intended heading vector⃗푣 = 푝 푡+1 − 푝 푡 . We generate an off-route waypoint푤 푑푒푣 via rotation: 푤 푑푒푣 = 푝 푡 + 휆 ⋅ℛ 휃 (⃗푣), 휃 ∈ 90 ∘ , 270 ∘ (13) whereℛ 휃 is the rotation matrix and휆is the deviation mag- nitude. •Step I: Re-routing and Stitching.We query the oracle Ψto bridge the topological gap. The final trajectory풯 푦푎푤 is synthesized by stitching the compliant history, the devi- ation path, and the recovery path: 풯 푦푎푤 = 푆 푔푒표 [1 ∶ 푡] ⏟ History ⊕ Ψ(푝 푡 , 푤 푑푒푣 ) ⏟ Deviation ⊕ Ψ(푤 푑푒푣 , 푙 푒 ) ⏟ Recovery (14) B.1.3 Reverse Driving.To synthesize high-risk retrograde motion, we construct a trajectory that visually overlaps with the planned route but flows against the topological direction of the lane. •Step I: Vector Inversion.At a selected critical node푝 푡 , we calculate the intended heading⃗푣and derive a ”wrong-way” target푤 푟푒푣 using an obtuse rotation angle: 푤 푟푒푣 = 푝 푡 + 휆 ⋅ℛ 휙 (⃗푣), 휙 ∈ [150 ∘ , 210 ∘ ](15) •Step I: Truncated Synthesis.We generate the violation segment targeting푤 푟푒푣 but truncate it to a distance limit훿 to simulate the onset of the violation. The final trajectory is defined as: 풯 푟푒푣 = 푆 푔푒표 [1 ∶ 푡] ⏟ History ⊕Trunc(Ψ(푝 푡 , 푤 푟푒푣 ), 훿) ⏟ Reverse Segment (16) This creates a sharp semantic conflict where the visual mo- tion opposes the map’s permitted lane directionality. B.1.4 Arrival-then-Leave.This scenario models the ”refusal to op- erate” behavior where a driver completes the order but subsequently departs from the pickup location푙 푒 instead of waiting. •Step I: Escape Target Generation.Upon validating the full execution of the geometric plan푆 푔푒표 , we generate an es- cape target푤 푒푠푐 located at a Euclidean distance푑 > 휏 푡ℎ푟푒푠ℎ from푙 푒 with a random bearing. •Step I: Sequence Extension.The final violation trajec- tory is constructed as the concatenation of the arrival phase and the unauthorized departure phase: 풯 푙푒푎푣푒 = 푆 푔푒표 ⏟ Arrival ⊕ Ψ(푙 푒 , 푤 푒푠푐 ) ⏟ Departure (17) This introduces a multimodal mismatch: the instruction se- quence푆 푖푛푠푡 terminates at푙 푒 (implying a stop), while the visual trajectory풯 푙푒푎푣푒 continues to evolve spatially. B.2 Chain-of-Adjudication In our Chain-of-Adjudication framework, we assign specific model architectures to each role to maximize performance while main- taining computational efficiency. Both the Adjudicator and the Rea- soning Refiner are instantiated using the QwQ-32B model lever- aging its advanced logical reasoning capabilities for rule interpre- tation and text summarization. The Visual Analyst is powered by Qwen3-VL-32B-Instruct enabling precise interpretation of spatiotem- poral trajectory data. To ensure the reasoning process remains fo- cused and to prevent infinite recursive loops during the inquiry A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication phase we strictly limit the maximum number of interaction turns between the Adjudicator and the Visual Analyst to 8 turns. Below we provide the specific system prompts designed for each agent in the framework. The prompt for the Adjudicator You are a ride-hailing marketplace adjudication expert re- sponsible for determining which liability features and final liability a driver’s behavior matches in a cancelled order based on the information provided after the driver accepts the request. Please provide the reasoning process and final result according to the adjudication rules and order infor- mation below. During the adjudication process, if you be- lieve map-related information is necessary, you may use <map>map-related question</map> to ask. Please note that your final liability determination must be consistent with the human-annotated result. The prompt for the Visual Analyst You are a Map Expert responsible for liability adjudica- tion in the ride-hailing marketplace. Your task is to an- alyze orders cancelled by drivers after acceptance and answer questions from an Adjudication Expert to deter- mine which specific fault indicators the driver’s behavior matches and the final liability. We will provide the order details, the map, and the expert’s questions. Please care- fully analyze the order information in the context of the map and return your answer within <answer></answer> tags. B.3 Knowledge-Aware Context Refinement In this section we elaborate on the training protocols and hyperpa- rameter configurations for the Context Refinement module. B.3.1 Scenario-Aware Rule Calibration Details.The Decomposed Ensemble Calibrator is designed to handle the high-dimensional and imbalanced nature of rule applicability. For each rule푟 푖 ∈풦 we train a dedicated binary classifier푓 푖 . Hybrid Feature Representation.To capture both the struc- tured metadata and the unstructured semantic context of an order we construct a hybrid feature space. •Tabular Features:We utilize intrinsic order attributes such as time of day, location coordinates, and cancellation rea- son codes as dense numerical features. •Semantic Features:We concatenate textual fields includ- ing passenger complaints and driver appeals and encode them into 1024-dimensional embeddings using the pre-trained bge-large-zh-v1.5model. These features are concatenated to form the input vector for the classifiers. The prompt for the Reasoning Refiner You are a Reasoning Refinement Specialist in the ride- hailing adjudication domain. Your task is to act as a meta- cognitive editor to distill the raw, fragmented interaction history between an Adjudicator and a Visual Analyst into a coherent, standardized adjudication log. Based on the provided conversation history, order meta- data, and adjudication rules, please reconstruct the reason- ing process and output the final conclusions following this strict format: 1.Reasoning Chain: Enclose your detailed reconstruc- tion of the adjudication path within <reason>...</reason> tags. Inside this tag, you must structure the content into four distinct stages: (1) Information Analysis: Systemati- cally summarize the key order metadata and dispute con- text. (2) Visual Evidence Integration: Synthesize the objec- tive trajectory facts verified by the Visual Analyst. (3) Rule Grounding: Explicitly map the established facts to the spe- cific liability clauses. (4) Comprehensive Adjudication: Per- form the final logical deduction. 2.Scenario Identification: Output the specific adjudica- tion scenario or fault category within <judge>...</judge> tags. 3.Final Verdict: Output the final liability determination within <result>...</result> tags. Please ensure that the refined reasoning path is logically fluid and that the final determination aligns strictly with the provided ground truth. Adaptive Model Selection Strategy.Recognizing that differ- ent rules exhibit varying statistical distributions we do not rely on a single algorithm. Instead we implement an automated model search framework. For each rule푟 푖 we train three gradient boosting variants: XGBoost, LightGBM, and CatBoost. During the validation phase we prioritize the Recall metric over Accuracy. This is a critical design choice for adjudication as missing an applicable rule is significantly more detrimental than retrieving a marginally relevant one. Consequently for every rule푟 푖 the final classifier푓 푖 is selected as follows: 푓 푖 =arg max 푚∈XGB, LGB, Cat Recall 푣푎푙 (푚)(18) To address the extreme class imbalance where rule applicability is often a rare event we employ a dynamic down-sampling strategy on the majority class during training ensuring a balanced distribu- tion for gradient optimization. B.3.2 Retrieval-Augmented Insight Extraction.Strict Temporal Par- titioning.To ensure the rigorous evaluation of the system’s pre- dictive capability, we enforce a strict temporal cutoff for the histor- ical repository풟 ℎ푖푠푡 . When processing a query order at timestamp 푡 푞푢푒푟푦 , the repository is restricted to: 풟 푣푎푙푖푑 ℎ푖푠푡 = (푂 (푗) , 푦 (푗) ) ∣ 푡 (푗) 표푟푑푒푟 < 푡 푞푢푒푟푦 (19) This prevents any form of data leakage where future adjudication outcomes could influence current decisions. Wu et al. Hyperparameters and Model Configuration.Based on em- pirical ablation studies, we set the retrieval depth to퐾 = 4, which offers the optimal trade-off between context richness and input noise. The Summary Agent responsible for synthesizing the Meta- Insight퐼 푠푦푛 is instantiated using QwQ-32B. Consistent with the cal- ibration module, we employ thebge-large-zh-v1.5encoder for vectorizing historical precedents ensuring semantic space align- ment. C Training Details In this section, we provide the detailed hyperparameter configura- tions and training protocols for our three-stage Progressive Align- ment Framework. Overview.The supervised fine-tuning stages are implemented using the LLaMA-Factory framework while the reinforcement learn- ing stage utilizes Easy-R1. All training experiments are conducted on a cluster of 8 NVIDIA H200 GPUs. C.1 Supervised Fine-Tuning Stages The first two stages focus on concept grounding and logic align- ment through supervised learning. The specific hyperparameter settings are compared in Table4. Stage 1: Visual Concept Alignment.We utilize the SynTraj dataset to align visual representations with ride-hailing concepts. Consistent with our goal of grounding abstract rules into visual patterns, we unfreeze both the vision tower and the multimodal projector while keeping the language model frozen. This ensures the visual encoder adapts to the synthetic trajectory domain with- out altering the pre-trained linguistic knowledge. Stage 2: Adjudication Logic Alignment.We subsequently freeze the vision tower to preserve the learned visual features and perform full-parameter fine-tuning on the language model back- bone. This stage aligns the reasoning chains of the model with pro- fessional adjudication protocols. Table 4: Hyperparameter settings for Stage 1 Visual Align- ment and Stage 2 Logic Alignment. HyperparameterStage 1Stage 2 Vision TowerUnfrozenFrozen ProjectorUnfrozenFrozen Language ModelFrozenUnfrozen PrecisionBF16BF16 OptimizerAdamWAdamW Learning Rate5.0 × 10 −6 5.0 × 10 −6 LR SchedulerCosineCosine Warmup Ratio0.10.1 Per-Device Batch Size162 Gradient Accumulation22 Num Epochs48 C.2 Reinforcement Learning Stage In the final stage, we employ the DAPO algorithm to explore deci- sion boundaries and enhance robustness. We initialize the policy model from the Stage 2 checkpoint. DAPO is specifically selected for its ability to handle group-wise optimizations without requir- ing a separate value model making it highly efficient for reasoning tasks. The hyperparameters are detailed in Table5.The total reward is defined as the weighted sum of our OS answer reward and the format compliance reward with balancing coefficients 0.8 and 0.2 respectively. The format reward enforces structural compliance. Table 5: Hyperparameter settings for Stage 3 Reinforcement Learning via DAPO. HyperparameterValue AlgorithmDAPO Actor Learning Rate1.0 × 10 −6 Critic Learning Rate1.0 × 10 −6 Weight Decay1.0 × 10 −2 KL Coefficient0.01 KL Penalty TypeLow Var KL Global Batch Size128 Rollout N5 Max Grad Norm1.0 Temperature / Top-p 1.0 / 1.0 D Experimental Details on Multimodal Adjudication Tasks To evaluate the generalization capability of our framework, we conducted experiments on the PetFinder Adoption Prediction bench- mark. This section provides a formal definition of the task, details the data partition strategy used to simulate a retrieval-augmented adjudication scenario, and describes the construction of the domain- specific rule base. D.1 Task Definition The experiment utilizes the public dataset from the PetFinder.my Adoption Prediction challenge. We formulate the problem as a mul- timodal ordinal regression task involving heterogeneous input modal- ities. Let풳denote the input space consisting of tuples(퐼, 푇, 푀), where퐼represents visual data from pet images,푇denotes unstruc- tured textual descriptions provided by rescuers, and푀represents structured tabular metadata including attributes such as age, breed, gender, and health condition. The objective is to predict a target label푦 ∈ 0, 1, 2, 3, 4, which represents the adoption speed category. The labels correspond to ordinal time intervals ranging from same-day adoption to cases where the pet remains unadopted after 100 days. Unlike traditional classification approaches, we treat this as an adjudication task where the model must synthesize evidence from visual and textual modal- ities to determine the adoptability score. To enable a rigorous evaluation of the model’s discriminative capability, we further define abinary classification sub-task. Specifically, we aggregate the ordinal labels into two distinct cate- gories: labels0, 1, 2are remapped asPositive Samples(represent- ing short adoption times), while labels3, 4are remapped asNega- tive Samples(indicating prolonged stays or unadopted cases). This A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication binarization allows us to report binary Precision and Recall met- rics, thereby quantifying the model’s effectiveness in distinguish- ing high-adoptability pets from those facing adoption challenges. D.2 Dataset Construction The original dataset contains 14,993 samples. To align with the case-based reasoning mechanism proposed in our framework, we restructured the dataset into three distinct subsets based on index ordering. This approach prevents information leakage and simu- lates a realistic scenario where past cases are used to adjudicate current inquiries. We partitioned the data as follows: (1)Historical Case Library: We designated the first 3,000 samples as the historical repository. This subset serves as the retrieval corpus풟 푟푒푡푟푖푒푣푎푙 for the Retrieval-Augmented Generation module. During inference, the model retrieves relevant cases from this fixed library to support its decision- making process. (2)Training Set: The subsequent 10,000 samples, specifically indices 3,000 through 13,000, were utilized as the training set풟 푡푟푎푖푛 . This subset is used to optimize the parameters of the multimodal encoder and the reasoning policy. (3)Test Set: The final 1,000 samples were reserved strictly for evaluation. This test set풟 푡푒푠푡 contains the target cases for adjudication, ensuring that the performance metrics reflect the model’s ability to generalize to unseen profiles using the established historical library. D.3 Construction of Rule Knowledge Base A core component of our architecture is the integration of explicit domain knowledge to guide the reasoning process. We constructed a specialized rule base containing 48 distinct heuristics governing pet adoption adjudication. The rule generation process employed a collaborative approach combining human domain expertise with Large Language Models. First, human experts identified fundamental biological and medi- cal constraints derived from animal welfare guidelines. These core constraints formed the initial seed set. Second, we utilized a Large Language Model to analyze high-confidence samples from the His- torical Case Library. The model extracted latent correlations be- tween multimodal features and adoption speeds. These candidate rules were then formalized into logical implications. Finally, the generated rules underwent a manual verification process to elimi- nate redundancy and ensure logical consistency. The resulting set of 48 rules was compiled into the final knowledge base used during the cross-modal reasoning stage.