Paper deep dive

SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

Sashuai Zhou, Qiang Zhou, Junpeng Ma, Yue Cao, Ruofan Hu, Ziang Zhang, Xiaoda Yang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, Zhou Zhao

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 54

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/26/2026, 2:48:21 AM

Summary

SpatialReward is a verifiable reward model for text-to-image generation that improves fine-grained spatial consistency by combining prompt decomposition, expert object/text detection, and vision-language chain-of-thought reasoning. It introduces SpatRelBench, a comprehensive benchmark for evaluating spatial relationships, object attributes, and rendered text placement, demonstrating improved alignment with human judgments in Stable Diffusion and FLUX models.

Entities (6)

FLUX · generative-model · 100%SpatRelBench · benchmark · 100%SpatialReward · reward-model · 100%Stable Diffusion · generative-model · 100%Flow-GRPO · reinforcement-learning-framework · 95%Prompt Decomposer · component · 95%

Relation Signals (4)

SpatialReward → evaluates → Spatial Consistency

confidence 95% · SpatialReward, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images.

SpatialReward → integratedinto → Flow-GRPO

confidence 95% · We integrate our SpatialReward model into the Flow-GRPO framework

Prompt Decomposer → partof → SpatialReward

confidence 95% · SpatialReward adopts a multi-stage pipeline: a Prompt Decomposer extracts entities...

SpatRelBench → evaluates → Stable Diffusion

confidence 90% · Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency

Cypher Suggestions (2)

Find all components of the SpatialReward framework · confidence 90% · unvalidated

MATCH (c:Component)-[:PART_OF]->(r:RewardModel {name: 'SpatialReward'}) RETURN c.name

Identify generative models evaluated by SpatialReward · confidence 90% · unvalidated

MATCH (m:GenerativeModel)-[:EVALUATED_BY]->(r:RewardModel {name: 'SpatialReward'}) RETURN m.name

Abstract

Abstract:Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.

PDF

Open source PDF →Open local PDF →

Full Text

53,759 characters extracted from source content.

Expand or collapse full text

SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation Sashuai Zhou 1,2 * , Qiang Zhou 2 * , Junpeng Ma 3 * , Yue Cao 2 , Ruofan Hu 1 , Ziang Zhang 1 , Xiaoda Yang 1 , Zhibin Wang 2 , Jun Song 2† , Cheng Yu 2 , Bo Zheng 2 , Zhou Zhao 1† 1 Zhejiang University , 2 Alibaba Group, 3 Fudan University Abstract Recent advances in text-to-image (T2I) generation via rein- forcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine- grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present SpatialReward, a ver- ifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi- stage pipeline: a Prompt Decomposer extracts entities, at- tributes, and spatial metadata from free-form prompts; ex- pert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model ap- plies chain-of-thought reasoning over grounded observa- tions to assess complex spatial relations that are challeng- ing for rule-based methods. To more comprehensively eval- uate spatial relationships in generated images, we introduce SpatRelBench, a benchmark covering object attributes, orientation, inter-object relations, and rendered text place- ment. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation qual- ity, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and con- trollable optimization in text-to-image generation models. The project page is available at: https://github. com/LivingFutureLab/SpatialReward 1. Introduction Recent advances in text-to-image generation [5, 15, 39, 49, 51, 62] have been increasingly fueled by reinforcement learning techniques [14, 17, 34, 40, 57, 74, 83]. Among these, GRPO-based approaches [34, 40, 74] have demon- * Equal contribution. † Corresponding author. Figure 1. Performance comparison of SD3.5-M [16] optimized via RL using SpatialReward versus Baseline Rewards. strated notable effectiveness in improving the generative performance. A central component in these methods is the pre-trained reward model (RM) [19, 30, 65, 70, 71], which assesses both visual quality and semantic alignment, pro- viding crucial feedback for policy-gradient optimization. Such reward-driven training has led to images that better match human preferences in global appearance and content. Despite these advances, existing RMs primarily focus on global semantics and coarse visual quality, while neglecting fine-grained spatial relationships. As a result, current T2I models still struggle to preserve spatial consistency among objects within a scene [1, 3, 6, 20, 27, 36]. These spatial inconsistencies degrade the realism of generated images and compromise faithful adherence to prompt semantics. We hypothesize that further improvements in T2I spa- tial generation depend more on verifiable, spatially-aware reward models than on refinements to RL training strate- gies. In particular, existing approaches suffer from two major deficiencies: Prompt-side rigidity—Structured eval- uation methods [19, 26, 27], such as Geneval [19], rely on fixed-format prompts and predefined object detectors. For instance, they can handle template-based inputs like “a photo of a purple backpack” but fail to generalize to arXiv:2603.22228v1 [cs.CV] 23 Mar 2026 T2I Generation Model A photo of a cell phone right of a chair, the words "WELCOME" are displayed on the cell phone screen, and the chair has a small tag on its back that reads "SIT”. SpatialReward MarkovDecisionProcess Sample s 0 s 1 s T Spatial Consistency Chain-of-ThoughtReasoning Prompt Decomposer ... Object: Cell phone Count: One Text: WELCOME Inclusion1 Object: Chair Count: One Text: SIT Inclusion2 Relation: Entity1 right of Entity2 Object Detector Text Detector R Text R Object Verifiable Information Correct Sample Spatial Mismatch Attribute Mismatch Context Mismatch HumanAssess Gpt-4o 0.880.900.800.82 0.900.880.920.85 1.00.820.810.76 Verifiable Information Prompt Prompt Optimization Qwen-VL (a) Flow-GRPOframework GroupRelative Policy Optimization ... Object Spatial Textual Stage1 Stage2 Stage3 Text Regions： On phone screen:(380, 275, 60, 30) On chair tag: (155, 110, 60, 40) Object Regions： Chair box: (5, 60, 360, 370) Phone box: (380, 240, 60, 100) (b) Overview of the SpatialReward Constraint Set ... ... ... → Pass→ Pass → Pass ... → R total = 1.0 ✅ Verifythat the phone’s left edge is to the right of the chair’s right edge. Spatial Relation Check Text on Chair Check Combine All Checks Text on Phone Check OCR result on the phone screen matches target text “WELCOME”. OCR result on the chair tag matches the target text “SIT” . All the conditionalchecks are satisfied now: Figure 2. Overall framework of our approach. (a) Standard Flow-GRPO [40] reinforcement learning pipeline for text-to-image gener- ation. (b) The proposed SpatialReward, which parses prompts into structured spatial and attribute constraints, verifies them on generated images via expert detection, and uses vision–language chain-of-thought reasoning to produce the final reward score. complex, compositional prompts frequently found in open- ended generation tasks. Vision-side overlooking—Holistic evaluation methods based on CLIPScore [30, 48, 70, 71] or vision–language models [2, 64, 65, 73] can handle ar- bitrary prompts and capture global semantics, but without fine-grained spatial verification they fail to detect positional errors, often rewarding scenes that are visually plausible yet spatially wrong [9, 38, 43, 59, 80]. In this paper, we propose SpatialReward, a verifiable reward model explicitly built for fine-grained evaluation of spatial layouts. Regarding the prompt side, we introduce a Prompt Decomposer to extract core entities, attributes, and spatial relations from arbitrary free-form prompts. This nor- malization enables expert detectors to operate regardless of prompt format, thereby enriching the model’s textual per- ception and supporting performance improvements in di- verse generation scenarios [7, 56, 78, 81]. Regarding the visual side, we draw inspiration from the success of rule- based, verifiable rewards in logical reasoning [21, 42, 77], where explicit and checkable rewards have been shown to significantly improve complex inference performance. We extend this concept to visual spatial evaluation with a collaborative verification mechanism. First, object-related metadata extracted by the Prompt Decomposer is passed to open-set detectors [11, 12, 33, 41], which produce fac- tual and highly verifiable information on object attributes and locations, thereby reducing hallucinations. Since spa- tial relationship assessment requires a sequence of reason- ing steps that link detected facts to relative positions and overall layouts, we incorporate the verified grounding into a chain-of-thought (CoT) process [58, 60, 69] within a vi- sion–language model. This explicit reasoning enables ro- bust reward estimation for complex layouts and achieves greater flexibility than conventional rule-based checks. To enrich the evaluation dimensions for spatial consis- tency in T2I models, we introduce SpatRelBench. This benchmark extends assessment beyond simple positional or color attributes to include object orientation, multi-object 3D positioning, complex spatial arrangements, and the placement of rendered text. We integrate our SpatialRe- ward model into the Flow-GRPO [40] framework, an RL approach built upon GRPO, using Stable Diffusion [16] and FLUX [31] as base models. Experimental results show that our approach significantly enhances the spatial consistency of generated images. Our main contributions are as follows: • We present SpatialReward, a verifiable spatial reward model, combining prompt decomposition, expert detec- tion, and chain-of-thought reasoning. • We introduce SpatRelBench, a benchmark extending spatial evaluation to fine-grained object attributes, inter- object relations, and rendered-text placement. • We conduct extensive experiments to verify that rein- forcement learning with a verifiable reward model can significantly enhance spatial consistency in T2I genera- tion. 2. Related Work 2.1. RL-based T2I Optimization Diffusion-based T2I models, such as Stable Diffusion [16] and FLUX [31], have achieved high visual quality, while reinforcement learning has been shown to effectively im- prove generation performance across diverse prompts. Typ- ical approaches train preference-based reward models to guide generation [4, 72], and recent work has further adapted GRPO-style optimization to text-to-image genera- tion [40, 74, 79], improving text-image alignment and gen- eralization. These advances are broadly relevant to mul- timodal applications that require fine-grained understand- ing, semantic alignment, and knowledge-enhanced reason- ing [24, 25, 61, 76, 82]. This trend motivates our design of a verifiable reward model specialized for fine-grained spatial relationship assessment in generated images. 2.2. Reward Models for T2I Generation Existing reward models can be broadly classified into structured methods [19, 26, 27] and holistic scorer meth- ods [30, 48, 65, 70, 71, 73]. Structured approaches, such as Geneval [19], rely on fixed-format prompts and prede- fined object detectors to verify attributes and spatial ar- rangements, achieving high precision within narrow tem- plates but generalizing poorly to free-form inputs. Holis- tic scorers include CLIP-based regressors such as ImageRe- ward [71], PickScore [30], and HPSv2 [70], which fine-tune CLIP to predict human-preference scores. More recently, vision-language model (VLM) backbones [2, 32] have been adopted to capture richer semantic and spatial cues, with representative methods including VisionReward [73] and UnifiedReward [63, 65]. While VLM-based RMs improve flexibility over structured pipelines, they often overlook de- tailed spatial inconsistencies, highlighting the importance of verifiable reward models with robust spatial reasoning to ensure trustworthy evaluation across diverse prompts. 2.3. Benchmarks for T2I Evaluation Early evaluation of image generation models relied on generic metrics such as Fr ́ echet Inception Distance (FID) [22], Inception Score (IS) [52], and CLIPScore [30, 48, 70], which, while effective for assessing overall image quality, fall short in capturing fine-grained im- age–text alignment and complex spatial relations. To rem- edy these issues, specialised benchmarks have emerged. GenEval [19] and T2I-CompBench [26, 27] are object- centric benchmarks focusing on fundamental aspects such as object attributes and positions, typically using object de- tectors for automated scoring. More recent benchmarks [18, 44, 45] employ advanced VLMs [2, 73] to evaluate specific reasoning skills. However, current spatial benchmarks tend to ignore fine-grained inter-object relations, including ori- entation, 3D spatial positioning, and text placement. These gaps motivated us to develop SpatRelBench, a benchmark with richer spatial relation coverage. 3. SpatialReward SpatialReward operates in three stages: Section 3.1 parses the prompt into structured spatial and attribute constraints; Section 3.2 verifies these constraints on the generated im- age using expert detection models, yielding verified re- wards; Section 3.3 leverages a vision–language model with chain-of-thought reasoning to infer spatial relations and ag- gregate the results into the final reward score. 3.1. Prompt Decomposition For RL-based training to generalize effectively, it is essen- tial to handle prompts describing complex spatial arrange- ments of multiple objects [19, 20, 26]. The spatial reward in our framework aims to evaluate such cases by extracting accurate inter-object positional relations for subsequent ver- ification using expert detection models. However, reliable detection requires that the prompt content be first decom- posed into structured elements: separating each subject, its attributes, and the spatial relations involved. Free-form prompts in the wild often contain irrelevant context or merge descriptions of different objects, introduc- ing ambiguity that undermines detection accuracy. To miti- gate this, we introduce a Prompt DecomposerD that trans- forms a free-form prompt P into structured constraint setC: C =D(P) = (tag,C inc ,C exc ),(1) where tag denotes the primary evaluation category (e.g., counting, orientation, spatial relation), C inc contains inclu- sion constraints, and C exc contains exclusion constraints. Each atomic constraint specifies object category, quantity, attributes, spatial relations, or textual inscriptions. Inspired by metadata-driven evaluation frameworks [19, 26], we constructed a dataset of approximately 100k multi- object metadata instances, explicitly defining attributes, counts, spatial relations, and associated text for each sub- ject. Using GPT-4o [46], we generated diverse natural language prompts from these metadatas, yielding (prompt, metadata) pairs for supervised training. We fine-tuned a Qwen2.5-VL-7B [2] model to accurately extract core meta- attributes from unrestricted prompts. This structured de- composition provides a reliable foundation for subsequent fine-grained image evaluation in our pipeline. 3.2. Fine-grained Verifiable Rewards While the capabilities of VLMs are advancing rapidly, existing work [29, 38, 43, 59, 67, 80] has shown that even state-of-the-art models struggle with compositional text prompts involving multiple objects, attribute binding, (a) Overview of SpatRelBench: Tasks, Categories, and Data Distribution (b) Construction Process Metadata Coco-80 Object-365 ImageNet-1k Positions Relations Textual Context Gemini-2.5-pro PromptRewrite a photo of a large brown bear in the forest, with the bold word “NEW” clearly displayed to the left of the animal. T2IModel Expert Check Response Manual Review A.Complex SpatialRelation D.TextPosition Accuracy C.Three-DimensionalRelation B.Objective Orientation E.TextCounting Consistency A cat oriented to the right is seated comfortably on a sofa, positioned next to a soft pillow. A chair positioned below a framed painting, with a ceramic vase to its right, in a minimalistic room. A photo of a cow right of a laptop, the laptop screen displaying “HELLO WORLD”. An image showing three fire hydrants in a row on a paved street, labeled from leftto right: “FIRE”,“DANGER”,and “EXIT”. A small car is parked behind a kitchen sink, and a freshly baked pizza sits inside th e sink’s basin. DataDistribution ～2000samples Complex Objective Three-Dimensional Tex tPosition Tex tCounting Object: Spatial: Tex tual: brown bear to the left NEW Object: Spatial: Tex tual: ✅ ✅ ✅ Correct! (c) Evaluation Methodology Figure 3. Overview of SpatRelBench, depicting benchmark tasks and their data distribution (a), the construction pipeline (b), and the evaluation methodology (c) designed to assess spatial relation understanding in text-to-image models. spatial-action relationships, counting, and logical reason- ing. Relying solely on VLMs for these tasks makes it diffi- cult to obtain stable and verifiable objective reward scores. Fortunately, modern open-domain object detection [10, 35, 41, 50, 66] and Optical Character Recognition (OCR) mod- els [12, 33, 68] demonstrate accuracy that significantly sur- passes the judgmental capabilities of VLMs, providing ob- jective scores that closely align with human evaluation stan- dards. Leveraging the constraints extracted by the Prompt Decomposer in Section 3.1, we integrate these specialized detectors to perform precise, criterion-specific verification. This process yields a sub-reward for each positive constraint c i ∈ C inc , providing a verifiable and quantitatively accurate signal for assessing spatial positions, object attributes, and other fine-grained relationships. Object Attribute and Presence Reward. For each in- clusion constraint c ∈ C inc that refers to an object, we evaluate whether the generated image I satisfies the spec- ified visual and spatial properties, including object cat- egory, color, target count, orientation, and depth order- ing. These properties are extracted via specialized detection models [10, 41, 66, 75]. Given a target category in c, object detectorF det [10, 41] is applied to the generated image I , yielding candidate bounding boxes D c =(B j ,s j ) k j=1 ,(2) where B j denotes a bounding box and s j ∈ [0, 1] its confi- dence score. Applying a confidence threshold τ det produces the verified set of detections B c =B j | (B j ,s j )∈ D c ∧ s j ≥ τ det ,(3) whose cardinality ˆ N c =|B c | forms the presence reward R presence (c) = I( ˆ N c > 0),(4) as well as the count reward R count (c) = exp −| ˆ N c − N ∗ c | ,(5) in which N ∗ c is the target count given in c. Beyond category and quantity, object attributes are also verified. The color reward is defined as R color (c) = sim color (C det ,C ∗ ),(6) where C det is obtained via a CLIP-based [48] classifier that evaluates cropped object regions against prompt templates combining each candidate color with the object class name. The top-scoring color is then compared to the target C ∗ us- ing sim color (·). Similarly, the orientation reward assesses angular consistency as R ori (c) = I(|θ det − θ ∗ |≤ δ θ ),(7) where the detected orientation θ det is obtained from an orientation-sensitive model [66], θ ∗ is the target in c, and δ θ specifies the tolerance. For 3D spatial reasoning, the depth reward evaluates whether the relative depth ordering matches the target: R depth (c) = exp(−|d rank − d ∗ rank |),(8) where d rank is the rank order inferred via monocular depth estimation [75], d ∗ rank is the target ordering. This formulation enforces comprehensive verification of every inclusion constraint c ∈ C inc against the generated image, capturing both appearance fidelity and fine-grained spatial consistency. Text Content and Localization Reward. For prompts that require rendering specific textual content within ob- jects, the reward must jointly assess semantic correctness and spatial placement. Given a target object B obj and re- quired text T ∗ , a global OCR modelF ocr [12, 68] extracts a set of detected text–box pairsT rec = (T ′ j ,B ′ j ) m j=1 , where B ′ j and B obj denote bounding boxes for detected text re- gions and the target object, respectively. The textual reward is defined by identifying the text instance that best matches T ∗ in content and is correctly localized within B obj : R text (T ∗ ,B obj ) = max (T ′ j ,B ′ j )∈T rec sim(T ∗ ,T ′ j )· IoA(B ′ j ,B obj ) , (9) where sim(·) measures normalized lexical similarity and IoA quantifies the degree of spatial containment: IoA(B text ,B obj ) = Area(B text ∩ B obj ) Area(B text ) , (10) High text reward is given only when generated text matches the target string and appears within the correct object bounding box, ensuring prompt fidelity in tasks involving embedded text. 3.3. Spatial Chain-of-Thought Reasoning Although the fine-grained object and text rewards pro- vide reliable verification for individual attributes, determin- ing complex spatial relations between multiple entities re- quires higher-level reasoning [8, 20, 43]. Simple rule-based geometric checks often struggle with nuanced semantics (e.g., distinguishing “on” from “above”) and with context- dependent layouts that cannot be resolved from geome- try alone. To address this, we adopt Qwen2.5-VL [2] as our Chain-of-Thought reasoning backbone, using verified detection-based signals as grounding to reduce hallucina- tion. For the spatial relation between entities e A and e B , we construct a CoT prompt P CoT comprising: (1) the target re- lation r, (2) their detected bounding boxes B A and B B , and (3) the set of attribute rewards for each object from previous stages R pres ,...,R ori ,R depth ,R text . By explicitly pro- viding these verifiable signals, the VLM is guided to reason step-by-step: first interpreting each attribute reward relative to the bounding boxes, then performing geometric analysis, and finally inferring whether the relation r holds. The output of F vlm is restricted to a structured for- mat containing the reasoning trace and a final score. We parse the score using a functionP score to yield the spatial- consistency reward: R spatial =P score F vlm (P CoT (r,B A ,B B , attributes)) , (11) where attributes denotes the verified property scores (e.g., presence, orientation, depth, text) for both entities from ear- lier detection stages. To enhance robustness and avoid overfitting to positive cases, we incorporate explicit penalties for satisfied exclu- sion constraints. For each inclusion constraint c∈C inc and exclusion constraint c ∈ C exc , both derived from the earlier Prompt Decomposer, the CoT module producesR spatial (c), and calculates the aggregated spatial score: R total = X c∈C inc R + spatial (c) − X c∈C exc R − spatial (c), (12) This formulation rewards satisfaction of required relations while penalizing the presence of undesired ones, produc- ing a spatial score that is both semantically informed and grounded in verifiable evidence. 4. SpatRelBench To enable more fine-grained and comprehensive evaluation of spatial relationships in generated images, we introduce SpatRelBench, a benchmark specifically designed to as- sess spatial consistency in complex scenarios. The evalu- ation protocol covers five primary dimensions: (1) Com- plex spatial relations, which assess the relative positioning and arrangement of multiple objects within a scene; (2) Ob- ject orientation, which checks whether each object is de- picted facing the correct direction; (3) Three-dimensional relations, which evaluate depth ordering and 3D layout con- sistency among objects; (4) Text-position accuracy, which verifies whether rendered text appears at the correct loca- tion relative to the associated object, using optical character recognition; and (5) Text-counting consistency, which deter- mines whether the quantity of rendered text across multiple objects matches the prompt specification. As illustrated in Fig. 3, we extend the evaluation cat- egory set beyond the COCO-80 [37] classes to include ImageNet-1k [13] and Objects365 [54] categories, cover- ing both common objects and fine-grained subcategories. Prompts are generated using Gemini-2.5-Pro [28] and then manually validated to ensure both diversity and correctness. During evaluation, domain-specific expert models are em- ployed to score each dimension. For every sub-task, a bi- nary decision (correct or incorrect) is recorded, and the overall accuracy is computed by normalizing the number of correct judgments over the total number of requirements. The current release contains approximately 2,000 annotated Table 1. Quantitative comparison of T2I generation models aligned with different reward models. Results are on GenEval(80-Obj) and SpatRelBench(1k-Obj), where parentheses indicate the number of object categories. S-Obj: Single object, T-Obj: Two objects, Cnt: Counting, Pos: Positions, Attr-C: Attribute (Color), P-Text: Position-Text OCR, C-Text: Counting-Text OCR, Cpx: Complex spatial relations, Ori: Orientation, 3DRel: 3D spatial relations, Overall: average score over all metrics in each dataset. Bold denotes the best score, and underline denotes the second best. Reward Model GenEval (80-Obj)SpatRelBench (1k-Obj) S-Obj. T-Obj. Cnt. Color Pos. Attr-C. Overall P-Text. C-Text. Cpx.Ori.3DRel. Overall Proprietary T2I models GPT Image 10.990.920.850.920.750.610.840.500.220.530.150.450.37 Seedream 3.00.990.960.910.930.470.800.840.250.200.110.070.410.21 Qwen-Image0.990.920.900.880.760.770.870.220.210.320.070.320.23 Based on Stable Diffusion 3.5 SD3.5-M0.980.790.590.800.280.580.670.400.130.220.070.360.23 + TextOCR0.990.870.590.810.330.580.700.48 0.240.230.090.380.28 + PickScore0.980.940.770.830.310.620.740.360.150.280.060.330.24 + QwenVL0.990.580.480.810.240.550.610.320.210.300.12 0.370.26 + ImageReward0.950.920.520.920.700.770.800.420.230.320.110.420.30 + UnifiedReward1.000.980.900.860.810.780.890.460.260.400.120.400.33 + SpatialReward1.000.990.860.960.980.910.950.510.330.430.260.550.42 Imp. over Baseline+0.02+0.20+0.27+0.16+0.70+0.33+0.28+0.11+0.20+0.21+0.19+0.19+0.19 Based on FLUX FLUX1-dev0.970.880.480.820.670.760.760.490.250.230.040.380.28 + SpatialReward1.000.990.890.980.990.940.970.630.400.520.320.450.46 Imp. over Baseline+0.03+0.11+0.41+0.16+0.32+0.18+0.21+0.14+0.15+0.29+0.28+0.17+0.18 entries. Built on a modular pipeline, SpatialRelBench is de- signed to facilitate future expansion to additional categories, tasks, or spatial dimensions, thus providing a challenging and discriminative benchmark for assessing complex spa- tial consistency in T2I models. 5. Experiments 5.1. Implementation Details Training Configuration. We apply reinforcement learn- ing to SD3.5-M [16] and FLUX1-dev [31], adopting the GRPO approach provided by the Flow-GRPO [40] frame- work. During training, we employ a sampling timestep of T = 10, a group size of G = 24, a noise level of a = 0.7, and a fixed image resolution of 512× 512. For evaluation, we increase the timestep to T = 40. Parameter-efficient tuning is enabled by LoRA with a rank r = 32 and a scal- ing factor α = 64. The KL regularization coefficient β is set to 0.04. All models are trained on 16 NVIDIA L20 GPUs. Evaluation Models and Baselines. We evaluate the pro- posed SpatialReward model by comparing it against a set of established reward models under identical exper- imental conditions.The baseline set comprises Tex- tOCR [12], PickScore [30], Qwen2.5-VL [2], ImageRe- ward [71], and UnifiedReward [65]. To ensure fairness, all reward models are trained on the same dataset, consist- ing of 100k spatial-relation prompts automatically gener- ated by GPT-4o and verified for correctness and diversity. The backbones are optimized via Flow-GRPO framework with identical hyperparameter settings. 5.2. Quantitative Comparison Evaluation on Spatial-Consistency Benchmarks As shown in Table 1, integrating the proposed SpatialReward into both Stable Diffusion and FLUX models yields consis- tent and substantial improvements across all evaluation di- mensions in the standard GenEval benchmark and Spatial- RelBench. On GenEval, SpatialReward enhances perfor- mance not only on common object-level metrics but also on complex compositional tasks, while on SpatialRelBench the SD3.5-M SD3.5-M + SpatialReward SD3.5-M + Qwen2.5-VL SD3.5-M + UnifiedReward SD3.5-M + Geneval A toy train standing in front of a vintage couch, with the couch positioned behinda large and bright orange carrot. Three distinct bookson a polished wooden shelf, left to right: the left book spine reads “Story”, middle book “Adventure” andrightbook “Knowledge”. Four white ceramic sinksin a row: the second from the left inscribed "Clean", the third from the left in scribed "Wash". A manfacing leftis positioned beside a vending machine in a casual setting. Figure 4. Qualitative comparison of generated image quality across different methods. Table 2. General-Purpose comparison on Wise, DPG, Aesthetic, and PickScore metrics. ModelWiseDPGAestheticPickScore SD3.5-M0.4583.965.3922.34 + SpatialReward0.4684.085.2322.52 Flux1-dev0.5083.846.1322.45 + SpatialReward0.5284.196.1523.22 gains are particularly notable in the challenging dimensions that require fine-grained spatial reasoning, such as multi- object relation understanding, orientation accuracy, depth ordering, and text–position alignment. The improvements manifest across both benchmarks and for different back- bone models, demonstrating that SpatialReward effectively enforces spatial coherence and semantic fidelity in gener- ated images. These results confirm the robustness and gen- eralizability of our approach, and highlight the discrimi- native power of SpatialRelBench in revealing performance gaps that remain hidden under single-dimensional evalua- tion protocols. Table 3. Correlation and accuracy with human spatial-consistency judgments; accuracy measured at threshold τ = 0.8. Reward ModelSpearman ρPearson rAccuracy CLIPScore0.420.400.68 ImageReward0.480.450.70 UnifiedReward0.510.490.72 VisionReward0.550.530.74 SpatialReward0.630.610.79 Evaluation on General-Purpose Metrics To assess the generalization of SpatialReward beyond spatial-focused benchmarks, we evaluated it using widely adopted metrics: overall fidelity (Wise [45], DPG [23]), visual appeal (Aes- thetic [53]), and prompt–image alignment (PickScore [30]). With both Stable Diffusion and FLUX backbones, Spatial- Reward matched or exceeded baseline performance across all metrics, except for a slight drop in Aesthetic. This sug- gests reward optimization maintains overall visual quality and semantic alignment. Gains on Wise and PickScore re- flect improved holistic prompt–image alignment, while sta- ble Aesthetic and DPG scores indicate enhanced spatial rea- soning without perceptual loss. Table 4. Ablation results for SpatialReward. Scores (accuracy) are reported for GenEval, SpatRel: SpatialRelBench, and T2IComp: T2I-CompBench. T2I-CompBench values are averaged over its 2D and 3D spatial-consistency tasks. Removed ComponentGenEvalSpatRel.T2IComp. Full SpatialReward95.237.150.1 – Exclusion Constraints90.525.945.9 – Expert Detection70.321.639.2 – CoT Reasoning94.227.947.5 5.3. Human Alignment on Spatial Consistency We evaluated whether SpatialReward aligns more closely with human perception of spatial consistency through a study on 500 prompt–image pairs from SpatialRelBench, each generated by GPT-4o and labeled by annotators as correct or incorrect in depicting the intended spatial rela- tions. Compared with representative general-purpose re- ward models, SpatialReward showed the highest correla- tion with human judgments (Spearman’s ρ [55], Pearson’s r [47]) and the best classification accuracy at a fixed thresh- old τ = 0.8, confirming that explicitly modeling fine- grained spatial relations produces assessments more consis- tent with human perception and serves as a reliable metric for complex spatial tasks in text-to-image generation. 5.4. Ablation Study To assess the contribution of each component within Spa- tialReward, we conduct ablation experiments by selectively removing three key modules: Exclusion Constraints, Expert Detection, and Chain-of-Thought Reasoning. The resulting accuracy variations across GenEval, SpatialRelBench, and T2I-CompBench are summarized in Table 4. Effect of Exclusion Constraints Removing the exclusion constraint, which introduces negative samples to penalize undesired spatial configurations, consistently reduces accu- racy across all benchmarks. Incorporating such penalties improves reward robustness, mitigates over-optimization, and prevents reward hacking that could degrade real-world performance. This mechanism is particularly beneficial for prompts containing distractor objects, preventing perfor- mance degradation in realistic settings. Effect of Expert Detection Omitting the expert detection stage, which performs fine-grained verification of object presence, attributes, and rendered text, results in the largest accuracy decline across all benchmarks. This outcome sub- stantiates our core principle that spatial reasoning bene- fits significantly from verified reward signals, as domain- → Above → Inside With CoT Reasoning Without CoT Reasoning A photo of a green-colored tailed frog perched above a neatly rolled sleeping ba g in the forest. Geometric Matching Only: Bbox frog ⊂ Bbox sleeping bag Spatial CoT Reasoning: BBox+ Full-Image Semantics + + Frog is touching the top surface of the sleeping bag. So the relation is 'Above'. ❌ ✅ →Reward = 0.25 → Reward = 1.0 Geometric Matching Only: Orientation dog = right → Facing right →Reward = 0.33 + + The dog faces right, with the fire hydrant on its right,sotheorientation is correct. Spatial CoT Reasoning: BBox+ Orient + Full-Image Semantics → Facing the fire hydrant ✅ → Reward = 1.0 A black dog is sitting on a paved street. The dog is facing towards a red fire hydrant, its gaze aligned in the hydrant’s direction. ❌ frog sleeping bag dog Figure 5. Effect of CoT reasoning in spatial relations. CoT com- bines bounding boxes, orientation, and scene semantics, yielding correct classifications where detected-only matching fails. specific detectors provide reliable and interpretable evi- dence that general-purpose vision–language models alone cannot consistently ensure. Effect of Chain-of-Thought Reasoning Eliminating the Chain-of-Thought reasoning process, which integrates ver- ified object attributes with geometric analysis to infer spa- tial relations, results in moderate accuracy reductions. Al- though the quantitative decline is smaller than that observed for other components, qualitative case studies (Fig. 5) high- light CoT’s critical role in complex scenarios. Examples include distinguishing “above” from “inside” and resolving intricate multi-object arrangements that rule-based heuris- tics cannot effectively handle. 6. Conclusion We presented SpatialReward, a spatial-consistency veri- fiable reward model, and SpatialRelBench, a benchmark for fine-grained evaluation of spatial relations in text-to- image generation. By combining constraint parsing, expert- based verification, and chain-of-thought reasoning, Spa- tialReward effectively enforces spatial fidelity in complex scenes. Experiments on both SpatialRelBench and general evaluation metrics, along with human alignment studies, show that our method surpasses existing reward models in capturing spatial correctness without compromising overall image quality. The results support the use of verifiable spa- tial reward as a reliable and adaptable objective for future text-to-image generation models. Acknowledgements This work was supported by the National Natural Science Foundation of China (NSFC) under Grant No. U24A20326, and the Joint Fund Project ”End-Cloud Collaborative Lightweight Autonomous Intelligence for Content Gener- ation”. References [1] Sumukh K Aithal, Pratyush Maini, Zachary Lipton, and J Zico Kolter.Understanding hallucinations in diffusion models through mode interpolation. Advances in Neural In- formation Processing Systems, 37:134614–134644, 2024. 1 [2] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond, 2023. 2, 3, 5, 6 [3] Lital Binyamin, Yoad Tewel, Hilit Segev, Eran Hirsch, Royi Rassin, and Gal Chechik. Make it count: Text-to-image gen- eration with an accurate number of objects. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13242–13251, 2025. 1 [4] Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning. arXiv preprint arXiv:2305.13301, 2023. 3 [5] Boyuan Cao, Jiaxin Ye, Yujie Wei, and Hongming Shan. Re- pLDM: Reprogramming pretrained latent diffusion models for high-quality, high-efficiency, high-resolution image gen- eration. In Adv. Neural Inform. Process. Syst. 1 [6] Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, et al. Getting it right: Improving spatial consistency in text- to-image models. In European Conference on Computer Vi- sion, pages 204–222. Springer, 2024. 1 [7] Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14455–14465, 2024. 2 [8] Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jian- nan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models. arXiv preprint arXiv:2503.09567, 2025. 5 [9] Xiang Chen, Chenxi Wang, Yida Xue, Ningyu Zhang, Xiaoyan Yang, Qiang Li, Yue Shen, Lei Liang, Jinjie Gu, and Huajun Chen.Unified hallucination detection for multimodal large language models.arXiv preprint arXiv:2402.03190, 2024. 2 [10] Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan.Yolo-world:Real-time open-vocabulary object detection.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16901–16911, 2024. 4 [11] Jaeyoung Chung, Jeongtaek Oh, and Kyoung Mu Lee. Depth-regularized optimization for 3d gaussian splatting in few-shot images. In CVPR, 2024. 2 [12] Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025. 2, 4, 5, 6 [13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 5 [14] Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023. 1 [15] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ̈ uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning, 2024. 1 [16] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ̈ uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In ICML, 2024. 1, 2, 3, 6 [17] Ying Fan,Olivia Watkins,Yuqing Du,Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Re- inforcement learning for fine-tuning text-to-image diffusion models. In Thirty-seventh Conference on Neural Informa- tion Processing Systems (NeurIPS) 2023. Neural Information Processing Systems Foundation, 2023. 1 [18] Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, and Dan Roth. Commonsense-t2i challenge: Can text-to- image generation models understand commonsense? arXiv preprint arXiv:2406.07546, 2024. 3 [19] Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment. NIPS, 36:52132–52152, 2023. 1, 3 [20] Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vi- neet, Eric Horvitz, Ece Kamar, Chitta Baral, and Yezhou Yang. Benchmarking spatial relationships in text-to-image generation. arXiv preprint arXiv:2212.10015, 2022. 1, 3, 5 [21] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 2 [22] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems, 30, 2017. 3 [23] Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu.Ella:Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 7 [24] Hai Huang, Yan Xia, Shengpeng Ji, Shulei Wang, Hanting Wang, Minghui Fang, Jieming Zhu, Zhenhua Dong, Sashuai Zhou, and Zhou Zhao. Enhancing multimodal unified rep- resentations for cross modal generalization. In Findings of the Association for Computational Linguistics: ACL 2025, pages 2353–2366, Vienna, Austria, 2025. Association for Computational Linguistics. 3 [25] Hai Huang, Yan Xia, Sashuai Zhou, Hanting Wang, Shulei Wang, and Zhou Zhao.Bridging domain generalization to multimodal domain generalization via unified representa- tions. In Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 22488–22498, 2025. 3 [26] Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xi- hui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. NIPS, 36:78723–78747, 2023. 1, 3 [27] Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhen- guo Li, and Xihui Liu. T2i-compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 2025. 1, 3 [28] Yichen Huang and Lin F Yang. Gemini 2.5 pro capable of winning gold at imo 2025. arXiv preprint arXiv:2507.15855, 2025. 5 [29] Amita Kamath, Jack Hessel, and Kai-Wei Chang.Text encoders bottleneck compositionality in contrastive vision- language models. arXiv preprint arXiv:2305.14897, 2023. 3 [30] Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. NIPS, 36:36652–36663, 2023. 1, 2, 3, 6, 7 [31] Black Forest Labs. Flux. https://github.com/black-forest- labs/flux, 2024. 2, 3, 6 [32] Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv:2408.03326, 2024. 3 [33] Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun Du, Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, et al. Pp-ocrv3: More attempts for the im- provement of ultra lightweight ocr system. arXiv preprint arXiv:2206.03001, 2022. 2, 4 [34] Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow- based grpo efficiency with mixed ode-sde. arXiv preprint arXiv:2507.21802, 2025. 1 [35] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al.Grounded language-image pre-training. In CVPR, 2022. 4 [36] Long Lian, Boyi Li, Adam Yala, and Trevor Darrell. Llm- grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023. 1 [37] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ́ ar, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 5 [38] Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, and Deva Ramanan.Revisiting the role of lan- guage priors in vision-language models.arXiv preprint arXiv:2306.01879, 2023. 2, 3 [39] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling. arXiv preprint arXiv:2210.02747, 2022. 1 [40] Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl. arXiv preprint arXiv:2505.05470, 2025. 1, 2, 3, 6 [41] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al.Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Euro- pean conference on computer vision, pages 38–55. Springer, 2024. 2, 4 [42] Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft:Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2 [43] Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, and Ranjay Krishna. Crepe: Can vision-language foundation models reason compositionally? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10910–10921, 2023. 2, 3, 5 [44] Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yi- ran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, et al. Phybench: A physical common- sense benchmark for evaluating text-to-image models. arXiv preprint arXiv:2406.11802, 2024. 3 [45] Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed seman- tic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265, 2025. 3, 7 [46] OpenAI.GPT-4 Technical Report.arXiv preprint arXiv:2303.08774, 2023. 3 [47] K. Pearson. Notes on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58:240–242, 1895. 8 [48] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, pages 8748–8763, 2021. 2, 3, 4 [49] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International confer- ence on machine learning, pages 8821–8831. Pmlr, 2021. 1 [50] Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024. 4 [51] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ̈ orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1 [52] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016. 3 [53] ChrisophSchuhmann.Laionaesthetics. https://github.com/LAION-AI/aesthetic-predictor,2022. 7 [54] X. Shang, D. Wang, G. Chen, and et al. Objects365: A large- scale, high-quality dataset for object detection. In ICCV, 2019. 5 [55] C. Spearman. The proof and measurement of association between two things. American Journal of Psychology, 15 (1):72–101, 1904. 8 [56] Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6619–6628, 2019. 2 [57] Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. In CVPR, pages 8228–8238, 2024. 1 [58] Qunzhong Wang, Jie Liu, Jiajun Liang, Yilei Jiang, Yuanx- ing Zhang, Jinyuan Chen, Yaozhi Zheng, Xintao Wang, Pengfei Wan, Xiangyu Yue, et al. Vr-thinker: Boosting video reward models through thinking-with-image reason- ing. arXiv preprint arXiv:2510.10518, 2025. 2 [59] Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, and Lijuan Wang. Equivariant similarity for vision-language foundation models. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 11998–12008, 2023. 2, 3 [60] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reason- ing in language models. arXiv preprint arXiv:2203.11171, 2022. 2 [61] Xuqin Wang, Tao Wu, Yanfeng Zhang, Lu Liu, Dong Wang, Mingwei Sun, Yongliang Wang, Niclas Zeller, and Daniel Cremers. Ladb: Latent aligned diffusion bridges for semi- supervised domain translation. In DAGM German Confer- ence on Pattern Recognition, pages 221–236. Springer, 2025. 3 [62] Xuqin Wang, Tao Wu, Yanfeng Zhang, Lu Liu, Mingwei Sun, Yongliang Wang, Niclas Zeller, and Daniel Cremers. Geodesicnvs: Probability density geodesic flow matching for novel view synthesis. arXiv preprint arXiv:2603.01010, 2026. 1 [63] Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine- tuning. arXiv preprint arXiv:2505.03318, 2025. 3 [64] Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine- tuning. arXiv preprint arXiv:2505.03318, 2025. 2 [65] Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation. arXiv preprint arXiv:2503.05236, 2025. 1, 2, 3, 6 [66] Zehan Wang, Ziang Zhang, Tianyu Pang, Chao Du, Heng- shuang Zhao, and Zhou Zhao. Orient anything: Learning robust object orientation estimation from rendering 3d mod- els, 2024. 4 [67] Zehan Wang, Sashuai Zhou, Shaoxuan He, Haifeng Huang, Lihe Yang, Ziang Zhang, Xize Cheng, Shengpeng Ji, Tao Jin, Hengshuang Zhao, et al. Spatialclip: Learning 3d-aware im- age representations from spatially discriminative language. In Proceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 29656–29666, 2025. 3 [68] Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. General ocr theory: Towards ocr-2.0 via a unified end-to-end model. arXiv preprint arXiv:2409.01704, 2024. 4, 5 [69] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. Advances in neural information processing systems, 35:24824–24837, 2022. 2 [70] Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023. 1, 2, 3 [71] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong.Imagere- ward: Learning and evaluating human preferences for text- to-image generation. Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 1, 2, 3, 6 [72] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong.Imagere- ward: Learning and evaluating human preferences for text- to-image generation. Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 3 [73] Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shu- run Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024. 2, 3 [74] Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025. 1, 3 [75] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. In CVPR, 2024. 4, 5 [76] Xiaoda Yang, Jiayang Xu, Kaixuan Luan, Xinyu Zhan, Hongshun Qiu, Shijun Shi, Hao Li, Shuai Yang, Li Zhang, Checheng Yu, et al.Omnicam: Unified multi- modal video generation via camera control. arXiv preprint arXiv:2504.02312, 2025. 3 [77] Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing gen- eralized multimodal reasoning through cross-modal formal- ization. arXiv preprint arXiv:2503.10615, 2025. 2 [78] Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, et al. Reco: Region-controlled text-to-image genera- tion. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 14246–14255, 2023. 2 [79] Zilyu Ye, Zhiyang Chen, Tiancheng Li, Zemin Huang, Wei- jian Luo, and Guo-Jun Qi. Schedule on the fly: Diffusion time prediction for faster and better image generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23412–23422, 2025. 3 [80] Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou.When and why vision- language models behave like bags-of-words, and what to do about it? arXiv preprint arXiv:2210.01936, 2022. 2, 3 [81] Yang Zhang, Teoh Tze Tzun, Lim Wei Hern, and Kenji Kawaguchi. Enhancing semantic fidelity in text-to-image synthesis: Attention regulation in diffusion models.In European Conference on Computer Vision, pages 70–86. Springer, 2024. 2 [82] Chunzheng Zhu, Yangfang Lin, Jialin Shao, Jianxin Lin, and Yijun Wang. Pathology-aware prototype evolution via llm-driven semantic disambiguation for multicenter diabetic retinopathy diagnosis. In Proceedings of the 33rd ACM In- ternational Conference on Multimedia, pages 9196–9205, 2025. 3 [83] Chunzheng Zhu, Yangfang Lin, Shen Chen, Yijun Wang, and Jianxin Lin. Medeyes: Learning dynamic visual focus for medical progressive diagnosis. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 13916–13924, 2026. 1