Paper deep dive

Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

Ping Chen, Daoxuan Zhang, Xiangming Wang, Yungeng Liu, Haijin Zeng, Yongyong Chen

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 70

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/22/2026, 6:07:12 AM

Summary

AFS-Search is a training-free, closed-loop framework for Text-to-Image (T2I) generation built on FLUX.1-dev. It addresses the limitations of open-loop sampling by integrating a Vision-Language Model (VLM) as a semantic critic to perform Parallel Rollout Search (PRS) and Agentic Flow Steering (AFS). This approach allows for real-time diagnosis and correction of intermediate latents, enabling precise spatial grounding and relational reasoning without altering model weights.

Entities (6)

AFS-Search · framework · 100%FLUX.1-dev · generative-model · 100%Agentic Flow Steering · methodology · 95%Parallel Rollout Search · methodology · 95%VLM · model-architecture · 95%SAM3 · segmentation-model · 90%

Relation Signals (4)

AFS-Search → builtupon → FLUX.1-dev

confidence 100% · a training-free closed-loop framework built upon FLUX.1-dev.

AFS-Search → implements → Parallel Rollout Search

confidence 95% · AFS-Search incorporates a training-free closed-loop parallel rollout search

AFS-Search → utilizes → VLM

confidence 95% · leverages a Vision-Language Model (VLM) as a semantic critic

AFS-Search → integrates → SAM3

confidence 90% · dynamically steer the velocity field via precise spatial grounding leveraging SAM3

Cypher Suggestions (2)

Identify the base model for AFS-Search. · confidence 95% · unvalidated

MATCH (f:Framework {name: 'AFS-Search'})-[:BUILT_UPON]->(m:Model) RETURN m.name

Find all components and methods used by the AFS-Search framework. · confidence 90% · unvalidated

MATCH (f:Framework {name: 'AFS-Search'})-[r]->(e) RETURN f, r, e

Abstract

Abstract:Precise Text-to-Image (T2I) generation has achieved great success but is hindered by the limited relational reasoning of static text encoders and the error accumulation in open-loop sampling. Without real-time feedback, initial semantic ambiguities during the Ordinary Differential Equation trajectory inevitably escalate into stochastic deviations from spatial constraints. To bridge this gap, we introduce AFS-Search (Agentic Flow Steering and Parallel Rollout Search), a training-free closed-loop framework built upon FLUX.1-dev. AFS-Search incorporates a training-free closed-loop parallel rollout search and flow steering mechanism, which leverages a Vision-Language Model (VLM) as a semantic critic to diagnose intermediate latents and dynamically steer the velocity field via precise spatial grounding. Complementarily, we formulate T2I generation as a sequential decision-making process, exploring multiple trajectories through lookahead simulations and selecting the optimal path based on VLM-guided rewards. Further, we provide AFS-Search-Pro for higher performance and AFS-Search-Fast for quicker generation. Experimental results show that our AFS-Search-Pro greatly boosts the performance of the original FLUX.1-dev, achieving state-of-the-art results across three different benchmarks. Meanwhile, AFS-Search-Fast also significantly enhances performance while maintaining fast generation speed.

PDF

Open source PDF →Open local PDF →

Full Text

69,231 characters extracted from source content.

Expand or collapse full text

Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation Ping Chen 1 , Daoxuan Zhang 1 , Xiangming Wang 1 , Yungeng Liu 1 , Haijin Zeng 1 ∗ , and Yongyong Chen 1⋆ Harbin Institute of Technology, Shenzhen, China zenghj@hit.edu.cn cyy2020@hit.edu.cn A red backpack and a blue book A mouse on the side of a key AFS-Search (oursFLUX.1-dev)AFS-Search (oursFLUX.1-dev) A professional chef in a white uniform is holding a stainless steel tray with a steaming pizza on it. To the chef's right, a small robot arm is pouring red wine into a glass. Fig. 1: From a visual perspective, our AFS-Search provides a closed-loop generation paradigm to achieve precise spatial grounding generation. Abstract. Precise Text-to-Image (T2I) generation has achieved great success but is hindered by the limited relational reasoning of static text encoders and the error accumulation in open-loop sampling. Without real-time feedback, initial semantic ambiguities during the Ordinary Dif- ferential Equation trajectory inevitably escalate into stochastic devia- tions from spatial constraints. To bridge this gap, we introduce AFS- Search (Agentic Flow Steering and Parallel Rollout Search), a training- free closed-loop framework built upon FLUX.1-dev. AFS-Search incor- porates a training-free closed-loop parallel rollout search and flow steer- ing mechanism, which leverages a Vision-Language Model (VLM) as a semantic critic to diagnose intermediate latents and dynamically steer the velocity field via precise spatial grounding. Complementarily, we for- mulate T2I generation as a sequential decision-making process, explor- ing multiple trajectories through lookahead simulations and selecting ⋆ Corresponding authors. arXiv:2603.18627v1 [cs.AI] 19 Mar 2026 2P. Chen et al. the optimal path based on VLM-guided rewards. Further, we provide AFS-Search-Pro for higher performance and AFS-Search-Fast for quicker generation. Experimental results show that our AFS-Search-Pro greatly boosts the performance of the original FLUX.1-dev, achieving state- of-the-art results across three different benchmarks. Meanwhile, AFS- Search-Fast also significantly enhances performance while maintaining fast generation speed. Keywords: T2I Generation· Vision-Language Model· AFS-Search 1 Introduction The field of Text-to-Image (T2I) generation has witnessed a paradigm shift with the emergence of Diffusion Models (DMs) [14, 30, 33] and Flow Match- ing (FM) [20, 21] architectures. Models such as SDXL [23], FLUX.1 [17], and Qwen-Image [37] have demonstrated an unparalleled ability to synthesize high- fidelity images that capture intricate artistic styles and textures. By leveraging large-scale pre-trained text encoders such as T5 [27], CLIP [26], these models have moved beyond simple object depiction toward generating complex scenes from natural language descriptions. Recently, VLM-based frameworks such as RPG [44], SILMM [25], AgentComp [46] encounter, showing great potential to further complete T2I generation via VLM perception and even agentic actions. Despite these impressive strides, achieving precise spatial grounding and re- lational reasoning remains a persistent challenge. As shown in Fig. 2, we identify two primary bottlenecks in conventional T2I pipelines: (1) Static text encoders often exhibit an expressive bottleneck when processing complex relational se- mantics. Specifically, they struggle to distinguish detailed spatial instructions, which results in underspecified semantic embeddings that fail to capture the nu- anced spatial relationships required for accurate image synthesis. (2) Traditional models, as well as recent agent-guided frameworks such as RPG [44], Layout- Guidance [34], AgentComp [46] follow an open-loop sampling paradigm. While existing agent-based frameworks enhance spatial control through pre-generation planning of the model, they solve an Ordinary Differential Equation (ODE) along a pre-defined trajectory without any internal feedback. Consequently, even mi- nor semantic ambiguities in the initial phase are irreversibly amplified through the discrete integration steps, ultimately leading to stochastic deviations where the final output fails to satisfy the original spatial constraints. To bridge this gap, we introduce AFS-Search, a training-free closed-loop framework designed to transform T2I generation from a passive sampling pro- cess into an active, decision-making procedure. Our core insight is that a gen- erative model should iteratively assess and adjust its generation process rather than producing outputs in a one-shot manner. By integrating a Vision-Language Model (VLM) as a Semantic Critic, our framework enables the system to perceive intermediate generation states and rectify potential errors in real-time. Specifi- cally, we propose Agentic Flow Steering (AFS) a novel steering mechanism AFS-Search3 Traditional Open-loop Fashion Our proposed AFS-Search with Closed-loop Fashion Score < Threshold Fig. 2: Motivation of our AFS-Search. Open-loop generation follows a fixed, feed- forward sampling trajectory without intermediate feedback or correction while closed- loop generation introduces real-time visual feedback. that diagnoses semantic drifts at critical timestamps and dynamically steers the velocity field of the flow model via precise spatial grounding leveraging SAM3 [1]. Additionally, going beyond simple correction, our AFS-Search framework in- corporates Parallel Rollout Search (PRS) to effectively navigate the complex latent landscape. At key bifurcation points, the agent performs lookahead simula- tions by exploring multiple potential trajectories, including corrective steering and random exploration. By evaluating these branches through VLM-guided rewards, the model selects the optimal path that maximizes alignment with the user’s intent. This strategy effectively leverages test-time computation to over- come the inherent randomness of the diffusion process, thereby ensuring robust performance. Further, we provide AFS-Search-Pro for higher performance and AFS-Search-Fast for quicker generation. The main contributions are as follows: – We propose AFS-Search, a training-free framework that reformulates T2I generation as as a closed-loop decision-making process instead of a passive sampling task. We provide two versions of AFS-Search for real-time compu- tation and efficiency. – We introduce Agentic Flow Steering mechanism that dynamically rectifies the ODE trajectory via VLM feedback and spatial anchoring via SAM3. – We implement a Parallel Rollout Search strategy using lookahead simulations to effectively resolve spatial and semantic conflicts during sampling. – Experimental results show that our AFS-Search-Pro achieves state-of-the- art results across three benchmarks, while AFS-Search-Fast also significantly enhances performance while meeting the requirements for fast generation. 2 Related Work 2.1 Compositional Text-to-Image Generation T2I generation has achieved remarkable progress, evolving from early GAN- based approaches [29,42,47] to the current state-of-the-art Diffusion Models [30, 32] and Flow Matching architectures [17, 20, 21]. Despite their ability to syn- thesize high-quality textures, standard models often struggle with compositional 4P. Chen et al. generation. Specifically, correctly binding attributes to objects and strictly fol- lowing spatial instructions. This limitation stems from the cross-attention mech- anism’s tendency to mix semantic information across different spatial regions, leading to attribute leakage or catastrophic neglect of objects. Existing solutions can be broadly categorized into training-based and training- free approaches. Training-based methods [31, 48] fine-tune the backbone or in- troduce additional adapters to enforce spatial constraints. However, in the era of billion-parameter foundation models trained on massive internet-scale datasets, fine-tuning on limited domain-specific data often yields diminishing returns. More critically, which causes the model to lose its open-world generalizability. Consequently, training-free methods have gained prominence. Early works utilized attention manipulation [2,13] to re-weight or mask cross-attention maps. While effective for simple layouts, these heuristic-based methods lack high-level reasoning and often fail in complex scenarios requiring logical planning. Distinct from these, our AFS-Search adopts a test-time search paradigm. We argue that pre-trained models already possess the necessary visual priors; the challenge lies not in learning new features, but in navigating the latent space to locate the correct composition without altering model weights. 2.2 Vision Language Models and Agentic Frameworks The rapid evolution of Large Multimodal Models (LMMs) [6,22] has catalyzed a new wave of Agentic generation frameworks that leverage the reasoning capa- bilities of LLMs/VLMs to control T2I synthesis. Early works focused on prompt enhancement and layout planning. For instance, DALL-E 3 [18] and Promp- tist [12] employ LLMs to rewrite user queries into descriptive captions. Moving beyond text, LayoutGPT [8] and VPGen [5] utilize LLMs to generate intermedi- ate scene layouts to guide diffusion models. While effective for initial grounding, these methods operate in an open-loop manner, lacking mechanisms to verify if the generated layout is actually respected. To address this, recent research has pivoted towards feedback-driven itera- tive frameworks. Idea2Img [45] introduces a multi-turn dialogue where an LMM iteratively revises prompts based on generated drafts. RPG [44] proposes a chain- of-thought planning strategy to decompose complex prompts into sub-regions. AgentComp [46] integrates multiple agent roles to iteratively refine composi- tional details via an external feedback loop. However, a common limitation of these state-of-the-art agents is their reliance on an external loop paradigm: they treat the generative model as a black box, rectifying errors solely by modifying the textual input and triggering a full re-generation. This trial-and-error process is computationally inefficient and often struggles to correct fine-grained local attributes without altering the global structure. In contrast, our approach shifts the paradigm from external re-prompting to internal state intervention. Instead of discarding the entire image upon failure, AFS-Search intervenes directly within the flow-matching trajectory. By perform- ing Parallel Rollout Search on the intermediate latents, we achieve precise, local- ized corrections without the computational overhead of iterative re-generation. AFS-Search5 User VLM Expert I want to draw four seasons in one picture. Spring scenery, blooming cherry blossoms and fresh grass. Summer scenery, a bright sunny beach with ocean and sand. Autumn scenery, falling maple leaves. Winter scenery, deep snow, ice crystals, and a snowman. Phase 1: Prompt Optimization A wooden window frame, thick and dark with visible grain and weathered texture, divides the view into four equally sized panes arranged in a 2x2 grid. Each pane displays a completely distinct season in photorealistic detail, with high contrast between scenes to emphasize separation and clarity. Top-left pane: Spring scenery — delicate pink cherry blossoms in full bloom, softly lit by morning sunlight, gently swaying above vibrant fresh green grass dotted with tiny white daisies. The background is a soft pastel blue sky with faint wisps of cloud. Depth of field focuses on the foreground blossoms.Top-right pane: ... Bottom-left pane:... Bottom-right pane: ... Conclusion: ... Phase 2: Generate Initial Structure FLUX.1-dev Intermediate step Flow Matching Process Base Branch Corrective Branch Exploration Branch Analyze the gap between the generated image and prompts “cherry blossom”, “blue sky” SAM3 Mask M 0 ˆ x Phase 3: Parallel Rollout Search Score: 6.5 Score: 5.5 Score: 8.5 Phase 4: Output and Feedback If the scores lower than threshold Defective Object Segmentation Agentic Flow Steering VLM Critic The flow should be that way! +퐀 Fig. 3: The framework of AFS-Search. The pipeline operates in four phases: (1) Prompt Optimization. A VLM rewrites the user prompt into an explicit instruction. (2) Generating Initial Structure. The FLUX.1-dev model generates an interme- diate state up to a bifurcation point. (3) Parallel Rollout Search. A VLM Critic diagnoses the intermediate state to guide search. The system explores three branches: a Baseline Branch, an Exploration Branch, and a Corrective Branch by AFS. (4) Out- put & Feedback: The optimal trajectory is selected based on VLM scores. If the scores are lower than threshold, a global redesign loop is triggered. 3 Method 3.1 Overview As illustrated in Fig. 3, AFS-Search reformulates text-to-image generation as a closed-loop decision-making process comprising four integrated phases: (1) Prompt Optimization, where a VLM rewrites abstract user inputs into de- tailed, spatially explicit instructions to minimize initial ambiguity; (2) Initial Structure Generation, where the FLUX.1-dev [17] model synthesizes latents up to a critical bifurcation point to establish a malleable global layout; (3) Parallel Rollout Search, the core phase where a VLM Critic diagnoses in- termediate defects to guide lookahead simulations across three branches: Base, Exploration, and Corrective, which employs SAM3-based Agentic Flow Steering via contrastive guidance and ultimately selects the optimal trajectory based on reward scores; and (4) Global Feedback, which concludes the generation and triggers a re-design loop if the final output score falls below a safety threshold. 3.2 Preliminaries: Flow Matching and Open-Loop Limitation Based on FLUX.1 dev [17], we build our method upon the Flow Matching paradigm, which models the generation process as a Continuous Normalizing 6P. Chen et al. Score: 9.5Score: 8.5Score: 9.0 Corrective Branch Base Branch Exploration Branch A brown bear and a red book Where should I go? Search 1: Keep going Search 2: AFS Search 3: Add noise Fig. 4: Motivation for Parallel Rollout Search. While the standard open-loop tra- jectory (Base Branch) yields a sub-optimal alignment (Score: 8.5), our search mecha- nism actively explores alternative futures. By comparing the Corrective Branch (guided by AFS) and Exploration Branch against the baseline, the agent identifies and selects the optimal trajectory (Score: 9.5) that best matches the prompt. Flow (CNF). Given a data distribution q(x 1 ) and a prior distribution p(x 0 ) = N(x 0 ; 0,I), the flow is defined by a time-dependent vector field v t (x). The gen- eration process involves solving the following ODE: dx t = v θ (x t ,t,y)dt,(1) where v θ is a neural network parameterized by θ and conditioned on text embed- ding y. Standard sampling integrates this ODE from t = 0 to t = 1. However, this open-loop integration is prone to error accumulation. Without feedback, any semantic misalignment in intermediate steps propagates to the final output. 3.3 Parallel Rollout Search (PRS) Motivation. Standard open-loop sampling suffers from stochastic failure, where early semantic errors accumulate irreversibly. To address this, as shown in Fig. 4, we reformulate generation as a navigable decision-making process rather than a fixed probabilistic trajectory. Crucially, we adopt a training-free strat- egy. Instead of fine-tuning, which risks catastrophic forgetting of the backbone’s open-world knowledge, we leverage test-time computation to explore the la- tent space. This approach unlocks the pre-trained model’s inherent capability to follow complex spatial instructions without altering PRS parameters. Prompt Optimization. The process begins with a VLM Prompt Opti- mizer. Given a user prompt y raw , a VLM rewrites it into a comprehensive in- struction y refined with defined logic constraints (e.g., object counts and precise colors). This ensures that constraints are explicitly defined before the denoising process begins. The agent’s prompt is provided in Supplyment A. Latent Space Search Tree. We perform standard denoising until a critical decision point t split such as 60% of total steps. At this state x t split , the VLM acts as a supervisor to diagnose defects such as object count errors, color mismatches, AFS-Search7 Fig. 5: VLM’s scoring mechanism. A VLM-driven evaluation framework (ranging from -10 to +10) that balances prompt adherence (50%), relational logic (30%), and visual integrity (20%). It employs a granular penalty-bonus system to enforce semantic precision and reward aesthetic quality. Full prompt is in Supplyment A. or spatial deviations. Based on this diagnosis, we construct a search tree with action space A =a base ,a steer ,a explore . Specifically: Baseline Branch (a base ) Continues the basic trajectory without intervention. Corrective Branch (a steer ) activates the Agentic Flow Steering module (see Sec. 3.4) to rectify specific semantic defects. Exploration Branch (a explore ) introduces stochastic perturbations to escape potential local semantic optima. By injecting controlled Gaussian noise ε ∼ N(0,σ 2 I) into the latent state x t split , this branch forces the ODE solver to diverge from the current deterministic path, allowing the model to re-sample alternative global layouts or object poses while preserving the broad context. Simulation and Selection. For each branch, we perform a short-horizon sim- ulation such as 5 steps. A VLM critic evaluates the resulting previews, assigning a reward score mechanism as illustrated in Fig. 5, covering core requirement of T2I generation. The optimal trajectory is selected to continue generation. 3.4 Agentic Flow Steering (AFS) Unlike passive sampling in standard ODE solvers, our AFS functions as an active optimal controller. Instead of shifting the latent states, we formulate the gen- erative intervention as an energy-minimization problem over the velocity field. This is achieved through three steps: Linear Trajectory Projection, Contrastive Energy Formulation, and Time-Scaled Velocity Modulation. The whole pipeline is demonstrated in Fig. 6, and the detailed analysis is as follows: Linear Trajectory Projection. A fundamental challenge in guiding Flow Matching models is that intermediate latents z t are noisy and lack decodable 8P. Chen et al. a blue backpack and a brown cow Prompt Optimization Preview VLM Diagnosis segmentation keyword: 'blue backpack', target object: 'blue backpack with silver zippers, matte finish, slightly worn fabric, positioned on left side of frame', positive concept: 'a blue backpack with a matte finish and silver zippers, slightly worn fabric, resting on grassy field with visible texture and subtle shadows beneath it, positioned on the left third of the frame', negative concept: ... target bbox: [0.25, 0.4, 0.5, 0.75] Mask M 0 ˆ x CLIP Text Encoder positive prompt negative prompt 퐀 퐀 퐀 퐀 C L I P I m a g e E n c o d e r Output Score: 9.0 Segmentation keyword SAM3 Fig. 6: Illustration of the pipeline of AFS. Given an intermediate preview ˆx 0 , the VLM diagnoses the defect and generates a spatial mask M. We formulate a contrastive energy function and project PRS gradient back to the velocity fieldv t . This time-scaled gradient modulation ensures optimal trajectory steering toward the target concept while strictly confining the intervention within the masked region. semantics. However, Rectified Flow architectures (e.g., FLUX.1) are optimized to follow near-linear optimal transport trajectories mapping a prior noise dis- tribution to the data distribution. Formally, given data x 0 and Gaussian noise z 1 ∼N(0, I), the forward probability path is constructed via linear interpolation: z t = tz 1 + (1− t)z 0 , t∈ [0, 1],(2) where z 0 is the clean latent representation. The corresponding ground-truth vector field driving this flow is the time derivative of the path: u t (z t ) = dz t dt = z 1 − z 0 . (3) By substituting Eq. (3) into Eq. (2), the z t can be rewritten as z t = tu t + z 0 . Leveraging this approximate constant-velocity property, we can project the current noisy state back to the data manifold in latent space. Given the predicted velocity v t ≈ u t , the estimated latent ˆ z 0 at any step t is determined by: ˆ z 0 = z t − t· v t , and the corresponding preview image is ˆ x 0 = Decoder( ˆ z 0 ). This deterministic projection allows the agent to peer into the “future” of the ODE trajectory, evaluating noisy latents directly on the image manifold. AFS-Search9 Contrastive Semantic Energy Formulation. To operationalize the VLM’s diagnosis, we define a contrastive energy function E( ˆ z 0 ) by passing the decoded image ˆ x 0 through CLIP. We compute the energy as: E( ˆ z 0 ) = cos(CLIP( ˆ x 0 ), e neg )− cos(CLIP( ˆ x 0 ), e pos ),(4) where e neg and e pos are the text embeddings of the defective and target concepts, respectively. Minimizing this energy actively repels the trajectory from the flawed local optimum while pulling it toward the correct semantic basin. Time-Scaled Velocity Modulation. To steer the generation, we must map the energy gradient ∇ ˆ z 0 E back to the ODE vector field v t . Using the chain rule, the energy gradient with respect to the velocity field is: ∇ v t E = ∂ ˆ z 0 ∂v t ∇ ˆ z 0 E =−t·∇ ˆ z 0 E. (5) Note that ∇ ˆ z 0 E implicitly incorporates the Jacobian of the Decoder through backpropagation: ∇ ˆ z 0 E = J ⊤ Dec ∇ ˆ x 0 E. Therefore, our gradient descent update on the velocity field is confined by the spatial mask M provided by SAM3 [1]: v corrected t = v t − η·∇ v t E ⊙ M = v t + ηt·∇ ˆ z 0 E ⊙ M. (6) This derivation unveils a key theoretical property of AFS: the correction applied to the velocity field is directly proportional to the latent-space energy gradient, scaled by a time-decaying factor ηt. When t is large (early stages), the guidance is strong, facilitating aggressive semantic correction. As t→ 0 (late stages), the term ηt naturally vanishes, ensuring that the intervention does not introduce high-frequency artifacts or disrupt the fine-grained texture synthesis as the flow converges to the data manifold. Notably, if a wrong mask is provided by SAM3, it will be ignored by the selection step. 3.5 Global Feedback Loop Finally, we implement a global safety mechanism. Upon completion, if the final image score falls below a quality threshold, a Redesign Loop is triggered. The VLM analyzes the failure mode, refines the prompt to address the specific issues, and restarts the Parallel Rollout Search process with a new random seed, ensuring high reliability for complex queries. 4 Experiment 4.1 Experimental Setup Base Model Settings. In our experiments, we employ FLUX.1-dev as the base text-to-image model. FLUX.1-dev is a 12B parameter rectified flow transformer capable of generating high-quality images from text descriptions. All images are 10P. Chen et al. Table 1: Quantitative comparison on T2I-CompBench. Red indicates the best per- formance, Blue indicates the second best, and “-” indicates close-sourced. Method Attribute BindingObject Relationship Complex ↑Average ↑Time (s) ↓ color ↑ shape ↑ Texture ↑Spatial ↑ Non-Spatial ↑ General T2I Models DALL-E 20.5750 0.5464 0.63740.12830.30430.36960.4268- SDXL 0.6369 0.5408 0.56370.20320.31790.40910.44534.6 PixArt-α0.6886 0.5582 0.70440.20820.31790.41170.48156.0 FLUX0.7736 0.5112 0.63250.27470.30770.36220.477011.7 Qwen-Image0.7835 0.5401 0.68160.36470.31090.35300.505632.2 SDv3.50.7717 0.6050 0.72500.22860.31760.37290.503542.8 Agentic Frameworks ConPreDiff0.7019 0.5637 0.70210.23620.31950.41840.4903- RPG0.6406 0.4903 0.55970.27140.30470.31280.4299104.2 EvoGen 0.7104 0.5457 0.72340.21760.33080.42520.4922125.3 T2I-R10.8130 0.5852 0.72430.33780.30900.39930.528183.2 MCCD0.6278 0.4832 0.56470.23500.31320.33480.4265132.2 AgentComp0.8743 0.6681 0.81420.47480.31960.42610.5962- AFS-Search-Fast (Ours)0.8132 0.6121 0.76070.5416 0.48320.52080.621932.5 AFS-Search-Pro (Ours)0.8847 0.6292 0.76090.6250 0.53050.61850.674862.3 generated at a resolution of 1024× 1024 pixels. For the VLM supervisor, we uti- lize Qwen-VL-MAX for AFS-Search-Pro and Qwen2.5-VL-7B for AFS-Search-Fast, which serves as the “brain” of the agent for prompt refinement, image diagnosis, and scoring. It is worth noting that, unless otherwise specified, the fol- lowing AFS-Search refers to AFS-Search-Pro. The system is configured with a multi-stage exploration strategy: (1) Prompt Refinement: The VLM first optimizes the raw user prompt. (2) Initial Generation: Standard sampling is performed for the first 40% of the diffusion process (from t = 1.0 to t = 0.6). (3) Parallel Rollout Search-based Branching: At the decision point (t = 0.6), the VLM diagnoses the intermediate latent and proposes multiple execution branches, including standard continuation and corrective steering. (4) Simulation and Selection: Each branch is simulated for a short horizon (3 or 5 steps), and the branch with the highest reward is selected for completion. (5) Global Retry: A failure recovery mechanism is triggered if the confidence score falls below a threshold (7.5/10), prompting a redesign of the instruction and a restart of the generation process (up to 1 to 2 retries). Benchmark and Baseline Models. To evaluate the effectiveness of our pro- posed AFS-Search, we first conduct experiments on T2I-CompBench [15], a comprehensive suite designed to evaluate the compositional capabilities of T2I models across various attribute dimensions, including color binding, shape consistency, and spatial relationships. Additionally, we select general T2I Mod- els such as DALL-E 2 [28], SDXL [23], PixArt-α [3], FLUX.1-dev [17], Qwen- Image [37], SDv3.5 [7], and Agentic Frameworks such as ConPreDiff [43], RPG [44], EvoGen [11], T2I-R1 [16], MCCD [19], AgentComp [46] as our baseline models. We further conduct experiments on GenEval [10], an object-focused framework AFS-Search11 A mouse near a bowlA blue banana and a green vase A diamond brooch and a teardrop bracelet Attribute Binding Object Relationship A cat is lazily lounging in a sunny windowsill BaselineAFS-Search(ours) AFS-Search(ours) Baseline The fiery, blazing sun sank below the horizon, painting the sky with a spectrum of orange and red hues, a stunning display of natural beauty The brown dog was lying on the green mat Complex Baseline AFS-Search(ours) Fig. 7: Visual Comparison of T2I-CompBench experiment. We conducted tests from three dimensions: attribute binding, spatial relationship, and complexity, and obtained good results in all, demonstrating the validity of our method. to evaluate compositional image properties such as object co-occurrence, posi- tion, count, and color and compare with FLUX.1-dev. Furthermore, we apply recent new benchmark R2I-Bench [4], a comprehensive benchmark designed to assess the reasoning capabilities of T2I generation models. We test our model in five main dimensions: Causal, Logical, Commonsense, Compositional and Math- ematical using ChatGPT 4o with baseline models such as Lumina-Image 2.0 [24], Sana-1.5 [40], Lumina-T2I [9], Omnigen [39], EMU3 [36], Janus-Pro-7B [38], Lla- maGen [35], Show-o [41] and FLUX.1-dev [17]. 4.2 Main Results The quantitative results are provided in Table 1 and the visual results are pro- vided in Fig. 7 on T2I-CompBench. Our AFS-Search-Pro achieves state-of-the- art by improving 7.86% on average and improves in Object Relationship and Complex tasks effectively, demonstrating our success of training-free agentic search framework. Additionally, we provide AFS-Search-Fast for quicker inter- face speed, which still has great performance by improving 2.57% on average. Although our model’s inference speed is slower compared to FLUX.1-dev, it surpasses methods in agentic frameworks, and it is also suitable for real-world application. Additionally, as shown in Fig. 8, our AFS-Search per- forms well across GenEval, further validating the effectiveness and superiority of our framework. What’s more, we apply R2I-Bench and the visual and quan- titive results are provided in Fig. 9 and Table 2. Our AFS-Search significantly improves the reasoning capabilities due to VLM, and achieves state-of-the-art on average metric, demonstrating the powerful generation ability. 12P. Chen et al. A photo of a clock A photo of a tv and a cell phone Baseline AFS-Search Fig. 8: Results on GenEval. We conducted tests from Objects, counting, color, po- sition and overall, and obtained good results in all, proving the validity of our method. Table 2: Quantitative comparison on the R2I-Bench benchmark. Bold indicates the best performance. MethodCausal Logical Commonsense Compositional MathematicalAverage Lumina-Image 2.00.40 0.560.490.650.130.45 Sana-1.50.21 0.490.490.670.130.40 Lumina-T2I0.18 0.380.380.490.130.31 Omnigen0.34 0.510.430.600.180.41 EMU30.41 0.610.440.620.090.43 Janus-Pro-7B0.36 0.460.450.640.070.40 LlamaGen0.12 0.350.380.490.070.28 Show-o0.30 0.570.420.560.120.39 FLUX.1-dev0.35 0.370.390.480.050.33 AFS-Search (ours)0.45 0.580.510.660.200.48 4.3 Ablation Study In this section, we conduct ablation experiments to evaluate the contribution of each core component of AFS-Search and investigate the impact of hyperparam- eter configurations on the generation performance. Effectiveness of Core Components. We evaluate the performance of our framework by systematically removing key modules: (1) FLUX: The base T2I model without any agentic intervention. (2) w/o Optimization: Disabling the VLM-based prompt refinement, using raw user prompts directly. (3) w/o Self-Correction: Disabling the AFS mechanism. (4) w/o Parallel Rollout Search: Disabling the multi-branch search, effec- tively performing greedy generation with a single path. (5) w/o Repair: Disabling the retry loop. AFS-Search13 Generate the result of a smartphone with poor quality dropped onto a concrete sidewalk from chest height. A bookshelf is placed against a wall, with a vase on the top shelf and a lamp on the floor. The lamp should be positioned a short distance from the wall. A person standing on a sidewalk at midday, casting a sharp shadow on the ground. If 2 × 2 = 5, then generate an image of 2 × 2 birds. A bicycle is leaning against a wall, with a helmet hanging from the handlebars. Place a water bottle on the ground next to the bicycle, not too close to the wall. Imagine a mountain with a waterfall defying gravity. Baseline BaselineBaselineAFS-Search (ours)AFS-Search (ours)AFS-Search (ours) Fig. 9: Visual Comparison of R2I-Bench experiment. We further try to explore the reasoning capabilities of our method. Our agentic framework helps original FLUX.1- dev think and react across the whole pipeline. Table 3: Ablation study on T2I-CompBench verifying the effectiveness of each com- ponent, proving that our core components improve the performance of FLUX.1-dev. Method Attribute BindingObject Relationship Complex ↑Average ↑ color ↑ shape ↑ Texture ↑Spatial ↑ Non-Spatial ↑ AFS-Search (ours)0.8847 0.6292 0.76090.6250 0.53050.61850.6748 FLUX0.7736 0.5112 0.63250.27470.30770.36220.4770 w/o Optimization0.8251 0.5614 0.68430.36540.35210.50110.5482 w/o Parallel Rollout Search0.7932 0.5421 0.65870.33150.33590.41320.5124 w/o Self-Correction0.8012 0.5631 0.64960.40310.34680.44630.5350 w/o Repair0.7831 0.5358 0.69320.43210.40120.43210.5463 As shown in Table 3, each component contributes significantly to the final performance. The Parallel Rollout Search and AFS are the most critical, particularly for complex spatial and shape-related prompts, where they pro- vide a gain of approximately 15-20% over the base model. The Global Repair mechanism further improves reliability by recovering from initial failures through prompt redesign. This design, by using VLM, helps the existing T2I framework to have a certain level of understanding under the premise of being training-free, and provides a closed-loop T2I generation paradigm. Analysis of Search Strategy We further investigate two key parameters gov- erning the Parallel Rollout Search: the Search Timing Strategy (when to branch) and the Simulation Horizon (how deep to simulate each branch). Search Timing Strategy (t split ). We compare branching at different stages of the diffusion process: early (t = 0.8), mid (t = 0.6), late (t = 0.4), and Multi- 14P. Chen et al. Fig. 10: Ablation Study of the Search Strategy. (a) Search Timing: Shows that searching at multiple stages is more effective than searching at any single time point. (b) Simulation Steps: Demonstrates that looking ahead with more simulation steps is essential for selecting the best path and avoiding errors. Stage adaptive strategy. As shown in Fig. 10, early branching is more effective for Spatial and Shape, as the global layout is determined early in the denoising process. In contrast, Color and Texture benefit more from later intervention (t = 0.4), where fine-grained attributes are finalized. Our Multi-Stage strategy achieves the best balance by allowing flexible intervention across categories. Simulation Horizon (Steps). We vary the number of simulation steps per- formed for each branch before selection: Greedy (0 steps), Standard (35 steps), and Deep (15 steps). The results indicate a clear positive correlation between simulation depth and performance, demonstrating that longer lookahead sim- ulations provide more reliable reward signals for the VLM to select the truly optimal path, albeit at the cost of increased inference time. 5 Conclusion In this paper, we propose a training-free closed-loop framework featuring Agen- tic Flow Steering and Parallel Rollout Search based on FLUX.1-dev. Our core insight is that a generative model should iteratively assess and adjust its genera- tion process rather than producing outputs in a one-shot manner. By optimizing prompts, during real-time inference search, we use different branches. Finally, VLMs are used for scoring, forming a closed-loop agent framework. Our frame- work has been tested on three different benchmarks and compared with lots of baseline models, achieving state-of-the-art performance. Additionally, we provide AFS-Search-Pro and AFS-Search-Fast for better performance and computational speed. Our ablation experiments also highlight the necessity and practicality of the core components of our framework, providing a training-free, closed-loop, thinking-capable T2I generation model framework paradigm. AFS-Search15 References 1. Carion, N., Gustafson, L., Hu, Y., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., Dollár, P., Ravi, N., Saenko, K., Zhang, P., Feichtenhofer, C.: SAM 3: Segment anything with concepts. CoRR abs/2511.16719 (2025). https: //doi.org/10.48550/ARXIV.2511.16719 2. Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph. 42(4), 148:1–148:10 (2023). https://doi.org/10.1145/3592116 3. Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wang, Z., Kwok, J.T., Luo, P., Lu, H., Li, Z.: Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net (2024), https: //openreview.net/forum?id=eAKmQPe3m1 4. Chen, K., Lin, Z., Xu, Z., Shen, Y., Yao, Y., Rimchala, J., Zhang, J., Huang, L.: R2i-bench: Benchmarking reasoning-driven text-to-image generation. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Proceed- ings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025. p. 12595–12630. Associa- tion for Computational Linguistics (2025). https://doi.org/10.18653/V1/2025. EMNLP-MAIN.636 5. Cho, J., Zala, A., Bansal, M.: Visual programming for step-by-step text-to-image generation and evaluation. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (2023), http://papers. nips.c/paper_files/paper/2023/hash/13250eb13871b3c2c0a0667b54bad165- Abstract-Conference.html 6. DeepSeek-AI: Deepseek-v3 technical report. CoRR abs/2412.19437 (2024). https://doi.org/10.48550/ARXIV.2412.19437 7. Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z., Rom- bach, R.: Scaling rectified flow transformers for high-resolution image synthe- sis. In: Salakhutdinov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scar- lett, J., Berkenkamp, F. (eds.) Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Proceedings of Machine Learning Research, vol. 235, p. 12606–12633. PMLR / OpenReview.net (2024), https://proceedings.mlr.press/v235/esser24a.html 8. Feng, W., Zhu, W., Fu, T., Jampani, V., Akula, A.R., He, X., Basu, S., Wang, X.E., Wang, W.Y.: Layoutgpt: Compositional visual planning and generation with large language models. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (2023), http://papers. nips.c/paper_files/paper/2023/hash/3a7f9e485845dac27423375c934cb4db- Abstract-Conference.html 16P. Chen et al. 9. Gao, P., Zhuo, L., Liu, D., Du, R., Luo, X., Qiu, L., Zhang, Y., Lin, C., Huang, R., Geng, S., Zhang, R., Xi, J., Shao, W., Jiang, Z., Yang, T., Ye, W., Tong, H., He, J., Qiao, Y., Li, H.: Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers. CoRR abs/2405.05945 (2024). https://doi.org/10.48550/ARXIV.2405.05945 10. Ghosh, D., Hajishirzi, H., Schmidt, L.: Geneval: An object-focused frame- work for evaluating text-to-image alignment. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neu- ral Information Processing Systems 36: Annual Conference on Neural Infor- mation Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (2023), http://papers.nips.c/paper_files/ paper/2023/hash/a3bf71c7c63f0c3bcb7f67c67b1e7b1- Abstract- Datasets_ and_Benchmarks.html 11. Han, X., Jin, L., Liu, X., Liang, P.P.: Progressive compositionality in text-to- image generative models. In: The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net (2025), https://openreview.net/forum?id=S85P4xjFD 12. Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image gener- ation. In: Neural Information Processing Systems (2023) 13. Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross-attention control. In: The Eleventh In- ternational Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net (2023), https://openreview.net/forum?id= _CDixzkzeyb 14. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Ad- vances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020), https://proceedings.neurips.c/paper/2020/hash/ 4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html 15. Huang, K., Sun, K., Xie, E., Li, Z., Liu, X.: T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems 36, 78723–78747 (2023) 16. Jiang, D., Guo, Z., Zhang, R., Zong, Z., Li, H., Zhuo, L., Yan, S., Heng, P., Li, H.: T2I-R1: reinforcing image generation with collaborative semantic-level and token- level cot. CoRR abs/2505.00703 (2025). https://doi.org/10.48550/ARXIV. 2505.00703 17. Labs, B.F., Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space (2025), https://arxiv.org/abs/2506.15742 18. Lai, Z., Zhu, X., Dai, J., Qiao, Y., Wang, W.: Mini-dalle3: Interactive text to image by prompting large language models. CoRR abs/2310.07653 (2023). https:// doi.org/10.48550/ARXIV.2310.07653 19. Li, M., Hou, X., Liu, Z., Yang, D., Qian, Z., Chen, J., Wei, J., Jiang, Y., Xu, Q., Zhang, L.: MCCD: multi-agent collaboration-based compositional diffusion for complex text-to-image generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. p. AFS-Search17 13263–13272. Computer Vision Foundation / IEEE (2025). https://doi.org/10. 1109/CVPR52734.2025.01238 20. Li, T., Sun, Q., Fan, L., He, K.: Fractal generative models. Trans. Mach. Learn. Res. 2025 (2025), https://openreview.net/forum?id=Qk9kn6lOlW 21. Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Rep- resentations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net (2023), https://openreview.net/forum?id=PqvMRDCJT9t 22. OpenAI: GPT-4 technical report. CoRR abs/2303.08774 (2023). https://doi. org/10.48550/ARXIV.2303.08774 23. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: improving latent diffusion models for high-resolution im- age synthesis. In: The Twelfth International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net (2024) 24. Qin, Q., Zhuo, L., Xin, Y., Du, R., Li, Z., Fu, B., Lu, Y., Yuan, J., Li, X., Liu, D., Zhu, X., Zhang, M., Beddow, W., Millon, E., Perez, V., Wang, W., He, C., Zhang, B., Liu, X., Li, H., Qiao, Y., Xu, C., Gao, P.: Lumina-image 2.0: A unified and efficient image generative framework. CoRR abs/2503.21758 (2025). https: //doi.org/10.48550/ARXIV.2503.21758 25. Qu, L., Li, H., Wang, W., Liu, X., Li, J., Nie, L., Chua, T.: SILMM: self- improving large multimodal models for compositional text-to-image generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. p. 18497–18508. Computer Vision Foundation / IEEE (2025). https://doi.org/10.1109/CVPR52734.2025.01724 26. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, p. 8748–8763. PMLR (2021), http://proceedings.mlr.press/v139/ radford21a.html 27. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020), https://jmlr.org/ papers/v21/20-074.html 28. Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I.: Zero-shot text-to-image generation. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, p. 8821–8831. PMLR (2021), http://proceedings.mlr.press/v139/ ramesh21a.html 29. Reed, S.E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: Balcan, M., Weinberger, K.Q. (eds.) Pro- ceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016. JMLR Workshop and Conference Proceedings, vol. 48, p. 1060–1069. JMLR.org (2016), http://proceedings.mlr. press/v48/reed16.html 30. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution im- age synthesis with latent diffusion models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18- 18P. Chen et al. 24, 2022. p. 10674–10685. IEEE (2022). https://doi.org/10.1109/CVPR52688. 2022.01042 31. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. p. 22500–22510. IEEE (2023). https://doi.org/10.1109/CVPR52729.2023.02155 32. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, S.K.S., Lopes, R.G., Ayan, B.K., Salimans, T., Ho, J., Fleet, D.J., Norouzi, M.: Photorealistic text-to-image diffusion models with deep language understanding. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Ad- vances in Neural Information Processing Systems 35: Annual Conference on Neu- ral Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 (2022), http://papers.nips.c/paper_files/ paper/2022/hash/ec795aeadae0b7d230fa35cbaf04c041-Abstract-Conference. html 33. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: 9th Inter- national Conference on Learning Representations, ICLR 2021, Virtual Event, Aus- tria, May 3-7, 2021. OpenReview.net (2021), https://openreview.net/forum? id=St1giarCHLP 34. Song, Y., Long, Z., Lan, M., Sun, C., Zhou, A., Chen, Y., Yuan, H., Cao, F.: Semantic attention and llm-based layout guidance for text-to-image generation. In: 2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, April 6-11, 2025. p. 1–5. IEEE (2025). https: //doi.org/10.1109/ICASSP49660.2025.10890155 35. Sun, P., Jiang, Y., Chen, S., Zhang, S., Peng, B., Luo, P., Yuan, Z.: Au- toregressive model beats diffusion: Llama for scalable image generation. CoRR abs/2406.06525 (2024). https://doi.org/10.48550/ARXIV.2406.06525 36. Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., Zhang, F., Wang, Y., Li, Z., Yu, Q., Zhao, Y., Ao, Y., Min, X., Li, T., Wu, B., Zhao, B., Zhang, B., Wang, L., Liu, G., He, Z., Yang, X., Liu, J., Lin, Y., Huang, T., Wang, Z.: Emu3: Next-token prediction is all you need. CoRR abs/2409.18869 (2024). https: //doi.org/10.48550/ARXIV.2409.18869 37. Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., Liu, Z.: Qwen-image technical report. CoRR abs/2508.02324 (2025). https://doi.org/10.48550/ARXIV.2508.02324 38. Wu, C., Chen, X., Wu, Z., Ma, Y., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C., Luo, P.: Janus: Decoupling visual encoding for unified multimodal un- derstanding and generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. p. 12966–12977. Computer Vision Foundation / IEEE (2025). https://doi.org/ 10.1109/CVPR52734.2025.01210 39. Xiao, S., Wang, Y., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., Liu, Z.: Omnigen: Unified image generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. p. 13294–13304. Computer Vision Foundation / IEEE (2025). https://doi.org/10.1109/CVPR52734.2025.01241 AFS-Search19 40. Xie, E., Chen, J., Zhao, Y., Yu, J., Zhu, L., Lin, Y., Zhang, Z., Li, M., Chen, J., Cai, H., Liu, B., Zhou, D., Han, S.: SANA 1.5: Efficient scaling of training- time and inference-time compute in linear diffusion transformer. In: Singh, A., Fazel, M., Hsu, D., Lacoste-Julien, S., Berkenkamp, F., Maharaj, T., Wagstaff, K., Zhu, J. (eds.) Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025. Proceedings of Machine Learning Research, vol. 267. PMLR / OpenReview.net (2025) 41. Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. In: The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net (2025), https://openreview.net/forum?id=o6Ynz6OIQ6 42. Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.: At- tngan: Fine-grained text to image generation with attentional generative ad- versarial networks. In: 2018 IEEE Conference on Computer Vision and Pat- tern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. p. 1316–1324. Computer Vision Foundation / IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00143 43. Yang, L., Liu, J., Hong, S., Zhang, Z., Huang, Z., Cai, Z., Zhang, W., Cui, B.: Improving diffusion-based image synthesis with context prediction. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Informa- tion Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (2023), http://papers.nips.c/paper_files/paper/2023/hash/ 7664a7e946a84ac5e97649a967717cf2-Abstract-Conference.html 44. Yang, L., Yu, Z., Meng, C., Xu, M., Ermon, S., Cui, B.: Mastering text-to-image dif- fusion: Recaptioning, planning, and generating with multimodal llms. In: Salakhut- dinov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Proceedings of Machine Learning Research, vol. 235, p. 56704–56721. PMLR / OpenReview.net (2024) 45. Yang, Z., Wang, J., Li, L., Lin, K., Lin, C., Liu, Z., Wang, L.: Idea2img: Itera- tive self-refinement with gpt-4v(ision) for automatic image design and generation. CoRR abs/2310.08541 (2023). https://doi.org/10.48550/ARXIV.2310.08541 46. Zarei, A., Pan, J., Gwilliam, M., Feizi, S., Yang, Z.: Agentcomp: From agentic rea- soning to compositional mastery in text-to-image models. CoRR abs/2512.09081 (2025). https://doi.org/10.48550/ARXIV.2512.09081 47. Zhang, H., Xu, T., Li, H.: Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. p. 5908–5916. IEEE Computer Society (2017). https://doi.org/10.1109/ICCV.2017.629 48. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. p. 3813–3824. IEEE (2023). https: //doi.org/10.1109/ICCV51070.2023.00355 20P. Chen et al. A Full Prompt Templates and Agentic Pipeline This section provides the verbatim system prompts used by the Vision-Language Model (VLM) in our agentic generation pipeline, including the structural opti- mization, surgical diagnosis, analytical scoring, and debugging-oriented redesign phases. A.1 Prompts System Prompt 1: Structural Prompt Refinement (Optimizer) Role/Strategy: You are a master FLUX.1 prompt engineer. Transform the user’s intent into a “Structural Prompt” that minimizes model ambiguity. Detailed Strategy Instructions: 1. Subject Specification: Define the core subject with high-quality adjectives (e.g., “photorealistic”, “matte finish”). 2. Spatial Layout: Use explicit coordinates or clear prepositional phrases (e.g., “In the foreground center...”, “On the far left...”). 3. Technical Parameters: Mention lighting (e.g., “cinematic lighting”, “rim light”), camera angle (e.g., “top-down view”), and atmosphere. 4. Entity Separation: If there are multiple items, use distinct descrip- tions for each to prevent concept bleeding. Strict Constraint: DO NOT change core colors, shapes, or quantities. If the user asks for a “red square”, do not output a “red circle”. Output Format: Output ONLY the refined prompt text. No explana- tions. System Prompt 2: Surgical Visual Diagnosis Agent (Supervisor) Role: You are an expert visual generation supervisor. Your task is to perform a surgical analysis of the current image against the User Prompt. Diagnosis Categories: – 1. Presence & Count: Verify if every object requested exists and the quantity is EXACT. (e.g., “3 apples” must not be 2 or 4). – 2. Attribute Fidelity: Check colors (hues, saturation), textures, and materials. (e.g., “translucent blue glass” must not be “opaque cyan plastic”). – 3. Relational Geometry: Analyze spatial prepositions: ’above’, ’in- side’, ’to the left of’, ’perfectly centered’, ’tangent to’. – 4. Structural Integrity: For diagrams/grids, check if lines connect correctly and regions are logically partitioned. AFS-Search21 Critical Failures (Automatic High Priority): – Concept Bleeding: Color of object A leaking into object B. – Spatial Swapping: Object A is on the right when it should be on the left. – Missing/Extra Entities: Any deviation in the count of primary subjects. Instructions for JSON Fields: – “segmentation_keyword”: Must be the MOST UNIQUE noun phrase for the defect. If the prompt has “a red cat and a blue dog” and the dog is red, use “dog”. – “positive_concept”: A clear, descriptive instruction for correction (e.g., “A vibrant cobalt blue dog with fur texture”). – “negative_concept”: A precise description of the error to be re- moved (e.g., “red-colored dog, reddish fur”). – “target_bbox”: Precise [ymin, xmin, ymax, xmax] (0.0-1.0). Output Format (JSON ONLY): "needs_correction": boolean, "segmentation_keyword": "string", "target_object": "string", "positive_concept": "string", "negative_concept": "string", "target_bbox": [float, float, float, float] System Prompt 3: Analytical Multi-Dimensional Scoring (Critic) Role: You are a highly analytical AI art critic and visual QA specialist. Eval- uate the image against the instruction using the following strict Scoring Rubric. Range: -10.0 (Total failure) to +10.0 (Perfect adherence). Scoring Categories: 1. Prompt Adherence (0 to +5.0): Are all requested subjects present? Is the color, shape, and count correct? 2. Relational Logic (0 to +3.0): Are spatial relationships (left, right, above, below) correctly executed? Is the interaction between objects realistic/as specified? 3. Visual Integrity (0 to +2.0): Is the image high-quality? (e.g., sharp, no distorted anatomy, no weird artifacts). 22P. Chen et al. Strict Deductions (Mandatory): – Missing primary subject: -5.0 per object. – Wrong color/attribute for a subject: -3.0. – Incorrect count: -4.0. – Severe concept bleeding (colors mixing inappropriately): - 2.5. – Incorrect spatial placement (e.g., "blue cube on left" is on right): -3.0. – Visible AI artifacts (extra fingers, blurry blobs): -2.0 to -5.0. Bonus Points: – Exceptional lighting/aesthetic: +1.0. – Perfect spatial symmetry (if implied): +1.0. Output Format (JSON ONLY): “score”: float, “reason”: “Detailed point-by-point breakdown of the score” System Prompt 4: Debugging-Oriented Prompt Redesign (Re- covery) Role/Mission: You are a debugging expert for text-to-image pipelines. The previous generation failed. Variables: User Intent: original_prompt VLM Error Analysis: failure_reason Your Task: 1. Identify the “Confusion Point”: Why did the model fail? (e.g., too many subjects, conflicting adjectives). 2. Simplify & Isolate: Break down complex requests into simpler, more direct instructions. 3. Reinforce Weak Points: If the failure was spatial, use stronger layout cues. If attributes failed, use repetitive reinforcement (e.g., “The cube is blue. It is a blue cube.”). 4. Style Anchor: Add a consistent style tag to help stabilize the gen- eration. Output Format: Output ONLY the new refined prompt text. No pream- ble. AFS-Search23 A breathtaking epic fantasy landscape. A giant, translucent jellyfish made of stardust and galaxies is floating majestically in the night sky above a glowing bioluminescent forest. The long, glowing tentacles of the jellyfish reach down and gently touch the tops of the glowing blue trees. A tiny silhouette of a traveler stands on a cliff in the foreground, looking up in awe. A wide cinematic shot of a futuristic cyberpunk street in Tokyo during a heavy rainstorm. In the center of the frame, a large, vibrant pink and cyan neon sign hanging from a dark building clearly reads "AGENTIC" in bold, futuristicfont. The wet pavement perfectly reflects the neon glow and the silhouettes of people walking with transparent umbrellas. A cinematic close-up of a steampunk mechanical owl perched on a stack of old leather-bound books. The owl is made of polished brass and copper gears, with glowing amber light emanating from its intricate eye sockets. Near one of its talons is delicately a small, glowing blue crystal shard. The background is a blurred Victorian library with warm candlelight. A colossal celestial dragon with scales forged from hammered gold and cooling obsidian, coiled tightly around a jagged volcanic peak. Its eyes are twin suns glowing with blinding white intensity. Molten lava cascades down the mountain like waterfalls of fire. In the background, a blood-red solar eclipse hangs in a dark, ashen sky. Hyper-realistic textures, cinematic wide-angle shot, volumetric smoke, embers swirling in the air, epic fantasy masterpiece. An ancient dark sovereign seated on a gargantuan throne carved from the monolithic bones of forgotten gods. The throne room is a decaying gothic cathedral with vaulted ceilings lost in shadow. Eerie emerald moonlight streams through shattered stained-glass windows, illuminating swirling spectral mist on the floor. The sovereign wears a crown of jagged black ice. Dark surrealism, intricate bone textures, ominous atmosphere, Rembrandt lighting, highly detailed oil painting style. A futuristic cyborg samurai standing in the center of a rain-drenched Neo-Tokyo street. He wears intricate matte-black carbon-fiber armor with glowing crimson internal circuitry. He is unsheathing a translucent plasma katana that crackles with blue electricity. The surrounding skyscrapers are covered in massive, flickering holographic koi fish and neon kanji signs. Ray-traced reflections in street puddles, cinematic bokeh, cyberpunk aesthetic, high-octane atmosphere, sharp focus on the blade. A majestic golden Chinese dragon soaring in the starry night sky. The dragon is tightly clutching a glowing red paper lantern in its left claw. Below the dragon, there are the curved roofs of a traditional Chinese pavilion covered in thick white snow. A delicate, translucent green jade carving of a koi fish. The jade fish is placed inside an open, intricately carved dark wooden box. The box sits on top of a vibrant blue silk fabric with golden cloud embroidery An ancient Chinese general standing proudly on a misty battlefield. He is wearing heavy black iron armor with golden lion-shaped shoulder guards and a flowing red silk cape . In his left hand, he holds a silver helmet decorated with a blue feather. In his left hand, he firmly grasps a long silver spear with a red tasseltied near the blade. Behind the general, to his right, stands a massive white warhorse wearing leather armor. Fig. 11: Additional Visual Results. 24P. Chen et al. A.2 AFS pipeline The whole AFS pipeline is demonstrated in Algorithm 1. Algorithm 1 AFS-Search Pipeline Require: User prompt y, Flow Model ε θ , VLM V, Steps N, Split t split 1: y ′ ← OptimizePrompt(y,V) 2: x T ∼N(0,I) 3: x split ← ODE(x T ,y ′ ,T → t split ) Phase 1: Initial Denoise 4: ˆx 0 ← Lookahead(x split ,v split ) 5: Diagnosis D ←V(ˆx 0 ,y ′ ) Phase 2: Diagnosis 6: if D indicates error then 7: Branches B ←Base, Steer(D), Explore 8: for b∈B do 9:x sim ← Simulate(x split ,b,∆t) Phase 3: Simulation 10:Score s b ←V(Decode(x sim )) 11: end for 12: b ∗ ← arg maxs b Selection 13: x 0 ← ODE(x b ∗ sim ,y ′ ,t sim → 0) 14: else 15: x 0 ← ODE(x split ,y ′ ,t split → 0) 16: end if 17: if Score(x 0 ) < τ then 18: goto Step 1 with refined y ′ Global Redesign Loop 19: end if 20: return x 0 B Text-to-Mask Pipeline and reliability Our framework is inherently robust to mask errors. If SAM3 generates an incor- rect mask, the subsequent AFS gradient will distort the image. Crucially, our VLM critic evaluates the short-horizon previews of all branches. An image with a corrupted mask will receive a low reward score and be naturally discarded in favor of the Baseline or Exploration branch. B.1 Self-Correction and Pruning Mechanism Even if the Segment Anything Model (SAM3) generates a noisy or incorrect mask due to an imprecise bounding box from the VLM, the system remains robust. As shown in the pipeline: 1. Branch Simulation: When a Corrective branch is initialized with a noisy mask, the subsequent AFS gradient will likely distort the image, creating visible artifacts or failing to improve the target object. AFS-Search25 2. Analytical Evaluation: The VLM Critic (System Prompt 3) performs a short-horizon evaluation. Any image corrupted by a faulty mask will trigger a heavy deduction under the Visual Integrity category (Deduction: -2.0 to -5.0 for AI artifacts). 3. Natural Selection: In the selection phase, the Parallel Rollout logic com- pares the scores of all active branches. A branch with a failed intervention will yield a significantly lower reward than the Baseline (Continue) or Ex- ploration branches. 4. Pruning: The system naturally discards the corrupted branch and resumes generation from the most stable path, effectively pruning the failed interven- tion and preventing error propagation. B.2 Formal Logic for Mask Error Recovery The selection criteria follows maxR(b) where R(b)is the reward score from the VLM Critic. IfR(Corrective) < R(Baseline)due to a noisy mask, the system reverts to the baseline, ensuring that a failed fix never degrades the final output. C Deep Dive into Agentic Flow Steering (AFS) In this section, we provide a more rigorous mathematical treatment of the Agen- tic Flow Steering (AFS) module and analyze PRS behavior within the Rectified Flow framework. C.1 Mathematical Derivation of the Velocity Gradient As established in Section 3.4, the core of AFS is the mapping of semantic energy E from the image manifold back to the velocity field v t of the ODE. The total gradient flow can be decomposed via the chain rule as: ∇ v t E = ∂E ∂ ˆ x 0 |z Semantic Loss · ∂ ˆ x 0 ∂ ˆ z 0 |z Decoder Jacobian · ∂ ˆ z 0 ∂v t |z Trajectory Projection , (7) where: – Semantic Loss Gradient (∇ ˆ x 0 E): This is computed by backpropagating the CLIP-based contrastive loss through the vision-language encoder. It rep- resents the "pixel-wise direction" for semantic correction. – Decoder Jacobian (J ⊤ Dec ): Since the VLM operates on decoded images ˆ x 0 , the gradient must pass through the VAE Decoder. We utilize the adjoint method to efficiently compute ∇ ˆ z 0 E = J ⊤ Dec ∇ ˆ x 0 E. – Trajectory Projection Gradient: From the linear projection ˆ z 0 = z t − tv t , we derive the sensitivity of the future state to the current velocity: ∂ ˆ z 0 ∂v t =−tI. Substituting these into Eq. (7), we obtain the final steering update: v corrected t = v t + ηt· (J ⊤ Dec ∇ ˆ x 0 E)⊙ M(8) 26P. Chen et al. Fig. 12: Conceptual Illustration of Vector Field Warping. (Left) Standard Flow: The trajectory follows a straight line but ends in a region with incorrect attributes (e.g., wrong color). (Right) AFS-Steered Flow: The velocity field is locally warped by the semantic gradient. The trajectory is bent toward the target basin while maintaining the overall linear transport properties of the flow. C.2 Geometric Interpretation: Vector Field Warping The AFS update does not merely jump between latent states. Instead, it per- forms Vector Field Warping. By modifying v t , we are essentially changing the momentum of the genera- tion. This soft intervention is significantly more stable than hard latent modi- fications, as it preserves the cumulative ODE integration history while steering future evolution. C.3 Extended Hyperparameter Sensitivity Analysis To further investigate the robustness and controllable generation capabilities of AFS-Search, we conducted extensive ablation studies on three core hyperpa- rameters: Step Size (η), Guidance Scale (σ), and Search Width (W). All experi- ments were evaluated using Qwen-VL-Max on the CompBench subset (Complex instructions) with single-round intervention. C.4 Quantitative Results The impact of these hyperparameters on Success Rate (%) across different se- mantic dimensions is summarized in Fig. 13. – Optimal Momentum (η): The step size η governs the magnitude of gradient- based intervention in the latent space. We observe that η = 200 serves as the “sweet spot.” A smaller η (50) fails to provide sufficient momentum to move the latents out of local semantic minima within the limited denois- ing steps. Conversely, an excessive η (400) introduces instability into the diffusion ODE, occasionally leading to over-correction. AFS-Search27 Fig. 13: Extra experiments on hyperparameters. The success rate is calculated by Qwen-vl-max, and the scoring mechanism is provided in above prompt. – Guidance Fidelity Trade-off (σ): The scale parameter σ balances the VLM-guided gradient signal with the original diffusion prior. At σ = 0.1, the agent effectively corrects attribute errors without compromising image realism. At higher values (σ ≥ 0.2), while the agent remains semantically fo- cused, we observe a slight decline in Spatial scores, suggesting that aggressive guidance can sometimes disrupt the precise geometric layout. – Search-Performance Pareto Frontier (W): Increasing the search width W yields consistent performance gains. By branching the generation process, AFS-Search explores multiple denoising trajectories, significantly mitigating the risk of “hallucinatory” generation. However, this comes with a linear in- crease in computational cost. W = 3 provides an optimal balance, achieving a > 10% improvement over the greedy baseline (W = 1) while maintaining reasonable inference latency (≈ 65s). C.5 Conclusion The empirical results validate the robustness of AFS-Search. The agent demon- strates high sensitivity to the search width, confirming that its MCTS-inspired exploration strategy is key to solving complex T2I tasks. For optimal perfor- mance, we recommend the configuration η = 200,σ = 0.1,W = 3. D Limitations and Future Works While our AFS-Search builds a new closed-loop paradigm in T2I field, there are still some limitations to be solved. Slow Inference Speed. The agentic architecture is naturally slower in inference time compared to traditional generative models, with most of the time spent on the VLM’s thinking time, the inference time of the native model, and some tool usage time. Since our paradigm successfully builds the entire framework into a self-evolving closed-loop system, future frameworks can consider how to more deeply integrate LLMs or VLMs into the native T2I model. This can not 28P. Chen et al. only speed up the model’s inference time but also give the T2I model a deeper semantic understanding capability. Deeper Research in Latent Space. For AFS-Search, the step of decoding to generate intermediate images in the middle also takes up a considerable amount of inference time. In the future, research and understanding of VLMs regarding the intermediate latent space in Flow Matching or Diffusion Models should be strengthened. If VLMs can have a better understanding of the intermediate latent space, then they can better guide the native models. Deeper Research in Flow Steering. In general, the way to perform Flow steering under a native model is either a search strategy under a training-free framework or using reinforcement learning. The search strategy is simple and effective, but it can result in slower model inference and slightly higher memory usage; the reinforcement learning approach lies in how to design the reward function and quantitatively analyze the latent space within the flow-matching generation framework, an area that has been sparsely studied.