Paper deep dive

Manifold-Aware Exploration for Reinforcement Learning in Video Generation

Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma, Xuanhua He, Bin Lin, Kaixiong Gong, Zhao Zhong, Liefeng Bo, Qifeng Chen, Harry Yang

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 65

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/26/2026, 2:35:15 AM

Summary

SAGE-GRPO (Stable Alignment via Exploration) is a reinforcement learning framework for video generation that addresses the stability-plasticity dilemma in Group Relative Policy Optimization (GRPO). It introduces a manifold-aware SDE with logarithmic curvature correction and a gradient norm equalizer at the micro-level, and a dual trust region with periodic moving anchors at the macro-level to prevent off-manifold drift and ensure reliable reward-guided updates.

Entities (5)

GRPO · algorithm · 100%HunyuanVideo1.5 · model · 100%SAGE-GRPO · algorithm · 100%FlowGRPO · algorithm · 95%VideoAlign · reward-model · 95%

Relation Signals (3)

SAGE-GRPO → evaluatedon → HunyuanVideo1.5

confidence 100% · We evaluate SAGE-GRPO on HunyuanVideo1.5

SAGE-GRPO → improves → GRPO

confidence 95% · SAGE-GRPO addresses the stability-plasticity dilemma in Group Relative Policy Optimization (GRPO).

SAGE-GRPO → uses → VideoAlign

confidence 95% · We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model

Cypher Suggestions (2)

Identify models evaluated using SAGE-GRPO. · confidence 95% · unvalidated

MATCH (a:Algorithm {name: 'SAGE-GRPO'})-[:EVALUATED_ON]->(m:Model) RETURN m.name

Find all algorithms that improve upon GRPO for video generation. · confidence 90% · unvalidated

MATCH (a:Algorithm)-[:IMPROVES]->(g:Algorithm {name: 'GRPO'}) RETURN a.name

Abstract

Abstract:Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at this https URL.

PDF

Open source PDF →Open local PDF →

Full Text

64,840 characters extracted from source content.

Expand or collapse full text

Manifold-Aware Exploration for Reinforcement Learning in Video Generation Mingzhe Zheng Weijie Kong Yue Wu Dengyang Jiang Yue Ma Xuanhua He Bin Lin Kaixiong Gong Zhao Zhong Liefeng Bo Qifeng Chen Harry Yang Abstract Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at here. Machine Learning, ICML Figure 1: Illustration of SAGE-GRPO. (Left) (a.1) At a higher noise region, Euler-style discretization introduces a purple region of extra energy (discretization error) beyond the true integral; we focus on the true integral region below, not this extra energy. (a.2) Our precise SDE removes unnecessary noise energy in high-noise regions, enabling more precise exploration and a better-learned data manifold. (Right) (b) Our method with improved exploration yields more stable and better-aligned generations compared with DanceGRPO (Xue et al., 2025), FlowGRPO (Liu et al., 2025b), and CPS (Wang and Yu, 2025). Figure 2: Geometric interpretation of noise injection strategies. Conventional linear SDEs (red) inject exploration noise using first-order approximations, ignoring signal decay curvature and causing off-manifold drift that results in temporal jitter and artifacts. Our Manifold-Aware SDE (blue) uses a logarithmic correction term so that exploration noise is concentrated closer to the flow trajectory and the video manifold, reducing off-manifold drift. 1 Introduction Group Relative Policy Optimization (GRPO) is a direct way to align video generation models with reward signals (Ho et al., 2020; Song et al., 2020b, a; Ma et al., 2025; Kong et al., 2024; Wu et al., 2025; Wan et al., 2025; Gao et al., 2025), but it has not yet been as reliable for video as it is for language models and images (Guo et al., 2025; Shao et al., 2024; Achiam et al., 2023; Shen et al., 2025). In GRPO training for video generation, we must draw a group of rollouts by converting the deterministic ODE sampler into an SDE sampler so that the policy can explore through diverse samples (Li et al., 2025a). Video generation has a large, structured solution space, so this exploration is easily disturbed. Current video GRPO baselines such as DanceGRPO and FlowGRPO rely on an Euler-style discretization and first-order approximations when deriving the SDE noise standard deviation (as shown in Table 1) (Black et al., 2023; Liu et al., 2025b; Xue et al., 2025). The resulting first-order truncation error can inject excess noise energy during sampling (shown in Figure 1(a.1)), which lowers rollout quality in high-noise steps and makes reward evaluation less reliable. This raises the following question: how can we obtain an accurate sampling path that improves rollout quality and stabilizes GRPO for video generation? Table 1: Comparison of SDE noise injection strategies used in video GRPO. Method Standard Deviation Σt1/2 _t^1/2 DanceGRPO ησt−σt+1η _t- _t+1 FlowGRPO ησt1−σt(σt−σt+1)η _t1- _t( _t- _t+1) Ours (Precise) η[−(σt−σt+1)+log⁡(1−σt+11−σt)]η [-( _t- _t+1)+ ( 1- _t+11- _t ) ] Flow-matching video generators parameterized by θ induce trajectories that are constrained by a pre-trained video generation model (Liu et al., 2022; Lipman et al., 2022; Wang et al., 2024). We treat this model as defining a valid data manifold ℳ⊂ℝDM ^D. Because the pre-trained parameters θ0 _0 are not yet sufficient for the target reward, GRPO must update θ through exploration while keeping trajectories within the vicinity of ℳM so that rollouts remain valid. As shown in Figure 2, FlowGRPO-style SDE exploration can overestimate the noise variance (red), push ztz_t away from ℳM, and produce temporal jitter. We therefore define the core problem of GRPO for video generation as how to constrain exploration within the vicinity of the data manifold so that each update improves rollouts while keeping reward evaluation reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which organizes exploration at both micro and macro levels around the manifold. At the micro level, we refine the discrete SDE and couple it with a gradient norm equalizer as part of micro-scale exploration. Concretely, instead of using an area-based first-order variance approximation, we compute the noise variance by integrating diffusion coefficients over each step and add a logarithmic correction log⁡(1−σt+Δt1−σt) \! ( 1- _t+ t1- _t ), which yields a more accurate variance for ODE-to-SDE exploration. As in Figure 1(a.1), this corresponds to integrating only the effective energy under the curve rather than the extra discretization area, and Figure 1(a.2) shows that the resulting precise SDE uses smaller variance while staying closer to the underlying video manifold. Even with this corrected SDE, the diffusion process still has an inherent signal-to-noise imbalance across timesteps: gradients vanish at high noise (t→1t→ 1) and explode at low noise (t→0t→ 0), which biases learning toward certain phases. The Gradient Norm Equalizer normalizes optimization pressure across timesteps so that updates remain comparable in magnitude, which makes micro-level exploration more precise and stable. With precise micro-level exploration, the policy after N steps updates tends to move closer to the data manifold; periodically updating a reference model from this trajectory therefore creates a trust region centered at a more manifold-consistent policy. This reduces long-horizon drift and helps avoid off-manifold local optima, as suggested by the red region in Figure 2. Traditional Fixed KL constraints DKL(πθ||π0)D_KL( _θ|| _0) anchor the policy to the initial model π0 _0, but as training progresses the optimal policy π∗π^* may be far from π0 _0, which causes underfitting. Step-wise KL constraints DKL(πθ||πk−1)D_KL( _θ|| _k-1) limit the magnitude of parameter updates per step (velocity control), ensuring smooth local transitions, but they only constrain the instantaneous update direction ∇θ _θ and do not bound the cumulative displacement ‖θk−θ0‖\| _k- _0\| from the initial parameters. This allows unbounded drift: even if each step is small, the policy can move slowly but consistently away from the manifold over many steps, eventually leading to degradation or reward hacking. To counteract drift while preserving plasticity, we introduce a Periodical Moving Anchor that updates the reference policy πref _ref every N steps, creating a dynamic trust region that repeatedly recenters exploration near a manifold-consistent policy. We combine the moving anchor with step-wise constraints into a Dual Trust Region objective that provides position control towards the manifold and velocity control between successive policies, forming a position-velocity controller that enables sustained plasticity. We evaluate SAGE-GRPO on HunyuanVideo1.5 (Wu et al., 2025) using the original VideoAlign evaluator (Liu et al., 2025c) (no reward-model fine-tuning) and observe consistent gains over baselines such as DanceGRPO (Xue et al., 2025), FlowGRPO (Liu et al., 2025b), and CPS (Wang and Yu, 2025) in both overall reward and temporal fidelity. Extensive ablations confirm that both the micro-level design (precise manifold-aware SDE with temporal gradient equalization) and the macro-level Dual Trust Region objective are necessary to reduce the stability–plasticity gap. Our main contributions are as follows: • We formulate GRPO for video generation as a manifold-constrained exploration problem and show that the ODE-to-SDE conversions used in existing methods can inject excess noise in high-noise steps, which reduces rollout quality and makes reward-guided updates less reliable. • At the micro-level, we constrain exploration with a Precise Manifold-Aware SDE and a Gradient Norm Equalizer, so that sampling noise stays manifold-consistent and updates are balanced across timesteps. • At the macro-level, we constrain long-horizon exploration with a Dual Trust Region with moving anchors and step-wise constraints, so that the trust region tracks more manifold-consistent checkpoints and prevents drift. 2 Related Work Reinforcement Learning for Diffusion and Flow Matching Models. Reinforcement learning has been adapted to fine-tune diffusion and flow matching models (Liu et al., 2025b; Xue et al., 2025; Xu et al., 2023; Jiang et al., 2025; Wallace et al., 2024; Xu et al., 2025; Lan et al., 2025; Jin et al., 2025; Lin et al., 2025a, b, c; Zhang et al., 2026) for alignment with human preferences. Early approaches such as DDPO (Black et al., 2023) and DPOK (Fan et al., 2023) treated the denoising process as a Markov Decision Process to enable policy gradient estimation. Inspired by GRPO in language models (Shao et al., 2024; Guo et al., 2025), FlowGRPO (Liu et al., 2025b) and DanceGRPO (Xue et al., 2025) adapted GRPO to visual generation via ODE-to-SDE conversion for stochastic exploration (Li et al., 2025a). However, existing methods rely on first-order noise approximations that can drive exploration off the data manifold and overlook the inherent gradient imbalance across timesteps. (a) DanceGRPO (b) FlowGRPO (c) CPS (d) Ours Figure 3: Temporal gradient balancing ablation across SDE formulations. Overall VideoAlign reward curves comparing runs with and without the Gradient Norm Equalizer. Without balancing, low-noise timesteps dominate optimization, leading to unstable or plateaued rewards. With balancing, reward curves become smoother with consistent improvement, and gradient scale variation is reduced from more than one order of magnitude to within a small constant factor. Preference Alignment for Video Generation. Aligning video generation models with human preferences is an active research area (Zheng et al., 2024; Long et al., 2025; Huang et al., 2024; Lu et al., 2025; He et al., 2025). Building on video diffusion models (Wan et al., 2025; Kong et al., 2024; Gao et al., 2025), researchers have developed video reward models (Liu et al., 2025c; Xu et al., 2024; Mi et al., 2025; Zhang et al., 2025) and alignment algorithms (Li et al., 2024; Gambashidze et al., 2024; Yu et al., 2024; Zhou et al., 2025; Jia et al., 2025). DanceGRPO (Xue et al., 2025) extends image-based RL to video, while Self-paced GRPO (Li et al., 2025b) proposes curriculum learning that dynamically adjusts reward weights. However, current alignment frameworks face a stability-plasticity dilemma: strict constraints (e.g., fixed KL anchored to initialization) limit plasticity, while relaxed constraints trigger reward hacking or catastrophic forgetting (Liu et al., 2025a; Li et al., 2025c). Unlike existing approaches that rely on heuristic scheduling or static anchors, our method integrates manifold-aware dynamics with a dual trust region to resolve this tension. Figure 4: Empirical gradient norm imbalance across noise levels. Observed norms (blue) decrease rapidly as σ increases and match the predicted relationship (red) ‖∇log⁡π‖∝1/Σt1/2\|∇ π\| 1/ _t^1/2, leading to vanishing gradients at high noise (σ→1σ→ 1) and exploding gradients at low noise (σ→0σ→ 0). Figure 5: The SAGE-GRPO Framework. Our method resolves the stability-plasticity dilemma with three coupled components: (Left) a manifold-aware SDE that keeps exploration noise tangent to the video manifold, (Middle) a Temporal Gradient Equalizer that balances optimization across timesteps, and (Right) a Dual Trust Region that combines moving anchors and step-wise KL constraints for long-term stable alignment. 3 Methodology We formulate the problem of video alignment as maximizing the expected reward J(θ)=0∼πθ[R(0)]J(θ)=E_x_0 _θ[R(x_0)] within a Group Relative Policy Optimization (GRPO) framework. However, a standard application of GRPO to video diffusion models faces specific challenges in maintaining stable and effective exploration on the video manifold. SAGE-GRPO addresses these challenges by designing a unified exploration strategy that operates from micro-level noise injection to macro-level policy constraints, so that every exploration step remains valid and balanced across the diffusion process. 3.1 Preliminaries: Flow Matching and Group Relative Policy Optimization Flow Matching and Rectified Flow. Flow Matching models generation as transport along a probability path pt()p_t(x) via an ordinary differential equation (ODE): dtdt=θ(t,t), dx_tdt=v_θ(x_t,t), (1) where θv_θ is a neural velocity field. Rectified Flow uses the linear interpolation path: t=(1−σt)0+σt1,x_t=(1- _t)x_0+ _tz_1, (2) which implies the velocity field: θ(t,t)=dtdt=−dσtdt(0−1)=11−σt(t−0).v_θ(x_t,t)= dx_tdt=- d _tdt(x_0-z_1)= 11- _t(x_t-x_0). (3) Group Relative Policy Optimization (GRPO). Given a prompt c, GRPO samples a group of G rollouts and optimizes the diffusion policy πθ _θ using a group-normalized advantage: ℒGRPO(θ)=−1G∑i=1GAi⋅∑t=1Tlog⁡πθ(t−1(i)|t(i),),L_GRPO(θ)=- 1G _i=1^GA_i· _t=1^T _θ(x_t-1^(i)|x_t^(i),c), (4) where T is the number of diffusion steps. We defer the reward composition, advantage normalization, and the stochastic rollout formulation to Appendix A and only keep the key equations in the corresponding modules. Figure 6: Qualitative comparison against baselines. Three prompts illustrate our core gains: (Top) Reduced temporal jitter while preserving accurate visual contents; (Middle) Enhanced alignment and photorealism under occlusion and lighting changes; (Bottom) Stronger semantic alignment with consistent prompt matching across frames. 3.2 SAGE-GRPO Framework 3.2.1 Micro-Level Exploration: Precise SDE and Gradient Equalization To enable stochastic exploration for GRPO, we perturb Rectified Flow with a marginal-preserving SDE whose noise stays aligned with the video manifold ℳ⊂ℝDM ^D (Figure 2). The key challenge is computing the correct noise standard deviation Σt1/2 _t^1/2 during discrete SDE discretization. For a marginal-preserving SDE with diffusion coefficient εt=ησt/(1−σt) _t=η _t/(1- _t), we integrate the variance over the interval [σt+1,σt][ _t+1, _t]: Σt=∫σt+1σtεs2ds=η2[−(σt−σt+1)+log⁡(1−σt+11−σt)], _t= _ _t+1 _t _s^2\,ds=η^2 [-( _t- _t+1)+ ( 1- _t+11- _t ) ], (5) where η is the exploration scaling factor. The logarithmic term accounts for the geometric contraction of the signal coefficient (1−σt)(1- _t), which linear approximations fail to capture. Taking the square root yields the noise standard deviation: Σt1/2=η−(σt−σt+1)+log⁡(1−σt+11−σt). _t^1/2=η -( _t- _t+1)+ ( 1- _t+11- _t ). (6) Applying Euler-Maruyama discretization with timestep Δt=σt−σt+1 t= _t- _t+1: t+Δt=t+θ(t,t)Δt+Σt2θ(t)+Σt1/2ϵ, x_t+ t=x_t+v_θ(x_t,t) t+ _t2s_θ(x_t)+ _t^1/2 ε, (7) where ϵ∼(,) ε (0,I) injects stochasticity, θ(t)≈−(t−^0)/σt2s_θ(x_t)≈-(x_t- x_0)/ _t^2 is the score function estimate. Since Σt _t the integrated variance is over [σt+1,σt][ _t+1, _t], the stochastic term is used Σt1/2 _t^1/2 directly without an additional Δt t factor. The Itô correction term Σt2θ(t) _t2s_θ(x_t) ensures consistency with Rectified Flow marginals; a detailed derivation is provided in Appendix A.1. As shown in Figure 2, our method creates a smaller, manifold-aligned exploration region (blue ellipsoid) that stays tangent to the flow trajectory, whereas conventional methods create larger, off-manifold exploration regions (red sphere) that cause state drift. This geometric insight ensures that every exploration step remains within the legal video space, preventing temporal artifacts. Even with correct noise injection, the diffusion process has an inherent signal-to-noise imbalance across timesteps: gradient norms vary by orders of magnitude (Figure 4), following a variance-gradient inverse relationship. For a Gaussian transition π(t−1|t)=(θ,Σt)π(x_t-1|x_t)=N( μ_θ, _tI): ‖∇log⁡π‖∝1Σt1/2,\| _ μ π\| 1 _t^1/2, (8) causing gradients to vanish at high noise (t→1t→ 1) and explode at low noise (t→0t→ 0), biasing learning toward certain phases. To counteract this imbalance, we estimate a per-timestep gradient scale tN_t from the SDE parameters (Appendix A.5) and apply a robust normalization: St=Median(τ=1T)t+ϵ, S_t= Median(\N_τ\_τ=1^T)N_t+ε, (9) where ϵε is a small constant. This equalization normalizes optimization pressure across timesteps so that structural and textural updates contribute equally; empirical validation is provided in Figure 3 and Appendix A.5. GRPO With Composite Reward and Group-Normalized Advantage. We score each rollout 0x_0 by a composite reward R(0)R(x_0) and compute the group-normalized advantage AiA_i: Ai=ri−μRσR+ϵ,A_i= r_i- _R _R+ε, (10) where ri=R(0(i))r_i=R(x_0^(i)), μR=1G∑j=1Grj _R= 1G _j=1^Gr_j, and σR2=1G∑j=1G(rj−μR)2 _R^2= 1G _j=1^G(r_j- _R)^2. Full definitions and implementation-aligned details are in Appendix A.4. Table 2: Main Comparison on Video Generation Benchmarks. Comparison of SAGE-GRPO with baselines under two reward settings. The first row reports the original HunyuanVideo 1.5 performance. For each method, we report results without KL regularization (w/o KL) and with their Fixed KL constraints (w/ Fixed KL). For SAGE-GRPO, we demonstrate the w/ Dual Moving KL mechanism. Bold, underline, and gray indicate the best, second best, and third best results, computed across both settings (A+B). Method Configuration VideoAlign Metrics Visual Metrics Overall VQ MQ TA CLIPScore PickScore HunyuanVideo 1.5 (Original) - 0.0654 -0.7539 -0.5870 1.4063 0.5409 0.7397 Setting A: Averaged Rewards (wvq=1.0,wmq=1.0,wta=1.0w_vq=1.0,w_mq=1.0,w_ta=1.0) DanceGRPO w/o KL 0.2768 -0.7589 -0.3852 1.4209 0.5386 0.7378 w/ Fixed KL 0.0979 -0.8077 -0.5091 1.4147 0.5403 0.7355 FlowGRPO w/o KL 0.2733 -0.7151 -0.5286 1.5170 0.5443 0.7394 w/ Fixed KL 0.1880 -0.6771 -0.5912 1.4563 0.5431 0.7407 CPS w/o KL 0.6343 -0.4855 -0.4021 1.5219 0.5479 0.7412 w/ Fixed KL 0.0928 -0.7156 -0.5825 1.3908 0.5479 0.7369 SAGE-GRPO w/o KL 0.4859 -0.6104 -0.4141 1.5104 0.5423 0.7360 w/ Fixed KL 0.2244 -0.7438 -0.5320 1.5001 0.5446 0.7382 w/ Dual Mov KL 0.2173 -0.7881 -0.4249 1.4303 0.5430 0.7452 Setting B: Alignment-Focused (wvq=0.5,wmq=0.5,wta=1.0w_vq=0.5,w_mq=0.5,w_ta=1.0) DanceGRPO w/o KL -0.2172 -0.8854 -0.6218 1.2901 0.5439 0.7352 w/ Fixed KL 0.1290 -0.7739 -0.5083 1.4112 0.5452 0.7276 FlowGRPO w/o KL 0.4773 -0.5671 -0.4731 1.5175 0.5403 0.7349 w/ Fixed KL 0.2103 -0.6654 -0.5506 1.4263 0.5427 0.7408 CPS w/o KL 0.3694 -0.6650 -0.5325 1.5669 0.5479 0.7311 w/ Fixed KL 0.3705 -0.6121 -0.4787 1.4613 0.5458 0.7364 SAGE-GRPO w/o KL -0.1222 -0.8720 -0.6046 1.3544 0.5404 0.7357 w/ Fixed KL 0.2857 -0.7062 -0.4425 1.4344 0.5414 0.7377 w/ Dual Mov KL 0.8066 -0.4765 -0.2384 1.5216 0.5484 0.7420 (a) VQ reward (b) MQ reward (c) TA reward Figure 7: KL weight ablation on VideoAlign rewards. Comparison of three KL weight schedules: fixed 10−510^-5 (green), two-stage 10−7→10−510^-7→ 10^-5 (red), and two-stage 10−7→10−610^-7→ 10^-6 (yellow). The two-stage schedule 10−7→10−510^-7→ 10^-5 achieves the strongest and most consistent gains across VQ, MQ, and TA, supporting gradually increasing λKL _KL to tighten the trust region (Appendix A.6). 3.2.2 Macro-Level Exploration: Dual Trust Region Optimization With micro-level exploration stabilized, we aim to prevent the policy model from drifting away from the data manifold and getting stuck in off-manifold local optima (Figure 2). We frame KL divergence as a dynamic anchoring mechanism that constrains exploration towards the data manifold. KL Divergence as Dynamic Anchor. For a Gaussian policy π(t−1|t)=(θ,Σt)π(x_t-1|x_t)=N( μ_θ, _tI), the KL divergence between the current policy πθ _θ and a reference policy πref _ref is: DKL(πθ||πref)=t∼πθ[(θ−ref)22Σt2]≈(θ−ref)22Σt2,D_KL( _θ|| _ref)=E_x_t _θ [ ( μ_θ- μ_ref)^22 _t^2 ]≈ ( μ_θ- μ_ref)^22 _t^2, (11) where θ μ_θ and ref μ_ref are the mean predictions of the current and reference policies, respectively. KL divergence acts as a distance metric in policy space, anchoring the current policy to the reference. The choice of reference determines the constraint nature: a fixed reference creates a hard constraint, while a moving reference enables adaptive exploration. Fixed KL: Hard Constraint Limiting Optimality. Traditional approaches use a fixed reference policy πref=π0 _ref= _0 from the pretrained video generation model. The constraint DKL(πθ||π0)D_KL( _θ|| _0) forces the policy to remain close to the initial distribution. However, as training progresses, the optimal policy π∗π^* may be far from π0 _0, and forcing DKL(πθ||π0)D_KL( _θ|| _0) to be small prevents reaching π∗π^*, leading to underfitting, which is too restrictive for long-term optimization where the policy needs to explore regions far from initialization. Step-wise KL: Velocity Constraint. Step-wise KL uses the previous step’s policy as reference: πref=πk−1 _ref= _k-1, where k denotes the optimization step. This constraint DKL(πθ||πk−1)D_KL( _θ|| _k-1) acts as a velocity limit, restricting the magnitude of parameter updates per step: ∥∇θDKL(πθ||πk−1)∥∝∥θ−k−1∥/Σt,\| _θD_KL( _θ|| _k-1)\| \| μ_θ- μ_k-1\|/ _t, (12) ensuring smooth local transitions. However, velocity control alone only limits the magnitude of ∇θ _θ (the update direction) but does not bound the cumulative displacement ‖θk−θ0‖\| _k- _0\| from the initial parameters. This allows unbounded drift: the policy move slowly but consistently away from the manifold, eventually leading to degradation or reward hacking. Periodical Moving KL: Position Control via Dynamic Trust Region. To counteract drift while maintaining plasticity, we introduce Periodical Moving KL that uses a periodically updated reference policy πref=πk−N _ref= _k-N, where N is the update interval. For every N optimization step, we update the reference model: πref←πθ _ref← _θ, creating a resetting anchor mechanism. This allows the model to perform local exploration within N steps, then establish the new position as a safe region: DKL(πθ||πref_N)=(θ−ref_N)22Σt2,D_KL( _θ|| _ref\_N)= ( μ_θ- μ_ref\_N)^22 _t^2, (13) where ref_N μ_ref\_N is the mean prediction from the reference model updated N steps ago. This creates a dynamic trust region that periodically resets the safe zone, similar to a multi-stage relaxed version of TRPO (Schulman et al., 2015), enabling the model to climb the reward landscape in stages (plasticity) while tethered to a valid distribution (stability). Dual KL: Position-Velocity Controller. We combine these two mechanisms into a dual KL objective that provides both position and velocity control: ℒKL=βpos⋅DKL(πθ||πref_N)+βvel⋅DKL(πθ||πk−1),L_KL= _pos· D_KL( _θ|| _ref\_N)+ _vel· D_KL( _θ|| _k-1), (14) where βpos _pos and βvel _vel are weighting coefficients. The position term DKL(πθ||πref_N)D_KL( _θ|| _ref\_N) provides the primary directional anchor, preventing long-term drift by constraining the policy to remain within a reasonable distance from a recent valid distribution. The velocity term DKL(πθ||πk−1)D_KL( _θ|| _k-1) acts as a damping factor, smoothing instantaneous updates and preventing abrupt policy changes. In practice, we compute the step-wise KL using log-probability differences from the rollout phase: DKL(πθ||πk−1)≈[logπk−1(t−1|t)−logπθ(t−1|t)],D_KL( _θ|| _k-1) [ _k-1(x_t-1|x_t)- _θ(x_t-1|x_t)], (15) where the expectation is taken over samples generated with the previous policy πk−1 _k-1. The full SAGE-GRPO objective that combines the GRPO policy loss, temporal equalization, and Dual KL regularization is provided in Appendix A.6. 4 Experiments 4.1 Experimental Setup Implementation Details. We conduct all experiments on HunyuanVideo 1.5 (Kong et al., 2024) with per-GPU batch size 22 and 44 gradient accumulation steps (effective batch size 88). Each video contains 8181 frames, and we apply GRPO updates every 2020 sampling steps along the diffusion trajectory. Following (Liu et al., 2025b), we use VideoAlign (Liu et al., 2025c) as the reward oracle, evaluating Visual Quality (VQ), Motion Quality (MQ), and Text Alignment (TA), with overall reward R=wvqSvq+wmqSmq+wtaStaR=w_vqS_vq+w_mqS_mq+w_taS_ta. We compare SAGE-GRPO against DanceGRPO (Xue et al., 2025), FlowGRPO (Liu et al., 2025b), and CPS (Wang and Yu, 2025). The KL regularization weight is scheduled in λKL∈[10−7,10−5] _KL∈[10^-7,10^-5] according to Appendix A.6. 4.2 Main Results We consider two reward configurations (Table 2): averaged (wvq=1.0,wmq=1.0,wta=1.0)(w_vq=1.0,w_mq=1.0,w_ta=1.0) and alignment-focused (wvq=0.5,wmq=0.5wta=1.0)(w_vq=0.5,w_mq=0.5w_ta=1.0). All rewards use the original VideoAlign model as a frozen evaluator (no reward-model fine-tuning), which ensures consistent evaluation across methods. Since current video GRPO baselines are implemented with substantial differences in engineering optimizations, directly reusing them would confound algorithmic effects with infrastructure choices. To obtain a fair comparison, we implement a unified training framework on HunyuanVideo1.5 with shared infrastructure across all methods and vary only the GRPO algorithm itself. Under the averaged-reward setting that matches Longcat-Video (Team et al., 2025), adding KL regularization typically improves visual performance but yields worse reward behavior, which we attribute to reward hacking in the reward model as discussed in previous work (Li et al., 2025b). We compare previous methods and SAGE-GRPO under both averaged and alignment-focused rewards, and evaluate variants with and without KL regularization, as summarized in Table 2. We further study how placing more weight on semantic alignment can reduce reward hacking artifacts. In the alignment-focused setting (Setting B), SAGE-GRPO with Dual Moving KL achieves the best Overall, VQ, MQ, and CLIPScore while remaining close to the best TA, and overall Table 2 suggests that emphasizing alignment provides a more reliable optimization target and yields more stable gains in both reward and visual metrics. 4.3 Qualitative Analysis We provide qualitative examples that complement the quantitative trends. Figure 6 highlights the improvement in coherence, photorealism, and semantic alignment over baselines, especially for prompts that require precise object interactions and long-range motion. Additional visual comparisons demonstrating superior alignment with emotional descriptions in text prompts are presented in Appendix Figure 10. 4.4 User Study To corroborate our automatic metrics, we conducted a user preference study with 29 evaluators on 32 prompts, comparing SAGE-GRPO with baselines (all at iter 100, sampling step 40, Setting B) across Visual Quality, Motion Quality, and Semantic Alignment. Table 3 reports the pairwise win rates of SAGE-GRPO against each baseline. Table 3: User Preference Study. Win rates of SAGE-GRPO against baselines. Results indicate a strong human preference for our method, especially in Motion Quality, confirming that automatic metrics align with perceptual quality. SAGE-GRPO vs. Visual Quality Motion Quality Semantic Alignment DanceGRPO 85.9% 75.8% 79.2% FlowGRPO 83.8% 79.2% 71.9% CPS 80.2% 70.8% 67.9% 4.5 Ablation Studies 4.5.1 Impact of Temporal Gradient Equalizer To evaluate the effectiveness of the Temporal Gradient Equalizer in Section 3.2.1, we compare training dynamics with and without per-timestep balancing across three SDE formulations and CPS. Figure 3 shows the overall VideoAlign reward curves for baselines and our method. 4.5.2 KL Strategy Ablation We next study the effect of different KL strategies introduced in Section 3.2.2. Figure 8 reports both the mean reward and standard deviation for four KL strategies, with qualitative comparisons in Appendix Figures 11 and 12. (a) Mean reward (b) Std (exploration) Figure 8: KL strategy ablation. (a) Dual Moving KL achieves the highest and most stable reward, supporting the position-velocity control interpretation (Equation (14)). (b) Moving KL attains high exploration in early training steps but the exploration level drops in later stages. Dual Moving KL maintains a higher and more stable exploration level throughout training. Figure 8(a) shows that Dual Moving KL consistently outperforms other variants in both convergence speed and final reward while avoiding the collapse observed in aggressive step-wise updates. Figure 8(b) shows that Moving KL explores quickly initially but exploration falls off; Dual Moving KL maintains higher exploration stably, validating the position-velocity controller interpretation in Equation (14). 4.5.3 KL Weight Sensitivity We compare three KL weight schedules: fixed 10−510^-5, two-stage 10−7→10−510^-7→ 10^-5, and milder 10−7→10−610^-7→ 10^-6. Figure 7 shows that the two-stage schedule yields higher rewards and smoother trajectories across VQ, MQ, and TA, consistent with gradually increasing λKL _KL to tighten the trust region. Implementation details are in Appendix A.6. 5 Conclusion We presented SAGE-GRPO, a manifold-aware GRPO framework for stable reinforcement learning for video generation. The core challenge is to design exploration strategies that respect the manifold structure, where each exploration step stays within the vicinity of the manifold rather than drifting into high-noise regions. At the micro-level, we derive a Precise Manifold-Aware SDE that keeps exploration noise closer to the flow trajectory, and introduce a Gradient Norm Equalizer that normalizes optimization pressure across timesteps. At the macro-level, we propose a Dual Trust Region mechanism combining position and velocity control to reduce off-manifold local optima while enabling sustained plasticity. Experiments on HunyuanVideo1.5 with VideoAlign reward show consistent improvements over strong baselines and validate the contribution of each component through ablations. Impact Statement This paper presents a method for more stable reinforcement learning alignment of text-to-video generation models. By improving temporal consistency and text alignment under a fixed reward model, our work may strengthen creative tools, scientific communication, and educational content that rely on controllable video synthesis. At the same time, stronger video generation systems can exacerbate existing concerns about misinformation, deepfakes, biased or harmful content, and the computational cost of large-scale training and sampling. Our experiments are conducted in a research setting on an existing model and evaluator, and our user study involves 29 voluntary evaluators rating 32 prompts comparing SAGE-GRPO against baselines in terms of visual quality, motion quality, and semantic alignment; there is no collection of personal data, but any future deployment should include safeguards such as content moderation, dataset auditing, and human oversight to reduce these risks. References J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Cited by: §1. K. Black, M. Janner, Y. Du, I. Kostrikov, and S. Levine (2023) Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301. Cited by: §1, §2. Y. Fan, O. Watkins, Y. Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee (2023) Dpok: reinforcement learning for fine-tuning text-to-image diffusion models. Advances in Neural Information Processing Systems 36, p. 79858–79885. Cited by: §2. A. Gambashidze, A. Kulikov, Y. Sosnin, and I. Makarov (2024) Aligning diffusion models with noise-conditioned perception. arXiv preprint arXiv:2406.17636. Cited by: §2. Y. Gao, H. Guo, T. Hoang, W. Huang, L. Jiang, F. Kong, H. Li, J. Li, L. Li, X. Li, et al. (2025) Seedance 1.0: exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113. Cited by: §1, §2. D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. (2025) Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Cited by: §1, §2. D. He, G. Feng, X. Ge, Y. Niu, Y. Zhang, B. Ma, G. Song, Y. Liu, and H. Li (2025) Neighbor grpo: contrastive ode policy optimization aligns flow models. arXiv preprint arXiv:2511.16955. Cited by: §2. J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. Advances in neural information processing systems 33, p. 6840–6851. Cited by: §1. T. Huang, G. Jiang, Y. Ze, and H. Xu (2024) Diffusion reward: learning rewards via conditional video diffusion. In European Conference on Computer Vision, p. 478–495. Cited by: §2. Z. Jia, Y. Nan, H. Zhao, and G. Liu (2025) Reward fine-tuning two-step diffusion models via learning differentiable latent-space surrogate reward. In Proceedings of the Computer Vision and Pattern Recognition Conference, p. 12912–12922. Cited by: §2. D. Jiang, D. Liu, Z. Wang, Q. Wu, L. Li, H. Li, X. Jin, D. Liu, Z. Li, B. Zhang, et al. (2025) Distribution matching distillation meets reinforcement learning. arXiv preprint arXiv:2511.13649. Cited by: §2. D. Jin, R. Xu, J. Zeng, R. Lan, Y. Bai, L. Sun, and X. Chu (2025) Semantic context matters: improving conditioning for autoregressive models. arXiv preprint arXiv:2511.14063. Cited by: §2. W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, et al. (2024) Hunyuanvideo: a systematic framework for large video generative models. arXiv preprint arXiv:2412.03603. Cited by: §1, §2, §4.1. R. Lan, Y. Bai, X. Duan, M. Li, D. Jin, R. Xu, L. Sun, and X. Chu (2025) Flux-text: a simple and advanced diffusion transformer baseline for scene text editing. arXiv preprint arXiv:2505.03329. Cited by: §2. J. Li, W. Feng, W. Chen, and W. Y. Wang (2024) Reward guided latent consistency distillation. arXiv preprint arXiv:2403.11027. Cited by: §2. J. Li, Y. Cui, T. Huang, Y. Ma, C. Fan, M. Yang, and Z. Zhong (2025a) Mixgrpo: unlocking flow-based grpo efficiency with mixed ode-sde. arXiv preprint arXiv:2507.21802. Cited by: §1, §2. R. Li, Y. Liang, Z. Ni, H. Huang, C. Zhang, and X. Li (2025b) Growing with the generator: self-paced grpo for video generation. arXiv preprint arXiv:2511.19356. Cited by: §2, §4.2. Y. Li, Y. Wang, Y. Zhu, Z. Zhao, M. Lu, Q. She, and S. Zhang (2025c) Branchgrpo: stable and efficient grpo with structured branching in diffusion models. arXiv preprint arXiv:2509.06040. Cited by: §2. Y. Lin, Z. Lin, H. Chen, P. Pan, C. Li, S. Chen, K. Wen, Y. Jin, W. Li, and X. Ding (2025a) Jarvisir: elevating autonomous driving perception with intelligent image restoration. In Proceedings of the Computer Vision and Pattern Recognition Conference, p. 22369–22380. Cited by: §2. Y. Lin, Z. Lin, K. Lin, J. Bai, P. Pan, C. Li, H. Chen, Z. Wang, X. Ding, W. Li, et al. (2025b) JarvisArt: liberating human artistic creativity via an intelligent photo retouching agent. arXiv preprint arXiv:2506.17612. Cited by: §2. Y. Lin, L. Wang, K. Lin, Z. Lin, K. Gong, W. Li, B. Lin, Z. Li, S. Zhang, Y. Peng, et al. (2025c) JarvisEvo: towards a self-evolving photo editing agent with synergistic editor-evaluator optimization. arXiv preprint arXiv:2511.23002. Cited by: §2. Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le (2022) Flow matching for generative modeling. arXiv preprint arXiv:2210.02747. Cited by: §1. H. Liu, H. Huang, J. Wang, C. Liu, X. Li, and X. Ji (2025a) DiverseGRPO: mitigating mode collapse in image generation via diversity-aware grpo. arXiv preprint arXiv:2512.21514. Cited by: §2. J. Liu, G. Liu, J. Liang, Y. Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang (2025b) Flow-grpo: training flow matching models via online rl. arXiv preprint arXiv:2505.05470. Cited by: Figure 1, Figure 1, §1, §1, §2, §4.1. J. Liu, G. Liu, J. Liang, Z. Yuan, X. Liu, M. Zheng, X. Wu, Q. Wang, M. Xia, X. Wang, et al. (2025c) Improving video generation with human feedback. arXiv preprint arXiv:2501.13918. Cited by: §A.4, §1, §2, §4.1. X. Liu, C. Gong, and Q. Liu (2022) Flow straight and fast: learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003. Cited by: §1. Z. Long, M. Zheng, K. Feng, X. Zhang, H. Liu, H. Yang, L. Zhang, Q. Chen, and Y. Ma (2025) Follow-your-shape: shape-aware image editing via trajectory-guided region control. arXiv preprint arXiv:2508.08134. Cited by: §2. Y. Lu, Y. Zeng, H. Li, H. Ouyang, Q. Wang, K. L. Cheng, J. Zhu, H. Cao, Z. Zhang, X. Zhu, et al. (2025) Reward forcing: efficient streaming video generation with rewarded distribution matching distillation. arXiv preprint arXiv:2512.04678. Cited by: §2. Y. Ma, K. Feng, Z. Hu, X. Wang, Y. Wang, M. Zheng, B. Wang, Q. Wang, X. He, H. Wang, et al. (2025) Controllable video generation: a survey. arXiv preprint arXiv:2507.16869. Cited by: §1. X. Mi, W. Yu, J. Lian, S. Jie, R. Zhong, Z. Liu, G. Zhang, Z. Zhou, Z. Xu, Y. Zhou, et al. (2025) Video generation models are good latent reward models. arXiv preprint arXiv:2511.21541. Cited by: §2. A. Nair, A. Gupta, M. Dalal, and S. Levine (2020) Awac: accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359. Cited by: §A.6. J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International conference on machine learning, p. 1889–1897. Cited by: §3.2.2. Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu, et al. (2024) Deepseekmath: pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. Cited by: §1, §2. X. Shen, Z. Li, Z. Yang, S. Zhang, Y. Zhang, D. Li, C. Wang, Q. Lu, and Y. Tang (2025) Directly aligning the full diffusion trajectory with fine-grained human preference. arXiv preprint arXiv:2509.06942. Cited by: §1. J. Song, C. Meng, and S. Ermon (2020a) Denoising diffusion implicit models. International Conference on Learning Representations. Cited by: §1. Y. Song, J. N. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole (2020b) Score-based generative modeling through stochastic differential equations. International Conference On Learning Representations. Cited by: §1. M. L. Team, X. Cai, Q. Huang, Z. Kang, H. Li, S. Liang, L. Ma, S. Ren, X. Wei, R. Xie, et al. (2025) Longcat-video technical report. arXiv preprint arXiv:2510.22200. Cited by: §4.2. B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik (2024) Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 8228–8238. Cited by: §2. T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. (2025) Wan: open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314. Cited by: §1, §2. F. Wang and Z. Yu (2025) Coefficients-preserving sampling for reinforcement learning with flow matching. arXiv preprint arXiv:2509.05952. Cited by: Figure 1, Figure 1, §1, §4.1. F. Wang, L. Yang, Z. Huang, M. Wang, and H. Li (2024) Rectified diffusion: straightness is not your need in rectified flow. arXiv preprint arXiv:2410.07303. Cited by: §1. B. Wu, C. Zou, C. Li, D. Huang, F. Yang, H. Tan, J. Peng, J. Wu, J. Xiong, J. Jiang, et al. (2025) Hunyuanvideo 1.5 technical report. arXiv preprint arXiv:2511.18870. Cited by: §1, §1. J. Xu, Y. Huang, J. Cheng, Y. Yang, J. Xu, Y. Wang, W. Duan, S. Yang, Q. Jin, S. Li, et al. (2024) Visionreward: fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059. Cited by: §2. J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong (2023) ImageReward: learning and evaluating human preferences for text-to-image generation. In Proceedings of the 37th International Conference on Neural Information Processing Systems, p. 15903–15935. Cited by: §2. R. Xu, D. Jin, Y. Bai, R. Lan, X. Duan, L. Sun, and X. Chu (2025) Scalar: scale-wise controllable visual autoregressive learning. arXiv preprint arXiv:2507.19946. Cited by: §2. Z. Xue, J. Wu, Y. Gao, F. Kong, L. Zhu, M. Chen, Z. Liu, W. Liu, Q. Guo, W. Huang, et al. (2025) DanceGRPO: unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818. Cited by: Figure 1, Figure 1, §1, §1, §2, §2, §4.1. X. Yu, C. Bai, H. He, C. Wang, and X. Li (2024) Regularized conditional diffusion model for multi-task preference alignment. Advances in Neural Information Processing Systems 37, p. 139968–139996. Cited by: §2. S. Zhang, Z. Zhang, C. Dai, and Y. Duan (2026) E-grpo: high entropy steps drive effective reinforcement learning for flow models. arXiv preprint arXiv:2601.00423. Cited by: §2. T. Zhang, C. Da, K. Ding, H. Yang, K. Jin, Y. Li, T. Gao, D. Zhang, S. Xiang, and C. Pan (2025) Diffusion model as a noise-aware latent reward model for step-level preference optimization. arXiv preprint arXiv:2502.01051. Cited by: §2. M. Zheng, Y. Xu, H. Huang, X. Ma, Y. Liu, W. Shu, Y. Pang, F. Tang, Q. Chen, H. Yang, et al. (2024) VideoGen-of-thought: step-by-step generating multi-shot video with minimal manual intervention. arXiv preprint arXiv:2412.02259. Cited by: §2. Y. Zhou, P. Ling, J. Bu, Y. Wang, Y. Zang, J. Wang, L. Niu, and G. Zhai (2025) Fine-grained grpo for precise preference alignment in flow models. arXiv preprint arXiv:2510.01982. Cited by: §2. Appendix A Appendix A.1 Derivation of Manifold-Aware SDE Variance Derivation To enable stochastic exploration for GRPO, we need to convert the deterministic Rectified Flow ODE into a stochastic differential equation (SDE) that preserves the marginal probability distribution at each timestep. Recall the general form of a marginal-preserving SDE for flow matching: dt=(θ(t,t)−12εt2θ(t))dt+εtdt,dz_t= (v_θ(x_t,t)- 12 _t^2s_θ(x_t) )\,dt+ _t\,dw_t, (16) where εt _t is the diffusion coefficient (a function of t), tw_t is a Brownian motion, and θ(t)≈−(t−^0)/σt2s_θ(x_t)≈-(x_t- x_0)/ _t^2 is the score function estimate. The Itô correction term 12εt2θ(t) 12 _t^2s_θ(x_t) ensures that the SDE preserves the same marginal distribution as the deterministic ODE. To discretize this SDE, we assume that θ(t,t)v_θ(x_t,t) and θ(t)s_θ(x_t) remain approximately constant during the interval [σt+1,σt][ _t+1, _t], where σt _t is the noise level at timestep t. The key challenge is to compute the integrated variance Σt _t for the stochastic term. We define: Σt≔∫σt+1σtεs2ds. _t _ _t+1 _t _s^2\,ds. (17) For Rectified Flow, we choose εt=ησt1−σt _t=η _t1- _t to match the geometric structure of the flow trajectory, where η is the exploration scaling factor. Substituting this form and integrating: Σt _t =∫σt+1σtη2σs1−σsds = _ _t+1 _tη^2 _s1- _s\,ds (18) =η2∫σt+1σt(11−σs−1)ds =η^2 _ _t+1 _t ( 11- _s-1 )\,ds (19) =η2[−log⁡(1−σs)−σs]σt+1σt =η^2 [- (1- _s)- _s ]_ _t+1 _t (20) =η2[−(σt−σt+1)+log⁡(1−σt+11−σt)]. =η^2 [-( _t- _t+1)+ ( 1- _t+11- _t ) ]. (21) Taking the square root, we obtain the noise standard deviation: Σt1/2=η−(σt−σt+1)+log⁡(1−σt+11−σt). _t^1/2=η -( _t- _t+1)+ ( 1- _t+11- _t ). (22) The logarithmic term log⁡((1−σt+1)/(1−σt)) ((1- _t+1)/(1- _t)) accounts for the geometric contraction of the signal coefficient (1−σt)(1- _t), which linear approximations fail to capture. Applying Euler-Maruyama discretization with timestep Δt=σt−σt+1 t= _t- _t+1, the discretized SDE becomes: t+Δt=t+θ(t,t)Δt+Σt2θ(t)+Σt1/2ϵ,x_t+ t=x_t+v_θ(x_t,t) t+ _t2s_θ(x_t)+ _t^1/2 ε, (23) where ϵ∼(,) ε (0,I) is the injected stochasticity. Note that Σt _t is already the integrated variance over the interval [σt+1,σt][ _t+1, _t], so the stochastic term uses Σt1/2 _t^1/2 directly without an additional Δt t factor. Problem Formulation. Let the noise level at timestep t be σt _t. In a Rectified Flow setting, the trajectory connects pure noise (σ=1σ=1) to data (σ=0σ=0). We aim to find the precise variance Σt _t required for the stochastic step such that the marginal distribution is preserved up to the second order. Let Δσ=σt−σt+1>0 σ= _t- _t+1>0. We analyze the terms inside the square root of our proposed Eq. 7. Let VtV_t denote the variance term: Vt=−(σt−σt+1)+log⁡(1−σt+11−σt)V_t=-( _t- _t+1)+ ( 1- _t+11- _t ) (24) Taylor Expansion Analysis. First, we express the logarithmic term using Δσ σ: log⁡(1−σt+11−σt)=log⁡(1−(σt−Δσ)1−σt)=log⁡(1+Δσ1−σt) ( 1- _t+11- _t )= ( 1-( _t- σ)1- _t )= (1+ σ1- _t ) (25) Let x=Δσ1−σtx= σ1- _t. Since step sizes are small, |x|<1|x|<1. We apply the Taylor expansion log⁡(1+x)≈x−x22+(x3) (1+x)≈ x- x^22+O(x^3): log⁡(1+Δσ1−σt)≈Δσ1−σt−12(Δσ1−σt)2 (1+ σ1- _t )≈ σ1- _t- 12 ( σ1- _t )^2 (26) Substituting this back into VtV_t: Vt V_t ≈−Δσ+(Δσ1−σt−12Δσ2(1−σt)2) ≈- σ+ ( σ1- _t- 12 σ^2(1- _t)^2 ) (27) =Δσ(11−σt−1)−12Δσ2(1−σt)2 = σ ( 11- _t-1 )- 12 σ^2(1- _t)^2 (28) =Δσ(σt1−σt)−(Δσ2) = σ ( _t1- _t )-O( σ^2) (29) The leading term Δσσt1−σt σ _t1- _t represents the ideal variance scaling for a geometric schedule, which linear approximations fail to capture. A.2 Standard Deviation Comparison: Ours vs. FlowGRPO Figure 9 compares the noise standard deviation per step between our precise SDE and FlowGRPO under three parameterization regimes to understand how different noise handling strategies affect exploration behavior. Regime (a): Both methods using FlowGRPO’s σ schedule. When both methods use FlowGRPO’s default σ schedule (where σt _t is set to σmax _ at early steps), our precise SDE exhibits near-zero standard deviation at the first step. This occurs because our method computes noise variance via integration: Σt=∫σt+1σtεs2ds _t= _ _t+1 _t _s^2\,ds. When both endpoints are equal (σt=σt+1=σmax _t= _t+1= _ ), the integration interval collapses, and the logarithmic term log⁡((1−σt+1)/(1−σt)) ((1- _t+1)/(1- _t)) evaluates to zero, yielding Σt≈0 _t≈ 0. This demonstrates that our integral-based formulation is sensitive to the σ schedule and requires proper boundary handling. Regime (b): Both methods using aggressive clamping at 1−3×10−31-3× 10^-3. When both methods apply the same clamping threshold (1−σ)≥3×10−3(1-σ)≥ 3× 10^-3 (equivalently, σ≤1−3×10−3σ≤ 1-3× 10^-3), FlowGRPO exhibits explosive behavior at the first step, with standard deviation reaching values around 3.03.0. This instability arises because FlowGRPO’s noise computation involves a ratio σ/(1−σ)σ/(1-σ); when (1−σ)(1-σ) is clamped to a small constant while σ remains large, the denominator becomes artificially small, causing the ratio to explode. In contrast, our precise SDE maintains stable and controlled standard deviation throughout, starting around 1.01.0 and decaying smoothly, demonstrating that our manifold-aware formulation inherently handles low-noise regimes more robustly. Regime (c): Each method using its default implementation. Under their respective default configurations, FlowGRPO uses its standard σ schedule, while our method applies clamping at (1−σ)≥3×10−3(1-σ)≥ 3× 10^-3. Our method maintains a lower standard deviation than FlowGRPO across most of the diffusion trajectory, particularly in later steps. This demonstrates that our precise SDE effectively reduces injected noise magnitude, leading to more refined exploration along the data manifold. Across all three regimes, our method consistently achieves smaller or more stable standard deviation than FlowGRPO. This supports the main-figure narrative (Figure 1): we remove unnecessary high-frequency noise energy in high-noise regions, enabling more precise exploration that stays closer to the data manifold. This behavior aligns with the micro-level exploration design of our SDE in Section 3.2.1. (a) Both using FlowGRPO’s σ schedule. (b) Both clamped at 1−3×10−31-3× 10^-3. (c) Each using default implementation. Figure 9: Step-wise std comparison: our precise SDE vs. FlowGRPO. (a) When both use FlowGRPO’s σ schedule, our integral-based formulation yields near-zero std at the first step due to equal endpoints. (b) When both are clamped at (1−σ)≥3×10−3(1-σ)≥ 3× 10^-3, FlowGRPO explodes at step 1, while ours remains stable. (c) Under default implementations, ours maintains lower std across most steps. This supports that we remove ineffective high-frequency noise and explore more precisely along the manifold (Section 3.2.1). Figure 10: Qualitative comparison highlighting emotional alignment. Two prompts illustrate SAGE-GRPO’s ability to better align with emotional descriptions: (Top) A teenage boy in a coffee shop, where SAGE-GRPO captures the ”calm, contemplative expression” and gentle motion of lowering the mug, while baselines show neutral expressions and abrupt movements. (Bottom) An older chef in a kitchen, where SAGE-GRPO consistently renders the ”lines of fatigue” and ”somber mood” through deep side-lighting shadows, while baselines fail to convey the intended emotional depth. Our manifold-aware exploration enables precise alignment with subtle emotional and action cues. A.3 Theoretical Gradient Norm Analysis Here we derive the relationship between the gradient norm and the noise schedule. For a Gaussian policy π(t−1|t)=(μθ,Σt)π(x_t-1|x_t)=N( _θ, _tI), the gradient of the log-probability with respect to the drift parameter μθ _θ is: ∇μlog⁡π=sample−μθΣt _μ π= x_sample- _θ _t (30) Since sample∼(μθ,Σt)x_sample ( _θ, _t), the expected norm is proportional to the standard deviation of the noise: [‖∇μlog⁡π‖]∝ΣtΣt=1ΣtE[\| _μ π\|] _t _t= 1 _t (31) Given our derived Manifold-Aware variance Σt≈η2Δσσt1−σt _t≈η^2 σ _t1- _t, the gradient norm scales as: ‖∇‖∝1−σtσtΔσ\|∇\| 1- _t _t σ (32) This confirms that as σt→0 _t→ 0 (low noise), the gradient norm explodes, necessitating our proposed Gradient Equalizer. A.4 GRPO Reward and Advantage Details Here we provide the implementation-aligned definitions of reward composition and the group-normalized advantage used in Equation (4). Following VideoAlign (Liu et al., 2025c), we construct a composite reward for a generated video 0x_0: R(0)=wvqSvq(0)+wmqSmq(0)+wtaSta(0),R(x_0)=w_vqS_vq(x_0)+w_mqS_mq(x_0)+w_taS_ta(x_0), (33) where SvqS_vq, SmqS_mq, and StaS_ta score visual quality, motion quality, and text alignment, and wvq,wmq,wtaw_vq,w_mq,w_ta are fixed scalar weights. Given a prompt c, GRPO samples a group of G rollouts 0(i)i=1G\x_0^(i)\_i=1^G and computes rewards ri=R(0(i))r_i=R(x_0^(i)). We use the group mean and standard deviation as a baseline: μR=1G∑j=1Grj,σR=1G∑j=1G(rj−μR)2, _R= 1G _j=1^Gr_j, _R= 1G _j=1^G(r_j- _R)^2, (34) and define the normalized advantage: Ai=ri−μRσR+ϵ,A_i= r_i- _R _R+ε, (35) where ϵε is a small constant for numerical stability. A.5 Temporal Gradient Equalizer: Derivation of tN_t We outline how to obtain a per-timestep gradient scale proxy tN_t that is compatible with the SDE transition used in Section 3.2.1. Consider a Gaussian transition π(t−1∣t)=(θ,Σt)π(x_t-1 _t)=N( μ_θ, _tI) parameterized through the network output (e.g., velocity/denoiser prediction) and a noise variance Σt _t determined by the chosen SDE. The log-probability gradient with respect to the mean parameter satisfies: ∇log⁡π=sample−θΣt. _ μ π= x_sample- μ_θ _t. (36) Since sample−θ∼(,Σt)x_sample- μ_θ (0, _tI), its magnitude is (Σt1/2)O( _t^1/2) in expectation, yielding the inverse relationship: [‖∇log⁡π‖]∝1Σt1/2.E [\| _ μ π\| ] 1 _t^1/2. (37) In practice, the network does not directly parameterize θ μ_θ; instead, θ μ_θ is obtained by composing the network prediction with the SDE/solver update rule, introducing an additional sensitivity factor. Let λt _t denote the scalar sensitivity from the solver mapping (details depend on the SDE type and discretization). We use the proxy: t=λtΣt1/2,N_t= _t _t^1/2, (38) and define the Temporal Gradient Equalizer (Equation (9)) as a robust normalization: St=Median(τ=1T)t+ϵ.S_t= Median(\N_τ\_τ=1^T)N_t+ε. (39) This produces approximately uniform gradient scales across timesteps, aligning with the empirical observation in Figure 4 and the training-curve improvement in Figure 3. A.6 SAGE-GRPO Objective and Adaptive KL Weighting We provide the complete objective used in SAGE-GRPO, combining GRPO, the Temporal Gradient Equalizer, and Dual KL regularization, together with a principled schedule for the overall KL coefficient. At each optimization step, we sample a group of G rollouts and compute advantages Aii=1G\A_i\_i=1^G as in Appendix A.4. Dual KL regularizer. We use two reference policies to implement a position–velocity controller in policy space (Section 3.2.2). The regularizer is ℒKL=βpos⋅DKL(πθ∥πref_N)+βvel⋅DKL(πθ∥πk−1),L_KL= _pos· D_KL( _θ\| _ref\_N)+ _vel· D_KL( _θ\| _k-1), (40) where πk−1 _k-1 is the previous policy and πref_N _ref\_N is a periodically refreshed anchor. The term DKL(πθ∥πk−1)D_KL( _θ\| _k-1) constrains the instantaneous update (velocity control), while DKL(πθ∥πref_N)D_KL( _θ\| _ref\_N) constrains the cumulative displacement from the anchor (position control). This separation is important for long-horizon training: velocity-only constraints can still accumulate drift, whereas a single fixed anchor can be overly restrictive. Adaptive KL weighting. We interpret the overall KL coefficient λKL _KL as a Lagrange multiplier associated with a trust-region constraint [DKL(πθ∥πref)]≤δE[D_KL( _θ\| _ref)]≤δ. Instead of fixing λKL _KL, we adapt it online so that the realized KL remains close to a target scale, analogous in spirit to adaptive behavior regularization in AWAC (Nair et al., 2020), where a temperature parameter is adapted from advantage statistics. Warm-up (two-stage increase). Let λmin=10−7 _ =10^-7 and λmax=10−5 _ =10^-5 denote the minimum and maximum KL coefficients. During the first K=100K=100 optimization steps, we use a linear warm-up: λKL(k)=λmin+(λmax−λmin)⋅kK,k≤K, _KL(k)= _ + ( _ - _ )· kK, k≤ K, (41) which corresponds to the two-stage schedules reported in Figure 7. This design keeps the trust region weak early to avoid underfitting, and gradually strengthens it as the policy improves. Conservative feedback control. After warm-up, we apply a proportional feedback update based on the recent KL history, similar to the P-term of a PID controller. Let D¯KL D_KL be the mean of the last H=10H=10 observed KL values and let DtargetD_target be the desired KL scale. We define the KL error eKL=D¯KL−Dtargete_KL= D_KL-D_target and update λKL←0.9λKL,D¯KL>(1+0.5)Dtarget,1.1λKL,D¯KL<(1−0.5)Dtarget,λKL,otherwise,λKL∈[λmin,λmax], _KL← cases0.9\, _KL,& D_KL>(1+0.5)\,D_target,\\ 1.1\, _KL,& D_KL<(1-0.5)\,D_target,\\ _KL,&otherwise, cases _KL∈[ _ , _ ], (42) where clipping enforces the same bounds as the warm-up stage. Intuitively, if the empirical KL is much larger than DtargetD_target, the controller reduces λKL _KL to relax the constraint; if it is much smaller, the controller increases λKL _KL to tighten the trust region. This combination of warm-up and feedback control stabilizes the effective trust-region radius and explains the smooth reward trajectories observed in Figure 7. Full SAGE-GRPO loss. Combining GRPO, the Temporal Gradient Equalizer, and the adaptively weighted Dual KL regularizer yields: ℒSAGE-GRPO(θ)=−1G∑i=1GAi⋅∑t=1TSt⋅logπθ(t−1(i)∣t(i),)−λKL⋅ℒKL. L_SAGE-GRPO(θ)=- 1G _i=1^GA_i· _t=1^TS_t· _θ(x_t-1^(i) _t^(i),c)- _KL·L_KL. (43) A.7 Additional Qualitative Results We include additional qualitative visualizations to complement the quantitative experiments in Section 4. Figure 10 demonstrates SAGE-GRPO’s superior ability to align with emotional descriptions in text prompts, capturing subtle facial expressions and mood cues that baselines often miss. To further validate the effectiveness of different KL strategies discussed in Section 3.2.2, we provide qualitative comparisons across five variants: no KL regularization, Fixed KL (anchored to the initial model π0 _0), Step-wise KL (velocity control only), Moving KL (position control only), and Dual Moving KL (combining both position and velocity control). Figures 11 and 12 show that Dual Moving KL consistently produces more realistic details, better temporal consistency, and stronger alignment with prompt descriptions compared to other KL strategies, which aligns with the quantitative findings in Section 4.2. Figure 11: KL strategy ablation: qualitative comparison (Case 1). Visual comparison across different KL strategies (no KL, Fixed KL, Step-wise KL, Moving KL, Dual Moving KL) on a prompt describing a fatigued soldier. Dual Moving KL produces more realistic facial details, better dirt and grime rendering, and maintains temporal consistency across frames compared to other variants. Figure 12: KL strategy ablation: qualitative comparison (Case 2). Additional visual comparison demonstrating how different KL strategies affect generation quality. Dual Moving KL consistently achieves better photorealism and alignment with prompt descriptions compared to alternatives, validating the position-velocity control mechanism discussed in Section 3.2.2.