Paper deep dive

WorldCache: Content-Aware Caching for Accelerated Video World Models

Umair Nawaz, Ahmed Heakl, Ufaq Khan, Abdelrahman Shaker, Salman Khan, Fahad Shahbaz Khan

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 81

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/26/2026, 2:49:00 AM

Summary

WorldCache is a training-free, perception-constrained dynamical caching framework designed to accelerate Diffusion Transformer (DiT) based video world models. It addresses the limitations of existing zero-order hold caching methods—which cause ghosting and motion artifacts—by introducing motion-adaptive thresholds, saliency-weighted drift estimation, optimal feature approximation via blending and warping, and phase-aware threshold scheduling. WorldCache achieves a 2.3x inference speedup on Cosmos-Predict2.5-2B while maintaining 99.4% of baseline quality.

Entities (6)

Cosmos-Predict2.5 · model · 100%Diffusion Transformers · model-architecture · 100%PAI-Bench · benchmark · 100%WorldCache · framework · 100%DiCache · method · 95%FasterCache · method · 95%

Relation Signals (3)

WorldCache → accelerates → Diffusion Transformers

confidence 100% · WorldCache: Content-Aware Caching for Accelerated Video World Models

WorldCache → evaluatedon → PAI-Bench

confidence 100% · On the Physical AI Bench (PAI-Bench) [51], WorldCache achieves a 2.3× speedup

WorldCache → outperforms → DiCache

confidence 95% · substantially outperforming prior training-free caching approaches

Cypher Suggestions (2)

Find all methods compared against WorldCache · confidence 90% · unvalidated

MATCH (w:Framework {name: 'WorldCache'})-[:OUTPERFORMS]->(m:Method) RETURN m.name

Identify benchmarks used for evaluation · confidence 90% · unvalidated

MATCH (f:Framework {name: 'WorldCache'})-[:EVALUATED_ON]->(b:Benchmark) RETURN b.name

Abstract

Abstract:Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbf{WorldCache}, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf{2.3$\times$} inference speedup while preserving \textbf{99.4\%} of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \href{this https URL}{World-Cache}.

PDF

Open source PDF →Open local PDF →

Full Text

81,187 characters extracted from source content.

Expand or collapse full text

WorldCache: Content-Aware Caching for Accelerated Video World Models Umair Nawaz 1⋆ , Ahmed Heakl 1 , Ufaq Khan 1 , Abdelrahman M. Shaker 1 , Salman Khan 1 , and Fahad Shahbaz Khan 1,2 1 Mohamed bin Zayed University of Artificial Intelligence, UAE 2 Linköping University, Sweden Abstract. Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denois- ing and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denois- ing steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion in- consistencies in dynamic scenes. We propose WorldCache, a Perception- Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresh- olds, saliency-weighted drift estimation, optimal approximation via blend- ing and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI- Bench, WorldCache achieves 2.3× inference speedup while preserving 99.4% of baseline quality, substantially outperforming prior training- free caching approaches. Our project can be accessed on: World-Cache. 1 Introduction World models predict future visual states that are physically consistent and use- ful for downstream decision-making, enabling agents to plan and act within simu- lated environments [50]. Large-scale Diffusion Transformers (DiTs) have become the dominant backbone for such models [9, 47, 48], because spatio-temporal at- tention over latent tokens captures the long-range dependencies central to world consistency (e.g., object permanence and causal motion). However, this expres- siveness comes at a steep computational cost: world-model rollouts require many frames, and each frame is produced by sequentially invoking deep transformer blocks across dozens of denoising steps [10,35]. The resulting latency is the pri- mary obstacle to interactive world simulation and closed-loop deployment. A natural remedy is to exploit redundancy along the denoising trajectory. Consecutive steps often produce only small changes in intermediate features [13], ⋆ Corresponding author: umair.nawaz@mbzuai.ac.ae arXiv:2603.22286v1 [cs.CV] 23 Mar 2026 2U. Nawaz et al. Fig. 1: Qualitative and quantitative comparison of acceleration methods on video world model generation using Cosmos-Predict2.5-2B. Left: Visual com- parison of Baseline (no acceleration), FasterCache, DiCache, and WorldCache (ours) across three representative timesteps (T 1 ,...,T N ) of a driving scene from the City Street domain. FasterCache achieves 1.6× speedup but introduces severe visual ar- tifacts and scene hallucinations. DiCache (1.3×) better preserves scene fidelity but exhibits noticeable spatial artifacts at later timesteps (red dashed boxes). WorldCache (ours) achieves 2.3× speedup while faithfully reproducing scene content, motion, and background structure across all timesteps (green dashed boxes). Right: Relative la- tency reduction (%) over Baseline for Text2World (T2W) and Image2World (I2W) tasks, showing WorldCache achieves up to 55.5% reduction compared to 40.5% for FasterCache and 27.9% for DiCache. All results are reported on a single H200 GPU. so recomputing every block at every step is wasteful. Training-free caching meth- ods exploit this observation: they estimate a step-to-step drift using a lightweight probe, then skip expensive layers when drift falls below a threshold, reusing cached activations instead. FasterCache [33] applies this idea to video DiTs with a fixed skip schedule, and DiCache [6] makes it adaptive via shallow-layer probes that decide both when and how to reuse cached states. For world models, how- ever, this “skip-and-reuse” paradigm fails precisely where it matters most: scenes with significant motion [22] and salient interactions [26]. The failure has a single root cause. Existing methods treat cache reuse as a zero-order hold: when probe drift is small, they copy stale features verbatim into the next step. Under mo- tion, this produces ghosting, semantic smearing, and incoherent trajectories (as shown in Fig. 1), exactly the artifacts that break world-model rollouts, where er- rors compound across autoregressive generation. Three specific blindspots make the problem worse. First, global drift metrics average over the entire spatial map, so a static background can mask large foreground changes, causing the method to skip when it should recompute. Second, all spatial locations are weighted equally, even though errors on salient entities (agents, hands, manipulated ob- jects) dominate both perceptual and functional quality. Third, a single static threshold ignores that early denoising steps establish global structure while late steps only refine high-frequency detail; a threshold tuned for the early phase becomes wastefully conservative in the late phase. We propose WorldCache, a training-free caching framework that replaces the zero-order hold with a perception-constrained dynamical approximation de- WorldCache: Content-Aware Caching for Accelerated Video World Models3 signed for DiT-based world models. WorldCache addresses each blindspot above with a lightweight, composable module. Causal Feature Caching (CFC) adapts the skip threshold to latent motion magnitude, preventing stale reuse during fast dynamics. Saliency-Weighted Drift (SWD) reweights the probe signal toward perceptually important regions, so caching decisions reflect foreground fidelity rather than background noise. Optimal Feature Approximation (OFA) replaces verbatim copying with least-squares optimal blending and motion-compensated warping, reducing approximation error when skipping does occur. Adaptive Thresh- old Scheduling (ATS) progressively relaxes the threshold during late denoising, where aggressive reuse is both safe and highly effective. Together, these modules convert caching from a brittle shortcut into a controlled approximation strategy aligned with world-model requirements. On the Physical AI Bench (PAI-Bench) [51], WorldCache achieves a 2.3× speedup on Cosmos-Predict2.5 (2B) while preserving 99.4% of baseline quality, outperforming both DiCache and FasterCache in speed–quality trade-off. Our contributions are: 1. We formalize feature caching for DiT-based world models as a dynamical approximation problem and identify the zero-order hold assumption in prior methods as the primary source of ghosting, blur, and motion incoherence in dynamic rollouts. 2. We introduce WorldCache, a unified framework that improves both when to skip (motion and saliency-aware decisions) and how to approximate (optimal blending and motion compensation), while adapting to the denoising phase. 3. We demonstrate state-of-the-art training-free acceleration on multiple DiT backbones, achieving up to 2.3× speedup with 99.4% quality retention on Cosmos-Predict2.5, and show that the approach transfers across model scales and conditioning modalities. 2 Related Work Video diffusion and world simulators. Diffusion models have become a lead- ing approach for high-fidelity video generation, from early formulations [17] to scalable latent/cascaded pipelines [3, 15, 16] and large-scale text-to-video sys- tems [42]. Recently, video generation models have also been studied as world sim- ulators, evaluated for physical consistency and action-relevant prediction [37,38]. In this direction, NVIDIA’s Cosmos platform/Cosmos-Predict target physical AI simulation [1,36], with benchmarks such as PAI-Bench to assess physical plau- sibility and controllability [51]. Related efforts include interactive environment world models [4] and large token-based models for video generation [24]. Efficient diffusion inference. A common acceleration axis is reducing sampling cost via fewer or cheaper denoising steps. Training-free methods include alter- native samplers such as DDIM [43] and fast solvers such as DPM-Solver/DPM- Solver++ [30,31], while distillation compresses many-step teachers into few-step students [40]. WorldCache instead keeps the base model and schedule, and re- duces compute via safe reuse of internal activations. 4U. Nawaz et al. Caching and reuse in diffusion transformers. Caching methods exploit redun- dancy across timesteps and guidance passes. DeepCache reuses high-level fea- tures across adjacent steps (mainly for U-Nets) [34]. For video diffusion trans- formers, FasterCache accelerates inference by reusing attention features across timesteps and introducing a CFG cache that reuses conditional/unconditional redundancy to reduce guidance overhead [33]. DeepCache [34] shows that reusing high-level features across steps can accelerate diffusion inference. DiCache fur- ther makes caching adaptive with an online probe to decide when to refresh and a trajectory-aligned reuse strategy to decide how to combine cached states [6]. Despite strong gains, caching can still be brittle when motion, fine textures, or semantically important regions cause cached states to drift. Motion-compensated and perception-aware reuse. Feature reuse has also been explored in video recognition via propagation with optical flow [54] and multi- rate update schedules [41], motivating alignment-aware reuse rather than fixed- coordinate copying. Classic and modern flow methods (Lucas–Kanade, RAFT) [32,44] illustrate the accuracy/efficiency trade-off for motion compensation. Per- ceptual quality can be tracked with deep perceptual metrics and structure/texture- aware measures [11,49], while Laplacian pyramids provide a classical multi-scale view of high-frequency detail [7]. WorldCache builds on these ideas with motion- aligned reuse, saliency-aware monitoring, and principled temporal extrapolation inspired by system identification [29]. 3 Method 3.1 Preliminaries: DiT Denoising in World Models We consider a DiT-based world model that predicts future visual states by iter- atively denoising a latent video representation. Let z t ∈R B×T×H×W×D denote the latent tensor at denoising step t (not to be confused with video frame index), where B is batch size, T is the number of latent frames, H×W is spatial resolu- tion, and D is the channel dimension. The denoiser is a stack of N transformer blocks F i N i=1 , producing z (i) t = F i (z (i−1) t ), with z (0) t = z t and z (N) t used by the sampler to obtain z t+1 . Throughout this section, superscripts in parentheses denote layer indices and subscripts denote denoising steps. 3.2 Foundation: Probe-Then-Cache WorldCache inherits its architectural skeleton from the probe-then-cache paradigm introduced by DiCache [6], but replaces both its skip criterion and its reuse mech- anism. We first describe the shared skeleton, then identify the two components we redesign. WorldCache: Content-Aware Caching for Accelerated Video World Models5 Fig. 2: A DiT world model denoises latent video states (z t ) by running probe blocks (1...k) followed by deep blocks (k+1... (N)). WorldCache inserts a decision gate that skips deep blocks when the saliency-weighted probe drift (δ SWD ) is below a motion and step-adaptive threshold (τ ATS (t)) (computed by CFC+ATS), enabling cache hits. On a cache hit, OFA reuses computation by aligning cached residuals via optimal interpolation and optional motion-compensated warping to approximate (z (N) t ). On a cache miss, the model executes deep blocks, updates the residual cache in a ping-pong buffer, and continues the denoising loop for (t=1...T). Probe (inherited). At each step t, only the first k blocks (probe depth) are evaluated to obtain z (k) t . A drift indicator approximates the deep-layer change: δ t = z (k) t − z (k) t−1 1 z (k) t−1 1 + ε .(1) If δ t falls below a threshold, blocks k+1,...,N are skipped and cached deep states are reused; otherwise the full network executes and the cache is refreshed. Skip criterion (replaced). DiCache uses a fixed global threshold on δ t . World- Cache replaces this with motion-adaptive, saliency-weighted decisions (Secs. 3.4– 3.5). Reuse mechanism (replaced). DiCache estimates a scalar blending coeffi- cient from L1 residual ratios and interpolates between cached states from steps t−1 and t−2. This captures the magnitude of feature evolution but discards di- rectional information. WorldCache replaces this with a vector-projection-based approximation and optional motion-compensated warping (Sec. 3.6). 3.3 WorldCache Overview Fig. 2 summarizes the full pipeline. At each denoising step t, the probe com- putes shallow features z (k) t . CFC (Sec. 3.4) and SWD (Sec. 3.5) jointly deter- mine whether to skip by combining a motion-adaptive threshold with a saliency- weighted drift signal. On a cache hit, OFA (Sec. 3.6) approximates the deep 6U. Nawaz et al. output via least-squares optimal blending and optional spatial warping. ATS (Sec. 3.7) modulates the skip threshold across the denoising trajectory, tighten- ing it during structure-formation steps and relaxing it during late refinement. All four modules are training-free and add negligible overhead to the probe com- putation. 3.4 Causal Feature Caching (CFC): Motion-Adaptive Decisions When is reuse safe? In world-model video, the amount of motion varies substantially across prompts and across denoising steps. A fixed threshold is overly permissive during fast motion (risking ghosting) and overly conservative during static intervals (missing speedups). CFC adapts the skip threshold using an inexpensive motion proxy derived from the raw latent input. We define a “velocity” as the normalized two-step input change: v t = z (0) t − z (0) t−2 1 z (0) t−2 1 + ε . (2) We use a two-step gap because step t−1 may itself be a cached approximation; anchoring to t−2 (the most recent fully-computed input) yields a more reliable velocity estimate. The motion-adaptive threshold is: τ CFC (v t ) = τ 0 1 + α· v t ,(3) where τ 0 is the base threshold and α controls sensitivity. When dynamics are fast (v t large), τ CFC tightens, making skips less likely; when dynamics are slow, τ CFC ≈ τ 0 . We maintain a ping-pong buffer (two alternating cache slots indexed by step parity) so that reuse is always anchored to one of the two most recent fully-computed states. 3.5 Saliency-Weighted Drift (SWD): Perception-Aware Probing Is the drift signal measuring the right thing? The global drift δ t (Eq. 1) treats every spatial location equally, so it cannot distinguish between harm- less background fluctuation and critical foreground change. SWD reweights drift toward perceptually important regions, ensuring that the method recomputes when salient content changes and skips when only the background drifts. We define a spatial saliency map from the channel-wise variance of probe features: S h,w = Var d ̄ z (k) t [h,w, :] , (4) where ̄ z (k) t is the probe output averaged over the batch and temporal axes, and the variance is taken over the channel dimension d. High channel variance indi- cates spatially complex, information-rich regions (edges, textures, object bound- aries) where caching errors are most perceptually visible (Fig. 3). We normalize WorldCache: Content-Aware Caching for Accelerated Video World Models7 Frame 0Frame 28Frame 60Frame 92 + Saliency (step 40)+ Saliency (step 40)+ Saliency (step 40)+ Saliency (step 40) 0.0 0.2 0.4 0.6 0.8 1.0 Channel-variance saliency Denoising step 40 (SKIP) Fig. 3: Saliency overlay (Cosmos-Predict2.5, 2B). Channel-variance saliency from step 40 overlaid on four video frames. High saliency (yellow) marks structurally complex regions where caching errors are most visible; low saliency (purple) marks smooth areas where reuse is safe. SWD reweights drift using this signal to prioritize fidelity on detail-rich content. to ˆ S ∈ [0, 1] and define the saliency-weighted drift: δ SWD t = 1 HW X h,w z (k) t (h,w)− z (k) t−1 (h,w) 1 · 1 + β s ˆ S h,w ,(5) where β s controls saliency emphasis. The weighting term (1 +β s ˆ S h,w ) amplifies drift contributions from salient regions and attenuates those from featureless backgrounds. Consequently, a scene where only the static sky changes produces a low δ SWD t (safe to skip), while one where a foreground agent moves, even slightly, produces a high δ SWD t (triggering recomputation). The final skip decision combines SWD with the motion-adaptive threshold from CFC: skip ⇐⇒ δ SWD t < τ CFC (v t ).(6) 3.6 Optimal Feature Approximation (OFA): Improved Reuse Quality When we skip, can we produce a better approximation? CFC and SWD decide when to skip. OFA improves what is produced on a cache hit, via two complementary operators: one temporal (least-squares optimal blending) and one spatial (motion-compensated warping). Optimal State Interpolation (OSI) On a cache hit, the deep output z (N) t must be approximated from cached history. DiCache [6] estimates a scalar blend- ing coefficient γ from L1 distance ratios between probe residuals. This captures the magnitude of feature evolution but discards directional information: when motion causes the feature trajectory to curve, the scalar ratio extrapolates along a stale direction, and the resulting errors accumulate over consecutive cache hits. 8U. Nawaz et al. We reformulate the estimation as a least-squares vector projection. Define the deep computation residual: r t = z (N) t − z (0) t ,(7) and on a cache hit, let ̃ r t = z (k) t − z (0) t be the probe-derived partial residual. We seek a gain γ ∗ that best aligns the recent residual trajectory with the current probe signal: ∆ tgt = ̃ r t − r t−2 , ∆ src = r t−1 − r t−2 ,(8) γ ∗ = arg min γ ∥∆ tgt − γ ∆ src ∥ 2 = ⟨∆ tgt , ∆ src ⟩ ∥∆ src ∥ 2 + ε .(9) We clamp γ ∗ to [0,γ max ] (we use γ max =2) to prevent blow-up when ∥∆ src ∥ is small. The deep output is approximated as: ˆ z (N) t = z (0) t + r t−2 + γ ∗ (r t−1 − r t−2 ). (10) The inner product in Eq. 9 is the key difference from scalar-ratio methods. When the feature trajectory curves (e.g., a moving object changes direction), the dot product naturally attenuates γ ∗ , preventing extrapolation along a stale direction. When the trajectory is linear, OSI recovers the same estimate as scalar-ratio methods. OSI thus generalizes scalar-ratio alignment; we verify the improvement empirically in the ablation study. Motion-Compensated Feature Warping OSI corrects temporal misalign- ment in the residual trajectory, but cached features from step t−1 may also be spatially misaligned when the scene contains motion. OFA optionally warps cached features to the current coordinate frame before applying OSI. We estimate a displacement field between consecutive latent inputs via multi- scale correlation in latent space (no external network): u t→t−1 = LatentCorr z (0) t , z (0) t−1 ,(11) which adds less than 3% overhead per cached step. The cached deep features are then warped: ̃ z (N) t−1 = Warp z (N) t−1 , u t→t−1 , (12) and ̃ z (N) t−1 replaces z (N) t−1 in the residual computation of Eq. 10. That is, OSI oper- ates on the spatially-corrected residuals ̃ r t−1 = ̃ z (N) t−1 −z (0) t−1 , reducing compound spatial drift that is especially harmful in autoregressive world-model rollouts. We disable warping during the first five denoising steps, where the low signal-to-noise ratio makes displacement estimation unreliable. WorldCache: Content-Aware Caching for Accelerated Video World Models9 Fig. 4: Comparison of static vs. adaptive caching strategies. (a) A Fixed Threshold (τ = 0.12) fails to accommodate the naturally increasing drift (δ) in later denoising steps, resulting in frequent recomputations and a low skip rate (36%). (b) Our Adaptive Threshold (ATS) dynamically scales τ AT S with the expected drift. This allows ATS to capture significantly more cache hits (green bars) in later stages, nearly doubling the overall skip rate (68%) while maintaining generation quality. 3.7 Adaptive Threshold Scheduling (ATS): Phase-Aware Reuse Can we push acceleration further without breaking fidelity? The pre- ceding modules (CFC, SWD, OFA) establish a perception-aware caching infras- tructure, but they operate with a fixed base threshold τ 0 . The denoising trajec- tory, however, has two distinct phases. During structure formation (early steps, high noise), the network makes large, semantically critical updates that establish global layout and motion. During detail refinement (late steps, low noise), up- dates become small, high-frequency corrections. A static threshold calibrated for the early phase becomes unnecessarily conservative in the late phase: empirically, the cache hit rate drops sharply after step 20 (out of 35) because probe drift falls below the detection floor while the threshold remains unchanged (Fig. 4). ATS addresses this mismatch with a step-dependent relaxation: τ ATS (t) = τ CFC (v t )· 1 + β d · t T , (13) where t ∈ [0,T ] is the denoising step, T is the total number of steps, and β d controls the relaxation rate. At an early step (e.g., t=2, T =35, β d =4.0), the multiplier is ≈1.2, keeping the threshold tight and forcing full execution for structure-critical updates. At a late step (e.g., t=32), the multiplier reaches ≈4.6, aggressively relaxing the threshold. Since the scene geometry is already established and the network produces only fine texture corrections, the cached OFA approximation is safely reused across consecutive steps. This aggressive late-stage relaxation is the primary source of WorldCache’s speedup gains. The ablation in Sec. 4.3 confirms that it reduces quality by less than 0.6% relative to the baseline. 10U. Nawaz et al. Table 1: Text2World (T2W) generation results on PAI-Bench across two model scales. We evaluate four methods, Baseline (no acceleration), DiCache, Faster- Cache, and WorldCache (ours), on Cosmos-Predict2.5 at both 2B and 14B parame- ter scales. Domain Score aggregates seven semantic categories (CS: City Street, AV: Aerial View, RO: Road, IN: Indoor, HU: Human, PH: Physics, MI: Mixed), while Quality Score spans eight perceptual and fidelity dimensions (SC: Scene Consistency, BC: Background Consistency, MS: Motion Smoothness, AQ: Aesthetic Quality, IQ: Im- age Quality, OC: Object Consistency, IS: Imaging Subject, IB: Imaging Background). Avg. denotes the overall score averaged across both Domain and Quality metrics. Lat. reports wall-clock inference latency (seconds), and Speedup is relative to the unaccel- erated Baseline. ∆ rows quantify the per-metric gain of WorldCache over DiCache, the strongest competing method. WorldCache achieves the best accuracy–efficiency trade- off, consistently outperforming DiCache while delivering up to 2.10× speedup at 2B and 2.14× at 14B with negligible quality degradation. Domain ScoreQuality Score Avg. ↑ Lat. (s) ↓ Speedup ↑ MethodCSAVROINHUPHMI Avg ↑ SCBCMSAQIQOCISIB Avg ↑ Cosmos-Predict2.5 – 2B Baseline0.759 0.643 0.724 0.820 0.769 0.859 0.8460.7670.909 0.929 0.979 0.501 0.712 0.199 0.788 0.8080.7280.748 54.341.0× DiCache0.756 0.631 0.707 0.799 0.773 0.849 0.8330.7590.902 0.925 0.978 0.493 0.705 0.197 0.780 0.8380.7270.743 40.821.3× FasterCache0.675 0.553 0.549 0.691 0.652 0.719 0.7450.6290.849 0.909 0.970 0.405 0.594 0.176 0.709 0.7960.6760.652 34.511.6× WorldCache0.759 0.639 0.735 0.810 0.760 0.845 0.8390.7630.903 0.927 0.979 0.492 0.703 0.196 0.782 0.8260.7270.745 26.28 2.1× ∆ (WC−DC)+0.003+0.008+0.028+0.011−0.013−0.004+0.006+0.004+0.001+0.002+0.001−0.001−0.002−0.001+0.002−0.012+0.000+0.002−14.54+0.80× Cosmos-Predict2.5 – 14B Baseline0.782 0.643 0.762 0.828 0.794 0.900 0.8800.7920.940 0.948 0.988 0.518 0.719 0.202 0.806 0.8460.7460.769 216.251.0× DiCache0.795 0.645 0.757 0.819 0.790 0.906 0.8800.7920.939 0.949 0.988 0.518 0.714 0.201 0.806 0.8450.7450.768 148.361.4× FasterCache0.707 0.564 0.584 0.710 0.677 0.773 0.7850.6590.884 0.930 0.979 0.427 0.604 0.180 0.731 0.8210.6940.676 126.601.7× WorldCache0.792 0.659 0.751 0.838 0.794 0.908 0.8790.7950.940 0.948 0.987 0.517 0.718 0.201 0.804 0.8560.7460.771 98.61 2.14× ∆ (WC−DC)−0.003+0.014−0.006+0.019+0.004+0.002−0.001+0.003+0.001−0.001−0.001−0.001+0.004+0.000−0.002+0.011+0.001+0.003−49.75+0.74× 4 Experiments 4.1 Experimental Setup Base models. We evaluate WorldCache on the Cosmos-Predict2.5 family of Video World Models [1] at two scales: the 2B-parameter and 14B-parameter variants. Both models employ a Video Diffusion Transformer backbone with 3D Rotary Positional Encoding (RoPE) and are trained with a Flow Matching (velocity prediction) objective. Each sampling run generates 93 frames (∼5.8s at 16 FPS), corresponding to 24 latent frames, using 35 denoising steps with Euler scheduling. To demonstrate that WorldCache is not specific to a single backbone, we additionally evaluate on WAN2.1 [45], a DiT-based video generation model, using the official inference configuration for the corresponding checkpoints. Benchmark. All methods are evaluated on PAI-Bench (Physical AI Bench- mark) [51], a comprehensive evaluation suite designed specifically for Video World Models. PAI-Bench spans six physical domains, Robot, Autonomous Ve- hicles, Human Activity, Industry, Common Sense, and Physics, and reports a Domain Score (physical plausibility), a Quality Score (visual fidelity), and their average as the Overall Score. We report results on both the Text-to-World (T2W) and Image-to-World (I2W) generation tasks. More details on each evaluation metric are provided in supplementary material. Baselines. We compare against two state-of-the-art training-free caching base- lines: DiCache [6], which employs an online probe profiling scheme with trajectory- WorldCache: Content-Aware Caching for Accelerated Video World Models11 Table 2: Image2World (I2W) generation results on PAI-Bench across two model scales. We evaluate four methods: Baseline (no acceleration), DiCache, Faster- Cache, and WorldCache (ours) on Cosmos-Predict2.5 at both 2B and 14B parameter scales under the Image2World setting, where a conditioning image guides video world generation. ∆ rows quantify the per-metric gain of WorldCache over DiCache. Com- pared to the T2W setting, I2W scores are generally higher owing to the additional visual grounding provided by the conditioning image. WorldCache achieves the best accuracy–efficiency trade-off, consistently outperforming DiCache while delivering up to 2.3× speedup at 2B and 2.18× at 14B with negligible quality degradation across all domain and quality dimensions. Domain ScoreQuality Score Avg. ↑ Lat. (s) ↓ Speedup ↑ MethodCSAVROINHUPHMI Avg ↑ SCBCMSAQIQOCISIB Avg ↑ Cosmos-Predict2.5 – 2B Baseline0.919 0.694 0.811 0.877 0.840 0.909 0.8860.8450.896 0.929 0.982 0.505 0.674 0.212 0.936 0.9520.7610.803 55.041.0× DiCache0.899 0.697 0.791 0.876 0.828 0.887 0.9090.8350.885 0.923 0.980 0.492 0.660 0.212 0.927 0.9400.7520.794 39.681.4× FasterCache 0.855 0.676 0.697 0.829 0.739 0.851 0.8470.7720.800 0.872 0.974 0.432 0.577 0.197 0.888 0.9190.7080.740 32.751.7× WorldCache 0.912 0.708 0.796 0.876 0.833 0.893 0.8900.8400.892 0.926 0.982 0.496 0.661 0.212 0.931 0.9480.7560.798 24.48 2.3× ∆ (WC−DC)+0.013+0.011+0.005+0.000+0.005+0.006−0.019+0.005+0.007+0.003+0.002+0.004+0.001+0.000+0.004+0.008+0.004+0.004−15.20+0.9× Cosmos-Predict2.5 – 14B Baseline0.920 0.716 0.826 0.905 0.849 0.922 0.9240.8600.912 0.935 0.988 0.510 0.665 0.213 0.958 0.9660.7690.814 210.071.0× DiCache0.913 0.716 0.826 0.886 0.844 0.920 0.9210.8550.911 0.935 0.988 0.509 0.658 0.212 0.956 0.9650.7670.811 146.041.4× FasterCache0.856 0.688 0.715 0.842 0.743 0.869 0.8620.7820.813 0.873 0.975 0.437 0.567 0.195 0.906 0.9300.7120.747 123.751.7× WorldCache0.923 0.727 0.824 0.901 0.845 0.925 0.9090.8590.912 0.935 0.988 0.509 0.664 0.213 0.957 0.9660.7680.813 99.25 2.18× ∆ (WC−DC)+0.010+0.011−0.002+0.015+0.001+0.005−0.012+0.004+0.001+0.000+0.000+0.000+0.006+0.001+0.001+0.001+0.001+0.002−46.79+0.74× aligned residual reuse; and FasterCache [33], which uses a fixed step-skipping schedule combined with CFG-Cache for unconditional branch reuse. All meth- ods are applied as drop-in replacements on the same base model checkpoints. Wall-clock latency is measured on a single NVIDIA H200 GPU with identical batch size and precision settings. WorldCache configuration. Unless otherwise stated, the full WorldCache pipeline combines four modules: Causal Feature Caching (CFC), Saliency-Weighted Drift (SWD), Optimal Feature Approximation (OFA), and Adaptive Thresh- old Scheduling (ATS). We set the base threshold τ 0 = 0.08, motion sensitivity α = 2.0, saliency weight β s = 0.12, and β d =4.0 across the diffusion trajec- tory. All hyperparameters are fixed across models and tasks without per-prompt tuning. Unless otherwise stated, we also apply the same caching pipeline and hyperparameter selection protocol to WAN2.1 as in Cosmos, and report results on the same PAI-Bench [51] tasks for direct comparison. 4.2 Main Results Text2World (T2W) on Cosmos-Predict2.5. Table 1 shows that World- Cache provides the strongest accuracy–efficiency trade-off at both scales. On Cosmos-2B, WorldCache reduces latency from 54.34 s to 26.28 s (2.1×) while preserving the overall average score (0.745 vs. 0.748; ∼99.6% retention). Di- Cache is substantially less aggressive (40.82 s, 1.3×) and slightly lower in Avg. (0.743), while FasterCache is faster than DiCache (34.51 s, 1.6×) but suffers a large quality drop (Avg. 0.652), reflecting degraded world-model fidelity. The ∆ row highlights that WorldCache improves over DiCache most clearly in the 12U. Nawaz et al. Table 3: WAN2.1 results on PAI-Bench. We report Domain Score (Avg), Quality Score (Avg), Overall Score (computed as the mean of Domain Avg and Quality Avg), wall-clock latency (seconds), and speedup relative to the unaccelerated baseline. Best accelerated results per task/column are in bold. Task MethodDomain Avg ↑ Quality Avg ↑ Overall ↑ Latency (s) ↓ Speedup ↑ WAN2.1 - 1.3B T2W Baseline0.78620.75920.7727120.041.00× DiCache0.78410.75640.770361.571.96× WorldCache (Ours)0.78530.75890.772150.842.36× ∆ (WC−DC)+0.0012+0.0025+0.0018−10.73+0.40× WAN2.1 - 14B I2W Baseline0.70650.77030.7384475.601.00× DiCache0.69490.76720.7311291.911.53× WorldCache (Ours)0.70690.77070.7388206.732.31× ∆ (WC−DC)+0.0120+0.0035+0.0077−85.18+0.78× Domain categories tied to dynamic scene structure (e.g., RO and IN) while main- taining essentially identical Quality Avg. (0.727). On Cosmos-14B, WorldCache again achieves the best frontier point, reaching 98.61 s (2.14×) and improving Avg. to 0.771 compared to 0.769 for the unaccelerated baseline. In contrast, Di- Cache achieves 1.4× (148.36 s) with Avg. 0.768, and FasterCache reaches 1.7× but with substantially lower Avg. (0.676). Image2World (I2W) on Cosmos-Predict2.5. As shown in Table 2, I2W scores are higher overall due to additional visual grounding, and WorldCache re- mains the best speed–quality trade-off. On Cosmos-2B, WorldCache achieves 2.3× speedup (55.04 s → 24.48 s) with Avg. 0.798, close to the baseline 0.803. DiCache is slower (1.4×, 39.68 s) and lower in Avg. (0.794), while FasterCache is faster (1.7×) but drops sharply in Avg. (0.740). The ∆ row indicates that WorldCache improves over DiCache across most Domain categories and con- sistently boosts perceptual consistency metrics (SC/BC/MS) while maintaining similar OC. On Cosmos-14B, WorldCache delivers 2.18× speedup (210.07 s → 99.25 s) with negligible change in Avg. relative to baseline (0.813 vs. 0.814), outperforming DiCache (1.4×, Avg. 0.811) and FasterCache (1.7×, Avg. 0.747). Transfer to WAN2.1. Table 3 demonstrates that the benefits of WorldCache transfer beyond Cosmos to WAN2.1. On WAN2.1-1.3B (T2W), WorldCache improves over DiCache in both efficiency and score: 61.57 s → 50.84 s (1.96×→ 2.36×) while increasing overall score from 0.7703 to 0.7721. On WAN2.1-14B (I2W), WorldCache yields a large latency reduction over DiCache (291.91 s → 206.73 s; 1.53× → 2.31×) and recovers overall score to 0.7388, essentially even surpassing the baseline 0.7384. These results confirm that WorldCache con- sistently improves the DiCache frontier, trading lower fidelity for substantially greater speed across both conditioning modalities and model families. More re- sults on the EgoDex benchmark [18] are included in the supplementary material. Qualitative Results. Fig. 5 demonstrates that DiCache introduces temporal artifacts (object ghosting, inconsistent motion) in dynamic scenes, highlighted by red dashed boxes, whereas WorldCache maintains coherent object appearance WorldCache: Content-Aware Caching for Accelerated Video World Models13 Fig. 5: Qualitative Image2World comparison (PAI-Bench). Given the same conditioning image and text description, we show evenly-spaced frames from the gen- erated rollout for (a) Baseline, (b) DiCache, and (c) WorldCache. DiCache exhibits temporal artifacts under dynamic scene evolution, e.g., object ghosting/deformation and inconsistent motion of vehicle and foreground structures (highlighted with red dashed boxes). WorldCache preserves the scene layout and maintains more coherent object appearance and trajectories over time (highlighted with green dashed boxes), while achieving substantial inference acceleration. Better viewed zoomed in. and consistent trajectories across frames (green dashed boxes). This qualitative comparison on PAI-Bench shows that WorldCache achieves both better tempo- ral consistency and inference acceleration over the baseline and DiCache. More qualitative examples are presented in the supplementary material. Overall, we observe gains in the speed–quality boundary across both T2W and I2W and at both 2B and 14B scales. The WAN2.1 results further indicate that these ben- efits transfer across model families, supporting the view that perception and dynamics-constrained caching is a generally useful inference primitive for world- model generation. Overall, our findings point to a practical recipe for acceler- ating world-model rollouts: invest in decision/approximation quality early, then exploit the refinement phase to harvest large speedups with minimal impact on temporal coherence. 4.3 Ablation Study To isolate the contribution of each WorldCache module, we conduct incremental ablations on the 2B model under the I2W setting (Table 4). Besides, we also provide further ablations related to the key components in supplementary. CFC improves safety under motion. Adding CFC yields a 1.52× speedup (55→36 s) with essentially unchanged overall score (0.8020 vs. 0.8027). This indi- cates that motion-adaptive thresholding is an effective first-order control signal: 14U. Nawaz et al. Table 4: Incremental ablation on Cosmos-Predict2.5-2B (I2W). Each row adds one module to the previous configuration. Legend: CFC = Causal Feature Caching (motion-adaptive thresholding); SWD = Saliency-Weighted Drift; OFA = Optimal Feature Approximation (Online System Identification); ATS = Adaptive Threshold Scheduling (dynamic threshold decay). ConfigurationDomain ↑ Quality ↑ Overall ↑ Speedup ↑ Latency (s) ↓ Base0.84470.76070.80271.00×55 + CFC0.8457 0.75830.80201.52×36 + CFC + SWD0.84140.75920.80031.67×33 + CFC + SWD + OFA0.8468 0.7602 0.80351.49×37 + CFC + SWD + OFA + ATS (WorldCache) 0.83950.75590.79772.3×25 it preserves fidelity by tightening reuse during fast dynamics, while still enabling frequent cache hits in stable intervals. SWD increases cache hits by focusing on salient regions. With SWD, speed increases further to 1.67× (33 s). Although the global scores remain close to baseline, SWD improves decision quality by preventing background drift from dominating the skip criterion, thereby yielding additional cache hits without sacrificing foreground stability. OFA “invests” in approximation quality. Introducing OFA yields the highest- fidelity configuration (Overall 0.8035), slightly exceeding the baseline. The im- proved approximation, least-squares optimal blending (and motion-aligned reuse when enabled), reduces error accumulation on cache hits. This comes with added overhead and a more conservative hit pattern, which explains the reduced net speedup (1.49×, 37 s). In other words, OFA intentionally trades some throughput to raise the quality margin. ATS “spends” the quality margin for speed. Finally, ATS unlocks the largest acceleration by relaxing thresholds late in denoising, where refinement updates are small. With the stabilizing effects of CFC+SWD+OFA in place, ATS increases cache hits in the low-noise phase, improving speed to 2.30× (25 s) while keeping overall within 0.6% of baseline (0.7977 vs. 0.8027), matching the intended invest-and-spend effect. 5 Conclusion We presented WorldCache, a unified framework for perception-constrained dy- namical caching in DiT-based video generation. By identifying the Zero-Order Hold assumption as the core source of artifacts in prior diffusion caching meth- ods, we redesigned caching around motion-aware decisions (CFC), saliency- aligned drift estimation (SWD), improved approximation operators (OFA), and denoising-phase scheduling (ATS). WorldCache requires no training and no ar- chitectural changes, making it immediately deployable for accelerating next- generation video world models. WorldCache: Content-Aware Caching for Accelerated Video World Models15 References 1. Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025) 2. Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 3. Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., Rombach, R.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023). https://doi.org/10.48550/arXiv.2311.15127, https: //arxiv.org/abs/2311.15127 4. Bruce, J., Dennis, M., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., Aytar, Y., Bechtle, S., Behbahani, F., Chan, S., Heess, N., Gonzalez, L., Osindero, S., Ozair, S., Reed, S., Zhang, J., Zolna, K., Clune, J., de Freitas, N., Singh, S., Rocktäschel, T.: Genie: Generative interactive environments. arXiv preprint arXiv:2402.15391 (2024). https://doi. org/10.48550/arXiv.2402.15391, https://arxiv.org/abs/2402.15391 5. Bruhn, A., Weickert, J., Schnör, C.: Lucas/kanade meets horn/schunck: Combin- ing local and global optic flow methods. International journal of computer vision 61(3), 211–231 (2005) 6. Bu, J., Ling, P., Zhou, Y., Wang, Y., Zang, Y., Lin, D., Wang, J.: Dicache: Let diffusion model determine its own cache. In: International Conference on Learning Representations (ICLR) (2026). https://doi.org/10.48550/arXiv.2508.17356, https://openreview.net/forum?id=kflYZjGumW 7. Burt, P.J., Adelson, E.H.: The laplacian pyramid as a compact image code. IEEE Transactions on Communications 31(4), 532–540 (1983), https://persci.mit. edu/pub_pdfs/pyramid83.pdf 8. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. p. 9650–9660 (2021) 9. Chen, S., Xu, M., Ren, J., Cong, Y., He, S., Xie, Y., Sinha, A., Luo, P., Xiang, T., Perez-Rua, J.M.: Gentron: Diffusion transformers for image and video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 6441–6451 (2024) 10. Chi, X., Ge, K., Liu, J., Zhou, S., Jia, P., He, Z., Liu, Y., Li, T., Han, L., Han, S., et al.: Mind: Learning a dual-system world model for real-time planning and implicit risk analysis. arXiv preprint arXiv:2506.18897 (2025) 11. Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: Unifying structure and texture similarity. arXiv preprint arXiv:2004.07728 (2020). https: //doi.org/10.48550/arXiv.2004.07728, https://arxiv.org/abs/2004.07728 12. Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., Isola, P.: Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344 (2023) 13. Fuest, M., Ma, P., Gui, M., Schusterbauer, J., Hu, V.T., Ommer, B.: Diffusion mod- els and representation learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2026) 14. Gao, S., Liang, W., Zheng, K., Malik, A., Ye, S., Yu, S., Tseng, W.C., Dong, Y., Mo, K., Lin, C.H., et al.: Dreamdojo: A generalist robot world model from large-scale human videos. arXiv preprint arXiv:2602.06949 (2026) 16U. Nawaz et al. 15. He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221 (2022). https: //doi.org/10.48550/arXiv.2211.13221, https://arxiv.org/abs/2211.13221 16. Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., Salimans, T.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022). https://doi.org/10.48550/arXiv.2210.02303, https://arxiv.org/abs/2210. 02303 17. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video dif- fusion models. In: Advances in Neural Information Processing Systems (NeurIPS) (2022). https://doi.org/10.48550/arXiv.2204.03458, https://arxiv.org/ abs/2204.03458 18. Hoque, R., Huang, P., Yoon, D.J., Sivapurapu, M., Zhang, J.: Egodex: Learn- ing dexterous manipulation from large-scale egocentric video. arXiv preprint arXiv:2505.11709 (2025) 19. Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 21807–21818 (2024) 20. Huang, Z., Zhang, F., Xu, X., He, Y., Yu, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y., et al.: Vbench++: Comprehensive and versatile benchmark suite for video generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 21. Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. p. 5148–5157 (2021) 22. Khan, U., Nawaz, U., Khan, M., Gueaieb, W., El Saddik, A.: Deepskinformer: Skin lesion segmentation using hierarchical transformers and edge enhancement. In: 2024 IEEE International Conference on Image Processing (ICIP). p. 3868– 3874. IEEE (2024) 23. Khan, U., Nawaz, U., Sheikh, T.T., Hanif, A., Yaqub, M.: Guardian: Guarding against uncertainty and adversarial risks in robot-assisted surgeries. In: Interna- tional Workshop on Uncertainty for Safe Utilization of Machine Learning in Med- ical Imaging. p. 59–69. Springer (2024) 24. Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Schindler, G., Hornung, R., Birodkar, V.N., Yan, J., Chiu, M.C., Somandepalli, K., Akbari, H., Alon, Y., Cheng, Y., Dillon, J.V., Gupta, A., Hahn, M., Hauth, A., Hendon, D., Martinez, A., Minnen, D., Sirotenko, M., Sohn, K., Yang, X., Adam, H., Yang, M.H., Essa, I., Wang, H., Ross, D.A., Seybold, B., Jiang, L.: Videopoet: A large language model for zero-shot video generation. In: International Conference on Machine Learning (ICML) (2024). https://doi.org/10.48550/arXiv.2312.14125, https: //icml.c/virtual/2024/poster/34296 25. LAION-AI: aesthetic-predictor. https://github.com/LAION- AI/aesthetic- predictor (2022), gitHub repository 26. Li, X., He, X., Zhang, L., Wu, M., Li, X., Liu, Y.: A comprehensive survey on world models for embodied ai. arXiv preprint arXiv:2510.16732 (2025) 27. Li, Z., Zhu, Z.L., Han, L.H., Hou, Q., Guo, C.L., Cheng, M.M.: Amt: All-pairs multi-field transforms for efficient frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 9801– 9810 (2023) WorldCache: Content-Aware Caching for Accelerated Video World Models17 28. Liu, F., Zhang, S., Wang, X., Wei, Y., Qiu, H., Zhao, Y., Zhang, Y., Ye, Q., Wan, F.: Timestep embedding tells: It’s time to cache for video diffusion model. In: Proceedings of the Computer Vision and Pattern Recognition Conference. p. 7353–7363 (2025) 29. Ljung, L.: System Identification: Theory for the User. Prentice Hall, 2 edn. (1999) 30. Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In: Advances in Neural Information Processing Systems (NeurIPS) (2022). https://doi.org/10.48550/ arXiv.2206.00927, https://arxiv.org/abs/2206.00927 31. Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095 (2022). https://doi.org/10.48550/arXiv.2211.01095, https://arxiv.org/ abs/2211.01095 32. Lucas, B.D., Kanade, T.: An iterative image registration technique with an appli- cation to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI). vol. 2, p. 674–679 (1981) 33. Lv, Z., Si, C., Song, J., Yang, Z., Qiao, Y., Liu, Z., Wong, K.Y.K.: Fastercache: Training-free video diffusion model acceleration with high quality. In: International Conference on Learning Representations (ICLR) (2025). https://doi.org/10. 48550/arXiv.2410.19355, https://openreview.net/forum?id=W49UjcpGxx 34. Ma, X., Fang, G., Wang, X.: Deepcache: Accelerating diffusion models for free. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024). https://doi.org/10.48550/arXiv.2312.00858, https://openaccess.thecvf.com/content/CVPR2024/html/Ma_DeepCache_ Accelerating_Diffusion_Models_for_Free_CVPR_2024_paper.html 35. Ma, Z., Zhang, Y., Jia, G., Zhao, L., Ma, Y., Ma, M., Liu, G., Zhang, K., Ding, N., Li, J., et al.: Efficient diffusion models: A comprehensive survey from principles to practices. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 36. NVIDIA: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025). https://doi.org/10.48550/arXiv.2501.03575, https: //arxiv.org/abs/2501.03575 37. OpenAI: Video generation models as world simulators. Technical report (2024), https://openai.com/index/video-generation-models-as-world-simulators/ 38. Qin, Y., Shi, Z., Yu, J., Wang, X., Zhou, E., Li, L., Yin, Z., Liu, X., Sheng, L., Shao, J., Bai, L., Ouyang, W., Zhang, R.: Worldsimbench: Towards video generation models as world simulators. arXiv preprint arXiv:2410.18072 (2024). https:// doi.org/10.48550/arXiv.2410.18072, https://arxiv.org/abs/2410.18072 39. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. p. 8748–8763. PmLR (2021) 40. Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. In: International Conference on Learning Representations (ICLR) (2022). https: //doi.org/10.48550/arXiv.2202.00512, https://arxiv.org/abs/2202.00512 41. Shelhamer, E., Rakelly, K., Hoffman, J., Darrell, T.: Clockwork convnets for video semantic segmentation. In: European Conference on Computer Vision (ECCV) (2016). https://doi.org/10.48550/arXiv.1608.03609, https://arxiv.org/ abs/1608.03609 42. Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., Parikh, D., Gupta, S., Taigman, Y.: Make-a-video: Text-to- video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022). 18U. Nawaz et al. https://doi.org/10.48550/arXiv.2209.14792, https://arxiv.org/abs/2209. 14792 43. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (ICLR) (2021). https://doi.org/10. 48550/arXiv.2010.02502, https://arxiv.org/abs/2010.02502 44. Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: European Conference on Computer Vision (ECCV) (2020). https://doi.org/10. 48550/arXiv.2003.12039, https://arxiv.org/abs/2003.12039 45. Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 46. Wang, Y., He, Y., Li, Y., Li, K., Yu, J., Ma, X., Li, X., Chen, G., Chen, X., Wang, Y., et al.: Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942 (2023) 47. Wang, Z., Xia, X., Chen, R., Yu, D., Wang, C., Gong, M., Liu, T.: Lavin-dit: Large vision diffusion transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. p. 20060–20070 (2025) 48. Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024) 49. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018). https: //doi.org/10.48550/arXiv.1801.03924, https://arxiv.org/abs/1801.03924 50. Zhao, C., Zhang, R., Wang, J., Zhao, G., Niyato, D., Sun, G., Mao, S., Kim, D.I.: World models for cognitive agents: Transforming edge intelligence in future networks. arXiv preprint arXiv:2506.00417 (2025) 51. Zhou, F., Huang, J., Li, J., Ramanan, D., Shi, H.: Pai-bench: A comprehensive benchmark for physical ai. arXiv preprint arXiv:2512.01989 (2025). https://doi. org/10.48550/arXiv.2512.01989, https://arxiv.org/abs/2512.01989 52. Zhou, F., Huang, J., Li, J., Ramanan, D., Shi, H.: Pai-bench: A comprehensive benchmark for physical ai. arXiv preprint arXiv:2512.01989 (2025) 53. Zhou, X., Liang, D., Chen, K., Feng, T., Chen, X., Lin, H., Ding, Y., Tan, F., Zhao, H., Bai, X.: Less is enough: Training-free video diffusion acceleration via runtime-adaptive caching. arXiv preprint arXiv:2507.02860 (2025) 54. Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recog- nition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), https://openaccess.thecvf.com/content_cvpr_ 2017/papers/Zhu_Deep_Feature_Flow_CVPR_2017_paper.pdf WorldCache: Content-Aware Caching for Accelerated Video World Models19 WorldCache: Content-Aware Caching for Accelerated Video World Models Supplementary Material Overview This supplementary material provides the following additional details: – Evaluation Metrics: Efficiency measures (latency/speedup), PAI-Bench-G quality protocol, and EgoDex-Eval fidelity metrics (PSNR/SSIM/LPIPS). – Implementation Details & Runtime: Training-free inference setup, eval- uated backbones, hardware, and benchmark runtime considerations. – Baseline Comparisons: Extended comparisons on PAI-Bench beyond the main paper. – Robotic Manipulation Results: WorldCache evaluation on EgoDex-Eval. – Extended Technical Details: Motion-compensated feature warping (OFA) and adaptive threshold scheduling (ATS). – Additional Ablations: Hyperparameter analysis and denoising step bud- get effects. – Qualitative Results: Caching failure modes and WorldCache mitigation examples. – Limitations & Future Work: Discuss some limitations and the potential future directions. A Evaluation Metrics We report two complementary aspects of performance: (i) quality (how faith- ful and coherent the generated world rollout is) and (i) efficiency (how fast the rollout can be generated). We evaluate quality on PAI-Bench using Do- main and Quality scores with fine-grained sub-metrics, and additionally eval- uate reconstruction-style fidelity on EgoDex-Eval using PSNR/SSIM/LPIPS. Efficiency is reported consistently across benchmarks using wall-clock latency and relative speedup for world generation. A.1 Efficiency Metrics Latency. We report Latency as the end-to-end wall-clock time (in seconds) to generate one video sample under a fixed sampling configuration (same num- ber of denoising steps, same resolution, same number of frames, same hard- ware and batch size). Latency includes the complete inference pipeline (probe computation, cache decision logic, approximation overhead, and any motion- compensation steps when enabled). 20U. Nawaz et al. Speedup. We report Speedup relative to the unaccelerated baseline: Speedup = Latency(Baseline) Latency(Method) (14) Higher speedup indicates faster generation. A.2 PAI-Bench Quality Metrics PAI-Bench [52] provides a multi-track evaluation framework for Physical AI. In this work we focus on the generation track (PAI-Bench-G), which measures a world model’s ability to synthesize coherent future videos under both Text- to-World (T2W) and Image-to-World (I2W) conditioning. Following the official protocol, we evaluate on the full PAI-Bench-G suite comprising 1,044 samples and report both quality and efficiency metrics. Quality is decomposed into a Quality Score (eight perceptual dimensions) and a Domain Score (seven seman- tic/physical categories); the two are averaged into an Overall score. Quality Score The Quality Score follows the VBench/VBench++ evaluation suite [19,20] and comprises eight sub-metrics. Throughout, we denote a generated video as a sequence of T frames f 1 ,f 2 ,...,f T . Subject Consistency (SC). This metric evaluates identity stability of the pri- mary subject across frames. We extract per-frame DINO [8] features d i (unit- normalised) and compute: SC = 1 T − 1 T X t=2 1 2 ⟨d 1 , d t ⟩ +⟨d t−1 , d t ⟩ , (15) where ⟨·,·⟩ denotes cosine similarity. The first term captures long-range consis- tency with the initial frame, while the second captures local (frame-to-frame) stability. Background Consistency (BC). Background stability is assessed analogously, using CLIP [39] image features c i instead: BC = 1 T − 1 T X t=2 1 2 ⟨c 1 , c t ⟩ +⟨c t−1 , c t ⟩ . (16) Motion Smoothness (MS). Motion plausibility is quantified via a frame-interpolation consistency test. The video is subsampled by dropping all odd-indexed frames, which are then reconstructed using a pre-trained frame interpolation model [27]. Let ˆ f 2k−1 denote the reconstructed version of the original frame f 2k−1 . The raw error is: S smooth = 1 ⌊T/2⌋ ⌊T/2⌋ X k=1 f 2k−1 − ˆ f 2k−1 1 , (17) WorldCache: Content-Aware Caching for Accelerated Video World Models21 which is normalised to [0, 1] and inverted so that higher values indicate smoother motion: MS = 1− S smooth 255 .(18) Aesthetic Quality (AQ). Visual appeal, encompassing composition, colour har- mony, and artistic quality, is scored per frame using the LAION aesthetic pre- dictor [25] on a [0, 10] scale. Scores are linearly mapped to [0, 1] and averaged: AQ = 1 T T X t=1 LAION(f t ) 10 .(19) Imaging Quality (IQ). Low-level fidelity (noise, blur, exposure artefacts) is eval- uated with the MUSIQ predictor [21], yielding per-frame scores in [0, 100]. The video-level metric is: IQ = 1 T T X t=1 MUSIQ(f t ) 100 .(20) Overall Consistency (OC). Semantic alignment between the generated video and the textual prompt is measured using the ViCLIP [46] video–text similarity score: OC = ViCLIP(f 1 ,...,f T , prompt).(21) I2V Subject (IS). For image-conditioned generation, subject fidelity is measured by comparing the conditioning image f ref to generated frames using DINO fea- tures s i : IS = 1 T − 1 T X t=2 1 2 ⟨s ref , s t ⟩ +⟨s t−1 , s t ⟩ .(22) I2V Background (IB). Background and layout fidelity in image-conditioned gen- eration is computed using DreamSim [12] features b i , which are sensitive to spatial layout: IB = 1 T − 1 T X t=2 1 2 ⟨b ref , b t ⟩ +⟨b t−1 , b t ⟩ .(23) Quality aggregation. The Quality Avg is computed as the arithmetic mean of the eight sub-metrics: Quality Avg = 1 8 X m∈SC, BC, MS, AQ, IQ, OC, IS, IB m.(24) 22U. Nawaz et al. Domain Score (Semantic/Physical Plausibility) While Quality Score cap- tures perceptual fidelity, it does not assess whether generated dynamics are phys- ically plausible. Domain Score addresses this gap through an MLLM-as-Judge protocol: a strong vision-language model, specifically Qwen3-VL-235B-A22B- Instruct [2], is queried with uniformly sampled frames and a curated set of Q binary verification questions q j Q j=1 that encode expected physical and seman- tic constraints (e.g., “Does the robotic arm lift the object?”). For each question the judge emits a binary response ˆa j ∈YES, NO. The Domain Score is then the accuracy of these responses against the ground-truth labels a j : Domain = 1 Q Q X j=1 1 ˆa j = a j . (25) Questions are organised into seven semantic categories reflecting distinct facets of world-model reasoning: CS (Common Sense), AV (Autonomous Vehicle), RO (Robot), IN (Industry), HU (Human), PH (Physics), and MI (Miscellaneous). We report per-category accuracy and compute Domain Avg as their mean: Domain Avg = 1 7 X c∈CS, AV, RO, IN, HU, PH, MI c.(26) Overall Score Following the PAI-Bench convention, we summarise generation quality with a single scalar that equally weights perceptual fidelity and physical plausibility: Overall = 1 2 Domain Avg + Quality Avg .(27) A.3 EgoDex-Eval Fidelity Metrics To complement PAI-Bench with reconstruction-style fidelity measures, we also evaluate on EgoDex-Eval [18], reporting standard full-reference image/video metrics alongside efficiency. PSNR. Peak Signal-to-Noise Ratio measures pixel-level reconstruction fidelity between generated and reference frames. Higher PSNR indicates lower distortion. SSIM. Structural Similarity Index measures perceived structural similarity (lu- minance, contrast, and structure) between generated and reference frames. Higher SSIM indicates better structural preservation. LPIPS. Learned Perceptual Image Patch Similarity measures perceptual dis- tance using deep features. Lower LPIPS indicates closer perceptual similarity to the reference. WorldCache: Content-Aware Caching for Accelerated Video World Models23 B Implementation Details and Runtime Setup Training-free inference acceleration. WorldCache is a strictly training-free and plug-and-play method. It does not modify model weights and can be en- abled/disabled at runtime. All improvements are obtained by (i) deciding when to reuse cached intermediate activations and (i) applying a lightweight cache-hit approximation, both of which can be toggled on/off at runtime. Evaluated backbones. Cosmos-Predict2.5 [1] is a diffusion transformer (DiT) based world model that supports both Text2World and Image2World genera- tion within a unified sampling pipeline. WAN2.1 [45] is an open large-scale diffusion-transformer video model suite, released with multiple parameter scales (e.g., 1.3B/14B). DreamDojo [14] is a robot-oriented interactive world model trained from large-scale human egocentric videos, released with pretrained/post- trained checkpoints (e.g., 2B/14B) and evaluation sets. In our EgoDex-Eval [18] experiments, we evaluate WorldCache on the provided model checkpoints with- out modifying training. Hardware and codebases. All experiments are run on a single NVIDIA H200 (140 GB) GPU, except for the calculation of the Domain score in PAI-bench, as that uses 4 NVIDIA H200 140 GB GPUs because of Qwen3-VL235B-A22B- Instruct [2] used as a judge. We use the official inference codebases and re- leased checkpoints for all evaluated backbones: Cosmos-Predict2.5, WAN2.1, and DreamDojo. Unless otherwise specified, we keep the generation configuration fixed across methods within each backbone (same resolution, video length, de- noising steps, scheduler, guidance setting, and batch size). Reported numbers are reproduced from our runs under the same hardware and software environment. All compared caching baselines are run under identical generation settings and hardware. Hyperparameters for WorldCache are selected on a held-out sub- set and then fixed for the full benchmark. No method uses additional training data or model updates during evaluation. PAI-Bench-G evaluation (world-model generation). For world-model generation quality, we follow the PAI-Bench-G track [52] and evaluate on the full suite of 1,044 samples under both T2W and I2W. We report the benchmark-defined quality metrics (Domain/Quality/Overall as described in Sec. A.2) and efficiency metrics (Latency/Speedup). Latency is measured end-to-end from the start of the denoising loop to the completion of decoding, and includes probe computa- tion, cache decision logic, and any approximation overhead (e.g., OFA warping). To highlight evaluation-scale impact, we also report the total runtime to process the entire benchmark: T PAI = 1044× T avg ,(28) where T avg is the mean per-sample latency in our tables. For example, on Cosmos- 2B (I2W), the baseline runtime corresponds to 55.04× 1044≈ 16 hours, whereas WorldCache at 24.48 s/sample corresponds to ≈ 7.1 hours, saving ≈ 9 hours per full PAI-Bench-G run on the same hardware. 24U. Nawaz et al. EgoDex-Eval evaluation (ground-truth-conditioned robotics video). To evaluate WorldCache on downstream robotics prediction with ground-truth video, we use EgoDex-Eval, an egocentric manipulation benchmark derived from the EgoDex dataset [18]. Following the standard protocol used in robot world-model eval- uation, we condition on the first frame and generate a rollout (e.g., 81 frames in our WAN2.1-14B setup), then compute frame-level full-reference metrics [23]: PSNR↑, SSIM↑, and LPIPS↓, together with Latency/Speedup. As with PAI- Bench, we measure end-to-end wall-clock latency for the full generation pipeline. When reporting evaluation-scale time, we use: T EgoDex = N eval × T avg ,(29) where N eval is the number of evaluation episodes and T avg is the mean per-episode latency. C More Baseline Comparison To further contextualize WorldCache against recent training-free caching base- lines, we compare with EasyCache [53] and TeaCache [28] under the same Cosmos- Predict2.5-2B setup for both Image2World (I2W) and Text2World (T2W) on PAI-Bench (Table 5). We additionally include DiCache as a strong probe-based caching baseline. In Table 5, bold/underline indicate the best/second-best re- sults among accelerated methods only (excluding the unaccelerated baseline). Text2World (T2W). For T2W, TeaCache (Slow) achieves the best acceler- ated Overall (0.7454) and the second-best Quality (0.7274), but again is conser- vative (49.40 s, 1.1×). EasyCache yields the best accelerated Domain (0.7641), while DiCache provides the strongest speed among non-WorldCache baselines (40.82 s, 1.3×). WorldCache reduces latency to 26.28 s (2.10×) with Overall 0.7450, matching the quality of the strongest baseline variants while delivering substantially higher acceleration. Image2World (I2W). Among prior caching baselines, EasyCache attains the best Domain score (0.8399), while TeaCache (Slow) achieves the strongest qual- ity and overall among non-WorldCache methods (Quality 0.7562, Overall 0.7979) but at conservative speed (49.59 s, 1.1×). DiCache and TeaCache (Fast) are faster (∼40–41 s, 1.3–1.3×) but with lower overall scores (0.7941–0.7965). WorldCache provides a substantially better efficiency point with 2.30× speedup (55.04 s → 24.48 s) while keeping overall competitive (0.7977), and close to the best accel- erated overall score and far faster than all other caching methods. Overall, existing caching baselines cluster around ∼1.3× speedup (or trade speed for slightly higher scores), whereas WorldCache consistently exceeds more than 2× speedup while remaining within the same quality band on both I2W and T2W. WorldCache: Content-Aware Caching for Accelerated Video World Models25 Table 5: PAI-Bench comparison with EasyCache and TeaCache (Cosmos- 2B). We report Domain, Quality, Overall, latency (s), and speedup vs. baseline for both Image2World (I2W) and Text2World (T2W). Bold and underline denote the best and second-best among accelerated methods only (excluding Baseline) within each block. MethodDomain ↑ Quality ↑ Overall ↑ Lat. (s) ↓ Speedup ↑ Text2World (T2W) — Cosmos-Predict2.5-2B Baseline0.76700.72800.747554.341.0× EasyCache0.76410.72620.745141.411.3× DiCache0.75900.72720.743140.821.3× TeaCache (Fast)0.76160.72660.744841.071.4× TeaCache (Slow)0.76340.72740.745449.401.1× WorldCache (Ours)0.76300.72700.745026.282.10× Image2World (I2W) — Cosmos-Predict2.5-2B Baseline0.84500.76100.803055.041.0× EasyCache0.83990.75520.797540.251.3× DiCache0.83520.75220.794139.681.3× TeaCache (Fast)0.83810.75490.796541.001.3× TeaCache (Slow)0.83960.75620.797949.591.1× WorldCache (Ours)0.83950.75590.797724.482.30× D Evaluation Results on Robotic Manipulation: EgoDex-Eval To evaluate WorldCache on a downstream robotics setting with ground-truth video supervision, we benchmark on EgoDex-Eval [18], an egocentric robot ma- nipulation dataset. We condition each model on the first frame of an episode and generate rollouts, reporting frame-level PSNR and SSIM (↑) and LPIPS (↓) against the ground-truth video, along with end-to-end latency and speedup (Table 6). WAN2.1-14B [45] (I2V). WorldCache achieves a 2.30× speedup (391.9 s → 171.6 s) while remaining close to baseline quality: PSNR drops marginally (13.19 vs. 13.30;∼99.2% retention), SSIM remains high (0.498 vs. 0.503), and LPIPS is nearly unchanged (0.460 vs. 0.459). DiCache is also faster than baseline (1.88×, 208.6 s) but exhibits a larger fidelity gap across all three metrics (12.95 PSNR, 0.491 SSIM, 0.461 LPIPS). This setting is particularly challenging for caching because egocentric manipulation contains continuous hand–object contact and fine-grained motion, where stale reuse can accumulate errors. The results in- dicate WorldCache better preserves motion and appearance while accelerating inference. Cosmos-2.5-2B [1] (I2V). On Cosmos-Predict-2.5-2B, WorldCache improves both efficiency and fidelity relative to DiCache. It reaches 1.62× speedup (70.01 s → 43.24 s) while preserving PSNR (12.82 vs. 12.87) and matching the best LPIPS 26U. Nawaz et al. Table 6: EgoDex-Eval results (I2V) across backbones. We report frame-level PSNR/SSIM/LPIPS against ground-truth videos, along with end-to-end latency and speedup relative to the unaccelerated baseline. Bold denotes the best value per block (including Baseline), and underline denotes the best accelerated result (Di- Cache/WorldCache). MethodPSNR ↑SSIM ↑LPIPS ↓Speedup ↑Lat. (s) ↓ WAN2.1-14B (I2V) Baseline13.300.5030.4591.00×391.9 DiCache12.950.4910.4611.88×208.6 WorldCache13.19 0.4980.4602.30×171.6 Cosmos-Predict-2.5-2B (I2V) Baseline12.870.4550.5181.00×70.01 DiCache12.630.4450.5311.34×51.97 WorldCache12.82 0.4660.5181.62×43.24 DreamDojo-2B (I2V) Baseline23.630.7750.2261.00×19.73 DiCache20.410.7340.2521.58×12.46 WorldCache23.690.7370.2511.90×10.36 (0.518, equal to baseline). Notably, WorldCache also attains a higher SSIM (0.466) than both baseline (0.455) and DiCache (0.445), suggesting improved structural stability under caching. DiCache provides a smaller 1.34× speedup (51.97 s) and shows larger degradation in PSNR and LPIPS (12.63 PSNR, 0.531 LPIPS). Overall, this block demonstrates that WorldCache’s caching strategy transfers beyond a single backbone under ground-truth-conditioned evaluation. DreamDojo-2B [14] (I2V). For DreamDojo-2B, WorldCache also delivers a favorable trade-off by achieving 1.90× speedup (19.73 s → 10.36 s) while pre- serving PSNR (23.69 vs. 23.63) with moderate changes in SSIM and LPIPS (0.737 vs. 0.775; 0.251 vs. 0.226). In contrast, DiCache yields a smaller 1.58× speedup (12.46 s) and suffers a pronounced fidelity loss (PSNR 20.41, SSIM 0.734, LPIPS 0.252). Overall, across WAN2.1, Cosmos, and DreamDojo, WorldCache consistently outperforms DiCache, achieving higher speedups with lower quality degradation under EgoDex-Eval. E Extended Technical Details E.1 Motion-Compensated Feature Warping in OFA In Section 3.6, we described the Optimal Feature Alignment (OFA) mechanism, which estimates a displacement field to warp cached features from the previous timestep to the current coordinate frame. Because computing dense optical flow directly on high-resolution, deep semantic feature maps can introduce significant computational overhead and susceptibility to high-frequency activation noise, we WorldCache: Content-Aware Caching for Accelerated Video World Models27 implement a spatial flow-scaling technique. Specifically, we introduce a spatial downsampling factor, s flow , applied prior to the displacement calculation. During inference, the latent features z (0) t and z (0) t−1 are first downsampled to a spatial resolution of s flow × H × s flow × W via bilinear interpolation. The Lucas-Kanade optical flow [5] equations are solved on this lower-resolution grid to produce a coarse displacement field. Finally, this continuous displacement field is upsampled back to the original resolution, and its vector magnitudes are scaled by 1/s flow to ensure correct geometric mapping during the final warping operation. This design choice serves two critical functions. First, it acts as a strong spa- tial low-pass filter, forcing the flow estimation to focus on macroscopic structural motion rather than microscopic, high-frequency signal fluctuations inherent to deep transformer representations. Second, by computing the correlation matrix on feature maps that are 1/25 th the spatial area, the runtime complexity of the Lucas-Kanade solver is drastically reduced, ensuring the alignment mechanism adds less than 3% computational overhead to the caching pipeline. E.2 Adaptive Thresholding Scheduling (ATS) In Section 3.7 of the main text, we introduced the Adaptive Threshold Scheduling (ATS) mechanism, which relaxes the caching threshold τ ATS (t) as the genera- tive process transitions from global structure formation to high-frequency detail refinement. In our practical implementation, this relaxation is controlled by a temporal decay factor applied dynamically at each step. Quadratic Threshold Decay. While Equation 13 describes the conceptual linear relaxation governed by β d , our exact implementation employs a quadratic scaling function to provide a smoother transition across the denoising trajectory. Let N be the total number of diffusion steps (e.g., worldcache_num_steps = 35 in our standard configuration), and let t∈ [0,N − 1] be the current forward sampling step index. We define a normalized progress variable u = N/35.0 to en- sure the decay curve remains scale-invariant regardless of the user’s chosen total number of steps. The base multiplier coefficient C(u) is derived via a quadratic fit: C(u) = u 2 6 + u 2 + 10 3 .(30) At step t, the step ratio r t = t/N is computed, and the final dynamic decay factor D(t) applied to the base threshold is calculated as: D(t) = 1.0 + C(u)· r t .(31) Thus, the final threshold at step t is strictly given by τ ATS (t) = τ base · D(t). Boundary Behavior. This specific quadratic fit was empirically designed to hit key operational targets: it tightly bounds the threshold scaling near 1.0 at t = 0 (enforcing rigorous structural computation) and smoothly accelerates the relaxation multiplier to approximately 5.0 as t → N (when N = 35). This ensures that during the final ∼ 20% of the generation process, where latent 28U. Nawaz et al. Table 7: Hyperparameter sensitivity for skip decisions (stage-wise). Each block varies a single hyperparameter while keeping the rest fixed in the corresponding stage configuration: CFC is swept with only CFC enabled; SWD is swept with CFC fixed to α=2; ATS is swept with CFC+SWD+OFA fixed to defaults. Results: Cosmos- Predict2.5–2B (I2W). Bold indicates the chosen default. HyperparameterValueDomain ↑Quality ↑Overall ↑ CFC: α 0.10.83890.73620.7876 0.20.84570.75830.8020 0.40.84170.73740.7896 0.50.83620.73560.7859 SWD: β s 0.050.84830.73350.7909 0.120.84140.75920.8003 0.50.84200.74020.7911 10.83900.73440.7867 ATS: β d 20.83300.74860.7908 40.83950.75590.7977 60.83140.74830.7899 80.82890.73920.7841 Table 8: Hyperparameter sensitivity for cache-hit approximation (stage- wise). OFA operator and warp scale sweeps are run with CFC and SWD fixed to defaults. Results: Cosmos-Predict2.5–2B (I2W). Bold indicates the chosen default. HyperparameterValueDomain ↑Quality ↑Overall ↑ OFA operator OSI only0.84780.73600.7919 Warp only0.84020.72970.7850 OSI + Warp0.84680.76020.8035 Warp scale 0.20.83850.73090.7847 0.50.84950.74200.7958 1.00.84080.73570.7883 updates are minimal, the network aggressively reuses cached features, yielding maximum acceleration without degrading spatial fidelity. F Additional Ablations F.1 Hyperparameter Selection To avoid ambiguity, we report hyperparameter studies as two separate abla- tions: (i) skip-decision sensitivity (Table 7) and (i) cache-hit approxima- tion sensitivity (Table 8). In all cases, rows within a block are not applied sequentially, as each row corresponds to a separate run where only the listed hyperparameter is changed, and all other settings are held fixed. Skip-decision hyperparameters. Table 7 varies the parameters that control when caching is allowed. CFC benefits from increasing motion sensitivity up to α=2, after which performance drops, so we select α=2. SWD performs best at β s =0.12, WorldCache: Content-Aware Caching for Accelerated Video World Models29 3570100140 Number of Denoising Steps 0 50 100 150 200 End-to-End Latency (s) Baseline Latency DiCache Latency WorldCache Latency DiCache Speedup WorldCache Speedup 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Speedup (×) 2.3× 2.9× 3.1× 3.0× 1.5× 1.6× 1.7× 2.3× Effect of Denoising Step Budget on Latency and Speedup Fig. 6: Effect of the denoising step budget. End-to-end latency and speedup of WorldCache vs. baselines when varying the number of denoising steps. indicating that moderate saliency weighting improves skip decisions while overly strong weighting reduces robustness. For ATS, β d =4 yields the best trade-off as larger values over-relax the reuse criterion and degrade both Domain and Quality, consistent with excessive skipping in late denoising. Approximation hyperparameters. Table 8 studies the approximation used on cache hits. OSI+Warp yields the strongest overall fidelity compared to OSI- only or warp-only, so we adopt it as the default OFA operator. For warping, a moderate scale (0.5) performs best, balancing alignment benefits with the noise sensitivity of high-resolution flow/warp signals. Unless stated otherwise, these selected defaults are used for all main results. F.2 Effect of the number of denoising steps. We vary the denoising step budget from 35 to 140 and report end-to-end wall- clock latency (Fig. 6). As expected, the unaccelerated baseline scales roughly linearly with the number of steps (57.0 s at 35 steps, increasing to 199.1 s at 140 steps), since each step requires a full DiT forward pass. WorldCache substan- tially reduces the effective per-step cost via cache reuse, yielding much lower latencies (25.0 s, 34.2 s, 45.5 s, and 66.0 s). The relative benefit is stable and be- comes even more pronounced for longer trajectories, improving from 2.3× at 35 steps to a maximum of 3.10× among 70–140 steps. This indicates that caching opportunities grow with trajectory length, particularly in late refinement, where feature updates are small and the additional decision/approximation overhead remains minor. 30U. Nawaz et al. Fig. 7: Qualitative Image2World comparison on PAI-Bench (Cosmos-2B). The conditioning image (top-left) and text description (top-right) specify a dashcam scene where the ego vehicle slows at a T-intersection as a woman and a child cross in front. We show evenly spaced frames from rollouts generated by (a) Baseline, (b) DiCache, and (c) WorldCache. DiCache exhibits temporal inconsistency on salient, moving entities, e.g., the pedestrians appearance and position become unstable and partially ghosted near the end of the rollout (red dashed box). WorldCache maintains more coherent pedestrian identity and motion while preserving scene layout (green dashed box), producing rollouts closer to the baseline under substantially lower infer- ence latency. Better viewed zoomed in. G Additional Qualitative Results We provide additional Image2World examples on PAI-Bench that highlight the failure modes of naive cache reuse under dynamics and fine-grained interactions. In the driving scene (Fig. 7), DiCache exhibits motion-related inconsistencies on salient moving pedestrians, including unstable appearance and slight ghosting near the end of the rollout, whereas WorldCache maintains more coherent iden- tity and trajectories while preserving global scene layout. In the manipulation scene (Fig. 8), the errors are even more localized and challenging as DiCache pro- duces visible distortions around hands and the carried plate, having regions with fast, articulated motion and strong perceptual saliency, while WorldCache pre- serves object boundaries and hand poses consistency across frames. Together, these examples illustrate that WorldCache improves temporal coherence, par- ticularly in the regimes most important for world models (foreground entities, WorldCache: Content-Aware Caching for Accelerated Video World Models31 Fig. 8: Qualitative Image2World comparison on PAI-Bench (Cosmos-14B). Given the conditioning image (top-left) and description (top-right), we generate a short kitchen rollout in which two people prepare food, where one spreads avocado on a nori sheet while the other holds a plate of vegetables. We visualize evenly spaced frames from (a) the unaccelerated baseline, (b) DiCache, and (c) WorldCache. While the baseline struggles with hand pose and object geometry, DiCache also exhibits noticeable temporal instability in the most salient, high-frequency regions, particularly around the hands and the plate: edges wobble, fine structures deform, and the appearance of the plate/hand region changes inconsistently over time (red dashed boxes). WorldCache largely avoids these artifacts, preserving sharper boundaries and more coherent local motion in the interaction region (green dashed boxes), yielding outputs that are closer to the baseline at substantially reduced latency. Best viewed with zoom. contact-rich motion, and articulated interactions), aligning with our motion- aware decisions and improved cache-hit approximation. Moreover, Fig. 9 and 10 provide additional robotics-centric I2V rollouts show- ing that WorldCache maintains stable scene structure and coherent robot or object motion over long horizons, including contact-rich interactions and clut- tered environments. These examples complement PAI-Bench metrics by visually confirming that acceleration does not introduce drift in object geometry or kine- matic consistency as the rollout advances. 32U. Nawaz et al. Fig. 9: Additional qualitative results. We show representative frames at increas- ing rollout times (T 0 → T N ) from PAI-Bench (top to bottom). A robot hand in- teracting with a refrigerator door, an egocentric manipulation scene near a tabletop, and an arm–human interaction setup. Across all examples, the generated videos main- tain stable scene layout and consistent object/robot geometry as motion evolves over time, demonstrating that WorldCache preserves temporal continuity in contact-rich and interaction-heavy rollouts. Better viewed with zoom-in. Fig. 10: Additional qualitative results. We show evenly spaced frames from (T 0 → T N ) for three representative scenarios: robotic arm usage for kitchen activities, opening a refrigerator door, and interaction within a cluttered toolbox. The sequences highlight sustained temporal coherence where robot pose, object geometry, and scene layout remain stable as the action unfolds. Better viewed zoom-in. H Limitations and Future Work WorldCache is a training-free inference method. While our motion and saliency- aware constraints make caching conservative in difficult regimes, extremely abrupt scene changes (e.g., rapid viewpoint jumps, heavy occlusions) can occasionally reduce cache hit rates. WorldCache: Content-Aware Caching for Accelerated Video World Models33 A natural extension is to learn or adapt caching policies online, e.g., using lightweight predictors to estimate drift/saliency more accurately or to select per-layer/per-token reuse budgets. We also plan to integrate stronger motion estimation and uncertainty-aware warping to improve cache-hit approximation under high-speed dynamics and occlusions.