Paper deep dive

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 65

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/26/2026, 2:48:46 AM

Summary

ThinkJEPA is a latent world modeling framework that integrates a VLM-thinker branch with a dense JEPA-style dynamics branch. It addresses the limitations of dense-only prediction (short temporal context) and VLM-only prediction (compute-driven sparse sampling and language-output bottlenecks) by using a dual-temporal pathway and a hierarchical pyramid representation extraction module to inject long-horizon semantic guidance into fine-grained latent forecasting.

Entities (4)

ThinkJEPA · framework · 100%V-JEPA2 · model · 98%Hierarchical pyramid representation extraction module · component · 95%Qwen3-VL (Thinking) · model · 95%

Relation Signals (3)

ThinkJEPA → uses → Qwen3-VL (Thinking)

confidence 98% · we use Qwen3-VL (Thinking) in our implementation

Hierarchical pyramid representation extraction module → injectsguidanceinto → JEPA predictor

confidence 95% · These VLM signals are injected into the JEPA predictor to improve semantic grounding

ThinkJEPA → integrates → VLM-thinker branch

confidence 95% · ThinkJEPA couples a dense JEPA branch for fine-grained latent dynamics modeling with a uniformly sampled VLM-thinker branch

Cypher Suggestions (2)

Find all components of the ThinkJEPA framework · confidence 90% · unvalidated

MATCH (f:Framework {name: 'ThinkJEPA'})-[:HAS_COMPONENT]->(c) RETURN c.name, c.entity_type

Identify models used as guidance providers in the framework · confidence 90% · unvalidated

MATCH (m:Model)-[:PROVIDES_GUIDANCE_TO]->(f:Framework {name: 'ThinkJEPA'}) RETURN m.name

Abstract

Abstract:Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

PDF

Open source PDF →Open local PDF →

Full Text

64,851 characters extracted from source content.

Expand or collapse full text

11institutetext: Northeastern University 11email: zhang.haich, lu.jiang, yunfu@northeastern.edu 22institutetext: University of California San Diego 22email: yijiangli@ucsd.edu 33institutetext: University of Maryland 33email: shwaihe, angliece@umd.edu 44institutetext: The University of Texas at Austin 44email: tushar.nagarajan@utexas.edu 55institutetext: University of Washington 55email: lasiafly@uw.edu ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model Haichao Zhang Yijiang Li Shwai He Tushar Nagarajan Mingfei Chen Jianglin Lu Ang Li Yun Fu Abstract Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision–language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM thinker branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM’s progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior. 1 Introduction World models aim to learn predictive abstractions of the environment that support forecasting, planning, and control. Among them, latent world models are particularly appealing: by predicting in representation space, they avoid generating photorealistic pixels or detailed 3D geometry, which can be computationally expensive and often unnecessary for downstream decision making. This paradigm, exemplified by JEPA-style methods (e.g., V-JEPA2 [4]), promises improved efficiency and encourages the model to emphasize higher-level structure (e.g., dynamics and physical constraints) rather than overfitting to appearance. Despite strong progress in V-JEPA2 [4] and its variants, existing JEPA-style latent world models still face two key limitations. (1) Limited temporal perspective for prediction. Most approaches rely on a short observation window consisting of densely sampled frames to predict future latents. While dense sampling captures fine-grained motion, it restricts temporal context and can bias the predictor toward local dynamics, missing longer-horizon semantics and event-level cues that are critical for robust forecasting. (2) Weak semantic grounding and general knowledge alignment. The latent space is typically learned via self-supervised visual representation learning (often related to masked reconstruction/prediction objectives), which yields motion-sensitive features but provides limited alignment to open-vocabulary concepts and compositional knowledge. As a result, the predictor may model how things move without understanding what the entities are and which attributes or relations matter, limiting generalization beyond a narrow domain (e.g., a single manipulation dataset). A natural alternative is to leverage modern vision-language models (VLMs), which excel at high-level video understanding [30, 7] and reasoning due to large-scale pretraining and multimodal alignment. When applied to uniformly sampled frames with a larger temporal stride, VLMs can capture long-range context, recognize entities and their attributes, and draw upon general world knowledge [33] that is often missing from purely visual latent predictors. This complementary capability motivates a promising direction: using a VLM as a thinker to guide latent world modeling. However, directly using VLMs as standalone dense predictors is often impractical and can be suboptimal in representation for fine-grained dynamics. Compute-driven sparsity. Video VLMs operate under quadratic attention cost and GPU memory constraints, and thus typically process only a small number of uniformly sampled frames. This design provides long-horizon context but makes it difficult to model high-FPS, fine-grained dynamics crucial for physical interaction and manipulation. Language-output bottleneck. [26] Most VLM pipelines ultimately produce language outputs (e.g., captions, rationales, or action descriptions). To generate text, visual information is progressively transformed through stacked transformer layers toward language-generation objectives and discrete token prediction. This induces an output bottleneck: fine-grained spatial details and continuous interaction states (e.g., contact, precise trajectories, fast motions) are compressed into a language-compatible representation, which is effective for semantic recognition but often inadequate for accurate physical forecasting. Consequently, language-based planning with VLM outputs can be coherent in text yet physically inconsistent. Data regime mismatch. [31] Moreover, deploying VLMs for domain-specific prediction or control often requires adaptation to relatively small, domain-specific datasets, where naïve fine-tuning can hurt general knowledge and semantic capabilities (e.g., catastrophic forgetting [32]). These observations suggest that VLMs are best used as semantic and knowledge-guidance providers, rather than standalone dense predictors. We therefore propose to integrate a VLM-thinker branch into a JEPA-style latent world model, combining dense-frame dynamics modeling with long-horizon semantic guidance in a unified framework. Specifically, we retain the dense-frame observation pathway of V-JEPA-style models to preserve fine-grained motion and interaction cues, while introducing a second branch that feeds uniformly sampled frames to a VLM to obtain long-horizon, knowledge-rich guidance. These VLM signals are injected into the JEPA predictor to improve semantic grounding and enhance the generalization of future latent prediction. A further challenge is how to extract useful guidance from a VLM. Using only the final-layer VLM features is often suboptimal: deeper layers are increasingly shaped toward language-generation objectives, while intermediate layers can contain richer visual reasoning signals with better spatial sensitivity. Motivated by this observation, we introduce a hierarchical pyramid representation extraction module that aggregates multi-depth VLM representations and distills them into guidance features compatible with the JEPA predictor, enabling the predictor to benefit from the VLM’s progressive reasoning process rather than a single terminal representation. Our contributions are summarized as follows: • We propose a VLM-guided JEPA-style latent world model that integrates a VLM as a thinker to provide semantic grounding and general knowledge guidance for future latent prediction. • We design a dual-temporal pathway: (i) a dense-frame JEPA pathway for fine-grained dynamics modeling, and (i) a uniformly sampled VLM pathway with a larger temporal stride to capture long-horizon context and high-level concepts. • We introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM features to better preserve visual reasoning cues and inject them effectively into the JEPA predictor. • Extensive experiments demonstrate improved representation quality and stronger downstream performance compared to both a V-JEPA predictor baseline and a state-of-the-art open-source VLM baseline (Qwen3-VL (Thinking)), with particularly large gains on hand-manipulation trajectory prediction. 2 Related Works 2.1 Latent World Models and Predictive Representation Learning Latent world models [9, 10, 11] aim to learn predictive abstractions of the environment that support forecasting, planning, and control. By modeling dynamics in a learned representation space, these approaches enable efficient prediction of future states without explicitly generating high-dimensional observations. Recent advances in predictive representation learning further strengthen this paradigm. In particular, JEPA-style approaches [16, 3] learn representations through predictive objectives that encourage models to capture higher-level structure such as motion patterns and physical interactions. Recent systems such as V-JEPA2 demonstrate the scalability of this approach and show promising results for video understanding and world modeling tasks. Despite these advances, most latent world models are learned solely from visual signals and lack alignment with open-vocabulary semantics or external knowledge, which can limit their ability to incorporate higher-level cues for complex forecasting scenarios. 2.2 Vision-Language Models for Multimodal Understanding Vision-language models (VLMs) have achieved remarkable progress in multimodal representation learning by aligning visual and textual modalities using large-scale image–text data [27, 19, 18, 35, 34]. Early approaches focus on joint representation learning and multimodal understanding tasks such as image captioning and visual question answering. More recent multimodal large language models (MLLMs) extend pretrained language models to process visual tokens, enabling instruction following and multimodal reasoning capabilities [2, 14, 20]. Representative systems such as LLaVA series [22, 17] integrate vision encoders with large language models through projection layers or cross-attention mechanisms. While these models demonstrate strong semantic reasoning and multimodal understanding capabilities, they are primarily designed for perception and reasoning tasks, and are not optimized for modeling structured physical dynamics. 2.3 Multimodal Fusion and Language-Guided Prediction Language has increasingly been used as a high-level control signal for visual generation and decision-making systems. Text-conditioned generative models enable natural language prompts to guide image synthesis and editing, as demonstrated by diffusion-based approaches such as DALL·E, Imagen, and Diffusion Transformers (DiT) [28, 29, 24]. Language guidance has also been explored in embodied decision-making frameworks, where large language models provide high-level instructions or goals for perception and action [1]. These works highlight the potential of language as a flexible interface for controlling visual and embodied systems. However, leveraging language signals to guide structured physical forecasting remains relatively underexplored. JEPA-style predictors with VLMs. Recent work has explored combining language models with JEPA-style representations, but largely in directions that differ from latent world modeling. For example, VL-JEPA [6] incorporates language signals into a joint-embedding predictive framework, and other approaches use V-JEPA representations as inputs to large language models for video understanding [4]. While effective for multimodal understanding, these designs often shift the primary output interface toward language generation or do not explicitly maintain a latent forecasting interface for downstream world-model tasks. In contrast, ThinkJEPA retains JEPA-style latent forecasting and leverages VLM semantics as guidance by injecting VLM-derived features into the JEPA predictor, preserving dense latent prediction while adding long-horizon semantic cues. Figure 1: Overall Architecture of ThinkJEPA. ThinkJEPA couples a dense JEPA branch for fine-grained latent dynamics modeling with a uniformly sampled VLM-thinker branch that provides long-horizon semantic guidance. The VLM guidance—including visual tokens from the ViT visual tokenizer and intermediate hidden states from the language model—is distilled by a pyramidal representation extraction module and injected into the V-JEPA predictor via layer-wise modulation. Concretely, guidance derived from language-model layers L0,…,LN\L_0,…,L_N\ is mapped to modulation parameters for predictor layers T0,…,TK\T_0,…,T_K\. The predicted future latents are concatenated with past teacher latents to form the full latent sequence, which is then fed into a task head to produce downstream trajectory predictions. 3 Methodology 3.1 Problem Definition 3.1.1 Basic Settings Given a video clip v with N frames, our goal is to forecast future latent representations that support downstream tasks; in this work, we focus on 3D hand trajectory prediction. We adopt a JEPA-style latent world modeling paradigm: a visual backbone encodes video frames into latent tokens, and a transformer predictor forecasts future latent tokens from past observations. To improve semantic grounding and long-horizon reasoning, we further condition the predictor on cached features from a video VLM thinking model (we use Qwen3-VL (Thinking) in our implementation), which serves as a thinker providing knowledge-rich guidance. 3.1.2 Long-Horizon Latent Forecasting via Recursion For long videos where the forecasting horizon exceeds the clip length supported by a single forward pass, we adopt the standard recursive rollout strategy commonly used in JEPA-style predictors. Concretely, the predictor takes the latent tokens forecast in the previous step as input for the next step, enabling iterative rollout of future latents beyond the original window. Although recursion allows arbitrarily long-horizon forecasting, it is susceptible to error accumulation over time. Accordingly, we evaluate both one-shot forecasting and recursive rollouts in our experiments, and analyze robustness under long-horizon prediction. 3.2 Dual-Temporal Perception Field Sampling Architecture A central challenge in combining VLM reasoning with latent world modeling is the mismatch between (i) the dense temporal signal required for accurate dynamics forecasting and (i) the long-horizon temporal context required for semantic understanding and event-level reasoning. Dense sampling preserves high-frequency motion and interaction cues but typically covers only a short time span, whereas sparse uniform sampling covers a long time span but discards dense motion details. To reconcile this trade-off under practical compute and memory budgets, ThinkJEPA adopts a dual-temporal perception-field design that explicitly assigns these two roles to two complementary branches. Given an input video clip v=Itt=1Nv=\I_t\_t=1^N with N frames, we construct two temporally sampled inputs: (i) a uniformly sampled clip vuv_u for the VLM-thinker branch, providing a large temporal perception field for global context and semantics; and (i) a densely sampled clip vdv_d for the JEPA branch, providing high-frequency temporal cues for fine-grained latent forecasting. The two branches are synchronized at the sample level (derived from the same v) and later fused through layer-wise guidance injection (Sec. 3). 3.2.1 Large temporal perception field sampling for the VLM thinker branch. Video VLMs are powerful for semantic grounding because they can identify entities, attributes, and event-level relationships by leveraging large-scale multimodal pretraining. However, applying transformer-based VLMs to long videos is constrained by quadratic attention cost and GPU memory usage, which typically limits the number of frames that can be processed in a single forward pass. As a result, VLMs commonly adopt uniform temporal sampling: a small set of frames is selected to span a long time horizon. Although this choice inevitably discards dense motion details, it maximizes temporal coverage and enables the VLM to reason over long-range context. In ThinkJEPA, we follow this practice and use the VLM branch specifically for long-horizon semantics and knowledge guidance (rather than dense dynamics prediction). We use Qwen3-VL (Thinking) as the VLM thinker and cache its intermediate representations for efficient conditioning of the latent predictor. Formally, we define the uniformly sampled clip vu=Isii=1Nu,si=⌊1+(i−1)⋅N−1Nu−1⌋,v_u\;=\;\I_s_i\_i=1^N_u, s_i\;=\; 1+(i-1)· N-1N_u-1 , (1) where NuN_u is the number of sampled frames for the VLM thinker branch. This sampling spans the entire clip, providing a large temporal perception field under limited compute. 3.2.2 Dense frame sampling for the JEPA branch. In contrast, JEPA-style latent world modeling requires dense temporal observations to accurately forecast future latents. Fine-grained dynamics, contact changes, and subtle interactions are often expressed as high-frequency temporal signals that are poorly captured by sparse sampling. Therefore, ThinkJEPA uses a dense sampling strategy for the JEPA branch and restricts it to a shorter observation window, where all frames are retained. Formally, we define an observation window starting at frame index t0t_0 and construct the dense clip vd=Itt=t0t0+Nd−1,v_d\;=\;\I_t\_t=t_0^t_0+N_d-1, (2) where NdN_d is the number of densely sampled frames. The V-JEPA backbone encodes vdv_d into per-frame patch tokens, producing past latent tokens FpastF^past. A JEPA-style predictor then forecasts future latent tokens F^fut F^fut from FpastF^past. These predicted latents serve as the target representation for downstream heads (e.g., trajectory regression), while the VLM branch provides complementary long-horizon semantic guidance to improve grounding and generalization. 3.2.3 Why dual-temporal sampling matters. The uniform VLM sampling and dense JEPA sampling are not redundant: they target different failure modes. Uniform sampling enables the VLM thinker to access long-range context and semantics that are difficult to infer from a short dense window, whereas dense sampling enables accurate modeling of high-frequency dynamics that sparse VLM inputs cannot represent reliably. By coupling these two perception fields and injecting VLM guidance into the JEPA predictor, ThinkJEPA benefits from both long-horizon semantic context and fine-grained dynamic cues in future latent forecasting. 3.3 JEPA-style latent tokenization and forecasting The visual backbone encodes a densely sampled clip into per-frame spatial tokens F∈ℝB×T×P×DF ^B× T× P× D, where B is the batch size, T is the number of frames in the observation window, P is the number of spatial tokens per frame, and D is the backbone latent dimension. We split the clip into past and future segments and use a masked-token transformer predictor to forecast future latent tokens from past tokens. The predictor operates in an internal dimension DpD_p and projects its outputs back to the backbone latent space of dimension D. 3.3.1 Rollout of the JEPA branch Densely sampled inputs provide strong motion and interaction cues, but they also limit the temporal duration that can be processed in a single forward pass due to compute and memory constraints. For videos whose length exceeds the JEPA observation window, we therefore perform recursive rollout by repeatedly forecasting the next segment and feeding the predicted latents into the subsequent step. Let W denote the number of frames per JEPA window (e.g., W=Tp+TfW=T_p+T_f), and let k index rollout steps. At step k, the predictor takes past latent tokens FkpastF^past_k and outputs future latent tokens F^kfut F^fut_k: F^kfut=g(Fkpast), F^fut_k\;=\;g\! (F^past_k ), (3) where g(⋅)g(·) is the JEPA-style predictor. For the next step, we set the past tokens to be the previously predicted future tokens (or a shifted window that includes them): Fk+1past←F^kfut.F^past_k+1\;←\; F^fut_k. (4) By iterating Eqs. (3)–(4), we can roll out arbitrarily long-horizon latent forecasts. While rollout enables long-horizon prediction, it is susceptible to error accumulation and remains limited by the local temporal context within each window. This motivates incorporating VLM-thinker guidance, which provides complementary long-horizon semantic context to stabilize forecasting and improve generalization (Sec. 3.2). 3.4 VLM Thinker: Hierarchical Pyramid Representation Extraction 3.4.1 Complementarity via injecting VLM guidance into JEPA Prior work has explored combining language and JEPA-style representations in different directions. For example, VL-JEPA [6] and approaches that feed V-JEPA features into LLMs for video understanding [4] primarily treat JEPA features as inputs to a language model. While effective for video-to-text understanding, this design shifts the output space toward language generation and does not directly preserve a latent world model interface for downstream prediction. In contrast, our goal is to retain JEPA-style latent forecasting while leveraging VLM semantics as guidance. This is non-trivial because the VLM must provide useful long-horizon semantic context without replacing the dense dynamics modeling of the JEPA predictor. As discussed in Sec. 3.2, uniform sampling enables the VLM thinker to access long-range context and event-level semantics under limited compute, whereas dense sampling provides the JEPA branch with high-frequency temporal signals for fine-grained dynamics. We combine these two pathways by injecting VLM guidance into the JEPA predictor in a layer-wise manner. Concretely, given a uniformly sampled clip vuv_u and a densely sampled clip vdv_d, the predictor forecasts future latent tokens conditioned on both VLM guidance and an optional text prompt: F^fut=g(Fpast(vd);ϕ(vu),p), F^fut\;=\;g\! (F^past(v_d)\,;\,φ(v_u),\,p ), (5) where Fpast(vd)F^past(v_d) are past latent tokens extracted by the V-JEPA backbone from the dense clip, ϕ(vu)φ(v_u) denotes VLM-derived guidance features from the uniform clip, p denotes the text prompt provided to the VLM thinker, and g(⋅;⋅)g(·;·) is the V-JEPA predictor. In practice, the VLM thinker prompt p is generated from a general summarization request, with its content/description populated from the clip metadata (e.g., task name and scene description), which helps the thinker focus on relevant entities and events. 3.4.2 Hierarchical pyramid representation extraction A key question is which VLM representations are most suitable for guiding latent forecasting. Using only the final-layer VLM features can be suboptimal, since deeper layers are increasingly shaped toward language-generation objectives, while intermediate layers often retain richer visual reasoning cues and better spatial sensitivity. This observation is supported by prior analyses showing that aggregating intermediate LLM representations can outperform using a single terminal layer for downstream tasks (e.g., [33]). Moreover, visual tokenizer outputs may lose fine-grained cues after passing through multimodal fusion and language decoding stages. Motivated by these findings, we propose a hierarchical pyramid representation extraction module that aggregates multi-depth VLM signals. Specifically, we combine (i) visual tokens from the VLM visual encoder (ViT tokenizer) and (i) intermediate hidden states from selected language-model layers, forming a depth-wise pyramid over the VLM. These multi-depth features are pooled and projected into the predictor space, yielding guidance features ϕ(vu)φ(v_u) that preserve both low-level visual cues and high-level semantic reasoning traces. 3.4.3 Layer-wise guidance injection We inject the extracted thinker guidance into the JEPA predictor via feature-wise linear modulation (FiLM) [25]. For predictor block ℓ , the guidance produces modulation parameters (γℓ,βℓ)( _ , _ ), and we modulate the block input as FiLM(z;γℓ,βℓ)=γℓ⊙z+βℓ.FiLM(z; _ , _ )\;=\; _ z+ _ . (6) This yields layer-wise, sample-specific conditioning that injects semantic and knowledge cues into latent forecasting without requiring the VLM to act as a dense predictor. 3.4.4 Joint prediction for downstream regression For the basic setting, we follow standard V-JEPA downstream protocols [4] by feeding the predicted latent tokens into a task head for trajectory regression. For long-horizon forecasting with recursive rollout (Sec. 3.3.1), we concatenate the past latents and the predicted future latents into a full-length token sequence, which is then fed to the temporal regression head to produce the target trajectories. 3.5 Implementation Details Backbone. We use a V-JEPA-L backbone (vit_large_rope) to extract per-frame patch tokens with latent dimension D=1024D=1024. VLM-injected V-JEPA predictor. We implement a V-JEPA predictor operating in an internal dimension Dp=384D_p=384 and inject VLM-thinker guidance into the predictor via layer-wise FiLM modulation. We condition each predictor block using (γℓ,βℓ)( _ , _ ) derived from cached Qwen3-VL (Thinking) representations. The cache provides encoder tokens and autoregressive (AR) tokens, which are projected to DpD_p, pooled, and mapped to per-layer FiLM parameters using lightweight MLP adapters. For hierarchical pyramid extraction, we cache intermediate hidden states from VLM layers ℒ=0,4,8,12,16,20,24,27L=\0,4,8,12,16,20,24,27\. Trajectory head. We use a lightweight temporal trajectory regression head for downstream prediction. The head first aggregates spatial tokens within each frame via attention pooling with a learnable query, producing a per-frame representation. It then applies temporal MLP blocks to model cross-frame dependencies, followed by stride-2 temporal downsampling to align the temporal resolution with the prediction horizon. Finally, a linear projection regresses 3D trajectories with output shape 32×52×332× 52× 3. 4 Experiments 4.1 Datasets We evaluate ThinkJEPA on two egocentric video benchmarks: EgoDex [13] and EgoExo4D [8]. EgoDex is a large-scale benchmark for egocentric dexterous manipulation, providing egocentric video paired with 3D hand (and finger) pose annotations, which naturally fits our latent forecasting and trajectory regression setting [13]. EgoExo4D is a large-scale multimodal, multiview dataset of skilled human activities with synchronized egocentric and exocentric videos, and extensive annotations including 3D body pose, 3D hand pose, and gaze, enabling evaluation of human motion from egocentric video [8]. 4.2 Evaluation Metrics We report standard trajectory errors and latent forecasting diagnostics. Trajectory metrics. Let Y^∈ℝB×Tf×J×3 Y ^B× T_f× J× 3 and Y∈ℝB×Tf×J×3Y ^B× T_f× J× 3 denote predicted and ground-truth 3D trajectories over TfT_f future frames and J joints. We compute: ADE [13] (Average Displacement Error): the mean Euclidean distance over all future frames and joints, averaged over the batch. FDE [13] (Final Displacement Error): the mean Euclidean distance on the final future frame, averaged over joints and batch. Accuracy: the fraction of predicted joint positions with Euclidean error below 0.05 m, aggregated over time and joints. Latent forecasting metrics. To complement trajectory-level evaluation (ADE↓ , FDE↓ , Acc↑ ), we report representation-level forecasting quality using three distance-based metrics computed between predicted and target latents: FD↓ (feature ℓ2 _2 distance), SL1↓ (SmoothL1 distance), and CD↓ (cosine distance, defined as 1−cos⁡(⋅)1- (·)). These metrics provide an interpretable view of latent prediction quality and directly reflect how well the model forecasts V-JEPA features in representation space. Rollout metrics. For recursive rollout evaluation, we report horizon-specific trajectory errors using A@H↓ and F@H↓ , which denote ADE@H and FDE@H at rollout horizon H∈4,8,16,32H∈\4,8,16,32\, respectively. 4.3 Baselines and Variants We compare ThinkJEPA against both strong single-branch baselines and controlled ablations. Since our goal is to endow JEPA-style latent world models with VLM-level semantic reasoning, we include baselines that isolate (i) the contribution of the VLM thinker alone, and (i) the contribution of the JEPA latent predictor alone, as well as ablations that probe which VLM signals are necessary. ThinkJEPA. Our full model uses dense-frame V-JEPA tokens for latent forecasting and injects VLM-thinker guidance derived from both encoder tokens and autoregressive (AR) tokens (Sec. 3.4). Qwen3-VL Thinking (VLM-only). To isolate the contribution of the VLM thinker, we disable the dense JEPA input by zeroing the visual latent tokens while keeping the VLM branch unchanged. We then train the same downstream head on the resulting VLM-derived representations. This baseline tests whether long-horizon VLM reasoning alone can support accurate dense trajectory prediction. We use Qwen3-VL (Thinking) [5] as a strong VLM baseline and extract intermediate representations to form the guidance embedding; to avoid reliance on a single terminal layer, we use multi-layer representations consistent with the pyramid design. V-JEPA Predictor (JEPA-only). To isolate the contribution of the JEPA latent world model, we train a V-JEPA predictor and the same downstream head following the standard JEPA-style protocol [4]. This baseline represents a strong dense latent forecasting model without any VLM conditioning. Ablations: token sources. To assess which VLM token sources contribute to guidance, we evaluate variants that selectively enable: (i) encoder tokens + V-JEPA, (i) encoder tokens only, (i) AR tokens + V-JEPA, and (iv) AR tokens only. We also include a variant that removes the thinker-guidance module while keeping the rest of the architecture unchanged (i.e., disabling VLM guidance and reducing to the V-JEPA predictor), which serves as a control without dual-temporal guidance. Ablations: layer selection. To study the role of hierarchical pyramid extraction, we compare guidance derived from different VLM layer selections (e.g., last-layer vs mid-layer guidance), using the same training/evaluation protocol. 4.4 Training and Experimental Settings Unless otherwise specified, we use the same architecture and hyperparameters as reported in Tab. 1 of the supplementary material. We train with learning rate 10−310^-3 and predictor learning rate 10−410^-4, using batch size 14 for training and 6 for evaluation. We set the random seed to 42 and use 2 dataloader workers. Our main forecasting setting uses a past/future split of 32/32 frames. 4.5 Long-Horizon Rollout Evaluation To evaluate long-horizon forecasting behavior beyond a single prediction window, we perform recursive rollout. We use a short-window predictor configuration with Tp=4T_p=4 and Tf=4T_f=4 for each step and recursively roll out to horizons H∈4,8,16,32H∈\4,8,16,32\ steps. We report ADE@H, FDE@H, and Accuracy@H computed after the autoregressive rollout, as well as latent-distance metrics (L2/SmoothL1/Cosine) aggregated over the rollout trajectory. Dataset Model ADE↓ FDE↓ Acc↑ FD↓ SL1↓ CD↓ EgoDex Qwen3-VL Thinking 0.142 0.144 0.084 99.538 1.656 0.615 V-JEPA Predictor 0.071 0.066 0.471 74.223 1.252 0.317 ThinkJEPA 0.061 0.056 0.596 74.032 1.248 0.315 EgoExo4D Qwen3-VL Thinking 0.661 0.690 0.038 104.548 1.756 0.690 V-JEPA Predictor 0.659 0.636 0.074 89.244 1.520 0.469 ThinkJEPA 0.622 0.597 0.171 79.654 1.364 0.359 Table 1: Quantitative comparison across datasets. We report trajectory metrics (ADE/FDE/Acc) and latent forecasting metrics (FD/SL1/CD). FD/SL1/CD denote V-JEPA feature distance, latent SmoothL1, and latent cosine distance, respectively. All values are reported with three decimal places. 4.6 Quantitative Comparison Tab. 1 reports main comparison on EgoDex and EgoExo4D datasets. ThinkJEPA consistently outperforms both single-branch baselines in trajectory prediction, achieving substantially lower ADE/FDE and markedly higher Acc. Compared to the V-JEPA Predictor, injecting VLM-thinker guidance improves semantic grounding while preserving dense dynamics cues, leading to a large gain in downstream trajectory accuracy. Compared to Qwen3-VL Thinking, ThinkJEPA avoids relying on sparse, language-oriented representations as a standalone predictor and instead uses the VLM as guidance, yielding a more physically grounded forecast. In addition to trajectory metrics, ThinkJEPA also improves latent forecasting quality (lower FD/SL1/CD), indicating that guidance injection benefits representation prediction rather than only the downstream head. Overall, these results show that ThinkJEPA can surpass both a strong VLM baseline and a strong latent world model baseline by integrating long-horizon VLM reasoning with dense JEPA-style latent forecasting. Abl. ADE↓ FDE↓ Acc↑ FD↓ SL1↓ CD↓ Encoder+V-JEPA predictor 0.1280.128 0.1290.129 0.1000.100 78.86978.869 1.3401.340 0.3600.360 Encoder-only 0.1430.143 0.1450.145 0.0860.086 102.910102.910 1.7001.700 0.6150.615 AR+V-JEPA predictor 0.1280.128 0.1300.130 0.0980.098 78.51478.514 1.3331.333 0.3560.356 AR-only 0.1420.142 0.1440.144 0.0860.086 102.910102.910 1.7001.700 0.6150.615 No-dual-temporal sampling 0.1280.128 0.1300.130 0.0990.099 78.86278.862 1.3401.340 0.3600.360 ThinkJEPA 0.0610.061 0.0560.056 0.5960.596 74.74774.747 1.2631.263 0.3240.324 Table 2: Ablation studies. We vary the VLM token sources and the thinker module. Encoder denotes encoder tokens, AR denotes autoregressive tokens, and VJ denotes the V-JEPA predictor; No-Th removes the thinker module. We abbreviate latent metrics as FD (feature distance), SL1 (SmoothL1), and CD (cosine distance). Single seed (42), best-epoch selection by minimum validation ADE. 4.7 Trajectory prediction baselines Following EgoDex, we compare against six trajectory prediction baselines formed by combining two Transformer architectures—decoder-only and encoder-decoder—with three policy representations: Behavior Cloning (BC) [23], Denoising Diffusion Probabilistic Models (DDPM) [12], and Flow Matching (FM) [21]. These baselines implemented from [15] are trained in the EgoDex trajectory prediction benchmark under a 2-second horizon and serve as strong task-specific references for egocentric hand trajectory forecasting. As shown in Tab. 3, ThinkJEPA outperforms all trajectory prediction baselines reported in EgoDex in terms of both ADE and FDE. Compared with the strongest task-specific baselines based on Behavior Cloning, ThinkJEPA reduces the average displacement error from 0.0767/0.0774 to 0.0610 and the final displacement error from 0.0818/0.0924 to 0.0560. The improvement over DDPM- and Flow-Matching-based baselines is even larger. These results suggest that VLM-guided latent forecasting provides a stronger trajectory representation than directly predicting trajectories with conventional decoder-only or encoder-decoder policy heads. Group Model ADE↓ FDE↓ Trajectory Baselines Decoder-only + Behavior Cloning 0.0767 0.0818 Decoder-only + DDPM 0.1148 0.1238 Decoder-only + Flow Matching 0.1527 0.1574 Encoder-decoder + Behavior Cloning 0.0774 0.0924 Encoder-decoder + DDPM 0.1272 0.1245 Encoder-decoder + Flow Matching 0.1736 0.1557 Latent Forecasting Qwen3-VL Thinking 0.1420 0.1440 V-JEPA Predictor 0.0710 0.0660 ThinkJEPA 0.0610 0.0560 Table 3: Comparison with EgoDex trajectory prediction baselines on EgoDex. We compare ThinkJEPA against the trajectory prediction baselines reported in EgoDex, including decoder-only and encoder-decoder architectures with Behavior Cloning, DDPM, and Flow Matching. ThinkJEPA achieves the best ADE/FDE among all compared methods. 4.8 Ablation on VLM Token Sources Tab. 2 studies the contribution of different VLM token sources. Using only one token set (encoder tokens or AR tokens) provides limited benefit over the V-JEPA Predictor, and using tokens alone (without the dense JEPA branch) reduces to the Qwen3-VL Thinking baseline. In contrast, ThinkJEPA achieves the strongest performance when combining both token sources with the dense JEPA pathway, suggesting that the two token types provide complementary signals for guidance: encoder tokens carry visual content summaries, while AR tokens capture generation-side reasoning traces. Removing the thinker module collapses performance back to the V-JEPA Predictor level, confirming that the gains come from the injected guidance rather than incidental changes in training or evaluation. Variant ADE↓ FDE↓ Acc↑ FD↓ SL1↓ CD↓ Last-layer 0.1280.128 0.1300.130 0.0990.099 78.85878.858 1.3401.340 0.3600.360 Mid-layer 0.1280.128 0.1310.131 0.0980.098 78.51778.517 1.3331.333 0.3560.356 All layers (ThinkJEPA) 0.0610.061 0.0560.056 0.5960.596 74.74774.747 1.2631.263 0.3240.324 Table 4: VLM layer selection on EgoDex. We compare guidance derived from different VLM layer selections. FD denotes V-JEPA feature distance, SL1 denotes latent SmoothL1, and CD denotes latent cosine distance. Single seed (42), best-epoch selection by minimum validation ADE. 4.9 Ablation on VLM Layer Selection Tab. 4 compares guidance derived from different VLM layer selections. We observe a small trade-off: last-layer guidance slightly improves trajectory metrics (ADE/FDE/Accuracy), whereas mid-layer guidance yields better latent forecasting quality (lower feature distance / SmoothL1 / cosine distance). This is consistent with the intuition that deeper layers are increasingly shaped toward language-generation objectives, while intermediate layers can retain richer visual reasoning cues. These results motivate hierarchical (multi-depth) guidance extraction and justify our pyramid design for robust guidance transfer. 4.10 Recursive Rollout: Trajectory Errors vs Horizon We evaluate long-horizon behavior via recursive rollout in Tab. 5. Qwen3-VL Thinking degrades sharply under rollout, exhibiting large errors at longer horizons, which supports our motivation that VLMs are ill-suited as standalone dense predictors for physically grounded forecasting. The V-JEPA Predictor remains stable but accumulates error gradually as the horizon increases. ThinkJEPA achieves the best performance across all horizons, indicating that VLM-thinker guidance improves long-horizon forecasting while maintaining dense dynamics modeling. Notably, the improvement becomes more pronounced as the rollout horizon increases, suggesting that semantic guidance helps stabilize iterative prediction and mitigate compounding errors. Model A@4 A@8 A@16 A@32 F@4 F@8 F@16 F@32 Qwen3-VL Thinking 0.140 0.819 1.375 1.026 0.143 2.850 0.286 1.092 V-JEPA Predictor 0.121 0.126 0.134 0.142 0.124 0.136 0.149 0.153 ThinkJEPA 0.071 0.078 0.092 0.111 0.073 0.090 0.118 0.136 Table 5: Recursive rollout on EgoDex: trajectory error vs. horizon. We perform autoregressive rollout for horizons H∈4,8,16,32H∈\4,8,16,32\. A@H and F@H denote ADE@H and FDE@H, respectively; the lower, the better. 4.11 Qualitative Results As shown in Fig. 2, we visualize predicted future hand trajectories by decoding the forecasted V-JEPA latents with the downstream task head and overlaying the resulting 3D joints on a reference frame. Overall, ThinkJEPA produces more plausible and accurate trajectories: the final endpoint (deep red) aligns more closely with the hand in the reference frame, and the temporal progression is smoother and more diverse over time. In contrast, as highlighted by the yellow circles, the V-JEPA baseline often exhibits temporally collapsed predictions, where blue points concentrate in a small region, indicating that multiple timesteps and joints are predicted to overlap. In the first example, the VLM-only baseline hallucinates a non-existent left hand, while the V-JEPA baseline yields less accurate joint localization and noisier motion compared to our method. Figure 2: Qualitative results. Predicted future hand-manipulation trajectories visualized as heat maps overlaid on the reference frame. Colors indicate temporal progression from blue (earlier) to red (later). Ideally, trajectories transition smoothly from blue to red, indicating coherent motion over time. ThinkJEPA produces smoother trajectories with better temporal consistency and joint alignment. 5 Conclusion We presented ThinkJEPA, a VLM-guided JEPA-style latent world modeling framework that integrates long-horizon semantic reasoning from a vision–language thinker with dense latent dynamics forecasting. ThinkJEPA adopts a dual-temporal perception design—uniform sampling for the VLM thinker and dense sampling for the JEPA branch—and injects pyramid-extracted, multi-depth VLM representations into the JEPA predictor via layer-wise modulation. This complementary integration preserves the latent forecasting interface required by downstream world-model tasks while enriching predictions with knowledge-aware guidance. Extensive experiments on egocentric hand-manipulation trajectory prediction demonstrate that ThinkJEPA improves both representation-level forecasting quality and downstream performance, outperforming a strong VLM baseline (Qwen3-VL (Thinking)) and a V-JEPA predictor baseline, and exhibiting robust long-horizon rollout behavior. Future work includes extending the framework to broader embodied tasks and exploring more scalable guidance mechanisms for longer videos and more diverse interaction scenarios. 6 Supplementary Materials Unless otherwise specified, all experiments in this section follow the same training backbone and data pipeline, and only differ in the conditioning signal or temporal sampling strategy. The supplementary suite consists of five controlled studies. All experiments are conducted on EgoDex using the same cached visual backbone features and the same downstream trajectory prediction protocol, unless otherwise noted. 6.1 Prompt + Video to VLM-Conditioned Features Experimental setting. This study evaluates a prompt-conditioned VLM feature design, where the predictor takes cached visual features as the primary input and uses language-modulated VLM features as external conditioning. The training setup follows the same backbone and downstream trajectory head as the main paper, while changing only the conditioning path. Experimental details. The input video is first represented by cached ViT-style spatiotemporal features, which serve as the main predictive substrate. In parallel, video frames together with a text prompt are passed through Qwen3-VL (Thinking) to extract VLM conditioning features. The conditioning features include two complementary streams: encoder-side representations and autoregressive generation-side representations. These VLM features are injected into the predictor rather than replacing the visual backbone. Analysis. As shown in Tab. 6, prompt-conditioned VLM features provide a competitive design choice. Compared with the full ThinkJEPA model, this variant achieves slightly weaker trajectory prediction (ADE/FDE/Acc: 0.069/0.062/0.495 vs. 0.061/0.056/0.596), but slightly better latent forecasting metrics (FD/SL1/CD: 74.007/1.248/0.315 vs. 74.747/1.263/0.324). These results suggest that prompt-conditioned VLM features are effective for representation guidance, while the full ThinkJEPA design yields a stronger overall downstream trade-off. Variant ADE↓ FDE↓ Acc↑ FD↓ SL1↓ CD↓ Prompt + video → VLM condition 0.069 0.062 0.495 74.007 1.248 0.315 ThinkJEPA 0.061 0.056 0.596 74.747 1.263 0.324 Table 6: Prompt + video to VLM-conditioned features. The predictor uses cached visual features as the main trajectory backbone and VLM-derived features as external conditioning. 6.2 Temporal Stride Ablation Experimental setting. This study examines the role of temporal sampling granularity in the dual-temporal design. We compare two temporal strides while keeping the predictor architecture, training budget, and conditioning mechanism fixed. Experimental details. EgoDex trajectories are first represented as 64 uniformly sampled temporal points over each episode. The prediction protocol uses 32 past points and 32 future points. For stride 1, no temporal decimation is applied, so the predictor observes all 64 sampled points. For stride 2, the temporal sequence is subsampled before the past/future split, resulting in a coarser temporal representation. Since the original sequence is already uniformly sampled to 64 points, stride 1 corresponds to denser temporal coverage, while stride 2 reduces temporal resolution. Analysis. The results in Tab. 7 show that denser temporal sampling improves both trajectory prediction and latent forecasting quality. Stride 1 outperforms stride 2 on all reported metrics. Compared with both stride variants, the full ThinkJEPA model further improves downstream trajectory performance, achieving the best ADE/FDE/Acc overall on this split. Stride ADE↓ FDE↓ Acc↑ FD↓ SL1↓ CD↓ Temporal stride 1 0.071 0.064 0.471 73.920 1.246 0.314 Temporal stride 2 0.073 0.071 0.458 74.266 1.247 0.317 Table 7: Temporal stride ablation. Denser temporal sampling improves trajectory prediction and latent forecasting quality. 6.3 Conditioning Mechanism Ablation Experimental setting. This study compares three conditioning operators under the same backbone, data split, and training budget. Only the conditioning mechanism is varied. Experimental details. We compare three ways of injecting VLM guidance into the predictor: (i) FiLM-style affine modulation, (i) cross-attention conditioning, and (i) AdaLN-style adaptive normalization. All variants consume the same cached VLM features and the same base visual representation stream, so differences can be attributed to the conditioning operator itself. Analysis. Tab. 8 shows that all three conditioning mechanisms are competitive. FiLM provides the strongest latent forecasting quality among the three variants, while cross-attention and AdaLN remain close alternatives. Compared with these controlled variants, the full ThinkJEPA model achieves substantially better trajectory prediction (ADE/FDE/Acc), indicating that the final design used in the paper offers the strongest downstream performance under the current setting. Conditioning ADE↓ FDE↓ Acc↑ FD↓ SL1↓ CD↓ FiLM 0.0706 0.064 0.471 73.878 1.245 0.314 Cross-attn 0.0707 0.066 0.475 73.965 1.247 0.315 AdaLN 0.0708 0.065 0.474 74.280 1.253 0.317 Table 8: Conditioning mechanism ablation. We compare FiLM, cross-attention, and AdaLN under the same training setup. 6.4 Direct Visual Conditioning and Deepstack-Token Removal Experimental setting. This study compares two variants: (i) removing the VLM branch entirely and conditioning only on direct visual features, and (i) keeping the VLM branch but removing the deepstack/thinking-token contribution. We further compare both variants against the full ThinkJEPA model. Experimental details. For direct visual conditioning, the predictor removes all VLM conditioning and operates only on visual backbone features. This serves as a controlled visual-only baseline within the same predictor family. For deepstack-token removal, the VLM branch is preserved, but the generation-side thinking/deepstack token contribution is explicitly dropped before conditioning is consumed by the predictor. This removal is implemented using token filtering and hard zeroing, ensuring that the removed tokens do not leak through the conditioning path. Analysis. The results in Tab. 9 show that both ablations remain competitive. Dropping deepstack tokens yields slightly stronger latent forecasting quality than direct visual conditioning alone, suggesting that the full VLM branch contributes non-trivial information. However, both variants are weaker than the full ThinkJEPA model in downstream trajectory performance, and ThinkJEPA achieves the best ADE/FDE/Acc overall. This indicates that the complete VLM guidance pathway is most effective when used as part of the full model design. Why FiLM as the default conditioning operator. Although we compare multiple conditioning operators in the supplementary experiments, we choose FiLM as the default design in ThinkJEPA because our primary goal is to improve latent feature prediction, rather than only optimizing the downstream regression head. FiLM performs feature-wise modulation directly in the predictor latent space, allowing the VLM thinker to refine the predicted representation while preserving the JEPA-style latent forecasting interface. Compared with cross-attention, FiLM is lighter-weight and introduces less structural change to the predictor, making it easier to attribute gains to guidance rather than additional token interactions. Compared with normalization-based conditioning such as AdaLN, FiLM provides a more direct channel-wise control over the latent features themselves, which is particularly aligned with our objective of improving representation-level prediction quality. For this reason, we adopt FiLM as the main conditioning operator in the paper, while including other variants as complementary ablations. Variant ADE↓ FDE↓ Acc↑ Direct visual conditioning 0.071 0.066 0.475 Drop deepstack tokens 0.072 0.066 0.464 ThinkJEPA 0.061 0.056 0.596 Table 9: Direct visual conditioning vs. deepstack-token removal. Both ablations remain competitive, while ThinkJEPA achieves the strongest downstream trajectory performance. 6.5 Pure Prompt-Only VLM Baseline Experimental setting. This study evaluates a pure VLM baseline without any task-specific prediction head. Unlike the VLM-only baseline in the main paper, which uses a trained downstream head on top of VLM-derived features, this study directly prompts the VLM with video and text and asks it to output future 3D trajectories in structured form. Its purpose is to provide a zero-shot reference point for direct prompting without task-specific adaptation. Experimental details. We use Qwen3-VL (Thinking) as a prompt-only baseline. The model observes only the past segment of the video and is instructed to predict future hand trajectories in world coordinates. It outputs a small set of future waypoints in JSON format, which are then interpolated to the full prediction horizon for evaluation. No trajectory head is trained, making this a zero-shot or prompt-only baseline. Analysis. As shown in Tab. 10, the pure prompt-only baseline performs dramatically worse than ThinkJEPA, with ADE/FDE of 10.855/10.927 compared to 0.061/0.056 for our method. This large gap confirms that direct prompting of a general-purpose VLM is insufficient for fine-grained metric-space trajectory prediction. In addition, parsing success is poor in this setting, indicating that structured trajectory generation itself is unstable under pure prompting. We therefore regard this baseline as an intentionally weak but informative reference point, rather than a competitive predictor for this benchmark. Implication for the main-paper VLM-only baseline. The result in Tab. 10 also clarifies why the VLM-only baseline in the main paper is implemented with a trained task head rather than direct prompting. A general-purpose VLM that has not been fine-tuned for future trajectory prediction performs very poorly in this setting, even though it possesses strong general semantic reasoning ability. This indicates that zero-shot prompting alone is insufficient for fine-grained metric-space forecasting of hand motion. Therefore, the main-paper VLM-only baseline is intentionally designed as a fairer and stronger comparison: it uses the same task-specific training protocol and downstream prediction head, while removing the JEPA latent forecasting pathway. In this way, the comparison in the main paper isolates the benefit of JEPA-style latent prediction versus VLM-based features under matched supervision and optimization, rather than comparing against an intentionally weak zero-shot prompt baseline. Baseline ADE↓ FDE↓ Acc↑ Qwen3-VL prompt-only 10.855 10.927 0.000 ThinkJEPA 0.061 0.056 0.596 Table 10: Pure prompt-only VLM baseline. We directly prompt Qwen3-VL (Thinking) to predict future trajectories from video and text, without any task-specific fine-tuning or trained prediction head. The large performance gap to ThinkJEPA indicates that zero-shot prompting is not sufficient for fine-grained metric-space trajectory forecasting. This study is included as a weak reference point only; the VLM-only baseline in the main paper is a substantially fairer comparison because it is trained with the same task-specific supervision and downstream head. Hyperparameter Value Input frames (T) 64 Past/Future split (Tp/TfT_p/T_f) 32/32 Input resolution 256×256256×256 Backbone V-JEPA-L (vit_large_rope) Backbone depth / dim 24 / 1024 Patch embedding Conv3d kernel/stride (2,16,16)(2,16,16) Predictor VLM-injected V-JEPA predictor Predictor dim (DpD_p) 384 Predictor depth / heads 12 / 6 RoPE / mask tokens enabled / 2 VLM thinker Qwen3-VL (Thinking) (cached) VLM token dim (DcD_c) 2048 Cache clips (NcN_c) 8 Encoder token length (LencL_enc) 480 AR token length (LarL_ar) 15 Pyramid layers (ℒL) 0,4,8,12,16,20,24,27\0,4,8,12,16,20,24,27\ Guidance injection layer-wise FiLM Temporal downsampling AvgPool stride 2 (64→ 32) Output shape 32×52×332× 52× 3 Table 11: Key architectural hyperparameters and tensor dimensions. 7 Implementation Details Shared implementation setting. Tab. 11 summarizes the key architectural hyperparameters and tensor dimensions used throughout the supplementary experiments. Unless otherwise specified, all experiments share the same base configuration: a 64-frame input clip at resolution 256×256256× 256, a V-JEPA-L backbone for latent token extraction, and a VLM-injected V-JEPA predictor operating in a latent dimension of Dp=384D_p=384. The VLM thinker is instantiated with cached Qwen3-VL (Thinking) features, including both encoder tokens and autoregressive tokens, and multi-depth VLM representations are extracted from the pyramid layer set ℒ=0,4,8,12,16,20,24,27L=\0,4,8,12,16,20,24,27\. Guidance is injected into the predictor via layer-wise FiLM modulation, and the final latent sequence is decoded through temporal downsampling to produce 32×52×332× 52× 3 trajectory outputs. This table is provided to clarify the common experimental backbone shared by the controlled ablations in the supplementary material. References [1] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, K. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, P. Sermanet, N. Sievers, C. Tan, A. Toshev, V. Vanhoucke, F. Xia, T. Xiao, P. Xu, S. Xu, M. Yan, and A. Zeng (2022) Do as i can and not as i say: grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, Cited by: §2.3. [2] J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan (2022) Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho (Eds.), External Links: Link Cited by: §2.2. [3] M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, and N. Ballas (2023) Self-supervised learning from images with a joint-embedding predictive architecture. External Links: 2301.08243, Link Cited by: §2.1. [4] M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, et al. (2025) V-jepa 2: self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985. Cited by: §1, §1, §2.3, §3.4.1, §3.4.4, §4.3. [5] S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y. Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y. Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang, Q. Wang, Y. Wang, T. Xie, Y. Xu, H. Xu, J. Xu, Z. Yang, M. Yang, J. Yang, A. Yang, B. Yu, F. Zhang, H. Zhang, X. Zhang, B. Zheng, H. Zhong, J. Zhou, F. Zhou, J. Zhou, Y. Zhu, and K. Zhu (2025) Qwen3-vl technical report. arXiv preprint arXiv:2511.21631. Cited by: §4.3. [6] D. Chen, M. Shukor, T. Moutakanni, W. Chung, J. Yu, T. Kasarla, A. Bolourchi, Y. LeCun, and P. Fung (2025) Vl-jepa: joint embedding predictive architecture for vision-language. arXiv preprint arXiv:2512.10942. Cited by: §2.3, §3.4.1. [7] Y. Feng, Y. Li, W. Zhang, S. Zheng, H. Luo, Z. Yue, and Z. Lu (2025-10) VideoOrion: tokenizing object dynamics in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), p. 20401–20412. Cited by: §1. [8] K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V. Baiyya, S. Bansal, B. Boote, et al. (2024) Ego-exo4d: understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, p. 19383–19400. Cited by: §4.1. [9] D. Ha and J. Schmidhuber (2018) World models. arXiv preprint arXiv:1803.10122. Cited by: §2.1. [10] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi (2020) Dream to control: learning behaviors by latent imagination. External Links: 1912.01603, Link Cited by: §2.1. [11] D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba (2020) Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193. Cited by: §2.1. [12] J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. Advances in neural information processing systems 33, p. 6840–6851. Cited by: §4.7. [13] R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang (2025) EgoDex: learning dexterous manipulation from large-scale egocentric video. External Links: 2505.11709, Link Cited by: §4.1, §4.2. [14] S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, L. Cui, O. K. Mohammed, B. Patra, Q. Liu, K. Aggarwal, Z. Chi, J. Bjorck, V. Chaudhary, S. Som, X. Song, and F. Wei (2023) Language is not all you need: aligning perception with language models. In Thirty-seventh Conference on Neural Information Processing Systems, External Links: Link Cited by: §2.2. [15] X. Jia, A. Donat, X. Huang, X. Zhao, D. Blessing, H. Zhou, H. A. Wang, H. Zhang, Q. Wang, R. Lioutikov, et al. (2025) X-il: exploring the design space of imitation learning policies. arXiv preprint arXiv:2502.12330. Cited by: §4.7. [16] Y. LeCun et al. (2022) A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review 62 (1), p. 1–62. Cited by: §2.1. [17] B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y. Li, Z. Liu, and C. Li (2024) LLaVA-onevision: easy visual task transfer. ArXiv abs/2408.03326. External Links: Link Cited by: §2.2. [18] J. Li, D. Li, S. Savarese, and S. C. H. Hoi (2023) BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, External Links: Link Cited by: §2.2. [19] J. Li, D. Li, C. Xiong, and S. C. H. Hoi (2022) BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, External Links: Link Cited by: §2.2. [20] Y. Li, Q. Gao, T. Zhao, B. Wang, H. Sun, H. Lyu, R. D. Hawkins, N. Vasconcelos, T. Golan, D. Luo, et al. (2024) Core knowledge deficits in multi-modal language models. arXiv preprint arXiv:2410.10855. Cited by: §2.2. [21] Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le (2022) Flow matching for generative modeling. arXiv preprint arXiv:2210.02747. Cited by: §4.7. [22] H. Liu, C. Li, Q. Wu, and Y. J. Lee (2023) Visual instruction tuning. ArXiv abs/2304.08485. External Links: Link Cited by: §2.2. [23] S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y. Zhu (2024) Robocasa: large-scale simulation of everyday tasks for generalist robots. arXiv preprint arXiv:2406.02523. Cited by: §4.7. [24] W. S. Peebles and S. Xie (2022) Scalable diffusion models with transformers. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), p. 4172–4182. External Links: Link Cited by: §2.3. [25] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville (2018) Film: visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32. Cited by: §3.4.3. [26] I. Pikabea, I. Lacunza, O. P. Velasco, C. Escolano, A. Gonzalez-Agirre, J. Hernando, and M. Villegas (2025) Breaking language barriers in visual language models via multilingual textual regularization. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, p. 299–337. Cited by: §1. [27] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever (2021) Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, External Links: Link Cited by: §2.2. [28] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen (2022) Hierarchical text-conditional image generation with clip latents. ArXiv abs/2204.06125. External Links: Link Cited by: §2.3. [29] C. Saharia, W. Chan, S. Saxena, L. Lit, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. Gontijo-Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi (2022) Photorealistic text-to-image diffusion models with deep language understanding. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA. External Links: ISBN 9781713871088 Cited by: §2.3. [30] Y. Tang, J. Bi, S. Xu, L. Song, S. Liang, T. Wang, D. Zhang, J. An, J. Lin, R. Zhu, et al. (2025) Video understanding with large language models: a survey. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §1. [31] J. Xiao, N. Huang, H. Qin, D. Li, Y. Li, F. Zhu, Z. Tao, J. Yu, L. Lin, T. Chua, et al. (2025) Videoqa in the era of llms: an empirical study. International Journal of Computer Vision 133 (7), p. 3970–3993. Cited by: §1. [32] Y. Zhai, S. Tong, X. Li, M. Cai, Q. Qu, Y. J. Lee, and Y. Ma (2023) Investigating the catastrophic forgetting in multimodal large language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, Cited by: §1. [33] H. Zhang, Y. Lu, L. Wang, Y. Li, D. Chen, Y. Xu, and Y. Fu (2025) LinkedOut: linking world knowledge representation out of video llm for next-generation video recommendation. arXiv preprint arXiv:2512.16891. Cited by: §1, §3.4.2. [34] W. Zhang, Y. Feng, H. Luo, Y. Li, Z. Yue, S. Zheng, and Z. Lu (2025) Unified multimodal understanding via byte-pair visual encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, p. 12976–12986. Cited by: §2.2. [35] W. Zhang, Z. Xie, Y. Feng, Y. Li, X. Xing, S. Zheng, and Z. Lu (2024) From pixels to tokens: byte-pair encoding on quantized visual modalities. arXiv preprint arXiv:2410.02155. Cited by: §2.2.