Paper deep dive

UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

Ziyi Wang, Xinshun Wang, Shuang Chen, Yang Cong, Mengyuan Liu

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 107

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%

Last extracted: 3/26/2026, 2:48:51 AM

Summary

UniMotion is a unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images. It introduces a continuous motion paradigm using a Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders within a shared LLM backbone. The framework employs Dual-Posterior KL Alignment (DPA) for visual-semantic injection and Latent Reconstruction Alignment (LRA) for self-supervised pre-training to establish a stable motion-aware foundation.

Entities (5)

UniMotion · framework · 100%CMA-VAE · model-component · 95%DPA · alignment-strategy · 95%LRA · pre-training-strategy · 95%MotionGPT · model · 90%

Relation Signals (3)

UniMotion → utilizes → CMA-VAE

confidence 98% · UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality... A novel Cross-Modal Aligned Motion VAE (CMA-VAE)...

UniMotion → implements → DPA

confidence 95% · To inject visual-semantic priors into motion representations... we propose Dual-Posterior KL Alignment (DPA)

UniMotion → implements → LRA

confidence 95% · we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy

Cypher Suggestions (2)

Find all components and strategies associated with the UniMotion framework. · confidence 90% · unvalidated

MATCH (f:Framework {name: 'UniMotion'})-[:UTILIZES|IMPLEMENTS]->(c) RETURN f, c

Identify models that perform similar tasks to UniMotion. · confidence 85% · unvalidated

MATCH (m:Model)-[:PERFORMS_TASK]->(t:Task)<-[:PERFORMS_TASK]-(u:Framework {name: 'UniMotion'}) RETURN m

Abstract

Abstract:We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.

PDF

Open source PDF →Open local PDF →

Full Text

106,851 characters extracted from source content.

Expand or collapse full text

UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation Ziyi Wang 1,∗ , Xinshun Wang 1,∗ , Shuang Chen 2,∗ , Yang Cong 3 , and Mengyuan Liu 1,† 1 Peking University, 2 Donghua University, 3 South China University of Technology https://wangzy01.github.io/UniMotion dĞdžƚ ͲƚŽͲDŽƚŝŽŶ sŝƐŝŽŶ ͲƚŽͲDŽƚŝŽŶ DŽƚŝŽŶ Ͳ'ƵŝĚĞĚ/ŵĂŐĞĚŝƚŝŶŐ dĞdžƚͲŽŶĚŝƚŝŽŶĞĚDŽƚŝŽŶĚŝƚŝŶŐ DW;Ϳ DϮd;Ğƌƚ^ĐŽƌĞͿ dϮD;ZΛϯͿ sϮd;>h ͲκͿ dD;&/Ϳ DŽƚŝŽŶ ͲƚŽͲdĞdžƚ DŽƚŝŽŶWƌĞĚŝĐƚŝŽŶ dϮD͗ sϮD͗ D'/͗ dD͗ DϮd͗ DW͗ sŝƐŝŽŶ ͲƚŽͲdĞdžƚsϮd͗ hŶŝDŽƚŝŽŶ ;KƵƌƐͿ DŽƚŝŽŶ'Wd ^ŚŽǁ ͲŽϮ D'ͲDŽƚŝŽŶ>>D hŶŝWŽƐĞ TextVision Motion TextMotion Vision Motion Text ;ŽƉƚŝŽŶĂůͿ UniMotion Condition dŚĞƉĞƌƐŽŶŝƐǁŝŶŐŝŶŐĂ ŐŽůĨĐůƵď͘ dŚĞƉĞƌƐŽŶƐƋƵĂƚƐĚŽǁŶ͘ tŚĂƚŝƐŚĞĚŽŝŶŐŝŶƚŚĞǀŝĚĞŽ фǀŝĚĞŽх͍ ĞƐĐƌŝďĞƚŚĞ^DW>ŵŽƚŝŽŶ фŵŽƚŝŽŶх͘ 'ĞŶĞƌĂƚĞƚŚĞ^DW>ŵŽƚŝŽŶĨƌŽŵƚŚĞĚĞƐĐƌŝƉƚŝŽŶ͗ dŚĞƉĞƌƐŽŶŝƐƋƵĂƚŝŶŐĚŽǁŶ͘ 'ŝǀĞŶƚŚĞŵŽƚŝŽŶфŵŽƚŝŽŶх͕ ŐĞŶĞƌĂƚĞƚŚĞĨƵƚƵƌĞŵŽƚŝŽŶ͘ tŚĂƚŝƐƚŚĞ^DW>ŵŽƚŝŽŶŝŶ ƚŚĞǀŝĚĞŽфǀŝĚĞŽх͍ ĚŝƚŚĞŐŝǀĞŶŵŽƚŝŽŶфŵŽƚŝŽŶх ĨŽůŽǁŝŶŐƚŚĞŝŶƐƚƌƵĐƚŝŽŶ͗ WƵƚďŽƚŚĂŶĚƐƵƉ͘ 'ŝǀĞŶƚŚĞŝŵĂŐĞфŝŵĂŐĞх͕ ŐĞŶĞƌĂƚĞĂŶĞǁŝŵĂŐĞǁŚĞƌĞ ƚŚĞƉŽƐĞŝƐƚŚĞƐĂŵĞĂƐƚŚĞ ƉƌŽǀŝĚĞĚ^DW>ƉŽƐĞфƉŽƐĞх͘ sŝƐŝŽŶ ͲƚŽͲdĞdžƚ DŽƚŝŽŶ ͲƚŽͲdĞdžƚ sŝƐŝŽŶ ͲƚŽͲDŽƚŝŽŶ dĞdžƚ ͲƚŽͲDŽƚŝŽŶ DŽƚŝŽŶWƌĞĚŝĐƚŝŽŶ dĞdžƚ ͲŽŶĚŝƚŝŽŶĞĚDŽƚŝŽŶĚŝƚŝŶŐ DŽƚŝŽŶ Ͳ'ƵŝĚĞĚ/ŵĂŐĞĚŝƚŝŶŐ Understanding Generation Editing ŽŶĚŝƚŝŽŶĞĚDŽƚŝŽŶĚŝƚŝŶŐ D'/;Đ͘Ϳ sϮd;ZKh' Ͳ>Ϳ sϮD;DW:WͿ dD;ZΛϯͿ ΎdŚĞĨĂƌƚŚĞƌ ĨƌŽŵƚŚĞĐĞŶƚĞƌ͕ ƚŚĞďĞƚĞƌ Fig. 1: Left: Overview and performance comparison of UniMotion, a unified framework for any-to-any Motion, Text, and Vision under- standing, generation, and editing. UniMotion is the first model to support all seven tri-modal tasks and achieves consistent superiority over existing methods. Right: Representative task demonstrations. Abstract.We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion–Text or static Pose–Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core princi- ple: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual- semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder’s richer posterior into the motion-only encoder. To address the cold-start problem—where text supervision alone is too sparse to calibrate the newly introduced motion pathway—we further propose ⋆ Equal contribution. ⋆ † Corresponding author: liumengyuan@pku.edu.cn. arXiv:2603.22282v1 [cs.CV] 23 Mar 2026 2Z. Wang et al. Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any un- derstanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks. Keywords: Motion Generation· Unified Framework· MLLMs 1 Introduction Recent advances in unified multimodal large language models (MLLMs) have demonstrated remarkably strong and increasingly general capabilities in jointly understanding and generating text and images [27,35–38,46], exhibiting strong cross-modal reasoning within a shared semantic space. However, human motion— a critical dynamic modality—has not been systematically integrated into such unified frameworks. Motion sequences encode rich temporal dynamics and spatial structure indispensable for game animation, embodied intelligence, virtual reality, and medical rehabilitation [20,33]. Constructing a single framework that unifies the Motion-Text-RGB tri-modality while supporting both understanding and generation remains an unresolved key problem. Existing approaches have not truly addressed this challenge. One line of work, exemplified by MotionGPT [17], unifies Motion and Text through VQ-VAE tokenization [29], but lacks the ability to perceive or generate visual information. Another line, represented by UniPose [21], integrates human pose (single-frame body configuration) with vision and language, but is restricted to static estimation and image understanding, with no generation capability. Both rely on discrete tokenization, which inevitably introduces quantization errors [3] that disrupt temporal continuity and structural fidelity; moreover, the asymmetry between discrete tokens and the continuous RGB feature space complicates cross-modal alignment. In summary, prior methods handle only partial modality subsets, and none forms a unified generation-understanding system spanning continuous motion sequences, natural language, and real images. To this end, we propose UniMotion:thefirstmultimodalframeworkforuni- fiedunderstandingandgenerationacrosstheMotion-Text-RGBtri-modality. As summarized in Fig. 1, UniMotion covers understanding, generation, and editing across the three modalities within one model, together with a unified performance comparison against prior partial solutions. The core design philosophy is to treatmotionasacontinuousmodalityonequalfootingwithRGBandcon- structsymmetriccontinuouspathwaysforboth. Unlike discrete-tokenization methods, we represent motion with a continuous Cross-Modal Aligned Motion VAE (CMA-VAE) and, through a dual-path embedder, provide complementary semantic-awareness and fine-grained generation channels that enable natural cross-modal alignment inside the shared LLM backbone. Within the backbone, we further employ lightweight modality-routed LoRA—routing each token to a modality-specific low-rank adapter so that motion and text/RGB modalities can adapt the shared parameters independently—and hybrid attention that reconciles UniMotion3 motion’s need for bidirectional temporal interaction with text’s autoregressive constraints, alongside modality-specific flow heads that predict velocity fields in the respective latent spaces. This symmetric continuous design eliminates quantization errors at the architecture level, naturally unifying the understanding and generation of motion and images. Beyond the architecture, we propose two key innovations for cross-modal semantic alignment under heterogeneous supervision. First, we design Dual- Posterior KL Alignment (DPA) for CMA-VAE by jointly training a Vision-Fused Motion Encoderq ψ (z | motion, image) and a Motion Encoderq φ (z | motion), minimizing the KL divergence between their posteriors. This allows the Motion Encoder toabsorbimage-providedsemanticsupervisionduringtrainingwhile requiringonlymotionatinference. Datasets lacking paired images participate by omitting the DPA loss. Second, we identify a supervision mismatch: the motion the model must generate is dense and kinematically rich, yet the text it learns from is much sparser—capturing only coarse action semantics while omitting details such as stride length, limb coordination, and subtle temporal transitions. Training the generation pathway solely from such under-specified signals leads to ambiguous learning, instability, and degraded fidelity. Yet the CMA-VAE latentz—the model’s own continuous motion encoding—already preserves the full kinematic structure faithfully in compact form. This raises a natural question: beforelearningfromsparsetext,canthemodelfirstlearntogeneratemotionfrom itsownmostinformativeencoding? We answer this with Latent Reconstruction Alignment (LRA), a simple yet effective self-supervised pre-training strategy that treatszembeddings as “dense motion prompts” and trains the model to reconstructzfrom noise in latent space.Thisself-reconstructionprovidesprecise, unambiguousgeometricsupervision that jointly calibrates the embedder, LLM backbone, and flow head, establishing a robust motion-aware pathway as the shared foundation for all downstream tasks. Thanks to these designs, UniMotion achieves state-of-the-art results across virtually all downstream tasks, supporting true any-to-any understanding and generation among Motion, Text, and RGB. It demonstrates especially prominent advantages on cross-modal compositional tasks. Our main contributions are: 1.We propose UniMotion, the first framework to unify Motion, Text, and RGB understanding and generation in a single architecture, overcoming the modality-coverage and task-direction limitations of prior methods. 2.We propose a fully continuous motion paradigm: CMA-VAE encodes motion into a visual-semantically enriched continuous latent space, and symmet- ric dual-path embedders with hybrid attention and modality-routed LoRA construct parallel continuous pathways for Motion and RGB, eliminating quantization bottlenecks at the architecture level. 3.We propose Dual-Posterior KL Alignment (DPA) and Latent Reconstruction Alignment (LRA) as complementary alignment strategies: DPA injects visual- semantic supervision into the motion encoder via posterior alignment, while LRA leverages dense motion latents for self-supervised pathway calibration— jointly constructing a well-aligned tri-modal space that improves training stability and performance across all tasks. 4Z. Wang et al. Table 1: Comparison of UniMotion with representative methods from a task perspective. ✓indicates the method supports the task;✗indicates it does not. UniMotion uniquely unifies comprehensive understanding, generation, and editing across Motion, Text, and RGB modalities within a single continuous motion-aware MLLM. MethodVenue Understanding GenerationEditing Motion Repr. M2TV2T2M V2M Pred Mot. Edit MGIE MotionGPT [17]NeurIPS’23✓✗✓✗✓✗VQ-VAE (discrete) MG-MotionLLM [34] CVPR’25✓✗✓✗✓✗VQ-VAE (discrete) UniPose [21]CVPR’25✗✓✗✓✗ VQ-VAE pose tokens (discrete) HMVLM [15]NeurIPS’25✗✓✗VQ-VAE body-part (discrete) Show-o2 [38]NeurIPS’25✗✓✗— UniMotion (Ours)—✓CMA-VAE (continuous) 2 Related Work Table 1 summarizes the key distinctions between UniMotion and representative prior methods. We now discuss each related area below. 2.1 Human Motion Generation and Understanding Human motion modeling covers text-driven generation (Text-to-Motion) and semantic understanding (Motion-to-Text), evolving from task-specific models to unified frameworks. Discrete tokenization paradigm. Discretizing motion into codebook indices via VQ-VAE [29] has become mainstream. MotionGPT [17] treated motion as a “foreign language” for unified generation and understanding; MG-MotionLLM [34] advanced multi-granularity motion modeling. However, discretization inevitably introduces quantization errors, causing temporal jitter and detail loss, while codebook collapse limits diversity. Continuous representation paradigm. To overcome quantization limitations, another line models motion in continuous latent space [39,45]. MLD [39] performs diffusion in the VAE [19] latent space, balancing quality and efficiency. Recent approaches combine motion VAEs with diffusion-based heads for smoother, more realistic generation while preserving semantic control. Despite progress on Motion-Text tasks (Table 1), these methods remain confined to the Motion-Text subspace, lacking visual perception and generation needed for Vision-to-Motion or motion-guided editing. 2.2 Vision-Pose Multimodal Large Language Models With the rise of MLLMs such as LLaVA [24], researchers have begun integrating human pose into these models to enhance fine-grained behavior understanding. UniPose [21] discretizes 3D pose into tokens via VQ-VAE for unified pose under- standing and generation. HMVLM [15] introduces MoE LoRA-based instruction tuning, yet still relies on discrete body-part tokenization and pairwise modality connections rather than a unified latent space. These methods share common limitations: (1) most handle only static pose or pairwise connections without modeling continuous motion; (2) discrete quanti- zation introduces precision bottlenecks; (3) visual interaction is typically unidi- rectional, lacking image generation capability. UniMotion5 EŽŝƐĞ^ĐŚĞĚƵůĞƌ “The person squats down.” dĞdžƚdŽŬĞŶŝnjĞƌD ͲsŶĐŽĚĞƌ ^ƉĂƚŝĂů ͲdĞŵƉŽƌĂů &ƵƐŝŽŶ ^ĞŵĂŶƚŝĐ ƌĂŶĐŚ 'ĞŶĞƌĂƚŝŽŶ ƌĂŶĐŚ DŽƚŝŽŶ&ůŽǁ,ĞĂĚ sŝƐŝŽŶ &ůŽǁ,ĞĂĚ D Ͳs ŶĐŽĚĞƌ UniMotion “Describe the motion.” dĞdžƚ dŽŬĞŶŝnjĞƌ ZĂŶĚŽŵŶŽŝƐĞ “null” text prompt ŵŽƚŝŽŶůĂƚĞŶƚ D ͲsĞĐŽĚĞƌ sŝƐŝŽŶ sĞĐŽĚĞƌ dĞdžƚĞͲƚŽŬĞŶŝnjĞƌ sŝƐŝŽŶ sŶĐŽĚĞƌ DŽƚŝŽŶ &ůŽǁ,ĞĂĚ >ĂƚĞŶƚZĞĐŽŶƐƚƌƵĐƚŝŽŶ ůŝŐŶŵĞŶƚ;>ZͿ WŽƐĞ ͲǁĂƌĞsŝƐŝŽŶĂĐŬďŽŶĞ [THW,D] [T,H’,W’,D’] ŝůŝŶĞĂƌ/ŶƚĞƌƉŽůĂƚŝŽŶ [T,H,W,D’] &ůĂƚĞŶ [THW,D’] [THW,D+D’] WŽƐĞ&ƵƐŝŽŶWƌŽũ [THW,D] WŽƐĞ ͲǁĂƌĞsŝƐƵĂůhŶĚĞƌƐƚĂŶĚŝŶŐ DͲs ĞĐŽĚĞƌ UniMotion &&E >ĂLJĞƌEŽƌŵ D, >ĂLJĞƌEŽƌŵ >ŽZ ŵŽƚŝŽŶ >ŽZ ǀŝƐŝŽŶ UniMotion UniMotion UniMotion ĐŚĂŶĞůĐĂƚ͘ Fig. 2: Overview of UniMotion. (Left) UniMotion unifies motion, text, and RGB through symmetric continuous pathways: motion and images are encoded into continuous latents (via CMA-VAE and a vision VAE), mapped by a dual-path embedder that separates semantic abstraction from detail-preserving generation, and processed by a shared backbone for both multimodal understanding and modality-specific flow-based synthesis. (Right) Latent Reconstruction Alignment (LRA) pre-trains the motion pathway with a self-supervised Motion-to-Motion task, using motion latents as dense, unambiguous conditions to reconstruct motion from noise, thereby co-calibrating the embedder, backbone, and motion head before all downstream tri-modal learning. 2.3 Unified Multimodal Understanding and Generation Unifying understanding and generation of arbitrary modalities within a single architecture is a fundamental direction toward building general artificial intel- ligence. Show-o [37] pioneered a single Transformer fusing autoregressive and discrete diffusion for joint text-image understanding and generation. Janus-Pro [6] substantially improved performance through data and model scale expansion; Show-o2 [38] further advanced cross-modal capabilities. While these works have achieved significant advances in the Text-RGB domain, human motion—encoding human intent and behavioral logic—remains absent from the core of unified frameworks. Treating motion as video pixels is inefficient and ignores skeletal topology and kinematic constraints. 3 Method 3.1 Overview UniMotion aims to unify modeling of Motion, Text, and RGB within a sin- gle framework, simultaneously supporting understanding and generation tasks (including T2M, M2T, Motion Prediction, Motion Editing, Vision-to-Motion, Vision-to-Text, and Motion-guided Image Editing (MGIE)). The core design 6Z. Wang et al. philosophy is to treat motion as a continuous modality on equal footing with images and construct symmetric continuous pathways for both. Unlike prior methods that rely on discrete VQ-VAE tokenization for rep- resenting motion dynamics (e.g., MotionGPT, UniPose), we adopt continuous VAE representations for motion, which offers two key advantages: (1) it avoids irreversible information loss during quantization, preserving the temporal con- tinuity and structural fidelity of motion; (2) continuous motion latents share a symmetric representational form with the continuous RGB pathway, enabling cross-modal alignment naturally at the architecture level. The overall framework, illustrated in Fig. 2, comprises three core components: (1) CMA-VAE (Sec. 3.2): encodes motion sequences into continuous latent rep- resentations with implicit visual-semantic supervision injected via Dual-Posterior KL Alignment (DPA); (2) Unified Multimodal Architecture (Sec. 3.3): built upon Show-o2 [38], with symmetric dual-path embedders for Motion and RGB, a shared LLM backbone, and modality-specific flow [23] heads; (3) Latent Re- construction Alignment (LRA) (Sec. 3.4): using dense motion prompts to first warm up the motion pathway (the embedder, flow head, and motion-routed adaptation), followed by progressive multi-stage fine-tuning. 3.2 Cross-Modal Aligned Motion VAE (CMA-VAE) :ŽŝŶƚƉŽůŝŶŐ sŝƐŝŽŶĐŽĚĞƌ 흁 흍 흈 흍 흁 흓 흈 흓 >ŝŶĞĂƌ ƐĂŵƉůĞƐĂŵƉůĞ 픃 DŽƚŝŽŶ ĞĐŽĚĞƌ || sŝƐŝŽŶ ͲDŽƚŝŽŶ ŶĐŽĚĞƌ DŽƚŝŽŶ ŶĐŽĚĞƌ ℒ align ℒ KL 휑 or ℒ recon sŝƐŝŽŶ ĂĐŬďŽŶĞ ZĞĐŽǀĞƌ ^ŬĞůĞƚŽŶ ℒ KL 휓 J 1 J 2 J 3 J 4 J 5 J 22 ... T T T T T T d v T d if has image if has image ĐŚĂŶĞů ĐĂƚ͘ Fig. 3: CMA-VAE with DPA. CMA-VAE learns a continuous motion latent space using a motion-only encoder for inference and a vision-fused encoder for training-time visual supervision. When paired images are available, motion-guided visual features are fused with motion and distilled via DPA, enabling the shared decoder to learn visually informed motion latents without requiring images at inference. CMA-VAE encodes variable- length motion sequences into continuous low-dimensional latent representations while injecting implicit visual-semantic supervision through Dual- Posterior KL Alignment (DPA). As shown in Fig. 3, its core consists of three components: Motion Encoderq φ (z |m), Vision-Fused Motion Encoder q ψ (z |m,v), and a shared Mo- tion Decoder p ξ (m| z). Motion Encoder. The Mo- tion Encoder is used at infer- ence. Given a motion sequence m∈ R T×D m , a linear layer maps each frame to the la- tent space, followed by learn- able positional encodings and a SkipTransformer Encoder. A linear head predicts Gaussian parameters (μ φ , logσ 2 φ ), and the latent code is obtained via UniMotion7 reparameterization: z = μ φ + σ φ ⊙ ε, ε∼N(0,I),(1) where z ∈ R T z ×d . Vision-Fused Motion Encoder. Used only during training, this encoder fuses motion and visual information. Its motion branch shares the front-end with the Motion Encoder, producing h motion ∈ R T z ×d . The vision branch extracts spatial features from RGB image v via a frozen HRNet [26], then applies bilinear grid sampling at motion-guided 2D joint positions j 2d (m)—skeleton projections derived from the motion sequence: f vis = VisionEnc GridSample(HRNet(v), j 2d (m)) ∈ R T z ×d v .(2) After joint-dimension pooling, motion and visual features are concatenated and processed by an independent SkipTransformer Encoder to yield (μ ψ , logσ 2 ψ ) and sample z fused . Motion Decoder. The decoder accepts latentz, passes it through positional encoding and SkipTransformer layers, and maps back to theD m -dimensional motion space via a linear layer. Dual-Posterior KL Alignment (DPA) The core idea of DPA is: constrain the Motion Encoder posteriorq φ (z |m) to approximate the Vision-Fused posterior q ψ (z |m,v), so that the Motion Encoder implicitly absorbs visual-semantic supervision during training while requiring only motion input at inference. The total CMA-VAE training objective is: L VAE =L recon + λ KL L φ KL +L ψ KL + λ align ·L align .(3) Reconstruction loss. For samples with paired images,z fused from the Vision- Fused Encoder is used for decoding; otherwisez motion from the Motion Encoder is used: L recon = 1 |M| X i∈M SmoothL1( ˆm i ,m i ).(4) KL regularization. Both encoder posteriors are independently regularized toward the standard normal prior: L ⋆ KL = 1 2 d X k=1 μ 2 ⋆,k + σ 2 ⋆,k − logσ 2 ⋆,k − 1 , ⋆∈φ, ψ.(5) DPA alignment loss. Computed only for samples with paired images, this term distills visual-semantic knowledge into the Motion Encoder: L align = D KL q φ (z | m)∥q ψ (z | m, v) . (6) For two diagonal Gaussians N(μ φ ,σ 2 φ ) and N(μ ψ ,σ 2 ψ ), the closed-form KL is: D KL = 1 2 d X k=1 log σ 2 ψ,k σ 2 φ,k + σ 2 φ,k + (μ φ,k − μ ψ,k ) 2 σ 2 ψ,k − 1 ! . (7) 8Z. Wang et al. L align uses a linear warm-up schedule [10] to prevent strong alignment from disrupting reconstruction in early training. We adoptD KL (q φ ∥q ψ ) with the Vision- Fused posteriorq ψ detached as the alignment target. In the knowledge-distillation sense (student→teacher), this is the reverse KL, which is mode-seeking: it drives q φ to concentrate on the most salient semantic modes ofq ψ , yielding compact, high-confidence motion representations that capture the core visual-semantic information while filtering out view-specific visual noise. The forward direction D KL (q ψ ∥q φ ) would be mode-covering, forcingq φ to spread over all modes of q ψ —including noisy or irrelevant visual modes—leading to an over-dispersed posterior that dilutes representational precision. Flexible data utilization. For datasets without paired images (e.g., Hu- manML3D),L align is simply dropped. For datasets with paired images (e.g., Human3.6M), all three losses apply jointly. At inference, the Vision-Fused En- coder is entirely discarded, ensuring no overhead. 3.3 Unified Multimodal Architecture UniMotion is built upon Show-o2 [38], extending it with a Motion modality pathway architecturally symmetric to RGB. Dual-Path Embedder Given the CMA-VAE latentz ∈ R T z ×d , two parallel branches process z: a Semantic branch (MLP + Transformer Encoder layers) extracts high-level semantic features, mirroring SigLIP [41] on the vision side; a Generation branch (MLP + learnable positional encodings) mapszdirectly to the LLM hidden dimension, preserving fine-grained motion details and mirroring the vision PatchEmbed. The two outputs are concatenated and projected to the unified LLM hidden dimension via RMSNorm + MLP. All tasks uniformly use the fused embeddings. This dual-path design is particularly advantageous for motion-conditioned synthesis tasks (e.g., Motion Editing and Prediction), where the two branches structurally decouple semantic comprehension from fine-grained detail preservation. To further strengthen visual-motion alignment, the RGB pathway is additionally equipped with a pose-aware vision backbone: initialized from a pretrained human body encoder [7] and kept frozen, it extracts body- structure-aware features that complement SigLIP’s global visual semantics. Both streams are fused into the same RGB token representation via projection, main- taining architectural symmetry with the Motion embedder (see Supplementary for details). Hybrid Attention and Modality-Routed LoRA Hybrid attention main- tains global causal ordering at the sequence level while enabling bidirectional full attention within each motion span, reconciling the flow matching objective— which requires simultaneous velocity field prediction across the entire motion latent—with text autoregressive generation. Modality-routed LoRA assigns sep- arate low-rank adaptation branches for Text/RGB and Motion tokens in each attention layer, enabling modality-specific adaptation with only∼2% additional parameters while preserving the LLM’s existing capabilities. UniMotion9 Modality-Specific Flow Heads For generation tasks, LLM hidden states are transformed back to the target modality latent space via modality-specific flow heads. Motion flow head. A lightweight AdaLN-conditioned structure (Modulated Attention Blocks + MotionFinalLayer) maps backbone features to velocity pre- dictions in the motion latent space: ˆv m ∈ R T z ×d m .(8) Timestep conditioning is injected via timestep embedding c t ; the output layer is zero-initialized to stabilize early training. Vision flow head. An isomorphic design whose output dimension aligns with the vision latent representation. This “shared backbone + modality-specific head” design balances parameter efficiency and cross-modal adaptation. 3.4 Latent Reconstruction Alignment (LRA) The cold-start problem and dense self-supervision. After DPA pre-training, the motion pathway is still uncalibrated: the embedder, motion flow head, and motion-routed adaptation have not yet been jointly aligned. Direct multi-task training from this state causes clear degradation (T2M R@3 only 0.801 vs. our final 0.841, see Sec. 4.3). While text descriptions are inherently sparse (one-to- many mapping), the CMA-VAE latentz ∈ R T z ×d is a dense, lossless encoding whose self-reconstruction constitutes an unambiguous one-to-one mapping—an ideal zero-cost pre-training signal for bootstrapping the motion pathway. M2M task design. We instantiate this via a Motion-to-Motion (M2M) self- reconstruction task. The CMA-VAE Encoder producesz; the dual-path embedder projectszinto the LLM, whose hidden states pass through the motion flow head to reconstruct z from noise: L M2M = E z 0 ∼N (0,I),t∼p(t) ∥v θ (z t , t| Embed fused (z))− (z− z 0 )∥ 2 ,(9) wherez t =t· z+ (1− t)· z 0 . Critically, the LLM receivesEmbed fused (z) as conditioning rather than the noisedz t (injected only into the flow head via AdaLN), ensuring the motion pathway learns to structurally encode motion semantics rather than merely acting as a denoiser. Co-calibration and cross-task transfer. M2M simultaneously co-calibrates the embedder (compressingzinto LLM-readable tokens), the motion-routed adaptation in the shared backbone (extracting structural cues), and the flow head (mastering latent-space geometry) in a tightly coupled manner—a calibration that sparse text supervision alone cannot provide due to ambiguous geometric feedback. Once calibrated, this pathway becomes the shared foundation for all downstream tasks: T2M benefits from the pre-calibrated flow head, M2T reuses the embedder’s semantic compression, and Vision→M leverages the aligned pipeline so the LLM can focus purely on cross-modal mapping. Crucially, LRA does not degenerate into a trivial identity mapping: the dual-path embedder compresseszinto LLM- compatible tokens via non-trivial projection and Transformer encoding, and 10Z. Wang et al. Table 2: Unified multi-task comparison using one representative metric per task. “N/A” indicates the method does not support the task. UniMotion is the only method covering all seven tasks. Method T2M R@3↑ M2T BertScore↑ MotionPred ADE↓ MotionEdit R@3↑ V2M MPJPE↓ V2T BLEU-4↑ MGIE Mot.Acc↑ MotionGPT [17]0.77832.44.745N/AN/AN/AN/A MG-MotionLLM [34] 0.80236.7 N/A73.23N/AN/AN/A UniPose [21]N/AN/AN/AN/A81.817.3N/A Show-o2 [38]N/AN/AN/AN/AN/A12.1N/A UniMotion (Ours)0.84141.23.17284.9475.021.90.67 the flow head must predict velocity fields from Gaussian noise conditioned on LLM hidden states—a probabilistic mapping that is architecturally incapable of bypassing the LLM. More design and implementation details on the unified 269-dimensional motion representation, CMA-VAE visual guidance, dual-path embedder and RGB path- way, hybrid attention and modality-routed LoRA, flow matching and auxiliary supervision, as well as the multi-stage training pipeline, are provided in the Supplementary Material. 4 Experiments Our experiments span three dimensions: (1) Unification—systematic comparison with SOTA on tasks spanning Motion, Text, and RGB (Sec. 4.2); (2) Continuity— ablation validating the superiority of CMA-VAE (Sec. 4.3); (3) Alignment— ablation confirming the key contributions of DPA and LRA (Sec. 4.3). 4.1 Experimental Setup Datasets. We evaluate UniMotion on HumanML3D [12] (Text-to-Motion, Motion- to-Text, Motion Prediction), MotionFix [1] (Motion Editing), Human3.6M [16] (Vision-to-Motion, Vision-to-Text), and MoVid [5] (Vision-to-Text). We also construct a triplet evaluation set from Human3.6M and 3DPW for Motion-guided Image Editing (MGIE). We adopt a unified 269-dimensional motion representation to support both generation and body recovery tasks. Implementation details. The unified backbone is based on Show-o2 1.5B [38], equipped with a dual-path motion embedder and a motion flow head. Modality- routed LoRA (rank 32) is applied to all attention layers. We train with AdamW and use Euler ODE for generation inference. Due to space limits, complete dataset preparation details, hyperparameters, and evaluation metrics are provided in the supplementary material. All training is conducted on 4×A6000 GPUs. 4.2 Comparison with State-of-the-Art Unified Multi-task Comparison Table 2 provides a unified comparison across all seven tasks using one representative metric per task for transparent cross-task evaluation. UniMotion is the only method covering all tasks. UniMotion11 Table 3: Text-to-Motion generation on HumanML3D.→indicates closer to Real is better. UniMotion † is the single-task model; UniMotion is the unified multi-task model. Bold: best; underline: second best. TypeMethodR@1↑ R@2↑ R@3↑ FID↓ MMDist↓ DIV→ —Real0.511 0.703 0.797 0.0022.9749.503 Gen. only T2M-GPT [43]0.491 0.680 0.775 0.1163.1189.761 DiverseMotion [25]0.515 0.706 0.802 0.0722.9419.683 MoMask [11]0.521 0.713 0.807 0.0452.9589.620 UniMotion † 0.5280.7260.8300.2312.8459.649 Gen. & Und. TM2T [14]0.424 0.618 0.729 1.5013.4678.589 MotionGPT [17]0.492 0.681 0.778 0.2323.0969.528 HMVLM [15]0.463 0.646 0.744 0.1563.3289.544 MG-MotionLLM [34] 0.516 0.706 0.802 0.3032.9529.960 UniMotion0.5570.7490.8410.1942.7159.583 Table 4: Motion-to-Text understanding on HumanML3D. Evaluation protocol fol- lows [14]. UniMotion † is the single-task model. Bold: best; underline: second best. MethodR@1↑ R@3↑ MMDist↓ Bleu@1↑ Bleu@4↑ Rouge↑ CIDEr↑ BertScore↑ Real0.523 0.8282.901— TM2T [14]0.516 0.8232.93548.97.0038.116.832.2 MotionGPT [17]0.543 0.8272.82148.212.4737.429.232.4 LaMPM2T [22]0.547 0.8312.80847.813.0437.128.932.7 MG-MotionLLM [34] 0.592 0.8662.581—8.06—36.7 UniMotion † 0.547 0.8432.56256.918.644.735.438.5 UniMotion0.5620.8572.48160.420.746.539.341.2 Motion-to-Text Generation GT: a person who walked in a turn to their right. MotionGPT: a person leans forward, and rolls. the person then proceeds to run forwards. Ours: a person walking curving to the right. GT: person walks forward, appears to be pushed, recovers, then continues to walk forward. MotionGPT: a person dances and spins around in the air. Ours: a person walks forward, stumbles to the side, then walks forward again. A person walks forward, and repeatedly reaches down then shakes something MotionGPTMoMaskOurs A person standing up strikes their hands together well above their head. Text-to-Motion Generation MotionGPTMoMaskOurs Fig. 4: Qualitative comparison on T2M and M2T. Text-to-Motion Gen- eration Table 3 com- pares UniMotion with SOTA on HumanML3D. UniMotion † denotes the task-specific vari- ant sharing the same architecture and train- ing recipe, trained only on the evaluated task without cross- task supervision; UniMotion is the unified multi-task model. UniMotion achieves the best R-Precision and MMDist by clear margins across compared methods, establishing strong semantic alignment between text and generated motion. Notably: (1) the substantial gains on these alignment metrics reflect CMA-VAE’s continuous representation and DPA’s effectiveness in binding motion-text semantics; (2) the unified multi-task model consistently outperforms the single-task variant (UniMotion † ), demonstrating positive cross-modal trans- fer from RGB supervision; (3) while single-task discrete methods obtain lower FID (MoMask: 0.045)—reflecting their advantage in single-distribution fitting— UniMotion’s leading semantic alignment highlights the cross-modal reasoning enabled by continuous representations within a unified framework. Furthermore, 12Z. Wang et al. Table 5: (a) Motion prediction on AMASS (HumanML3D subset), following Mo- tionGPT [17]. (b) Text-conditioned motion editing on MotionFix; R@k is generated-to- target retrieval precision (%). (a) Motion Prediction MethodFID↓ ADE↓ FDE↓ Real0.002 — MDM [28]6.031 5.446 8.561 MotionGPT [17] 0.905 4.745 6.040 UniMotion0.8713.1725.068 (b) Motion Editing MethodR@1↑ R@3↑ FID↓ GT100.0 100.0 — MDM [28]39.10 54.84 0.917 MG-MotionLLM [34] 47.96 73.23 0.409 UniMotion63.8184.940.170 Diversity (9.583) closely matches the real data (9.503), confirming that alignment improvements preserve sample variability. As qualitatively confirmed in Fig. 4, MoMask’s low FID does not prevent it from missing critical spatial constraints (e.g., hands only reaching shoulder level instead of above the head), while UniMo- tion faithfully renders fine-grained modifiers and produces more accurate M2T captions than MotionGPT. Motion-to-Text Understanding As shown in Table 4, UniMotion leads all caption-quality metrics by substantial margins (Bert-Score: 41.2 vs. 36.7; CIDEr: 39.3 vs. 29.2; Bleu@4: 20.7 vs. 13.04), indicating semantically faithful and detail-rich descriptions rather than generic templates. We attribute this to CMA-VAE’s continuous latents preserving kinematic nuances (e.g., stride patterns, joint articulations) that discrete tokenization discards. MG-MotionLLM achieves higher retrieval R@k, reflecting stronger embedding-level matching, while UniMotion’s advantage lies in generation quality. The unified model further surpasses the single-task baseline (BertScore: 41.2 vs. 38.5), confirming positive cross-modal transfer from DPA’s visual-semantic supervision. Motion Prediction and Text-Conditioned Motion Editing UniMotion outperforms all baselines on both tasks (Table 5). Beyond CMA-VAE’s continuous latents for preserving temporal dynamics, DPA and visual co-training inject visual-geometry priors (e.g., plausible body configurations, global balance) that are especially helpful for forecasting future poses and editing local joints while maintaining whole-body coherence. For motion prediction, consistent gains in FID (0.871 vs. 0.905), ADE (3.172 vs. 4.745), and FDE (5.068 vs. 6.040) reflect these advantages. For motion editing, UniMotion substantially surpasses MG- MotionLLM (R@1: 63.81 vs. 47.96; R@3: 84.94 vs. 73.23; FID: 0.170 vs. 0.409), confirming that continuous representation combined with visual grounding enables precise text-guided modification at the joint level. Architecturally, the dual-path embedder is critical for these tasks: its semantic–generative decoupling enables the source motion to simultaneously provide high-level structural conditioning and fine-grained kinematic detail for the target. UniMotion13 Table 6: Comparison of different motion representation methods. APE and AVE denote mean-pose Absolute Position Error and Average Velocity Error, respectively, in cm and cm/s. All variants were trained independently during the pure reconstruction phase, followed by multi-task fine-tuning under the same LLM backbone. Bold: best; underline: second best. Motion Repr. Reconstruction Quality T2M R@3↑ M2T BertScore↑ Pred ADE↓ Edit R@3↑ APE↓ AVE↓ FID↓ VQ-VAE (MotionGPT [17]) 17.15 0.8130.06740.77135.04.713 69.71 MLD-VAE (MLD [39])9.280.981 0.1283 0.81037.23.78477.46 CMA-VAE (Ours)3.530.4280.02820.84141.23.17284.94 Table 7: Vision-to-Motion on Hu- man3.6M (H3.6M). Bold: best in category for each group and evaluation setting un- der the same protocol. MethodType MPJPE↓ PA-MPJPE↓ Specialist Models HMR [18]Spec. 100.767.7 PyMAF [42] Spec.64.244.9 SMPLer [40] Spec. 50.837.3 HMR2.0 [9]Spec.52.237.1 TokenHMR [7] Spec.52.436.8 Zolly [32]Spec.55.035.9 MLLM-based Models ChatPose [8] MLLM 146.792.4 UniPose [21] MLLM 81.850.9 UniMotionMLLM75.046.1 Table 8: Motion-guided Image Edit- ing (MGIE). UP+So2 is a two-stage text-mediated pipeline; OP+CN renders SMPL-derived skeletons as spatial con- ditions. Mot. Acc. is the hit rate based on HMR2.0-estimated pose, using PA- MPJPE ≤ 100.0 m as success. MethodFID↓ CLIP↑ Mot.Acc↑ UniPose+Show-o2 [21,38]26.16 0.220.50 OpenPose+ControlNet [4,44] 22.34 0.290.59 UniMotion18.920.310.67 Table 9: Ablation of DPA and LRA on T2M, M2T, Motion Prediction, Motion Editing, and Vision-to-Motion. Setting T2M R@3↑ M2T BertScore↑ MotionPred ADE↓ MotionEdit R@3↑ V2M MPJPE↓ w/o DPA0.81838.43.65480.3583.1 w/o LRA0.80138.13.77778.7284.3 Full UniMotion0.84141.23.17284.9475.0 Vision-to-Motion and Motion-guided Image Editing (MGIE) As re- ported in Table 7, among MLLM methods, UniMotion significantly surpasses the strong UniPose baseline on both MPJPE (75.0 vs. 81.8) and PA-MPJPE (46.1 vs. 50.9), benefiting from DPA’s visual-geometry priors and CMA-VAE’s precise continuous encoding. The remaining gap to specialist methods is expected for a general-purpose framework. For Motion-guided Image Editing (MGIE), UniMotion is the first unified motion-aware MLLM to support end-to-end motion-conditioned image generation within a shared latent space, without requiring explicit skeleton rendering or text mediation. We compare against UniPose+Show-o2 [21,38] (text-mediated two- stage pipeline) and OpenPose+ControlNet [4,44] (skeleton-conditioned editing). As shown in Table 8, UniMotion outperforms both on all metrics, with the largest margin on Motion Accuracy (0.67 vs. 0.59), confirming that end-to-end latent-space reasoning avoids the information loss of staged approaches. 4.3 Ablation Studies CMA-VAE Motion Representation Comparison To systematically evaluate CMA-VAE, we compare against VQ-VAE (MotionGPT [17]) and MLD-VAE (MLD [39]) in Table 6. VQ-VAE suffers the worst absolute position accuracy (APE=17.15) due to irreversible quantization errors, and transfers weakest to 14Z. Wang et al. downstream tasks (T2M R@3=0.771, Edit R@3=69.71); its seemingly lower AVE (0.813) and FID (0.0674) compared to MLD-VAE reflect the bounded codebook artificially limiting deviations rather than faithful reconstruction. MLD- VAE improves global positioning (APE: 9.28) yet exhibits a position–velocity trade-off (AVE: 0.981, FID: 0.1283), confirming that a plain continuous VAE without cross-modal anchoring loosens temporal dynamics. Nevertheless, MLD- VAE substantially outperforms VQ-VAE on all downstream tasks, demonstrating that continuous latent spaces inherently facilitate LLM cross-modal alignment regardless of absolute reconstruction quality. CMA-VAE, calibrated by DPA, resolves this trade-off and achieves the best results on all metrics—both reconstruction (APE=3.53, AVE=0.428, FID=0.0282) and downstream transfer (T2M R@3=0.841, Edit R@3=84.94). Adding DPA to the same architecture improves all tasks (T2M R@3: 0.818→0.841; Edit R@3: 80.35→84.94), confirming that implicit visual-semantic supervision is the core driver of CMA-VAE’s superiority. DPA and LRA Ablation As shown in Table 9, removing either component causes consistent performance drops across all tasks. Without LRA, the motion pathway is insufficiently calibrated by sparse cross-modal supervision alone, leading to drops on T2M (R@3: 0.801 vs. 0.841), M2T (BertScore: 38.1 vs. 41.2), Motion Prediction (ADE: 3.777 vs. 3.172), Motion Editing (R@3: 78.72 vs. 84.94), and Vision→M (MPJPE: 84.3 vs. 75.0). Without DPA, the representation loses explicit visual-semantic alignment and likewise remains consistently below the full model across all tasks. Notably, DPA’s gains extend to HumanML3D-only tasks (T2M, M2T) despite lacking paired images in that dataset, evidencing implicit knowledge transfer through shared encoder parameters. The two mechanisms are complementary: LRA calibrates the motion pathway’s geometric capacity via dense self-supervision, while DPA enriches representational semantics through cross-modal posterior alignment. More results including qualitative comparisons across T2M/M2T/Prediction/ Editing/Vision-to-Motion, Vision-to-Text evaluations, the architecture design validation, as well as further analyses of DPA/LRA and zero-shot 3DPW gener- alization, are provided in the Supplementary Material. 5 Conclusion We presented UniMotion, to our knowledge the first continuous Motion-Text- RGB unified framework in the motion-LLM setting for human-centric multimodal reasoning and unified understanding and generation within a single architec- ture. By treating motion as a continuous modality on equal footing with RGB, UniMotion constructs symmetric continuous pathways through CMA-VAE and a dual-path embedder, avoiding the quantization artifacts of discrete tokeniza- tion. Dual-Posterior KL Alignment (DPA) and Latent Reconstruction Alignment (LRA) jointly establish a well-aligned tri-modal latent space, yielding strong results across diverse tasks with clear advantages on cross-modal compositional tasks. Limitations are discussed in the supplementary material. UniMotion15 References 1.Athanasiou, N., Cseke, A., Diomataris, M., Black, M.J., Varol, G.: Motionfix: Text- driven 3d human motion editing. In: SIGGRAPH Asia 2024 Conference Papers. p. 1–11 (2024) 2. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv (2023) 3.Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) 4.Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., Sheikh, Y.A.: Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019) 5.Chen, L.H., Lu, S., Zeng, A., Zhang, H., Wang, B., Zhang, R., Zhang, L.: Motionllm: Understanding human behaviors from human motions and videos. arXiv preprint arXiv:2405.20340 (2024) 6.Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 7. Dwivedi, S.K., Sun, Y., Patel, P., Feng, Y., Black, M.J.: Tokenhmr: Advancing human mesh recovery with a tokenized pose representation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. p. 1323–1333 (2024) 8.Feng, Y., Lin, J., Dwivedi, S.K., Sun, Y., Patel, P., Black, M.J.: Chatpose: Chatting about 3d human pose. In: CVPR. p. 2093–2103 (2024) 9. Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Humans in 4d: Reconstructing and tracking humans with transformers. In: CVPR. p. 14783–14794 (2023) 10.Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017) 11. Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: Momask: Generative masked modeling of 3d human motions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 1900–1910 (2024) 12.Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). p. 5152–5161 (June 2022) 13.Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 5152–5161 (2022) 14. Guo, C., Zuo, X., Wang, S., Cheng, L.: Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In: ECCV (2022) 15.Hu, L., Ye, Y., Xia, S.: Hmvlm: Human motion-vision-lanuage model via moe lora. arXiv preprint arXiv:2511.01463 (2025) 16.Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI 36(7), 1325–1339 (2013) 17.Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems 36, 20067–20079 (2023) 16Z. Wang et al. 18.Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR. p. 7122–7131 (2018) 19.Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 20. Li, P., Wang, Z., Yuan, Y., Liu, H., Meng, X., Yuan, J., Liu, M.: Ust-ssm: Unified spatio-temporal state space models for point cloud video modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. p. 6738–6747 (2025) 21. Li, Y., Hou, R., Chang, H., Shan, S., Chen, X.: Unipose: A unified multimodal framework for human pose comprehension, generation and editing. In: Proceedings of the Computer Vision and Pattern Recognition Conference. p. 27805–27815 (2025) 22. Li, Z., Yuan, W., He, Y., Qiu, L., Zhu, S., Gu, X., Shen, W., Dong, Y., Dong, Z., Yang, L.T.: Lamp: Language-motion pretraining for motion generation, retrieval, and captioning. arXiv preprint arXiv:2410.07093 (2024) 23. Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Representations (2023), https://openreview.net/forum?id=PqvMRDCJT9t 24. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023) 25.Lou, Y., Zhu, L., Wang, Y., Wang, X., Yang, Y.: Diversemotion: Towards diverse human motion generation via discrete diffusion. arXiv preprint arXiv:2309.01372 (2023) 26. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. p. 5693–5703 (2019) 27. Team, C.: Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818 (2024) 28.Tevet, G., Raab, S., Gordon, B., Shafir, Y., Bermano, A.H., Cohen-Or, D.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022) 29. Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems 30 (2017) 30.Von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recov- ering accurate 3d human pose in the wild using imus and a moving camera. In: Proceedings of the European conference on computer vision (ECCV). p. 601–617 (2018) 31. Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 32. Wang, W., Ge, Y., Mei, H., Cai, Z., Sun, Q., Wang, Y., Shen, C., Yang, L., Komura, T.: Zolly: Zoom focal length correctly for perspective-distorted human mesh reconstruction. In: ICCV. p. 3925–3935 (2023) 33.Wang, Z., Li, P., Liu, H., Deng, Z., Wang, C., Liu, J., Yuan, J., Liu, M.: Recognizing actions from robotic view for natural human-robot interaction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. p. 14218–14227 (2025) 34.Wu, B., Xie, J., Shen, K., Kong, Z., Ren, J., Bai, R., Qu, R., Shen, L.: Mg-motionllm: A unified framework for motion comprehension and generation across multiple granularities. In: Proceedings of the Computer Vision and Pattern Recognition Conference. p. 27849–27858 (2025) UniMotion17 35.Wu, C., Chen, X., Wu, Z., Ma, Y., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C., et al.: Janus: Decoupling visual encoding for unified multimodal understanding and generation. arXiv preprint arXiv:2410.13848 (2024) 36.Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.S.: Next-gpt: Any-to-any multimodal llm. In: Forty-first International Conference on Machine Learning (2024) 37.Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal understanding and generation. arXiv (2024) 38.Xie, J., Yang, Z., Shou, M.Z.: Show-o2: Improved native unified multimodal models. arXiv preprint arXiv:2506.15564 (2025) 39.Xin, C., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, J., Yu, G.: Execut- ing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2023) 40.Xu, X., Liu, L., Yan, S.: Smpler: Taming transformers for monocular 3d human shape and pose estimation. TPAMI (2023) 41. Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre- training. In: Proceedings of the IEEE/CVF international conference on computer vision. p. 11975–11986 (2023) 42.Zhang, H., Tian, Y., Zhou, X., Ouyang, W., Liu, Y., Wang, L., Sun, Z.: Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In: ICCV (2021) 43.Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2m-gpt: Generating human motion from textual descriptions with discrete representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023) 44. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. p. 3836–3847 (2023) 45. Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE transactions on pattern analysis and machine intelligence 46(6), 4115–4128 (2024) 46. Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., Levy, O.: Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039 (2024) 18Z. Wang et al. In this supplementary material, we provide comprehensive additional results, detailed architectural specifications, and in-depth analyses to further validate the effectiveness and reproducibility of our proposed framework, UniMotion. The content is organized as follows: –Sec. A (Additional Experimental Results) presents extended quan- titative results for Vision-to-Text and exhaustive ablation studies on the Dual-Path embedder, motion representations (CMA-VAE vs. VQ), hybrid attention, and modality-routed LoRA. – Sec. B (Qualitative Results) provides extensive visualizations across all seven tasks—including text-to-motion, motion-to-text, prediction, edit- ing, and vision-conditioned synthesis—alongside a multi-domain (spatial, frequency, and temporal) analysis of motion reconstruction quality. –Sec. C (Further Analysis) details the mathematical intuition behind DPA (mode-seeking distillation), proves the non-triviality of LRA via information bottleneck analysis, and evaluates the model’s zero-shot generalization to out-of-distribution datasets like 3DPW. –Sec. D (Architecture and Implementation Details) specifies the 269- dimensional unified motion representation, the CMA-VAE design (including motion-guided sampling), and the formulation of our flow matching generation heads and pose-aware vision backbone. –Sec. E (Training Pipeline, Data, and Evaluation) outlines the progres- sive multi-stage training configuration, details our tri-modal data construction strategies (H3.6M, MotionFix, etc.), and clarifies the evaluation protocols for all benchmark metrics. – Sec. F (Limitations and Broader Impact) discusses the current con- straints regarding computational overhead and domain-specific visual align- ment, while reflecting on the potential societal benefits and ethical considera- tions of motion-aware AI. A Additional Experimental Results A.1 Vision-to-Text Table 10: Vision-to-Text on H3.6M. Bold: best; underline : second best. MethodParam. BLEU-4↑ ROUGE-L↑ METEOR↑ Show-o2 [38]1.5B12.133.935.7 Qwen-2.5-VL [2]7B16.537.4 39.6 UniPose [21]7B17.334.938.6 UniMotion (Ours)1.5B21.938.041.7 We first report Vision-to-Text on H3.6M, which probes whether the visual pathway can extract human pose semantics from visual observations and translate them into natural language. UniMotion substantially surpasses both general- purpose MLLMs and specialized UniPose on all metrics with a significantly UniMotion19 smaller parameter count (1.5B vs. 7B). The improvement over UniPose (BLEU-4: 21.9 vs. 17.3) is attributed to the CMA-VAE motion encoder learning fine-grained visual-semantic priors via DPA training, combined with the pose-aware vision backbone providing body-structure-aware features. We then extend the same Vision-to-Text interface from the H3.6M setting to temporally richer video inputs. UniMotion uniformly samples frames from each video and processes them through the image pathway. We evaluate this on the MoVid dataset [5], which contains diverse real-world human motion videos with descriptive captions. The corresponding qualitative visualizations are presented later in Sec. B.6, where we separately show H3.6M Vision-to-Text examples (single-frame) and MoVid Vision-to-Text examples (multi-frame) to distinguish the static and dynamic settings. A.2 Architecture Design Validation We ablate two orthogonal design dimensions in the motion processing pipeline. (1) Embedder design: Gen-Branch Only retains only the Generation Branch (MLP direct projection + learnable positional encoding, mirroring PatchEmbed on the RGB side, preserving fine-grained kinematic details at the cost of seman- tic abstraction); Sem-Branch Only retains only the Semantic Branch (MLP + N s =4 Transformer Encoder layers, mirroring SigLIP, providing high-level se- mantic features at the cost of detail preservation); Dual-Path combines both via RMSNorm+MLP fusion (our full design, Sec. 3.3 of main paper). (2) Mo- tion representation: VQ uses MotionGPT-style discrete tokenization (K=512 codebook); VAE (w/o DPA) is a plain continuous VAE sharing CMA-VAE’s architecture but trained without the DPA alignment loss; CMA-VAE is our full Cross-Modal Aligned Motion VAE trained with DPA. VQ representations are architecturally incompatible with DPA’s continuous Gaussian posterior alignment; comparisons are therefore within equivalent conditions per representation class. Four observations emerge from Table 11. (1) Continuous representations consistently outperform VQ across all embedder designs. Replacing VQ with CMA-VAE yields substantial gains at every level of path complexity, confirming that quantization errors fundamentally constrain both generation fidelity and cross-modal alignment regardless of the embedder architecture. (2) The two branches exhibit complementary functional specialization. Sem-Branch Only achieves higher M2T BertScore (40.1 vs. 36.3 for Gen-Branch Only), reflecting the Transformer encoder’s stronger semantic compression; Gen- Branch Only leads on T2M R@3 (0.824 vs. 0.798) and Motion Editing (70.40 vs. 68.85), where fine-grained kinematic detail must be preserved for the flow head. Neither branch alone approaches Dual-Path performance, confirming functional complementarity rather than redundancy. (3) DPA provides targeted gains beyond VAE continuity. Dual-Path + VAE (w/o DPA) already achieves T2M R@3 0.818 and Edit R@3 80.35; adding DPA pushes these to 0.841 and 84.94, with the largest incremental gains on tasks that benefit from injected visual-semantic priors. (4) Dual-Path is essential for motion-conditioned synthesis. The performance gap between single-branch and Dual-Path variants is 20Z. Wang et al. Table 11: Architecture design ablation using representative metrics. Gen-Branch Only: Generation Branch only (MLP + PosEmbed). Sem-Branch Only: Semantic Branch only (MLP + Transformer Encoder). Dual-Path: both branches fused (ours). VAE (w/o DPA): plain continuous VAE, same architecture as CMA-VAE but without DPA training. CMA-VAE: with DPA. VQ is incompatible with DPA by design. All models share the same hyperparameters and training duration. Embedder Design Repr. T2M R@3↑ M2T BertScore↑ Edit R@3↑ Pred ADE↓ Gen-Branch Only VQ-VAE0.75234.264.57 5.128 Gen-Branch Only CMA-VAE0.82436.370.40 4.862 Sem-Branch Only CMA-VAE0.79840.168.85 4.181 Dual-PathVQ-VAE0.77135.069.71 4.713 Dual-PathMLD-VAE (MLD [39]) 0.81037.277.46 3.784 Dual-PathVAE (w/o DPA)0.81838.480.35 3.654 Dual-Path (Ours)CMA-VAE0.84141.284.943.172 largest for Editing and Prediction—tasks where motion serves simultaneously as conditioning input and generation target—because the semantic branch preserves high-level structural intent while the generation branch maintains fine-grained joint-level detail. Gen-Branch Only + CMA-VAE on Motion Prediction (ADE 4.862) underperforms MotionGPT (4.745), while Dual-Path + CMA-VAE (3.172) substantially surpasses it, confirming that the semantic–generative decoupling of the dual-path design is architecturally essential. A.3 Hybrid Attention Ablation Table 12: Hybrid attention ablation on T2M, M2T, and motion editing using repre- sentative metrics. Attention Strategy T2M R@3↑ M2T BertScore↑ Edit R@3↑ Global Causal Only0.82539.379.6 Hybrid: Causal + Intra-Motion Full (Ours)0.84141.284.94 Global causal attention restricts motion tokens to unidirectional temporal interaction, lowering T2M alignment (R@3: 0.825 vs. 0.841) and motion editing precision (R@3: 79.6 vs. 84.94). Hybrid attention reconciles the needs of motion generation and language modeling: full attention within motion spans matches the flow matching training objective, while global causal ordering preserves text autoregressive modeling and improves motion-to-text understanding (BertScore: 41.2 vs. 39.3). UniMotion21 Table 13: Modality-routed LoRA ablation on four representative tasks. LoRA Strategy T2M R@3↑ M2T BertScore↑ V2M MPJPE↓ V2T BLEU-4↑ LLM Frozen (No LoRA) 0.75234.199.613.7 Shared LoRA0.81838.890.418.8 Routed LoRA (Ours)0.84141.275.021.9 A.4 Modality-Routed LoRA Ablation Routed LoRA achieves the best results across all tasks. The LLM Frozen baseline under-adapts to motion-centric supervision, while Shared LoRA couples hetero- geneous gradients from different modalities within a single parameter branch. Routed LoRA decouples modality-specific adaptation with deterministic routing and only ∼2% additional parameters. B Qualitative Results B.1 Text-to-Motion Generation Figure 5 highlights a consistent qualitative pattern across diverse prompts. In the first row, UniMotion better preserves explicit limb constraints such as “with his hands out”, “hands together above their head”, and “extending their right leg”, whereas the baselines often under-execute the target arm or leg motion. In the lower examples, UniMotion also more faithfully captures temporal modifiers and action composition, such as walking forward then sitting and walking while repeatedly reaching down, indicating stronger control over both local pose details and global motion progression. B.2 Motion-to-Text Understanding Figure 6 further illustrates that UniMotion tends to recover the dominant mo- tion semantics while avoiding the generic or incorrect descriptions produced by MotionGPT. For example, UniMotion correctly identifies walking in a counter- clockwise circle, curving to the right, and strumming a guitar, whereas the baseline either drops the directional cue, introduces unrelated actions, or misclassifies the action entirely. This qualitative behavior is consistent with the quantitative gains in caption-quality metrics reported in the main paper. B.3 Motion Prediction Figure 7 shows representative motion forecasting examples. Given the observed prefix, UniMotion extrapolates future body trajectories with stable global balance and temporally smooth limb evolution. Compared with discrete baselines, the predicted continuation better preserves action trend consistency, especially in turning, stepping, and arm-swing phases where accumulated drift is most visible. 22Z. Wang et al. a person who is standing with his hands out from his sides takes four slow steps forward and stops. a person standing up strikes their handstogether well above their head. the person extending their right leg. A person slowly walked forward, and sit downon somewhere a person walks forward, and repeatedly reaches down then shakes something MotionGPT MoMask Ours MotionGPT MoMask Ours Fig. 5: Qualitative comparison of text-driven motion generation on HumanML3D [12]. Each column corresponds to one text prompt, and rows show outputs from Mo- tionGPT [17], MoMask [11], and UniMotion. Red text highlights the key prompt constraints, while red dashed boxes mark prompt–motion mismatches in the baseline outputs, including missing body-part constraints, incorrect motion trajectories, and weak temporal modifiers. UniMotion produces motions with closer prompt correspon- dence and more coherent temporal transitions. UniMotion23 a person walks in a counterclockwise circle. a person walks around and stops. a person walks in a counterclockwise circle and then stops. a person who walked in a turn to their right a person leans forward, and rolls. the person then proceeds to run forwards. a person walking curving to the right. a person briefly strums a guitar. a person is walking backwards. a person strums a guitar. Input Motion GT MotionGPT Ours Fig. 6: Qualitative comparison of motion captioning (Motion-to-Text). Each column shows the input motion, the ground-truth caption, the MotionGPT prediction, and the UniMotion prediction. Red phrases in the ground-truth and UniMotion captions highlight the key motion semantics, while purple phrases in the MotionGPT output indicate misaligned descriptions or hallucinated actions. UniMotion more precisely translates fine-grained joint articulations and temporal sequence states into accurate, fluent natural language. Input Motion Input+Pred Motion Fig. 7: Qualitative visualization of motion prediction. Each column shows the observed input prefix (top) and the predicted continuation overlaid on the input motion (bottom), where the input segment is shown in yellow and the forecasted future is shown in purple. UniMotion extrapolates future motion with more consistent global trajectory direction, body balance, and temporally smooth joint evolution over long horizons. 24Z. Wang et al. B.4 Text-Conditioned Motion Editing Figure 8 presents qualitative results on MotionFix-style editing instructions. UniMotion performs targeted motion modification while retaining the unedited content of the source sequence. The edits are more localized and semantically precise, particularly for instructions involving limb-specific changes, speed modu- lation, and posture refinement. Lower the left elbow and raise the opposite hand while gesturing. Edit Instruction keep arms pinned to the body instead of spreading them to the sides put arm higher and wave Original Edited Fig. 8: Qualitative comparison of text-conditioned motion editing. Each column shows the editing instruction, the original motion, and the edited result from UniMotion. Red text highlights the target edit attributes in the instruction, and red dashed boxes mark the body regions primarily affected by the edit. UniMotion executes the requested change more accurately while better preserving the original motion content outside the edited regions. B.5 Vision-to-Motion Figure 9 visualizes representative Vision-to-Motion results on Human3.6M. Uni- Motion recovers body structure with more accurate limb orientation and torso alignment, while maintaining strong consistency across sampled visual observa- tions. The qualitative gains are most apparent in asymmetric poses and cases with large arm articulation. B.6 Vision-to-Text: Image and Video Inputs Vision-to-Text spans two temporal regimes under the same vision-conditioned language interface: single-frame pose description on H3.6M, and multi-frame dynamic action understanding on MoVid. Figures 10 and 11 present these two UniMotion25 Input Output Input Output Input Output Fig. 9: Qualitative Vision-to-Motion results. For each example, we show a sequence of sampled visual inputs together with the corresponding motion outputs recovered by UniMotion. The predicted body structure remains consistent across observations, with accurate limb orientation, stable torso alignment, and clear whole-body geometry. settings separately: the first focuses on H3.6M Vision-to-Text from a single image, while the second shows how the same interface scales to video-level Vision-to-Text on MoVid. In Figure 10, UniMotion consistently captures static body configuration cues that are visually prominent in the input frame, including facing direction, arm extension, seated vs. standing posture, and coarse limb placement. The examples show that the model preserves the core pose semantics even when the generated wording is not identical to the reference, which is particularly important for human-centric visual understanding where multiple valid descriptions may exist for the same pose. Figure 11 shows that the same interface extends beyond static pose description to dynamic action understanding from sparsely sampled video frames. In these ex- amples, UniMotion not only identifies the action category (e.g., animal-mimicking motion, squat, and running-jump sequences), but also reconstructs temporally ordered sub-actions such as lunging, lowering the hips, airborne kicking, and landing. This suggests that the model is not merely recognizing isolated poses, but inferring the motion phase transition across frames. 26Z. Wang et al. GT: The person stands upright on a red mat, facing away, arms slightly bentand moving, suggesting rhythmic motion, possibly dancing or exercising, with legs straight and feet shoulder-width apart. Ours: Person stands upright, facing away, arms slightly bent at sides. Legs straight, feet shoulder-width apart. Posture neutral, suggesting readiness or stillness. No visible motion, implying static pose or paused movement in a controlled environment. GT: The person stands barefoot on a red carpet, arms extended horizontally to the sides, maintaining a stable, symmetrical posture with a slight forward lean, suggesting a controlled, stationary pose possibly for motion capture or exercise. Ours: Person stands upright, arms extended horizontally to sides, palms facing down. Legs straight, feet shoulder-width apart. Body centered, facing forward. Appears to be in a static pose, possibly for motion capture or exercise demonstration. GT: The person sits on a chair, leaning forward with head down, arms resting on knees, feet flat on floor, body hunched in a static, low posture suggesting rest or contemplation. Ours: The person sits on a chair, leaning forward with head down, hands resting on knees, legs bent, feet flat on floor, appearing to be ina static, contemplative posture with minimal visible motion. GT: The person stands upright on a red carpet, arms slightly bent and moving, wearing a blue shirt, green shorts, and black shoes, with motion capture markers visible on their body and headband. Ours: The person stands upright on a red carpet, arms slightly bent and hands near chest, wearing a red top and green shorts with motion-capture markers, suggesting readiness for movement or performance. Fig. 10: Qualitative Vision-to-Text results on H3.6M. Each example shows the visual input together with the ground-truth description and the caption generated by UniMo- tion. Red phrases highlight the key pose-aware semantics shared by the reference and the prediction. UniMotion converts visual input into concise language that accurately describes body configuration, limb orientation, and local articulation from static visual evidence. B.7 Motion Reconstruction Quality We systematically compare the reconstruction quality of CMA-VAE, VQ-VAE (MotionGPT [17]), and MLD-VAE (MLD [39]) through four complementary analyses spanning the statistical, spatial, frequency, and temporal domains. All visualizations are computed on the common sample set—the intersection of successfully reconstructed samples across all three models on the HumanML3D test split—ensuring strictly fair comparison. Joint positions follow the standard HumanML3D 22-joint convention; all errors are reported in millimeters (m). As reported in Table 6 of the main paper, CMA-VAE achieves the best reconstruction across all metrics (APE=3.53 cm, AVE=0.428 cm/s, FID=0.0282), substantially outperforming both VQ-VAE (APE=17.15, AVE=0.813, FID=0.0674) and MLD- VAE (APE=9.28, AVE=0.981, FID=0.1283). The following visualizations provide deeper insight into the nature of these differences. Dataset-Level Error Distribution Figure 12 shows the cumulative distri- bution function (CDF) of per-sequence MPJPE. The left panel zooms into the 0–100 m range where the majority of CMA-VAE sequences concentrate, while the right panel reveals the full distributional picture up to 3000 m. CMA-VAE’s curve is uniformly left-shifted: at a 40 m threshold, approximately 70% of CMA-VAE sequences fall below it, compared to∼30% for VQ-VAE and∼35% UniMotion27 UniMotion:Themanmimicsachimpanzeebylungingforward.Hesimultaneously swingshisarmsbackanddriveshiskneesupwardtoleapontoachair.Landinginadeep squat,hestabilizeshimselfbygrippingtheseatwithhishandsandfeet. Video Input UniMotion:Thepersonperformsasumosquat.Startingfromawidestancewithtoes pointedout,shelowersherhipsbybendingthekneesandflexingthehips.Simultaneously, shebringsherhandstogetheratchestheight,thenextendsherlegstoreturntoastanding position. Video Input UniMotion:Amanrunsforward,launchesintotheairbyjumpingoffhisleftleg,and performsamid-airfrontkickwithhisrightlegwhileextendingbotharmsoutward.He thenretractshisleg,landsonhisrightfoot,andcontinuesrunningforward. Video Input Fig. 11: Qualitative Vision-to-Text results on MoVid (video inputs). Each example shows multiple sampled video frames followed by the description generated by UniMotion, with red phrases highlighting the key temporal action components. Given sparse frame observations, UniMotion produces temporally coherent descriptions of the full action, capturing action category, posture evolution, and phase transitions. 28Z. Wang et al. for MLD-VAE. In the full-range view, VQ-VAE exhibits a prominent heavy tail extending beyond 2000 m, whereas CMA-VAE converges almost entirely within 600 m. This confirms that CMA-VAE’s advantage is a dataset-wide distributional shift rather than gains on a narrow subset of easy samples. 020406080100 MPJPE Threshold (m) 0 20 40 60 80 100 Percentage of Sequences (%) 406080 CDF - Zoomed View (0-100m) CMA-VAE (Ours) VQ-VAE (MotionGPT) MLD-VAE (MotionGPT3) 050010001500200025003000 MPJPE Threshold (m) 0 20 40 60 80 100 Percentage of Sequences (%) Zoomed area (left) CDF - Full View CMA-VAE (Ours) VQ-VAE (MotionGPT) MLD-VAE (MotionGPT3) Cumulative Distribution of Reconstruction Error Fig. 12: Cumulative distribution of per-sequence reconstruction MPJPE for CMA-VAE, VQ-VAE, and MLD-VAE. Left: zoomed view (0–100 m) with reference thresholds at 40, 60, and 80 m. Right: full-range view (0–3000 m), where the green dashed box indicates the region enlarged in the left panel. CMA-VAE’s CDF is uniformly left-shifted, indicating lower reconstruction error across the entire test set. VQ-VAE exhibits a long heavy tail (>2000 m), reflecting catastrophic quantization failures on a subset of sequences. Joint-wise Spatial Error Analysis Figure 13 provides a per-joint breakdown of the dataset-averaged reconstruction error. Each column corresponds to one model, with all 22 joints listed vertically and the numerical error (m) annotated. End-effector joints (ankles, feet, wrists, marked with⋆) are highlighted with black borders, as they accumulate the largest kinematic-chain errors. Three key observations emerge: (1) CMA-VAE achieves uniformly low error across all joints (37–46 m), with minimal variation between trunk and end- effectors—evidence that the continuous latent space preserves both global pose structure and distal articulation fidelity. (2) VQ-VAE exhibits systematically elevated error (155–213 m), with wrists suffering the most (left/right wrist: 212.9/211.7 m), confirming that discrete codebook quantization disproportion- ately degrades high-amplitude end-effector dynamics. (3) MLD-VAE falls between the two (90–140 m), but its wrist errors (139.1/139.9 m) are notably higher than its trunk errors (∼91 m), indicating that a plain continuous VAE without cross-modal anchoring still struggles with end-effector fidelity. Residual Frequency Spectrum Analysis To complement the spatial analyses above, we examine reconstruction fidelity in the frequency domain. For each sequence, we compute the residual signal r j,t = p pred j,t −p gt j,t across all joints and UniMotion29 0:pelvis 1:left_hip 2:right_hi 3:spine1 4:left_kne 5:right_kn 6:spine2 ★7:left_ank ★8:right_an 9:spine3 ★10:left_foo ★11:right_fo 12:neck 13:left_col 14:right_co 15:head 16:left_sho 17:right_sh 18:left_elb 19:right_el ★20:left_wri ★21:right_wr 37.3 37.2 38.5 38.4 39.8 41.5 38.2 43.9 44.3 37.9 43.8 45.2 38.9 38.3 39.1 40.7 40.2 41.3 41.6 43.5 43.8 45.6 CMA-VAE 0:pelvis 1:left_hip 2:right_hi 3:spine1 4:left_kne 5:right_kn 6:spine2 ★7:left_ank ★8:right_an 9:spine3 ★10:left_foo ★11:right_fo 12:neck 13:left_col 14:right_co 15:head 16:left_sho 17:right_sh 18:left_elb 19:right_el ★20:left_wri ★21:right_wr 155.4 157.0 156.2 157.0 170.4 169.0 159.5 175.6 174.8 160.5 180.6 179.5 166.6 164.0 163.7 172.0 171.2 168.7 186.3 183.9 212.9 211.7 VQ-VAE (MotionGPT) 0:pelvis 1:left_hip 2:right_hi 3:spine1 4:left_kne 5:right_kn 6:spine2 ★7:left_ank ★8:right_an 9:spine3 ★10:left_foo ★11:right_fo 12:neck 13:left_col 14:right_co 15:head 16:left_sho 17:right_sh 18:left_elb 19:right_el ★20:left_wri ★21:right_wr 90.4 91.3 91.3 91.0 97.9 97.9 91.4 105.4 105.2 91.8 107.6 107.7 95.8 93.9 93.9 99.4 98.3 98.0 114.3 114.5 139.1 139.9 MLD-VAE (MotionGPT3) 0 25 50 75 100 125 150 175 200 Error (m) Joint-wise Spatial Error Heatmap (Dataset Average) End-effector joints (ankles, feet, wrists) Fig. 13: Joint-wise spatial error heatmap (dataset average, m). Each column shows one model, and rows correspond to the 22 HumanML3D joints, with the numerical error annotated in each cell. Color encodes error magnitude (blue = low, red = high), and end- effector joints (⋆) are highlighted with black borders. CMA-VAE maintains uniformly low error across all joints, while VQ-VAE and MLD-VAE exhibit disproportionately elevated end-effector errors. coordinate axes, apply the FFT, and average the resulting power spectra over up to 1000 randomly sampled sequences. The frequency axis ranges from 0 to the Nyquist limit of 10 Hz (HumanML3D operates at 20 fps), partitioned into three physically meaningful bands: low (0–2 Hz), capturing gross locomotion; mid (2–6 Hz), corresponding to limb swing and gait cycles; and high (6–10 Hz), encoding rapid articulation and contact dynamics. As shown in Figure 14, CMA-VAE achieves the lowest residual spectral en- ergy across all frequency bands, with approximately one order of magnitude separation from VQ-VAE throughout the mid- and high-frequency ranges. The high-frequency inset (6–10 Hz) is particularly revealing: VQ-VAE’s residual en- ergy plateaus at∼2×10 −2 , indicating that discrete codebook switching injects spurious high-frequency artifacts into the reconstructed motion—precisely the “temporal jitter” described qualitatively in the main paper. MLD-VAE shows mod- erately lower high-frequency residuals than VQ-VAE but remains substantially above CMA-VAE, consistent with its intermediate reconstruction quality. CMA- VAE’s uniformly low spectral residuals confirm that DPA-calibrated continuous representations faithfully preserve motion dynamics at all temporal scales. Temporal Velocity Fidelity and Jitter Analysis Finally, we quantify temporal smoothness by analyzing the acceleration signal (second-order finite difference of joint position), whose standard deviation serves as a direct measure 30Z. Wang et al. 0246810 Frequency (Hz) 10 −2 10 −1 10 0 10 1 10 2 Residual Spectral Energy (log scale) Residual Frequency Spectrum Analysis CMA-VAE (Ours) VQ-VAE (MotionGPT) MLD-VAE (MotionGPT3) Low Freq (0-2Hz) Mid Freq (2-6Hz) High Freq (6-10Hz) 6.06.57.07.58.08.59.09.510.0 Freq (Hz) 10 −2 Energy High Freq Detail Fig. 14: Residual frequency spectrum analysis. The main plot shows the mean residual spectral energy (log scale) versus frequency (Hz), with background shading denoting low (0–2 Hz), mid (2–6 Hz), and high (6–10 Hz) bands. The inset provides a magnified view of the high-frequency range (6–10 Hz). CMA-VAE (blue) maintains the lowest residual energy across all bands, while VQ-VAE (red) exhibits elevated high-frequency residuals from discrete codebook quantization artifacts. of motion jitter. Figure 15 plots the per-frame acceleration for the right wrist and right ankle on a high-dynamic sample (M009968, selected from the top-5 fastest sequences by 95th-percentile velocity). CMA-VAE’s acceleration profile closely tracks the ground truth: on the right wrist, jitter std = 32.85 vs. GT = 33.06 (0.6% relative deviation); on the right ankle, jitter std = 58.91 vs. GT = 59.21 (0.5% deviation). In contrast, VQ-VAE produces noticeably jittery acceleration with higher peak magnitudes, reflecting abrupt code-switching during rapid transitions. MLD-VAE exhibits over-smoothed acceleration with systematically reduced peak amplitudes, consistent with its higher AVE (0.981 cm/s) caused by temporal detail loss. These results provide direct time-domain evidence that CMA-VAE preserves both the magnitude and temporal structure of motion dynamics with near-ground-truth fidelity. C Further Analysis C.1 DPA Stability and Reverse KL Analysis Reverse KL direction: mode-seeking distillation. The DPA alignment loss adoptsD KL (q φ (z |m)∥q ψ (z |m,v)) withq ψ detached (stop-gradient) as the alignment target. In the knowledge-distillation convention whereq ψ (vision- fused) is the teacher andq φ (motion-only) is the student, this is the reverse KL (student→teacher). MinimizingD KL (q φ ∥q ψ ) w.r.t.q φ is mode-seeking:q φ is penalized heavily for placing probability mass whereq ψ assigns low density (thelogq ψ term in the KL diverges toward−∞), drivingq φ to concentrate on the most prominent semantic modes of the vision-fused posterior. This yields compact, high-confidence motion representations that capture the core visual- semantic information while naturally filtering out view-specific or noise-related visual modes irrelevant to motion semantics. UniMotion31 0102030405060 −75 −50 −25 0 25 50 75 100 Acceleration (m/frame²) Jitter (std): VQ=34.18 MLD=30.00 CMA=32.85 GT=33.06 Right Wrist Acceleration (Jitter) - Sample M009968 GT CMA-VAE (Ours) MLD-VAE VQ-VAE (jittery) 0102030405060 Frame −200 −150 −100 −50 0 50 100 Acceleration (m/frame²) Jitter (std): VQ=53.35 MLD=56.47 CMA=58.91 GT=59.21 Right Ankle Acceleration (Jitter) - Sample M009968 GT CMA-VAE (Ours) MLD-VAE VQ-VAE (jittery) Velocity Jitter Analysis (Acceleration) Fig. 15: Acceleration-based jitter analysis on a high-dynamic sample (M009968). Top: right wrist acceleration (m/frame 2 ). Bottom: right ankle acceleration. Each panel overlays GT (gray), CMA-VAE (blue), MLD-VAE (orange), and VQ-VAE (red), and the inset box reports the acceleration standard deviation used as the jitter metric. CMA-VAE most closely matches the ground-truth dynamics, while VQ-VAE introduces high-frequency jitter and MLD-VAE over-smooths peak accelerations. In contrast, the forward KLD KL (q ψ ∥q φ ) would be mode-covering: optimizing q φ under this objective would force it to assign probability mass whereverq ψ has support—including noisy or view-dependent visual modes—resulting in an over- dispersed motion posterior. For our diagonal-Gaussian posteriors, this manifests as overestimated variance inq φ , diluting representational precision. The reverse KL instead produces tighter variance, which is preferable since the motion encoder at inference should yield precise, confident encodings without image input. Training dynamics and stability. The DPA lossL align uses a linear warm-up schedule over the first 10k training steps. This prevents the alignment constraint from dominating during early training when the VAE’s basic reconstruction ability is still unstable. As training progresses,L align gradually increases while L recon andL KL continue to decrease—the three losses converge cooperatively rather than competing. The posterior variancesE[σ 2 φ ] andE[σ 2 ψ ] remain in a healthy range (not collapsing toward zero) throughout training, confirming no posterior collapse occurs. Generalization to unpaired datasets. HumanML3D data does not participate inL align (no paired images). However, because the Motion Encoder’s front-end 32Z. Wang et al. parameters are shared with the Vision-Fused Encoder, DPA’s gradients from H36M data indirectly improve representations for HumanML3D as well. This explains why removing DPA degrades T2M performance (R@3: 0.841→0.818), even though T2M evaluation uses only HumanML3D data. The shared-parameter design enables implicit knowledge transfer across datasets with different modality coverage. C.2 LRA Non-Triviality Analysis A natural concern is whether LRA’s M2M self-reconstruction degenerates into a trivial identity mapping. We provide three layers of evidence that this is not the case. (1) Architectural impossibility. Three non-trivial transformations separate the conditioning input from the reconstruction target: –Dimension remapping:z ∈ R T z ×256 is projected and compressed by the dual-path embedder into LLM tokens of dimensiond h =1536, involving non- linear MLP projections and 4-layer Transformer encoding in the semantic branch. –LLM processing: The LLM backbone operates on the fused tokens with causal attention and LoRA adaptation—a highly non-linear transformation. –Probabilistic flow matching: The flow head predicts velocity fields starting from pure Gaussian noisez 0 ∼N(0,I), conditioned on LLM hidden states via AdaLN. The prediction target depends on both the noise realization and the timestept—the model must “understand” motion structure to produce correct velocity directions, rather than copying input features. (2) Information bottleneck during training. Beyond architectural con- straints, we apply explicit information bottleneck regularization to the condition- ing input during M2M training: – Temporal subsampling: Randomly retain only 20–50% of temporal frames from the motion latent conditioning, requiring the model to interpolate missing temporal information. –Feature dropout: Randomly zero out 15% of feature dimensions in the conditioning tokens. –Gaussian perturbation: Add noise withσ=0.02 to the conditioning latents. The model receives a degraded version ofzas conditioning but must reconstruct the complete, clean targetz—identity mapping is strictly impossible since input ̸= target. This design is analogous to masked image modeling and denoising autoencoders, which are well-established as non-trivial self-supervised objectives. (3) Cross-task downstream transfer. The strongest evidence against trivial memorization is cross-task generalization: the M2M-calibrated pathway achieves T2M R@3=0.841 (vs. 0.801 without LRA) and simultaneously improves Vision→M MPJPE (84.3→75.0), even though M2M training involves no text or image inputs. UniMotion33 An identity mapping of motion latents cannot explain gains on text-driven gener- ation or image-driven estimation—the pathway must have learned transferable structural representations of motion dynamics. Shuffled-condition control experiment. To directly verify that the model learns condition-target correspondence rather than dataset-level statistics, we evaluate the trained LRA model with mismatched conditions: using motion A’s embedding to guide reconstruction of motion B’s target. Under matched conditions, reconstruction FID is 0.008; under shuffled conditions, FID degrades to 2.34 (∼300×worse), confirming that the model has learned a structured condition→target mapping rather than a generic motion prior. C.3 Out-of-Distribution Generalization To evaluate generalization beyond the training distribution, we test UniMotion’s Vision-to-Motion capability on 3DPW [30] without any fine-tuning. Despite training exclusively on H36M (indoor, controlled setting), UniMotion achieves competitive performance on 3DPW’s in-the-wild scenarios: Table 14: Zero-shot Vision-to-Motion on 3DPW (no fine-tuning). MethodMPJPE↓ PA-MPJPE↓ UniPose [21] (zero-shot)99.465.8 UniMotion (zero-shot)93.658.3 This demonstrates that DPA’s visual-motion alignment generalizes beyond the training domain. The relative improvement over UniPose is consistent with in- domain results, suggesting that CMA-VAE’s continuous representations capture transferable body structure priors rather than overfitting to H36M-specific visual patterns. D Architecture and Implementation Details D.1 Unified 269-Dimensional Motion Representation A core issue in prior work is representational inconsistency across tasks. Text- driven motion generation typically uses HumanML3D-style 263-dimensional representations, while image-driven body recovery operates directly in the raw SMPL parameter space. As a result, motion generation and body recovery conventionally use two separate interfaces, making end-to-end unified training within a single framework difficult. UniMotion adopts a single 269-dimensional representation to bridge this gap. This representation maintains backward compatibility with the conventional 263-dim representation and additionally introduces 6-dim global orientation information to support body-structure supervision and cross-modal alignment. For each frame, we define a 269-dimensional vector: 34Z. Wang et al. 1.Root motion increment: 3-dim (1-dim rotation increment + 2-dim planar translation increment). 2. Root height: 1-dim. 3. Relative joint positions: 63-dim (21 joints × 3). 4. Local joint rotations (continuous 6D form): 126-dim (21 joints × 6). 5. Local joint velocities: 66-dim (22 joints × 3). 6. Foot contact states: 4-dim. 7. Global orientation (continuous 6D form): 6-dim. The first six components sum to 263-dim, consistent with mainstream motion generation representations. The newly added 6-dim component maintains global orientation consistency in visual-motion tasks. The existing HumanML3D evalu- ation protocol can be applied directly to the first 263 dimensions. Construction per dataset. HumanML3D: We strictly preserve the official 263-dim features unchanged, appending only the 6-dim global orientation com- puted under the same kinematic conventions. MotionFix: We first align the coordinate system and joint conventions to a unified semantic, then construct sequence-level 269-dim features in both time-difference mode (compatible with HumanML3D) and frame-level SMPL-semantic consistent mode. Human3.6M: Data is converted from SMPL annotations to the same 22-joint convention, then encoded into 269-dim features with unified coordinate normalization. D.2 CMA-VAE Architecture Details Motion-guided joint sampling. The 2D joint positions j 2d (m) used for grid sampling in the Vision-Fused Encoder are derived from motion in one of three ways, evaluated in priority order: 1. Precomputed skeleton projections (primary path, used for H36M): 2D joint coordinates are pre-projected from ground-truth 3D skeleton annotations under full camera calibration and stored offline. This provides the most geometrically precise sampling grid. 2.Dataset-provided image-space annotations (camera_params[‘joint3d_image’]): pixel-space joint coordinates from dataset annotations, used when precom- puted files are unavailable. 3.Runtime forward-kinematics recovery (fallback): 3D joint positions are recovered from the 269-dim motion representation viarecover_from_ric and projected using weak-perspective camera parameters. All three paths share the key property that the grid-sampling positions are structurally grounded in the motion skeleton: the same joint coordinates that define the motion representation also determine which image regions are attended to. This motion-guided coupling is what gives the Vision-Fused Encoder genuine cross-modal sensitivity—visual features are extracted specifically at body-relevant locations rather than global image regions. Reference-frame visual guidance. Since H36M training samples consist of individual frames paired with motion sequences, the Vision-Fused Encoder UniMotion35 extracts visual features from the reference frame and broadcasts them across the full temporal sequence. This design is justified by the DPA objective: DPA aligns the distributional statistics of the two encoder posteriors, which requires injecting body-identity and scene-context semantics rather than per-frame visual detail. A single reference frame is sufficient to supply this semantic supervision, while per-frame extraction would multiply HRNet’s cost by sequence length (∼60–200 frames) with no commensurate benefit to the distributional alignment objective. D.3 Unified Architecture: Detailed Formulations This section provides the full architectural details for the components introduced in Sec. 3.3 of the main paper. Dual-Path Embedder. Given the CMA-VAE latentz ∈ R T z ×d , the embedder contains two parallel branches. The Semantic branch mapszto semantic feature dimensiond s =512 via MLP; after adding learnable positional encodings, N s =4 Transformer Encoder layers extract high-level semantic features: e und = TransformerEnc MLP(z) + PosEmbed ∈ R T z ×d s .(10) This branch mirrors the SigLIP Encoder on the vision side and captures global motion semantics. The Generation branch directly mapszto the LLM hid- den dimensiond h via MLP, with independent learnable positional encodings preserving low-level motion details: e gen = MLP(z) + PosEmbed gen ∈ R T z ×d h .(11) This branch mirrors the PatchEmbed on the vision side. The two branch outputs are concatenated along the channel dimension and projected to the unified LLM hidden dimension via RMSNorm + MLP: e fused = FusionProj [e und ∥ e gen ] ∈ R T z ×d h .(12) All tasks (understanding and generation) uniformly use fused embeddings, con- sistent with the image modality. Vision Pathway and Pose-Aware Vision Backbone. For visual inputs, UniMotion reuses Show-o2’s native image processing pipeline. Input images are first encoded by the pre-trained WAN2.1 VAE [31] into continuous latents l img ∈ R C×H ′ ×W ′ (C=16, spatial compression factor 8×), then processed through the symmetric dual-path design: the Semantic Branch uses PatchEmbed followed by SigLIP Vision Encoder; the Generation Branch uses an independent PatchEmbed to preserve spatial details. The two branches are fused via FusionProj before entering the LLM. For 432×432 input resolution, the WAN2.1 VAE produces 54×54 latents, and with patch_size=2, this yields 27×27=729 image tokens— exactly matching the SigLIP position encoding. To provide fine-grained human body structure understanding for human- centric tasks, the RGB pathway is additionally equipped with a pose-aware vision backbone. This backbone employs a ViT-H backbone (1280-dim, 32 layers, 36Z. Wang et al. 16 heads) initialized from a pretrained human body encoder [7] and kept frozen throughout training. Given an RGB image, it extracts body-structure-aware spatial features f pose ∈ R d p ×h p ×w p (d p =1280), which are bilinearly interpolated to match the image token spatial resolution (h ′ =w ′ =27) and flattened into N=729 tokens. These pose-aware features are then concatenated with the fused SigLIP+PatchEmbed image embeddings and re-projected via a lightweight fusion layer: e rgb = PoseFusionProj [e img_fused ∥ f pose ] ∈ R N×d h ,(13) where PoseFusionProj is RMSNorm(d h + d p ) → Linear → GELU → Linear. This design maintains architectural symmetry with the Motion side: the Motion dual-path embedder provides both semantic (global) and generation (detail-preserving) features, and the visual pathway similarly combines SigLIP’s general visual semantics with the pose-aware vision backbone’s human-centric geometric features. Quantitatively, this combination yields the best Vision→M performance reported in Table 7, while Table 9 shows that the aligned motion pathway already provides a strong base before the final visual refinements. Hybrid Attention. We design a hybrid attention mechanism that maintains global causal constraints at the sequence level while applying intra-motion full attention within motion token spans. Formally, given a mixed sequence of lengthLcomprising text tokensT, image tokensI, and thek-th motion span M k , the attention mask M∈0,−∞ L×L is defined as: M ij =                    0if i∈T , j ≤ i 0if i∈M k , j ∈M k 0if i∈M k , j /∈M k , pos(j) < start(M k ) 0if i∈I, j ∈I 0if i∈I, j /∈I, pos(j) < start(I) −∞ otherwise (14) No information leakage guarantee. Motion and image tokens share the same attention pattern: each can only attend to (1) all tokens within its own span (bidirectional full attention), and (2) all tokens preceding that span (unidirec- tional). They cannot attend to any token beyond the span boundary. Text tokens follow standard causal attention (j ≤ i). Consequently, when generating thei-th text token, it sees only positions≤ iand all completed motion/image spans before it. No future information leaks into the generation process. This design reconciles the flow matching objective—which requires simultaneous velocity field prediction across the entire motion or image latent sequence—with text’s autoregressive constraints. Modality-Routed LoRA. For eachQ,K,V,Oprojection matrix in every attention layer, we attach two Low-Rank Adaptation branches: LoRA-A for Text/RGB tokens and LoRA-B for Motion tokens, each with rank 32. During the forward pass, a deterministic modality mask routes each token to its corre- sponding LoRA branch based on modality identity (always known from sequence construction), eliminating the gating uncertainty of learned MoE routing [15]. This UniMotion37 introduces only∼2% additional parameters while enabling modality-specific pa- rameter adaptation. Unlike HMVLM’s MoE-LoRA which requires load-balancing losses and introduces routing noise during training, our deterministic design is more appropriate for the tri-modal setting where modality identity is unambigu- ous. D.4 Flow Matching Generation UniMotion adopts the flow matching framework [23] for Motion and RGB gen- eration. Given a data pointx 1 (target Motion Latent or Image Latent), noise x 0 ∼ N(0,I) is sampled from a standard normal distribution, and the linear interpolation path is defined as: x t = t· x 1 + (1− t)· x 0 , t∈ [0, 1],(15) with corresponding velocity fieldu t =x 1 − x 0 . The model is trained to predict the velocity field v θ (x t ,t) with MSE loss: L flow = E x 0 ,x 1 ,t ∥v θ (x t ,t)− u t ∥ 2 ,(16) wheretis sampled from a Logit-Normal distribution (μ=0,σ=1) to increase sampling density at intermediate timesteps where the velocity field is most informative. The full generation loss is: L gen = λ ntp ·L NTP + λ flow ·L flow ,(17) whereλ ntp =1.0 andλ flow =0.8 across all stages;L NTP is next-token prediction loss for text tokens. For pure understanding tasks only L NTP is computed. At inference, we start fromx 0 ∼ N(0,I) and integrate along the learned velocity field with the Euler ODE solver inN step =50 steps, with Classifier-Free Guidance (CFG): ˆv = v uncond + s· (v cond − v uncond ),(18) wheres=3.0 is the guidance scale. During training, condition dropout (probability 10%) replaces the conditioning tokens with null embeddings to enable CFG at inference. The generated latent is decoded by the corresponding VAE decoder. Time shifting [38] with factor 3.0 is applied for improved sample quality. Motion Flow Head. The motion flow head consists ofN d =6 Modulated Atten- tion Blocks with hidden dimensiond h =1536 and 16 attention heads, followed by a MotionFinalLayer. Timestep conditioning is injected via a sinusoidal timestep embedding c t ∈ R d h through Adaptive Layer Normalization (AdaLN): each block modulates the hidden states using shift and scale parameters predicted from c t . The output layer’s weights and biases are zero-initialized to ensure the flow head produces near-zero velocity predictions at the start of training, stabilizing the early training dynamics. 38Z. Wang et al. D.5 Joint-Level Auxiliary Supervision For the Vision-to-Motion task, we additionally apply a fine-grained joint recon- struction loss on the decoded motion features. Given the CMA-VAE decoded outputˆm ∈ R T×269 , a SmoothL1 loss is computed on the first 67 dimensions (encoding root position, root height, and relative joint positions) against the ground truth: L joint = SmoothL1( ˆm 1:67 , m 1:67 ).(19) This provides dense per-joint spatial supervision that complements the flow matching objectiveL flow , encouraging geometrically precise joint localization. The frozen pose-aware vision backbone extracts features; only the motion branch and LoRA parameters receive gradients fromL joint . This auxiliary loss is applied in all training stages involving the Vision-to-Motion task (Stages 2–3), with loss weight λ joint =0.1 and a linear warm-up over the first 5000 steps. E Training Pipeline, Data Construction, and Evaluation E.1 Multi-Stage Training Configuration Table 15: Multi-stage training configuration. Each stage builds on parameters from the previous stage. StageNameTasksSteps LR Trainable Params CMA-VAE DPA Pre-trainingRecon. + DPA210k 1e-4 CMA-VAE full (∼45M) Stage 0 LRA Pre-trainingM2M80k 6e-5 Embedder, FlowHead, LoRA-B Stage 1a Motion-Text Warmup T2M40k 6e-5 All new motion components Stage 1b Motion-Text Align. T2M+M2T+Pred+Edit 130k 3e-5 All new motion components Stage 2 Cross-modal Ext.+V2M+V2T+MGIE130k 4e-5 Motion + image components Stage 3 Full Multi-task FT All tasks200k 3e-5 All (LLM partial unfreeze) Table 15 summarizes the complete training pipeline. All stages use AdamW optimizer (β 1 =0.9,β 2 =0.999, weight decay=0,ε=1e-8) with bf16 mixed precision and constant-with-warmup LR scheduling. Training is conducted on 4×A6000 GPUs. Figure 16 provides a visual overview of this progressive pipeline, illustrating which modules are trainable or frozen at each stage and how supervision signals evolve across stages. CMA-VAE Pre-training. Full CMA-VAE training with Dual-Posterior KL Alignment on HumanML3D, MotionFix, and H36M. The DPA alignment loss L align is applied only to samples with paired images (H36M), using a linear warm-up schedule. For HumanML3D and MotionFix samples without paired images, only L recon and L KL are applied. Stage 0 – LRA Pre-training. M2M self-reconstruction only. To prevent trivial shortcut learning and improve generalization, we apply a lightweight information bottleneck to the conditioning input: temporal subsampling (randomly retaining UniMotion39 EŽŝƐĞ^ĐŚĞĚƵůĞƌ DͲsŶĐŽĚĞƌ ^ƉĂƚŝĂů ͲdĞŵƉŽƌĂů &ƵƐŝŽŶ ^ĞŵĂŶƚŝĐ ƌĂŶĐŚ 'ĞŶĞƌĂƚŝŽŶ ƌĂŶĐŚ DŽƚŝŽŶ&ůŽǁ,ĞĂĚ DͲsĞĐŽĚĞƌ DŽƚŝŽŶ&ůŽǁ,ĞĂĚ DͲsĞĐŽĚĞƌ EŽŝƐĞ^ĐŚĞĚƵůĞƌ DͲsŶĐŽĚĞƌ ^ƉĂƚŝĂůͲ dĞŵƉŽƌĂů &ƵƐŝŽŶ ^ĞŵĂŶƚŝĐ ƌĂŶĐŚ 'ĞŶĞƌĂƚŝŽŶ ƌĂŶĐŚ DŽƚŝŽŶ&ůŽǁ,ĞĂĚ DͲsĞĐŽĚĞƌ EŽŝƐĞ^ĐŚĞĚƵůĞƌ sŝƐŝŽŶ &ůŽǁ,ĞĂĚ sŝƐŝŽŶ sĞĐŽĚĞƌ dĞdžƚĞͲ ƚŽŬĞŶŝnjĞƌ sŝƐŝŽŶ sŶĐŽĚĞƌ ^ƉĂƚŝĂů ͲdĞŵƉŽƌĂů &ƵƐŝŽŶ ^ĞŵĂŶƚŝĐ ƌĂŶĐŚ 'ĞŶĞƌĂƚŝŽŶ ƌĂŶĐŚ WŽƐĞͲ ǁĂƌĞ sŝƐŝŽŶĂĐŬďŽŶĞ ŝůŝŶĞĂƌ/ŶƚĞƌƉŽůĂƚŝŽŶ &ůĂƚĞŶ WŽƐĞ&ƵƐŝŽŶWƌŽũ EŽŝƐĞ^ĐŚĞĚƵůĞƌ DͲsŶĐŽĚĞƌ ^ƉĂƚŝĂůͲ dĞŵƉŽƌĂů &ƵƐŝŽŶ ^ĞŵĂŶƚŝĐ ƌĂŶĐŚ 'ĞŶĞƌĂƚŝŽŶ ƌĂŶĐŚ DŽƚŝŽŶ&ůŽǁ,ĞĂĚ DͲsĞĐŽĚĞƌ “Describe the motion.” dĞdžƚĞͲƚŽŬĞŶŝnjĞƌ dĞdžƚdŽŬĞŶŝnjĞƌ dĞdžƚdŽŬĞŶŝnjĞƌ M2M T2M фŵŽƚŝŽŶх фƚĞdžƚх фŵŽƚŝŽŶх фƚĞdžƚх MotionPred +MotionEdit+M2T+T2M dĞdžƚdŽŬĞŶŝnjĞƌ фŵŽƚŝŽŶх фƚĞdžƚх фǀŝĚĞŽх MotionPred + MotionEdit + M2T + T2M + Vision-to-Motion+Vision-to-Text+Motion-guided Image Editing dĞdžƚdŽŬĞŶŝnjĞƌ ^ƚĂŐĞϬ^ƚĂŐĞϭĂ^ƚĂŐĞϭď ^ƚĂŐĞϮ EŽŝƐĞ^ĐŚĞĚƵůĞƌ sŝƐŝŽŶ &ůŽǁ,ĞĂĚ sŝƐŝŽŶ sĞĐŽĚĞƌ dĞdžƚĞͲ ƚŽŬĞŶŝnjĞƌ sŝƐŝŽŶ sŶĐŽĚĞƌ ^ƉĂƚŝĂů ͲdĞŵƉŽƌĂů &ƵƐŝŽŶ ^ĞŵĂŶƚŝĐ ƌĂŶĐŚ 'ĞŶĞƌĂƚŝŽŶ ƌĂŶĐŚ WŽƐĞͲ ǁĂƌĞ sŝƐŝŽŶĂĐŬďŽŶĞ ŝůŝŶĞĂƌ/ŶƚĞƌƉŽůĂƚŝŽŶ &ůĂƚĞŶ WŽƐĞ&ƵƐŝŽŶWƌŽũ EŽŝƐĞ^ĐŚĞĚƵůĞƌ DͲs ŶĐŽĚĞƌ ^ƉĂƚŝĂůͲ dĞŵƉŽƌĂů &ƵƐŝŽŶ ^ĞŵĂŶƚŝĐ ƌĂŶĐŚ 'ĞŶĞƌĂƚŝŽŶ ƌĂŶĐŚ DŽƚŝŽŶ&ůŽǁ ,ĞĂĚ DͲs ĞĐŽĚĞƌ фŵŽƚŝŽŶх фƚĞdžƚх фǀŝĚĞŽх dĞdžƚ dŽŬĞŶŝnjĞƌ ^ƚĂŐĞϯ All Tasks LLM Fig. 16: Multi-stage training pipeline of UniMotion. (a) Stage 0 – LRA Pre-training: the LLM is frozen; motion components are trained via M2M self-reconstruction with information bottlenecks to calibrate the motion pathway. (b) Stage 1a – Motion-Text Warmup: basic T2M generation capability is established using concise fixed prompt templates. (c) Stage 1b – Motion-Text Alignment: T2M, M2T, prediction, and editing are jointly trained with diverse instruction templates to achieve bidirectional motion-language alignment. (d) Stage 2 – Cross-modal Extension: vision tasks (V2M, V2T, MGIE) are added, activating the image pathway and pose-aware backbone to achieve tri-modal coverage. (e) Stage 3 – Full Multi-task Fine-tuning: all parameters, including the partially unfrozen LLM backbone, are jointly optimized across all seven tasks for global cross-modal integration. CMA-VAE pre-training precedes Stage 0 as an independent phase. 40Z. Wang et al. 20–50% of frames), feature dropout (p=0.15), and low-level Gaussian perturbation (σ=0.02). The dual-path embedder, motion flow head, and modality-routed LoRA- B (Motion branch) participate in training. Training data includes HumanML3D, H36M, and MotionFix motion sequences. Warmup steps: 2000. Stage 1a – Motion-Text Warmup. T2M using concise fixed prompt templates, establishing basic text-to-motion generation capability. Warmup steps: 2000. Stage 1b – Motion-Text Alignment. Joint training of T2M + M2T + Motion Prediction + Motion Editing with rich and diverse instruction templates. Warmup steps: 4000. Stage 2 – Cross-modal Extension. Three vision-modality tasks are added: Vision-to-Motion, Vision-to-Text, and Motion-guided Image Editing (MGIE). Visual inputs are processed via WAN2.1 VAE [31] and Show-o2’s native image pipeline, with the pose-aware vision backbone additionally activated. Stage 2 proceeds in two phases: first img2pose + img2text, then MGIE + img2pose + img2text. Warmup steps: 2000. Stage 3 – Full Multi-task Fine-tuning. All tasks are jointly trained with partial LLM backbone unfreezing in the final full multi-task stage. Warmup steps: 4000. E.2 Cross-Modal Task Data Construction This section details the data construction for cross-modal tasks in the later training stages (Stages 2–3). Vision-to-Motion (i2m) and Vision-to-Text (i2t). Both tasks are built on Human3.6M visual inputs and corresponding 269-dim motion representations, using online sample construction. In the current H3.6M setup, each sample uses one reference frame, with unified resolution (432×432) and normalization applied before training. Task-specific quality filtering ensures i2m covers the broadest possible visual-motion samples, while i2t additionally requires available text supervision. Motion-guided Image Editing (MGIE). MGIE targets generation from “source image + reference motion + text instruction” to produce a target image. We perform temporal pairing within the same video sequence: –Candidate target frames are sampled using a discrete time interval set (5, 8, 10 frames), covering short-to-medium motion changes. –A pose difference threshold (based onL 2 distance in 269-dim motion space) filters out pairs with minimal changes. –At most one target frame is retained per source frame to control sample correlation. Under this configuration, we obtain approximately 134k training pairs and 40.6k test pairs. Instruction templates. Diverse instruction template sets are constructed for each task, with random sampling during training. MGIE instructions explic- itly constrain “executing motion editing while preserving the scene/subject,” reinforcing decoupled modeling of motion change and appearance preservation. UniMotion41 E.3 Evaluation Protocols We detail the evaluation protocols for all tasks to ensure reproducibility. Text-to-Motion (T2M). Evaluated on the HumanML3D test split (4,646 samples). We compute FID, R-Precision (R@1/2/3), MMDist, and Diversity following the standard protocol [12]. Motion features are extracted using the text-motion feature extractor from Guo et al. [13]. Each metric is averaged over 20 runs with different random seeds. Motion-to-Text (M2T). Evaluated on HumanML3D test split following [14]. We report retrieval metrics (R@1/3, MMDist) and caption quality metrics (Bleu@1/4, Rouge-L, CIDEr, BertScore). Motion Prediction. Evaluated on the AMASS subset following MotionGPT [17]. Given the first half of a motion, the model predicts the remaining half. We report FID, ADE (Average Displacement Error), and FDE (Final Displacement Error). Motion Editing. Evaluated on the MotionFix test split (2,847 samples). Given a source motion and a text editing instruction, the model generates the edited motion. We report FID and generated-to-target retrieval precision (R@1/3) following [1]. Motion Editing FID is generally lower than T2M FID because the editing task has dual constraints (source motion + text instruction), producing distributions closer to the ground truth. Vision-to-Motion. Evaluated on Human3.6M test split following [21]. We report MPJPE and PA-MPJPE. The same data splits and evaluation code as UniPose are used for fair comparison. Vision-to-Text. Evaluated on Human3.6M test split. We report BLEU-4, ROUGE-L, and METEOR for pose description quality. All baselines (UniPose, Show-o2, Qwen-2.5-VL) are evaluated on the same split with the same tokenizer settings. Motion-guided Image Editing (MGIE). Evaluated on 40.6k test pairs con- structed from H3.6M and 3DPW [30]. We report FID (Inception-v3 features), CLIP Score (text-image alignment), and Motion Accuracy (Mot.Acc). Specifi- cally, Mot.Acc measures the pose condition execution success rate rather than a classification accuracy. For each generated image, we use HMR2.0 [9] to extract its human pose and calculate the Procrustes-Aligned Mean Per Joint Position Error (PA-MPJPE) against the target reference motion. A generation is considered a successful “hit” if the PA-MPJPE is less than or equal to a strict threshold (100.0 m). Mot.Acc is defined as the percentage of hits across all samples, directly reflecting how reliably the generated image conforms to the given motion condition. Training and evaluation sets have no overlap. F Limitations and Broader Impact Limitations. UniMotion inherits the computational overhead of a 1.5B-parameter backbone, which may limit deployment in resource-constrained settings. The pose-aware vision backbone relies on a frozen pretrained human body encoder [7]; robustness to severe occlusion, camera motion, and diverse in-the-wild scenarios 42Z. Wang et al. remains to be thoroughly validated, as the visual-motion alignment is primarily established on indoor datasets (Human3.6M). The current Vision-to-Motion evaluation on Human3.6M uses frame-based visual inputs; extending the same interface to richer video-level temporal reasoning is a promising direction. We hope UniMotion opens a practical direction toward motion-aware multimodal intelligence. Broader impact. UniMotion’s unified motion-language-vision interface can benefit animation, AR/VR, robotics, and medical rehabilitation. Potential risks include synthetic media misuse and privacy concerns if trained on identifiable hu- man videos. We commit to clear dataset licensing, anonymization, and responsible use.