Paper deep dive

OpenT2M: No-frill Motion Generation with Open-source,Large-scale, High-quality Data

Bin Cao, Sipeng Zheng, Hao Luo, Boyuan Li, Jing Liu, Zongqing Lu

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 58

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/22/2026, 6:07:02 AM

Summary

OpenT2M is a large-scale, high-quality, open-source motion dataset containing over one million sequences and 2800 hours of human motion, designed to address data quality and generalization issues in text-to-motion (T2M) generation. The authors also introduce MonoFrill, a no-frill T2M model utilizing a novel 2D-PRQ motion tokenizer that captures spatiotemporal dependencies by decomposing the human body into five parts, achieving superior reconstruction and zero-shot performance.

Entities (5)

2D-PRQ · tokenizer · 100%MonoFrill · model · 100%OpenT2M · dataset · 100%HumanML3D · dataset · 95%Motion-X · dataset · 95%

Relation Signals (3)

MonoFrill → trainedon → OpenT2M

confidence 100% · Building upon OpenT2M, we introduce MonoFrill

MonoFrill → uses → 2D-PRQ

confidence 100% · Its core component is 2D-PRQ, a novel motion tokenizer

OpenT2M → improves → T2M models

confidence 90% · OpenT2M significantly improves generalization of existing T2M models

Cypher Suggestions (2)

Identify the tokenizer used by a specific model · confidence 95% · unvalidated

MATCH (m:Model {name: 'MonoFrill'})-[:USES]->(t:Tokenizer) RETURN t.name

Find all models trained on the OpenT2M dataset · confidence 90% · unvalidated

MATCH (m:Model)-[:TRAINED_ON]->(d:Dataset {name: 'OpenT2M'}) RETURN m.name

Abstract

Abstract:Text-to-motion (T2M) generation aims to create realistic human movements from text descriptions, with promising applications in animation and robotics. Despite recent progress, current T2M models perform poorly on unseen text descriptions due to the small scale and limited diversity of existing motion datasets. To address this problem, we introduce OpenT2M, a million-level, high-quality, and open-source motion dataset containing over 2800 hours of human motion. Each sequence undergoes rigorous quality control through physical feasibility validation and multi-granularity filtering, with detailed second-wise text annotations. We also develop an automated pipeline for creating long-horizon sequences, enabling complex motion generation. Building upon OpenT2M, we introduce MonoFrill, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as "frills". Its core component is 2D-PRQ, a novel motion tokenizer that captures spatiotemporal dependencies by dividing the human body into biology parts. Experiments show that OpenT2M significantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. We expect OpenT2M and MonoFrill will advance the T2M field by addressing longstanding data quality and benchmarking challenges.

PDF

Open source PDF →Open local PDF →

Full Text

57,627 characters extracted from source content.

Expand or collapse full text

智在无界 BeingBeyond ∝ ∝ ∝ ∝ 智在无界 ∝ BeingBeyond OpenT2M: No-frill Motion Generation with Open-source, Large-scale, High-quality Data Bin Cao 1,2,3 Sipeng Zheng 6 Hao Luo 5 Boyuan Li 4 Jing Liu 1,2,† Zongqing Lu 5,6,† 1 CASIA 2 UCAS 3 BAAI 4 RUC 5 PKU 6 BeingBeyond https://research.beingbeyond.com/opent2m (a) HumanML3D (c) HumanML3D* (b) Motion-X (d) Motion-X* Out-of-domain Figure 1: (Left) Visualization of text embeddings for the training and validation sets of HumanML3D and Motion-X. A substantial overlap between the splits indicates data leakage. To avoid this risk, we remove the overlap via data repartition (version denoted as∗). (Right) However, we observe a drastic performance drop when experimenting on this repartitioned benchmark, which reveals the limited generalization capability of current methods when faced with out-of-domain data. Abstract Text-to-motion (T2M) generation aims to create realistic human movements from text descriptions, with promising applications in animation and robotics. Despite recent progress, current T2M models perform poorly on unseen text descriptions due to the small scale and limited diversity of existing motion datasets. To address this problem, we introduceOpenT2M, a million-level, high- quality, and open-source motion dataset containing over 2800 hours of human motion. Each sequence undergoes rigorous quality control through physical feasibility validation and multi-granularity filtering, with detailed second-wise text annotations. We also develop an automated pipeline for creating long- horizon sequences, enabling complex motion generation. Building uponOpenT2M, we introduceMonoFrill, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as “frills”. Its core component is 2D-PRQ, a novel motion tokenizer that captures spatiotemporal dependencies by dividing the human body into biology parts. Experiments ‡ Correspondence to <jliu@nlpr.ia.ac.cn > and <zongqing.lu@pku.edu.cn > arXiv:2603.18623v1 [cs.CV] 19 Mar 2026 show thatOpenT2Msignificantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. We expectOpenT2MandMonoFrillwill advance the T2M field by addressing longstanding data quality and benchmarking challenges. Date: March 19, 2026 1 Introduction Recent years have seen remarkable progress in generating human motion according to text descriptions for video games, movies, and humanoid robots. However, current state-of-the-art methods [12,16], which depend heavily on motion-capture data [11,26], struggle to create novel motions beyond what they’ve seen during training. We argue that this limited generalization in text-to-motion (T2M) models arises from a fundamental bottleneck on existing motion datasets: they lack both diversity and scale. In fact, we suppose that many reported improvements on standard benchmarks may simply reflect overfitting to the training distribution rather than practical advances. To support this claim, we first perform a systematic statistical analysis. Specifically, we draw the distribution of text descriptions in two widely-used benchmarks: HumanML3D and Motion-X [20] using CLIP [31]. We observe significant overlap between training and validation sets (Figure 1). Specifically, 10.62% and 16.97% validation texts appear word-for-word in the training set in these two datasets — most of them correspond to quite similar motions. We also find duplicate descriptions within the validation sets themselves. This data contamination seriously undermines how we evaluate T2M models. To fix this problem, we create their cleaned version (marked with ∗ ). As expected, models perform poorly on the new cleaned benchmarks. Another concerning issue is that modern T2M methods typically need hundreds of training epochs to converge — a sign of overfitting, suggesting that existing performance metrics are artificially inflated. One straightforward thought is to create larger and more diverse motion datasets. However, progress in high-quality human motion data has stalled since AMASS was released, mainly because professional motion- capture equipment and facilities are extremely expensive. To avoid these costs, recent works [7,38] have tried extracting motion from internet videos using off-the-shelf estimation tools [32]. While web videos provide access to diverse motion patterns, this approach brings additional noise. Most importantly, a large portion of motions extracted from videos contain physically unrealistic artifacts like foot sliding, body drifting, and limb intersections, which severely limit their usefulness for training reliable motion generation models [15]. To solve these problems, we introduceOpenT2M, a large-scale, high-quality human motion dataset containing over one million sequences. Our dataset focuses on bridging the quality gap towards motion-capture databases like HumanML3D while being much larger in scale. The key advantage ofOpenT2Mis that it’s freely available to researchers and uses a carefully designed curation process. Unlike previous large-scale video-based motion datasets, which are either not publicly available [38] or lack proper physical-aware quality control, we make our dataset open-source with an effective refinement pipeline.OpenT2Moffers four key improvements over existing datasets. (1) Physically Feasible Validation: We validate that all motion sequences are physically feasible and can be simulated, making them suitable for training models that control humanoid robots. (2) Multi- granularity Quality Filtering: We remove sequences with occlusions or partial body captures, ensuring that the full human body is visible throughout each motion sequence. (3) Second-wise Descriptions: We generate detailed textual labels for per-second motion, combining them into comprehensive descriptions that accurately capture all actions in the video. (4) Long-horizon Motions: Our dataset includes extended motion sequences that enable models to generate realistic, long-horizon movements from complex text descriptions. In addition, the increasing scale of motion datasets also poses a challenge for motion tokenizers in accurately reconstructing motions. Inspired by residual vector quantization (RQ) techniques [12,19] and MotionBook [38], we propose a novel motion tokenizer, named 2D-PRQ, that shows superior reconstruction 2 performance and great zero-shot ability. Our contributions are summarized as follows: •A Large-scale, High-quality Motion Database. We curateOpenT2Mcontaining over one million sequences. Our dataset ensures that all motions are physically realistic through multi-granularity quality filtering and manual validation. It also includes long-horizon motion sequences that enable T2M models to generate complex movements from detailed text descriptions. •A New Robust Foundation Benchmark. In addition to improve the generalization of current T2M models, OpenT2M further provides a reliable benchmark for fairly evaluating existing methods. •An Effective No-frill T2M Model. We develop a powerful yet “no-frill” motion generation model that achieves excellent T2M performance without complicated designs or technical trick. By simply a novel motion tokenizer named 2D-PRQ,MonoFrilleffectively captures how motion unfolds both in space and over time. After pretraining onOpenT2M, it shows outstanding performance, especially when tested under zero-shot setups. 2 Related Work Human Motion Dataset. Dataset is the foundation of building a robust T2M model. Pioneering datasets, like KIT [29] and AMASS [26], adapt motion capture devices to obtain human motion data and manual text annotation. The scale and diversity of these datasets are limited. BABEL [30] provides frame-level text annotation on AMASS and serves as a long-horizon motion generation benchmark. HumanML3D [11] expands datasets with 14.6K motions and 44.9K texts by merging AMASS and HumanAct12 [10]. Motion-X [21] further scales up the dataset by extracting motions from monocular videos and annotating motions by PoseScript [5], resulting in a motion dataset comprising 81.1k sequences. Wang et al.[38]introduces the first million-level motion dataset, MotionLib, and highlights the importance of scaling datasets. HuMo100M [3] is the largest motion dataset featuring 5M motion sequences with multi-granularity text annotation. However, the scarcity of large-scale, high-quality, and open-source datasets hinders building a generalizable T2M model. In this work, we introduceOpenT2M, a large-scale, high-quality, and open-source dataset that improves the generalization ability of current T2M models. Motion Tokenization. Building an effective motion tokenizer is crucial for high-quality motion generation. Motion tokenizer contains a motion encoder, a motion decoder, and a quantizer. T2M-GPT [42] adapts VQ-VAE to discrete motion into motion tokens by applying the 1D convolution and an embedding to represent the whole body feature. Furthermore, to reduce reconstruction error, Lee et al.[19]introduces residual quantization (RQ), utilizing multiple layers to quantify motion sequences iteratively. Recently, emerging research has explored fine-grained motion tokenization. Chen et al.[4]decouples the human body into the upper body and lower body, and Cao et al.[3]decouples the human body into five parts. However, these methods encode and quantify different body parts independently without skeletal constraints. This limitation motivates us to design 2D-PRQ, a novel motion tokenizer capturing spatial and temporal dependencies and showing superior zero-shot performance. 3 The OpenT2M Dataset The development of robust T2M models is hindered by the lack of large-scale, high-quality data. Prior datasets suffer from insufficient diversity, often leading to the artifact where R@1 exceeds R@1 Real [28,33,46], indicating an ambiguous, one-to-many text-motion mapping. To address this challenge, we introduceOpenT2M, an open-source dataset created through a rigorous curation pipeline designed with several key steps (Figure 2): Physically Feasible Validation. Motion capture (MoCap) data provides high-quality human motion sequences, valued for its inherent accuracy and adherence to physical constraints [13,26]. However, MoCap data is difficult to scale up. To leverage more abundant but noisier video-based motion data, we introduce an RL-based filter to ensure physical plausibility. We train a robustness policy,π refine [25] on AMASS, using it to track motions extracted from web videos. By retaining only the motions that our policy can successfully track, we eliminate artifacts like jittering and foot-sliding to guarantee physical feasibility. In our data, more than 63% of the extracted motions pass this physically feasible validation. Noting that this process not only 3 Extracted Motion Data Physically feasible motion Step 2: Multi-granularity filter Keypoints Filter BBox ratio Filter Duration Filter Curated Motion Data (c) Second-wise Text Annotation Gemini-2.5 The person is bending at the waist and preparing to swing a golf club. The person is swinging the golf club downwards towards the ball. The person has completed the swing with the club extended. The person is performing golf swings, involving bending at the waist, swinging the club downwards. (a) Motion Data Curation (b) Long-horizon Motion Curation Step 1: Physically feasible validation Motion Dataset Step 1: Slerp Interpolation Motion and Transition Step 2: Physical Refinement Avatar Trajectory Orientation Alignment Global Coordinate Alignment Slerp Interpolation Figure 2: Data Curation pipeline. (a) We adopt a two-stage pipeline, including physically feasible validation and multi-granularity filter. (b) We adapt the interpolation-based method for motion curation and introduce an RL-policy for refinement. (c) For text annotation, we generate temporally aligned labels for each second of video, using them to synthesize a precise, semantic-rich description. Table 1: Comparison with existing human motion datasets, where “#physically-feasible” refers to the motion sequences that comply with physical laws and “#long-horizon” denotes the dataset that can serve as a long-horizon benchmark. #Clips#Hours#Avg. Length#long-horizon#physically-feasible#vision BABEL [30]52.9K33.22.3s!!! KIT [29]5.7K11.29.5s%!% HumanML3D [11]29.2K28.67.1s%%% Motion-X [21]81.1K144.26.4s%%! MotionLib [38]1.2M1456.4-%!! MotionMillion [7] 2M--%%! HuMo100M [3]5.7M8508.35.3s!!! OpenT2M1M2815.610.1s!!! optimizes the motion quality, but also maintains highly dynamic motion sequences (e.g, dancing, fencing, and pitching). To prove this, we provide additional visualization examples in Appendix D. Compared with previous works [7,23], this process ensures the extracted motions adhere to physical constraints, significantly enhancing realism and quality. Multi-granularity Filtering. Although web videos can serve as a rich source for human motion [9,17,39], their motion quality is often compromised by bad cases like occlusions, blur, and low resolution. To avoid these, we apply a set of quality criteria to ensure motions’ high fidelity after extracting 2D keypoints using a pre-trained detector [41], including: (1) A minimum keypoint count per frame to maintain structural completeness and remove occluded or partial-body sequences; (2) A lower-bound ratio of human bounding boxes for sufficiently visible and detailed human bodies to ensure precise motion estimation and text annotation; (3) A shortest temporal duration to exclude fragmented clips to retain continuous activities. These criteria help to build a high-quality motion extraction pipeline with clear video clips and semantic-rich text labels. Similarly, additional details about their setups can be found in Appendix A. Second-level Text Annotation. The quality of text annotations, in both semantic richness and precision, is critical for dataset integrity and motion generation fidelity. Unlike prior works using a single-stage approach [3,38] to generate a single, coarse description for an entire video clip, which fails to capture all activities within the video, our method alleviate crucial detail omission to obtain finer-grained motion alignment. We achieve this by implementing a two-stage pipeline: Gemini-2.5-pro [34] firstly produces temporally precise, second-by-second descriptions of human motions, including fine-grained limb movements. These fine-grained motion descriptions are then synthesized into a coherent summary for the entire clip. This process captures comprehensive action details, providing reliable text annotation for pretraining a robust motion generation model. Long-horizon Motion Curation. Most existing motion data is predominantly short-duration, limiting 4 Part Combiner Part Divider 2D Encoder 2D Decoder Part-level Motion Codebook Left Arm Code <left_arm_0><left_arm_1><left_arm_2> <torso_0><torso_1><torso_2> Torso Code <right_leg_0><right_leg_1><right_leg_2> Right Leg Code <right_arm_0><right_arm_1><right_arm_2> Right Arm Code <left_leg_0><left_leg_1><left_leg_2> Left Leg Code Autoregressive Language Model Input Text: A man lifts his right arm. Text Tokenizer Motion Detokenizer Figure 3: Model Overview. We propose an extendable, autoregressive (AR) and discrete T2M model with no frills. (left) Our core design 2D-PRQ divides the entire body into five parts, encoding and quantizing motion into a sequence of discrete part-level tokens. (right) The AR model takes text as input and predicts part-level motion tokens. We call this model “MonoFrill” to show its simplicity. their utility for long-horizon benchmarks. While works like BABEL [30] have explored to address this issue, their scale and duration still remain constrained. Here, we curate a strategy to synthesize massive long-horizon motion sequences. First, we concatenate raw motions via interpolation with orientation and global coordinate alignment. However, such intuitive operation can lead to physically implausible transitions. Therefore, we replace it with a two-step refinement: an RL-based policy filters out untrackable motions, and the avatars’ trajectories are used to ensure physically feasible transitions. In addition, previous works [3] create long-horizon text by directly concatenation as well, which introduces noise and inefficiency due to motion-irrelevant content. Instead, we use Gemini-2.5-pro to merge refined concise commands (e.g., “wave left hand.”) into clean, user-friendly descriptions. As far as we know,OpenT2Mis the first dataset with an average motion length exceeding 10 seconds. A statistical comparison with counterparts is illustrated in Table 1. 4 The MonoFrill Model Overview. Inspired by large language models’ success in multimodal understanding [24,36,43,45], we frame human motion as a specialized “language”. Our approach, illustrated in Figure 3, uses a motion tokenizer to discretize sequences into tokens, which are then generated autoregressively by an LLM. To integrate motion tokens into the LLM backbone, we expand the LLM’s vocabulary by incorporating theKdiscrete codes. We also introduce special tokens such as<mot>,</mot>to delimit motion sequences. The overall training pipeline consists of two phases. First, we train a motion tokenizer to discretize motion features into motion tokens while minimizing reconstruction error. This is followed by a text-motion alignment training via motion instruction tuning [16], which is conducted onOpenT2Mto achieve robust and general-purpose text-motion alignment. We name our model as “MonoFrill” to denote its simplicity and extendable capability without any complex design. Motion Instruction Training. Achieving robust text-motion alignment is essential for developing a generalizable motion generation model. In the text-alignment training phase, 2D-PRQ first encodes and quantizes the continuous raw motion featuresM∈R T×D sequence into discrete motion tokensV ∈R n×p×l , using a temporal downsampling ratio ofn/T. Here,p= 5 represents the number of body parts,nis the number of temporal tokens, andlis the number of residual layers in the quantization process,Kis the size of the motion codebook. In addition to common motion tokens, we also introduce another two special tokens <part>, and</part>to separate body-part-specific subsequences in order to structure the input effectively. To enable autoregressive prediction of motion tokens conditioned on descriptions, we design a standardized template for all text-motion pairs: Input I: The person performs a salute and then shakes hands with another person. Answer M: <mot> <part_1><motion_token> ... < /part_1> ... < /mot> To train our large motion model, we optimize the negative log-likelihood loss over the predicted tokens as 5 follows: L(Θ) =− L X j=1 logP Θ (y j |desc, ˆy 1:j−1 ),(1) whereˆyandydenote the input and target token sequences, respectively. Θ represents the model parameters and L is the length of the target sequence. desc represents text input. 2D-PRQ: Towards Generalized Motion Tokenization. The increasing scale of motion datasets demands more effective encoding. Current VQ-based methods [42,44] use 1D temporal convolutions and a single embedding for the whole body, leading to information loss and limited generalization. In this work, we propose 2D-PRQ, a novel tokenizer that captures spatiotemporal dependencies by decomposing the body into parts. Given a motion sequencem 1:T ∈R T×D , 2D-PRQ first splits it into part-level features ̃m 1:T ∈R T×p×d , where dis the part-level feature dimension, andp=5 represents the body parts:left arm, left leg, torso, right leg, right arm. More details about how to split the whole body feature into independent parts can be found in Appendix B. Unlike methods that process parts in isolation [4], we conceptualize the sequence as a 2D image: time as width and body parts as height. Such design allows us to use a 2D convolution block for motion encoding[14], capturing both temporal correlations across frames and spatial dependency between different body parts, which is crucial for maintaining whole-body coordination and consistency. The encoder outputs a latent sequence ̃ b 1:p;1:n with a downsampling ratio ofn/T. Each latent vector ̃ b i,j is quantized via residual quantization [19] using a shared codebookC, producing the token sequence [b k 1:p;1:n ] K k=0 , whereb k denotes the code sequence at layerk. For the decoding, a symmetric 2D decoder reconstructs the part-level features ˆm 1:p;1:n which are aggregated to restore the raw motion feature ˆm j . The reconstruction loss is: L =||m− ˆm|| 1 + p X i=0 ||m i − ˆm i || 1 + β K X k=1 p X i=1 ||r k i − sg[b k i ]|| 2 2 .(2) 5 Experiments 5.1 Experimental Setup Datasets. To evaluate the performance and generalization capabilities of our model, we conduct experiments on three diverse motion datasets: HumanML3D [11], Motion-X [21], and our collectedOpenT2M. HumanML3D is a widely adopted benchmark for text-to-motion generation, comprising 4,616 high-quality motion sequences paired with 44,970 textual annotations derived from sources like the AMASS dataset. Motion-X extends this scale with approximately 81,000 motion sequences, incorporating multi-modal data (e.g., video and audio cues) to enhance diversity in complex interactions and long-horizon motions. For further validation on an even larger scale, we utilizeOpenT2M, a comprehensive dataset with over 1 million motion sequences sourced from real-world human activities, which covers a broad spectrum of human activities, such as walking, dancing, and sports, making it ideal for assessing motion synthesis from diverse language descriptions. Following established protocols, we partition each dataset into training, validation, and test splits using an 80%, 5%, and 15% ratio, respectively. Evaluation Metrics. Our experiments center on two primary tasks to comprehensively assessMonoFrill on text-to-motion (T2M) generation and 2D-PRQ on motion reconstruction. For T2M generation, we adopt standard metrics from the literature [11], including Motion-retrieval Precision (R-Precision), Multimodal Distance (MMDist), and Frechet Inception Distance (FID). In addition, the effect of motion tokenizers is assessed by the motion reconstruction task, which reconstructs input motions through the tokenizer to verify discretization quality. We employ FID to measure overall sequence realism and Mean Per Joint Position Error (MPJPE) to quantify geometric accuracy. Details of these metrics can be seen in Appendix C. Implementation Details. For the motion reconstruction task, we implement a motion encoder with a temporal downsampling rate ofα= 4 for fair comparison. The motion tokenizer is trained with a learning rate of 2e-4 and a batch size of 256. We implement ourMonoFrill-2D-PRQ 4 with three sizes of LLMs: GPT2-medium [18], LlaMA2-7B [37], and LlaMA3.1-8B [6]. Full parameter training is performed on 8× A800 GPU with a learning rate of 2e-4 and a batch size of 1024 over 5000 steps on OpenT2M. 6 Table 2: Comparison of zero-shot performance onOpenT2M zero using different datasets for training. Models trained on OpenT2M consistently present significant OOD improvements. #Model#training dataR@1 ↑R@2 ↑R@3 ↑FID ↓MMDist ↓DIV ↑ Real-0.3160.4950.621-3.7717.749 MDMHumanML3D0.0650.1260.18051.3077.6423.040 MDMMotion-X0.0550.1070.16056.2578.0083.019 MDMOpenT2M0.1940.3380.4478.1534.8897.136 T2M-GPTHumanML3D0.0700.1300.18662.0368.0932.586 T2M-GPTMotion-X0.0630.1200.17353.4647.7702.957 T2M-GPTOpenT2M0.1590.2710.3575.5665.0726.921 Being-M0HumanML3D0.0730.1340.19058.5417.9562.932 Being-M0Motion-X0.0570.1090.15746.2227.6523.220 Being-M0OpenT2M0.1550.2660.3565.8115.1107.090 MonoFrill-2D-PRQ 4 HumanML3D0.0610.1190.17360.1778.0592.674 MonoFrill-2D-PRQ 4 Motion-X0.0520.1100.15255.4707.8412.433 MonoFrill-2D-PRQ 4 OpenT2M0.2400.3990.5121.4754.2817.563 Table 3: Comparison of motion instruction tuning on HumanML3D. We apply a limited number of training steps to avoid overfitting. Models with#pretrainconsistently achieve significant improvements across diverse #LLM backbones. #Model#LLM backbone#pretrainR@1 ↑R@2 ↑R@3 ↑FID ↓MMDist ↓DIV ↑ Real--0.5190.7100.801-3.17610.954 MonoFrillGPT2-medium-0.0780.1480.21261.8098.8034.810 MonoFrillLlaMA2-7B-0.4720.6450.7410.6193.57211.226 MonoFrillLlaMA3-8B-0.5030.6940.7920.5463.22411.104 MonoFrillGPT2-medium✓0.2150.3160.37717.917.1298.372 MonoFrillLlaMA2-7B✓0.4850.6760.7730.4353.38611.373 MonoFrillLlaMA3-8B✓0.5180.7040.7980.2383.17211.216 5.2 Effectiveness of OpenT2M Dataset While previous works have introduced large-scale datasets [7,38,40], their impact on model capabilities still remains inadequately explored. To valid the effectiveness ofOpenT2M, we conduct a rigorous T2M evaluation focusing on the following key aspects: (1) zero-shot generalization to out-of-domain cases, (2) adaptation to novel motion activities via instruction tuning, and (3) long-horizon motion generation. Zero-shot Motion Generalization. To rigorously assess the generalization of T2M models to unseen data, we curate a held-out evaluation setOpenT2M zero comprising 12,000 motions excluded from training data, including HumanML3D andOpenT2M, ensuring no domain overlap between the evaluation and training sets. This OOD benchmark enables zero-shot evaluation, where models generate motions for novel text prompts without task-specific fine-tuning. We benchmark three representative baselines: MDM [35], T2M- GPT [42], Being-M0 [38], as well as ourMonoFrill. As shown in Table 2, models trained on HumanML3D and Motion-X exhibit limited zero-shot performance, with metrics like FID and R-Precision revealing degraded semantic alignment and motion diversity OOD sequences. In contrast, training onOpenT2Myields substantial improvements across all baselines, underscoring its role in enhancing generalization through diverse, large-scale coverage of motion primitives and contexts. Motion Instruction Tuning. Inspired by the two-stage training paradigm in multimodal vision-language models [22], we adopt a similar pipeline for T2M generation: an initial pretraining phase on our large-scale OpenT2Mdataset to foster robust text-motion alignment, followed by targeted fine-tuning on downstream benchmarks. Specifically, we fine-tune the pre-trained model on HumanML3D for a limited 50 epochs. Unlike previous works that train for up to 300 epochs on the same dataset — potentially leading to in-domain overfitting — we intentionally restrict the number of training steps. This allows us to assess inherent generalization capabilities without conflating them with the effects of prolonged training, a potential confound in prior evaluations. As shown in Table 3, models pre-trained onOpenT2Mconsistently outperform their 7 non-pre-trained counterparts, indicating that pre-training equips the model with generalized motion patterns. Table 4: Comparison onOpenT2M long , where “#text refinement” refers to converting raw texts into cleaned user commands, "#long-horizon" denotes incorporating long-horizon motion data into OpenT2M. Model#text refinement#long-horizonR@1 ↑R@2 ↑R@3 ↑FID ↓MMDist ↓DIV ↑ Real✓0.5730.7400.822-2.84210.450 MonoFrill--0.0910.1650.22636.8377.9765.871 MonoFrill-✓0.4840.6480.7380.4303.52010.682 MonoFrill✓0.5100.6770.7650.2973.32210.748 Input Text: Hold right hand out. Take a photograph of something. Input Text: Wave with left hand, then wave with right hand. Do a low kick. Input Text: Do three jumping jacks. Take one large side step to the right. Input Text: Cross arms. Pick up something. Input Text: Stand and swing arms in circles. Do knee squats. Input Text: Wave right hand over head. Walk forward and stop. Figure 4: Visualization of generated long-horizon motions. Visualization results demonstrate the ability to generate long-horizon motion sequences that accurately align with complex texts. Long-horizon Motion Generation. Before introducing long-horizon benchmark, we first conduct text refinement. Text annotations in existing datasets, such as HumanML3D, contain considerable redundant details. Directly concatenating texts to construct long-horizon benchmark will introduce noise and inefficiency due to motion-irrelevant content. To mitigate this issue, we design a specific prompt and utilize Gemini-2.5 [34] to conduct text refinement: (1) removing motion-irrelevant details; (2) converting text annotations into cleaned and precise user commands. As illustrated in Table 5, this text refinement results in an improvement in R-Precision, achieving a better alignment between the refined text and motion sequences. Table 5: Ablation of text refinement on HumanML3D #text refinementR@1 ↑R@2 ↑R@3 ↑ -0.5200.7090.801 ✓0.5330.7200.808 Following text refinement, we introduceOpenT2M long , a long-horizon benchmark built with our cu- ration pipeline to evaluate T2M models on extended sequence generation. Our evaluation of a leading model,MonoFrill, reveals a significant struggle to produce satisfied performance without training on long-horizon motion data. In addition, text refinement further substantially improves this ability by enhancing text-motion alignment. Visualizations of the generated sequences are provided in Figure 4, and a detailed comparison with the BABEL dataset is available in Appendix A.2. 5.3 Effectiveness of 2D-PRQ Comparison of Motion Reconstruction. As shown in Table 6, 2D-PRQ outperforms previous methods, including PRQ, on large-scale datasets. Under a consistent configuration (codebook size 1024, feature dim 8 Table 6: Comparison of motion reconstruction on three benchmarks. Subscripts denote the number of quantization layers. HumanML3DMotion-XOpenT2M Motion TokenizerCodebook SizeFID ↓MPJPE ↓FID ↓MPJPE ↓FID ↓MPJPE ↓ VQ-VAE 1 10240.35883.9020.127115.3823.130178.534 FSQ 1 655360.15170.4800.828110.0211.962165.084 RQ-VAE 6 10240.03148.6960.01367.3900.08096.753 RQ-VAE 8 10240.02145.6330.02065.4840.06284.655 PRQ 4 10240.00328.7030.01273.9890.09495.743 PRQ 6 10240.00525.4850.00958.1550.02967.569 2D-PRQ 4 10240.00328.6280.01154.4930.02249.134 2D-PRQ 6 10240.00525.4170.00848.0990.02137.922 Table 7: Comparison of T2M on OpenT2M under different model parameters and motion tokenizers. ModelLLMR@1 ↑R@2 ↑R@3 ↑FID ↓MMDist ↓DIV ↑ MonoFrill-VQ 1 GPT2-medium0.2570.4100.51311.2265.1467.393 MonoFrill-VQ 1 LlaMA2-7B0.3450.5340.6563.0053.9558.463 MonoFrill-VQ 1 LlaMA3-8B0.3450.5340.6562.9793.9608.437 MonoFrill-2D-PRQ 4 GPT2-medium0.3570.5340.6458.8804.3167.905 MonoFrill-2D-PRQ 4 LlaMA2-7B0.4910.6750.7770.4752.9629.450 MonoFrill-2D-PRQ 4 LlaMA3-8B0.4780.6680.7770.5523.0128.901 512, except for FSQ [27]) (codebook size 65536), 2D-PRQ achieves substantially lower reconstruction error on Motion-X andOpenT2Mwhile using a simpler architecture. The key advantage lies in its 2D convolutional design, which jointly models spatial and temporal dependencies. This leads to marginal gains on HumanML3D (MPJPE: 25.417 vs. 25.485) but dramatically larger improvements as the dataset scale increases, as evidenced by results on Motion-X (MPJPE: 54.493 vs. 73.989) and OpenT2M (MPJPE: 49.134 vs. 95.743). Comparison of Text-to-Motion Generation. The choice of motion tokenizer is critically dependent on the scale of the training data. As shown in Table 2, replacing VQ-VAE with our 2D-PRQ tokenizer in the Being-M0 model leads to a performance drop when training on smaller datasets like HumanML3D and Motion-X. We attribute this to the increased number of motion tokens in 2D-PRQ, which requires large-scale data for effective training. This hypothesis is confirmed when training on the large-scaleOpenT2M: here, the MonoFrill-2D-PRQ 4 model achieves superior zero-shot performance, even exceeding strong baselines like T2M-GPT, Being-M0, and MDM. This result, also evident in Table 7, underscores that 2D-PRQ unlocks the full potential of large datasets and highlights the critical role of a well-designed motion representation. In Table 7, we observe that scaling the LLM from GPT2-medium to LLaMA2-7B brings significant gains. However, further scaling to LLaMA3-8B yields diminishing returns. This phenomenon can be found both VQ-VAE and 2D-PRQ, indicating that the performance degradation is unrelated to the interaction between the backbone and the tokenizer. Furthermore, we perform hyperparameter tuning and observe that scaling up the backbone from Llama2-7B to Llama3-8B still does not yield significant gains. Therefore, we hypothesize a saturation point where performance becomes less dependent on LLM size. Table 8: Zero-shot comparison of motion tokenizers. HumanML3DMotion-X Motion TokenizerFID ↓ MPJPE ↓FID ↓ MPJPE ↓ VQ-VAE 1 25.525237.70244.889293.301 PRQ 4 2.169135.9645.020167.508 2D-PRQ 4 0.10777.6951.606108.921 Comparison under Zero-shot Setup. Previous work primarily adopts theVQ-VAE 1 tokenizer and trains it on limited-scale datasets for extensive peri- ods (e.g., 200K steps), which leads to overfitting and fails to assess the tokenizer’s inherent zero-shot per- formance. In contrast, we pre-train various tokenizers on the large-scaleOpenT2Mdataset and evaluate their zero-shot performance on HumanML3D and Motion- X. As shown in Table 8,2D-PRQ 4 significantly shows superior zero-shot performance compared withVQ-VAE 1 . Furthermore, compared withPRQ 4 ,2D-PRQ 4 reduce reconstruction error from 135.964 to 77.695 per frame on HumanML3D, demonstrating the superior 9 generalization and effectiveness in mitigating tokenizer overfitting. 6 Conclusion This paper introducesOpenT2M, a large-scale, high-quality human motion dataset with physically feasible validation, multi-granularity filtering, and second-wise annotation. We also introduce a pipeline that synthesizes long-horizon motion autonomously, containing motion connection and text connection to equip T2M models with the capability to generate complex and long-horizon motion sequences. LeveragingOpenT2M, we introduce MonoFrill, a pretrained T2M model achieving superior performance without complicated “frills”. As the core component ofMonoFrill, 2D-PRQ, a novel motion tokenizer, decouples human body features into five parts and captures spatiotemporal dependencies by applying 2D convolution, showing superior reconstruction performance on large-scale datasets and zero-shot ability. Comprehensive experiments demonstrate that OpenT2Mshows benefits in improving generalization on unseen motion sequences and motion instruction tuning. We hope that our findings and the release of OpenT2M will benefit this field. 10 References [1] Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov, Leonid Pishchulin, Anton Milan, Juergen Gall, and Bernt Schiele. Posetrack: A benchmark for human pose estimation and tracking. InProceedingsoftheIEEEconference oncomputervisionandpatternrecognition, pages 5167–5176, 2018. [2]Zhongang Cai, Mingyuan Zhang, Jiawei Ren, Chen Wei, Daxuan Ren, Zhengyu Lin, Haiyu Zhao, Lei Yang, Chen Change Loy, and Ziwei Liu. Playing for 3d human recovery.IEEETransactionsonPatternAnalysisand MachineIntelligence, 2024. [3]Bin Cao, Sipeng Zheng, Ye Wang, Lujie Xia, Qianshan Wei, Qin Jin, Jing Liu, and Zongqing Lu. Being-m0. 5: A real-time controllable vision-language-motion model.arXivpreprintarXiv:2508.07863, 2025. [4] Changan Chen, Juze Zhang, Shrinidhi K Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, and Ehsan Adeli. The language of motion: Unifying verbal and non-verbal language of 3d human motion. In ProceedingsoftheComputerVisionandPatternRecognitionConference, pages 6200–6211, 2025. [5]Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, and Grégory Rogez. Posescript: 3d human poses from natural language. InEuropeanConferenceonComputerVision, pages 346–362. Springer, 2022. [6]Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXive-prints, pages arXiv–2407, 2024. [7]Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data.arXivpreprintarXiv:2507.07095, 2025. [8]Mihai Fieraru, Mihai Zanfir, Silviu Cristian Pirlea, Vlad Olaru, and Cristian Sminchisescu. Aifit: Automatic 3d human-interpretable feedback models for fitness training. InProceedingsoftheIEEE/CVFconferenceon computervisionandpatternrecognition, pages 9919–9928, 2021. [9]Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedingsoftheIEEE/CVFConferenceonComputer VisionandPatternRecognition, pages 19383–19400, 2024. [10]Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d human motions. InProceedingsofthe28thACMinternational conferenceonmultimedia, pages 2021–2029, 2020. [11]Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedingsoftheIEEE/CVFconferenceoncomputervisionand patternrecognition, pages 5152–5161, 2022. [12]Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPattern Recognition, pages 1900–1910, 2024. [13]Chuan Guo, Inwoo Hwang, Jian Wang, and Bing Zhou. Snapmogen: Human motion generation from expressive texts.arXivpreprintarXiv:2507.09122, 2025. [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition, pages 770–778, 2016. [15] Daniel Holden. animation-quality-blog. Inhttps://theorangeduck.com/page/animation-quality, 2024. [16]Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language.AdvancesinNeuralInformationProcessingSystems, 36:20067–20079, 2023. [17] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXivpreprint arXiv:1705.06950, 2017. 11 [18]Klemens Lagler, Michael Schindelegger, Johannes Böhm, Hana Krásná, and Tobias Nilsson. Gpt2: Empirical slant delay model for radio space geodetic techniques.Geophysicalresearchletters, 40(6):1069–1073, 2013. [19]Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedingsoftheIEEE/CVFconferenceoncomputervisionandpatternrecognition, pages 11523–11532, 2022. [20]Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset.AdvancesinNeuralInformationProcessingSystems, 36:25268–25280, 2023. [21] Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large-scale 3d expressive whole-body human motion dataset.AdvancesinNeuralInformationProcessingSystems, 36, 2024. [22]Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. [23]Shunlin Lu, Jingbo Wang, Zeyu Lu, Ling-Hao Chen, Wenxun Dai, Junting Dong, Zhiyang Dou, Bo Dai, and Ruimao Zhang. Scamo: Exploring the scaling law in autoregressive motion generation model. InProceedingsof theComputerVisionandPatternRecognitionConference, pages 27872–27882, 2025. [24]Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXivpreprintarXiv:2002.06353, 2020. [25] Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedingsoftheIEEE/CVFInternationalConferenceonComputerVision, pages 10895–10904, 2023. [26]Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InProceedingsoftheIEEE/CVFinternationalconferenceoncomputer vision, pages 5442–5451, 2019. [27]Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple.arXivpreprintarXiv:2309.15505, 2023. [28] Mathis Petrovich, Michael J Black, and Gül Varol. Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis. InProceedingsoftheIEEE/CVFInternationalConferenceonComputerVision, pages 9488– 9497, 2023. [29]Matthias Plappert, Christian Mandery, and Tamim Asfour. The kit motion-language dataset.Bigdata, 4(4): 236–252, 2016. [30]Abhinanda R Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and Michael J Black. Babel: Bodies, action and behavior with english labels. InProceedingsoftheIEEE/CVFconferenceon computervisionandpatternrecognition, pages 722–731, 2021. [31] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternationalconferenceonmachinelearning, pages 8748–8763. PMLR, 2021. [32]Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. Wham: Reconstructing world-grounded humans with accurate 3d motion. InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition, pages 2070–2080, 2024. [33] Shinichi Tanaka, Zhao Wang, Yoichi Kato, and Jun Ohya. Unlocking pretrained llms for motion-related multimodal generation: A fine-tuning approach to unify diffusion and next-token prediction.arXivpreprintarXiv:2503.06119, 2025. [34]Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXivpreprintarXiv:2403.05530, 2024. [35]Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion diffusion model.arXivpreprintarXiv:2209.14916, 2022. 12 [36]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXivpreprintarXiv:2302.13971, 2023. [37] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXivpreprintarXiv:2307.09288, 2023. [38]Ye Wang, Sipeng Zheng, Bin Cao, Qianshan Wei, Qin Jin, and Zongqing Lu. Quo vadis, motion generation? from large language models to large motion models.arXivpreprintarXiv:2410.03311, 2024. [39] Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprintarXiv:2307.06942, 2023. [40]Liang Xu, Shaoyang Hua, Zili Lin, Yifan Liu, Feipeng Ma, Yichao Yan, Xin Jin, Xiaokang Yang, and Wenjun Zeng. Motionbank: A large-scale video motion benchmark with disentangled rule-based annotations.arXivpreprint arXiv:2410.13790, 2024. [41]Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation.AdvancesinNeuralInformationProcessingSystems, 35:38571–38584, 2022. [42]Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. Generating human motion from textual descriptions with discrete representations. InProceedingsofthe IEEE/CVFconferenceoncomputervisionandpatternrecognition, pages 14730–14740, 2023. [43] Wanpeng Zhang, Zilong Xie, Yicheng Feng, Yijiang Li, Xingrun Xing, Sipeng Zheng, and Zongqing Lu. From pixels to tokens: Byte-pair encoding on quantized visual modalities.arXivpreprintarXiv:2410.02155, 2024. [44]Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, and Wanli Ouyang. Motiongpt: Finetuned llms are general-purpose motion generators.arXivpreprintarXiv:2306.10900, 2023. [45]Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXivpreprintarXiv:2410.02713, 2024. [46] Zeyu Zhang, Yiran Wang, Wei Mao, Danning Li, Rui Zhao, Biao Wu, Zirui Song, Bohan Zhuang, Ian Reid, and Richard Hartley. Motion anything: Any to motion generation.arXivpreprintarXiv:2503.06955, 2025. 13 Appendix In this supplementary, we provide additional details ofOpenT2Min Section A. We provide the details about dividing human body into independent parts in Section B. We also provide details of evaluation metrics in Section C. We provide visualization examples of OpenT2M in Section D. A Additional Analysis of OpenT2M A.1 Data Distribution Figure 5a shows the number distribution of motion sequences across different subsets inOpenT2Mon a logarithmic scale, demonstrating variations in dataset sizes.OpenT2Mintegrates 21 curated subsets, amounting to a comprehensive collection of 1 million motion sequences. A substantial portion of motions inOpenT2M are extracted from web videos utilizing motion estimation models [32], such as Kinetics-700 [17], Internvid [39]. These motions undergo rigorous physically feasible validation and multi-granularity filtering. We set the number of visible keypoints to 8, while the whole body corresponds to 17 keypoints. Each motion sequence accounts for over 50% of the duration of the corresponding original video, ensuring temporal consistency and semantic validity.OpenT2Malso integrates open-source human motion datasets [1,2,8], such as Motion-X [21]. Leveraging the proposed long-horizon motion curation pipeline, we construct 190K long-horizon motion sequences. TheOpenT2M long comprises motions spliced from two, three, four, and five individual motion sequences. Figure 5b shows the average length distribution ofOpenT2Macross different subsets. We observe that the subset with the shortest average sequence length is Postrack, comprising merely 16.12 frames, while 3DPW exhibits the longest average length, exceeding 500 frames. Following a meticulous curation process, OpenT2M exhibits a substantially longer average length compared with previous work [3]. 10 0 10 1 10 2 10 3 10 4 10 5 10 6 OpenT2M-long internvid-text-human ntu-rgbd-120 kinetics-700 babel internvid-holistic2d motionx openvid-1m bedlam egoexo4d gta-human talkshow kit egobody chi3d mpi-inf-3dhp posetrack fit3d arctic rich 3dpw 190,538 152,347 133,694 127,192 108,284 88,295 80,440 33,521 27,734 25,564 19,181 18,199 5,772 3,786 2,984 2,240 1,666 1,504 992 820 86 Number of Motions in Different Datasets (a) Distribution of motion sequences across different sub- sets in OpenT2M. 0100200300400500600 3dpw egoexo4d mpi-inf-3dhp arctic fit3d rich openvid-1m internvid-text-human OpenT2M-long egobody talkshow internvid-holistic2d kinetics-700 bedlam kit motionx chi3d babel gta-human ntu-rgbd-120 posetrack 572.92 551.01 500.19 494.9 472.63 407.26 321.62 319.3 309.87 299.83 225.31 194.89 156.6 134.67 132.86 127.29 67.96 67.24 46.19 44.48 16.12 Average Length across Different Datasets (b) Average length distribution ofOpenT2Macross different subsets. Figure 5: Statistics of theOpenT2Mdataset. (a) Motion sequence distribution (log scale). (b) Average motion length distribution. A.2 Comparison of Long-horizon Datasets We first detail the pipeline for long-horizon motion curation. Two different motion sequences are initially aligned in orientation by rotating the initial frame of the second sequence to match the facing direction of the last frame in the first sequence. Subsequently, the entire second sequence is translated spatially to align its position with that of the last frame of the first sequence. Finally, a fixed transition duration is applied, during which spherical linear interpolation is performed between the last frame of the first motion and the 14 initial frame of the second motion to ensure smooth kinematic continuity. To ensure that long-horizon motion sequences adhere to physical constraints, we utilize the concatenated motion sequence as reference poses for an RL policy, driving the avatar in the IsaacGym to track the reference motion. The resulting motion, refined through physical simulation, is adopted as the final long-horizon motion sequences. Figure 6 shows the length distribution comparison betweenOpenT2M long and BABEL [30]. BABEL labels about 43 hours of mocap sequences from AMASS [26] with fine-grained action labels. BABEL exhibits a substantial variation in motion length, containing motion sequences from 5s to over 100s. In BABEL, 37.9% of motion sequences last 5s or less, which significantly limits its effectiveness for evaluating the long-horizon motion generation capability of T2M models. In contrast,OpenT2M long contains only 0.33% of motions within 5s. Furthermore,OpenT2M long contains 20 times motion sequences than BABEL. As a result, even intervals with relatively low proportions inOpenT2M long may contain a larger number of motions compared to BABEL. For instance, motions lasting from 35s to 40s only constitute 0.76% inOpenT2M long , yetOpenT2M long contains 1,454 motion sequences from 35s to 40s. Meanwhile, although the same interval accounts for a higher proportion (0.9%) in BABEL, it represents merely 89 motions. (a) OpenT2M long Length Distribution(b) BABEL Length Distribution Figure 6: Length distribution comparison between OpenT2M long and BABEL datasets. A.3 Second-wise Text Annotation Previous works [3,38] typically annotate motion sequences by directly feeding corresponding videos into Vision-Language Models (VLMs) to generate coarse textual descriptions. While this approach offers efficiency, it suffers from a critical limitation: motion sequences extracted from web videos often comprise complex and continuous motion clips. When VLMs are applied in an end-to-end manner to entire video clips, they tend to overlook fine-grained and crucial motion details. Such omissions impact the quality and utility of annotated texts, particularly for applications requiring high temporal precision or detailed kinematic analysis. In this work, we design a second-wise annotation scheme as shown in Figure 7. The annotation task mainly contains second-wise captions and a general summary task. The annotation process begins by uniformly extracting video frames every 0.5s. Each second video frames are first annotated individually with second-wise descriptions. These second-wise captions are then summarized to form a precise caption for the entire video clip. In the annotation process, we deliberately exclude any descriptions of backgrounds, facial expressions, clothes, and other attributes that are irrelevant to human motion. We present annotated examples in Figure 8 to illustrate the precise alignment between text and motion. B Additional Details of 2D-PRQ In this work, we propose 2D-PRQ, a tokenizer that divides the joints of the whole body into 5 parts, including: • Left Hand: spine 1 , spine 2 , spine 3 , left collar, left shoulder, left elbow, left wrist • Right Hand: spine 1 , spine 2 , spine 3 , right collar, right shoulder, right elbow, right wrist 15 "second-wise": [ "second": 0, "start_frame": 0, "end_frame": 2, "text": "The person is standing still and looking to their right." , . . . . . . "second": 2, "start_frame": 4, "end_frame": 6, "text": "The person raises their right hand to their forehead." , "second": 3, "start_frame": 6, "end_frame": 8, "text": "The person continues to hold their right hand to their forehead." , "second": 4, "start_frame": 8, "end_frame": 10, "text": "The person lowers their right hand." , . . . . . . "second": 6, "start_frame": 12, "end_frame": 14, "text": "The person is shaking hands with their right hand." , . . . . . . "second": 9, "start_frame": 18, "end_frame": 20, "text": "The person is shaking hands with their right hand." ], "summary": "The person performs a salute and then shakes hands with another individual." You are a video annotation AI that describes and analyzes ONLY the visible physical motion of the person within the given BBOX in the videos. ## Input You will receive a series of video frames in chronological order, with each frame sampled every0.5 seconds. To analyze motion over each 1-second segment, group every three consecutive frames as follows: •Second 1 : frames 0, 1, 2 (time 0.0s-1.0s) •Second 2 : frames 2, 3, 4 (time 1.0s-2.0s) •Second 3 : frames 4, 5, 6 (time 2.0s-3.0s) •And so on. In general, for the i-th second, analyze frames at indices (2i - 2), (2i - 1), and 2i, covering the time interval from (i - 1) to i seconds. ## Task Your task consists of 2 parts: 1. Second-wise Caption For each 1-second segment of video, give one sentence to describe the physical motion of the person within the given BBOX. If the person within the given BBOX is not visible during that second, return null. 2. General Summary After listing the second-wise results, give one sentence that summarizes the overall physical motion of the person within the given BBOX. This summary should: • Describe the overall physical motion of the person in the BBOX to highlight common types of motion (e.g., walking, playing basketball, pivoting), but ONLY when 100% certain. • Add some action details about the limb ONLY if clearly visible (e.g., left hand). Prompt TemplateText Annotation Figure 7: Prompt template for generating second-wise text annotations utilizing Gemini-2.5. • Left Leg: spine 1 , spine 2 , spine 3 , left hip, left knee, left ankle, left foot • Right Leg: spine 1 , spine 2 , spine 3 , right hip, right knee, right ankle, right foot • Torso: spine 1 , spine 2 , spine 3 , neck, left collar, right collar, head The pelvis, spine 1 , spine 2 , and spine 3 are shared across all parts, as they remain relatively stable during human motion. Each joint is represented by relative 6D rotations and redundant 3D positions, resulting in a dimensionality of 63+8 per part, including 4D root node and 4D foot contact information. When aggregating part features into motion features, we average the shared joints. C Evaluation Metrics Text-to-motion. We adapt R-precision, MMDist, and FID to evaluate T2M model follow Guo et al.[11]. Each metric is illustrated as follows: • R-precision: The retrieval metric is designed to evaluate the semantic consistency between text and generated motion. The R-precision is computed as the accuracy of its ground-truth text description being ranked Top-1 when retrieved by the generated motion from a text pool. Following Guo et al. [11], we set the size of the description pool to 32. 16 •MMDist: MultiModel Distance is computed as the average Euclidean distance between motion feature and corresponding text feature. •FID: Frechet Inception Distance is designed to measure the similarity between the distribution of generated motions and ground-truth motion in the feature space. It is computed as the Fréchet distance between the feature distributions of the generated motion and ground-truth motion. Motion Reconstruction. We adapt FID and MPJPE to evaluate motion tokenizers on the motion reconstruction task. •FID: Similar to T2M, Frechet Inception Distance for motion reconstruction is computed as the Fréchet distance between the feature distributions of reconstruction motion and ground-truth motion. •MPJPE: The metric is computed by averaging the L2 distances between all joints of reconstruction motion and ground-truth motion across all frames. D Visualization Examples We provide visualization examples ofOpenT2Min Figure 8. Visualization examples demonstrate thatOpenT2M encompasses a diverse range of motion patterns and exhibits strong text-motion alignment, providing a high-quality data foundation for building large motion models. Text Annotation: The person repeatedly lunges forward with their right arm extended and then retracts their arm while stepping back. Text Annotation: The person is performing a series of dance moves, involving rotations, leans, and arm extensions. Text Annotation: The person is performing push-ups, moving their chest up and down towards and away from the floor. Text Annotation: The person performs a series of slow, deliberate movements, characterized by shifting weight between legs, extending and retracting arms in a flowing motion. Text Annotation: The person is a softball pitcher who performs a pitching motion including shifting weight, raising their arm, and releasing the ball, followed by recovery. Text Annotation: The person is performing a weightlifting movement, starting from a deep squat and lifting the barbell overhead. Text Annotation: The person performs ballistic side lunges with dumbbells, alternating lunges to the right and left while maintaining an upright posture. Text Annotation: The person starts standing and talking, then bends down to pick up two dumbbells and continues to hold them in a bent-over position. Figure 8: Visualization examples of OpenT2M, each example is annotated with precise text. 17