Paper deep dive
PRIOR: Perceptive Learning for Humanoid Locomotion with Reference Gait Priors
Chenxi Han, Shilu He, Yi Cheng, Linqi Ye, Houde Liu
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/22/2026, 6:11:21 AM
Summary
PRIOR is a single-stage reinforcement learning framework for humanoid locomotion that integrates a parametric gait generator, a GRU-based state estimator for egocentric depth-based terrain perception, and terrain-adaptive footstep rewards. It eliminates the need for adversarial training or multi-stage distillation, achieving 100% traversal success on complex terrains while providing a 3x training speedup on the Isaac Lab platform.
Entities (5)
Relation Signals (4)
PRIOR → builton → Isaac Lab
confidence 100% · We present PRIOR, an efficient and reproducible framework built on Isaac Lab
PRIOR → demonstratedon → ZERITH Z1
confidence 95% · The proposed PRIOR framework was simulated and demonstrated on the ZERITH Z1 model.
PRIOR → optimizespolicyusing → PPO
confidence 95% · The policy is optimized using Proximal Policy Optimization (PPO).
PRIOR → utilizes → GRU
confidence 95% · a GRU-based state estimator that infers terrain geometry
Cypher Suggestions (2)
Find all frameworks and the simulation platforms they are built on. · confidence 90% · unvalidated
MATCH (f:Framework)-[:BUILT_ON]->(s:Platform) RETURN f.name, s.name
Identify the robot models used by specific frameworks. · confidence 90% · unvalidated
MATCH (f:Framework)-[:DEMONSTRATED_ON]->(r:Robot) RETURN f.name, r.name
Abstract
Abstract:Training perceptive humanoid locomotion policies that traverse complex terrains with natural gaits remains an open challenge, typically demanding multi-stage training pipelines, adversarial objectives, or extensive real-world calibration. We present PRIOR, an efficient and reproducible framework built on Isaac Lab that achieves robust terrain traversal with human-like gaits through a simple yet effective design: (i) a parametric gait generator that supplies stable reference trajectories derived from motion capture without adversarial training, (ii) a GRU-based state estimator that infers terrain geometry directly from egocentric depth images via self-supervised heightmap reconstruction, and (iii) terrain-adaptive footstep rewards that guide foot placement toward traversable regions. Through systematic analysis of depth image resolution trade-offs, we identify configurations that maximize terrain fidelity under real-time constraints, substantially reducing perceptual overhead without degrading traversal performance. Comprehensive experiments across terrains of varying difficulty-including stairs, boxes, and gaps-demonstrate that each component yields complementary and essential performance gains, with the full framework achieving a 100% traversal success rate. We will open-source the complete PRIOR framework, including the training pipeline, parametric gait generator, and evaluation benchmarks, to serve as a reproducible foundation for humanoid locomotion research on Isaac Lab.
Tags
Links
- Source: https://arxiv.org/abs/2603.18979v1
- Canonical: https://arxiv.org/abs/2603.18979v1
Full Text
44,207 characters extracted from source content.
Expand or collapse full text
PRIOR: Perceptive Learning for Humanoid Locomotion with Reference Gait Priors Chenxi Han1,2,∗, Shilu He2,∗, Yi Cheng2, Linqi Ye3, Houde Liu1,† 1 Tsinghua University 2 ZERITH Robotics 3 Shanghai University ∗ Equal contribution † Corresponding author Abstract Training perceptive humanoid locomotion policies that traverse complex terrains with natural gaits remains an open challenge, typically demanding multi-stage training pipelines, adversarial objectives, or extensive real-world calibration. We present PRIOR, an efficient and reproducible framework built on Isaac Lab that achieves robust terrain traversal with human-like gaits through a simple yet effective design: (i) a parametric gait generator that supplies stable reference trajectories derived from motion capture without adversarial training, (i) a GRU-based state estimator that infers terrain geometry directly from egocentric depth images via self-supervised heightmap reconstruction, and (i) terrain-adaptive footstep rewards that guide foot placement toward traversable regions. Through systematic analysis of depth image resolution trade-offs, we identify configurations that maximize terrain fidelity under real-time constraints, substantially reducing perceptual overhead without degrading traversal performance. Comprehensive experiments across terrains of varying difficulty—including stairs, boxes, and gaps—demonstrate that each component yields complementary and essential performance gains, with the full framework achieving a 100% traversal success rate. We will open-source the complete PRIOR framework, including the training pipeline, parametric gait generator, and evaluation benchmarks, to serve as a reproducible foundation for humanoid locomotion research on Isaac Lab. I INTRODUCTION Deploying humanoid robots in human-centric environments requires locomotion controllers that can negotiate diverse terrain geometries—stairs, boxes, gaps—while preserving the natural bipedal gaits essential for safe and predictable coexistence with people and infrastructure. Reinforcement learning (RL) has emerged as the dominant paradigm for training such controllers, with recent work demonstrating impressive results in either terrain-robust locomotion through perceptive policies [30, 13] or natural gait synthesis through motion priors [16, 17]. Achieving both capabilities simultaneously, however, typically incurs substantial system complexity: multi-stage teacher–student distillation, adversarial discriminators for style enforcement, or extensive sim-to-real calibration. These requirements hinder reproducibility and raise the barrier to entry for locomotion research. In this work, we ask: can a single-stage RL pipeline, without adversarial training or distillation, produce perceptive humanoid locomotion that is both terrain-robust and natural? Figure 1: The proposed PRIOR framework was simulated and demonstrated on the ZERITH Z1 model. (A)–(D) illustrate the robot traversing four representative terrain types. We answer this question with PRIOR, a framework whose design is guided by three observations drawn from recent literature. (i) LiDAR-based elevation mapping, while geometrically precise, relies on odometric integration that accumulates drift during extended locomotion; egocentric depth images, by contrast, provide a self-contained terrain signal that is inherently free of such drift. (i) Adversarial motion priors, the prevailing mechanism for imposing gait style, suffer from well-documented training pathologies—mode collapse, reward ambiguity, and hyperparameter sensitivity — that are exacerbated on challenging terrains where the policy must deviate significantly from reference motions; a parametric gait generator can offer comparable stylistic guidance through deterministic supervision, sidestepping these instabilities entirely. (i) The ongoing transition from Isaac Gym to Isaac Lab as the community-standard simulation platform calls for training pipelines natively built on the newer stack; we develop PRIOR entirely within Isaac Lab, incorporating systematic optimizations that yield a 3×3× speedup over the vanilla Isaac Lab baseline. Concretely, PRIOR trains a single locomotion policy end-to-end through three mutually reinforcing mechanisms. A parametric gait generator produces phase-conditioned joint trajectories by dynamically blending motion capture primitives, supplying the policy with velocity-adaptive motion targets that replace adversarial style losses. A GRU-based state estimator fuses proprioceptive history with egocentric depth observations and is trained via self-supervised auxiliary objectives—heightmap reconstruction and linear velocity prediction—to distill local terrain geometry into a compact latent representation, requiring neither external localization nor manual annotation. Terrain-adaptive footstep rewards bias swing-leg placement toward geometrically favorable contact regions, enabling reliable footholds on discontinuous surfaces. Complementing these components, we conduct a depth resolution study that characterizes the Pareto frontier between reconstruction fidelity and computational cost, revealing that perceptual overhead can be substantially reduced with negligible impact on traversal performance. We validate PRIOR on simulated terrains of progressive difficulty—flat ground, boxes, staircases, and gaps. Controlled ablations confirm that each component is individually necessary and that their combination produces synergistic gains beyond any subset, with the complete system reaching a 100% traversal success rate. To lower the barrier for future work, we will publicly release the full code. Our contributions are summarized as follows: • We present PRIOR, a single-stage RL framework that unifies depth-based terrain perception, parametric gait generation, and terrain-adaptive footstep rewards to achieve robust and natural humanoid locomotion—eliminating the need for adversarial objectives, teacher–student distillation, or multi-stage training. • We provide a depth resolution analysis that maps the Pareto frontier between terrain reconstruction quality and computational throughput, together with Isaac Lab-specific training optimizations that yield a 3×3× training speedup over the vanilla implementation. • We conduct comprehensive ablations that isolate the necessity and synergy of each component, and commit to open-sourcing the complete framework as a reproducible baseline on Isaac Lab. I RELATED WORK I-A Perception-Driven Robot Locomotion Early research on legged robot locomotion focused primarily on ”blind walking” strategies based on proprioception. By constructing closed-loop feedback control using internal sensors such as joint encoders and IMUs, these methods achieved a degree of robust walking on unknown terrains. Studies such as [3, 10, 11, 9, 8] have demonstrated that strong terrain adaptability can be attained relying solely on internal sensing. Based on this, several works have introduced motion prior constraints to generate more agile and holistic locomotion patterns [26, 27]. However, due to the lack of direct environmental observation, these methods struggle with precise, proactive gait planning when encountering significant non-local terrain variations. As task complexity has increased, recent research has gradually shifted toward end-to-end locomotion learning frameworks. One category of methods introduces two-stage teacher-student distillation frameworks with privileged information [30, 1, 4, 25]. Some studies have incorporated temporal structures and attention mechanisms to enhance the robustness and consistency of terrain representations [13, 6, 23], while some other work has explored the integration of gait optimization [24]. Despite the significant progress made by these methods in complex environments, their motion generation often relies heavily on meticulous reward engineering or multi-stage training pipelines. This results in high system complexity and leaves room for improvement regarding the naturalness of the generated motion. I-B Motion-Prior-Based Robot Locomotion Research on motion priors primarily focuses on enhancing the human-likeness of robot movements and can be broadly categorized into explicit motion generation and stylistic constraints. One prominent category follows the imitation learning paradigm. Some approaches achieve complex skill synthesis and multi-action switching through explicit trajectory constraints [15, 28]. Others utilize latent variables or probabilistic models to model high-dimensional motion distributions in order to generate diverse human-like movements [17, 29]. Recently, diffusion models have emerged as a significant tool for constructing motion priors, demonstrating remarkable advantages [21, 12]. Despite significantly improving gait naturalness, insufficient modeling of environmental perception and online feedback limits its adaptability and robustness in complex terrain. Another category of methods utilizes reference motions as reward signals or discriminative criteria to guide the policy toward human-like styles in a weakly supervised manner. This approach enhances motion flexibility by relaxing strict trajectory matching constraints. Representative works include adversarial imitation for physical character control [18] and its applications in legged robot locomotion control [26, 14, 22]. Furthermore, some studies have utilized latent space modeling or Mixture-of-Experts (MoE) structures to enhance gait diversity while improving the continuity of motion generation [27, 24]. However, due to insufficient integration of environmental perception, learned human-like styles struggle to maintain consistency in real-world or unstructured environments. I METHOD In this section, we provide a detailed description of the implementation of the PRIOR framework. PRIOR utilizes an estimator that fuses temporal depth information with proprioception to estimate both terrain features and robot proprioceptive states. Furthermore, reference gaits derived from processed natural motion data are integrated as inputs, providing soft constraints on robot locomotion via gait-aware rewards. Our discussion is organized into four primary components: the overall perceptive locomotion framework, the generation and learning of human-like motions, the high-throughput training infrastructure, and the specific implementation details of the training process. Figure 2: Overview of the proposed PRIOR framework. The framework comprises three components: (a) Asymmetric actor-critic architecture for reinforcement learning. (b) State Estimator (yellow): Fuses multimodal latent features for policy driving and performs self-supervised regression for velocity estimation, terrain reconstruction, and state prediction. (c) Reference Gait Generator (blue): Synthesizes physics-consistent reference trajectories via phase normalization and velocity-driven weighted interpolation, while constraining locomotion through gait-aware rewards. I-A Perception-Driven Locomotion Framework I-A1 Asymmetric Actor–Critic Architecture As illustrated in Fig. 2, the proposed PRIOR framework adopts an asymmetric actor–critic architecture and is trained in a single-stage, end-to-end manner that enables the perception module and the control policy to co-evolve [13]. This design avoids error amplification issues commonly encountered in two-stage teacher–student distillation-based training pipelines. The policy is optimized using Proximal Policy Optimization (PPO) [20]. Actor network. The actor network takes (i) a 45-dimensional current proprioceptive observation to_t, (i) a 163-dimensional state estimator output te_t which is detailed in Section I-A2, and (i) a 1-dimensional gait phase signal ϕt _t as input. The current proprioceptive observation is defined as follows: t=[t,t,t,t,˙t,t−1]⊤.o_t= [ ω_t,\;g_t,\;c_t,\; θ_t,\; θ_t,\;a_t-1 ] . (1) where t ω_t is the body angular velocity, tg_t is the gravity direction vector expressed in the body frame, tc_t is the velocity command, t θ_t and ˙t θ_t are the joint positions and velocities, respectively, and t−1a_t-1 is the action applied at the previous time step. Critic Network. The critic network receives the noise-free base linear velocity tv_t from the simulation environment, the proprioceptive observation to_t, the height map scan information tm_t provided by the RayCaster, and the reference gait phase ϕt _t. The input is defined as: t=[t,t,t]⊤.s_t= [v_t,\;o_t,\;m_t\; ] . (2) Reward and Action Space. We categorize the reward functions into four groups: task tracking, stability, smoothness, and safety. Specific rewards for foot-terrain interaction are crucial for humanoid balance and obstacle negotiation. These foot-related components, summarized in Table I, ensure gait rhythm, ground clearance, and precise landing on complex environmental features. The actor network outputs a 12-dimensional action vector ta_t corresponding to the leg joints of the humanoid robot. This action serves as modulation to the default standing joint configuration default θ^default, yielding the target joint positions ttarget θ_t^target,which is defined as: ttarget=default+t. θ_t^target= θ^default+a_t. (3) The desired joint positions are then fed into a low-level PD controller to calculate the target joint torques t τ_t, which is defined as: t=Kp(ttarget−t)−Kd˙t. τ_t=K_p( θ_t^target- θ_t)-K_d θ_t. (4) where the stiffness KpK_p and damping KdK_d are set to 60.0 and 2.0, respectively, to match the hardware specifications of ZERITH Z1. TABLE I: LANDING STATE REWARD COMPONENTS Reward Description Weight (wiw_i) rairr_air Promotes gait rhythm. 1.25 rslider_slide Minimizes ground slipping. -0.10 rdbl-airr_dbl-air Penalizes walking on one leg. -1.00 rswingr_swing Ensures leg lift height. -20.0 rstumbler_stumble Prevents foot-obstacle tripping. -30.0 redgeL/Rr_edge^L/R Encourages safe foot placement. -2.00 I-A2 State Estimator The state estimator fuses proprioceptive sensing and depth visual perception. Compared with estimation methods that rely solely on visual information, this multimodal fusion architecture improves the real-time performance of state feedback and reduces estimation bias caused by noise in a single visual modality. The two input modalities consist of proprioceptive observations tH1o_t^H1 with a stacking horizon of H1 = 10, and temporally stacked depth images tH2d_t^H2 with a stacking horizon of H2 = 2, where each depth frame has a cropped resolution of [36,64]. The proprioceptive observation tH1o_t^H1 is processed by a multilayer perceptron (MLP) encoder to extract a 128-dimensional proprioceptive state feature, while the temporally stacked depth images tH2d_t^H2 are processed by a convolutional neural network (CNN) encoder to extract a 128-dimensional depth feature. The encoded features of the two modalities are then concatenated and fed into a single-layer gated recurrent unit (GRU) to generate a memory representation of the proprioceptive state and the terrain state. The output of the memory module is a 163-dimensional vector te_t, which serves as an input to the actor network, composed of a 3-dimensional base velocity ^t v_t, a 32-dimensional latent vector tz_t and a 128-dimensional height map latent vector th_t. t=[^t,t,t]⊤.e_t= [ v_t,\;z_t,\;h_t ] . (5) The height map latent vector th_t is decoded by an MLP to obtain the estimated terrain state ^t m_t, while te_t is decoded by another MLP to predict the next-step proprioceptive state ^t+1 o_t+1. Using the privileged information available in the critic observations, self-supervised learning is applied to ^t v_t, as well as the decoded results ^t m_t and ^t+1 o_t+1, by minimizing the mean squared error (MSE) loss. The overall loss function of the state estimator is defined as follows: ℒ=MSE(^t,t)+MSE(^t+1,t+1)+MSE(^t,t)L=MSE\! ( v_t,v_t )+MSE\! ( o_t+1,o_t+1 )+MSE\! ( m_t,m_t ) (6) I-B Reference Gait Priors Inspired by the Parameterized Motion Generator (PMG) framework [5], we present a data-efficient motion prior extraction method designed to achieve humanoid control on complex terrains using a minimal set of high-fidelity motion templates. Distinguishing itself from conventional imitation learning approaches that necessitate hours of large-scale motion capture data, our method extracts structural features from core gait cycles to construct a continuous and physics-consistent reference gait space. This not only substantially lowers the overhead of data acquisition and preprocessing but also provides robust kinematic guidance for policy convergence in non-stationary terrain environments. I-B1 Motion Data Preprocessing We utilize a high-precision motion capture system to collect human movement data, ranging from static postures to various forward velocities, denoted as humanD^human. This data is then retargeted to our humanoid robot, ZERITH Z1, using optimization-based algorithms [7], [2], resulting in the robot dataset robotD^robot: robot=,˙,,,D^robot=\ θ, θ,v, ω,c\ (7) θ represents the joint angles, ˙ θ represents the joint angular velocities, v represents the base linear velocity, ω represents the base angular velocity, =μ,σc=\μ,σ\ represents the foot contact information, including the contact center and range. Since the raw retargeted dataset robotD^robot contains initial transitions, terminal decelerations, and measurement noise inherent in the motion capture process, we designed a preprocessing pipeline to extract high-fidelity, periodic, and smooth reference motion segments. To construct reference trajectories with strict periodicity, we utilize the foot contact information c to segment the motion data at various velocities. We employ a clip_range mechanism to extract a single, stable gait cycle T from the original long sequences. In practice, we typically select the second or third cycle from the sequence to avoid the non-stationary dynamics associated with acceleration or deceleration phases. Furthermore, to eliminate high-frequency noise introduced by the motion capture system and ensure the smoothness of joint commands, we apply a 1D Gaussian filter to the raw joint sequences. The resulting processed robot dataset is represented as: robot_clip=,˙,,,,TD^robot\_clip=\ θ, θ,v, ω,c,T\ (8) where T denotes the duration of a single gait cycle. I-B2 Reference Gait Generation Given the commanded base velocity =[vx,vy,ω]v=[v_x,v_y,ω] and the gait phase ϕ∈[0,1)φ∈[0,1), the reference joint trajectory is synthesized through a unified weighted interpolation framework. For each velocity channel x∈vx,vy,ωx∈\v_x,v_y,ω\, we determine an interpolation coefficient based on the magnitude of the commanded velocity. Assume that the dataset contains a set of motion templates θi(ϕ),Ti\ _i(φ),T_i\, where θi(ϕ) _i(φ) denotes the phase-dependent joint trajectory and TiT_i is the corresponding gait period. Given a commanded velocity uxu_x, we select its two neighboring nominal velocities ulu_l and u_u, and define the interpolation factor as α=clip(|ux|−uluu−ul+ε,0,1)α=clip ( |u_x|-u_lu_u-u_l+ ,0,1 ) (9) where ε is a small constant for numerical stability. The commanded gait period is obtained via linear interpolation: Tu=(1−α)Tl+αTu.T_u=(1-α)T_l+α T_u. (10) The gait phase is then updated according to the normalized phase progression rule: ϕt+1=(ϕt+ΔtTu)mod1. _t+1= ( _t+ tT_u ) 1. (11) After obtaining the updated phase ϕφ, the reference joint trajectory is synthesized by blending the neighboring motion templates: θd(ϕ)=(1−α)θl(ϕ)+αθu(ϕ). _d(φ)=(1-α)\, _l(φ)+α\, _u(φ). (12) To handle near-zero velocity commands, we introduce a standing threshold vthv_th: θd(ϕ)=θstand,‖v‖≤vth,(1−α)θl(ϕ)+αθu(ϕ),otherwise. _d(φ)= cases _stand,&\|v\|≤ v_th,\\ (1-α) _l(φ)+α _u(φ),&otherwise. cases (13) In addition, a velocity-dependent stance ratio ρ()ρ(v) is defined to construct phase-based contact indicators rL(ϕ)r^L(φ) and rR(ϕ)r^R(φ), which are later used for contact supervision and reward formulation. This velocity-conditioned interpolation strategy ensures smooth transitions across different commanded speeds and maintains temporal consistency via unified phase evolution. I-B3 Gait-aware Reward Design To encourage the policy to maintain consistency with the reference gait over complex terrains, we construct a set of exponential tracking reward terms based on the target joint trajectories and commanded base velocities generated by the reference gait module. Each reward term follows a unified exponential form: ri=exp(−λiei),r_i= (- _ie_i), (14) where eie_i denotes the tracking error of the corresponding physical quantity, and λi _i is a scaling coefficient. All gait-related reward terms are combined as a weighted summation: rgait=∑iwiri,r_gait= _iw_ir_i, (15) where wiw_i represents the weight of each component. The detailed definitions of each gait reward term are summarized in Table I. These reward components constrain the policy from four complementary aspects, including pose consistency, velocity matching, dynamic motion trend tracking, and key support joint stabilization. This design allows the robot to preserve the periodic structure imposed by the reference gait generator while maintaining adaptability to complex environments. TABLE I: GAIT-AWARE REWARD COMPONENTS Reward Equation (eie_i) Weight (wiw_i) rposr_pos ‖−d(ϕ)‖2\| θ- θ_d(φ)\|^2 0.10 rvelr_vel ‖b−‖2\|v_b-v\|^2 0.05 rΔr_ ‖Δ−Δd(ϕ)‖1\| θ- θ_d(φ)\|_1 0.05 rankler_ankle ∑j∈L,R(θj−θd,j(ϕ))2 _j∈\L,R\ ( _j- _d,j(φ) )^2 0.05 I-C High-Throughput Training Infrastructure To address the computational overhead and GPU memory (VRAM) bottlenecks caused by high-dimensional depth perception in massively parallel environments, we developed a systematic engineering optimization framework on the NVIDIA Isaac Sim and Isaac Lab platform. Based on our self-developed ZERITH Z1 humanoid robot model, we constructed the training environment and improved overall system throughput from two perspectives: memory management and rendering strategy. I-C1 Heterogeneous Observation Buffer Management VRAM capacity is a primary factor limiting the degree of parallelism (i.e., NenvN_env) in reinforcement learning. To address the high-dimensional tensor storage pressure introduced by depth images, we propose a heterogeneous memory management scheme designed to decouple physics computation from data caching. Under this mechanism, VRAM serves only as a transient buffer for rendering outputs, while the generated observation tensors are asynchronously transferred to host memory (RAM) for storage and indexing. Experimental results demonstrate that this strategy significantly frees GPU computational resources. As shown in Table I, on an RTX 4090 (24 GB) platform, the maximum number of parallel environments for vision-based tasks increases from the baseline of 512 to 1024. In a 48 GB VRAM configuration, the parallel scale further extends to 1536 environments, substantially improving sample efficiency during training. TABLE I: PARALLEL SCALE UNDER DIFFERENT MEMORY STRATEGIES GPU Max NenvN_env Storage VRAM 4090 (24G) 512 GPU ∼ 24G 4090 (24G) 1024 CPU ∼ 24G 4090 (48G) 1536 CPU ∼ 48G I-C2 Render-time Pre-processing Optimization To optimize the vision processing pipeline, we first define the core perceptual requirement: the system must reliably distinguish terrain features with a minimum height of 5cm5\,cm. We derive the spatial resolution based on the geometric camera model: rv=z0sin(−δ)sin(α)sin(α+δ)=z0sin(βh)sin(α)sin(α−βh)r_v= z_0 (-δ) (α) (α+δ)= z_0 ( βh) (α) (α- βh) (16) where z0z_0 represents the mounting height, α is the pitch angle, β denotes the vertical Field of View (FOV), h is the image height in pixels, and auxiliary variable δ=−β/hδ=-β/h. Under the configuration of z0=0.8mz_0=0.8\,m, β=58∘β=58 , and α=45∘α=45 , an effective vertical resolution of h=36pxh=36\,px yields a spatial resolution of rvr_v = 0.0463m/pix0.0463\,m/pix at a typical measurement distance of 1.13m1.13\,m. Since 0.0463m<0.05m0.0463\,m<0.05\,m, this configuration satisfies the critical constraint for sensing 5cm5\,cm terrain variations while minimizing the input dimensionality. Based on this analysis, we directly render a low-resolution depth buffer of 45×8045× 80 instead of high-resolution frames. A 36×6436× 64 region is extracted via center cropping, followed by stochastic depth perturbations to improve robustness. This render-time optimization reduces perception overhead and GPU memory bandwidth usage, enabling the policy to process visual observations at a significantly higher frequency, which is crucial for maintaining stable locomotion on highly irregular terrains.. The overall pipeline is illustrated in Fig. 3. Figure 3: Depth data flow: Images in the top and bottom rows represent samples from different parallel environments. I-D Training Details I-D1 Training Curriculum To prevent policy instability caused by overly challenging terrains during the early stage of training, we adopt an adaptive terrain curriculum strategy based on the traveled distance of the robot, following the approach in [19]. In simulation, curriculum training is conducted simultaneously over four types of terrain: Pyramid Stairs, Inverted Stairs, Boxes, and Plane. All four terrain types are native terrain generators provided by Isaac Lab. Detailed terrain configurations and curriculum settings are illustrated in Fig. 4. overpic[width=212.47617pt]dixing_photo1_cutV1.png (2.0,53.0)A overpic overpic[width=212.47617pt]dixing_photo2_cutV1.png (3.0,53.0)B overpic overpic[width=212.47617pt]dixing_photo3_cutV1.png (3.0,53.0)C overpic overpic[width=212.47617pt]dixing_photo4_cutV1.png (3.0,53.0)D overpic E A B C D Terrain Type Pyramid Stairs Inverted Stairs Boxes Plane Range (/m) [0.05,0.23][0.05,0.23] [0.05,0.23][0.05,0.23] [0.05,0.20][0.05,0.20] – Parameter step height step height obstacle height flat surface Weight 0.2 0.2 0.2 0.1 Figure 4: Overview of the terrain curriculum for training. (A)–(D) represent distinct terrain types posing various physical challenges. Table (E) details the parameter ranges and the distribution (weight) for each terrain type. I-D2 Domain Randomization To enhance robustness and sim-to-real transfer, we apply domain randomization across physical dynamics, initial conditions, and visual inputs. The corresponding perturbation ranges are listed in Table IV. TABLE IV: DOMAIN RANDOMIZATION RANGES Parameter Randomization range Unit Base payload [−5.0,5.0][-5.0,5.0] kg Link mass factor [0.8,1.2][0.8,1.2] – Center of mass shift [−0.15,0.15][-0.15,0.15] m Friction coefficient [0.2,1.5][0.2,1.5] – KpK_p factor [0.9,1.1][0.9,1.1] – KdK_d factor [0.9,1.1][0.9,1.1] – Joint armature [2×10−3,2×10−2][2× 10^-3,2× 10^-2] kg⋅·m2 Initial base position (x,yx,y) [−0.5,0.5][-0.5,0.5] m Initial base orientation (yaw) [−π,π][-π,π] rad Initial base linear velocity [−0.5,0.5][-0.5,0.5] m/s Initial base angular velocity [−0.5,0.5][-0.5,0.5] rad/s Initial joint position scale [0.5,1.5][0.5,1.5] – Depth image bias [−0.04,0.04][-0.04,0.04] m Depth image noise (σ) 0.020.02 m Depth hole probability 0.030.03 – IV EXPERIMENT TABLE V: ABLATION STUDY RESULTS OF THE PRIOR FRAMEWORK ON TERRAIN ADAPTABILITY Method Mean Level Pyramid Stairs Inverted Stairs Boxes Plane Mean Reward PRIOR (ours) 5.7533 1.0 1.0000 1.0000 1.0 26.3462 PRIOR w/o reference gait 5.7735 1.0 1.0000 1.0000 1.0 23.7233 PRIOR w/o ^t m_t 5.7672 1.0 0.7734 1.0000 1.0 13.1775 PRIOR w/o tH2d_t^H2 5.4627 1.0 0.3750 0.9687 1.0 10.1463 PRIOR with H1 = 6 5.7417 1.0 1.0000 1.0000 1.0 19.3234 PRIOR w/o landing state reward 5.7403 1.0 1.0000 1.0000 1.0 22.6262 IV-A Experimental Setting Based on the optimization architecture described in Section I-C, we conduct high-throughput policy training on a single NVIDIA RTX 4090 (24GB) GPU. The system supports 1024 parallel environments simultaneously performing depth perception and physics simulation, significantly improving data collection efficiency while maintaining stable simulation dynamics. The policy converges after approximately 12,000 training iterations. The depth images are updated at 30 Hz, while the control policy is executed at 50 Hz, ensuring a balance between perception refresh rate and control stability. The trained policy is exported via ONNX and deployed directly on the onboard computing unit of the ZERITH Z1 humanoid robot. The ZERITH Z1 model has 23 degrees of freedom (DoF), including 6 DoF per leg, 3 DoF at the waist, and 4 DoF per arm. IV-B Simulation Ablation Studies IV-B1 Experimental Configurations To systematically analyze the contribution of each component in the PRIOR framework, we design a staged ablation study to separately evaluate the perception-driven locomotion architecture and the reference gait prior. First, four ablated variants are compared against the configuration without the reference gait prior (“PRIOR w/o reference gait”) to investigate the impact of architectural components within the perception-motion framework. Subsequently, the “PRIOR w/o reference gait” model is compared with the full PRIOR framework to quantify the contribution of the reference gait prior to locomotion stability and behavioral quality. The detailed configurations are as follows: • PRIOR (Ours): The complete PRIOR framework. • PRIOR w/o reference gait: The PRIOR framework without the humanoid reference gait constraint. • PRIOR w/o ^t m_t: The PRIOR framework without explicit terrain estimation and elevation map supervision. • PRIOR w/o tH2d_t^H2: The PRIOR framework where temporally stacked depth observations are removed from the estimator input. • PRIOR with shorter H1: The PRIOR framework with a reduced proprioceptive history length (H1 = 6). • PRIOR w/o landing state reward: The PRIOR framework without the designed landing state reward terms. IV-B2 Evaluation Procedure All six policies are trained in the Isaac Lab simulator under the same terrain curriculum setting. We evaluate the policies using the following quantitative metrics: • Curriculum capability: The mean terrain level achieved (mean level) and the maximum successfully traversed level across different terrain types (Pyramid Stairs, Inverted Stairs, Boxes, and Plane). • Training performance: Total reward, convergence speed, and training stability. Figure 5: The upper panel illustrates the training curves for mean reward, while the lower panel displays the mean terrain level achieved over training iterations. IV-B3 Results and Discussion As shown in Table V and Fig. 5, the ablation results demonstrate the effectiveness of each component. Reward–stability correlation: Although the “PRIOR w/o reference gait” variant achieves a slightly higher mean curriculum level (5.7735), its average reward (23.7233) is approximately 10% lower than that of the full framework. This suggests that traversal capability alone does not guarantee motion quality or efficiency. Behavioral rationality: The reference gait prior contributes not only to performance but also to motion quality. Empirically, policies without gait prior tend to exploit unstable or high-frequency oscillatory motions to maximize traversal success. With the humanoid gait constraint, the robot maintains near-perfect terrain success rates (all terrain metrics reaching 1.0) while achieving significantly smoother and more energy-efficient locomotion, as reflected by higher reward values. This property is crucial for real-world deployment. Explicit terrain estimation and supervision (PRIOR w/o ^t m_t): Removing explicit terrain estimation reduces the average reward to 13.1775. Although it performs better than the variant without temporal depth information, its performance on complex terrains such as Inverted Stairs (0.7734) remains substantially lower than the full model. This indicates that explicit terrain representation and elevation supervision significantly enhance the policy’s understanding of geometric structures. Temporal depth information (PRIOR w/o tH2d_t^H2): This variant exhibits the worst performance, with an average reward of 10.1463 and a significant drop in success rate on Inverted Stairs (0.375). Without integrating temporal depth information, the robot loses the ability to anticipate terrain variations, leading to unstable foothold planning during dynamic locomotion. Proprioceptive history length (H1): Reducing the proprioceptive history length results in an average reward of 19.3234. A longer history window (H1 = 10) enables the system to implicitly estimate physical properties such as ground friction and center-of-mass deviation, thereby improving robustness under unknown disturbances. Landing state reward: As illustrated in Fig. 6, the removal of the landing state reward results in less stable foot placements, leading to a decrease in mean reward to 22.6262. This qualitative comparison reinforces that our fine-grained reward design effectively optimizes the behavior during the landing transient. Figure 6: Comparison of foot-placement behavior with and without the landing state reward. V CONCLUSIONS This paper presents PRIOR, an efficient and reproducible single-stage reinforcement learning framework developed in the Isaac Lab environment for perception-aware humanoid locomotion. The proposed method addresses the challenges of integrating perception and control for humanoid robots operating over complex terrains. PRIOR combines a GRU-based explicit terrain reconstruction state estimator with a parameterized gait generator within a unified learning pipeline. This design enables the ZERITH Z1 humanoid model to traverse diverse challenging terrains with high precision and robustness. Experimental results demonstrate that PRIOR achieves high average rewards and maintains a 100% traversal success rate across various complex terrains. Furthermore, the learned policy consistently handles terrains of relatively high difficulty, highlighting the effectiveness and generalization capability of the proposed framework. Despite these advantages, our current work still has limitations in real-world deployment experiments. Future research will focus on developing more generalizable dynamics adaptation algorithms to enable seamless sim-to-real transfer from Isaac Lab to the physical ZERITH Z1 platform. References [1] A. Agarwal, A. Kumar, J. Malik, and D. Pathak (2022) Legged locomotion in challenging terrains using egocentric vision. In 6th Annual Conference on Robot Learning, External Links: Link Cited by: §I-A. [2] J. P. Araujo, Y. Ze, P. Xu, J. Wu, and C. K. Liu (2025) Retargeting matters: general motion retargeting for humanoid motion tracking. arXiv preprint arXiv:2510.02252. Cited by: §I-B1. [3] I. M. Aswin Nahrendra, B. Yu, and H. Myung (2023) DreamWaQ: learning robust quadrupedal locomotion with implicit terrain imagination via deep reinforcement learning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), Vol. , p. 5078–5084. External Links: Document Cited by: §I-A. [4] X. Cheng, K. Shi, A. Agarwal, and D. Pathak (2024) Extreme parkour with legged robots. In 2024 IEEE International Conference on Robotics and Automation (ICRA), Vol. , p. 11443–11450. External Links: Document Cited by: §I-A. [5] C. Han, Y. Min, Z. Huang, A. Hong, H. Liu, Y. Cheng, and H. Liu (2026) PMG: parameterized motion generator for human-like locomotion control. External Links: 2602.12656, Link Cited by: §I-B. [6] J. He, C. Zhang, F. Jenelten, R. Grandia, M. Bächer, and M. Hutter (2025) Attention-based map encoding for learning generalized legged locomotion. Science Robotics 10 (105), p. eadv3604. External Links: Document, Link, https://w.science.org/doi/pdf/10.1126/scirobotics.adv3604 Cited by: §I-A. [7] T. He, Z. Luo, W. Xiao, C. Zhang, K. Kitani, C. Liu, and G. Shi (2024) Learning human-to-humanoid real-time whole-body teleoperation. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , p. 8944–8951. External Links: Document Cited by: §I-B1. [8] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter (2019-01) Learning agile and dynamic motor skills for legged robots. Science Robotics 4 (26). External Links: ISSN 2470-9476, Link, Document Cited by: §I-A. [9] G. Ji, J. Mun, H. Kim, and J. Hwangbo (2022-04) Concurrent training of a control policy and a state estimator for dynamic and robust legged locomotion. IEEE Robotics and Automation Letters 7 (2), p. 4630–4637. External Links: ISSN 2377-3774, Link, Document Cited by: §I-A. [10] A. Kumar, Z. Fu, D. Pathak, and J. Malik (2021-07) RMA: Rapid Motor Adaptation for Legged Robots. In Proceedings of Robotics: Science and Systems, Virtual. External Links: Document Cited by: §I-A. [11] J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter (2020-10) Learning quadrupedal locomotion over challenging terrain. Science Robotics 5 (47). External Links: ISSN 2470-9476, Link, Document Cited by: §I-A. [12] Q. Liao, T. E. Truong, X. Huang, Y. Gao, G. Tevet, K. Sreenath, and C. K. Liu (2025) Beyondmimic: from motion tracking to versatile humanoid control via guided diffusion. arXiv preprint arXiv:2508.08241. Cited by: §I-B. [13] S. Luo, S. Li, R. Yu, Z. Wang, J. Wu, and Q. Zhu (2024) PIE: parkour with implicit-explicit learning framework for legged robots. IEEE Robotics and Automation Letters 9 (11), p. 9986–9993. External Links: Document Cited by: §I, §I-A, §I-A1. [14] T. Peng, L. Bao, J. Humphreys, A. M. Delfaki, D. Kanoulas, and C. Zhou (2025) Learning bipedal walking on a quadruped robot via adversarial motion priors. In Towards Autonomous Robotic Systems, M. N. Huda, M. Wang, and T. Kalganova (Eds.), Cham, p. 118–129. Cited by: §I-B. [15] X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne (2018-07) DeepMimic: example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph. 37 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §I-B. [16] X. B. Peng, E. Coumans, T. Zhang, T. Lee, J. Tan, and S. Levine (2020) Learning agile robotic locomotion skills by imitating animals. arXiv preprint arXiv:2004.00784. Cited by: §I. [17] X. B. Peng, Y. Guo, L. Halper, S. Levine, and S. Fidler (2022-07) ASE: large-scale reusable adversarial skill embeddings for physically simulated characters. ACM Transactions on Graphics 41 (4), p. 1–17. External Links: ISSN 1557-7368, Link, Document Cited by: §I, §I-B. [18] X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa (2021-07) AMP: adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics 40 (4), p. 1–20. External Links: ISSN 1557-7368, Link, Document Cited by: §I-B. [19] N. Rudin, D. Hoeller, P. Reist, and M. Hutter (2021) Learning to walk in minutes using massively parallel deep reinforcement learning. In Conference on Robot Learning, 8-11 November 2021, London, UK, A. Faust, D. Hsu, and G. Neumann (Eds.), Proceedings of Machine Learning Research, Vol. 164, p. 91–100. External Links: Link Cited by: §I-D1. [20] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §I-A1. [21] A. Serifi, R. Grandia, E. Knoop, M. Gross, and M. Bächer (2024) Robot motion diffusion model: motion generation for robotic characters. In SIGGRAPH Asia 2024 Conference Papers, SA ’24, New York, NY, USA. External Links: ISBN 9798400711312, Link, Document Cited by: §I-B. [22] J. Shi, X. Liu, D. Wang, O. Lu, S. Schwertfeger, C. Zhang, F. Sun, C. Bai, and X. Li (2025) Adversarial locomotion and motion imitation for humanoid policy learning. arXiv preprint arXiv:2504.14305. Cited by: §I-B. [23] J. Sun, G. Han, P. Sun, W. Zhao, J. Cao, J. Wang, Y. Guo, and Q. Zhang (2025) Dpl: depth-only perceptive humanoid locomotion via realistic depth synthesis and cross-attention terrain reconstruction. arXiv preprint arXiv:2510.07152. Cited by: §I-A. [24] D. Wang, X. Wang, X. Liu, J. Shi, Y. Zhao, C. Bai, and X. Li (2025) MoRE: mixture of residual experts for humanoid lifelike gaits learning on complex terrains. arXiv preprint arXiv:2506.08840. Cited by: §I-A, §I-B. [25] H. Wang, Z. Wang, J. Ren, Q. Ben, T. Huang, W. Zhang, and J. Pang (2025) BeamDojo: learning agile humanoid locomotion on sparse footholds. In Robotics: Science and Systems (RSS), Cited by: §I-A. [26] J. Wu, G. Xin, C. Qi, and Y. Xue (2023) Learning robust and agile legged locomotion using adversarial motion priors. IEEE Robotics and Automation Letters 8 (8), p. 4975–4982. External Links: Document Cited by: §I-A, §I-B. [27] J. Wu, Y. Xue, and C. Qi (2023) Learning multiple gaits within latent space for quadruped robots. arXiv preprint arXiv:2308.03014. Cited by: §I-A, §I-B. [28] P. Xu, X. Shang, V. Zordan, and I. Karamouzas (2023-07) Composite motion learning with task control. ACM Trans. Graph. 42 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §I-B. [29] H. Zhang, L. Zhang, Z. Chen, L. Chen, Y. Wang, and R. Xiong (2025) Natural humanoid robot locomotion with generative motion prior. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , p. 6622–6629. External Links: Document Cited by: §I-B. [30] Z. Zhuang, S. Yao, and H. Zhao (2024) Humanoid parkour learning. In 8th Annual Conference on Robot Learning, External Links: Link Cited by: §I, §I-A.