Paper deep dive
AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control
Peng Xu, Zhengnan Deng, Jiayan Deng, Zonghua Gu, Shaohua Wan
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 98%
Last extracted: 3/22/2026, 5:08:27 AM
Summary
AerialVLA is an end-to-end Vision-Language-Action framework for UAV navigation that replaces modular, oracle-dependent systems with a unified perception-action loop. It utilizes a minimalist dual-view perception strategy, fuzzy directional prompting derived from onboard sensors, and numerical action tokenization to achieve robust 3-DoF control and intrinsic landing without external object detectors.
Entities (5)
Relation Signals (3)
AerialVLA → evaluatedon → TravelUAV
confidence 100% · Extensive experiments on the TravelUAV benchmark demonstrate that AerialVLA achieves state-of-the-art performance
AerialVLA → integrates → LLaMA 2
confidence 100% · unifies a hybrid visual encoder with a Llama 2 [32] language model
AerialVLA → uses → OpenVLA-7B
confidence 100% · our architecture is built upon the OpenVLA-7B [13] backbone
Cypher Suggestions (2)
Find all benchmarks used to evaluate the AerialVLA framework. · confidence 95% · unvalidated
MATCH (f:Framework {name: 'AerialVLA'})-[:EVALUATED_ON]->(b:Benchmark) RETURN b.nameIdentify the model backbone used by AerialVLA. · confidence 95% · unvalidated
MATCH (f:Framework {name: 'AerialVLA'})-[:USES]->(m:Model) RETURN m.nameAbstract
Abstract:Vision-Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating semantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end-to-end Vision-Language-Action framework mapping raw visual observations and fuzzy linguistic instructions directly to continuous physical control signals. First, we introduce a streamlined dual-view perception strategy that reduces visual redundancy while preserving essential cues for forward navigation and precise grounding, which additionally facilitates future simulation-to-reality transfer. To reclaim genuine autonomy, we deploy a fuzzy directional prompting mechanism derived solely from onboard sensors, completely eliminating the dependency on dense oracle guidance. Ultimately, we formulate a unified control space that integrates continuous 3-Degree-of-Freedom (3-DoF) kinematic commands with an intrinsic landing signal, freeing the agent from external object detectors for precision landing. Extensive experiments on the TravelUAV benchmark demonstrate that AerialVLA achieves state-of-the-art performance in seen environments. Furthermore, it exhibits superior generalization in unseen scenarios by achieving nearly three times the success rate of leading baselines, validating that a minimalist, autonomy-centric paradigm captures more robust visual-motor representations than complex modular systems.
Tags
Links
- Source: https://arxiv.org/abs/2603.14363v1
- Canonical: https://arxiv.org/abs/2603.14363v1
Full Text
48,979 characters extracted from source content.
Expand or collapse full text
AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control Peng Xu 1 , Zhengnan Deng 1 , Jiayan Deng 1 , Zonghua Gu 2 , and Shaohua Wan 1⋆ 1 Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, China pxu023@gmail.com, shaohua.wan@uestc.edu.cn 2 Department of Computer Science, Hofstra University, Hempstead, USA zonghua.gu@hofstra.edu Abstract. Vision-Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating se- mantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end-to-end Vision-Language-Action framework mapping raw visual observations and fuzzy linguistic instructions directly to contin- uous physical control signals. First, we introduce a streamlined dual- view perception strategy that reduces visual redundancy while preserv- ing essential cues for forward navigation and precise grounding, which additionally facilitates future simulation-to-reality transfer. To reclaim genuine autonomy, we deploy a fuzzy directional prompting mechanism derived solely from onboard sensors, completely eliminating the depen- dency on dense oracle guidance. Ultimately, we formulate a unified con- trol space that integrates continuous 3-Degree-of-Freedom (3-DoF) kine- matic commands with an intrinsic landing signal, freeing the agent from external object detectors for precision landing. Extensive experiments on the TravelUAV benchmark demonstrate that AerialVLA achieves state-of-the-art performance in seen environments. Furthermore, it ex- hibits superior generalization in unseen scenarios by achieving nearly three times the success rate of leading baselines, validating that a mini- malist, autonomy-centric paradigm captures more robust visual-motor representations than complex modular systems. Code is available at: https://github.com/XuPeng23/AerialVLA Keywords: Vision-Language Navigation· Vision-Language-Action· UAV Control· End-to-End Learning ⋆ Corresponding author. arXiv:2603.14363v1 [cs.CV] 15 Mar 2026 2P. Xu et al. 00 98 49 LAND 76 58 54 Dense Oracle Guidance External Detector : Turn Right Rigid and non-adaptive path Fuzzy Directional Prompting : fly forward-right and ... Agile and reactive control Grounding DINO Severe semantic gap Unified perception-action loop Integrated Intrinsic Landing End-to-End Model 1. Minimalist Dual- View Perception 2. Active Visual Grounding 3. Unified 3-DoF Control Miss the Target Oracle Path : Turn Left ... Target Bearing Fig. 1: Comparison of UAV navigation paradigms. While existing modular meth- ods (red) rely on oracle guidance and external detectors, AerialVLA (green) achieves agile autonomous navigation and precise landing via a unified end-to-end policy driven by fuzzy onboard hints and intrinsic stopping. 1 Introduction Vision-Language Navigation (VLN) has seen remarkable progress in ground- based agents. However, extending VLN to Unmanned Aerial Vehicles (UAVs) introduces unique challenges due to the complexity of 3D open-world environ- ments. This transition marks a paradigm shift in aerial robotics, as underscored by recent surveys [11,24,31], evolving from automated flight to agentic UAVs ca- pable of autonomous perception, reasoning, and interaction. The field has rapidly advanced alongside the development of specialized benchmarks, ranging from ob- ject goal navigation [36] to high-level spatial reasoning [7,40] and realistic flight simulation [8,34]. While this flourishing ecosystem has established rigorous stan- dards for evaluating the cognitive understanding of a scene, the challenge of mastering continuous 6-Degree-of-Freedom (6-DoF) physical execution remains arduous. Unlike ground robots constrained to a 2D plane, UAVs must navigate in a full 6-DoF state space, managing 3D position and orientation, while interpreting visual cues from actively changing ego-centric viewpoints. This introduces unique challenges in continuous control under gravitational and inertial constraints. The goal of UAV-VLN is to enable drones to navigate to a natural-language-described target autonomously. This capability is critical for applications ranging from search and rescue to remote inspection, where GPS signals may be unreliable or target coordinates are unknown. Despite this potential, existing UAV-VLN approaches often suffer from a reliance on what we term “double crutches”, which limits their autonomy in the wild. A primary issue is the dependency on oracle guidance. State-of-the- art benchmarks such as TravelUAV [34] and recent methods [12, 39] typically inject dense, ground-truth-derived directional hints (e.g., “Turn Right”) directly AerialVLA3 into the input prompts. This practice effectively degrades the navigation agent into a passive instruction follower, bypassing the core challenge of active spatial reasoning and path planning. Furthermore, these modular pipelines frequently necessitate an external object detector, such as Grounding DINO [17], to trigger the landing phase due to a lack of fine-grained visual grounding. This dependency creates a disjointed perception-control loop where the policy learns how to move but relies on an external black box to decide when to stop, thereby reducing system robustness when detectors fail in open-world scenarios. As illustrated in Figure 1, to address the aforementioned double crutches, we present AerialVLA, a minimalist end-to-end Vision-Language-Action frame- work. Unlike existing modular approaches (red trajectory) that suffer from rigid paths and semantic gaps due to dense oracle guidance and external detectors, AerialVLA (green trajectory) establishes a unified perception-action loop. By replacing exact oracles with fuzzy onboard directional hints, our agent is driven to perform active visual grounding, enabling agile and robust spatial reason- ing. Concurrently, by mapping raw observations directly to continuous physical signals, AerialVLA unlocks an integrated intrinsic landing, seamlessly unifying long-distance cruising and precision stopping within a single policy. In summary, our main contributions are three-fold: – Minimalist Dual-View Perception: We fuse front and down views into a streamlined visual interface aligned with consumer UAV hardware. This design discards multi-camera redundancies while preserving the essential ge- ometric and semantic cues required for forward navigation and precise target grounding. – Fuzzy Directional Prompting: We eliminate the reliance on step-by-step oracle guidance by introducing fuzzy directional prompts. Derived solely from onboard IMU estimations, this formulation accommodates real-world localization uncertainty and forces the agent to learn robust, active spatial reasoning rather than passive instruction following. – High-DoF Control via Numerical Tokenization: We tokenize a con- tinuous 3-DoF action space compatible with standard UAV control APIs, effectively leveraging the pre-trained numerical reasoning of LLMs. This un- locks an end-to-end intrinsic stopping policy, unifying cruising and precision landing without external object detectors. 2 Related Work 2.1 Benchmarks and Datasets for Aerial Agents High-quality simulators drive aerial autonomy. AerialVLN [18] and AVDN [6] pioneered instruction-following and dialog tasks, while UAV-ON [36] introduced open-world object navigation. Recent focus expanded to cognitive capabilities: SpatialSky-Bench [40] evaluates spatial reasoning, and UAVBench [7] assesses mission planning. However, these benchmarks primarily evaluate high-level per- ception or abstract logic. While OpenFly [8] offers diverse trajectories via discrete 4P. Xu et al. macro-actions, TravelUAV [34] focuses on 3D continuous navigation precision. We adopt TravelUAV to rigorously evaluate our agent’s continuous maneuvering capabilities. 2.2 Vision-Language Navigation for UAVs As a critical domain of embodied intelligence, the UAV-VLN task requires the integration of multimodal perception, spatial reasoning, and motion planning in complex 3D environments [37]. Unlike prior surveys that categorize methods by algorithmic foundations, we classify existing approaches based on their control granularity to explicitly highlight end-to-end execution challenges. High-Level Planning and Waypoint Prediction. Most works, includ- ing early baselines like CMA [1] and AVDN [6], formulate navigation as a planning problem, predicting spatial waypoints for low-level controllers (e.g., PID). These includes modular systems (CityNavAgent [41], SkyVLN [15], Aero- Duo [35]) generating long-horizon or collaborative plans, and mission-generation frameworks (UAV-VLA [25]). Learning-based policies (TravelUAV [34], Open- VLN [16], LongFly [12], NavFoM [39], FlightGPT [3]) regress waypoints, tar- gets, or trajectories via fine-tuned foundation models and spatiotemporal con- texts, while ANWM [42] evaluates candidate trajectories using a generative world model. While effective for task decomposition, decoupling perception from con- trol discards fine-grained visual cues critical for aerodynamic stability, introduces severe inference latency, and forces reliance on semantically unaware low-level controllers. Training-Free and Reasoning-Centric Approaches. A parallel trend exploits frozen foundation models [21,29] to bypass policy training. For instance, SPF [9] prompts them to predict 2D waypoints for heuristic 3D unprojection, which often violates physical constraints due to absent depth perception. Mean- while, TypeFly [4] generates code primitives for task planning. Despite their strong zero-shot capabilities, the sequential token generation of these massive LLMs creates inference latency fundamentally incompatible with real-time aerial agility. End-to-End Continuous Control. This paradigm maps visual observa- tions directly to continuous control signals (e.g., velocity, thrust). RaceVLA [26] and CognitiveDrone [20] pioneer this direction by adapting fine-tuned VLA mod- els for flight. Nevertheless, these works are primarily confined to structured en- vironments like racing tracks, simplifying navigation into discrete gate-selection tasks. While CognitiveDrone adds an auxiliary VLM, this decoupled, lower- frequency reasoning remains separated from the high-frequency control loop. Lacking intrinsic and unified spatial reasoning, these models struggle with the semantic grounding required for unstructured open-world exploration. To overcome the dichotomy between decoupled hierarchical planning and structurally confined continuous control, AerialVLA introduces a minimalist VLA paradigm. By unifying fuzzy directional prompting with an intrinsic land- ing policy, we achieve robust, open-world reasoning and precise maneuvering without relying on external detectors or heavy reasoning chains. AerialVLA5 Vertical Mosaic Front View Down View DinoV2 MLP Projector Fly to your right and find the target. The red motorcycle is parked on a crosswalk ... intersection with road markings visible. Fly to your right and find the target. The red motorcycle is parked on a crosswalk ... intersection with road markings visible. Llama Tokenizer Llama2-7B <c x , c z , c ψ > LAND LoRALoRA De-quantize <Δx, Δz, Δψ> Dual-Condition Landing <c x , c z , c Ψ >=<00, 49, 49> or has LAND token LM Head SigLIP Target Bearing to your right Onboard Sensors (c) AerialVLA Model (a) Language Input(b) Visual Input(d) Output and Action Stage Make Action ^^ ^^ ^^ TrainableTrainable FrozenFrozen Action TokenAction Token Visual TokenVisual Token Language TokenLanguage Token Fig. 2: The architecture of AerialVLA. The framework processes multimodal in- puts to generate continuous control signals end-to-end. (a) Language Input con- structs prompts with fuzzy directional hints derived from the IMU, eliminating oracle reliance. (b) Visual Input fuses front and down views via a vertical mosaic. (c) The AerialVLA Model utilizes a Llama-2 backbone with LoRA to autoregressively pre- dict numerical tokens. (d) Output and Action Stage decodes tokens into spatial offsets for velocity control, or triggers the dual-condition landing. 2.3 Vision-Language-Action Models: From Ground to Air The paradigm of Vision-Language-Action (VLA) models has revolutionized robotic control by unifying perception and action within a single transformer. Generalist models like RT-2 [43] and OpenVLA [13] demonstrate remarkable generalization by fine-tuning VLMs to output robot actions directly as language tokens. While end-to-end learning has been successfully applied to ground navigation [14,27,30] to bypass traditional explicit map construction [2], existing VLA successes re- main predominantly focused on quasi-static manipulation tasks or 2D ground rovers. Extending the VLA paradigm to aerial platforms introduces fundamental challenges absent in ground-based settings. First, UAVs operate in a fully dy- namic 6-DoF state space subject to continuous gravitational and inertial forces, where pitch and roll dynamics are tightly coupled with translational motion. Second, unlike reversible manipulation actions, flight control requires managing tight kinematic constraints where failure is often terminal (e.g., collision). Aeri- alVLA bridges this gap by adapting the VLA architecture specifically for aerial embodiment. We demonstrate that a direct token-prediction approach, when 6P. Xu et al. grounded in large-scale expert demonstrations, effectively masters the continu- ous 3-DoF control required for high-stakes, open-world aerial navigation. 3 Method 3.1 Overview We present AerialVLA, an end-to-end Vision-Language-Action framework de- signed to endow Unmanned Aerial Vehicles (UAVs) with autonomous navigation capabilities in open-world environments. As illustrated in Figure 2, our architec- ture is built upon the OpenVLA-7B [13] backbone, which unifies a hybrid visual encoder with a Llama 2 [32] language model using the Transformer architec- ture [33]. To adapt this generalist model for aerial navigation in dynamic 6-DoF state space, we introduce three key innovations: (1) a streamlined Dual-View Perception interface that balances informational gain with computational effi- ciency; (2) a Fuzzy Directional Prompting mechanism eliminating reliance on oracle assistance; and (3) a Numerical Action Tokenization strategy that leverages the pre-trained numerical reasoning of LLMs for precise control. 3.2 Minimalist Dual-View Perception While recent UAV-VLN benchmarks typically simulate multi-camera arrays (e.g., five distinct views), processing such high-dimensional inputs significantly in- creases computational overhead and inference latency, hindering real-time de- ployment. Moreover, redundant views often contribute marginal utility for forward- facing tasks. To address this, we propose a minimalist dual-view strategy that aligns with consumer UAV hardware configurations. We explicitly process only the front and down views, where the front view provides critical cues for obstacle avoidance and target identification, while the down view is essential for precise ground alignment and landing maneuvers. Formally, given the front image I front ∈ R H×W×3 and down image I down ∈ R H×W×3 from the AirSim simulator [28], we perform vertical concatenation to form a composite observation I comp = [I front ;I down ]∈ R 2H×W×3 . This composite is resized to the required 224× 224 resolution and processed by a hybrid visual encoder combining SigLIP [38] and DINOv2 [22]. Both encoders utilize the Vi- sion Transformer (ViT) [5] architecture, with SigLIP providing language-aligned semantic features via CLIP-style [23] objectives and DINOv2 capturing robust, fine-grained spatial representations. To seamlessly map this dual-view input into the LLM embedding space, we fully fine-tune the visual projector. This adap- tation ensures that the vertically compressed inputs effectively preserve crucial navigation cues, such as object categories and terrain textures. Furthermore, the default 90 ◦ Field-of-View (FOV) in AirSim ensures the front and down views naturally align at their horizontal boundary, establishing physical and seman- tic continuity. Significantly, the 224× 224 input dimension inherently aligns the stitching seam with the non-overlapping 14×14 patch grids of the ViT encoders. This architectural alignment prevents intra-patch corruption, enabling the self- attention mechanism to cleanly resolve cross-view spatial relationships. AerialVLA7 Table 1: Fuzzy directional hint mapping from onboard IMU-derived relative angle. Angle Range |θ|Fuzzy Hint (θ > 0 / θ < 0) 0 ◦ ≤|θ|≤ 15 ◦ “straight ahead” 15 ◦ <|θ|≤ 60 ◦ “forward-right” / “forward-left” 60 ◦ <|θ|≤ 120 ◦ “to your right” / “to your left” 120 ◦ <|θ|≤ 180 ◦ “to your right rear” / “to your left rear” 3.3 Fuzzy Directional Prompting A major limitation of existing UAV-VLN methods (e.g., TravelUAV [34]) is their reliance on dense, step-by-step oracle guidance derived from pre-recorded opti- mal trajectories. This compromises autonomy by degrading the agent into a passive instruction follower. AerialVLA removes this dependency by construct- ing instruction prompts using only a fuzzy directional hint derived from on- board sensors (IMU/GPS). Specifically, we define a mapping function M that discretizes the relative bearing θ of the target into coarse-grained semantic buck- ets, as detailed in Table 1. By deliberately stripping away these dense oracles, we subject our agent to a fundamentally more challenging and rigorous learn- ing objective. Rather than providing exact step-by-step trajectory alignments, our input offers only coarse directional priors, introducing significant spatial ambiguity relative to the ground-truth path. Importantly, this coarse-grained formulation not only aligns with real-world imperfect localization but also in- troduces necessary tolerance during training. By providing directional guidance without precise angular resolution, it forces the model to rely primarily on active visual grounding, thereby enhancing policy robustness against sensor noise and environmental ambiguity. As illustrated in Figure 3, the prompt is structured as a natural sequence commencing with the visual placeholder, followed by the directional hint and the detailed target description. This format integrates multimodal context into a cohesive narrative for the LLM. Distinctively, AerialVLA executes a fully re- active policy based strictly on the current observation and immediate hint. By intentionally bypassing complex spatiotemporal memory buffers, this minimal- ist design drastically reduces inference latency and prevents the accumulation of cascading errors from stale states. Ultimately, prioritizing this instantaneous agility establishes a highly robust foundation for reactive visual-motor control in dynamic environments. Geometry-Consistent Supervision. Training an autonomous policy on fuzzy prompts requires strictly resolving mathematical ambiguities in expert demonstrations. Modular baselines utilize dense oracle guidance, providing suf- ficient local context to justify detour maneuvers. Conversely, AerialVLA relies solely on coarse hints. Pairing a lateral target hint with a straight-flight label in an obstacle-free environment introduces causal confusion, which typically stems from delayed human pilot reactions. To resolve this ambiguity while preserv- ing critical collision avoidance skills, we propose a geometry-consistent filtering 8P. Xu et al. AerialVLA Prompt Formulation <image> Fly straight ahead and find the target. The blue and white car is situated on a rocky terrain, surrounded by expansive, arid landscape with scattered rocks and distant mountains in the background under a clear sky. This rugged area appears barren with no visible vegetation, and the car is aligned among the rocks, partially camouflaged by the terrain. Action: 00 49 49 LAND Fig. 3: AerialVLA prompt formulation. The structured prompt comprises four com- ponents: (i) an <image> token for visual input, (i) a fuzzy directional hint (red), (i) a detailed target description, and (iv) the corresponding numerical control actions (blue). strategy using lateral depth maps d lat . Formally, we evaluate frames exhibiting a significant lateral target bearing (|θ| > 60 ◦ ) coupled with a near-zero expert yaw rate (ω ψ ≈ 0). For these ambiguous frames, we inspect the corresponding side-view depth: if the lateral space is clear (min(d lat ) > 20m), we discard the sample. However, if a nearby obstacle is detected (min(d lat ) ≤ 20m), we retain the straight flight label as a valid, essential evasion maneuver. This geometric cu- ration removes approximately 4% of the total training frames, guaranteeing that AerialVLA learns to navigate structural constraints, such as urban intersections, without being penalized by contradictory supervision. 3.4 High-DoF Control via Numerical Tokenization Task-Oriented Action Space and Unified Landing. We define a continuous 3-DoF action space A = ⟨∆x,∆z,∆ψ⟩ to decouple altitude, forward progres- sion, and heading adjustments. This formulation delegates low-level stabiliza- tion (e.g., roll/pitch dynamics) to the flight controller, aligning with high-level motion primitives exposed by commercial UAV APIs (e.g., DJI SDK, PX4) to facilitate real-world deployment. Crucially, unlike modular baselines that rely on external object detectors (e.g., Grounding DINO [17]) to trigger flight ter- mination, AerialVLA learns an intrinsic stopping policy. During training, we explicitly align terminal frames with a zero-displacement label vector ⟨0, 0, 0⟩ and the standard text token LAND. Consequently, landing is executed end-to-end via a robust dual-condition check: either the generation of the LAND token or the prediction of near-zero spatial offsets. This unifies navigation and landing into a single behavior cloning objective, enabling the agent to autonomously trigger flight termination upon visual convergence. Standard Numerical Tokenization. Previous VLA approaches like RT- 2 [43] typically introduce special action tokens (e.g., <act_0> to <act_255>). In data-constrained UAV domains, training these new embeddings from scratch creates a severe cold-start problem, forcing the model to re-learn basic ordinal re- lationships. Instead, AerialVLA maps continuous action dimensions, discretized into N = 99 bins, directly to existing numerical tokens within the LLM vocabu- lary. This elegantly leverages the understanding of magnitude and order inherent AerialVLA9 in the pre-trained model, yielding faster convergence and smoother control tra- jectories. At each timestep t, the model autoregressively predicts three integer tokens: ⟨ˆc x , ˆc z , ˆc ψ ⟩ = LLM(E vis ,E prompt )(1) where ˆc k ∈ 0, 1,..., 98. These categorical indices are deterministically de- quantized into continuous physical commands ⟨ ˆ ∆x, ˆ ∆z, ˆ ∆ψ⟩. Specifically, ˆ ∆x∈ [0, 5] and ˆ ∆z ∈ [−5, 5] represent the forward and vertical displacements in me- ters, while ˆ ∆ψ ∈ [−π,π] denotes the yaw change in radians. These action bound- aries are strictly derived from the statistical distribution of the expert dataset. To ensure robust physical execution, we map these spatial offsets to continuous velocity commands. Rather than relying on default position-control APIs that often induce erratic accelerations, we enforce a constant cruise speed (1.0 m/s) to dynamically calculate the corresponding flight duration. The UAV is then driven via the moveByVelocityAsync interface offered within the AirSim environment. This explicit velocity-duration mapping prevents abrupt kinematic transitions and minimizes camera motion blur, ensuring stable, high-quality visual observa- tions for subsequent autoregressive inference steps. 3.5 Training Objective We formulate the learning process as a Behavior Cloning (BC) problem. Given a dataset of expert demonstrations D = (I t ,P,a ∗ t ), where I t represents the visual observation, P the structured language prompt, and a ∗ t the corresponding ground-truth expert action, we optimize the model to minimize the negative log-likelihood of the expert action tokens. The autoregressive training objective is defined as: L =−E (I t ,P,a ∗ t )∼D X k∈x,z,ψ logp(a ∗ t,k | I t ,P,a ∗ t,<k ) (2) where a ∗ t,k denotes the discrete token for the k-th dimension of the expert action. This straightforward frame-level supervision perfectly aligns with our reactive navigation policy. 4 Experiments 4.1 Dataset and Evaluation Metrics Dataset. We evaluate AerialVLA on the TravelUAV benchmark [34], specifically the UAV-Need-Help task, which contains ∼12k human-piloted trajectories. De- parting from the original oracle-assisted setting, we evaluate purely autonomous navigation guided exclusively by fuzzy directional prompts. We strictly adhere to the official data splits, training on 7,922 trajectories and evaluating across three distinct test sets comprising 1,418 trajectories for the Seen split, 629 for Unseen Object, and 958 for Unseen Map. 10P. Xu et al. Evaluation Metrics. Following standard protocols [34], we report Naviga- tion Error (NE) measuring the Euclidean distance to the target, Success Rate (SR) indicating the percentage of episodes ending within 20 meters of the tar- get, Oracle Success Rate (OSR), and Success weighted by Path Length (SPL) to balance success with trajectory efficiency. 4.2 Implementation Details Model and Training Setup. We instantiate AerialVLA using the OpenVLA- 7B backbone [13], training on 420k frames from 7,922 expert trajectories. We apply LoRA [10] (r = 64, α = 128, dropout 0.05) to the language backbone, yielding∼2.98% trainable parameters. Both visual encoders remain frozen, while the projector is fully fine-tuned. Optimization uses AdamW [19] (weight decay 0.03, gradient clip 1.0) with a cosine scheduler (peak LR 2×10 −4 , 5% warmup). Training runs on 4× RTX 4090 (24GB) GPUs with a global batch size of 64 for 5 epochs (∼35 hours) in BF16 precision. Simulation Configuration. To strictly align with the decoupled action space defined in Sec. 3.4, we configure the AirSim [28] simulator to operate in MaxDegreeOfFreedom mode rather than the standard ForwardOnly mode. This enables the agent to execute independent yaw rotation (∆ψ) alongside a forward velocity vector, providing the holonomic agility essential for reactive obstacle avoidance. Computational Efficiency. Evaluated on an RTX 4090, AerialVLA re- quires 17GB VRAM and 0.38s total latency, outperforming the 20GB and 0.63s of TravelUAV. While our dual-vision VLA model takes 0.35s versus their 0.26s backbone and prediction head, completely eliminating their 0.37s Assist and Grounding DINO modules drastically reduces system latency. With fuzzy direc- tional hints adding merely 0.03s, our framework offers a vastly faster and more compact architecture for real-time navigation. 4.3 Baselines To evaluate our pure end-to-end paradigm, we compare AerialVLA against three baseline categories. Unlike our single autoregressive stream, these methods in- herently rely on specialized decoder heads or external auxiliary modules. All baseline results are directly sourced from their original publications [12,34] us- ing identical data splits. Heuristic Baselines. Random Action and Fixed Action serve as lower bounds to quantify task difficulty. Hybrid VLM Approaches. We evaluate CMA [1], a traditional recurrent neural network baseline. Notably, we include TravelUAV [34], which combines a language model feature extractor with hierarchical decoders for trajectory re- gression, while relying on an external Grounding DINO [17] for target detection. Recent Generalist Models. NavFoM [39] is a foundation model trained across various embodiments using a specialized trajectory planning head, and LongFly [12] leverages explicit spatiotemporal encoding modules. AerialVLA11 Table 2: Comparison on the Test Seen Set. SR, OSR, and SPL are reported in per- centage (%). Bold and underline denote the best and second-best model results. Method FullEasyHard NE↓ SR↑ OSR↑ SPL↑ NE↓ SR↑ OSR↑ SPL↑ NE↓ SR↑ OSR↑ SPL↑ Human14.1594.5194.5177.8411.6895.4495.4476.1917.1693.3793.3779.85 Random Action222.20 0.14 0.21 0.07 142.07 0.26 0.39 0.13 320.12 0.00 0.00 0.00 Fixed Action188.61 2.27 8.16 1.40 121.36 3.48 11.48 2.14 270.69 0.79 4.09 0.49 CMA [1]135.73 8.37 18.72 7.90 84.89 11.48 24.52 10.68 197.77 4.57 11.65 4.51 TravelUAV-DA [34] 98.66 17.45 48.87 15.76 66.40 20.26 51.23 18.10 138.04 14.02 45.98 12.90 NavFoM [39]93.05 29.17 49.24 25.03 58.98 32.91 53.16 27.87 143.83 23.58 43.40 20.80 LongFly [12]60.02 36.39 65.87 31.0738.10 38.5271.90 31.2485.20 33.9458.94 30.88 Ours65.8847.9657.6938.5443.7649.3061.3037.1493.1646.3053.2340.26 4.4 Quantitative Results Tables 2, 3, and 4 summarize our evaluation across the Seen, Unseen Object, and Unseen Map splits, demonstrating the distinct advantages of our fully au- tonomous paradigm over assistant-reliant baselines. Performance on Seen Environments. As shown in Table 2, AerialVLA achieves a new state-of-the-art 47.96% SR and 38.54% SPL, surpassing the strongest baseline (LongFly) by substantial margins (+11.57% SR and +7.47% SPL). The performance gap becomes even more pronounced on the Hard split in- volving long-horizon flights, where our SR advantage widens to +12.36%. While LongFly achieves a higher OSR by incorporating ground-truth directional hints, its precipitous drop from OSR (65.87%) to SR (36.39%) exposes a critical fail- ure in the final termination phase. In contrast, AerialVLA demonstrates superior OSR-to-SR conversion efficiency. By unifying navigation and landing with an in- trinsic stopping mechanism, our agent autonomously executes precise terminal maneuvers without relying on external oracle triggers. Table 3: Comparison on the Test Unseen Object Set. SR, OSR, and SPL are reported in percentage (%). Bold and underline denote the best and second-best results, respec- tively. Method FullEasyHard NE↓ SR↑ OSR↑ SPL↑ NE↓ SR↑ OSR↑ SPL↑ NE↓ SR↑ OSR↑ SPL↑ Random Action 260.14 0.16 0.16 0.16 174.10 0.48 0.48 0.48 302.96 0.00 0.00 0.00 Fixed Action 212.84 3.66 9.54 2.16 151.66 6.70 13.88 3.72 243.29 2.14 7.38 1.38 CMA [1]155.79 9.06 16.06 8.68 102.92 14.83 22.49 13.90 182.09 6.19 12.86 6.08 TravelUAV [34] 118.11 22.42 46.90 20.51 86.12 24.40 49.28 22.03 134.03 21.43 45.71 19.75 NavFoM [39] 108.04 29.83 47.99 27.20 70.51 32.54 50.72 29.54 133.01 28.03 46.18 25.64 LongFly [12]66.7443.8764.5638.3954.8438.0156.8431.3657.07 50.2574.16 45.27 Ours61.4556.6064.8646.6145.7256.9464.1143.7669.2756.4365.2448.03 12P. Xu et al. Generalization to Unseen Objects. Table 3 evaluates robustness against novel target categories, where AerialVLA maintains superiority with 56.60% SR overall. While baselines achieve high OSR on the Hard split via oracle guid- ance, their reliance on explicit object detectors severely limits their ability to recognize and stop at out-of-distribution targets. In contrast, our method di- rectly grounds novel visual concepts to control actions. This confirms that our minimalist framework leverages the open-vocabulary representations inherent in the LLM to identify unseen targets, rather than merely overfitting to training categories. Adaptability to Unseen Maps. The Unseen Map Test Set (Table 4) pro- vides the strongest evidence of our generalization capability. In entirely novel environments, LongFly suffers a drastic degradation to 11.27% SR due to its heavy reliance on historical context and map-specific priors. In contrast, Aeri- alVLA demonstrates remarkable zero-shot adaptability, achieving 37.58% SR and 28.22% SPL. Both metrics are approximately three times those of the SOTA baseline. This indicates that AerialVLA acquires a fundamental visual servoing capability that seamlessly transfers to novel geometries. By strictly re- lying on instantaneous observations rather than accumulated spatial memory, our reactive approach exhibits superior robustness against severe environmental shifts. Table 4: Comparison on the Test Unseen Map Set. SR, OSR, and SPL are reported in percentage (%). Bold and underline denote the best and second-best results, respec- tively. Method FullEasyHard NE↓ SR↑ OSR↑ SPL↑ NE↓ SR↑ OSR↑ SPL↑ NE↓ SR↑ OSR↑ SPL↑ Random Action 202.98 0.00 0.00 0.00 158.46 0.00 0.00 0.00 265.88 0.00 0.00 0.00 Fixed Action 180.47 0.52 2.61 0.39 132.89 0.89 4.28 0.67 247.72 0.00 0.25 0.00 CMA [1]141.68 2.30 10.02 2.16 102.29 3.57 14.26 3.33 197.35 0.50 4.03 0.50 TravelUAV [34] 138.80 4.18 20.77 3.84 102.94 4.63 22.82 4.24 189.46 3.53 17.88 3.28 NavFoM [39] 125.10 6.30 18.95 5.68 102.41 6.77 20.07 6.04 170.58 5.36 15.71 4.97 LongFly [12]108.32 11.2730.279.3278.5612.9634.3110.32148.109.0224.887.98 Ours67.4237.5852.9228.2244.9941.8958.4729.7299.1131.4945.0926.11 4.5 Qualitative Analysis Figure 4 visualizes representative trajectories highlighting two distinct naviga- tion behaviors, demonstrating robust decision-making capabilities of the agent. Precision Maneuvering in Clutter. In unstructured settings like exten- sive forests (Figure 4, Top), AerialVLA leverages its decoupled 3-DoF action space for fine-grained control. The agent horizontally aligns with the target while maintaining altitude above vegetation, followed by a precise vertical descent to safely land. AerialVLA13 Case 1 Case 2 18234567 18234567 Instruction: The brown elephant is situated in a landscape with lush green vegetation, sparse palm trees, large rocks scattered around, and cliff formations nearby, all under a bright sky with the sun visible. Instruction: The brown dog is located on a grass-covered area surrounded by a variety of pine trees, with the sky visible above and sunlight casting shadows on the ground. Fig. 4: Qualitative visualization of our proposed AerialVLA. We display the vertical mosaic inputs (Front/Down) at key timesteps. The agent demonstrates pre- cision maneuvering in clutter (Top) and active error correction against distractors (Bottom), validating the robustness of the end-to-end policy. Active Error Correction against Distractors. Under visual ambiguity (Figure 4, Bottom), the agent approaches a distractor object, briefly hovers for inspection, and upon recognizing the mismatch, autonomously climbs to resume the search. This active perception loop proves a capacity for self-correction that strictly exceeds simple trajectory regression. 4.6 Ablation Study We conduct ablation experiments to validate the individual contributions of our core architectural designs, with quantitative results summarized in Table 5. Robustness to Raw Demonstrations. To verify that our performance gains stem from the architecture rather than dataset curation, we trained a vari- ant on raw, uncleaned demonstrations. While noisy label conflicts predictably cause a slight degradation (e.g., 5.43% SR drop on unseen maps), this raw variant still achieves 32.15% SR, preserving a nearly threefold advantage over LongFly. This confirms the inherent robustness of the proposed architecture, proving that geometry-consistent filtering primarily serves to resolve mathematical ambigui- ties rather than artificially boosting baseline performance. Efficacy of Minimalist Perception. A five-view variant incorporating re- dundant side and rear cameras severely degrades performance, plummeting the unseen map SR from 37.58% to 21.71%. This substantial decline corroborates 14P. Xu et al. Table 5: Ablation on data curation, perception views, and action tokenization. We compare raw noisy data, 5-view input, and custom action token variants against our default formulation. Metrics are in percentage (%). Training Data / Method SeenUnseen Object Unseen Map SR↑ SPL↑ SR↑SPL↑SR↑ SPL↑ Baseline LongFly [12]36.39 31.07 43.8738.3911.27 9.32 Ours - custom tokens39.84 31.73 38.7932.7726.51 19.98 Ours - 5-view input41.54 32.69 51.5142.3321.71 13.46 Ours w/o filtering40.90 32.59 51.9942.6732.15 22.67 Ours47.9638.5456.6046.6137.5828.22 our hypothesis that excessive visual inputs distract the agent and induce over- fitting to background clutter, empirically validating our streamlined dual-view design. Advantage of Numerical Tokenization. A variant utilizing custom ac- tion tokens instead of standard numerical digits experiences significant SR and SPL drops across all splits. This performance collapse validates our analysis in Sec. 3.4: training novel embeddings from scratch introduces a severe cold-start problem, whereas adopting a pure numerical format leverages the pre-trained magnitude awareness of the LLM to ensure reliable, fine-grained control. 5 Limitations and Future Work While highly effective, the minimalist design of AerialVLA presents specific av- enues for future refinement. First, by prioritizing instantaneous reactive con- trol over explicit historical memory, the agent occasionally faces challenges with global backtracking in highly repetitive urban structures (e.g., the NewYorkCity map). Second, bounded by the nature of behavior cloning, the policy acts con- servatively in extreme out-of-distribution scenarios. For instance, when targets are severely occluded by dense canopies in the unseen ModularPark map, the agent defaults to safe hovering rather than executing complex multi-stage explo- ration. To address these trade-offs, future work will integrate lightweight memory mechanisms for global reasoning and explore reinforcement learning fine-tuning to enable active exploration beyond static expert demonstrations. 6 Conclusion This paper presents AerialVLA to fundamentally rethink the prevailing mod- ular methodologies in UAV vision-language navigation. Our work introduces a novel perspective to the field by establishing a minimalist end-to-end Vision- Language-Action paradigm. By navigating solely via onboard fuzzy hints and unifying cruising with precise termination into a single autonomous policy, we AerialVLA15 completely decouple the agent from dense oracle guidance and explicit object de- tectors. Extensive evaluations prove that discarding redundant spatial memory and complex auxiliary modules inherently fosters exceptional zero-shot general- ization across unseen targets and novel map geometries. We anticipate that this pure end-to-end philosophy will serve as a robust foundation for future natively intelligent aerial agents operating in unconstrained open-world environments. References 1. Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., van den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: CVPR (2018) 4, 10, 11, 12 2. Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza, D., Neira, J., Reid, I., Leonard, J.J.: Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE T-RO 32(6), 1309–1332 (2017) 5 3. Cai, H., Dong, J., Tan, J., Deng, J., Li, S., Gao, Z., Wang, H., Su, Z., Sumalee, A., Zhong, R.: FlightGPT: Towards generalizable and interpretable UAV vision- and-language navigation with vision-language models. In: EMNLP. p. 6659–6676 (2025) 4 4. Chen, G., Yu, X., Ling, N., Zhong, L.: TypeFly: Low-latency drone planning with large language models. IEEE TMC 24(09), 9068–9079 (2025) 4 5. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021) 6 6. Fan, Y., Chen, W., Jiang, T., Zhou, C., Zhang, Y., Wang, X.: Aerial vision-and- dialog navigation. In: ACL Findings. p. 3043–3061 (2023) 3, 4 7. Ferrag, M.A., Lakas, A., Debbah, M.: UAVBench: An open benchmark dataset for autonomous and agentic AI UAV systems via LLM-generated flight scenarios. arXiv preprint arXiv:2511.11252 (2025) 2, 3 8. Gao, Y., Li, C., You, Z., Liu, J., Zhen, L., Chen, P., Chen, Q., Tang, Z., Wang, L., Yang, P., Tang, Y., Tang, Y., Liang, S., Zhu, S., Xiong, Z., Su, Y., Ye, X., Li, J., Ding, Y., Wang, D., Wang, Z., Zhao, B., Li, X.: OpenFly: A comprehensive platform for Aerial vision-language navigation. In: ICLR (2026) 2, 3 9. Hu, C.Y., Lin, Y.S., Lee, Y., Su, C.H., Lee, J.Y., Tsai, S.R., Lin, C.Y., Chen, K.W., Ke, T.W., Liu, Y.L.: See, point, fly: A learning-free VLM framework for universal unmanned aerial navigation. In: CoRL. p. 4697–4708 (2025) 4 10. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: ICLR (2022) 10 11. Javaid, S., Fahim, H., He, B., Saeed, N.: Large language models for UAVs: Current state and pathways to the future. IEEE OJVT 5, 1166–1192 (2024) 2 12. Jiang, W., Wang, L., Huang, K., Fan, W., Liu, J., Liu, S., Duan, H., Xu, B., Ji, X.: LongFly: Long-horizon UAV vision-and-language navigation with spatiotemporal context integration. arXiv preprint arXiv:2512.22010 (2025) 2, 4, 10, 11, 12, 14 13. Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: OpenVLA: An open-source vision-language-action model. In: CoRL. p. 2679–2713 (2025) 5, 6, 10 16P. Xu et al. 14. Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-Across-Room: Mul- tilingual vision-and-language navigation with dense spatiotemporal grounding. In: EMNLP. p. 4392–4412 (2020) 5 15. Li, T., Huai, T., Li, Z., Gao, Y., Li, H., Zheng, X.: SkyVLN: Vision-and-language navigation and NMPC control for UAVs in urban environments. In: IROS (2025) 4 16. Lin, P., Sun, G., Liu, C., Li, F., Ren, W., Cong, Y.: OpenVLN: Open-world aerial vision-language navigation. arXiv preprint arXiv:2511.06182 (2025) 4 17. Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. In: ECCV. p. 38–55 (2024) 3, 8, 10 18. Liu, S., Zhang, H., Qi, Y., Wang, P., Zhang, Y., Wu, Q.: AerialVLN: Vision-and- language navigation for UAVs. In: ICCV. p. 15384–15394 (2023) 3 19. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019) 10 20. Lykov, A., Serpiva, V., Khan, M.H., Sautenkov, O., Myshlyaev, A., Tadevosyan, G., Yaqoot, Y., Tsetserukou, D.: CognitiveDrone: A VLA model and evaluation bench- mark for real-time cognitive task solving and reasoning in UAVs. arXiv preprint arXiv:2503.01378 (2025) 4 21. OpenAI, Achiam, J., Adler, S., Agarwal, S., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 4 22. Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual features without supervision. TMLR (2024) 6 23. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021) 6 24. Sapkota, R., Roumeliotis, K.I., Karkee, M.: UAVs meet agentic AI: A multido- main survey of autonomous aerial intelligence and agentic UAVs. arXiv preprint arXiv:2506.08045 (2025) 2 25. Sautenkov, O., Yaqoot, Y., Lykov, A., Mustafa, M.A., Tadevosyan, G., Akhmetkazy, A., Cabrera, M.A., Martynov, M., Karaf, S., Tsetserukou, D.: UAV- VLA: Vision-language-action system for large scale aerial mission generation. In: HRI. p. 1588–1592 (2025) 4 26. Serpiva, V., Lykov, A., Myshlyaev, A., Khan, M.H., Abdulkarim, A.A., Sautenkov, O., Tsetserukou, D.: RaceVLA: VLA-based racing drone navigation with human- like behaviour. arXiv preprint arXiv:2503.02572 (2025) 4 27. Shah, D., Sridhar, A., Dashora, N., Stachowicz, K., Black, K., Hirose, N., Levine, S.: ViNT: A foundation model for visual navigation. In: CoRL (2023) 5 28. Shah, S., Dey, D., Lovett, C., Kapoor, A.: AirSim: High-fidelity visual and physical simulation for autonomous vehicles. In: FSR. p. 621–635 (2018) 6, 10 29. Team, G., Anil, R., Borgeaud, S., Wu, Y., et al.: Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023) 4 30. Tian, X., Gu, J., Li, B., Liu, Y., Wang, Y., Zhao, Z., Zhan, K., Jia, P., Lang, X., Zhao, H.: DriveVLM: The convergence of autonomous driving and large vision- language models. In: CoRL (2024) 5 31. Tian, Y., Lin, F., Li, Y., Zhang, T., Zhang, Q., Fu, X., Huang, J., Dai, X., Wang, Y., Tian, C., Li, B., Lv, Y., Kovács, L., Wang, F.Y.: UAVs meet LLMs: Overviews AerialVLA17 and perspectives towards agentic low-altitude mobility. Inf. Fusion 122, 103158 (2025) 2 32. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C.C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P.S., Lachaux, M.A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E.M., Subramanian, R., Tan, X.E., Tang, B., Taylor, R., Williams, A., Kuan, J.X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., Scialom, T.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 6 33. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017) 6 34. Wang, X., Yang, D., Wang, Z., Kwan, H., Chen, J., Wu, W., Li, H., Liao, Y., Liu, S.: Towards realistic UAV vision-language navigation: Platform, benchmark, and methodology. In: ICLR (2025) 2, 4, 7, 9, 10, 11, 12 35. Wu, R., Zhang, Y., Chen, J., Huang, L., Zhang, S., Zhou, X., Wang, L., Liu, S.: AeroDuo: Aerial duo for UAV-based vision and language navigation. In: ACM M (2025) 4 36. Xiao, J., Sun, Y., Shao, Y., Gan, B., Liu, R., Wu, Y., Guan, W., Deng, X.: UAV- ON: A benchmark for open-world object goal navigation with aerial agents. In: ACM M. p. 13023–13029 (2025) 2, 3 37. Yao, F., Liu, Y., Zhang, W., Zhu, Z., Li, C., Liu, N., Hu, P., Yue, Y., Wei, K., He, X., Zhao, X., Wei, Z., Xu, H., Wang, Z., Shao, G., Yang, L., Zhao, D., Yang, Y.: AeroVerse-Review: Comprehensive survey on aerial embodied vision-and-language navigation. Innov. Inform. 1(1), 100015 (2025) 4 38. Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: ICCV. p. 11975–11986 (2023) 6 39. Zhang, J., Li, A., Qi, Y., Li, M., Liu, J., Wang, S., Liu, H., Zhou, G., Wu, Y., Li, X., Fan, Y., Li, W., Chen, Z., Gao, F., Wu, Q., Zhang, Z., Wang, H.: Embodied navigation foundation model. In: ICLR (2026) 2, 4, 10, 11, 12 40. Zhang, L., Zhang, Y., Li, H., Fu, H., Tang, Y., Ye, H., Chen, L., Liang, X., Hao, X., Ding, W.: Is your VLM sky-ready? a comprehensive spatial intelligence benchmark for UAV navigation. arXiv preprint arXiv:2511.13269 (2025) 2, 3 41. Zhang, W., Gao, C., Yu, S., Peng, R., Zhao, B., Zhang, Q., Cui, J., Chen, X., Li, Y.: CityNavAgent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory. In: ACL. p. 31292–31309 (2025) 4 42. Zhang, W., Tang, P., Zeng, X., Man, F., Yu, S., Dai, Z., Zhao, B., Chen, H., Shang, Y., Wu, W., Gao, C., Chen, X., Wang, X., Li, Y., Zhu, W.: Aerial world model for long-horizon visual generation and navigation in 3D space. arXiv preprint arXiv:2512.21887 (2025) 4 43. Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.W.E., Leal, I., Kuang, Y., Kalashnikov, D., Julian, R., Joshi, N.J., Irpan, A., Ichter, B., Hsu, J., Herzog, A., Hausman, K., Gopalakrishnan, K., Fu, C., 18P. Xu et al. Florence, P., Finn, C., Dubey, K.A., Driess, D., Ding, T., Choromanski, K.M., Chen, X., Chebotar, Y., Carbajal, J., Brown, N., Brohan, A., Arenas, M.G., Han, K.: RT-2: Vision-language-action models transfer web knowledge to robotic control. In: CoRL (2023) 5, 8