Paper deep dive

Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications

Che Chen, Lanhua Li, Shimin Gong, Yu Zhao, Yuming Fang, Dusit Niyato

Year: 2026Venue: arXiv preprintArea: cs.ITType: PreprintEmbeddings: 83

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/26/2026, 2:30:46 AM

Summary

The paper proposes a delay-tolerant multi-agent deep reinforcement learning (MADRL) framework, enhanced with a spatio-temporal attention module, to optimize trajectory planning, network formation, and transmission control in UAV-assisted wireless networks with limited communication and intermittent information exchange.

Entities (5)

Base Station · infrastructure · 100%Ground User · device · 100%UAV · agent · 100%STA-MADRL · algorithm · 98%MADRL · algorithm · 95%

Relation Signals (3)

UAV → relaysdatato → Base Station

confidence 95% · UAVs to accelerate data transmissions from ground users (GUs) to a remote base station (BS) via the UAVs' relay communications.

UAV → serves → Ground User

confidence 95% · UAVs to assist GUs' data transmissions

STA-MADRL → optimizes → UAV

confidence 90% · STA-MADRL framework to optimize the UAVs' trajectories, network formation, and transmission control strategies

Cypher Suggestions (2)

Identify algorithms used to optimize UAV behavior. · confidence 95% · unvalidated

MATCH (a:Algorithm)-[:OPTIMIZES]->(u:Agent {type: 'UAV'}) RETURN a.name, u.name

Find all UAVs and the Base Station they relay data to. · confidence 90% · unvalidated

MATCH (u:Agent {type: 'UAV'})-[:RELAYS_DATA_TO]->(b:Infrastructure {name: 'Base Station'}) RETURN u, b

Abstract

Abstract:In this paper, we employ multiple UAVs to accelerate data transmissions from ground users (GUs) to a remote base station (BS) via the UAVs' relay communications. The UAVs' intermittent information exchanges typically result in delays in acquiring the complete system state and hinder their effective collaboration. To maximize the overall throughput, we first propose a delay-tolerant multi-agent deep reinforcement learning (MADRL) algorithm that integrates a delay-penalized reward to encourage information sharing among UAVs, while jointly optimizing the UAVs' trajectory planning, network formation, and transmission control strategies. Additionally, considering information loss due to unreliable channel conditions, we further propose a spatio-temporal attention based prediction approach to recover the lost information and enhance each UAV's awareness of the network state. These two designs are envisioned to enhance the network capacity in UAV-assisted wireless networks with limited communications. The simulation results reveal that our new approach achieves over 50\% reduction in information delay and 75% throughput gain compared to the conventional MADRL. Interestingly, it is shown that improving the UAVs' information sharing will not sacrifice the network capacity. Instead, it significantly improves the learning performance and throughput simultaneously. It is also effective in reducing the need for UAVs' information exchange and thus fostering practical deployment of MADRL in UAV-assisted wireless networks.

PDF

Open source PDF →Open local PDF →

Full Text

82,318 characters extracted from source content.

Expand or collapse full text

Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications Che Chen, Lanhua Li, Shimin Gong, Yu Zhao, Yuming Fang, Dusit Niyato Che Chen, Lanhua Li, and Shimin Gong are with the School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518000, China, and Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), Zhuhai 519082, China (e-mail: chench576@mail2.sysu.edu.cn, lilh65, gongshm5@mail.sysu.edu.cn). Yu Zhao is with the Department of Equipment Management and Unmanned Aerial Vehicle Engineering, Air Force Engineering University, Xi’an (e-mail: zhaoyuair@163.com). Yuming Fang is with the School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics, Nanchang 330032, China (e-mail: fa0001ng@e.ntu.edu.sg). Dusit Niyato is with the College of Computing and Data Science, Nanyang Technological University, Singapore (e-mail: dniyato@ntu.edu.sg). Abstract In this paper, we employ multiple UAVs to accelerate data transmissions from ground users (GUs) to a remote base station (BS) via the UAVs’ relay communications. The UAVs’ intermittent information exchanges typically result in delays in acquiring the complete system state and hinder their effective collaboration. To maximize the overall throughput, we first propose a delay-tolerant multi-agent deep reinforcement learning (MADRL) algorithm that integrates a delay-penalized reward to encourage information sharing among UAVs, while jointly optimizing the UAVs’ trajectory planning, network formation, and transmission control strategies. Additionally, considering information loss due to unreliable channel conditions, we further propose a spatio-temporal attention based prediction approach to recover the lost information and enhance each UAV’s awareness of the network state. These two designs are envisioned to enhance the network capacity in UAV-assisted wireless networks with limited communications. The simulation results reveal that our new approach achieves over 50% reduction in information delay and 75% throughput gain compared to the conventional MADRL. Interestingly, it is shown that improving the UAVs’ information sharing will not sacrifice the network capacity. Instead, it significantly improves the learning performance and throughput simultaneously. It is also effective in reducing the need for UAVs’ information exchange and thus fostering practical deployment of MADRL in UAV-assisted wireless networks. I Introduction Recently, the applications of unmanned aerial vehicles (UAVs) have attracted extensive attention in various wireless networks, such as UAV-assisted sensing networks, mobile edge computing, and wireless-powered networks [28, 8, 2]. UAVs offer high mobility, adaptability, and controllability, which improve the efficiency of data transmission in these networks. Specifically, UAVs can dynamically relocate to enhance channel conditions for data transmissions from ground users (GUs) to the base station (BS), improving both network coverage and capacity. The UAV can also serve as a mobile energy supplier for low-power GUs in large-scale wireless-powered networks to prolong the network lifespan [14]. However, each UAV faces limitations in energy supply, computation capacity, and coverage area. These limitations can be practically alleviated by employing multiple UAVs to serve the GUs in large-scale wireless networks [10, 24], by designing efficient information sharing and collaborative control mechanisms for UAVs’ trajectory planning, multi-hop networking, and transmission control strategies. I-A Challenges and Motivations The UAVs’ mobility control can be achieved by joint trajectory planning, allowing UAVs to simultaneously serve different GUs to improve the network coverage and transmission efficiency [18]. The UAVs need to adjust their paths to avoid inter-UAV collisions and adapt to environmental uncertainties like obstacles and channel dynamics. The UAVs’ trajectory planning is also related to the GUs’ data transmission requirements, balancing the trade-off among coverage, energy efficiency, and communication performance. The UAVs’ collaboration in data transmissions can be facilitated through the UAV-to-UAV (U2U) connections and transmission control. For example, the UAVs far away from the BS often suffer from poor channel conditions by using only the direct UAV-to-BS (U2B) channels, reducing the transmission efficiency and network capacity. Thus, it can be more efficient to exploit U2U links to relay data transmissions from the distant UAVs to the BS, forming a multi-hop relay network to support energy-efficient data sensing over a large-scale service area [7]. Additionally, the UAVs’ simultaneous data transmissions can lead to severe signal interference. The transmission control is further complicated by the UAVs’ mobility and channel dynamics. Potentially, a higher network throughput can be achieved by dynamically optimizing the UAVs’ collaborative transmission control and U2U relaying strategies along with the UAVs’ trajectory planning strategy. The spatio-temporal coupling among the UAVs’ transmission control, U2U networking, and trajectory planning poses significant challenges for network performance maximization. It firstly relies on the availability of global network information and secondly calls for an efficient control algorithm, especially in large-scale UAV-assisted wireless networks. The absence of information exchange may lead to overlapped service areas among the UAVs, decreasing the network capacity and resource utilization [7]. However, real-time information exchange among all UAVs is normally impractical or unaffordable. For example, the U2U and U2B connections can become unavailable due to severely attenuated channel conditions as two UAVs move far away along their trajectories. Such intermittent U2U and U2B connections thus introduce random information delay or loss [5]. Additionally, the UAVs are required to simultaneously optimize the trajectories, transmission control, and resource allocation strategies, which is computationally demanding especially with dynamic channel conditions and information uncertainties. These motivate us to design robust and adaptive control algorithms, capable of handling incomplete network information while ensuring reliable and efficient network performance. I-B New Designs and Contributions In this paper, we first improve the UAVs’ information sharing to enable more efficient collaboration. Typically, the high dimensional multiple UAVs’ joint control can be formulated as a Markov decision process (MDP). It can be solved by a multi-agent deep reinforcement learning (MADRL) framework, which relies on full information sharing among all UAV agents. In particular, for each UAV, the decision-making agent should be aware of the complete and real-time network state and the other UAVs’ actions during the centralized training phase [1]. This implies that all UAVs should be fully connected to enable real-time information exchange. On the other hand, the UAVs’ real-time information exchange consumes significant channel resources, sacrificing the network capacity and resource efficiency. Therefore, it becomes a critical design problem to balance the UAVs’ information exchange for efficient learning and data communications. We address this trade-off by designing a delay-tolerant MADRL framework for the UAVs’ decision making in communication-limited scenarios. Although efficient information exchange via U2U connections is not always feasible, each UAV can update and cache its status information, i.e., service location, traffic demands, and channel conditions, at the BS when it reports the sensing data to the BS via the U2B connection. Meanwhile, each UAV can retrieve the other UAVs’ status information from the BS’s ACK packets without direct U2U connections [5]. To reduce the information delay at the BS, we integrate a delay-penalized reward design into the MADRL framework, which encourages all UAVs to optimize their trajectories and maintain frequent information exchange with the BS, fostering awareness of the network state and more efficient UAVs’ collaboration. As the UAV-assisted wireless network scales up, each UAV increasingly relies on multi-hop U2U links to forward its data to the BS, which leads to larger information delays for decision-making. Such delayed information will misguide UAVs’ trajectory planning and transmission control. As such, we aim to mitigate the impact of delayed information resulting from limited communication by leveraging historical information cached at the BS. Specifically, we integrate a spatio-temporal attention module into the MADRL framework so that each UAV can predict other UAVs’ delayed information, and then adapt its trajectory and transmission control strategies accordingly. The prediction module exploits both the temporal correlations in the UAV’s historical information and the spatial dependencies among neighboring UAVs. By estimating the delayed information, each UAV can better understand the complete network state and support more efficient collaboration without real-time inter-UAV information exchange. We envision that the spatio-temporal attention enhanced MADRL (STA-MADRL) together with the delay-penalized reward design can provide a practical multi-agent decision-making framework for decentralized wireless networks. Different from the existing studies, we focus on communication-limited scenarios and design an incentive mechanism to promote timely information updates. Moreover, leveraging historical interactions for prediction enables effective collaboration even when the channels for information sharing are sporadic and error-prone. Specifically, our main contributions are summarized as follows: • UAVs’ joint control in communication-limited scenarios: Multiple UAVs are used to collect data from the GUs and forward the data to the remote BS. The UAVs’ trajectory planning is firstly required to enhance network coverage. The UAVs further adapt the multi-hop network formation along the UAVs’ trajectories. Besides, the UAVs optimize transmission control to better serve the GUs simultaneously. Different from the current literature, we focus on network throughput maximization in a communication-limited scenario, in which the UAVs’ information sharing is unreliable and intermittent. • Delay-penalized reward improving UAVs’ information sharing: Information sharing is required to guide the UAVs’ efficient collaboration, while it also poses extra constraints on the UAVs’ trajectory planning and limits the transmission capability in the communication-limited scenario. As such, we devise a delay-penalized reward in the MADRL framework that guides the UAVs’ trajectory planning to maintain frequent information exchange without a significant loss in transmission capability. • Spatio-temporal attention predicting UAVs’ information loss: The information delay or loss becomes inevitable when the network scales up. Instead of relying on frequent information exchange, we further propose the STA-MADRL algorithm to exploit the UAVs’ historical observations by using a spatio-temporal attention based prediction approach, which enhances the UAVs’ awareness of the complete network state and facilitates more efficient collaboration in the joint control. The simulation results verify that STA-MADRL achieves over 50% reduction in information delay and 75% throughput gain comparing to the communication-limited MADRL. Some preliminary results of this work have been presented in a conference paper [5], which verifies the effectiveness of the delay-penalized reward to guide the UAVs’ trajectory planning. In this paper, we focus on compensating the UAVs’ excessive information delay as the UAV-assisted wireless network scales up, and integrate a spatio-temporal attention module into the MADRL framework to optimize the UAVs’ trajectories, network formation, and transmission control strategies efficiently. The remainder of this paper is organized as follows. The literature review is provided in Section I. We detail the system model in Section I and propose the delay-tolerant MADRL for throughput maximization in Section IV. Then, considering excessive information delay, we further propose the STA-MADRL framework in Section V. Finally, we present extensive results in Section VI and draw the conclusions in Section VII. I Related Works I-A Joint Trajectory Planning and Transmission Control Efficient multi-UAV deployment enables concurrent service for multiple GUs, improving network coverage, capacity, and transmission efficiency. The UAVs’ fast and dynamic deployment also comes with the price for agile network adaptability that demands adaptive transmission control and networking strategies along with the UAVs’ trajectories. The authors in [22] explored the UAVs’ deployment and GUs’ scheduling strategies to maximize network throughput while maintaining fairness. The authors in [12] maximized the GUs’ minimum uplink throughput by jointly optimizing the UAVs’ trajectories and resource allocation strategies under energy neutrality and mobility constraints in wireless-powered networks. The authors in [21] jointly optimized the UAV’s trajectory planning, task offloading, and computing resource assignment to minimize system energy consumption of a UAV-based air-ground integrated computing network under limited battery capacity. UAVs’ multi-hop relay network has also been considered in [3] to enhance data transmission, task offloading, and on-the-fly computation. The authors in [7] exploited the UAVs’ energy-efficient network formation strategy to dynamically adapt multi-hop relay connections along with their trajectories. The joint optimization of the UAVs’ trajectories and transmission control has also been proved indispensable for enhancing network performance in terms of fairness, energy efficiency, transmission delay, and energy consumption, e.g. [31, 3, 23]. I-B Optimization vs. Learning for UAV-assisted Networks Network performance maximization in UAV-assisted wireless networks is typically solved either by optimization or machine learning methods. The authors in [25] employed the conventional block coordinate descent (BCD) method to derive a closed-form solution for transmit power that maximizes the minimum throughput in a UAV-assisted hybrid NOMA system. The authors in [29] proposed a generalized Dinkelbach and successive convex approximation algorithm to jointly optimize the transmit beamforming, computation offloading, and the UAV’s trajectory in a joint sensing, communication, and computation framework. However, conventional optimization methods generally rely on real-time and global network information to perform heuristic decomposition, problem-specific simplification, and iterative approximation, leading to high computational complexity and even non-guaranteed performance. DRL can adapt to the changing network environment and learn optimal policies through continuously interactions with the network environment, making it particularly effective for trajectory planning, network formation, and transmission control in UAV-assisted wireless networks. The authors in [27] proposed the dueling double deep Q-network (D3QN) algorithm for trajectory planning to maximize the collected data from multiple GUs under realistic constraints. The authors in [16] proposed the deep deterministic policy gradient (DDPG) algorithm for trajectory planning of both UAVs and unmanned ground vehicles (UGVs), which can supply energy to the UAVs. The authors in [19] employed a multi-agent DDPG (MADDPG) algorithm to jointly optimize the UAVs’ trajectories and the expected age of information (AoI). Considering the UAVs’ limited computation and energy capacities, the authors in [32] proposed the multi-agent twin-delayed DDPG (MATD3) algorithm to minimize the sum of execution delays and energy consumption in a task offloading system between multiple UAVs and edge nodes. A multi-agent advantage actor-critic (MAA2C) algorithm was used in [17] to optimize the UAVs’ trajectories and enhance GUs’ spectral efficiency for downlink transmissions. Though MADRL has been proven particularly effective in handling complex and dynamic systems, it still faces significant challenges for practical and flexible deployment in UAV-assisted wireless networks. The requirement of complete network information for centralized training becomes difficult when multiple UAVs have limited information about each other, leading to instability and slow convergence in the training phase. The intermittent channels among UAVs also make it costly for real-time information sharing. These challenges call for more efficient MADRL approaches for large-scale UAV-assisted wireless networks with limited communications. I-C UAVs’ Collaboration with Limited Communications With limited information sharing, the authors in [11] proposed the attentional communication model (ACM) to selectively share critical information among agents, thus reducing the communication overhead. The authors in [13] proposed a graph neural network (GNN) based framework to compress and share only essential graph-structured information among UAVs. Federated learning can also be employed to enable the collaborative model training without raw data sharing [33]. The authors in [9] formulated a multi-agent partially observable MDP to minimize the total energy consumption of a UAV-assisted MEC network by optimizing the UAVs’ mobility, user association, resource allocation, and task offloading decisions. A neural network based interaction mechanism is integrated into the MADRL framework to help UAVs generate autonomously task-oriented messages for energy minimization. Our previous work in [7] proposed a data-driven Bayesian optimization approach to guide MADDPG by estimating each UAV’s best action in every decision epoch. Such a data-driven action estimation only focuses on the reward in the next step, and thus it is short-sighted and may be contradictory with the MADDPG’s action. Instead of relying on information sharing, recent research has also explored prediction-based approaches to estimate the lost or delayed information for UAVs’ collaboration. The authors in [4] proposed a deep recurrent Q-network (DRQN) framework to optimize the UAV’s trajectory by predicting the partially observable system state, enabling spectral-efficient resource allocation and scheduling. The authors in [20] proposed a GNN framework for the UAVs to predict the other UAVs’ trajectories using historical information, suppressing the need for real-time information sharing. The authors in [6] proposed a graph-attention multi-agent trust region reinforcement learning framework for trajectory planning and resource assignment in a multi-UAV-assisted wireless network. The graph recurrent network is used to process the network topology and extract useful information and patterns from historical observations. Different from existing works, in this paper we exploit the spatio-temporal dependencies from both the UAVs’ trajectories and networking strategies to enhance UAVs’ collaboration. I System Model Considering a large set of GUs randomly distributed over the service area, we employ multiple UAVs to assist GUs’ data transmissions to the remote BS by optimizing the UAVs’ trajectory planning, network formation, and transmission control strategies. As illustrated in Fig. 1, the multi-UAV-assisted wireless network consists of one remote BS, N UAVs, and M GUs, similar to those in [7] and [5]. The sets of UAVs and GUs are represented as =1,2,…,NN= \1,2,…,N \ and ℳ=1,2,…,MM= \1,2,…,M \, respectively. We assume that the GUs cannot be served directly by the BS due to the long distance or obstacles. The UAVs are equipped with F antennas while the GUs are single-antenna devices. The UAVs’ signal beamforming can be used to enhance the wireless data transmissions and the wireless energy transfer to sustain the low-power GUs. The whole system is operated in a time-slotted framework. Each time frame is divided into multiple time slots with a unit length. The set of time slots is denoted by ≜1,2,…,TT \1,2,…,T\. The UAVs’ operations in each time slot can be further divided into three phases, i.e., wireless energy transfer tet_e, uplink data collection tst_s, and data forward transmission trt_r. Each UAV firstly moves to a fixed location and beamforms energy to the GUs in the first phase tet_e, and then collects the sensing data from the GUs for the second phase tst_s. After that, the UAVs forward the sensing data directly to the BS or relay it to the other UAVs via the U2U links in the third phase trt_r. I-A UAV-assisted Wireless Channel and Energy Transfer Let ℓ0=(x0,y0,h0) _0=(x_0,y_0,h_0) denote the BS’s location with the fixed height h0h_0, which is viewed as the UAV-0. The set of all UAVs including the BS is defined as ~≜∪0 N ∪\0\. The UAV-n’s trajectory is defined as a collection of location points in different time slots, i.e., ℒn=ℓn(1),…,ℓn(t),…,ℓn(T)L_n=\ _n(1),…, _n(t),…, _n(T)\, where ℓn(t)≜(xn(t),yn(t),H) _n(t) (x_n(t),y_n(t),H) denotes the location with a fixed altitude H in the t-th time slot. To ensure safety, all UAVs need to satisfy the distance and speed constraints: ‖ℓn(t+1)−ℓn(t)‖≤vmaxte,∀n∈, || _n(t+1)- _n(t)||≤ v_ t_e,~∀ n , (1a) ||ℓn(t)−ℓn′(t)||≥dmin,∀n,n∈′,n≠n,′ || _n(t)- _n (t)||≥ d_ ,~∀ n,n \!∈\!N,~n≠ n , (1b) where vmaxv_ and dmind_ denote the maximum flying speed and the minimum safety distance between two UAVs, respectively. The GU-m’s location is indicated by qm=(xm,ym,0)q_m=(x_m,y_m,0). Let n,m(t)∈ℂF×1h_n,m(t) ^F× 1 denote the channel between the UAV-n and the GU-m at the t-th slot. The UAV-to-GU (U2G) channel can be modeled as follows: n,m(t)=ω01/2‖ℓn(t)−qm‖~n,m(t),h_n,m(t)= _0^1/2 \| _n(t)-q_m \| h_n,m(t), (2) where ω0 _0 is the channel gain at unit distance and the small-scale fading channel ~n,m(t) h_n,m(t) is a combination of the line-of-sight (LoS) and the non-LoS components. It is clear that the U2G channel n,m(t)h_n,m(t) depends on the UAVs’ trajectory planning. Each GU has a battery with finite capacity EmaxE_ and can harvest RF energy from UAVs’ signal beamforming during the power transfer period tet_e. Let Em(t)E_m(t) denote the GU-m’s battery level at the beginning of the t-th time slot. Let ne(t)∈ℂ1×Fw_n^e(t) ^1× F denote the UAV-n’s beamforming vector in the power transfer phase tet_e. Hence, the energy harvested by GU-m can be evaluated as [30]: emh(t)=∑n∈ηe∫0te|n,m(t)ne(t)|2dt,e^h_m(t)= _n _e _0^t_e|h_n,m(t)w_n^e(t)|^2dt, (3) where ηe _e denotes the energy conversion efficiency and ne(t)w_n^e(t) is limited by the power constraint ‖ne(t)‖2≤pne||w_n^e(t)||^2≤ p^e_n. The GU-m’s energy consumption emc(t)e_m^c(t) for uplink data transmission via the U2G channel is assumed to be a linear function of the transmission time, i.e., emc(t)=pmstse_m^c(t)=p_m^st_s, where the transmit power pmsp_m^s depends on the GU-m’s transmission rate and the U2G channel condition. To ensure sustainable operations, the GU-m’s energy consumption is constrained as follows: emc(t)≤min⁡Em(t)+emh(t),Emax.e_m^c(t)≤ \E_m(t)+e^h_m(t),E_max\. (4) Thus, the GU-m’s battery level evolves as follows: Em(t+1)=min⁡Emax,Em(t)+emh(t)−emc(t). E_m(t+1)= \E_max,E_m(t)+e_m^h(t)-e_m^c(t) \. (5) Figure 1: Multiple UAVs assist GUs’ data tranmissions to the remote BS, by joint trajectory planning, network formation, and transmisison control. I-B GUs’ Scheduling and Uplink Data Transmissions Each GU independently generates sensing data with the random size qm(t)q_m(t) per time slot, which remains in its data buffer until the GU is scheduled by UAVs for uplink data transmission during the second phase tst_s. Let the binary variables =ψm,n(t)m∈ℳ,n∈,t∈ =\ _m,n(t)\_m ,n ,t denote the GUs’ association strategy, where ψm,n(t)=1 _m,n(t)=1 indicates the association between GU-m and UAV-n in the t-th time slot, and ψm,n(t)=0 _m,n(t)=0 otherwise. We assume that each GU can associate with at most one UAV per time slot [7, 5]: ∑n∈ψm,n(t)≤1,∀m∈ℳ and t∈. _n _m,n(t)≤ 1,~~∀ m and t . (6) Let ℳ~(t)≜m|∀n,ψm,n(t)=1 M(t) \m|~∀ n, _m,n(t)=1\ denote the set of all active GUs in the same time slot. These GUs may create interference to each other during their uplink data transmissions. Given the UAV-n’s receive beamforming vector ns(t)∈ℂ1×Fw_n^s(t) ^1× F, the GU-m’s signal to interference plus noise ratio (SINR) at the UAV-n is defined as: γm,n=pms|m,nns|2σn2+∑m′∈ℳ~∖mpm′s|m′,nns|2, _m,n= p^s_m|h_m,nw_n^s|^2 _n^2+ _m ∈ M mp^s_m |h_m ,nw_n^s|^2, (7) where σn2 _n^2 denotes the noise power, and the second term in the denominator is the interference to the GU-m’s signal reception at the UAV-n. For simplicity, we omit the time index in (7). Thus, the uplink data rate from the GU-m to the UAV-n is given by: dm,ns(t)=ψm,n(t)tslog⁡(1+γm,n(t)).d^s_m,n(t)= _m,n(t)t_s (1+ _m,n(t) ). (8) Let Qm(t)Q_m(t) denote the GU-m’s data buffer size. Thus, the GU-m’s buffer dynamics can be described as follows: Qm(t+1)=max⁡Qm(t)−∑n∈dm,ns(t),0+qm(t).Q_m(t+1)= \Q_m(t)- _n d^s_m,n(t),0 \+q_m(t). (9) I-C Network Formation via U2U Relay Communications During the forward transmission phase trt_r, UAVs can either transmit data directly to the BS or forward it to the next UAVs via the U2U links. When the direct U2B channel is unavailable, e.g., the UAV is far away from the BS, the UAV will establish the U2U connection to a nearby UAV with more preferable channel condition, instead of carrying the sensing data and flying closer to the BS for direct transmissions. The adaptive formation and evolution of U2U connections along the UAVs’ trajectories can be viewed as the UAVs’ network formation. In particular, let binary matrix =ϕn,n′(t)n,n∈′~,t∈ =\ _n,n (t)\_n,n ∈ N,t denote the UAVs’ network formation strategy, similar to that in [7]. We have ϕn,n′(t)=1 _n,n (t)=1 if the UAV-n transmits data to the UAV-n′n via the U2U channel, and ϕn,n′(t)=0 _n,n (t)=0 otherwise. To mitigate interference, we assume that each UAV can forward data in at most one U2U channel each time: ∑n∈′~,n′≠nϕn,n′(t)≤1,∀n∈ and t∈. _n ∈ N,n ≠ n _n,n (t)≤ 1,~~∀ n∈N and t . (10) Let pnep^e_n denote the UAV-n’s transmit power and nr(t)w_n^r(t) be the normalized transmit beamforming vector in the forward transmission phase. Then, the SINR at the receiving UAV-n′n from the transmitting UAV-n is given by: γn,n′r=pne|n,n′nr|2σn′2+∑k∈∖nϕk,n′pke|k,n′kr|2, _n,n ^r= p^e_n|h_n,n w_n^r|^2 _n ^2+ _k n _k,n p^e_k|h_k,n w_k^r|^2, (11) where the second term in the denominator denotes the interference from the other UAVs forwarding data simultaneously. We assume that the U2U channel n,n′h_n,n follows a similar model to that of the U2G channel. Thus, the size of data forwarded from the UAV-n to the UAV-n′n is given by: dn,n′r(t)=ϕn,n′(t)trlog⁡(1+γn,n′(t)),d_n,n ^r(t)= _n,n (t)t_r (1+ _n,n (t) ), (12) and the total data received by the UAV-n is defined as: dnc(t)=∑m∈ℳdm,ns(t)+∑n′∈∖ndn′,nr(t),d^c_n(t)= _m d^s_m,n(t)+ _n nd^r_n ,n(t), (13) which includes the data collected from the GUs directly and that forwarded by the other UAVs. Via the U2U channels, the UAV-n can also forward a part of its data to the other UAVs, denoted as dno(t)=∑n∈′~∖ndn,n′r(t)d^o_n(t)= _n ∈ N nd^r_n,n (t). Similar to the GUs’ buffer dynamics in (9), the UAV-n’s data buffer status Dn(t)D_n(t) in each time slot dynamically evolves as follows: Dn(t+1)=max⁡Dn(t)−dno(t),0+dnc(t).D_n(t+1)= \D_n(t)-d^o_n(t),0 \+d^c_n(t). (14) Note that both GUs and UAVs may have limited data buffer size. Considering the low cost of buffer space, the buffer size can be practically very large comparing to the size of sensing data, and thus we omit the buffer limits in (9) and (14). IV Delay-Tolerant MADRL for UAVs’ Collaborative Control We aim to maximize the network capacity by jointly optimizing the UAVs’ trajectory planning ℒ≜ℒnn∈L \L_n\_n , network formation , and transmission control strategies (,)( W, ), including the UAVs’ beamforming ≜ne(t),ns(t),nr(t)n∈,t∈ W \w_n^e(t),w_n^s(t),w_n^r(t) \_n ,t and the GUs’ scheduling strategy , which can be formulated as: maxℒ,,,⁡[∑n∈dn,0r]s.t.(1)−(14). _L, , W, ~E [ _n d^r_n,0 ] .t.\,\,~ equ-cons-v- equ-data-upgrade. (15) The expectation is over different time slots t∈t and dn,0rd^r_n,0 denotes the data forwarded to the remote BS by the UAV-n per time slot. Problem (15) is a mixed-integer nonlinear program and challenging to solve. The UAVs’ network formation and GUs’ scheduling are discrete variables, both scaling exponentially with the size of the GUs and UAVs. Additionally, different UAVs’ trajectory planning and beamforming strategies are strongly coupled in time and space domains, which makes one-shot optimization computationally extensive. In the sequel, we first propose the general MADRL framework to solve problem (15) with complete network information. Then, we reveal the practical limitations of MADRL in large-scale UAV-assisted wireless networks due to the unreliable channels for UAVs’ information exchanges. This motivates us to propose the delay-tolerant MADRL algorithm to guide the UAVs’ trajectory planning that improves both information sharing and network capacity. IV-A General Multi-agent DRL Framework The effective application of DRL approaches depends on a proper reformulation of problem (15) into MDP, which offers a robust mathematical modeling for sequential decision-making problems. For a multi-UAV-assisted wireless network, the system state at each time slot consists of all UAVs’ observations (t)≜ℓ(t),(t) o(t) \ (t), D(t) \, including all UAVs’ locations ℓ(t)≜ℓn(t)n∈ (t) \ _n(t)\_n and their buffered data (t)≜Dn(t)n∈ D(t) \D_n(t)\_n . According to the optimization problem in (15), the UAV-n’s action in the t-th time slot n(t) a_n(t) includes the next trajectory point ℓn(t+1) _n(t+1), the U2U network formation ϕn,n′(t)n′∈~\ _n,n (t)\_n ∈ N, the beamforming vectors ne(t),ns(t),nr(t)\w_n^e(t),w_n^s(t),w_n^r(t)\, and the GUs’ association strategy ψm,n(t)m∈ℳ\ _m,n(t)\_m . The UAVs’ joint action is given by (t)≜n(t)n∈ a(t) \ a_n(t)\_n . The reward depends on the complete network state and all UAVs’ actions. Considering the throughput maximization in (15), we assume that each UAV shares the same reward function R(t)R(t), defined as follows: R(t)=∑n=1Nμ1dn,0r(t)+μ2Zn(t),R(t)= _n=1^N _1d^r_n,0(t)+ _2Z_n(t), (16) where μ1 _1 and μ2 _2 are the weighting parameters for the overall throughput and a penalty term Zn(t)Z_n(t), respectively. The transmission reward dn,0r(t)d^r_n,0(t) encourages each UAV-n to forward more information to the BS, while the penalty term Zn(t)Z_n(t) will ensure the UAVs’ safety during trajectory planning: Zn(t)=∑n,n′∈((ℓ^n(t)≤vmax)+(ℓn,n′(t)≥dmin)),Z_n(t)= _n,n^ (I( _n(t)≤ v_ )+I ( _n,n (t)≥ d_ ) ), where ℓ^n(t) _n(t) is the UAV-n’s speed and ℓn,n′(t) _n,n (t) is the distance between two UAVs. Here (⋅)I(·) denotes the indicator function. Each agent in the MADRL framework has distributed decision-making capabilities, dynamically adjusting its next action based on its perception of the network state using a pair of deep neural networks (DNNs), namely, the actor and the critic networks. The actor network selects actions by approximating the policy function, while the critic network evaluates these actions by estimating the value function. These two networks enable each agent to adapt and evaluate actions during the training phase until all agents’ joint policies stabilize with the highest reward value. Let θt _t and wtw_t denote the DNN parameters of the actor- and critic-network, respectively. Given the global network information (t) o(t), the UAV-n’s actor-network Πnθt _n _t aims to generate the action an(t)=Πnθt((t))a_n(t)= _n _t ( o(t) ) to maximize the value function n(θt)J_n ( _t ), which is defined as the cumulative discounted reward n(θt)=a[∑t=0∞ρrtRn(t)]J_n ( _t )=E_a [ _t=0^∞ _r^tR_n(t) ], where ρr _r represents the discount factor for the rewards in different time slots. However, the true value function n(θt)J_n ( _t ) is unattainable during online learning due to limited sample size. Thus, the critic-network QnwtQ_n^w_t is further employed to approximate the true value function, namely Q-value, which is used to evaluate the quality of the state-action pair ((t),an(t)) ( o(t),a_n(t) ). A large Q-value implies that the action an(t)a_n(t) can be more preferable when visiting the same state in the future time steps. To this end, we can update the policy Πnθt _n _t by using the deterministic policy gradient method [15]. The critic-network QnwtQ_n^w_t can be updated according to the temporal-difference (TD) error between the online critic-network QnwtQ_n^w_t and its target yn(t)=Rn(t)+ρrQ^nw^ty_n(t)=R_n(t)+ _r Q_n w_t. Here, the DNN parameter w^t w_t of the target critic-network Q^nw^t Q_n w_t is a delayed copy from wtw_t to prevent an excessive overestimation of the value function. To minimize the TD error, the gradient descent method can be used to update the critic-network QnwtQ_n^w_t. More detailed derivations of the general MADRL framework can be referred to [7] and thus omitted here for conciseness. IV-B Delay-Tolerant MADRL with Limited Communications Each UAV-n’s actor network in the general MADRL framework relies on its local observation n(t) o_n(t) and the other UAVs’ observations −n(t) o_-n(t) as input to generate the action n(t)=Πnθt(n(t),−n(t)) a_n(t)= _n _t ( o_n(t), o_-n(t) ). This requires that all UAVs can report their real-time local information to the BS and share the complete network state (t)=(n(t),−n(t)) o(t)=( o_n(t), o_-n(t)) with all UAVs for efficient coordination. We denote it as the Ideal-MADRL algorithm in the following discussions. The Ideal-MADRL’s real-time environmental awareness allows different UAVs to plan their optimal trajectories and transmission control decisions to avoid signal interferences and inefficiency. However, when the UAVs’ real-time information exchange becomes unavailable in a practical system, the learning performance may degrade significantly due to the misaligned information input (n(t),~−n(t))( o_n(t), o_-n(t)) to the UAV-n’s actor network, where ~−n(t) o_-n(t) can be viewed as the delayed network information from the other UAVs. For example, some UAVs may be distant from the BS and thus the U2B links are not available. As such, they cannot send their local information to the BS via U2B links in every time slot. Two UAVs may also be disconnected as they fly away from each other, making it difficult for their information exchange via the U2U link. IV-B1 Delay-penalized reward design Considering the practical challenges for real-time information exchange in UAV-assisted wireless networks, we are motivated to design a delay-tolerant MADRL algorithm that accounts for the discrepancy between ~−n(t) o_-n(t) and −n(t) o_-n(t). To achieve this, we record the actual time delay between ~−n(t) o_-n(t) and −n(t) o_-n(t) in each time slot, and propose a delay-penalized reward function for each UAV to guide its trajectory planning. In particular, let ζn(t) _n(t) denote the UAV-n’s information delay since its last U2B connection with the BS. When the time delay ζn(t) _n(t) becomes large, the delayed information ~−n(t) o_-n(t) will be very different from the real state −n(t) o_-n(t). Hence, we employ the time delay ζn(t) _n(t) as a penalty term that forces the UAV to fly closer to the BS and report its newest network information via the U2B link, avoiding a continuous increase in the information delay. As such, we can simply revise the reward in (16) as follows to account for the UAVs’ information delay: R~(t)=∑n=1Nμ1(ω1dn,0r(t)−ω2ζn(t))+μ2Zn(t), R(t)= _n=1^N _1( _1d^r_n,0(t)- _2 _n(t))+ _2Z_n(t), (17) where ω1 _1 and ω2 _2 are the non-negative weighting coefficients that balance network throughput and information delay, respectively. This new reward function encourages more frequent interactions between UAVs and the BS. When all UAVs’ information delays are small, the complete network information can be timely shared among all UAVs, leading to a fully coordinated UAV-assisted wireless network to serve the GUs with minimum conflicts and interferences. IV-B2 Evaluating the information delay We assume that the BS can cache the latest state information reported by the UAVs via the U2B links. For each UAV-n, when there exists the U2B link, it can forward the sensing information to the BS and report its local state information n(t) o_n(t) as well, which will replace the obsolete information cached by the BS. The UAV-n can also retrieve all state information −n(t) o_-n(t) of the other UAVs via the U2B link. When the UAV-n is far from the BS and there is no U2B link, the BS will maintain the same state information n(t) o_n(t) for the next time slots. The UAV-n cannot retrieve the latest information from the BS and have to rely on delayed information to make decisions. In the t-th time slot, let cn(t)c_n(t) denote the most recent time slot when the UAV-n exchanges state information with the BS to report its local information and retrieve the global network information. Note that the information exchange depends on the availability of the U2B link. We can update cn(t)c_n(t) as: cn(t)=cn(t−1)(1−ϕn,0(t))+tϕn,0(t). c_n(t)=c_n(t-1)(1- _n,0(t))+t _n,0(t). (18) When the U2B link exists, i.e., ϕn,0(t)=1 _n,0(t)=1, the UAV-n will report its information to the BS and thus we have cn(t)=tc_n(t)=t and otherwise cn(t)=cn(t−1)c_n(t)=c_n(t-1). Hence, the information delay ζn(t) _n(t) can be simply evaluated as ζn(t)=t−cn(t) _n(t)=t-c_n(t), which represents the number of time slots since the UAV-n’s last successful U2B connection with the BS. In this case, the BS maintains the obsolete state information n(t−ζn(t)) o_n(t- _n(t)) to approximate the UAV-n’s most recent state information n(t) o_n(t) in the t-th time slot. When the UAV-n reports its state information to the BS in the t-th time slot, the other UAVs can immediately access its latest state information n(t) o_n(t) with zero delay, i.e., ζn(t)=0 _n(t)=0. Otherwise, they may make biased decisions based on the obsolete state information n(t−ζn(t)) o_n(t- _n(t)). It is clear that the UAV-n’s information delay ζn(t) _n(t) depends on the frequency of U2B connections with the BS. Infrequent connections force the UAVs to use delayed information for trajectory planning and network formation, potentially leading to interference or service conflictions. V Spatio-Temporal Attention Enhanced MADRL Delay-tolerant MADRL records all UAVs’ information delay and relies on the delay-penalized reward to guide the UAVs’ trajectory planning that ensures frequent information exchange with the BS. As such, all UAVs’ information delay can be maintained at a low level. However, as the network scales up with more UAVs, it becomes more difficult to plan all UAVs’ trajectories and schedule their information exchange with the BS while maintaining the low information delay. Note that the U2B connections are available only when the UAVs are close to the BS. The U2B transmissions can be congested and thus the information delay will inevitably increase as the number of UAVs increases. Besides the delay-penalized reward design, we further explore how UAVs can fly efficiently to counter the inevitable information delay or loss as the network scales up. Our basic idea is to predict and recover the UAVs’ missing information, instead of acquiring it from the BS via U2B connections, by exploiting the spatio-temporal dependencies in the UAVs’ historical information. Figure 2: The STA-MADRL framework. V-A Spatio-Temporal Attention for Information Prediction The overview of the proposed STA-MADRL framework is shown in Fig. 2, which incorporates the delay-penalized reward design to improve the UAVs’ information exchange with the BS, and leverages a spatio-temporal prediction module to recover the missing information and thus improve the UAVs’ awareness of the complete system state for collaborative trajectory planning and transmission control. The spatio-temporal prediction module combines temporal and spatial attention mechanisms to improve state prediction. The temporal attention module predicts the latest state information by analyzing individual UAV’s historical behavior pattern. Temporal dynamics are captured using multi-head attention (MHA) [26]. The spatial attention module exploits different UAVs’ spatial correlations for information prediction using graph attention networks (GAT) [6]. The combination of spatial and temporal features creates more comprehensive feature representations. The corrected state information is then input to the MADRL framework for each UAV to learn its trajectory planning, network formation, and transmission control strategies. As shown in Fig. 2, the BS maintains a table of the UAVs’ state information, in which we use solid green points to denote the UAVs’ real-time state information, while the hollow points represent the obsolete or missing information. Due to the UAVs’ limited U2B communications with the BS, the state information cached by the BS is incomplete. Once some UAV has the U2B connection with the BS, its state information can be updated into the BS’s information table. By counting each UAV’s information delay, the delay-penalized reward can be evaluated to assess the quality of the UAVs’ current actions, including the trajectory planning and transmission control strategies. The BS’s information table can be further processed by the spatio-temporal attention module, which corrects the delayed information or fills up the missing entries in the information table. The corrected information table then provides the complete system state for all UAVs to make coordinated decisions in the MADRL framework. In the sequel, we present the details of the spatio-temporal attention based prediction module. V-B Temporal Multi-head Attention The temporal attention aims to predict each UAV’s future state information by exploiting the dependencies in its historical information. For example, when the GUs’ traffic demands exhibit a stable spatial distribution over time, the UAVs may also have stable and periodically repeated trajectories to serve all GUs, which can be exploited for predicting the UAVs’ future trajectories. To proceed, we define a moving window with length τ0 _0 to predict the UAV’s next state information by focusing on the most recent state information in the past τ0 _0 time slots. This allows the temporal prediction model to capture the recent trends efficiently. At the t-th time slot, we collect the UAV-n’s historical state information from time slot t−τ0t- _0 to t−1t-1, denoted as n(t)=n(t−τ0),…,n(t−1)∈ℝτ0×doX_n(t)=\ o_n(t- _0),…, o_n(t-1)\ _0× d_o, where dod_o represents the dimension of the UAV-n’s local state information, including its trajectory point and the buffer size. This sequence n(t)X_n(t) is then used to predict the current state ^n(t) o_n(t). We implement the prediction by the MHA mechanism [26], which processes temporal dependencies in parallel, exhibiting prominent advantages over sequential algorithms such as long short-term memory (LSTM) and gated recurrent unit (GRU) in capturing nonlinear and long-range patterns. Specifically, MHA is composed of a set of DNNs that transform input sequences into queries Q, keys K, and values V, through linear projections using tunable weight matrices. It computes attention score using scaled dot-product independently in each head, allowing simultaneous focus on different feature subspaces. The output of each head is concatenated into a vector and then decoded as the final output of MHA. Such a multi-head architecture can capture complex temporal dependencies in high-dimensional feature space, making it particularly effective for processing sequential time series. Figure 3: Network structure of the spatio-temporal attention prediction module. As detailed in Fig. 3, the historical input sequence n(t)X_n(t) is firstly encoded by Lc(⋅)L^c(·) and linearly transformed into a high-dimensional form ℰn(t)=Lc(n)=n(t)(nc)T+nc∈ℝτ0×deE_n(t)=L^c (X_n )=X_n(t)( V_n^c)^T+ b_n^c _0× d_e, where nc∈ℝde×do V_n^c ^d_e× d_o and nc∈ℝτ0×de b_n^c _0× d_e are the weight matrix and bias of the encoding layer, respectively. For notational convenience, we omit the time index for the following discussions. After encoding, the dimension of the feature space increases from dod_o to ded_e. The encoding into a high-dimensional space typically offers richer features for extraction and facilitates more efficient training of the subsequent attention modules. We then feed the encoded data ℰnE_n into parallel self-attention modules, using scaled dot-product computations to capture the correlations in the encoded data. Through three fully connected layers, the self-attention module maps ℰnE_n into query space n=LQ(ℰn)∈ℝτ0×dkQ_n=L^Q (E_n ) _0× d_k, key space n=LK(ℰn)∈ℝτ0×dkK_n=L^K (E_n ) _0× d_k, and the value space n=LV(ℰn)∈ℝτ0×dvV_n=L^V (E_n ) _0× d_v, respectively, where dkd_k and dvd_v represent the dimensions of the query-key space and the value space, respectively. The attention score of each query-key pair is n=nnT λ_n=Q_nK_n^T, which indicates query-key similarity, i.e., a higher value shows a stronger correlation. The scaled scores ^n=ndk λ_n= λ_n d_k weight the value to focus on the most pertinent parts of the input for accurate predictions. Subsequently, we apply the softmax function to transform the scaled scores ^n λ_n into attention matrix as follows: n=softmax⁡(^n), δ_n=softmax( λ_n), (19) which represents the importance of the key nK_n to the query nQ_n. A higher weight indicates that the key is more relevant. The final value is ℬn=nnB_n= δ_nV_n. Furthermore, multiple independent attention operations in different heads are employed to stabilize the self-attention learning process. The outputs of all heads are concatenated as a vector ℬ^n B_n, which is then decoded as the predicted information vector as follows: ^n(t) Y_n(t) =Ld(ℬ^n)=ℬ^n(nd)T+nd =L^d( B_n)= B_n( V_n^d)^T+ b_n^d =^n(t−τ0+1),…,^n(t)∈ℝτ0×do, =\ o_n(t- _0+1),…, o_n(t)\ _0× d_o, (20) where nd∈ℝdo×dv V_n^d ^d_o× d_v and nd∈ℝτ0×do b_n^d _0× d_o are the weight matrix and bias of the decoding layer, respectively. Till this point, we can extract the UAV-n’s information ^n(t) o_n(t) from ^n(t) Y_n(t) at the t-th time slot. By collecting all UAVs’ predictions ^n(t)n∈\ o_n(t)\_n , the BS can build the complete state information. V-C Graph-based Spatial Attention The temporal MHA focuses on the dynamic state transitions and correlations hidden in individual UAV’s historical information, without considering cross-UAVs’ spatial dependencies. In fact, two UAVs may have highly correlated observations when they are close to each other and have similar trajectories. Intuitively, it is feasible to predict one UAV’s state based on the states of its neighboring UAVs. Such proximity-based spatial correlations imply that we may evaluate the spatial attention based on the distances among UAVs or the similarity of their trajectories. Considering the UAVs’ mobility, we can model the UAVs’ network formation in a dynamic graph structure, based on which we can perform graph computation and extract the spatial correlations among UAVs. We envision that the combination of temporal and spatial correlations can provide more accurate and comprehensive predictions for the UAVs’ missing state information. As shown in Fig. 3, we define an attention graph G((t),ℱ(t))G(Z(t),F(t)) to represent the UAVs’ spatial connections in the t-th time slot, where (t)=n(t)n∈Z(t)= \ z_n(t) \_n denotes the set of nodes and ℱ(t)=fn,n′(t)n≠n′,n,n′∈F(t)= \f_n,n (t) \_n≠ n ,n,n denotes the weights on edges connecting the nodes. Each node zn(t)∈(t)z_n(t) (t) corresponds to a UAV and contains the encoded observations of the UAV, i.e., n(t)=Ls(^n(t))∈ℝds z_n(t)=L^s( o_n(t)) ^d_s. The encoding function Ls(⋅)L^s(·) maps the UAV’s state information into the dsd_s-dimensional feature space, preserving the node’s information for further graph computations. The edge weight fn,n′(t)f_n,n (t) is the attention score between two UAVs evaluated based on their distance and U2U connections. Given the graph structure G((t),ℱ(t))G(Z(t),F(t)), we further extract spatial features by GAT module, which includes stacked graph attention layers. For each edge, the weight fn,n′(t)f_n,n (t) can be updated by evaluating the similarity between two nodes: f^n,n′(t)=softmax⁡(LeakyReLU⁡([n(t)∥n′(t)])), f_n,n (t)=softmax (LeakyReLU( ρ[ z_n(t)\| z_n (t)]) ), (21) where [n(t)∥n′(t)][ z_n(t)\| z_n (t)] denotes the concatenation of two vectors n(t) z_n(t) and n′(t) z_n (t), and ρ is the GAT’s trainable parameter. After updating all edges’ weights in ℱ^(t) F(t), we can further update each node’s feature by aggregating as follows: ^n(t)=softmax⁡(∑n′∈∖nf^n,n′(t)n′(t)). z_n(t)=softmax ( _n \n\ f_n,n (t) z_n (t) ). (22) After iterative graph computation, the graph attention layer produces the final attention graph G(^(t),ℱ^(t))G( Z(t), F(t)). The integration of the temporal and spatial attention module is shown in Fig. 3. Initially, the BS replaces the UAVs’ obsolete state information n(t−ζn) o_n(t- _n) with the predicted state information ^n(t) o_n(t) by using the temporal attention module. The temporal predictions ^(t) o(t) then drive the spatial attention module to process and extract the dependency information from the graph G(t)G(t). The final attention graph G(t)G(t) and the temporal prediction ^(t) o(t) are flattened into a one-dimensional vector by the fusion layer to generate the compensated and complete state information. Leveraging the enhanced state representation, MADRL can be employed to solve the throughput maximization problem, achieving preferable trajectories, network formation, and transmission control strategies. In summary, the proposed STA-MADRL framework for UAVs’ collaboration addresses the practical challenges of the conventional MADRL framework from two aspects. Firstly, considering the UAVs’ limited communications, we devise the delay-penalized reward to guide the UAVs’ trajectory planning that ensures frequent information exchange with the BS. Secondly, we focus on the hidden spatio-temporal dependencies of the UAVs’ historical observations and propose a prediction module to recover the complete state information for the UAVs’ decision making in the MADRL framework. VI Numerical Results In this part, we numerically evaluate the STA-MADRL framework to demonstrate its performance in communication-limited UAV-assisted wireless networks. We consider M=9M=9 GUs, N=3N=3 UAVs, and one BS within a 2×2 km2 area. The x-y coordinate is scaled to the range of [−1,1][-1,1]. The BS is located at ℓ0=(1,1,0) _0=(1,1,0) while the GUs are randomly distributed. The UAVs can start services from arbitrary locations, operating at a fixed altitude of H=100H=100 meters with a maximum speed of vmax=20v_max=20 m/s. The main parameters are similar to those in [7]. VI-A Information Sharing Improves Learning Performance We implement the delay-tolerant MADRL with the delay-penalized reward and compare it with Ideal-MADRL. Note that Ideal-MADRL assumes complete and real-time information available to all UAVs and thus it can serve as a theoretical performance benchmark. We also implement the conventional communication-limited MADRL, which relies on incomplete state information for decision making. Dictated by (17), the delay-tolerant MADRL leverages both the throughput performance and delay statistics to guide the UAVs’ trajectory planning that ensures frequent information exchange with the BS. As shown in Fig. 4(a), the delay-tolerant MADRL algorithm demonstrates a higher convergence speed and improved reward performance compared with the communication-limited MADRL, verifying the importance of information sharing for the UAVs’ efficient collaboration. By forcing the UAVs’ frequent information sharing, we not only achieve a better learning efficiency, but also improve the overall throughput performance. However, compared with Ideal-MADRL, the throughput of the delay-tolerant MADRL still has a significant drop, implying that the importance of information sharing can be further exploited by more sophisticated algorithm design. In Fig. 4(b), we show the average information delay in the delay-tolerant and communication-limited MADRL algorithms. It is obvious that the delay-tolerant MADRL achieves a lower information delay than that of the communication-limited MADRL. This verifies that the delay-penalized reward encourages the UAVs to have more frequent information exchanges with the BS. As such, the BS can timely update the UAVs’ local state in the information table and share the latest system state with the other UAVs via ACK packets. With reduced information delay, all UAVs can make informative control decisions in the MADRL framework to achieve an improved throughput performance as revealed in Fig. 4(a). (a) Throughput dynamics. (b) Dynamics of information delay. Figure 4: STA-MADRL achieves close throughput with the Ideal-MADRL. In Fig. 4, we also compare the throughput and information delay performance of the STA-MADRL and the delay-tolerant MADRL algorithms, along with two baseline approaches. In our simulation, the STA-MADRL algorithm achieves 25% throughput gain compared to the delay-tolerant MADRL and over 75% throughput gain compared to the communication-limited MADRL at convergence. By correcting the delayed information using a spatio-temporal prediction module, the STA-MADRL algorithm offers more accurate awareness of the complete network environment, allowing UAVs to adapt their trajectories and transmission control strategies more effectively. Compared with Ideal-MADRL, the STA-MADRL algorithm achieves a very close throughput performance without real-time information sharing among all UAVs, making it more practical for deployment in UAV-assisted wireless networks. Fig. 4(b) shows the UAVs’ average information delay during the learning process. By penalizing the information delay, the delay-tolerant MADRL algorithm can significantly reduce the UAVs’ average information delay, comparing with the communication-limited MADRL. The STA-MADRL algorithm further reduces the information delay by making more informative decisions based on the corrected network state. The average information delay can be reduced by 50% with the communication-limited MADRL. Besides, we observe that STA-MADRL demonstrates a more stable learning performance and faster convergence, as both the dynamics of throughput and information delay in Fig. 4 show a smaller variance of fluctuations. The learning algorithms will converge as the variance of fluctuation stabilizes at a fixed level. By predicting the missing information, the STA-MADRL algorithm may provide smoother gradient updates during learning, leading to a more stable learning performance and faster convergence speed. (a) Information delay over time. (b) Variance of information delay. Figure 5: Delay-penalized reward improves fair information exchange. (a) U2B connection frequency. (b) Information error. Figure 6: Frequent U2B information exchange reduces information error. VI-B Delay-penalized Reward Improves Information Sharing Figure 5 shows the UAVs’ average information delay and delay variance over time along their trajectories. Compared with the communication-limited MADRL algorithm, the delay-penalized reward encourages the UAVs to have more frequent information exchange with the BS and thus it can maintain the UAVs’ information delay at a relatively low level as shown in Fig. 5(a). A smaller variance of the UAVs’ information delay in Fig. 5(b) implies that all UAVs in the STA-MADRL and the delay-tolerant MADRL algorithms can fairly have the opportunities to exchange state information with the BS. This verifies that the delay-penalized reward design can incentivize UAVs to maintain well-balanced information delay at a low level. Without such a delay incentive, the communication-limited MADRL algorithm may allocate one UAV to serve the remote GUs for a long time period, leading to both a higher average delay in Fig. 5(a) and the unbalanced delay among UAVs in Fig. 5(b). (a) Ideal-MADRL (b) Communication-limited MADRL (c) Delay-tolerant MADRL (d) STA-MADRL Figure 7: The UAVs’ trajectory planning with different learning algorithms. Figure 6(a) shows the UAVs’ average frequencies of information exchange with the BS in different algorithms, which are obtained by counting and averaging the number of the UAVs’ U2B connections on their trajectories over different time slots. Clearly, relying on the delay-penalized reward, the delay-tolerant MADRL and STA-MADRL algorithms achieve more frequent U2B connections than that of the communication-limited MADRL. As such, the BS can timely update all UAVs’ state information and share the complete state with other UAVs, enhancing the UAVs’ environmental awareness and promoting their multi-agent collaborations. An interesting observation is that the STA-MADRL can tolerate a slightly smaller U2B connection frequency compared to the delay-tolerant MADRL algorithm. This is because the spatio-temporal prediction module in the STA-MADRL can help UAVs to recover the delayed information and build the complete system state without frequent U2B connections. As such, the UAVs in the STA-MADRL algorithm can have more opportunities to serve the GUs and thus improve the network capacity, as revealed in Fig. 4(a). The reward performance of the communication-limited MADRL degrades significantly due to the misaligned information input (n(t),~−n(t))( o_n(t), o_-n(t)) to the UAV-n’s actor network, where ~−n(t) o_-n(t) is the delayed information from the other UAVs. The STA-MADRL algorithm actually intends to minimize the information error between ~−n(t) o_-n(t) and the real-time state −n(t) o_-n(t) by using the spatio-temporal attention based prediction module. In Fig. 6(b), we characterize each UAV’s information error by |~−n(t)−n(t)|| o_-n(t)- o_-n(t)| and compare the UAVs’ average information error in different learning algorithms. It is clear that the STA-MADRL algorithm maintains the lowest information error, which provides a more accurate estimate of the complete system state for the UAVs’ decision-making in the MADRL framework. The comparison results in Fig. 6(b) corroborates with the observations in Fig. 6(a). Taking communication-limited MADRL as the baseline, a higher U2B connection frequency with the delay-tolerant MADRL algorithm in Fig. 6(a) implies less information error in Fig. 6(b). VI-C Information Sharing Enhances Network Throughput The delay-penalized reward design in (17) cares about both the UAVs’ transmission performance and the opportunities for information sharing by planning the UAVs’ trajectories. To ensure frequent information exchange with the BS, a distant UAV has to circulate back to report its local information to the BS. Intuitively this may sacrifice the UAVs’ network throughput to serve the GUs, i.e., the UAVs’ transmission performance may degrade when they fly back and forth between the BS and their service areas. However, our counter-intuitive observation is that the UAVs’ throughput performance can also be increased significantly by planning frequent U2B connections on the UAVs’ trajectories. As shown in Fig. 4(a), the convergent throughput performance in STA-MADRL is close to the optimum achievable by the Ideal-MADRL and nearly doubled compared to that of the communication-limited MADRL. This is because the frequent information sharing provides accurate estimation of the complete system state and supports more efficient collaboration among UAVs to improve the overall network throughput. In this part, we further evaluate the UAVs’ throughput on their trajectories with different algorithms. Fig. 7 shows the UAVs’ trajectories in the STA-MADRL and delay-tolerant MADRL algorithms, as well as two baselines, to serve the GUs under the same setting. The UAVs’ trajectories are highlighted with different line styles. In Fig. 7(a), Ideal-MADRL achieves full network coverage with balanced service of all GUs. With complete information, each UAV can be optimally allocated to serve a subset of GUs with a clear boundary and stable trajectory. In contrast, the communication-limited MADRL algorithm in Fig. 7(b) leads to conflicting trajectories with obvious overlaps on individual UAVs’ service areas. Besides service overlap, some UAVs are assigned with excessively large service areas, e.g., the red and green UAVs. This may lead to excessive information delay when they are far from the BS for a long time. Without timely information exchange, each UAV will try to cover the whole service area and serve all GUs, as shown in Fig. 7(b). This explains how excessive information delay hinders the UAV’s efficient collaboration. In Fig. 7(c) and Fig. 7(d), though the delay-tolerant MADRL and STA-MADRL produce different trajectories, one common observation is that all UAVs have relatively stable routes and serve a subset of GUs with little overlap, compared to the communication-limited MADRL. The delay-penalized rewards in both algorithms drive the UAVs to periodically circulate above a subset of GUs. Compared with STA-MADRL, the delay-tolerant MADRL still has some overlapping service areas and some uncertainties in the UAVs’ trajectories, due to the lack of real-time complete system state. Compared with Ideal-MADRL, STA-MADRL also performs well, i.e., each UAV’s trajectory has a clear boundary with minimum overlap on the service area. (a) Accumulated throughput (b) STA-MADRL’s max. throughput Figure 8: Throughput performance with different learning algorithms. (a) STA-MADRL (b) Commun.-limited MADRL Figure 9: Clear boundary with less service overlap in STA-MADRL. VI-D Trade-off between Learning and Throughput Performance Correspondingly, Fig. 8(a) records the BS’s accumulated throughput as the UAVs move on their trajectories. With complete information, the Ideal-MADRL algorithm achieves the highest throughput in Fig. 8(a) and serves as the upper bound. The service overlaps in the communication-limited MADRL algorithm inevitably decrease the efficiency of UAVs’ collaboration and result in a significant throughput degradation, which can be regarded as the throughput lower bound for our algorithm design. The accumulated throughput with STA-MADRL is continuously increasing and close to that of Ideal-MADRL as shown in Fig. 8(a). The above results reveal that the network throughput and learning performance can be improved significantly and simultaneously by enabling regularly frequent information sharing among UAVs. However, we also envision that the cost of information exchange cannot be ignored when excessive channel resources are consumed to establish real-time U2B connections. In an extreme case, all UAVs will stay close to the BS and exchange real-time information with the BS. Such real-time information exchange is useful for the awareness of network environment and multi-agent decision making in the MADRL framework. However, it inevitably limits the network coverage and capacity. In this part, we examine how the overall network throughput is affected by the U2B connection frequency. Fig. 8(b) shows the change of throughput with respect to the UAVs’ average U2B connection frequencies. Given the weighting coefficient ω2 _2 in the delay-penalized reward (17), we actually fix the UAVs’ sensitivities to information delay and will observe the U2B connection frequency in Fig. 6 by planning trajectories with the STA-MADRL algorithm. Hence, the x-axis in Fig. 8(b) is obtained by varying the weighting coefficient ω2 _2, while the y-axes denote the corresponding throughput and delay performance achieved by the STA-MADRL algorithm, respectively. Figure 10: STA-MADRL achieves a higher throughput with a fewer UAVs. An interesting observation is that the network throughput with STA-MADRL is not always increasing with the UAVs’ frequencies of information exchange. As shown in Fig. 8(b), the network throughput first increases with the U2B connection frequency and then declines as we further increase the U2B connection frequency. There is a clear optimal frequency of information exchange that achieves the maximum throughput, which is close to that of Ideal-MADRL. It implies that real-time information exchange is not necessary for a practical UAV-assisted wireless network, considering the trade-off between throughput and learning performance. With low U2B connection frequencies, the UAVs experience severe information delay and estimation error in the complete system state, leading to throughput degradation in the UAVs’ collaborative control. Conversely, when the UAVs become more sensitive to information delays and demand excessive information exchange, substantial channel resources will be consumed by information exchange, which restricts the UAVs’ transmission capabilities to forward the GUs’ sensing data. VI-E Robustness of Multi-UAV’s Collaborative Control The delay-penalized reward design constantly monitors each UAV’s information delay and forces frequent information exchange with the BS, while the spatio-temporal attention based prediction module further recovers the missing information. These will help each UAV estimate the real-time complete system state and adapt to the network environment. Therefore, we expect that STA-MADRL will be more robust against the network dynamics in optimizing the UAVs’ trajectories and transmission control strategies. Compared with the 3-UAVs’ trajectory planning in Fig. 7, we add a new UAV into the service area and show the new trajectory planning strategy for the current 4 UAVs scenario in Fig. 9. By enabling frequent information exchange and prediction, STA-MADRL can detect the presence of a new UAV and optimize its trajectory jointly with other UAVs. Thus, we observe that STA-MADRL can adaptively adjust each UAV’s service area to ensure a clear boundary with minimum service overlap, as shown in Fig. 9(a). In contrast, with the same network settings, the UAVs’ trajectories in the communication-limited MADRL become quite chaotic, as shown in Fig. 9(b), due to a lack of accurate state information of the network. Such a lack of information sharing evidently limits the efficiency of UAVs’ collaborative control, and also makes it inflexible and insensitive to the change of network environment. To evaluate the efficiency of collaboration, we quantitatively compare the throughput accumulated at the BS as 4 UAVs fly along the trajectories in Fig. 10. Without doubt STA-MADRL achieves over a 40% throughput gain over the communication-limited MADRL algorithm. Besides, we plot STA-MADRL’s throughput with 3 UAVs serving the same set of GUs in Fig. 10. An interesting observation is that 3 UAVs with STA-MADRL can even achieve a higher throughput than that of 4 UAVs with the communication-limited MADRL algorithm. This observation further verifies that STA-MADRL is robust to the network dynamics and more efficient for UAVs’ collaborative control. (a) Throughput (b) Information delay Figure 11: The impact of GUs’ deployment sparsity. Figure 11 shows the impact of GUs’ deployment sparsity on throughput and information delay, where the sparsity level increases from 1 to 5. As depicted in Fig. 11(a), STA-MADRL achieves throughput comparable to Ideal-MADRL and outperforms communication-limited MADRL across all sparsity levels. While throughput decreases for most schemes as deployment becomes sparser due to deteriorated channel conditions, STA-MADRL maintains relatively stable performance. Fig. 11(b) demonstrates that STA-MADRL effectively reduces information delay compared to delay-tolerant MADRL, which validates the effectiveness of spatio-temporal prediction in compensating for communication delays. VII Conclusions In this paper, we have focused on trajectory planning, network formation, and transmission control, in a multi-UAV-assisted wireless network with limited communications. We have addressed the performance degradation in the conventional communication-limited MADRL framework by two special designs. The delay-penalized reward firstly encourages each UAV to plan a proper trajectory that supports frequent information exchange with the BS. The spatio-temporal attention module exploits the UAVs’ historical information for an enhanced awareness of complete network state and more efficient collaborative control in the MADRL framework. Numerical results showed that the proposed STA-MADRL achieves more favorable delay and throughput performance. It also verified the trade-off between learning and throughput performance. The network throughput and learning performance can be improved significantly and simultaneously with a proper frequency of information exchange. References [1] Y. Bai, H. Zhao, X. Zhang, Z. Chang, R. Jäntti, and K. Yang (2023-Oct.) Toward autonomous multi-UAV wireless network: a survey of reinforcement learning-based approaches. IEEE Commun. Surv. Tutor. 25 (4), p. 3038–3067. Cited by: §I-B. [2] M. Basharat, M. Naeem, Z. Qadir, and A. Anpalagan (2022-Feb.) Resource optimization in UAV-assisted wireless networks: A comprehensive survey. Trans. Emerg. Telecommun. Technol. 33 (7), p. e4464. Cited by: §I. [3] M. Chen, Y. Zhang, and K. B. Letaief (2020-Jan.) Joint trajectory and communication design for multi-UAV enabled wireless networks. IEEE Trans. Wireless Commun. 19 (10), p. 6296–6309. Cited by: §I-A. [4] D. Deng, C. Wang, and W. Wang (2022-Apr.) Joint Air-to-ground scheduling in UAV-aided vehicular communication: a DRL approach with partial observations. IEEE Commun. Lett. 26 (7), p. 1628–1632. Cited by: §I-C. [5] Z. Fan, S. Gong, Y. Long, L. Li, B. Gu, and N. C. Luong (2024) Delay-tolerant multi-agent DRL for trajectory planning and transmission control in uav-assisted wireless networks. In Proc. IEEE VTC, Vol. , p. 1–5. Cited by: §I-A, §I-B, §I-B, §I-B, §I. [6] Z. Feng, D. Wu, M. Huang, and C. Yuen (2024) Graph-attention-based reinforcement learning for trajectory design and resource assignment in multi-UAV-assisted communication. IEEE Internet of Things J. 11 (16), p. 27421–27434. Cited by: §I-C, §V-A. [7] S. Gong, M. Wang, B. Gu, W. Zhang, D. T. Hoang, and D. Niyato (2023-Mar.) Bayesian optimization enhanced deep reinforcement learning for trajectory planning and network formation in multi-UAV networks. IEEE Trans. Veh. Technol. , p. 1–16. Cited by: §I-A, §I-A, §I-A, §I-C, §I-B, §I-C, §I, §IV-A, §VI. [8] N. Hossein Motlagh, P. Kortoçi, X. Su, L. Lovén, H. K. Hoel, S. Bjerkestrand Haugsvær, V. Srivastava, C. F. Gulbrandsen, P. Nurmi, and S. Tarkoma (2023-Jun.) Unmanned aerial vehicles for air pollution monitoring: a survey. IEEE Internet Things J. 10 (24), p. 21687–21704. Cited by: §I. [9] S. Hwang, H. Lee, M. Kim, and I. Lee (2025-Jan.) Multi-agent deep reinforcement learning for decentralized multi-UAV mobile edge computing networks. IEEE Internet of Things J. (), p. 1–14. External Links: Document Cited by: §I-C. [10] S. Javaid, N. Saeed, Z. Qadir, H. Fahim, B. He, H. Song, and M. Bilal (2023-Mar.) Communication and control in collaborative UAVs: recent advances and future trends. IEEE Trans. Intell. Transp. Syst. 24 (6), p. 5719–5739. Cited by: §I. [11] J. Jiang and Z. Lu (2018-Mar.) Learning attentional communication for multi-agent cooperation. Adv. Neural Inf. Process. Syst. 31. Cited by: §I-C. [12] C. Kim, H. Choi, and K. Lee (2024-Mar.) Joint optimization of trajectory and resource allocation for multi-UAV-enabled wireless-powered communication networks. IEEE Trans. Commun. 72 (9), p. 5752–5764. Cited by: §I-A. [13] X. Li, H. Zhou, S. Yang, and Z. Li (2021-Jul.) Graph neural networks for multi-UAV path planning and network formation. IEEE Trans. Veh. Technol. 70 (4), p. 3533–3547. Cited by: §I-C. [14] Y. Liu, H. Dai, Q. Wang, M. Imran, and N. Guizani (2022-Apr.) Wireless powering internet of things with UAVs: challenges and opportunities. IEEE Network 36 (2), p. 146–152. Cited by: §I. [15] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch (2017-Dec.) Multi-agent actor-critic for mixed cooperative-competitive environments. Proc. Adv. Neural Inf. Process. Syst 30. Cited by: §IV-A. [16] K. Messaoudi, A. Baz, O. S. Oubbati, A. Rachedi, T. Bendouma, and M. Atiquzzaman (2024-Apr.) UGV charging stations for UAV-assisted AoI-aware data collection. IEEE Trans. Cogn. Commun. Netw. 10 (6), p. 2325–2343. Cited by: §I-B. [17] M. Nasr-Azadani, J. Abouei, and K. N. Plataniotis (2022-Feb.) Single- and multi-agent actor-critic for initial UAV’s deployment and 3-D trajectory design. IEEE Internet Things J. 9 (16), p. 15372–15389. Cited by: §I-B. [18] K. K. Nguyen, T. Q. Duong, T. Do-Duy, H. Claussen, and L. Hanzo (2022-Feb.) 3D UAV trajectory and data collection optimisation via deep reinforcement learning. IEEE Trans. Commun. 70 (4), p. 2358–2371. Cited by: §I-A. [19] O. S. Oubbati, M. Atiquzzaman, H. Lim, A. Rachedi, and A. Lakas (2022-Jun.) Synchronizing UAV teams for timely data collection and energy transfer by deep reinforcement learning. IEEE Trans. Veh. Technol. 71 (6), p. 6682–6697. Cited by: §I-B. [20] Y. Pan, X. Wang, Z. Xu, N. Cheng, W. Xu, and J. Zhang (2024-Aug.) GNN-empowered effective partial observation MARL method for AoI management in multi-UAV network. IEEE Internet Things J. 11 (21), p. 34541–34553. Cited by: §I-C. [21] P. Qin, Y. Fu, Y. Xie, K. Wu, X. Zhang, and X. Zhao (2023-05) Multi-Agent Learning-Based optimal task offloading and UAV trajectory planning for AGIN-Power IoT. IEEE Trans. Commun. 71 (7), p. 4005–4017. Cited by: §I-A. [22] Z. Qin, Z. Liu, G. Han, C. Lin, L. Guo, and L. Xie (2021-Oct.) Distributed UAV-BSs trajectory optimization for User-level fair communication service with multi-agent deep reinforcement learning. IEEE Trans. Veh. Technol. 70 (12), p. 12290–12301. Cited by: §I-A. [23] Z. Qin, X. Zhang, and Y. Zhang (2023-Sep.) Multi-agent learning for joint optimization of UAV trajectory planning, task offloading, and resource allocation in edge computing. IEEE Trans. Wireless Commun. 22 (3), p. 1745–1759. Cited by: §I-A. [24] G. Sun, J. Li, A. Wang, Q. Wu, Z. Sun, and Y. Liu (2022-Jun.) Secure and energy-efficient UAV relay communications exploiting collaborative beamforming. IEEE Trans. Commun. 70 (8), p. 5401–5416. Cited by: §I. [25] J. Tang and J. Chen (2024-05) Throughput maximization for UAV-assisted data collection with hybrid NOMA. IEEE Trans. Wireless Commun. 23 (10), p. 13068–13081. Cited by: §I-B. [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017-Dec.) Attention is all you need. Proc. Adv. Neural Inf. Process. Syst 30. Cited by: §V-A, §V-B. [27] X. Wang, M. C. Gursoy, T. Erpek, and Y. E. Sagduyu (2022-Sep.) Learning-based UAV path planning for data collection with integrated collision avoidance. IEEE Internet Things J. 9 (17), p. 16663–16676. Cited by: §I-B. [28] Y. Wang, Z. Su, Q. Xu, R. Li, T. H. Luan, and P. Wang (2023-Apr.) A secure and intelligent data sharing scheme for UAV-assisted disaster rescue. IEEE Trans. Netw. 31 (6), p. 2422–2438. Cited by: §I. [29] H. Xie, T. Zhang, X. Xu, D. Yang, and Y. Liu (2024-Mar.) Joint sensing, communication, and computation in UAV-assisted systems. IEEE Internet Things J. 11 (18), p. 29412–29426. Cited by: §I-B. [30] G. Yang, R. Dai, and Y. Liang (2021) Energy-efficient UAV backscatter communication with joint trajectory design and resource optimization. IEEE Trans. Wireless Commun. 20 (2), p. 926–941. Cited by: §I-A. [31] Y. Zeng, J. Xu, and R. Zhang (2019-Mar.) Energy minimization for wireless communication with rotary-wing UAV. IEEE Trans. Wireless Commun. 18 (4), p. 2329–2345. Cited by: §I-A. [32] N. Zhao, Z. Ye, Y. Pei, Y. Liang, and D. Niyato (2022-Mar.) Multi-agent deep reinforcement learning for task offloading in UAV-assisted mobile edge computing. IEEE Trans. Wireless Commun. 21 (9), p. 6949–6960. Cited by: §I-B. [33] G. Zhu, Y. Du, D. Dai, and T. Nguyen (2020-05) Blockchain-enabled federated learning for UAV edge computing network: issues and solutions. IEEE Trans. Veh. Technol. 7 (4), p. 2622–2636. Cited by: §I-C.