← Back to papers

Paper deep dive

A prior information informed learning architecture for flying trajectory prediction

Xianda Huang, Zidong Han, Ruibo Jin, Zhenyu Wang, Wenyu Li, Xiaoyang Li, Yi Gong

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 51

Abstract

Abstract:Trajectory prediction for flying objects is critical in domains ranging from sports analytics to aerospace. However, traditional methods struggle with complex physical modeling, computational inefficiencies, and high hardware demands, often neglecting critical trajectory events like landing points. This paper introduces a novel, hardware-efficient trajectory prediction framework that integrates environmental priors with a Dual-Transformer-Cascaded (DTC) architecture. We demonstrate this approach by predicting the landing points of tennis balls in real-world outdoor courts. Using a single industrial camera and YOLO-based detection, we extract high-speed flight coordinates. These coordinates, fused with structural environmental priors (e.g., court boundaries), form a comprehensive dataset fed into our proposed DTC model. A first-level Transformer classifies the trajectory, while a second-level Transformer synthesizes these features to precisely predict the landing point. Extensive ablation and comparative experiments demonstrate that integrating environmental priors within the DTC architecture significantly outperforms existing trajectory prediction frameworks

Tags

ai-safety (imported, 100%)cscv (suggested, 92%)preprint (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/13/2026, 12:28:37 AM

Summary

The paper introduces a Prior Information-Informed Dual-Transformer-Cascaded (PIDTC) architecture for predicting the landing points of flying objects, specifically tennis balls. By integrating environmental priors (such as court boundaries extracted via Hough line detection) with trajectory data captured by a single industrial camera, the model uses a two-level Transformer approach: the first classifies the trajectory (in/out), and the second synthesizes features to predict precise landing coordinates, outperforming traditional data-driven methods.

Entities (5)

PIDTC · model-architecture · 100%YOLOv10 · detection-algorithm · 100%Tennis ball · flying-object · 98%Canny algorithm · edge-detection-algorithm · 95%Hough line detection · computer-vision-technique · 95%

Relation Signals (3)

PIDTC predicts Landing Point

confidence 98% · a second-level Transformer synthesizes these features to precisely predict the landing point.

YOLOv10 extracts Flight Coordinates

confidence 95% · Using a single industrial camera and YOLO-based detection, we extract high-speed flight coordinates.

PIDTC integrates Environmental Priors

confidence 95% · This paper introduces a novel, hardware-efficient trajectory prediction framework that integrates environmental priors with a Dual-Transformer-Cascaded (DTC) architecture.

Cypher Suggestions (2)

Find all models that integrate environmental priors · confidence 90% · unvalidated

MATCH (m:Model)-[:INTEGRATES]->(p:Feature {name: 'Environmental Priors'}) RETURN m.name

Map the pipeline of data processing · confidence 85% · unvalidated

MATCH (a:Algorithm)-[r]->(b:Data) RETURN a.name, type(r), b.name

Full Text

50,247 characters extracted from source content.

Expand or collapse full text

A prior information informed learning architecture for flying trajectory prediction Xianda Huang, Zidong Han, Ruibo Jin, Zhenyu Wang, Wenyu Li, Xiaoyang Li, and Yi Gong Xianda Huang, Zidong Han, Zhenyu Wang, Xiaoyang Li, and Yi Gong are with the Southern University of Science and Technology, Shenzhen, China. Ruibo Jin and Wenyu Li are with the Chinese University of Hong Kong-Shenzhen, Shenzhen, China. The first two authors contributed equally to this work. Corresponding author: Yi Gong (gongy@sustech.edu.cn). Abstract Trajectory prediction for flying objects is critical in domains ranging from sports analytics to aerospace. However, traditional methods struggle with complex physical modeling, computational inefficiencies, and high hardware demands, often neglecting critical trajectory events like landing points. This paper introduces a novel, hardware-efficient trajectory prediction framework that integrates environmental priors with a Dual-Transformer-Cascaded (DTC) architecture. We demonstrate this approach by predicting the landing points of tennis balls in real-world outdoor courts. Using a single industrial camera and YOLO-based detection, we extract high-speed flight coordinates. These coordinates, fused with structural environmental priors (e.g., court boundaries), form a comprehensive dataset fed into our proposed DTC model. A first-level Transformer classifies the trajectory, while a second-level Transformer synthesizes these features to precisely predict the landing point. Extensive ablation and comparative experiments demonstrate that integrating environmental priors within the DTC architecture significantly outperforms existing trajectory prediction frameworks. I Introduction The proliferation of intelligent systems across aerospace and sports analytics—specifically those driven by advanced computer vision—has created a critical need for the accurate trajectory prediction of aerial targets, drawing substantial interest from both academia and industry [1, 2, 3, 4, 5, 6, 7]. For example, notable applications encompass air traffic management within the aviation sector and motion tracking utilized in officiating during competitive sports events. Formally, trajectory prediction aims to estimate a target’s dynamic state by modeling its spatiotemporal dynamics to forecast future behavior accurately. However, predicting the trajectories of flying objects remains a formidable challenge. Their states are governed by high-order, nonlinear dynamics that are highly sensitive to complex environmental variations, making accurate physical modeling exceedingly difficult. Furthermore, achieving high-precision predictions relies on extensive, high-fidelity trajectory datasets, which are both time-consuming and costly to acquire. Current trajectory prediction methods generally fall into two paradigms: model-based and data-driven. Model-based approaches leverage kinematic models and boundary conditions to project future states [8, 9, 10]. While they provide structured frameworks that are highly effective for short-term prediction, their computational complexity escalates dramatically with system dimensionality, severely limiting their scalability in complex scenarios. Conversely, data-driven approaches—particularly deep learning [11, 12, 13]—excel at extracting nonlinear flying patterns directly from historical datasets. However, existing methods often neglect critical environmental priors and physical constraints, such as obstacle-impacted landing points. Furthermore, they demand massive volumes of high-quality, multi-camera training data, resulting in prohibitive collection and preprocessing costs. In this paper, we propose a novel trajectory prediction method that integrates flight data with environmental priors to accurately forecast the landing points of tennis balls in real-world outdoor courts. Our pipeline begins with a custom data acquisition system featuring a single high-speed 2D camera (150–250 fps) and a professional ball launch machine. We employ YOLOv10 for precise ball detection, alongside edge and Hough line detection to extract critical court boundaries (e.g., corners and sidelines) as environmental priors. To process this integrated dataset, we introduce a Prior Information-Informed Dual-Transformer-Cascaded (PIDTC) architecture. Within this model, a first-level Transformer classifies trajectories using the environmental priors, while a second-level Transformer synthesizes these features to pinpoint the final landing coordinates. Extensive experiments validate the superior accuracy and effectiveness of our proposed approach. The main contributions of our work are summarized as follows. • We propose a novel transformer-based model for flying object trajectory prediction. This architecture specifically targets the accurate forecasting of critical trajectory moments (e.g., landing points), addressing a major gap in existing data-driven approaches. • We construct a comprehensive trajectory dataset using a cost-effective, 2D monocular industrial camera setup to capture high-speed grayscale images. This methodology significantly reduces the hardware complexity and financial costs associated with conventional multi-camera acquisition systems. • We integrate environmental priors (e.g., court corners) with standard trajectory data to enrich the physical characterization of 2D flight paths. Extensive experiments validate that leveraging these enhanced features within our PIDTC architecture substantially outperforms existing baseline methods. The rest of the paper is organized as follows. Section I introduces the related work. Section I introduces the construction of the trajectory dataset. Section IV presents the proposed trajectory prediction model, outlining its theoretical foundations. Section V describes the experimental setup, followed by a thorough analysis of the experimental results obtained. Finally, Section VI concludes this paper. I Related Work I-A Kinematic Model-based Trajectory Prediction Model-based trajectory prediction methods utilize kinematic models to project future states based on governing motion laws. Existing literature, predominantly focused on table tennis, typically involves two critical steps: establishing the kinematic model and determining boundary conditions. The establishment of table tennis kinematic model primarily relies on high-order polynomial fitting [14], [15]. The pioneering kinematic model for table tennis that incorporated spin effects was proposed in [14] via a quintic polynomial, successfully enabling the prediction of table tennis trajectories. To enhance the accuracy of trajectory predictions, [16] determined key parameters, such as the resistance coefficient and Magnus force coefficient, through visual measurements, subsequently integrating these coefficients into the kinematic model. Given the inherent complexity of the ball’s motion, a single model often proves insufficient. Consequently, some researchers employ multiple models to capture the motion from various perspectives. In [17], two distinct table tennis kinematic models were proposed: the discrete model and the continuous model. The discrete model is employed for the state estimation of the flying ball, while the continuous model is tasked with predicting trajectories based on the ball’s current state. A collision model was utilized in [18] to estimate the ball motion parameters after the ball’s impact, and subsequently integrated a kinematic model with motion parameters to predict post-collision trajectory. Beyond the kinematic model, the determination of boundary conditions plays a crucial role in enhancing predictive performance. The boundary conditions primarily involves the initial states of the flight ball, such as position and velocity. Fourier series method was employed in [19] and [20] to fit the velocity variation values of the flight ball, then extract the initial states of the table tennis. Kalman filtering-based method was introduced to mitigate environmental noise. [21] employed an extended Kalman filter to measure the spin state of table tennis and performed force analysis to establish its kinematic model. [22] proposed trajectory prediction methods utilizing the unscented Kalman filter, which addresses the low estimation efficiency of the extended Kalman filter. Although the model-based trajectory prediction methods has demonstrated the ability to deliver high-precision outcomes in short-term forecasting tasks, there are obvious challenges associated with these methods: 1) The kinematic models present challenges in accurately predicting the influence of random factors on long-term forecasting tasks; 2) To accurately represent special trajectories, particularly collision points, it is essential to re-establish the collision model [23]. This necessity inherently raises the modeling cost associated with these scenarios. I-B Data-driven Trajectory Prediction TABLE I: Summary of data-driven trajectory prediction The type of trajectory Prediction model Flying ball trajectory RNN [13]; LSTM [11, 12], [32] Vehicle trajectory Markov [24]; LSTM [28]; Transformer [35] Ship trajectory GRU[29], [38]; LSTM [30] (a) (b) Figure 1: (a) The schematic overview of the trajectory data acquisition system. (b) Schematic diagram of the experimental scene. Data-driven trajectory prediction methods extract underlying kinematic patterns from historical datasets. While early statistical approaches like Hidden Markov Models [24] and Bayesian inference [25] suffered from limited accuracy, the advent of deep learning has significantly advanced the field. Initially, neural networks were leveraged to enhance model-based trajectory prediction, including the estimation of a ball’s spin state during flight [26], [27]. Later, some researchers began to apply recurrent neural networks (RNNs) for direct trajectory prediction [28, 29, 30]. RNNs represent a specific category of neural architectures designed for processing sequential data, including foundational RNNs, gated recurrent units (GRUs) and long short-term memory networks (LSTMs) [31]. For example, a structure of numerous LSTM units was introduced to fit the motion patterns, thus enabling long-term trajectory predictions for a flying ball [11]. To further refine target detection capabilities within recognition systems, a cross-stage partial network was introduced, which enhanced the predictive performance of LSTMs [32]. To predict the landing point of table tennis, an LSTM-based trajectory prediction model incorporating a mixture density network was proposed [12]. A method utilizing two RNNs was suggested to forecast two trajectories that diverge at the collision point, thereby reducing the impact of the ball’s collision [13]. However, the iterative nature inherent to RNNs results in cumulative errors during long-term predictions. To address the limitation of RNN, some other models were proposed for trajectory prediction. A convolutional neural network (CNN)-based method was proposed to incorporate human pose information, thereby aiding the subsequent LSTM network in predicting table tennis landing point [33]. A deep conditional generative model was proposed for trajectory prediction [34], which employs an encoder-decoder architecture to effectively mitigate the error accumulation problem commonly encountered with RNNs. Furthermore, the Transformer model, characterized by its analogous encoder-decoder architecture, has been successfully utilized in predicting some other types of trajectories [35, 36, 37]. Incorporating reasonable prior information alongside suitable neural networks has emerged as a promising strategy for enhancing predictive performance [38]. The integration of prior knowledge has been demonstrated to improve both the accuracy and training efficiency of neural networks across various domains, including pose estimation [39], image restoration [40] and 3D reconstruction [41]. In addition, for vehicle trajectory prediction, the lane-changing intention was used as prior information [42]. For vessel trajectory prediction, the automatic identification system data was coded as the input [43], achieving a lower prediction error compared to the standard LSTM. The predicted course was integrated as a prior information in [44] for the short-term prediction of vessel trajectory. Although data-driven methods demonstrate significant potential for predicting flying trajectories, there remains obvious challenges associated with these existing approaches: 1) They require large volumes of high-quality data for training, resulting in considerable collection cost; 2) Most of them only rely on the input trajectories, ignore other useful information, such as the prior environmental/contextual information; 3) Most of them neglect the critical points of the trajectory impacted by physical obstacles, such as landing points. I Flying Trajectory Dataset Construction Figure 2: Flowchart of dataset construction. I-A Data Acquisition System A schematic overview of the data acquisition system is illustrated in Fig. 1. A Jbotsports JW-05 ball launch machine is positioned at the center of the baseline, opposite the camera. To ensure comprehensive trajectory coverage, a Basler acA1920-155um industrial camera (equipped with a 5 m wide-angle lens) is mounted on a 5-meter tripod at the court’s corner. The camera captures images at 164 fps with a resolution of 1280×650 pixels. To minimize environmental interference, data collection was conducted exclusively under clear, calm weather conditions. The acquisition protocol proceeds as follows: 1. System Calibration: The camera’s field of view and exposure settings are optimized via a upper computer to ensure clear identification of both the court lines and the flying ball. 2. Data Acquisition: The camera and ball machine are activated, continuously recording the flight path until the ball impacts the target sand layer. 3. Scene Reset: After each valid recording, the sand layer is smoothed to erase the landing mark, preventing data contamination in subsequent trials. Due to the inherent mechanical variance of the launch machine, trajectories landing squarely within the designated sand area accounted for less than 20% of the trials. Consequently, we curated a final dataset of 350 highly qualified trajectories from an initial pool of more than 2,000 recordings. I-B Dataset Construction Method As illustrated in Fig. 2, the raw video data undergoes a systematic preprocessing pipeline—encompassing data cleaning, YOLOv10-based ball detection [45], and coordinate extraction—to finalize a dataset of 350 valid trajectories. The procedure is structured as follows: 1. YOLOv10 Training: To ensure precise detection, the YOLO model was trained on 5,000 annotated images (split 4:1 for training and validation) over 300 epochs with a batch size of 16. Under our experimental lighting conditions, the model achieved a recognition accuracy exceeding 98%. 2. Trajectory Data Cleaning: Capturing the precise moment of landing is challenging; therefore, the ball’s initial bounce serves as the ground-truth landing indicator. For each sample, we extract the 25 flight frames immediately preceding this bounce, resulting in 25 trajectory points and 1 landing point per sequence. 3. Coordinate Extraction & Verification: The trained YOLOv10 model detects the ball across all frames, saving the 2D spatial coordinates to text files. Finally, all outputs undergo manual verification to eliminate any anomalous detections. IV Trajectory Prediction Model Figure 3: Overview of the trajectory prediction model - PIDTC. This section details the proposed prediction model, illustrated in Fig. 3. The architecture comprises two core components: a prior information extraction module and a dual-Transformer-cascaded prediction network. Within the DTC framework, the first-level Transformer acts as a trajectory classifier—determining whether the landing point will fall “in” or “out” of the court boundaries—while the second-level Transformer leverages this classification to precisely forecast the final landing coordinates. IV-A Prior Information Extraction Module As illustrated in Fig. 3, the process of extracting prior information can be delineated into several stages: Gaussian filtering, edge detection, Hough line detection [46], and subsequent intersection calculation. Initially, the module undertakes Gaussian filtering on the trajectory image, which is represented in grayscale. This process is pivotal for enhancing the image’s smoothness and minimizing interference during the gradient calculation essential for edge detection. We let I_o represent the original image, I_p represent the denoised image. The pixel coordinates within I_o are denoted by (x,y)(x,y). Then the normalized Gaussian kernel ​ Kernel_n required for Gaussian filtering has a corresponding value ​(x,y) Kernel_n(x,y). The calculation for each pixel makes up a complete Gaussian kernel for the image. The denoised image is obtained by convolution with the original image. The calculation formula is: ​(x,y)=12​π​σ2​e−x2+y22​σ2, Kernel(x,y)= 12πσ^2e^- x^2+y^22σ^2, (1) ​(x,y)=​(x,y)∑x=01279∑y=0649​(x,y), Kernel_n(x,y)= Kernel(x,y) _x=0^1279 _y=0^649 Kernel(x,y), (2) ​(i,j)=∑u=01279∑v=0649​(u,v)​(i−u,j−v), I_p(i,j)= _u=0^1279 _v=0^649 I_0(u,v) Kernel_n(i-u,j-v), (3) where σ is the variance of Gaussian distribution. Next, the denoised image is subjected to edge detection using the Canny algorithm [47], which facilitates the extraction of pixel coordinates corresponding to the edges present in the image. To initiate this process, we compute the pixel gradient matrices of I_p by employing the Sobel operators. The Sobel operators consist of two matrices: S_x and S_y. We use S_x to calculate the gradient matrix g_x in the x direction. S_y is used to calculate the gradient matrix g_y in the y direction. The calculation process is expressed as follows: =∗;=∗, g_x= S_x* I_p; g_y= S_y* I_p, (4) ​(x,y)=2​(x,y)+2​(x,y), G(x,y)= g_x^2(x,y)+ g_y^2(x,y), (5) ​(x,y)=arctan⁡(​(x,y)​(x,y)), θ(x,y)= ( g_y(x,y) g_x(x,y) ), (6) where G denotes the amplitude of the image gradient, while θ signifies the direction of the image gradient. The symbol * is utilized to indicate the cross-correlation operation. Once the above matrices are obtained, the Canny algorithm performs non-maximum suppression. This process retains only the local maxima of G based on θ, thereby narrowing the edges to a precise 1-pixel width. Then, we set two gradient magnitude thresholds, i.e., one high and one low. An edge is categorized as a strong edge if its gradient magnitude exceeds the high threshold, whereas an edge is identified as a weak edge if its gradient magnitude falls between the two thresholds. Thereafter, we retain only the strong edges and the weak edges that are connected to the strong edges to establish definitive edges. Next, we determine the line equation of the edge using Hough line detection. In Hough line detection, the pixel A​(m0,n0)A(m_0,n_0) in the image space can represent a line PAP_A in the parameter space. The line equation of PAP_A is as follows: PA:b=−m0​a+n0,P_A:b=-m_0a+n_0, (7) where a represents the abscissa of the parameter space (corresponding to the slope of a line passing through A in image space) and b represents the ordinate of the parameter space (corresponding to the intercept of this line in image space). Consequently, the Hough transform allows us to convert edge pixel points into lines within the parameter space. The intersections of these lines facilitate the extraction of both the slope and intercept of the edges represented in image space. Subsequently, we can formulate the line equations for the edges identified in the preceding step. Finally, we proceed to determine the corner points of the sideline based on the detected edges. Edge detection will yield a set of parallel edges. It is essential to merge these parallel and proximate edges into one edge. Then we selects two corner points as the prior information. IV-B Trajectory Classification Module (a) (b) Figure 4: (a) Structure of the trajectory classification module. (b) Structure of the landing point prediction module. The trajectory classification module determines whether a flight path will land “in” or “out” of bounds by integrating sequential trajectory data with environmental priors. This fusion is achieved via a cross-attention mechanism. The module outputs discrete classification labels—an advantageous format that explicitly guides the subsequent prediction network, enabling it to learn and incorporate prior spatial contexts more effectively. Illustrated in Fig. 4(a), the Transformer-based classification module consists of an encoder that extracts dynamic trajectory features and a decoder that processes static prior information to predict the final classification. The module takes 25 trajectory points (Tb​a​l​lT_ball) and 2 prior information points (Bp​r​i​o​rB_prior) as input, generating a 1D vector (L​a​b​e​lLabel) as output. The initial input is formulated as: I​n​p​u​t=C​o​n​c​a​t​(Tb​a​l​l,Bp​r​i​o​r),Input=Concat(T_ball,B_prior), (8) where C​o​n​c​a​t​()Concat() represents matrix concatenation operation. This direct concatenation facilitates subsequent sequence segmentation within the Feature Encoding Network (FEN), preventing premature fusion and preserving the temporal dynamics of the trajectory. Following dimensional transformation, the input is split into distinct trajectory and prior information sequences. After token embedding and positional encoding, these sequences are processed independently via Multi-Head Attention (MHA) before being fused through a cross-attention mechanism. Finally, the network computes the loss and performs backpropagation. Since the trajectory classification module is based on a binary classification framework, the Binary Cross-Entropy (BCE) loss function has been selected for training this model, as presented below: B​C​E=−1N​∑i=1N(qi​l​o​g​(pi)+(1−qi)​l​o​g​(1−pi)),BCE=- 1N _i=1^N(q_ilog(p_i)+(1-q_i)log(1-p_i)), (9) where q is the out-of-bounds label (0 or 1), p is the predicted value from the trajectory classification module, and N is the total number of trajectories. The output L​a​b​e​lLabel is concatenated with trajectory data to preserve the temporal dynamics of the trajectory, similar to the input structure. The concatenation process is given by: O​u​t​p​u​t=C​o​n​c​a​t​(Tb​a​l​l,L​a​b​e​l),Output=Concat(T_ball,Label), (10) IV-C Landing Point Prediction Module Illustrated in Fig. 4(b), the landing point prediction module receives an input of 25 trajectory points alongside the classification L​a​b​e​lLabel, outputting the final predicted landing coordinates in 2D space. The input is initially flattened from 2D to 1D using a FEN comprising two linear layers activated by a Rectified Linear Unit (ReLU) function [48]. The flattened data is then explicitly separated back into the trajectory sequence and the classification label. This division ensures that the subsequent positional encoding accurately annotates the trajectory order, thereby preserving its temporal dynamics. Following separation, token embedding projects both data streams into vectors of dimension dm​o​d​e​ld_model. Positional encoding function is used to infuse the model with the necessary sequential context [49], which is expressed as P​E​(p​o​s,2​i)=s​i​n​(p​o​s/100002​i/dm​o​d​e​l),PE(pos,2i)=sin(pos/10000^2i/d_model), (11) P​E​(p​o​s,2​i+1)=c​o​s​(p​o​s/100002​i/dm​o​d​e​l),PE(pos,2i+1)=cos(pos/10000^2i/d_model), (12) where p​o​spos is the position of the element within trajectory sequence and i is the dimension of positional code. The alternating expression of sine function and cosine function can mark the relative positions of different elements of the trajectory. We divide the data of dm​o​d​e​ld_model dimensions into dm​o​d​e​l/2d_model/2 groups. Two data sequences in each group are respectively represented by P​E​(p​o​s,2​i)PE(pos,2i) and P​E​(p​o​s,2​i+1)PE(pos,2i+1). Following positional encoding, the trajectory data D enters the encoder’s multi-head attention mechanism [49]. Here, the self-attention layer maps the input sequence into three distinct feature spaces—Queries, Keys, and Values—to compute the attention weights. These spaces are respectively defined by the query matrix Q, key matrix K, and value matrix V, i.e., =​;=​;=​, Q= D W_Q; K= D W_K; V= D W_V, (13) A​t​t​e​n​t​i​o​n​(,,)=s​o​f​t​m​a​x​(​Tdk)​,Attention( Q, K, V)=softmax( Q K^T d_k) V, (14) where W denotes the corresponding weight matrix, dkd_k denotes the length of the key vector. The results of self-attention are used to calculate the multi-head attention value, which is a combination of multiple self-attention value h​e​a​dhead, which is expressed as follows: h​e​a​di=A​t​t​e​n​t​i​o​n​(​i,​i,​i),head_i=Attention( QW^Q_i, KW^K_i, VW^V_i), (15) M​H​()=C​o​n​c​a​t​(h​e​a​d1,h​e​a​d2,…,h​e​a​dh)​0,MH( D)=Concat(head_1,head_2,...,head_h) W^0, (16) where M​H​()MH( D) is the multi-head attention value of the trajectory, and 0 W^0 denotes the weight matrix. Consequently, the multi-head attention value is output from the feedforward module. In the decoder, we also make attention calculations for the classification label data L, resulting in the multi-head attention value M​H​()MH( L). Subsequently, we implement cross-attention mechanisms utilizing M​H​()MH( L) alongside the outputs from the encoder E. Within the cross-attention framework, M​H​()MH( L) is used to calculate c Q_c, and E is used to calculate c K_c and c V_c. Then we calculate multi-head attention value based on c Q_c, c K_c and c V_c. c Q_c provides the trajectory state (“in” or “out”) to influence the prediction area of the landing point, while c K_c and c V_c supply the temporal features of the trajectory for influencing the precise position of the prediction point. The resulting cross-attention output is then processed through a feedforward module, resulting in the generation of the decoder output. Finally, we achieve the 2D predicted landing point from Feature Decoding Network-2 (FDN-2). It is worth noting that both FDN-2 and FDN-1 consist of two linear layers; however, the key difference is that FDN-1 requires normalization while FDN-2 does not; and the activation functions for FDN-1 and FDN-2 are Sigmoid and ReLU, respectively. We choose Mean Squared Error (MSE) as the loss function: M​S​E=1N​∑i=1N(t​r​u​t​hi−p​r​e​d​i​c​t​i​o​ni)2,MSE= 1N _i=1^N(truth_i-prediction_i)^2, (17) where t​r​u​t​htruth is the landing point coordinate, p​r​e​d​i​c​t​i​o​nprediction is the predicted value of landing point prediction module, and N is the total number of trajectories. V Experimental Results V-A Implementation Details TABLE I: Parameters for training Parameter Classification Prediction Epoch 500 1000 Batch-size 10 10 Learning rate 0.0001 0.0001 Embedding number 128 500 dm​o​d​e​ld_model 64 512 Dropout rate 0.1 0.1 Encoder/Decoder layers 1 1 Attention heads 2 2 Feedforward neuron number 256 2048 All experiments were implemented in Python using the PyTorch framework on a workstation equipped with an Nvidia GeForce RTX 3080 GPU. The dataset was partitioned into training and testing sets at a 4:1 ratio. We trained the proposed model using a batch size of 10 and the Adam optimizer, retaining the checkpoint that achieved the lowest validation loss for final testing. Specific training parameters for the different modules are detailed in Table I. In total, the finalized model contains 5.53M parameters, comprising 0.15M in the classification module and 5.38M in the prediction module. V-B Evaluation Metrics For the classification module, we have selected B​C​EBCE, A​c​c​u​r​a​c​yAccuracy, P​r​e​c​i​s​i​o​nPrecision, and R​e​c​a​l​lRecall to conduct performance evaluation. For the prediction module, we have selected M​S​EMSE, R​M​S​ERMSE, and B​i​a​sBias for evaluation purposes. The B​C​EBCE loss function is used to evaluate the quality of the prediction results of a binary classification model. Its mathematical formulation is given in Equation (9). A​c​c​u​r​a​c​yAccuracy represents the proportion of samples that have been accurately predicted: A​c​c​u​r​a​c​y=T​P+T​NT​P+F​P+T​N+F​N×100%,Accuracy= TP+TNTP+FP+TN+FN× 100\%, (18) where T​PTP is the number of correctly predicted positive samples, T​NTN is the number of correctly predicted negative samples, F​PFP is the number of incorrectly predicted positive samples, and F​NFN is the number of incorrectly predicted negative samples. P​r​e​c​i​s​i​o​nPrecision denotes the proportion of accurately identified positive samples relative to the total number of samples classified as positive: P​r​e​c​i​s​i​o​n=T​PT​P+F​P×100%.Precision= TPTP+FP× 100\%. (19) R​e​c​a​l​lRecall denotes the proportion of accurately identified positive samples relative to the total number of genuine positive samples: R​e​c​a​l​l=T​PT​P+F​N×100%.Recall= TPTP+FN× 100\%. (20) M​S​EMSE represents the mean square error, which quantifies the discrepancy between the predicted landing point and the true landing point. R​M​S​ERMSE represents the root mean square error between the predicted landing point and the true landing point: R​M​S​E=1N​∑i=1N(t​r​u​t​hi−p​r​e​d​i​c​t​i​o​ni)2.RMSE= 1N _i=1^N(truth_i-prediction_i)^2. (21) B​i​a​sBias represents the average bias observed between the predicted landing point and the actual landing point. This metric is critical as it quantifies the systematic error in predictions, enabling a more comprehensive understanding of model accuracy and reliability in trajectory assessments: B​i​a​s=1N​∑i=1N(t​r​u​t​hi−p​r​e​d​i​c​t​i​o​ni).Bias= 1N _i=1^N(truth_i-prediction_i). (22) To providing a mapping from the pixel error to the physical error, we need to estimate the physical coordinate of the landing point (xp​h​y,yp​h​y,zp​h​y)(x_phy,y_phy,z_phy). Since all the predicted landing points are on the court surface (i.e., zp​h​y=0z_phy=0), the mapping process can be simplified to a 2D-2D conversion. Therefore, the pixel coordinates (xi​m​g,yi​m​g)(x_img,y_img) and the physical coordinates (xp​h​y,yp​h​y)(x_phy,y_phy) can be converted through the following homography matrix H: [xi​m​gyi​m​g1]=​[xp​h​yyp​h​y1]. bmatrixx_img\\ y_img\\ 1 bmatrix= H bmatrixx_phy\\ y_phy\\ 1 bmatrix. (23) It can be designed based on at least 4 points with known physical coordinates (e.g., the intersection points of the court sidelines). We use 10 points to calculate a more accurate H. Then, the physical bias between the predicted coordinate (xi,yi)(x_i,y_i) and the actual one (x^i,y^i)( x_i, y_i) can be calculated as P​h​y​B​i​a​s=1N​∑i=1N(xi−x^i)2+(yi−y^i)2.PhyBias= 1N _i=1^N (x_i- x_i)^2+(y_i- y_i)^2. (24) V-C Dataset Preprocessing To rigorously assess the efficacy of our proposed prediction method, we developed a comprehensive trajectory dataset, detailed in section I. Utilizing this dataset, we trained the prediction model as described in section IV. To train the trajectory classification module, the dataset preprocessing process involves the following 2 steps: (1) For each trajectory, we use the prior information extraction module to obtain two corner points as the prior information; (2) We compute the sideline based on prior points, then we determine the true classification label based on the positional relationship between the landing point and the sideline. The label is assigned a value of “0” when the classification outcome is identified as “out”. if the classification result is “in”, the label is set to “1”. For the landing point prediction module, we incorporated these classification labels into the original trajectory dataset. This preprocessing approach enhances training efficiency by providing contextual information that aids in the learning process. V-D Test Set Experiment (a) (b) Figure 5: The loss during training and testing processes. (a) The BCE loss of classification module. (b) The MSE loss of prediction module. To rigorously assess the convergence of training and testing loss, the BCE and MSE associated with training and testing processes are illustrated in Fig. 5. Figure 5(a) illustrates the variation of BCE loss as the number of epochs increases. The results demonstrate that classification module achieves effective convergence on test set. Figure 5(b) illustrates the variation of MSE loss as the number of epochs increases. The prediction module achieves similar effective convergence on the test set. Through comparative analysis, it is observed that the convergence speed of the prediction module is slower than that of the classification module. This is primarily due to the significantly large number of parameters in the prediction module. V-E Ablation Experiment (a) (b) Figure 6: The loss of ablation models versus training epochs. (a) The BCE loss. (b) The MSE loss. TABLE I: Classification performance comparison of ablation models CMN CMP Accuracy 52.86% 85.71% Precision 52.85% 81.40% Recall 100% 94.59% TABLE IV: Prediction performance comparison of ablation models PMN PMP PMC MSE 1183.39 690.16 372.39 RMSE 34.40 26.27 19.30 Bias (pixel) 23.06 18.02 13.35 PhyBias (cm) 29.58 23.16 17.07 To rigorously assess the effectiveness of our proposed prediction method, we conduct ablation experiments to examine the influence of various types of prior information on classification and prediction performance. The training losses associated with the different ablation models are illustrated in Fig. 6. The performance comparisons of ablation models are illustrated in Table I and IV. For the classification ablation models, we denote the distinct types of the prior information related to null values and prior points as “classification model with null value” (CMN) and “classification model with prior information” (CMP), respectively. Figure 6(a) illustrates the variation of training BCE loss as the number of training epochs increases. The results demonstrate that CMP achieves effective convergence during training, in stark contrast to CMN, which fails to exhibit similar convergence behavior. Additionally, Table I provides a comprehensive overview of the classification performance across the ablation models when evaluated on the test set. The data presented therein reveals that only CMP exhibits effective classification capabilities. These findings highlight the importance of prior information points as essential features for trajectory classification. For the prediction ablation models, these models incorporate different forms of prior information for null values, prior points, and classification labels, referred to as “prediction model with null value” (PMN), “prediction model with prior information points” (PMP), and “prediction model with classification labels” (PMC), respectively. Figure 6(b) illustrates the trend of the training MSE loss as training epochs progress. Notably, the PMC exhibits a more rapid convergence relative to the other ablation models under consideration. As detailed in Table IV, the predictive performance metrics of the various ablation models are presented. The PMC achieves the lowest losses across three evaluative criteria. Specifically, when compared to PMN, the PMC demonstrates reductions of 68.53%, 43.90%, and 42.11% in the three criteria (MSE, RMSE, Bias). These results suggest that the integration of prior information significantly improves the model’s predictive capabilities. Furthermore, when contrasted with the PMP model, the PMC persistently reveals lower values across all three criteria, underscoring that single classification label provide a more effective input than reliance solely on prior information points. V-F Comparative Experiments Across Different Learning Models (a) (b) (c) Figure 7: The training loss across different models versus training epochs: (a) MSE, (b) RMSE, (c) Bias. TABLE V: Prediction Performance Comparison ACROSS DIFFERENT MODELS MSE RMSE Bias (pixel) PhyBias (cm) RNN [13] 1064.99 32.63 26.71 34.16 GRU [38] 3417.77 58.46 49.24 63.98 LSTM [11] 866.72 29.44 23.96 30.55 Transformer [35] 1170.42 34.21 22.48 27.74 PIDTC 372.39 19.30 13.35 17.07 To illustrate the superior predictive capabilities of the proposed model, a comprehensive set of comparative experiments was carried out, benchmarked against established models used in previous research. For this comparison, four models were selected: RNN [13], GRU [38], LSTM [11] and original Transformer [35]. The results pertaining to three distinct loss metrics are illustrated in Fig. 7. Compared to the other predictive models, our proposed PIDTC model exhibits not only a faster convergence rate but also achieves a lower training loss at convergence. Table V further shows the prediction performance across different models. Our proposed model exhibits the lowest loss across all the three evaluative criteria, highlighting its effectiveness. While the basic Transformer achieves a lower bias compared to other standard models, its MSE loss remains higher than that of both RNN and LSTM models. These evaluation results imply that the structural attributes of the Transformer may enhance regression capabilities, though this improvement may come at the cost of precision in predicting certain specialized samples. The proposed PIDTC, featuring a dual-transformer-cascade structure—one dedicated to trajectory classification that incorporates prior information and the other focused on landing point prediction—effectively addresses the limitations associated with the standard Transformer. V-G The Impact of Different Sizes of Training Set TABLE VI: Prediction performance comparison for different training set sizes 20% NtN_t 40% NtN_t 60% NtN_t 80% NtN_t MSE 499.41 547.52 542.15 372.39 RMSE 22.35 23.40 23.28 19.30 Bias (pixel) 15.65 17.10 16.22 13.35 PhyBias (cm) 19.98 21.57 20.65 17.07 In addition, comparative experiments are conducted to evaluate the impact of varying the training set sizes, based on the same dataset that has Nt=350N_t=350 samples. The proposed PIDTC model is trained using four distinct sizes of the training set, specifically 20%NtN_t, 40%NtN_t, 60%NtN_t, and 80%NtN_t. Subsequently, we assess the performance on the same test set and present the results in Table VI. The result reveals that the loss of the proposed model generally decreases as the size of the training set increases. VI Conclusion In this paper, we presented a novel learning architecture for flying trajectory prediction based on prior environmental information and a cascaded transformer structure. To develop the flying trajectory dataset, we constructed an effective trajectory data acquisition platform that comprises a single 2D industrial camera and a ball launch machine. Our proposed model comprised three distinct sub-modules: a prior information extraction module for the identification of critical prior points, a trajectory classification module for “in” versus “out” classifications with respect to the court boundary, and a landing point prediction module dedicated to predicting the landing point coordinates. Subsequently, we calculated the sideline using Hough line detection and select two corner points as the relevant prior information. The trajectory classification module was then employed to generate classification labels. In the final step, trajectories and their corresponding classification labels were input into the landing point prediction module. Our ablation experiments validated the efficacy of the proposed approach. Comparative experiments reveal that our approach outperforms existing trajectory prediction frameworks, achieving a remarkable reduction in MSE/RMSE loss and Bias. Inspired by this work, future research endeavors will focus on integrating additional prior environmental information and adopting a physical-informed learning methodology. References [1] Y. Huang, J. Du, Z. Yang, Z. Zhou, L. Zhang and H. Chen, “A Survey on Trajectory-Prediction Methods for Autonomous Driving,” IEEE Trans. Intell. Veh., vol. 7, no. 3, p. 652–674, Sep. 2022. [2] Z. Wang, J. Zhang, and W. Wei, “Deep Learning Based Missile Trajectory Prediction,” Proc. Int. Conf. Unmanned Syst. (ICUS), Harbin, China, Nov. 2020, p. 474–478. [3] T. Xue and Y. Liu, “Trajectory Prediction of a Flying Object based on Hybrid Mapping between Robot and Camera Space,” IEEE Int. Conf. Real-Time Comput. Robot. (RCAR), Kandima, Maldives, Aug. 2018, p. 567–572. [4] X. Qin, Z. Li, K. Zhang, F. Mao, and X. Jin, “Vehicle Trajectory Prediction via Urban Network Modeling,” J. Sens., vol. 23, no. 10, May. 2023. [5] Y. Yang, D. Kim, and D. Choi, “Ball Tracking and Trajectory Prediction System for Tennis Robots,” J. Comput. Des. Eng., vol. 10, no. 3, p. 1176–1184, Apr. 2023. [6] Y. Wu, H. Yu, J. Du, B. Liu, and W. Yu, “An Aircraft Trajectory Prediction Method Based on Trajectory Clustering and a Spatiotemporal Feature Network,” J. Electron., vol. 11, no. 21, Apr. 2023. [7] W. Zhao, “Hawk-Eye Deblurring and Pose Recognition in Tennis Matches Based on Improved GAN and HRNet Algorithms,” Int. J. Adv. Comput. Sc., vol. 16, no. 1, p. 107–118, Jan. 2025. [8] A. Nakashima, Y. Ogawa, C. Liu, and Y. Hayakawa, “Robotic table tennis based on physical models of aerodynamics and rebounds,” IEEE Int. Conf. Rob. Biomimetics. (ROBIO), Phuket, Thailand, Dec. 2011, p. 2348–2354. [9] H. Chiang, B. Tseng, J. Chen and H. Hsieh, “Trajectory Analysis in UKF: Predicting Table Tennis Ball Flight Parameters,” IT Prof., vol. 26, no. 3, p. 65–72, May. 2024. [10] C. Liu, “Kalman Tracking Algorithm of Ping-Pong Robot based on Fuzzy Real-Time Image,” J. Intell. Fuzzy Syst., vol. 38, no. 4, p. 3585–3594, 2020. [11] J. Wu, “Localization and Trajectory Prediction of Spinning Flying Ping-Pong Ball based on Learning,” M.S. thesis, Control Eng. Dept., Zhejiang Univ., Hangzhou, China, 2018. [12] H. Li, S. Ali, and J. Zhang et al., “Video-based Table Tennis Tracking and Trajectory Prediction using Convolutional Neural Networks,” Fractals-complex Geom. Pattern Scaling Nat. Soc., vol. 30, no. 5, Aug. 2022. [13] H. Lin, Z. Yu, and Y. Huang, “Ball Tracking and Trajectory Prediction for Table-Tennis Robots,” J. Sens., vol. 20, no. 2, Jan. 2020. [14] Andersson and L. Russell, “Aggressive Trajectory Generator for a Robot Ping-Pong Player,” IEEE Control. Syst. Mag., vol. 9, no. 2, p. 15–21, Feb. 1989. [15] Y. Huang, D. Xu, M. Tan, and H. Su, “Trajectory Prediction of Spinning Ball for Ping-Pong Player Robot,” IEEE Int. Conf. Intell. Rob. Syst. (IROS), San Francisco, CA, United states, Sep. 2011, p. 3434–3439. [16] J. Nonomura, A. Nakashima, and Y. Hayakawa, “Analysis of effects of Rebounds and aerodynamics for trajectory of table tennis ball,” Proc. SICE. Annu. Conf. (SICE AC), 2010, p. 1567–1572. [17] Y. Zhang, “State Estimation and Trajectory Prediction of Fast Flying Object,” Ph.D. thesis, Control Sci. Eng. Dept., Zhejiang Univ., Hangzhou, China, 2015. [18] J. Zeng, “Research and Design of Table Tennis Robot System with Seven Degrees of Freedom,” M.S. thesis, Control Sci. Eng. Dept., DongHua Univ., Shanghai, China, 2020. [19] Y. Zhao, Y. Zhang, R. Xiong and J. Wang, “Optimal State Estimation of Spinning Ping-Pong Ball Using Continuous Motion Model,” IEEE Trans. Instrum. Meas., vol. 64, no. 8, p. 2208–2216, Aug. 2015. [20] Y. Zhao, R. Xiong and Y. Zhang, “Model Based Motion State Estimation and Trajectory Prediction of Spinning Ball for Ping-Pong Robots using Expectation-Maximization Algorithm,” J. Intell. Rob. Syst., vol. 87, no. 3, p. 407–423, Sep. 2017. [21] K. Mülling, J. Kober and J. Peters, “A Biomimetic Approach to Robot Table Tennis,” IEEE Int. Conf. Intell. Rob. Syst. (IROS), 2010, p. 1921–1926. [22] S. Luo, J. Niu, P. Zheng and Z. Jingid, “Application of Minimum Error Entropy Unscented Kalman Filter in Table Tennis Trajectory Prediction,” PLoS One, vol. 17, no. 9, Sep. 2022. [23] X. Chen, Y. Tian, Q. Huang, W. Zhang, and Z. Yu, “Dynamic Model based Ball Trajectory Prediction for a Robot Ping-Pong Player,” IEEE Int. Conf. Rob. Biomimetics. (ROBIO), 2010, p. 603–608. [24] D. Vasquez, T. Fraichard, and C. Laugier, “Growing Hidden Markov Models: An Incremental Tool for Learning and Predicting Human and Vehicle Motion,” Int. J. Rob. Res., vol. 28, no. 11, p. 1486–1506, Nov. 2009. [25] Z. Thomas, R. Judith, and F. Hartmut, “Bayesian Inference of Aircraft Operating Speeds for Stochastic Medium-Term Trajectory Prediction,” AIAA. IEEE Dig. Avionics. Syst. Conf. Proc. (DASC), Barcelona, Spain, Oct. 2023. [26] Y. Ren, Z. Fang, D. Xu, and M. Tan, “Spinning Pattern Classification of Table Tennis Ball’s Flying Trajectory based on Fuzzy Neural Network,” J. Control Decis., vol. 29, no. 2, p. 263–269, Feb. 2014. [27] Q. Wang and Z. Sun, “Trajectory Identification of Spinning Ball using Improved Extreme Learning Machine in Table Tennis Robot System,” IEEE Int. Conf. Cyber Technol. Autom., Control, Intell. Syst. (CYBER), Shenyang, China, Jun. 2015, p. 551–554. [28] K. Messaoud, N. Deo, M. Trivedi, and F. Nashashibi, “Trajectory prediction for autonomous driving based on multi-head attention with joint agent-map representation,” Proc. IEEE Intell. Veh. Symp. (IV), Nagoya, Japan, Jul. 2021, p. 165–170. [29] Y. Suo, W. Chen, C. Claramunt, and S. Yang, “A Ship Trajectory Prediction Framework Based on a Recurrent Neural Network,” J. Sens., vol. 20, no. 18, Sep. 2020. [30] L. Qian, Y. Zheng, L. Li, Y. Ma, C. Zhou, and D. Zhang, “A New Method of Inland Water Ship Trajectory Prediction Based on Long Short-Term Memory Network Optimized by Genetic Algorithm,” Appl. Sci., vol. 12, no. 8, Apr. 2022. [31] C. Zachary, B. John, and E. Charles, “A Critical Review of Recurrent Neural Networks for Sequence Learning,” 2015, arXiv:1506.00019. [32] W. Li, “Table Tennis Target Detection and Rotating Ball Trajectory Prediction based on Deep Learning,” M.S. thesis, Control Eng. Dept., Donghua Univ., Shanghai, China, 2021. [33] E. Wu and H. Koike, “FuturePong: Real-time Table Tennis Trajectory Forecasting using Pose Prediction Network,” Conf. Hum. Fact. Comput. Syst. Proc. (CHI), Honolulu, HI, United states, Apr. 2020. [34] S. Gomez-Gonzalez, S. Prokudin, B. Schoelkopf, and J. Peters, “Real Time Trajectory Prediction using Deep Conditional Generative Models,” IEEE Rob. Autom. Lett., vol. 5, no. 2, p. 970–976, Apr. 2020. [35] W. Zhang, Q. Chai, Q. Zhang, and C. Wu, “Obstacle-Transformer: A Trajectory Prediction Network based on Surrounding Trajectories,” IET Cyber-Syst. Robot., vol. 5, no. 1, Mar. 2023. [36] S. Yoon and K. Lee, “Aircraft Trajectory Prediction With Inverted Transformer,” IEEE ACCESS, vol. 13, p. 26318–26330, 2025. [37] H. Damirchi, M. Greenspan, and A. Etemad, “Context-Aware Pedestrian Trajectory Prediction with Multimodal Transformer,” Proc. Int. Conf. Image Process. (ICIP), Kuala Lumpur, Malaysia, Oct. 2023, p. 2535–2539. [38] Y. Xiao, Y. Hu, J. Liu, Y. Xiao, and Q. Liu, “An Adaptive Multimodal Data Vessel Trajectory Prediction Model Based on a Satellite Automatic Identification System and Environmental Data,” J. Mar. Sci. Eng., vol. 12, no. 3, Mar. 2024. [39] Y. Lu, Y. Zhao, and H. Wang et al., “Exploiting Motion Prior for Accurate Pose Estimation of Dashboard Cameras,” IEEE Rob. Autom. Lett., vol. 10, no. 1, p. 764–771, Jan. 2025. [40] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep Image Prior,” Int. J. Comput. Vision, vol. 128, no. 7, p. 1867–1888, Jul. 2020. [41] Z. Zhang, Z. Yang, and Y. Yang, “SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction,” Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit. (CVPR), Seattle, WA, United states, Jun. 2024, p. 9936–9947. [42] L. Wang, J. Zhao, M. Xiao and J. Liu, “Predicting Lane Change and Vehicle Trajectory With Driving Micro-Data and Deep Learning,” IEEE ACCESS, vol. 12, p. 106432–106446, 2024. [43] L. You, S. Xiao, and Q. Peng et al., “ST-Seq2Seq: A Spatio-Temporal Feature-Optimized Seq2Seq Model for Short-Term Vessel Trajectory Prediction,” IEEE ACCESS, vol. 8, p. 218565–218574, 2020. [44] X. Zhang, X. Fu, Z. Xiao, H. Xu, W. Zhang, J. Koh, Z. Qin, ”A Dynamic Context-Aware Approach for Vessel Trajectory Prediction Based on Multi-Stage Deep Learning”, IEEE Trans. Intell. Veh., vol.9, no.11, p.7193–7207, 2024. [45] A. Wang, H. Chen, and L. Liu et al., “YOLOv10: Real-Time End-to-End Object Detection,” 2024, arXiv:2405.14458. [46] P. Mukhopadhyay and B. Chaudhuri, ”A Survey of Hough Transform”, Pattern Recognit., vol.48, no.3, p.993–1010, Mar. 2015. [47] C. John, “Computational Approach to Edge Detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, p. 679–698, Nov. 1986. [48] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Commun. ACM, vol. 60, no. 6, p. 84–90, Jun. 2017. [49] A. Vaswani, N. Shazeer, and N. Parmar et al., “Attention Is All You Need,” 2017, arXiv:1706.03762.