Paper deep dive

AI-Enhanced Spatial Cellular Traffic Demand Prediction with Contextual Clustering and Error Correction for 5G/6G Planning

Mohamad Alkadamani, Colin Brown, Halim Yanikomeroglu

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 25

Abstract

Abstract:Accurate spatial prediction of cellular traffic demand is essential for 5G NR capacity planning, network densification, and data-driven 6G planning. Although machine learning can fuse heterogeneous geospatial and socio-economic layers to estimate fine-grained demand maps, spatial autocorrelation can cause neighborhood leakage under naive train/test splits, inflating accuracy and weakening planning reliability. This paper presents an AI-driven framework that reduces leakage and improves spatial generalization via a context-aware two-stage splitting strategy with residual spatial error correction. Experiments using crowdsourced usage indicators across five major Canadian cities show consistent mean absolute error (MAE) reductions relative to location-only clustering, supporting more reliable bandwidth provisioning and evidence-based spectrum planning and sharing assessments.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: failed | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 0%

Last extracted: 3/13/2026, 1:12:02 AM

OpenRouter request failed (402): {"error":{"message":"This request requires more credits, or fewer max_tokens. You requested up to 65536 tokens, but can only afford 56816. To increase, visit https://openrouter.ai/settings/keys and create a key with a higher monthly limit","code":402,"metadata":{"provider_name":null}},"user_id":"user_2shvuzpVFCCndDdGXIdfi40gIMy"}

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

24,480 characters extracted from source content.

Expand or collapse full text

AI-Enhanced Spatial Cellular Traffic Demand Prediction with Contextual Clustering and Error Correction for 5G/6G Planning Mohamad Alkadamani, Colin Brown, and Halim Yanikomeroglu M. Alkadamani is with Innovation, Science and Economic Development Canada (ISED) and Carleton University, Ottawa, Canada (e-mail: mohamad.alkadamani@ised-isde.gc.ca).C. Brown is with the Communications Research Centre (CRC), Ottawa, Ontario, Canada.H. Yanikomeroglu is with Carleton University, Ottawa, Ontario, Canada. Abstract Accurate spatial prediction of cellular traffic demand is essential for 5G NR capacity planning, network densification, and data-driven 6G planning. Although machine learning can fuse heterogeneous geospatial and socio-economic layers to estimate fine-grained demand maps, spatial autocorrelation can cause neighborhood leakage under naive train/test splits, inflating accuracy and weakening planning reliability. This paper presents an AI-driven framework that reduces leakage and improves spatial generalization via a context-aware two-stage splitting strategy with residual spatial error correction. Experiments using crowdsourced usage indicators across five major Canadian cities show consistent mean absolute error (MAE) reductions relative to location-only clustering, supporting more reliable bandwidth provisioning and evidence-based spectrum planning and sharing assessments. I Introduction A key capability for AI-enabled wireless networks and cognitive spectrum management is characterizing how cellular traffic demand varies across space, particularly in dense urban regions. For 5G and beyond, spatial demand hotspots influence carrier bandwidth selection, small-cell deployment, capacity expansion, and the feasibility of spectrum access mechanisms. Spatial demand prediction can also help regulators and planners identify regions at risk of under- or over-provisioning and screen spectrum-sharing feasibility in longer-horizon planning. Estimating cellular traffic demand, especially for commercial mobile applications, is complex and shaped by inter-related factors such as technological evolution, regulations, usage trends, and socio-economic conditions. Traditionally, planning studies relied on expert assessments in industry whitepapers [5] or simplified models [7]. Increasing system complexity and data availability have accelerated interest in data-driven planning and resource-allocation methods [2, 3, 1]. These approaches, including ML and advanced analytics, can infer spatial demand distributions and support planning decisions via predictive traffic-demand maps [9]. Data-driven spatial traffic-demand prediction faces challenges specific to geospatial wireless data, including heterogeneous feature resolutions, high dimensionality, and the need to handle spatial autocorrelation to avoid biased evaluation. A central concern is spatial information leakage: nearby samples are statistically dependent, so naive train/test splitting can yield over-optimistic accuracy and misleading generalization claims. Prior work on traffic-demand proxy modeling provides a starting point. In [8], diverse geospatial inputs were studied alongside a demand proxy, and location-based clustering was used to improve train/test independence; [9] advanced the proxy and interpretability aspects. Spatio-temporal traffic prediction has also been studied using recurrent neural networks with geo-referenced traffic measurements to capture temporal dynamics and geographic variability [11]. From the leakage-mitigation perspective, spatial cross-validation techniques are surveyed in [4, 12], and related fields propose cluster-based and hybrid splitting strategies [6, 13]. However, many clustering-based splits remain context-blind, relying mainly on location (or generic clustering) without explicitly enforcing representativeness in land-use and functional context; consequently, dependence can still leak across folds and evaluation may remain optimistic. This paper proposes a context-aware framework for spatial cellular traffic-demand prediction that improves spatial generalization under spatial autocorrelation and supports reliable 5G/6G planning. Contributions include a context-aware two-stage splitting strategy that combines spatial clustering with land-use/context clustering to form representative folds and reduce leakage relative to location-only clustering, and a five-city Canadian evaluation with planning-oriented mappings that translate prediction error into bandwidth-dimensioning error and congestion-risk screening for carrier bandwidth and spectrum-provisioning assessments. The remainder of the paper is organized as follows. Section I describes the data model and feature mapping. Section I characterizes spatial dependency using Moran’s I. Section IV presents the proposed two-stage splitting strategy. Section V reports performance results, Section VI links errors to 5G/6G planning metrics, and Section VII concludes. I Data-driven Spatial Traffic Demand Modeling Spatial traffic-demand prediction is formulated as supervised learning over heterogeneous geospatial features by mapping both feature layers and a traffic-demand proxy to a common geographic unit. I-A Grid-cell representation and study areas The study region is partitioned into uniform square grid cells of approximately 1.5km×1.5km1.5\,km× 1.5\,km. The grid cell is an analysis unit used to align datasets to a common spatial resolution. Five major Canadian cities are considered: Montreal, Vancouver, Greater Toronto, Ottawa, and Calgary. Let S=s1,…,snS=\s_1,…,s_n\ denote the set of n grid cells. Each sis_i has a feature vector i∈ℝmx_i ^m and a target yiy_i representing a traffic-demand proxy, with learning objective y^i=fθ(i) y_i=f_θ(x_i). I-B Traffic demand proxy from crowdsourced measurements Direct traffic measurements are typically restricted to mobile network operators and are unavailable for public studies. A traffic-demand proxy yiy_i is therefore derived from crowdsourced mobile usage indicators collected via application-embedded SDKs and aggregated to the grid-cell level. The dataset comprises approximately 15 million measurements across the five cities over one month in 2023. In the 4G/5G transition regime, the indicators reflect combined activity and represent total traffic intensity rather than generation-specific load. The proxy increases with the number and persistence of observed user connections within a grid cell and, although bytes transferred are not explicitly encoded, captures the dominant busy-hour spatial structure (dense connections imply higher load, while sparse activity implies lower demand). After filtering and aggregation, each grid cell sis_i is assigned yiy_i paired with ix_i for supervised learning. I-C Feature layers and grid-cell mapping Input features originate from heterogeneous sources and are standardized to the grid-cell representation. Feature layers may be defined on administrative units (e.g., census subdivisions and dissemination areas), as points (e.g., points of interest and businesses), or as polygons (e.g., land-use/land-cover). Geometry-aware mapping assigns each grid cell sis_i a consistent feature vector ix_i. Feature layers are temporally aligned to overlap with the same one-month window of the crowdsourced dataset; some layers have coarser temporal granularity than others but still overlap the analysis period. For areal datasets, grid-cell values are obtained by spatial overlay using area-weighted allocation from source polygons to intersecting grid cells. For categorical polygon layers (e.g., land-use), the dominant class within each grid cell is used; point-based layers are mapped via counts or density estimates (normalized by cell area). Mapped features include socio-economic variables (population density and related census indicators), urban infrastructure (counts/densities of businesses, roads, and points of interest), land-use type (land-use/land-cover attributes), and network infrastructure (an indicator of cellular infrastructure presence). All layers are aligned to the grid and combined into ix_i, yielding (i,yi)i=1n\(x_i,y_i)\_i=1^n. Figure 1 summarizes the pipeline. Figure 1: Modeling pipeline. I Spatial Dependency Characterization Urban traffic-demand maps exhibit spatial dependence: neighboring areas often share land use, socio-economic conditions, and mobility-driven usage, producing correlated demand values. Quantifying spatial dependence (i) motivates leakage-aware splitting and (i) informs the spatial scale required to separate neighborhoods and obtain realistic generalization estimates. I-A Global Moran’s I for spatial autocorrelation Global Moran’s I quantifies city-wide spatial autocorrelation in yiy_i [10]. For N grid cells, I=NW∑i=1N∑j=1Nwij(yi−y¯)(yj−y¯)∑i=1N(yi−y¯)2,I= NW\, _i=1^N _j=1^Nw_ij(y_i- y)(y_j- y) _i=1^N(y_i- y)^2, (1) where y¯ y is the mean demand proxy, wijw_ij is a spatial weight, and W=∑i=1N∑j=1NwijW= _i=1^N _j=1^Nw_ij. A distance-threshold neighborhood is used with wij=1w_ij=1 if centroid distance dij≤dthresholdd_ij≤ d_threshold and wij=0w_ij=0 otherwise; varying dthresholdd_threshold reveals the correlation range. Figure 2 plots I versus distance (in grid cells) to identify the dominant correlation range that split boundaries should exceed to reduce leakage; inter-city differences further indicate that a single fixed split radius can be suboptimal. Figure 2: Moran’s I versus distance (grid cells). I-B Local Moran’s I for spatial clusters and outliers Local Moran’s I localizes spatial clusters and outliers. For grid cell sis_i, Ii=(yi−y¯)∑j=1Nwij(yj−y¯),I_i=(y_i- y) _j=1^Nw_ij(y_j- y), (2) where wijw_ij defines the neighborhood. Local Moran’s I yields four spatial association types: High-High (H), high yiy_i with high neighbors; Low-Low (L), low yiy_i with low neighbors; High-Low (HL), high yiy_i with low neighbors; and Low-High (LH), low yiy_i with high neighbors. Figure 3 visualizes these associations, with dark red indicating H clusters (persistent hotspots) and light red indicating L clusters (persistent low-demand regions). Transitional areas often exhibit HL/LH behavior and can be challenging for generalization when neighbors are split across folds. Figure 3: Local Moran’s I clusters across five Canadian cities. IV Two-Stage Data Splitting and Spatial Error Correction Spatial autocorrelation makes random splitting unreliable because adjacent grid cells can share near-duplicate context, inflating validation accuracy when neighbors fall in different folds. Location-only clustering reduces leakage but can produce context-imbalanced folds (e.g., commercial versus residential); the proposed two-stage procedure enforces spatial separation and context diversity and is followed by spatial error correction as shown in Fig. 4. Figure 4: Two-stage clustering and spatial error correction framework. IV-A Stage 1: spatial clustering for leakage reduction Stage 1 partitions the study area into spatially cohesive blocks by applying k-Means to grid-cell centroids minCii=1k∑i=1k∑sj∈Ci‖j−i‖2, _\C_i\_i=1^k _i=1^k _s_j∈ C_i\|p_j- μ_i\|^2, (3) where jp_j is the coordinate vector of sjs_j, CiC_i is the iith spatial cluster, and i μ_i is its centroid. The parameter k controls cluster diameter and the extent to which correlated neighbors are separated across folds; Fig. 2 guides selection by indicating the dominant correlation range (up to r grid cells) that most train/validation boundaries should exceed. IV-B Stage 2: context-aware refinement within spatial clusters Stage 2 refines each spatial cluster using context features (e.g., land-use/land-cover and related attributes) to improve representativeness in feature space and avoid folds dominated by a single context type. A normalized dissimilarity between sas_a and sbs_b within the same spatial cluster is d(sa,sb)=∑ℓ=1m|xaℓ−xbℓ|σℓ,d(s_a,s_b)= _ =1^m |x_a -x_b | _ , (4) where σℓ _ is the standard deviation of feature ℓ . Categorical land-use can be represented via one-hot encoding so that dissimilarity reflects context changes. Stage 2 produces sub-clusters that remain spatially contained within Stage-1 blocks and are more homogeneous in context. Sub-clusters are assigned to folds to preserve Stage-1 spatial separation while ensuring each fold contains a mixture of contexts. Figure 5 illustrates the effect in Montreal. As shown, the location-only k-Means partition uses five folds, whereas the two-stage procedure yields three folds because refinement and fold construction merge sub-clusters into the smallest set of folds that remains spatially separated and sufficiently distinct in land-use context. Figure 5: Comparison of clustering techniques for Montreal. IV-C Learning model and spatial error correction XGBoost is used as the base predictor due to strong performance on structured tabular features and nonlinear interactions. Let y^i y_i denote the prediction and ei=yi−y^ie_i=y_i- y_i the residual. Even under leakage-reduced splitting, residuals can remain spatially correlated due to unmodeled neighborhood effects and latent variables, creating geographically coherent bias. A Spatial Error Model (SEM) is applied to the residual process =X+ϵ,ϵ=λWϵ+,y=X β+ ε, ε=λ W ε+u, (5) where W is a spatial weights matrix (distance threshold or k-nearest neighbors), λ captures residual spatial dependence, and u is i.i.d. noise. The SEM represents residuals as a spatially filtered process. SEM refinement is implemented as post-processing: XGBoost is trained on two-stage folds, residuals are computed on training data, λ is estimated given W, and the spatial filter is applied to correct predictions. A regularized objective that penalizes spatially structured residuals is ℒ(θ,λ)=1N∑i=1N(y^i−yi)2+α‖θ‖22+β‖(I−λW)−1ϵ‖22,L(θ,λ)= 1N _i=1^N( y_i-y_i)^2+α\|θ\|_2^2+β\|(I-λ W)^-1 ε\|_2^2, (6) where θ denotes XGBoost parameters. The third term discourages residual autocorrelation, improving robustness in unseen geographic regions. V Performance Evaluation This section evaluates predictive performance and spatial generalization under the proposed splitting strategy using MAE and R2R^2, with learning curves assessing leakage-driven overfitting under spatially structured validation. V-A Cross-city evaluation Table I reports two evaluations. A leave-one-city-out configuration assesses transferability by holding each city out as an unseen test set while training on the remaining cities. An “All Cities” configuration performs splits across the pooled dataset to reflect within-distribution performance. Across both configurations, two-stage splitting consistently reduces MAE relative to location-only clustering, indicating that context-aware folds better reflect deployment heterogeneity. SEM provides additional MAE reductions, consistent with residual spatial structure not fully captured by the base predictor. Although R2R^2 gains are smaller, the systematic MAE reduction improves absolute demand estimates, which is the critical quantity for downstream planning. TABLE I: Performance Results Across Five Canadian Cities City Mean Absolute Error (MAE) R2 Gain (%) k-Means Two-Stage Two-Stage + SEM Toronto 1532.8 1012.3 845.2 3.85 Montreal 1621.5 1123.6 825.0 2.87 Ottawa 1475.4 987.2 808.3 3.86 Vancouver 1398.6 953.5 783.4 4.88 Calgary 1450.7 1001.9 795.7 3.86 All Cities 1432.7 989.9 806.7 3.66 V-B Learning curves Figure 6 compares learning curves for location-only clustering, two-stage spatial+context splitting, and two-stage splitting with SEM refinement. The training–validation gap indicates overfitting under the chosen split, and shaded bands reflect fold variability. Location-only clustering shows a large training–validation gap, indicating limited transfer to spatially distinct regions. Two-stage splitting reduces this gap and improves validation MAE, while SEM refinement further lowers validation error by correcting residual spatial dependence. Figure 6: Learning curves comparison for different clustering strategies. VI Wireless Planning Link for 5G/6G This section presents a 5G NR mid-band (3.5 GHz) case study mapping the prediction errors in Section V to bandwidth-dimensioning and congestion-risk metrics. VI-A Bandwidth dimensioning Offered downlink traffic demand (bps) in grid cell sis_i is mapped from the proxy yiy_i as Di=κyi,D_i=κ y_i, (7) where κ is the busy-hour traffic per proxy unit; κ=50κ=50 kbps per proxy unit is used across all methods. Let γi _i denote the downlink SINR random variable over the geographic area represented by sis_i. A planning-level spectral-efficiency abstraction is η(γ)=(1−ρoh)log2⁡(1+γ)η(γ)=(1- _oh) _2(1+γ), where η(⋅)η(·) is in bps/Hz and ρoh∈[0,1) _oh∈[0,1) captures fractional overhead loss. An outage-constrained effective spectral efficiency is defined as ηi(δ)=Q1−δ(η(γi)) _i^(δ)=Q_1-δ(η( _i)), where Q1−δ(⋅)Q_1-δ(·) is the (1−δ)(1-δ)-quantile and δ is the allowable outage probability. The bandwidth required to serve sis_i is approximated by Bi,req(δ)=Diηi(δ).B_i,req^(δ)= D_i _i^(δ). (8) For a predicted proxy y^i y_i, the induced per-cell bandwidth-error magnitude follows |B^i,req(δ)−Bi,req(δ)|=(κ/ηi(δ))|y^i−yi| | B_i,req^(δ)-B_i,req^(δ) |=(κ/ _i^(δ)) | y_i-y_i |. Under a constant planning assumption ηi(δ)=η(δ) _i^(δ)=η^(δ), the mean absolute bandwidth dimensioning error (BDE) is proportional to MAE: BDE(δ)=κη(δ)⋅MAE.BDE^(δ)= κη^(δ)·MAE. (9) Table I reports the resulting “All Cities” BDE sensitivity for η(δ)∈2,3,3.5η^(δ)∈\2,3,3.5\ bps/Hz using the MAE in Table I, and Fig. 7 illustrates the city-wise BDE mapping for the baseline setting η(δ)=2η^(δ)=2 bps/Hz. TABLE I: Sensitivity Table: “All Cities” BDE (MHz) for 3.5 GHz. η(δ)η^(δ) (bps/Hz) k-Means Two-Stage Two-Stage + SEM 2.0 35.8 24.7 20.2 3.0 23.9 16.5 13.4 3.5 20.5 14.1 11.5 VI-B Congestion risk versus candidate carrier bandwidth Feasibility screening of candidate carrier bandwidths B (e.g., 40–100 MHz at 3.5 GHz) is captured by the congested fraction Pcong(B)=1N∑i=1N(Di>Bηi(δ)),P_cong(B)= 1N _i=1^NI\! (D_i>B\, _i^(δ) ), (10) where Pcong(B)P_cong(B) is the share of grid cells whose offered demand exceeds supported capacity Bηi(δ)B\, _i^(δ) and N is the number of grid cells in the evaluation set. Figure 8 compares Pcong(B)P_cong(B) under the reference surface DiD_i (“Observed demand”) and predicted surfaces D^i D_i, where deviations from the reference quantify planning risk (overestimation/underestimation of congestion). Leakage-reduced splitting and SEM refinement improve spatial generalization and shift the inferred congestion curve toward the observed curve across B. For illustration without operator traffic traces, Fig. 8 uses an “All Cities” case study combining a heavy-tailed spatial demand distribution (log-normal proxy field) with prediction-error models calibrated to the MAE values in Table I, highlighting how moderate MAE differences translate into meaningful shifts in estimated congested area. Figure 7: Case study: bandwidth dimensioning error (BDE). Figure 8: All Cities: Pcong(B)P_cong(B) versus B under observed and predicted demand. VII Conclusion This paper presented an AI-driven framework for spatial cellular traffic-demand prediction to support data-driven 5G/6G capacity and spectrum planning. The framework addresses spatial autocorrelation, which can inflate evaluation when training and validation sets are not properly separated, via a context-aware two-stage splitting strategy that reduces leakage while preserving functional representativeness across folds, and SEM refinement that mitigates residual spatial errors. Evaluation across five major Canadian cities shows consistent MAE reductions relative to location-only clustering, with additional gains from SEM correction. A planning-oriented 5G NR mid-band case study demonstrates how the predictive gains translate into more reliable carrier bandwidth selection and provisioning assessments while controlling leakage and geographically coherent bias. References [1] C. Brown, H. Rutagemwa, and M. Alkadamani (2024) Geospatial Insights in Spectrum Management: An Adaptive Data-Driven Licensing Approach. In 2024 IEEE International Conference on Communications Workshops (ICC Workshops), p. 2113–2118. Note: External Links: Document Cited by: §I. [2] K. Doke, A. Abedi, M. Hollingsworth, M. Zheleva, A. Sahai, D. Grunwald, and K. Gremban (2024) Towards data-driven policies in spectrum management. In 2024 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN), Vol. , p. 163–168. External Links: Document Cited by: §I. [3] Federal Communications Commission (FCC - USA) (2023) Advancing Understanding of Non-Federal Spectrum Usage. Note: [Online]. Available: https://docs.fcc.gov/public/attachments/FCC-23-63A1.pdf Cited by: §I. [4] S. Gao, Y. Hu, and W. (. Li (2023) Geospatial artificial intelligence. 1st edition, CRC Press. Cited by: §I. [5] GSMA and Coleago Consulting (2021) Estimating Mid-band Spectrum Needs in the 2025-2030 Time Frame: Global Outlook. Note: [Online]. Available: https://w.gsma.com/connectivity-for-good/spectrum/wp-content/uploads/2021/07/Estimating-Mid-Band-Spectrum-Needs.pdf Cited by: §I. [6] H. Feng, Y. Wang, Z. Li, N. Zhang, Y. Zhang and Y. Gao Information leakage in deep learning-based hyperspectral image classification: a survey. Note: Remote Sensing, 2023; 15(15):3793 Cited by: §I. [7] International Telecommunication Union (2013) Methodology for Calculation of Spectrum Requirements for the Terrestrial Component of IMT - R-REC-M.1768-1. Note: [Online]. Available: https://w.itu.int/dms_pubrec/itu-r/rec/m/R-REC-M.1768-1-201304-I!!PDF-E.pdf Cited by: §I. [8] J. Parekh, A. Ghasemi, and H. Yanikomeroglu (2023) Data-Driven Modelling of Mobile Network Demand for Efficient Spectrum Management. In 2023 IEEE 34th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), p. 1–6. Note: External Links: Document Cited by: §I. [9] J. Parekh, E. Yackoboski, A. Ghasemi, and H. Yanikomeroglu (2023) Modeling Local Demand for Mobile Spectrum Using Large Crowdsourced Datasets. In 2023 IEEE Future Networks World Forum (FNWF), p. 1–5. Note: External Links: Document Cited by: §I, §I. [10] H. Li, C. Calder, and N. Cressie (2007-10) Beyond moran’s i: testing for spatial dependence based on the spatial autoregressive model. Geographical Analysis 39, p. 357 – 375. External Links: Document Cited by: §I-A. [11] C. Qiu, Y. Zhang, Z. Feng, P. Zhang, and S. Cui (2018-08) Spatio-temporal wireless traffic prediction with recurrent neural network. IEEE Wireless Communications Letters 7 (4), p. 554–557. External Links: Document Cited by: §I. [12] J. Salazar, L. Garland, J. Ochoa, and M. Pyrcz (2021-11) Fair train-test split in machine learning: mitigating spatial autocorrelation for improved prediction accuracy. Journal of Petroleum Science and Engineering 209, p. 109885. External Links: Document Cited by: §I. [13] W. Yanwen, K. Mahdi and Z-M., Raúl Spatial+: a new cross-validation method to evaluate geospatial machine learning models. Note: International Journal of Applied Earth Observation and Geoinformation. vol 121, 2023 Cited by: §I.