← Back to papers

Paper deep dive

SmaAT-QMix-UNet: A Parameter-Efficient Vector-Quantized UNet for Precipitation Nowcasting

Nikolas Stavrou, Siamak Mehrkanoon

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 30

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/26/2026, 2:35:25 AM

Summary

SmaAT-QMix-UNet is a parameter-efficient deep learning model for precipitation nowcasting that enhances the SmaAT-UNet architecture by integrating a vector quantization (VQ) bottleneck and mixed kernel depth-wise convolutions (MixConv). The model achieves a 37.5% reduction in parameters while maintaining or improving predictive accuracy and providing enhanced interpretability through Grad-CAM and UMAP visualizations.

Entities (7)

SmaAT-QMix-UNet · model · 100%SmaAT-UNet · model · 100%Grad-CAM · method · 95%MixConv · technique · 95%UMAP · method · 95%Vector Quantization · technique · 95%KNMI precipitation dataset · dataset · 90%

Relation Signals (4)

SmaAT-QMix-UNet isvariantof SmaAT-UNet

confidence 100% · This paper presents SmaAT-QMix-UNet, an enhanced variant of SmaAT-UNet

SmaAT-QMix-UNet usestechnique Vector Quantization

confidence 100% · introduces two key innovations: a vector quantization (VQ) bottleneck

SmaAT-QMix-UNet usestechnique MixConv

confidence 100% · mixed kernel depth-wise convolutions (MixConv) replacing selected encoder and decoder blocks

SmaAT-QMix-UNet evaluatedon KNMI precipitation dataset

confidence 95% · We use the KNMI precipitation dataset from [25]

Cypher Suggestions (2)

Find all models that are variants of SmaAT-UNet · confidence 90% · unvalidated

MATCH (m:Model)-[:IS_VARIANT_OF]->(b:Model {name: 'SmaAT-UNet'}) RETURN m.name

List all techniques used by SmaAT-QMix-UNet · confidence 90% · unvalidated

MATCH (m:Model {name: 'SmaAT-QMix-UNet'})-[:USES_TECHNIQUE]->(t:Technique) RETURN t.name

Abstract

Abstract:Weather forecasting supports critical socioeconomic activities and complements environmental protection, yet operational Numerical Weather Prediction (NWP) systems remain computationally intensive, thus being inefficient for certain applications. Meanwhile, recent advances in deep data-driven models have demonstrated promising results in nowcasting tasks. This paper presents SmaAT-QMix-UNet, an enhanced variant of SmaAT-UNet that introduces two key innovations: a vector quantization (VQ) bottleneck at the encoder-decoder bridge, and mixed kernel depth-wise convolutions (MixConv) replacing selected encoder and decoder blocks. These enhancements both reduce the model's size and improve its nowcasting performance. We train and evaluate SmaAT-QMix-UNet on a Dutch radar precipitation dataset (2016-2019), predicting precipitation 30 minutes ahead. Three configurations are benchmarked: using only VQ, only MixConv, and the full SmaAT-QMix-UNet. Grad-CAM saliency maps highlight the regions influencing each nowcast, while a UMAP embedding of the codewords illustrates how the VQ layer clusters encoder outputs. The source code for SmaAT-QMix-UNet is publicly available on GitHub \footnote{\href{this https URL}{this https URL}}.

Tags

ai-safety (imported, 100%)cslg (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

30,054 characters extracted from source content.

Expand or collapse full text

SmaAT-QMix-UNet: A Parameter-Efficient Vector-Quantized UNet for Precipitation Nowcasting Nikolas Stavrou, Siamak Mehrkanoon1 *Corresponding author. Abstract Weather forecasting supports critical socioeconomic activities and complements environmental protection, yet operational Numerical Weather Prediction (NWP) systems remain computationally intensive, thus being inefficient for certain applications. Meanwhile, recent advances in deep data-driven models have demonstrated promising results in nowcasting tasks. This paper presents SmaAT-QMix-UNet, an enhanced variant of SmaAT-UNet that introduces two key innovations: a vector quantization (VQ) bottleneck at the encoder–decoder bridge, and mixed kernel depth-wise convolutions (MixConv) replacing selected encoder and decoder blocks. These enhancements both reduce the model’s size and improve its nowcasting performance. We train and evaluate SmaAT-QMix-UNet on a Dutch radar precipitation dataset (2016–2019), predicting precipitation 30 minutes ahead. Three configurations are benchmarked: using only VQ, only MixConv, and the full SmaAT-QMix-UNet. Grad-CAM saliency maps highlight the regions influencing each nowcast, while a UMAP embedding of the codewords illustrates how the VQ layer clusters encoder outputs. The source code for SmaAT-QMix-UNet is publicly available on GitHub 111https://github.com/nstavr04/MasterThesisSnellius. I Introduction Numerical Weather Prediction (NWP) models are computationally expensive, requiring large-scale simulations that are impractical for edge deployment or rapid ensemble forecasting. As a result, many researchers have turned to data-driven approaches, with deep learning models increasingly explored and adopted for weather forecasting over the past decade [12, 1, 28, 24, 22, 2]. Early data-driven nowcasting mainly relied on recurrent sequence models such as RNNs, LSTMs, and GRUs to model temporal dynamics, but these approaches largely neglected the spatial structure of radar and satellite imagery. ConvLSTM mitigates this by embedding convolutions within LSTM gates, enabling joint learning of motion and morphology and outperforming optical-flow baselines for short-range forecasts [19], while the PredRNN family extends this with spatio-temporal memory to better preserve storm structure at longer lead times [28]. More recent approaches adopt attention-based architectures. SmaAT-UNet integrates attention and depthwise-separable convolutions within a U-Net backbone to capture multi-scale features efficiently [25]. It achieves competitive nowcasting performance while only using a quarter of the parameters of a standard U-Net variant. Other related models such as WF-UNet and STC-ViT further demonstrating the effectiveness of lightweight attention and transformer-style mechanisms for nowcasting [9, 16]. Interpretability is essential for deploying deep models. Explainable-AI methods like Grad-CAM [18] and UMAP [11] provide human-readable insights into both individual predictions and overall model behavior. Recent XAI surveys in meteorology argue that combining such local and global views is essential for building practitioner trust and for debugging models before operational rollout [10]. This paper introduces SmaAT-QMix-UNet, a compact evolution of SmaAT-UNet that achieves marginally improved accuracy and precision while reducing trainable parameters by 37.5%. The improvement comes from two modifications: a discrete vector-quantization (VQ) bottleneck and MixConv blocks. A VQ module is inserted at the encoder–decoder bridge, replacing the latent feature map with nearest codeword indices to produce a compressed, noise-robust representation that also supports cluster-level interpretation. In addition, selected depthwise-separable convolutions in the encoder and decoder are replaced with MixConv layers that blend multiple receptive field sizes within a single block [23], preserving multi-scale sensitivity while reducing redundancy. Interpretability is provided through Grad-CAM saliency maps and UMAP projections of the learned VQ codewords. Together, these changes improve efficiency, accuracy, and interpretability while remaining suitable for edge deployment. I Related Work Several studies have built upon the classic UNet architecture [15] to tackle precipitation nowcasting. For instance, A-TransUNet [30] uses a transformer and a UNet with attention modules and depthwise-separable convolutions, demonstrating improved performance on precipitation nowcasting. Similarly, Broad-UNet [5] refines the UNet backbone with multi-scale feature extraction via asymmetric parallel convolutions and Atrous Spatial Pyramid Pooling (ASPP) module, yielding more accurate predictions with fewer parameters. Variants extending the SmaAt-UNet [25] approach include GA-SmaAt-GNet [14], which leverages a generative adversarial framework and integrates precipitation masks to boost performance in extreme events, and SAR-UNet [13], which introduces residual connections alongside depthwise-separable convolutions to enhance both accuracy and interpretability through visual explanations. Lastly, WF-UNet [9] took a different approach by fusing additional meteorological inputs via a 3D-UNet architecture. Other deep learning approaches have also been proposed for precipitation nowcasting beyond the UNet family. Early work using ConvLSTM has evolved into models like TrajGRU [20], which learn location-variant recurrent connections to better capture natural motion. Moreover, models like MetNet [21] and Nowcasting-Nets [3] combine self-attention and recurrent structures to extend forecast horizons and deliver probabilistic predictions that capture uncertainty. The authors in [27] formulated precipitation nowcasting as a spatiotemporal graph sequence problem. Various studies have demonstrated the effectiveness of VQ techniques across a range of applications, highlighting their potential to enhance model robustness, efficiency, and representational power. In the medical imaging domain, work has been done which shows that integrating a quantization block into a UNet architecture leads to improved segmentation accuracy and robustness against noise, domain shifts, and other perturbations by learning a discrete, low-dimensional representation [17]. Similarly, VQ-UNet [8] applies a multi-scale hierarchical VQ approach to defend deep neural networks against adversarial attacks, effectively reducing unwanted noise and reconstructing data with high fidelity. The widely known VQ-VAE paper, further establishes that replacing continuous latent codes with a learned discrete codebook can avoid issues like posterior collapse, ultimately enabling high-quality generation in images, videos, and speech [29]. Finally, the aforementioned GPTCast model [6], which adapts the VQ-GAN framework [4], illustrates that vector quantization can be effectively employed in precipitation nowcasting, in their case with a variational autoencoder, to generate accurate, high-resolution forecasts. Beyond the depthwise-separable convolutions employed in SmaAT-UNet, efficient convolutions continue to diversify. MixConv partitions channels and applies several kernel sizes within one depth-wise layer, improving the accuracy-to-FLOPs ratio on mobile-scale models [23], while GhostConv (from GhostNet) generates “ghost” feature maps through inexpensive linear operations, slashing parameter count without sacrificing representational power [7]. I Method Figure 1: (a) SmaAT-QMix-UNet architecture: Rectangles represent feature maps, with height indicating spatial resolution and width the channel dimension. MixConv blocks are used in the last two encoder levels and the first decoder stage, while a VQ layer discretizes the B×18×18×512B× 18× 18× 512 bottleneck tensor. (b) Vector-quantization module: Latent features are flattened, each 512-D vector is assigned to its nearest codebook entry (K=32K=32), and reshaped into a quantized feature map. Training optimizes the combined codebook and β-weighted commitment losses. I-A Proposed SmaAT-QMix-UNet model I-A1 Architecture Fig. 1 sketches the proposed SmaAT-QMix-UNet, which follows the encoder–bottleneck–decoder template of SmaAT-UNet [25] but introduces two key modifications. The encoder comprises five hierarchical levels (blue and cyan arrows). In each level, two 3 × 3 depth-wise separable convolutions, BatchNorm and ReLU are followed by a CBAM attention module. The CBAM output is forwarded via a skip connection (grey arrows) to the matching decoder stage, while a 2×2 max pool (red arrow) halves the spatial resolution and feeds the next level. Levels 1–3 use the original depthwise-separable convolutions while Levels 4 and 5 use a MixConv block (cyan arrows) that processes channel groups with 3×3 and 5×5 kernels in parallel [23]. At the bottleneck between encoder and decoder, the last encoder tensor is routed through a VQ module (purple arrow). Each latent vector is replaced by its nearest codeword entry in a learned codebook and then passed to the decoder. The decoder mirrors the encoder with four stages. Each begins with bilinear up-sampling (green arrows) that doubles spatial dimensions and concatenates the result with the corresponding skip connection. The first decoder stage reuses the MixConv block, while the remaining stages revert to depth-wise separable convolutions. A final 1×1 convolution (purple arrow) produces the single-channel precipitation nowcast at 30 mins lead time. I-A2 Vector-Quantization (VQ) Module On the bottleneck of our model architecture, we use a VQ module following the lines of VQ-VAE, [29]. The continuous encoder output e∈ℝB×H×W×Cz_e\!∈\!R^B× H× W× C is mapped to a discrete latent space defined by a learnable codebook ℰ=kk=1KE=\e_k\_k=1^K, with codeword k∈ℝDe_k\!∈\!R^D. We choose the codeword dimensionality D to match the number of channels C, i.e., D=CD=C. Therefore, in the rest of the paper, we use D to also indicate the number of channels. As in [26], the codebook ℰE is a set of learnable parameters, optimized jointly with the rest of the model. Next we flatten ez_e to a matrix e∈ℝN×Dz_e ^N× D, where N=B×H×WN=B× H× W. We denote every row of this matrix as n∈ℝDv_n ^D. For each vector nv_n, we compute the squared ℓ2 _2 distance to every codeword and select the index of the nearest codeword as follows: kn=arg⁡mink∈1,…,K​‖n−k‖22,for​n=1,…,N.k_n= k∈\1,…,K\ _n-e_k _2^2, for\;\;n=1,…,N. (1) Next, each nv_n is replaced with its corresponding codeword kne_k_n, and the result is reshaped to match the original layout, producing the quantized tensor as follows: q=reshape​(knn=1N),q∈ℝB×H×W×D.z_q= reshape (\e_k_n\_n=1^N ), _q ^B× H× W× D. (2) Incorporating the VQ module introduces two additional loss terms, commitment loss and codebook loss which are computed during training. Commitment loss penalizes encoder vectors for drifting away from their selected code embeddings and thus encourages them to “commit” stably to a discrete code and codebook loss, which pulls each codeword in ℰE toward its detached encoder output to keep the dictionary representative. Training uses the straight-through estimator together with the two-term loss: ℒVQ=1N​∑n=1N‖sg​[n]−kn‖22⏟codebook+βN​∑n=1N‖n−sg​[kn]‖22⏟commitment,L_VQ= 1N _n=1^N [v_n]-e_k_n _2^2_codebook\;+\; βN _n=1^N _n-sg[e_k_n] _2^2_commitment, (3) where sg⁡[⋅]sg[·] is the stop-gradient operator and β is the commitment cost controlling how strongly the encoder is encouraged to commit to its selected codes. All terms are computed as the mean squared error over all latent elements, making the loss magnitude insensitive to batch size or spatial resolution. At inference time the module is deterministic, mapping each encoder activation to the nearest codeword entry in ℰE. I-A3 Mixed Convolution Block In the original SmaAT-UNet, each DoubleDSC unit stacks two depth-wise separable convolutions with fixed 3×3 kernels. Since the deepest encoder layers have the widest channel dimensions and require the largest receptive fields, we modify only these layers and the first decoder stage after the VQ bottleneck. Specifically, we replace the DoubleDSC units in the two deepest encoder levels and the first decoder stage with a double MixConv block. Each MixConv splits the feature map into two equal channel groups, applies 3×3 and 5×5 depth-wise convolutions, concatenates the results, and projects them through a shared 1×1 point-wise convolution with BN and ReLU (Fig. 2). This sequence is repeated twice to mirror the original DoubleDSC structure. The resulting design captures both local and broader spatial context while reducing parameter count, and is applied only at these three locations to preserve the baseline behaviour elsewhere. Figure 2: Mixed depthwise convolution (MixConv). The input tensor is split in two disjoint groups. First group is processed by a 3x3 depthwise convolution and the second group by a 5x5 depthwise convolution. The two outputs are then concatenated along the channel dimension. I-A4 Model Variation To showcase the effects of VQ and MixConv we evaluate four progressively modified networks. The starting point is the unaltered SmaAT-UNet, which acts as the reference baseline. Adding MixConv alone, restricted to the last two encoder levels and the first decoder stage, yields SmaAT-Mix-UNet, a purely convolutional variant that probes the impact of kernel diversity on accuracy and size. Additionally, we train SmaAT-Q-UNet, which inserts the VQ bottleneck but leaves all depth-wise separable convolution layers unchanged, isolating the contribution of discretized latents. Finally, we combine both modifications in SmaAT-QMix-UNet. This model includes the VQ layer and the same three MixConv replacements used in SmaAT-Mix-UNet, providing the configuration that jointly targets compactness and accuracy. All four networks share identical training and optimizer settings, and early-stopping criteria, ensuring that any performance differences can be attributed to the architectural changes alone. I-B Training For our training, we follow the same steps as [25]. Our model ingests twelve consecutive radar maps per step and is trained for at most 100 epochs on a single NVIDIA H100 GPU with the Adam optimiser and an initial learning rate of 0.001 and a learning rate patience of 4 which reduces learning rate if no improvement on validation loss is shown for 4 consecutive epochs. If the validation loss shows no improvement for 15 straight epochs, training stops early. Through hyperparameter tuning, we set a codebook length of 32 and a commitment cost of 0.75 that give the best model performance . I-C Evaluation Primary performance is reported as mean-squared error (MSE) over the test split. To assess event detection quality, each output is thresholded into rain/no-rain masks, counts of true positives, false positives, true negatives and false negatives then yield precision, recall, accuracy and F1-score. Results for every SmaAT-QMix-UNet variant are benchmarked against the unaltered SmaAT-UNet and a Persistence baseline that simply repeats the last input frame. I-D Explainability We pair a local and a global tool to scrutinise model behavior. Gradient-weighted Class Activation Mapping (Grad-CAM) is applied to every encoder and decoder level to highlight the pixels that most influence each nowcast horizon. Complementing these saliency maps, Uniform Manifold Approximation and Projection (UMAP) embeds the full set of vector-quantized code indices into two dimensions, revealing how discrete codes cluster into recurring weather regimes and allowing their associated Grad-CAM maps to be inspected side by side. Together, Grad-CAM and UMAP provide a coherent view of how the network combines spatial cues and latent code patterns to produce its predictions. TABLE I: Performance comparison at 30-min lead time on the NL-50 test set, including persistence, SmaAT-UNet, and proposed models, with model size and inference time reported. Best values are in bold Model Parameters Inference Time (ms) MSE (px) Precision Recall Accuracy F1 Score Persistence – – 0.0248 0.678 0.643 0.756 0.660 SmaAT-UNet 4 M 45 0.0122 0.730 0.850 0.829 0.786 SmaAT-Q-UNet 4 M 41 0.0119 0.748 0.820 0.832 0.782 SmaAT-Mix-UNet 2.5 M 42 0.0129 0.670 0.866 0.794 0.756 SmaAT-QMix-UNet 2.5 M 39 0.0120 0.763 0.812 0.838 0.787 IV Experiments We use the KNMI precipitation dataset from [25], comprising approximately 420 000 radar composites recorded every five minutes between 2016 and 2019 by the Dutch C-band radars at De Bilt and Den Helder. Images are normalised, cropped to the common radar coverage, and centre-cropped to 288 × 288 pixels. Following [25], only samples with at least 50% rainy pixels in the target frame are retained, forming the NL-50 subset used for all experiments. Each sample consists of twelve consecutive rain maps (60 min history), with the model predicting precipitation 30 min ahead using mean squared error as the loss. Training follows the baseline setup with Adam, batch size 8, and identical learning-rate scheduling and early stopping. The only additional hyperparameters are the VQ codebook size K and commitment cost β, tuned on the validation set; the best configuration uses K=32K=32 and β=0.75β=0.75. The model with the lowest validation loss is evaluated on the NL-50 test set. Figure 3: Comparison of predictions generated by different models. The SmaAT-QMix-UNet model shows better alignment with the ground truth. Figure 4: (a) UMAP visualization of encoder feature vectors before and after vector quantization in SmaAT-QMix-UNet, where grey points denote pre-VQ representations and colored points indicate their assigned codewords. (b) Hyperparameter tuning results for the VQ module, showing validation performance across 16 combinations of codebook size and commitment cost, with K=32K=32 and β=0.75β=0.75 achieving the best performance. Figure 5: Heatmaps generated with Grad-CAM for SmaAT-QMix-UNet, showing activation regions across the five encoder and four decoder levels, including responses from the convolutional blocks (DoubleDSC or MixConv) and the CBAM modules in the encoder. V Results and Discussion V-A Precipitation nowcasting V-A1 Tuning We tune the two VQ-specific hyperparameters, the codebook size K and the commitment cost β, using a coarse grid search with K∈8,16,32,64K∈\8,16,32,64\ and β∈0.25,0.50,0.75,1.00β∈\0.25,0.50,0.75,1.00\ on the NL-50 validation set. Fig. 4 shows the resulting pixel-wise MSE heatmap. The best performance is obtained at K=32K=32 and β=0.75β=0.75, with K=16K=16, β=0.25β=0.25 yielding comparable results. Overall, moderate codebook capacity combined with a relatively high commitment cost provides the best balance between representation diversity and quantization stability. We therefore fix K=32K=32 and β=0.75β=0.75 for all subsequent SMiQ-UNet experiments. V-A2 Evaluation The experimental results on the Dutch precipitation dataset demonstrate the advantages of SmaAT-QMix-UNet in both predictive performance and model compactness. Table I reports 30-minute lead-time results on the NL-50 test set, including accuracy metrics, model size, and runtime. In terms of MSE, SmaAT-Q-UNet and SmaAT-QMix-UNet slightly outperform the SmaAT-UNet baseline (0.0119 and 0.0120 vs. 0.0122), whereas SmaAT-Mix-UNet alone underperforms (0.0129), indicating that MixConv without discretization is insufficient. All learned models outperform the persistence baseline by a large margin. Across secondary metrics, SmaAT-QMix-UNet matches or exceeds the baseline on all scores except recall (0.812 vs. 0.850), likely due to VQ regularization suppressing weak precipitation cells. This is offset by higher precision (+0.033) and improved overall accuracy. Crucially, SmaAT-QMix-UNet achieves these results with only 2.5M parameters, 37.5% fewer than the baseline, and reduces inference time by approximately 6 ms per batch. Overall, the results confirm that selective MixConv drives parameter efficiency, while the VQ bottleneck preserves or improves skill, yielding a compact and efficient nowcasting model. Figure 3 provides a qualitative comparison of 30-minute forecasts, showing that SmaAT-QMix-UNet produces predictions closest to the ground truth, particularly in regions of higher precipitation intensity. V-B UMAP visualization Fig. 4 uses Uniform Manifold Approximation and Projection (UMAP) to visualize the effect of vector quantization on the encoder latent space. The left panel shows a two-dimensional embedding of the 512-D encoder features, colored by their assigned codeword. After quantization (right), features collapse onto a small set of discrete codewords, with colored points indicating codeword locations and grey points showing the original feature positions. The tight clustering demonstrates that the codebook efficiently compresses similar patterns while preserving the overall latent-space structure. V-C Grad-CAM visualization Fig. 5 presents Grad-CAM saliency maps for all encoder and decoder levels of SmaAT-QMix-UNet, showing heatmaps for the convolutional blocks and the corresponding CBAM units. In the shallow encoder (Levels 1–3), DoubleDSC blocks already outline the main precipitation regions, while CBAM distributes attention more broadly. By Level 3, both modules converge on the central rain areas. In deeper encoder layers (Levels 4–5), MixConv maintains focus on precipitation structures, with CBAM highlighting complementary regions as representations become more abstract. During decoding, saliency initially remains concentrated on high-intensity precipitation, then expands through up-sampling and skip connections before progressively refocusing on the heaviest rainfall. Overall, the saliency maps indicate a hierarchical representation in which SmaAT-QMix-UNet captures global rainfall geometry in early layers, emphasizes intense precipitation in deeper layers, and refines this information during decoding, while MixConv preserves spatial localization despite larger receptive fields. VI Conclusion In this work, we introduced SmaAT-QMix-UNet, which combines a vector-quantization bottleneck with MixConv to preserve the multi-scale design of SmaAT-UNet while substantially reducing model size. By discretizing the latent space and replacing the deepest convolutional blocks with MixConv, the model achieves marginal improvements in nowcasting skill with significantly fewer parameters, making it well suited for resource-constrained and edge deployments. Interpretability is enhanced through a two-level analysis: Grad-CAM highlights spatial regions driving the 30-minute forecast, while UMAP projections of the VQ codewords reveal coherent latent-space clusters. Together, these properties reduce inference and training costs and support efficient, interpretable precipitation nowcasting. References [1] I. A. Abdellaoui and S. Mehrkanoon (2021) Symbolic regression for scientific discovery: an application to wind speed forecasting. In IEEE symposium series on computational intelligence (SSCI), p. 01–08. Cited by: §I. [2] D. Aykas and S. Mehrkanoon (2021) Multistream graph attention networks for wind speed forecasting. In IEEE Symposium Series on Computational Intelligence (SSCI), p. 1–8. Cited by: §I. [3] M. R. Ehsani, A. Zarei, H. V. Gupta, K. Barnard, and A. Behrangi (2021) Nowcasting-nets: deep neural network structures for precipitation nowcasting using imerg. arXiv preprint arXiv:2108.06868. Cited by: §I. [4] P. Esser, R. Rombach, and B. Ommer (2021) Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 12873–12883. Cited by: §I. [5] J. G. Fernández and S. Mehrkanoon (2021) Broad-UNet: multi-scale feature learning for nowcasting tasks. Neural Networks 144, p. 419–427. Cited by: §I. [6] G. Franch, E. Tomasi, R. Wanjari, V. Poli, C. Cardinali, P. P. Alberoni, and M. Cristoforetti (2024) GPTCast: a weather language model for precipitation nowcasting. arXiv preprint arXiv:2407.02089. Cited by: §I. [7] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu (2020) Ghostnet: more features from cheap operations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 1580–1589. Cited by: §I. [8] Z. He and M. Singhal (2024) VQUNet: vector quantization u-net for defending adversarial attacks by regularizing unwanted noise. In Proceedings of the 2024 7th International Conference on Machine Vision and Applications, p. 69–76. Cited by: §I. [9] C. Kaparakis and S. Mehrkanoon (2023) WF-UNet: weather data fusion using 3d-unet for precipitation nowcasting. Procedia Computer Science 222, p. 223–232. Cited by: §I, §I. [10] A. Mamalakis, I. Ebert-Uphoff, and E. A. Barnes (2020) Explainable artificial intelligence in meteorology and climate science: model fine-tuning, calibrating trust and learning new science. In International Workshop on Extending Explainable AI Beyond Deep Models and Classifiers, p. 315–339. Cited by: §I. [11] L. McInnes, J. Healy, and J. Melville (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: §I. [12] S. Mehrkanoon (2019) Deep shared representation learning for weather elements forecasting. Knowledge-Based Systems 179, p. 120–128. Cited by: §I. [13] M. Renault and S. Mehrkanoon (2023) SAR-UNet: small attention residual unet for explainable nowcasting tasks. In IEEE International Joint Conference on Neural Networks (IJCNN), p. 1–8. Cited by: §I. [14] E. Reulen, J. Shi, and S. Mehrkanoon (2024) GA-SmaAt-GNet: generative adversarial small attention gnet for extreme precipitation nowcasting. Knowledge-Based Systems 305, p. 112612. Cited by: §I. [15] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part I 18, p. 234–241. Cited by: §I. [16] H. Saleem, F. Salim, and C. Purcell (2024) STC-vit: spatio temporal continuous vision transformer for weather forecasting. arXiv preprint arXiv:2402.17966. Cited by: §I. [17] A. Santhirasekaram, A. Kori, M. Winkler, A. Rockall, and B. Glocker (2022) Vector quantisation for robust segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, p. 663–672. Cited by: §I. [18] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, p. 618–626. Cited by: §I. [19] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. Advances in neural information processing systems 28. Cited by: §I. [20] X. Shi, Z. Gao, L. Lausen, H. Wang, D. Yeung, W. Wong, and W. Woo (2017) Deep learning for precipitation nowcasting: a benchmark and a new model. Advances in neural information processing systems 30. Cited by: §I. [21] C. K. Sønderby, L. Espeholt, J. Heek, M. Dehghani, A. Oliver, T. Salimans, S. Agrawal, J. Hickey, and N. Kalchbrenner (2020) Metnet: a neural weather model for precipitation forecasting. arXiv preprint arXiv:2003.12140. Cited by: §I. [22] T. Stańczyk and S. Mehrkanoon (2021) Deep graph convolutional networks for wind speed prediction. In proceedings of European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), p. 147–152. Cited by: §I. [23] M. Tan and Q. V. Le (2019) Mixconv: mixed depthwise convolutional kernels. arXiv preprint arXiv:1907.09595. Cited by: §I, §I, §I-A1. [24] K. Trebing and S. Mehrkanoon (2020) Wind speed prediction using multidimensional convolutional neural networks. In IEEE Symposium Series on Computational Intelligence (SSCI), p. 713–720. Cited by: §I. [25] K. Trebing, T. Stanczyk, and S. Mehrkanoon (2021) SmaAt-UNet: precipitation nowcasting using a small attention-unet architecture. Pattern Recognition Letters 145, p. 178–186. Cited by: §I, §I, §I-A1, §I-B, §IV. [26] A. Van Den Oord, O. Vinyals, et al. (2017) Neural discrete representation learning. Advances in neural information processing systems 30. Cited by: §I-A2. [27] L. Vatamány and S. Mehrkanoon (2025) Graph dual-stream convolutional attention fusion for precipitation nowcasting. Engineering Applications of Artificial Intelligence 141, p. 109788. Cited by: §I. [28] Y. Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu (2018) Predrnn++: towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In International conference on machine learning, p. 5123–5132. Cited by: §I. [29] W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas (2021) Videogpt: video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157. Cited by: §I, §I-A2. [30] Y. Yang and S. Mehrkanoon (2022) A-Transunet: attention augmented transunet for nowcasting tasks. In IEEE international joint conference on neural networks (IJCNN), p. 01–08. Cited by: §I.