Paper deep dive
AGCD: Agent-Guided Cross-Modal Decoding for Weather Forecasting
Jing Wu, Yang Liu, Lin Zhang, Junbo Zeng, Jiabin Wang, Zi Ye, Guowen Li, Shilei Cao, Jiashun Cheng, Fang Wang, Meng Jin, Yerong Feng, Hong Cheng, Yutong Lu, Haohuan Fu, Juepeng Zheng
Abstract
Abstract:Accurate weather forecasting is more than grid-wise regression: it must preserve coherent synoptic structures and physical consistency of meteorological fields, especially under autoregressive rollouts where small one-step errors can amplify into structural bias. Existing physics-priors approaches typically impose global, once-for-all constraints via architectures, regularization, or NWP coupling, offering limited state-adaptive and sample-specific controllability at deployment. To bridge this gap, we propose Agent-Guided Cross-modal Decoding (AGCD), a plug-and-play decoding-time prior-injection paradigm that derives state-conditioned physics-priors from the current multivariate atmosphere and injects them into forecasters in a controllable and reusable way. Specifically, We design a multi-agent meteorological narration pipeline to generate state-conditioned physics-priors, utilizing MLLMs to extract various meteorological elements effectively. To effectively apply the priors, AGCD further introduce cross-modal region interaction decoding that performs region-aware multi-scale tokenization and efficient physics-priors injection to refine visual features without changing the backbone interface. Experiments on WeatherBench demonstrate consistent gains for 6-hour forecasting across two resolutions (5.625 degree and 1.40625 degree) and diverse backbones (generic and weather-specialized), including strictly causal 48-hour autoregressive rollouts that reduce early-stage error accumulation and improve long-horizon stability.
Tags
Links
- Source: https://arxiv.org/abs/2603.15260v1
- Canonical: https://arxiv.org/abs/2603.15260v1
Intelligence
Status: not_run | Model: - | Prompt: - | Confidence: 0%
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
50,511 characters extracted from source content.
Expand or collapse full text
AGCD: Agent-Guided Cross-modal Decoding for Weather Forecasting Jing Wu 1∗ , Yang Liu 2∗ , Lin Zhang 3∗ , Junbo Zeng 1 , Jiabin Wang 1 , Zi Ye 1 , Guowen Li 1 , Shilei Cao 1 , Jiashun Cheng 2 , Fang Wang 4 , Meng Jin 5 , YeRong Feng 6 , Hong Cheng 2 , Yutong Lu 1,7 , Haohuan Fu 8,7 , and Juepeng Zheng 1,7† 1 Sun Yat-sen University, Zhuhai, China 2 The Chinese University of Hong Kong, Hong Kong, China 3 Jiangxi Science and Technology Normal University, Nanchang, China 4 China Meteorological Administration, Beijing, China 5 Huawei Technologies Co., Ltd, China 6 Guangdong-Hong Kong-Macao Greater Bay Area Weather Research Center for Monitoring Warning and Forecasting, China 7 National Supercomputing Center in Shenzhen, Shenzhen, China 8 Tsinghua University, Shenzhen, China ∗ Equal contribution † Corresponding author Abstract. Accurate weather forecasting is more than grid-wise regres- sion: it must preserve coherent synoptic structures and physical con- sistency of meteorological fields, especially under autoregressive roll- outs where small one-step errors can amplify into structural bias. Exist- ing physics-priors approaches typically impose global, once-for-all con- straints via architectures, regularization, or NWP coupling, offering lim- ited state-adaptive and sample-specific controllability at deployment. To bridge this gap, we propose Agent-Guided Cross-modal Decoding (AGCD), a plug-and-play decoding-time prior-injection paradigm that derives state-conditioned physics-priors from the current multivariate at- mosphere and injects them into forecasters in a controllable and reusable way. Specifically, We design a multi-agent meteorological narration pipeline to generate state-conditioned physics-priors, utilizing MLLMs to extract various meteorological elements effectively. To effectively apply the pri- ors, AGCD further introduce cross-modal region interaction decoding that performs region-aware multi-scale tokenization and efficient physics- priors injection to refine visual features without changing the backbone interface. Experiments on WeatherBench demonstrate consistent gains for 6-hour forecasting across two resolutions (5.625 ◦ and 1.40625 ◦ ) and diverse backbones (generic and weather-specialized), including strictly causal 48-hour autoregressive rollouts that reduce early-stage error ac- cumulation and improve long-horizon stability. Keywords: Weather forecasting· Multi-agent generation· Physics-priors injection arXiv:2603.15260v1 [cs.AI] 16 Mar 2026 2J. Wu et al. 1 Introduction Short-range weather forecasting is a cornerstone of operational prediction, un- derpinning public safety and high-stakes decision-making. High-impact phenom- ena can develop within hours, requiring accurate forecasts of evolving multi- variable atmospheric states that preserve cross-variable physical consistency. In this regime, small single-step errors, seemingly minor under grid-wise metrics, can accumulate and amplify into structural biases during autoregressive deploy- ment. Traditionally, Numerical Weather Prediction (NWP) maintains consis- tency by solving dynamical equations, but it incurs prohibitive computational cost under high resolution and frequent update cycles [1, 2]. In contrast, data- driven forecasters trained on large reanalysis datasets enable substantially faster inference while achieving competitive short-to-medium-range accuracy [5, 28]. Despite their efficiency, purely data-driven forecasters do not explicitly enforce physical consistency across variables and space, where small short-range errors can be amplified under autoregressive deployment and evolve into physically implausible states. In contrast, operational forecasting routinely performs state- aware diagnosis and targeted corrections to maintain coherent synoptic struc- tures. Recognizing that purely data-driven, grid-wise regression is insufficient to preserve meteorologically meaningful structures and constraints under complex atmospheric dynamics, prior work has revisited a central principle of NWP: constraining evolution with physical knowledge. Accordingly, researchers have attempted to inject meteorological physical priors into learning-based forecasters in various forms to guide physics-aware representation learning and improve predictive performance. Existing attempts to incorporate physical knowledge into data-driven forecasters mainly differ in where the prior is imposed: (1) model-level biases baked into architectures (e.g., spectral or operator designs [27, 37,48], variable embeddings [15,35,46], spherical/mesh representations [22], and tailored objectives [26]), (2) training-time constraints added as regularization or physics-informed objectives [3, 31, 42, 61], and (3) Hybrid schemes [40] that couple with NWP can enhance physical consistency. While effective, these priors are usually imposed in a global, once-for-all manner, limiting sample-specific controllability and state-adaptive guidance during multi-step deployment. Fig. 1 summarizes this gap and motivates an alternative: deriving state-conditioned, physically consistent guidance from the current atmosphere and applying it in a controllable and reusable way. Recently, Multimodal Large Language Models (MLLMs) and agent workflows have achieved strong results across computer vision [13,14,16,29,44,49,53,59] and natural language processing [7,10,25,54,60,62], and are increasingly trained with an emphasis on physical consistency and guidance [19,43,51,52,55]. Their abil- ity to produce consistent visual descriptions suggests a route to summarize the current multi-variable atmosphere into an explicit, controllable prior that high- lights synoptic structures and enforces cross-variable consistency. Unlike static physics injection that is baked into architectures or losses and is hard to steer per sample, state-conditioned summaries provide situation-aware guidance with AGCD for Weather Forecasting3 Fig. 1: Global static physics-priors vs. State-conditioned physics-priors: proposed AGCD injects cached state-conditioned physics-priors at decoding time. per-variable evidence and checkable consistency constraints. However, naively applying generic captioners or single-round MLLMs to meteorological fields is hindered by two bottlenecks: reliability (coverage gaps and cross-variable incon- sistencies in strongly coupled, high-dimensional states) and efficiency (online multi-agent reasoning is costly for training and deployment). Therefore, we seek a reliable and efficient mechanism that generates causally valid, state-conditioned priors and injects them into forecasters with less runtime cost. Motivated by this perspective, we propose Agent-Guided Cross-modal Decod- ing (AGCD), a plug-and-play prior-injection paradigm designed for Transformer- based neural forecasters. Concretely, AGCD employs an offline Multi-agent Me- teorological Narration Pipeline (MMNP) to generate concise, state-conditioned physics-priors from multi-variable heatmaps, and injects them as decoding-time guidance into Transformer-based forecasters. To realize this, we further intro- duce Cross-modal Region Interaction Decoding (CRID), a plug-and-play cross- modal decoder that efficiently fuses the cached priors with visual tokens for region-adaptive refinement, improving structural fidelity without modifying the backbone I/O interface. We evaluate AGCD on WeatherBench [39] at two res- olutions; for long-horizon assessment, we perform strictly causal 6 hour-step autoregressive rollouts up to 48 hours, where the narrative is refreshed from the current rollout state without introducing future information. Across settings, AGCD consistently improves accuracy and reduces error accumulation, leading to more stable long-horizon behavior. Contributions. Our main contributions are three-fold: – We introduce a new perspective for physics-priors injection in weather fore- casting: leveraging MLLMs to convert multi-variable atmospheric states into state-conditioned physics-priors that are explicit, controllable, and reusable. – We propose AGCD, a plug-and-play decoding-time prior-injection frame- work that couples an offline multi-agent narration pipeline (MMNP) with a lightweight cross-modal decoder (CRID) to enable region-adaptive refine- ment without modifying backbone interfaces. – We demonstrate consistent gains for 6-hour forecasting on WeatherBench across two resolutions and diverse backbones (generic and weather-specialized), 4J. Wu et al. including 48-hour autoregressive rollouts that reduce early-stage error accu- mulation. 2 Related Work 2.1 Data-driven Weather Forecasting Data-driven weather forecasting has advanced rapidly with deep models that learn spatiotemporal dynamics from large reanalysis datasets. Beyond early con- volutional and recurrent approaches, recent progress mainly follows three direc- tions: (i) Neural operators that approximate the evolution operator in function space and enable efficient global mixing for autoregressive rollouts [6,32,37]; (i) Transformer-style forecasters that scale sequence modeling on latitude–longitude grids, often incorporating weather-aware designs such as variable level embed- dings, pressure-level structure, and latitude-weighted objectives [4,5,36]; and (i) graph-based forecasters that perform message passing on spherical meshes for long-range transport and multi-scale interactions beyond regular grids [28,34,63]. These methods have achieved strong single-step accuracy and practical inference efficiency, making them promising alternatives or complements to traditional NWP in short-range settings. Existing physics-priors injection is largely static, which is often hard to con- trol at training time and lacks a mechanism for state-aware revision over dy- namically sensitive regions. This static becomes fragile under autoregressive de- ployment: small structural misplacements and weak cross-variable coherence at early steps can be recursively amplified, yielding systematic bias and unstable long-horizon trajectories. These limitations motivate an explicit, controllable, and plug-and-play guidance mechanism that injects state-conditioned priors at decoding time, without redesigning strong backbones, thereby improving the stability of early-stage autoregressive rollouts. 2.2 MLLMs and Agentic Workflows for Structured Guidance Recent multimodal large language models (MLLMs) and agentic workflows [11] have become practical mechanisms for structured guidance, showing strong ca- pabilities in grounded description [20,24,44,53], region-centric reasoning [14,49], multi-step decomposition [12, 21, 23, 41, 50, 57], and tool-augmented verifica- tion [7,47,62] across vision and language tasks. In particular, verification-oriented designs are used to suppress omissions, contradictions, and overconfident state- ments [17, 30, 56]. This paradigm suggests a route to convert high-dimensional visual observations [9] into compact semantic summaries that can act as con- trollable signals for downstream models [8,33,58]. However, transferring generic captioners or online multi-agent reasoning to meteorological fields is challenging due to two constraints: reliability and ef- ficiency. Weather states are high-dimensional and strongly cross-variable cou- pled, making one-shot generation prone to incomplete coverage and inconsistent AGCD for Weather Forecasting5 Fig. 2: The overview of the proposed AGCD. semantics, which is undesirable as a stable training-time prior. Meanwhile, on- line multi-agent execution is costly and often difficult to reproduce within large- scale forecasting pipelines. These limitations motivate guidance mechanisms that produce deterministic, evidence-grounded state summaries with explicit consis- tency control, enable offline caching to avoid online multi-agent iterations during training and one-step inference while supporting strictly causal rollouts via a lightweight single-step editor. 3 Methodology 3.1 Overall Framework As illustrated in Fig. 2, our framework couples structured language guidance with visual spatiotemporal representation learning for meteorological forecasting. It consists of a language pathway that provides semantic cues and a visual pathway that produces spatiotemporal tokens for prediction. Language pathway. For each meteorological variable field X i ∈R H×W , we render it into an RGB heatmap I i ∈R H×W×3 using a fixed colormap and a fixed normalization scheme to ensure a deterministic value-to-color mapping. Given the multivariate heatmapsI i N i=1 , the proposed Multi-agent Meteorolog- ical Narration Pipeline (MMNP) (Sec. 3.2) generates a coherent meteorologi- cal narrative S final summarizing salient atmospheric states and potential inter- variable interactions. To avoid running multi-agent iterations, S final is precom- puted offline for each sample and cached for training and inference. 6J. Wu et al. We then encode S final using a pretrained Large Language Model (LLM) and extract the last-layer hidden states as token embeddings: T =E LLM (S final )∈R N t ×d t . (1) Importantly, the LLM is kept frozen throughout training and inference. Visual pathway and cross-modal coupling. In parallel, the raw fields are fed into a Transformer-based forecasting backbone (such as Pangu [4], ClimaX [36], etc.), producing patch tokens P ∈R N×d (with N = H·W) and a global class token C ∈R 1×d . We then perform a cross-modal guidance preprocessing step in our Cross-modal Region Interaction Decoding (CRID) (Sec. 3.3). Specifically, the class token C generates token-wise and channel-wise gates to refine the frozen text embeddings T, yielding visual-guided text features ̃ T aligned to the visual feature space. CRID then injects ̃ T into region-aware decoding through token distillation and cross-attention modulation, producing improved forecasts that leverage both local atmospheric patterns and global semantic context. 3.2 Multi-agent Meteorological Narration Pipeline (MMNP) Generating a coherent and meteorologically plausible narrative from multivari- ate atmospheric inputs requires (i) capturing salient spatial patterns within each variable and (i) integrating cross-variable cues without introducing contradic- tions or temporally-confounded causal claims. Therefore, we propose MMNP, a collaborative multi-agent pipeline that produces an offline narrative prior S final from deterministically rendered RGB heatmapsI i N i=1 (Sec. 3.1). To keep com- putation bounded and reproducible, MMNP operates under fixed prompts tem- plates and a fixed refinement budget. Agents and roles MMNP consists of three agents with complementary respon- sibilities: (1) Variable-specific description agents A V i . For each variable V i , agent A V i takes the corresponding heatmap I i and extracts salient spatial structures with coarse localization cues in a concise textual form: d i = A V i (I i ), i = 1,...,N.(2) Each d i follows a lightweight, template-constrained style (short clauses with approximate regions and intensity trends) to facilitate downstream integration and verification. (2) Sequential integration agent A I . The integration agent A I merges d i N i=1 into a unified narrative by iteratively updating a running state S i−1 → S i under a fixed variable order: S i = A I (S i−1 ,d i ), S 0 =∅.(3) AGCD for Weather Forecasting7 To prevent uncontrolled verbosity and to ensure consistent phrasing across sam- ples, A I writes S i in a template-constrained format (short sentences or bullets) and explicitly separates: (a) observations grounded in the current heatmap pat- terns, and (b) hypothesized interactions across variables, phrased as tentative rather than factual or future-dependent claims. (3) Evidence-grounded evaluator E. Given the variable-wise descriptionsd i N i=1 and the integrated narrative S final , the evaluator E performs a structured con- sistency check and returns either PASS or a feedback package. Concretely, E assesses three aspects: – Per-variable coverage: whether salient structures described in each d i are reflected in S final (mitigating coverage gaps); – Consistency with described evidence: whether statements in S final pre- serve the coarse localization and intensity trends stated in d i , without distortion or unwarranted specificity; – Coherence: whether the narrative is concise, well-structured, and non- redundant. The evaluator reports localized issues with different types (such as missing, dis- torted, contradictory, and overstated-causality) to enable targeted refinement. Forward generation and evaluation All variable-specific agents are executed in parallel to produce d i N i=1 , followed by chained integration to obtain S final . The evaluator then verifies S final against the variable-wise descriptions: flag = E(d i N i=1 ,S final ).(4) If flag is PASS, we output S final as the final narrative prior for subsequent frozen- LLM. If the flag is Fail, E returns a feedback package that specifies the issue type and the implicated variable, together with the current integrated narrative: Feedback = (type,i,d i ,S final ),(5) where type ∈ missing, distorted, contradictory, overstated-causality and i indexes the variable whose description is implicated. Conditioned on Feedback, the integration agent A I revises S final by (i) adding missing but supported con- tent from d i , (i) correcting distorted localization and intensity phrasing, (i) resolving contradictions by rephrasing or narrowing claims and (iv) weakening causal language into hypothesis form, while preserving unaffected content. To ensure bounded and reproducible computation, we run at most R re- finement rounds (fixed across the dataset). If the narrative still fails after R rounds, we fall back to the best-scoring version selected by E. The finalized S final is cached offline and reused during training and inference, avoiding online multi-agent iterations during optimization. 8J. Wu et al. 3.3 Cross-modal Region Interaction Decoding (CRID) As discussed in Sec. 3.1, the generated meteorological narrative is not merely for post-hoc explanation; instead, it should serve as a decoding-time explicit physics-priors that guides the forecaster toward dynamically sensitive regions and cross-variable-consistent structures. To this end, we propose CRID, a plug- and-play decoder that injects state-conditioned physics-priors without chang- ing the backbone interface. CRID consists of two components: Cross-Modal Guidance (CMG), which produces visual-conditioned text features, and Cross- Modal Interaction (CMI), which performs region-aware multi-source interaction to modulate patch tokens for forecasting (See Fig. 3). Fig. 3: Structure of Cross-Modal In- teraction. Inputs. Given an input state at time t, the forecasting backbone produces a set of patch-wise visual tokens P ∈R N×d (with N = H·W) and a global summary token C ∈R 1×d . In parallel, a frozen text encoder embeds the narrative prior (Sec. 3.1) into token features T ∈R N t ×d t . CRID takes (P,C,T ) as inputs and per- forms decoding-time revision. Cross-Modal Guidance (CMG): CMG converts frozen text features into visual-conditioned semantics. The core idea is to use the class token C as a compact summary of the current atmospheric state and let it gate the narrative tokens T, thereby selectively emphasizing state-relevant semantic cues. We first align text features to the visual channel dimension as Eq. (6), where g(·) is a learnable linear projection if d t ̸= d. We then map C through a lightweight MLP f (·) and split the output into two queries as Eq. (7): U = g(T )∈R N t ×d ,(6) [q tok ,q ch ] = f (C), q tok ∈R 1×N t , q ch ∈R 1×d .(7) Token-wise gating reweights narrative tokens by their compatibility with the global state as Eq. (8), and channel-wise gating further refines the semantic channels to match the state-dependent emphasis as Eq. (9): α = softmax(q ch U ⊤ )∈R 1×N t , U (1) = α⊙ U.(8) β = softmax(q tok U (1) )∈R 1×d , ̃ T = β⊙ U (1) .(9) The resulting ̃ T ∈R N t ×d serves as a state-conditioned physics-priors and will be injected into CMI for region-aware interaction. Cross-Modal Interaction (CMI): CMI injects the guided semantics ̃ T into patch tokens via region-aware tokenization and memory-based modulation. Given patch tokens P ∈R N×d , we construct multi-scale region tokens by pooling on AGCD for Weather Forecasting9 the token grid. Let P grid ∈R H×W×d be the reshaped tokens, and for scales S we compute R (s) = Flatten AvgPool s×s (P grid ) , R = [R (s) ] s∈S ∈R N r ×d .(10) We first construct a unified decoding context by concatenating patch tokens, multi-scale region tokens, and guided semantic tokens: X = Concat(P,R, ̃ T ) = [P ; R; ̃ T ]∈R L×d , L = N + N r + N t .(11) Since directly operating on X is computationally expensive and may dilute salient cross-modal cues, we further distill X into a compact set of M memory tokens (M ≪ L) via Hopfield pooling [38], yielding representative prototypes for efficient decoding-time modulation: Z = HopfieldPool(Q h ,X)∈R M×d ,(12) where Q h ∈R M×d denotes learnable pooling queries. We apply multi-head at- tention (MHA) with P as queries and the memory Z as keys and values: ˆ P = MHA(PW Q ,ZW K ,ZW V ) + P, P out = MLP( ˆ P ).(13) where W Q ,W K ,W V are learnable linear projections for queries, keys, and values, respectively. The proposed CMI acts as a plug-in decoder that replaces the orig- inal decoding head and directly outputs the final forecasts, without modifying the backbone encoder. 4 Experiments 4.1 Setup Dataset. We evaluate on WeatherBench at 5.625 ◦ and 1.40625 ◦ for 6 hour forecasting: given state at time t, predict t+6h. We further assess long-horizon behavior via autoregressive rollouts up to 48 hours by iteratively feeding pre- dictions back as inputs. Inputs include surface variables wind10m, t2m and upper-air variables z, r, q, wind, t over 13 pressure levels; we report canonical WeatherBench scores on Z500, T850, T2m, and 10m wind. We use a strict tem- poral split: train (1979-01-01 to 2016-12-31) and test (2017-01-01 to 2017-12-31). Metrics. All methods are trained under an identical supervised setup to predict t+6h from t, using the same variable configuration and optimization schedule; evaluation uses latitude-weighted RMSE and ACC computed on climatology- based anomalies. Full details are provided in the supplementary material (Sec. S2). 10J. Wu et al. Table 1: 6-hour forecasting results on WeatherBench at two resolutions. AGCD con- sistently improves RMSE and ACC across backbones. RMSE ↓ and ACC ↑. Method T2m [K]10m Wind [m/s]Z500 [m 2 /s 2 ]T850 [K] 5.625 ◦ 1.40625 ◦ 5.625 ◦ 1.40625 ◦ 5.625 ◦ 1.40625 ◦ 5.625 ◦ 1.40625 ◦ RMSE ↓ ACC ↑ RMSE ↓ ACC ↑RMSE ↓ ACC ↑ RMSE ↓ ACC ↑RMSE ↓ ACC ↑ RMSE ↓ ACC ↑RMSE ↓ ACC ↑ RMSE ↓ ACC ↑ ViT1.6859 0.9554 1.3570 0.97100.5490 0.9723 0.5946 0.9668131.01 0.9929 80.80 0.99701.21 0.9736 0.91 0.9852 ViT+AGCD1.2601 0.9768 1.2450 0.97540.4781 0.9788 0.5600 0.9695113.07 0.9951 75.90 0.99760.98 0.9830 0.86 0.9871 CaiT1.8747 0.9317 1.5200 0.96580.6051 0.9674 0.6420 0.9622149.95 0.9917 104.60 0.99531.51 0.9516 1.06 0.9829 CaiT+AGCD 1.8703 0.9450 1.4700 0.96820.5993 0.9678 0.6210 0.9638132.80 0.9938 96.80 0.99601.41 0.9635 0.99 0.9846 ClimaX1.2308 0.9759 0.7799 0.99040.4970 0.9776 0.3443 0.988988.98 0.9972 32.84 0.99950.92 0.9857 0.49 0.9957 ClimaX+AGCD0.8843 0.9880 0.7420 0.99120.4513 0.9812 0.3320 0.989670.22 0.9979 31.10 0.99960.78 0.9905 0.46 0.9962 Pangu0.4965 0.9961 0.5147 0.99580.5963 0.9666 0.5321 0.973380.91 0.9970 74.36 0.99740.52 0.9951 0.67 0.9920 Pangu+AGCD0.4916 0.9962 0.4551 0.99670.5507 0.9716 0.4451 0.981468.92 0.9978 58.73 0.99840.50 0.9954 0.63 0.9929 Baselines. We evaluate AGCD as a plug and play module on bothgeneric vision backbones and weather-specialized forecasters. ViT [18] is a pure Trans- former that models an image as a sequence of patch tokens, serving as a strong and scalable generic backbone for grid-like inputs. CaiT [45] extends ViT by introducing class-attention mechanisms to enable deeper Image Transformers with improved optimization and representation. ClimaX [36] is a foundation model for weather and climate that is designed to be flexible over heterogeneous datasets (different variables and spatiotemporal coverage), and can be pretrained and then finetuned for downstream forecasting tasks. Pangu-Weather [4] is a high-resolution global weather forecasting model that performs fast deterministic forecasts with a 3D architecture tailored to atmospheric fields. Implementation Details MMNP uses fixed prompt templates with a bounded refinement budget R to produce deterministic physics-priors from multi-variable heatmaps. Full MMNP details and all hyperparameters are provided in the sup- plementary material (Sec. S1–S2; Table S1). 4.2 6 hour Forecasting For each framework, we report both the vanilla model and with AGCD coun- terpart obtained by plugging our semantic guidance (MMNP+CRID) into the decoding stage. Tables 1 summarize the 6 hour forecasting performance at 5.625 ◦ and 1.40625 ◦ , respectively. Our proposed plug-and-play AGCD consistently im- proves all tested backbones, reducing RMSE and increasing ACC on the canon- ical variables. We provide qualitative comparisons of representative 6 hour fore- casts for Z500, T850, T2m, and 10m wind at 1.40625 ◦ (Pangu Figs. 4) and 5.625 ◦ (ClimaX), respectively, which shows that our proposed method yields re- sults that closely match the ground with smaller bias. The 5.625 ◦ visualization is deferred to the supplementary material (Sec. S3). 4.3 Autoregressive forecasting Text update rule for rollouts. While our base task is 6 hour forecasting and the narrative prior is intentionally concise, regenerating the full MMNP at every AGCD for Weather Forecasting11 Fig. 4: Qualitative comparison of 6 hour weather forecasting with Pangu and Pangu+AGCD (AGCD) on 1.40625 ◦ data across multiple variables. (a) Initial fields at time t. (b) Ground-truth targets at t+6h. (c) Predictions from the vanilla Pangu. (d) Error maps from the vanilla Pangu. (e) Predictions from Pangu with our AGCD. (f) Error maps from Pangu with our AGCD. Error maps visualize Pred−GT. rollout step is unnecessary and inefficient. Therefore, we adopt a lightweight rollout update: we keep the variable-specific describers and evaluator off during rollouts, and reuse only the sequential integration agent as a single-step editor. In all autoregressive experiments, we instantiate this editor with InternVL3.5. Concretely, at step k the editor takes (i) the current predicted meteorological heatmap stack I (k) i and (i) the previous-step narrative S (k−1) , then outputs an updated narrative S (k) by making minimal, evidence-grounded edits: S (k) = A I S (k−1) ,I (k) i N i=1 . (14) This yields a causally valid, step-adaptive physics-priors with negligible over- head, while avoiding repeated multi-agent refinement. The updated S (k) is then encoded by the frozen LLM and injected by CRID for the next rollout step. We evaluate AGCD via strictly causal autoregressive rollouts with a 6 hour step: starting from the initial state at time t, the model iteratively feeds its own prediction back as input to forecast t+6h,...,t+48h. Fig. 5 reports the lead- time curves of latitude-weighted RMSE/ACC, and the detailed RMSE results at 12-hour intervals across backbones are deferred to the supplementary material (Sec. S3). Across variables, our AGCD consistently reduces error accumulation and yields more stable trajectories under rollout. 5 Discussion 5.1 How crucial is semantic alignment for improvement? We keep the backbone and CRID identical and only change the text: Matched (sample-aligned), Shuffled (mismatched), and Empty (null). Table 2 shows 12J. Wu et al. Fig. 5: Autoregressive rollout comparison between Pangu and Pangu+AGCD up to 48 hours (6 hour steps). Table 2: Semantic relevance controls. All settings keep the visual backbone (ViT) and CRID identical; only the text input is modified. Text setting Z500T850T2m10m Wind RMSE ↓ ACC ↑RMSE ↓ ACC ↑RMSE ↓ ACC ↑RMSE ↓ ACC ↑ Vision-only (no text)131.01 0.99291.21 0.97361.6859 0.95540.5490 0.9723 Matched (ours)113.07 0.99510.98 0.98301.2601 0.97680.4781 0.9788 Shuffled (mismatch)136.40 0.99221.24 0.97301.7120 0.95500.5650 0.9718 Empty (null prompt)134.80 0.99241.23 0.97321.7050 0.95520.5600 0.9720 that improvements appear only with Matched text, while Shuffled/Empty largely remove the benefit and can even underperform the vision-only baseline, confirm- ing that semantic alignment is necessary. Fig. 6 provides a concrete example showing that matched narratives offer localized, state-consistent priors rather than generic text cues. For T850, the narrative explicitly highlights the dy- namically active regions over Eurasia–North and Africa, which coincide with the boxed areas where the baseline exhibits structured warm/cold displace- ment errors. For Z500, the prior emphasizes the Siberian ridge, aligning with the synoptic-scale height pattern and guiding corrections on the corresponding ridge-related error patches. For T2m, the narrative points to a temperature- gradient band around 60 ◦ S, matching the sharp frontal-like transitions where the baseline tends to blur gradients and incur coherent bias. For 10m wind, the prior focuses on the North Pacific, consistent with the prominent wind struc- tures and the concentrated error clusters in that region. Across variables, these region-specific priors translate into targeted error reductions in the zoomed-in boxes, supporting that the gain comes from sample-aligned semantic guidance rather than extra text capacity. AGCD for Weather Forecasting13 Fig. 6: Relevance case study: State-consistent priors yield targeted error reductions. Table 3: Ablation on MMNP generation strategies (same CRID and backbone (ViT)). Text generator Z500T850T2m10m Wind RMSE ↓ ACC ↑RMSE ↓ ACC ↑RMSE ↓ ACC ↑RMSE ↓ ACC ↑ Single-agent (A I only)123.80 0.99371.05 0.98021.3900 0.96880.5070 0.9756 Multi-agent w/o evaluator (A V i +A I )118.20 0.99441.01 0.98191.3200 0.97420.4920 0.9773 Full MMNP (ours) (A V i +A I +E)113.07 0.99510.98 0.98301.2601 0.97680.4781 0.9788 5.2 Can multi-agent decomposition enhance narrative reliability? We compare three narrative generation strategies under the same forecasting backbone and CRID: (1) Single-agent uses only the integration agent A I to produce a single-pass narrative by taking the full multi-variable heatmap stack I i N i=1 as input, without variable-wise decomposition or post-hoc verification. (2) Multi-agent w/o evaluator decomposes the input into variable-specific de- scriptionsd i and integrates them with A I , but removes the evidence-grounded evaluator E. (3) Full MMNP further adds E to detect and revise omissions and cross-variable inconsistencies. Representative narrative comparisons are provided in the supplementary material (Sec. S3). Table 3 shows a monotonic improvement as the generation pipeline becomes more reliable: variable-wise decomposition already improves over Single-agent, and adding the evaluator yields further gains. This supports that the benefit comes from better evidence coverage and consistency control, rather than in- creasing text length. 5.3 What drives per-variable fidelity? Finally, fixing the integration agent A I and evaluator E, we study per-variable fidelity from two angles: (i) swapping the variable-specific agents A V i while keeping the rest of the pipeline unchanged (Fig. 7); and (i) incrementally en- abling a subset of A V i (Table 4). Fig. 7 shows that stronger variable-specific describers yield consistently better RMSE/ACC across all four canonical vari- 14J. Wu et al. Fig. 7: Ablation on variable-specific description agents in MMNP (ViT backbone; A I and E fixed). We report latitude-weighted RMSE (bars; lower is better) and ACC (line; higher is better) on Z500, T850, T2m, and 10m wind. Table 4: Incremental ablation on enabling variable-specific description agents in MMNP (ViT backbone; A I and E fixed). RMSE ↓ / ACC ↑ for 6 hour forecasts. Enabled A · Z500T850T2m10m wind RMSE ↓ ACC ↑RMSE ↓ ACC ↑RMSE ↓ ACC ↑RMSE ↓ ACC ↑ No MMNP131.01 0.99291.21 0.97361.6859 0.95540.5490 0.9723 A t2m 130.62 0.99291.16 0.97521.5438 0.96200.5352 0.9735 A t2m,10m 130.12 0.99301.13 0.97681.4820 0.96500.5249 0.9745 A t2m,10m,t850 129.66 0.99301.10 0.97821.4187 0.96780.5146 0.9755 A t2m,10m,t850,z500 113.07 0.99510.98 0.98301.2601 0.97680.4781 0.9788 Table 5: Ablation on CRID components. Setting Z500T850T2m10m Wind RMSE ↓ ACC ↑RMSE ↓ ACC ↑RMSE ↓ ACC ↑RMSE ↓ ACC ↑ Vision-only (no CRID)131.01 0.99291.21 0.97361.6859 0.95540.5490 0.9723 + Text, w/o Region tokens124.70 0.99361.09 0.97861.4200 0.96810.5150 0.9750 + Text, w/o HopfieldPool115.60 0.99481.00 0.98241.3000 0.97580.4860 0.9780 + Text, w/o CMG gating118.90 0.99431.02 0.98181.3400 0.97480.4930 0.9775 Full CRID (ours)113.07 0.99510.98 0.98301.2601 0.97680.4781 0.9788 ables, confirming that MMNP benefits from higher-quality per-variable evidence rather than simply increasing text capacity. 5.4 Which CRID components matter most? We next validate the design choices in CRID by ablating its key components while keeping the backbone, training protocol, and text generator fixed. We con- sider removing (i) the region-aware multi-scale tokens, (i) the Hopfield- based distillation and directly performing attention over the concatenated tokens, and (i) the CMG gating that produces visually aligned text features. Results in Table 5 indicate that each component contributes to the final perfor- mance. Notably, region-aware tokens consistently improve variables character- ized by sharp gradients or coherent synoptic structures, while Hopfield distilla- tion provides a favorable accuracy–efficiency trade-off by retaining informative cross-modal cues with reduced token complexity. Removing CMG gating de- grades performance, suggesting that coarse global visual context (class token) is beneficial for reweighting and aligning frozen text embeddings before fusion. AGCD for Weather Forecasting15 6 Conclusions In this work, we propose AGCD, an explicit and plug-and-play decoding-time prior injection paradigm for neural weather forecasting. AGCD is motivated by a key gap in existing forecasters: grid-wise regression alone lacks a state- aware physics-priors, so structural errors and cross-variable inconsistencies can be amplified under autoregressive rollouts. To bridge this gap, we introduce a MMNP to produce state-conditioned physics-priors with evidence-grounded consistency control, and a lightweight CRID decoder to inject these priors for region-adaptive refinement without changing the backbone interface. Extensive experiments on WeatherBench at 5.625 ◦ and 1.40625 ◦ demonstrate consistent gains in latitude-weighted RMSE and ACC across both generic vision back- bones and weather-specialized forecasters, and improved stability under strictly causal 48-hour autoregressive rollouts. Further analyses and ablations validate that the improvements stem from matched and reliable narratives with sufficient per-variable coverage, rather than merely adding extra text tokens. In the fu- ture, we plan to extend AGCD to broader variable sets and higher-resolution forecasting, explore more efficient state update and caching strategies for long- horizon deployment, and integrate stronger physically grounded constraints to further enhance robustness in operational settings. References 1. Bauer, P., Dueben, P.D., Hoefler, T., Quintino, T., Schulthess, T.C., Wedi, N.P.: The digital revolution of earth-system science. Nature Computational Science 1(2), 104–113 (2021) 2. Bauer, P., Thorpe, A., Brunet, G.: The quiet revolution of numerical weather prediction. Nature 525(7567), 47–55 (2015) 3. Beucler, T., Pritchard, M., Rasp, S., Ott, J., Baldi, P., Gentine, P.: Enforcing analytic constraints in neural networks emulating physical systems. Physical review letters 126(9), 098302 (2021) 4. Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., Tian, Q.: Pangu-weather: A 3d high-resolution model for fast and accurate global weather forecast. arXiv preprint arXiv:2211.02556 (2022) 5. Bi, K., Xie, L., Zhang, H., Chen, X., Gu, X., Tian, Q.: Accurate medium-range global weather forecasting with 3d neural networks. Nature 619(7970), 533–538 (2023) 6. Bonev, B., Kurth, T., Hundt, C., Pathak, J., Baust, M., Kashinath, K., Anand- kumar, A.: Spherical fourier neural operators: Learning stable dynamics on the sphere. In: International conference on machine learning. p. 2806–2823. PMLR (2023) 7. Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., Baraldi, L., Cornia, M., Cucchiara, R.: The revolution of multimodal large language models: A survey. Findings of the association for computational linguistics: ACL 2024 p. 13590– 13618 (2024) 8. Chen, C.Y., Shi, M., Zhang, G., Shi, H.: T2i-copilot: A training-free multi-agent text-to-image system for enhanced prompt interpretation and interactive gener- ation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. p. 19396–19405 (2025) 16J. Wu et al. 9. Chen, G., Zhou, X., Shao, R., Lyu, Y., Zhou, K., Wang, S., Li, W., Li, Y., Qi, Z., Nie, L.: Less is more: Empowering gui agent with context-aware simplification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. p. 5901–5911 (2025) 10. Chen, H., Lin, J., Chen, X., Fan, Y., Dong, J., Jin, X., Su, H., Fu, J., Shen, X.: Multimodal language models see better when they look shallower. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. p. 6688–6706 (2025) 11. Chen, L., Wang, Y., Tang, S., Ma, Q., He, T., Ouyang, W., Zhou, X., Bao, H., Peng, S.: Egoagent: a joint predictive agent model in egocentric worlds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. p. 6970–6980 (2025) 12. Chen, M.: Agentcaster: Reasoning-guided tornado forecasting. In: NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abil- ities, and Scaling (2025), https://openreview.net/forum?id=ZTQNYngIyF 13. Chen, Y., Xue, F., Li, D., Hu, Q., Zhu, L., Li, X., Fang, Y., Tang, H., Yang, S., Liu, Z., et al.: Longvila: Scaling long-context visual language models for long videos. arXiv preprint arXiv:2408.10188 (2024) 14. Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. p. 24185–24198 (2024) 15. Cirstea, R.G., Guo, C., Yang, B., Kieu, T., Dong, X., Pan, S.: Triformer: Tri- angular, variable-specific attentions for long sequence multivariate time series forecasting–full version. arXiv preprint arXiv:2204.13767 (2022) 16. Dai, Z., Li, K., Liu, J., Yang, J., Qiao, Y.: No need for real anomaly: Mllm empow- ered zero-shot video anomaly detection. arXiv preprint arXiv:2602.19248 (2026) 17. Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., We- ston, J.: Chain-of-verification reduces hallucination in large language models. In: Findings of the association for computational linguistics: ACL 2024. p. 3563–3578 (2024) 18. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) 19. Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) 20. Gao, J., Li, Y., Cao, Z., Li, W.: Interleaved-modal chain-of-thought. In: Proceedings of the Computer Vision and Pattern Recognition Conference. p. 19520–19529 (2025) 21. Ghezloo, F., Seyfioglu, M.S., Soraki, R., Ikezogwo, W.O., Li, B., Vivekanandan, T., Elmore, J.G., Krishna, R., Shapiro, L.: Pathfinder: A multi-modal multi-agent sys- tem for medical diagnostic decision-making applied to histopathology. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. p. 23431– 23441 (2025) 22. Gomez, M., Berne, A., Beucler, T., et al.: Global forecasting of tropical cyclone intensity using neural weather models. arXiv preprint arXiv:2508.17903 (2025) 23. Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. p. 14953–14962 (2023) AGCD for Weather Forecasting17 24. He, X., You, Z., Gong, J., Liu, C., Yue, X., Zhuang, P., Zhang, W., BAI, L.: RadarQA: Multi-modal quality analysis of weather radar forecasts. In: The Thirty- ninth Annual Conference on Neural Information Processing Systems (2025), https: //openreview.net/forum?id=WlrmpjocNe 25. Jiang, S., Liang, J., Wang, J., Dong, X., Chang, H., Yu, W., Du, J., Liu, M., Qin, B.: From specific-mllms to omni-mllms: a survey on mllms aligned with multi- modalities. In: Findings of the Association for Computational Linguistics: ACL 2025. p. 8617–8652 (2025) 26. Kashinath, K., Mustafa, M., Albert, A., Wu, J.L., Jiang, C., Esmaeilzadeh, S., Aziz- zadenesheli, K., Wang, R., Chattopadhyay, A., Singh, A., et al.: Physics-informed machine learning: case studies for weather and climate modelling. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sci- ences 379(2194) (2021) 27. Kochkov, D., Yuval, J., Langmore, I., Norgaard, P., Smith, J., Mooers, G., Klöwer, M., Lottes, J., Rasp, S., Düben, P., et al.: Neural general circulation models for weather and climate. Nature 632(8027), 1060–1066 (2024) 28. Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet, F., Ravuri, S., Ewalds, T., Eaton-Rosen, Z., Hu, W., et al.: Learning skillful medium-range global weather forecasting. Science 382(6677), 1416–1421 (2023) 29. Li, C., Im, E.W., Fazli, P.: Vidhalluc: Evaluating temporal hallucinations in mul- timodal large language models for video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 13723– 13733 (2025) 30. Li, M., Hou, X., Liu, Z., Yang, D., Qian, Z., Chen, J., Wei, J., Jiang, Y., Xu, Q., Zhang, L.: Mccd: Multi-agent collaboration-based compositional diffusion for complex text-to-image generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. p. 13263–13272 (2025) 31. Li, Y., Zhang, Y., Yin, M.: Physics-informed mamba network for ultra-short-term photovoltaic power forecasting: integrating wgan-gp augmentation and ceemdan- sst decomposition. Renewable Energy p. 124851 (2025) 32. Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A., Anandkumar, A.: Fourier neural operator for parametric partial differential equa- tions. arXiv preprint arXiv:2010.08895 (2020) 33. Liao, X., Zeng, X., Wang, L., Yu, G., Lin, G., Zhang, C.: Motionagent: fine- grained controllable video generation via motion field agent. arXiv preprint arXiv:2502.03207 (2025) 34. Linander, H., Petersson, C., Persson, D., Gerken, J.E.: PEAR: Equal area weather forecasting on the sphere. In: NeurIPS 2025 AI for Science Workshop (2025), https://openreview.net/forum?id=MlnM8lvFSq 35. Mitra, P.P., Ramavajjala, V.: Learning to forecast diagnostic parameters using pre-trained weather embedding. arXiv preprint arXiv:2312.00290 (2023) 36. Nguyen, T., Brandstetter, J., Kapoor, A., Gupta, J.K., Grover, A.: Climax: A foundation model for weather and climate. arXiv preprint arXiv:2301.10343 (2023) 37. Pathak, J., Subramanian, S., Harrington, P., Raja, S., Chattopadhyay, A., Mar- dani, M., Kurth, T., Hall, D., Li, Z., Azizzadenesheli, K., et al.: Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural op- erators. arXiv preprint arXiv:2202.11214 (2022) 38. Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Gruber, L., Holzleitner, M., Adler, T., Kreil, D., Kopp, M.K., Klambauer, G., Brandstetter, J., Hochreiter, S.: Hopfield networks is all you need. In: International Conference on Learning Representations (2021), https://openreview.net/forum?id=tL89RnzIiCd 18J. Wu et al. 39. Rasp, S., Dueben, P.D., Scher, S., Weyn, J.A., Mouatadid, S., Thuerey, N.: Weath- erbench: a benchmark data set for data-driven weather forecasting. Journal of Ad- vances in Modeling Earth Systems 12(11), e2020MS002203 (2020) 40. Schultz, M.G., Betancourt, C., Gong, B., Kleinert, F., Langguth, M., Leufen, L.H., Mozaffari, A., Stadtler, S.: Can deep learning beat numerical weather prediction? Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 379(2194) (2021) 41. Shi, Y., Di, S., Chen, Q., Xie, W.: Enhancing video-llm reasoning via agent-of- thoughts distillation. In: Proceedings of the Computer Vision and Pattern Recog- nition Conference. p. 8523–8533 (2025) 42. Taghizadeh, M., Zandsalimi, Z., Nabian, M.A., Shafiee-Jood, M., Alemazkoor, N.: Interpretable physics-informed graph neural networks for flood forecasting. Computer-Aided Civil and Infrastructure Engineering 40(18), 2629–2649 (2025) 43. Team, H.V., Lyu, P., Wan, X., Li, G., Peng, S., Wang, W., Wu, L., Shen, H., Zhou, Y., Tang, C., et al.: Hunyuanocr technical report. arXiv preprint arXiv:2511.19575 (2025) 44. Tong, P., Brown, E., Wu, P., Woo, S., Iyer, A.J.V., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., et al.: Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms. Advances in Neural Information Processing Systems 37, 87310–87356 (2024) 45. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: Proceedings of the IEEE/CVF international confer- ence on computer vision. p. 32–42 (2021) 46. Ueyama, A., Kawamoto, K., Kera, H.: Vartex: Enhancing weather forecast through distributed variable representation. arXiv preprint arXiv:2406.19615 (2024) 47. Varambally, S., Fisher, M., Thakker, J., Chen, Y., Xia, Z., Jafari, Y., Niu, R., Jain, M., Manivannan, V.V., Novack, Z., Han, L., Eranky, S., Cachay, S.R., Berg- Kirkpatrick, T., Watson-Parris, D., Ma, Y., Yu, R.: Zephyrus: An agentic frame- work for weather science. In: The Fourteenth International Conference on Learning Representations (2026), https://openreview.net/forum?id=aVeaNahsID 48. Verma, Y., Heinonen, M., Garg, V.: ClimODE: Climate and weather forecast- ing with physics-informed neural ODEs. In: The Twelfth International Confer- ence on Learning Representations (2024), https://openreview.net/forum?id= xuY33XhEGR 49. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 50. Wang, S., Zhang, L., Zhu, L., Qin, T., Yap, K.H., Zhang, X., Liu, J.: Cog-dqa: Chain-of-guiding learning with large language models for diagram question an- swering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 13969–13979 (2024) 51. Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 52. Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y., Wu, C., Wang, B., et al.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024) 53. Xiao, B., Wu, H., Xu, W., Dai, X., Hu, H., Lu, Y., Zeng, M., Liu, C., Yuan, L.: Florence-2: Advancing a unified representation for a variety of vision tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. p. 4818–4829 (2024) AGCD for Weather Forecasting19 54. Xing, S., He, Y., Chen, H., Ke, W.: Incorporating llm versus llm into multimodal chain-of-thought for fine-grained evidence generation. IEEE Access 13, 202143– 202170 (2025) 55. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 56. Yang, Z., Zeng, W., Jin, S., Qian, C., Luo, P., Liu, W.: Nader: Neural architecture design via multi-agent collaboration. In: Proceedings of the Computer Vision and Pattern Recognition Conference. p. 4452–4461 (2025) 57. Yang, Z., Chen, D., Yu, X., Shen, M., Gan, C.: Vca: Video curious agent for long video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. p. 20168–20179 (2025) 58. Yao, J., Liu, Y., Dong, Z., Guo, M., Hu, H., Keutzer, K., Du, L., Zhou, D., Zhang, S.: Promptcot: Align prompt distribution via adapted chain-of-thought. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. p. 7027–7037 (2024) 59. Yin, H., Ren, Y., Yan, K., Ding, S., Hao, Y.: Rod-mllm: Towards more reliable object detection in multimodal large language models. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. p. 14358–14368 (2025) 60. Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review 11(12), nwae403 (2024) 61. Yuval, J., O’Gorman, P.A., Hill, C.N.: Use of neural networks for stable, accu- rate and physically consistent parameterization of subgrid atmospheric processes with good performance at reduced precision. Geophysical Research Letters 48(6), e2020GL091363 (2021) 62. Zhang, D., Yu, Y., Dong, J., Li, C., Su, D., Chu, C., Yu, D.: Mm-llms: Recent advances in multimodal large language models. Findings of the Association for Computational Linguistics: ACL 2024 p. 12401–12430 (2024) 63. Zheng, J., Ling, Q., Feng, Y.: Physics-assisted and topology-informed deep learn- ing for weather prediction. In: Kwok, J. (ed.) Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25. p. 7958–7966. International Joint Conferences on Artificial Intelligence Organization (8 2025). https://doi.org/10.24963/ijcai.2025/885, https://doi.org/10.24963/ ijcai.2025/885, main Track