Paper deep dive

Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates

Linxiao Yang, Xue Jiang, Gezheng Xu, Tian Zhou, Min Yang, ZhaoYang Zhu, Linyuan Geng, Zhipeng Zeng, Qiming Chen, Xinyue Gu, Rong Jin, Liang Sun

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 90

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/22/2026, 5:53:16 AM

Summary

Baguan-TS is a sequence-native in-context learning (ICL) framework for time series forecasting that eliminates the need for hand-crafted features. It utilizes a 3D Transformer architecture to attend across temporal, variable, and context axes, supported by a target-space retrieval-based local calibration (Y-space RBfcst) and a context-overfitting strategy to mitigate output oversmoothing.

Entities (5)

Baguan-TS · model · 100%In-context learning · learning-paradigm · 99%3D Transformer · architecture · 95%Y-space RBfcst · calibration-module · 95%Context-overfitting strategy · training-strategy · 92%

Relation Signals (4)

Baguan-TS → uses → 3D Transformer

confidence 100% · instantiated by a 3D Transformer that attends jointly over temporal, variable, and context axes.

Baguan-TS → performs → In-context learning

confidence 98% · Baguan-TS, which integrates the raw-sequence representation learning with ICL

Baguan-TS → employs → Context-overfitting strategy

confidence 95% · To counter output oversmoothing, we introduce a mitigation strategy that integrates reliability-aware weighting of support examples

Baguan-TS → implements → Y-space RBfcst

confidence 95% · To improve calibration and stability, we propose a target-space retrieval-based forecast (Y-space RBfcst) local calibration module.

Cypher Suggestions (2)

Identify the learning paradigm used by the model. · confidence 95% · unvalidated

MATCH (m:Model {name: 'Baguan-TS'})-[:PERFORMS]->(p:LearningParadigm) RETURN p.name

Find all components and strategies associated with the Baguan-TS model. · confidence 90% · unvalidated

MATCH (m:Model {name: 'Baguan-TS'})-[:USES|IMPLEMENTS|EMPLOYS]->(component) RETURN m.name, component.name, labels(component)

Abstract

Abstract:Transformers enable in-context learning (ICL) for rapid, gradient-free adaptation in time series forecasting, yet most ICL-style approaches rely on tabularized, hand-crafted features, while end-to-end sequence models lack inference-time adaptation. We bridge this gap with a unified framework, Baguan-TS, which integrates the raw-sequence representation learning with ICL, instantiated by a 3D Transformer that attends jointly over temporal, variable, and context axes. To make this high-capacity model practical, we tackle two key hurdles: (i) calibration and training stability, improved with a feature-agnostic, target-space retrieval-based local calibration; and (ii) output oversmoothing, mitigated via context-overfitting strategy. On public benchmark with covariates, Baguan-TS consistently outperforms established baselines, achieving the highest win rate and significant reductions in both point and probabilistic forecasting metrics. Further evaluations across diverse real-world energy datasets demonstrate its robustness, yielding substantial improvements.

PDF

Open source PDF →Open local PDF →

Full Text

89,965 characters extracted from source content.

Expand or collapse full text

Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates Linxiao Yang 1 Xue Jiang 1 Gezheng Xu 1 Tian Zhou 1 Min Yang 1 Zhaoyang Zhu 1 Linyuan Geng 1 Zhipeng Zeng 1 Qiming Chen 1 Xinyue Gu 1 Rong Jin 1 Liang Sun 1 Abstract Transformers enable in-context learning (ICL) for rapid, gradient-free adaptation in time series fore- casting, yet most ICL-style approaches rely on tabularized, hand-crafted features, while end-to- end sequence models lack inference-time adapta- tion. We bridge this gap with a unified framework, Baguan-TS, which integrates the raw-sequence representation learning with ICL, instantiated by a 3D Transformer that attends jointly over tem- poral, variable, and context axes. To make this high-capacity model practical, we tackle two key hurdles: (i) calibration and training stabil- ity, improved with a feature-agnostic, target-space retrieval-based local calibration; and (i) output oversmoothing, mitigated via context-overfitting strategy. On public benchmark with covariates, Baguan-TS consistently outperforms established baselines, achieving the highest win rate and sig- nificant reductions in both point and probabilistic forecasting metrics. Further evaluations across diverse real-world energy datasets demonstrate its robustness, yielding substantial improvements. 1. Introduction Time series forecasting increasingly demands models that adapt swiftly to new tasks, remain robust under distribu- tion shift, and operate efficiently in data-limited regimes. While Transformers have revealed the promise of in-context learning (ICL)—conditioning predictions on a small sup- port set at inference time without gradient updates—most ICL-style approaches in forecasting rely on tabularization and hand-crafted features (Hoo et al., 2025), limiting their ability to exploit the structure of raw sequences. On the other hand, end-to-end sequence models excel at learning representations directly from raw time series but typically lack ICL-style, gradient-free adaptation (Ansari et al., 2024; 1 DAMO Academy, Alibaba Group, Hangzhou, China. . Figure 1. Three paradigms for time series forecasting: (a) End-to- end sequence models learn from raw histories but lack in-context adaptation at inference. (b) Tabular ICL approaches (e.g., TabPFN) perform ICL over feature-engineered representations. (c) Our unified approach (Baguan-TS) enables sequence-native ICL on raw multivariate inputs, attending over temporal, variables, and context for gradient-free adaptation. Das et al., 2024). This disconnect motivates a framework that unifies end-to-end representation learning with ICL for time series data, so that a single model can both extract features from raw sequences and adapt on the fly. We propose Baguan-TS, a general-purpose architecture that enables sequence-native ICL on raw multivariate time se- ries without hand-crafted features. It builds on a unified, episodic framework that learns to forecast from raw se- quences given a support set, thereby removing dependence on feature engineering while bringing ICL’s rapid adapta- tion into a sequence-native architecture. An overview of this framework is shown in Fig. 1. To instantiate the framework, we introduce a 3D Transformer that treats temporal, vari- able, and context dimensions as first-class axes of attention. Analogous to video-style modeling, the model builds mul- tiscale representations over the temporal×variable plane and aligns support and query along the context axis via cross-attention. This design lifts the performance ceiling by discovering task-relevant structure directly from raw time se- 1 arXiv:2603.17439v1 [cs.LG] 18 Mar 2026 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates ries inputs and specializing predictions per episode through the provided context. In addition, the model admits a 2D inference mode by setting the context length to one, which functions as a strong complementary component in ensem- bles. In this way, it offers a robust fallback and diversity boost when integrating multiple predictors. However, adopting a high-capacity 3D architecture intro- duces two main challenges we target: Challenge 1: Locality Calibration.Large-capacity mod- els are prone to miscalibration and optimization insta- bility, especially when dealing with a large 3D Trans- former model. A lightweight, feature-agnostic mecha- nism is needed to provide local, episode-specific cor- rection without manual features. Challenge 2: Output Oversmoothing in ICL. In the pre- sence of noisy, heterogeneous support examples, the model can oversmooth the spurious signals rather than extracting stable periodic rules. Effective ICL under these conditions requires a careful balance between de- noising (resisting noise and shift) and selection (focus- ing on a compact, highly relevant subset of samples). To improve calibration and stability, we propose a target- space retrieval-based forecast (Y-space RBfcst) local cal- ibration module. By referencing nearest support targets within each episode, this feature-agnostic procedure pro- vides a local, distribution-aware adjustment that comple- ments the learned predictor. It improves calibration and stabilizes training and inference under limited per-task data and distribution shift, and integrates naturally with episodic ICL—offering a simple, retrieval-based bias/variance cor- rection that scales with model capacity without relying on hand-crafted features. To counter output oversmoothing, we introduce a mitiga- tion strategy that integrates reliability-aware weighting of support examples with deliberately overfitting to some ex- act context sample in the support set. Counterintuitively, this concentrates attention on the few examples that matter for each query. This balance between denoising and selec- tion reduces spurious correlations and improves few-shot robustness when training a large 3D architecture. Together, these components form a unified framework that marries representation learning with ICL on raw se- quences; a 3D Transformer instantiation that raises the performance (and offers a practical 2D inference mode for ensemble complementarity); and two complementary mechanisms—context-overfitting mitigation and Y-space RBfcst calibration—that make such a high-capacity, ICL- enabled model robust and trainable in practice. The result is a practical route to bringing ICL’s strengths to raw time 0.00.5 (Higher is better) Sundial-Base Toto-1.0 Moirai-2.0 Chronos-Bolt TabPFN-TS TiRex TimesFM-2.5 Ours Average Win Rate (w.r.t SQL) 0.80.91.0 (Lower is better) Sundial-Base Toto-1.0 Moirai-2.0 Chronos-Bolt TiRex TabPFN-TS TimesFM-2.5 Ours Average SQL Figure 2. Probabilistic forecasting results onfev-bench-cov (30 covariate-aware tasks): Baguan-TS leads all baselines, achiev- ing the highest average win rate and lowest average SQL. series forecasting under realistic data and shift conditions. We summarize our contributions as follows: • We propose Baguan-TS, a unified end-to-end frame- work that performs ICL directly on raw multivariate time series, without relying on feature engineering. It is instantiated as a 3D Transformer attending jointly over temporal, variable, and context axes, and also sup- ports a 2D inference mode. Onfev-bench-cov(30 covariate-aware tasks), Baguan-TS achieves the best average scaled quantile loss (SQL) and MASE with the highest win rate, reducing SQL versus TabPFN-TS by 4.8% (Fig. 2). •We develop a Y-space RBfcst local calibration module—feature-agnostic and episode-specific—that improves calibration, data efficiency, and scalability when training larger 3D Transformers. In our experi- ments, this module consistently improves overall fore- casting accuracy and robustness under injected noise compared to training without retrieval. •We introduce a context-overfitting strategy that explic- itly balances sample denoising and sample selection, stabilizing in-context learning in high-capacity mod- els. The strategy consistently lowers training loss and restores periodic spike reconstruction, mitigating over- smoothing without harming trend accuracy. 2. Related Work We situate Baguan-TS within three lines of prior work, in- cluding end-to-end time series forecasting, large pretrained time series models, and in-context modeling. We summa- rize the key capability differences of Baguan-TS and these approaches in Table 1. Time Series Forecasting. Deep neural networks domi- nate modern time series forecasting, with univariate models focusing on single sequences (Rangapuram et al., 2018; Salinas et al., 2020; Oreshkin et al., 2020) and multivari- ate models jointly modeling many correlated series using 2 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates Table 1. Capability comparison of representative time series fore- casting approaches. Models Raw time sequence ICL (gradient- free) No hand- crafted features Cross- variable interaction Local retrieval-based calibration End-to-end models (PatchTST, FedFormer) ✓ ×✓ × Large pretrained models (Chronos, TimesFM, Tirex) ✓ ×✓ × Tabular ICL-style models (TabPFN-TS) ×✓ ×✓× Ours (Baguan-TS)✓ Transformers (Wu et al., 2021; Zhou et al., 2022b; Nie et al., 2023; Liu et al., 2024a; Wang et al., 2024b; Chen et al., 2024; Zhou et al., 2023) and other architectures (Sen et al., 2019; Zhou et al., 2022a; Jin et al., 2022; Wang et al., 2024a; Hu et al., 2024; Qi et al., 2024). While highly effective on their training domains, these models typically require task- specific retraining, limiting their adaptability and motivating more general-purpose approaches. Large Pretrained Time Series Models. Recent work pre- trains large sequence models for time series (Woo et al., 2024; Goswami et al., 2024; Ansari et al., 2024; Das et al., 2024; Rasul et al., 2023; Liu et al., 2024b; Shi et al., 2025), often with Transformer-based architectures and diverse time series data. However, their zero- and few-shot performance, especially for multivariate forecasting with complex cross- channel dependencies, often still trails specialized models. In-Context Modeling. A complementary line of work treats forecasting as conditional generation and relies on in-context (few-shot) adaptation (Hoo et al., 2025; Feuer et al., 2024; Zhu et al., 2023; Dooley et al., 2023; Hegsel- mann et al., 2023). While competitive with strong tabular learners, these methods typically rely on hand-crafted fea- tures (Chen & Guestrin, 2016; Ke et al., 2017), limiting their practical usages in diverse applications. In contrast, the end- to-end sequence models are preferred by directly capturing temporal structures and cross-channel interactions. 3. Baguan-TS 3.1. Problem Formulation In this paper, we consider univariate time series forecast- ing with known covariates. Lety t T t=1 be the observed series andx t T+H t=1 , wherex t ∈ R M , the associatedM- dimensional covariates, available for both the history and the forecast horizonH(e.g., calendar effects or known ex- ogenous inputs such as weather forecasts). The task is to predict the future valuesy t T+H t=T+1 . For a test instance, lety b = (y 1:T )∈ R T denote the look- back window,y f = (y T+1:T+H ) ∈ R H the future values, andX∈ R (T+H)×M the covariates overt = 1,...,T +H. We assume an underlying mapping y f = f ∗ (y b , X). In the in-context learning setting, we are given a context setD c = (X j , y b j , y f j ) C j=1 ofCinput–output examples. At test time, we adapt to a target instance by leveraging an inference functiongto approximatef ∗ , where the approxi- mation is conditioned onD c to minimize the prediction error l(y f ,g(X, y b ,D c )). Our objective is to learn a universal functiongthat can adaptively fit a wide range of different underlying mappings f ∗ across diverse tasks. For convenience, we concatenate covariates and the full series intoY i = [X i , y i ] ∈ R (T+H)×(M+1) , wherey i ∈ R T+H stacks(y b i , y f i ). Stacking all context and target sam- ples yields a tensorY ∈ R (C+1)×(T+H)×(M+1) , whose firstCslices correspond toD c and the last slice to the target instance. For the target slice,y f is unknown and stored as a mask, while it is observed for the context slices. 3.2. Architecture Fig. 3 shows the overall architecture of Baguan-TS, with its components detailed in the following subsections. 3.2.1. PATCHING AND TOKENIZATION Given the tensorY ∈ R (C+1)×(T+H)×(M+1) , we apply an encoder based on temporal patching and random Fourier features (Rahimi & Recht, 2007). Each variable in each slice is split along the temporal axis intoSnon-overlapping patches of lengthP(zero-padding the tail if needed), reduc- ing sequence length while preserving local structure. For each covariate patchq ∈ R P , we map it toe = [cos(φ); sin(φ)] ∈ R D , withφ=Wq+b, whereWand Figure 3. Overall architecture of Baguan-TS. The input tensor Y ∈ R (C+1)×(T+H)×(M+1) is encoded into patch tokens, then iteratively processed by stacked 3D Transformer blocks performing variable, temporal, and context attention, and finally mapped by a prediction head to produce the forecasting outputsy f ∈ R H . 3 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates bare shared learnable parameters. For the target series, we zero-fill the unknown future and use a maskm∈0, 1 P to indicate observed positions. Its embeddinge f adds an indicator termVmto the Fourier features to distinguish the history from the forecast horizon. The resulting token tensor has shape (C + 1)× S× (M + 1)× D. 3.2.2. 3D TRANSFORMER BLOCK The core of our architecture is the 3D Transformer block, an encoder-only module that jointly models correlations across the variable, temporal, and context dimensions. Given the encoded tokens, it applies a sequence of specialized atten- tion layers over these axes, followed by a feed-forward net- work (FFN). Residual connections and layer normalization follow each attention layer and the FFN. Compared to conventional 1D or 2D Transformer blocks (Hoo et al., 2025; Ansari et al., 2024) that oper- ate on flattened or reduced-dimensional embeddings, our block preserves a structured three-dimensional layout. This factorized design provides a stronger inductive bias while retaining the flexibility of the standard Transformer. Formally, letZ ∈ R (C+1)×S×(M+1)×D denote the token representation (contexts × temporal patches × variables × embedding dimension). All attention layers use standard multi-head self-attention (MHA) (Vaswani et al., 2017) along different axes of this representation. We describe these stages in detail below. Temporal Attention. This module learns temporal depen- dencies within each context to capture how patterns evolve over time. For each fixed contextcand variablem, we extract the sliceT c,m = Z c,:,m,: ∈R S×D as the representa- tion along the temporal axis. To model these dependencies, we apply MHA integrated with Rotary Position Embed- dings (RoPE). By encoding relative phase information into the query and key vectors, RoPE allows the mechanism to naturally capture relative temporal distances and periodic patterns, which are crucial for time series forecasting. Variable Attention: Instead of temporal sequences, this branch operates on the variable dimensionV c,s = Z c,s,:,: ∈ R (M+1)×D for each time steps. To account for distinct variable semantics, we augment the representation with a learnable variable-wise embeddings before performing MHA to capture cross-variable correlations. Context Attention: This branch models relationships across different instances by taking the sliceC s,m = Z :,s,m,: ∈ R (C+1)×D . Unlike the temporal and variable branches, context attention omits positional encodings, as the context dimension typically lacks an inherent sequential ordering and instead focuses on global information sharing. 3.2.3. PREDICTION HEAD After a stack of 3D Transformer blocks, we obtain latent representations for future patches. A lightweight MLP head maps these features to per-horizon predictive distributions. Because inputs arez-normalized on the lookback (approxi- mately zero mean and unit variance), most forecast values lie within a bounded range; we fix[−10, 10]and uniformly discretize it intoK = 5000bins. For each horizon step, the head outputs logits over these bins; applying softmax yields a probability vectorp∈ R K . This provides a full probabilis- tic forecast with both point estimates (via the expected value over bin centers) and quantiles (via the empirical CDF). 3.3. Training Our 3D Transformer operates on context–target episode sets. During training, we (i) build these contexts via retrieval- based forecasting; and (i) stabilize in-context learning with a context-overfitting strategy. Implementation details on training data, loss functions, and inference settings are in Appendix A. 3.3.1. RETRIEVAL-BASED FORECASTING To construct informative and scalable contexts, we adopt retrieval-based forecasting (RBfcst) to select informative contexts at scale. For a series of lengthNand a context win- dowT +H, there areN−T−H+1candidate subsequences of lengthT + H, so full enumeration is memory-intensive. Instead, given the target lookbacky b ∈ R T as the query, we retrieve itsK ctx nearestT-length subsequences from a sliding-window index over the historical region (to avoid leakage). Each retrievedT-length window, together with the subsequentHpoints, forms one context of lengthT + H (with covariates aligned). Distances are computed onz- normalized windows (e.g., Euclidean or cosine), andK ctx is chosen to balance relevance and memory. To align training with this strategy and increase diversity, we first select a reference sequence of lengthT + H, sam- ple a lookback of lengthT, and retrieve2K ctx nearestT- length subsequences using the same procedure. This leads to2K ctx + 1candidates (including the selected reference), which is stacked into a tensor of shape(2K ctx + 1)× S× (M + 1) . We then randomly chooseK ctx slices to form a tensor of shapeK ctx × S× (M + 1)for that step. A sub- set of theseK ctx slices is designated as the target (with its future masked), and the remainder serve as context. This randomized subsampling from a larger pool mimics imper- fect retrieval, encouraging context diversity and robustness to retrieval noise. Related tabular foundation models (Thomas et al., 2024; Xu et al., 2025; Gorishniy et al., 2024) typically retrieve in covariate space (X-space). In contrast, we retrieve by 4 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates covariates target Prediction horizon Contexts Organization covariates target target template contextsretrieved bytarget similarity ( ) covariates target most similar contexts sim=0.91 sim=0.92sim=0.98 contextsretrieved bycovariates similarity ( ) sim=0.92 sim=0.98 covariate template most similar contexts uniformly split contexts ( ) i i i covariates target covariates target (a) Different context organization strategies 100050005001000 t-SNE dim 1 1000 750 500 250 0 250 500 750 t-SNE dim 2 RBfcst (Y-space) RBfcst (X-space) No RBfcst Ground Truth 3210123 t-SNE dim 1 3 2 1 0 1 2 t-SNE dim 2 RBfcst (Y-space) RBfcst (X-space) No RBfcst Ground Truth (b) t-SNE visualization Figure 4. Context organization strategies and t-SNE visualization. (a) Three context organization methods: (i) uniform splits (green); (i) covariate-based retrieval in X-space (purple, relies on feature engineering); (i) target-based retrieval in Y-space (ours, yellow), which focuses on historical patterns and is feature-agnostic. Higher similarity scores indicate stronger contextual relevance. (b) t-SNE plots of prediction horizons onepf(top) andentsoe(bottom) for three RBfcst variants. The ground truth (red star) lies closest to the Y-space RBfcst cluster (shaded), indicating it best captures the true pattern. similarity of the target history (Y-space), i.e., nearest neigh- bors of the lookback segment. For time series, the trajectory often summarizes the combined effects of all drivers, mak- ing Y-space retrieval more informative and feature-agnostic, without hand-crafted feature design. To illustrate robustness, we compare three context- organization strategies in Fig. 4: (i) uniform splitting (green), where windows are taken at equal intervals; (i) X-space retrieval (purple), guided by covariate similarity and feature engineering; and (i) Y-space retrieval (yellow, ours), which focuses on historical patterns most similar to the target. Higher similarity scores indicate stronger con- textual relevance. t-SNE plots onepf(top) andentsoe (bottom) show that horizons retrieved in Y-space cluster near the ground truth (red star), suggesting this strategy best captures the true pattern. 3.3.2. CONTEXT-OVERFITTING STRATEGY The direct application of in-context learning to time series forecasting reveals a critical limitation: during early training, models tend to predict smooth, low-frequency trends while systematically suppressing periodic spike signals. This hap- pens even when identical spike patterns are clearly present in the context window, indicating that the model misinterprets such abrupt fluctuations as noise rather than valid signal components. Although this tendency may enhance robust- ness against irregular outliers in certain datasets, it severely undermines performance in time series domains where pe- riodic spikes constitute essential features (e.g., physiolog- ical monitoring or demand surges). To mitigate this over- smoothing issue where the model fails to leverage contextual spike templates for query inference, we introduce a context- overfitting strategy. Concretely, we enrich the query with a short segment copied verbatim from one context slice and add an auxiliary self-retrieval objective: the model is re- quired to identify and align the matching context segment and reconstruct the corresponding target values (see Fig. 5, where we include a synthetic test samplex 2 whose lookback contains a motif duplicated from the context, and train the model to retrieve that context). This explicitly encourages template matching: model must identify and retrieve target (a) Original design(b) Duplicate-context design Figure 5. Illustration of the context-overfitting strategy. (a) Origi- nal design, where the model forecasts query targets from retrieved context episodes. (b) Duplicate-context design: a short segment from one context slice is copied into a query, and the model is trained to retrieve the matching context and reconstruct its targets. 5 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates Figure 6. Effect of the context-overfitting strategy. Main: train- ing loss curves for the baseline model (green) and our context- overfitting strategy (red). Insets: example forecasts compared with ground truth (blue). The baseline oversmooths outputs and misses high-frequency spikes (right); our strategy keeps a low loss while recovering spike patterns by matching contextual templates (left). values from the context when encountering identical patterns in the query. As demonstrated by the training dynamics and qualitative results in Fig. 6, the baseline model underutilizes contextual templates and produces oversmoothed predic- tions, while our strategy enables accurate reconstruction of periodic spikes without compromising trend accuracy. 3.4. Adaptive Inference A key property of the 3D Transformer is its structural flexi- bility: it can operate either as a full sequence-native model or as a 2D tabular-style predictor. Setting the context length S = 1collapses the temporal axis and restrict attention to feature and context dimensions. We find this 2D mode useful when temporal dependence is mostly captured by covariates or is weak (e.g., unreliable history under distri- bution shift). Thus, we ensemble the 2D and 3D modes in inference, which consistently improves SQL and reduces quantile calibration errors compared to using either mode alone. 4. Experiments 4.1. Experiment Settings We compare Baguan-TS with recent time series founda- tion models: Sundial-Base (Liu et al., 2025b), TabPFN- TS (Hoo et al., 2025), TimesFM-2.0 (Das et al., 2024), TiRex (Auer et al., 2025), Toto-1.0 (Cohen et al., 2025), Chronos-Bolt (Ansari et al., 2024), and Moirai-2.0 (Liu et al., 2025a). Among them, only TabPFN-TS natively supports covariates; models without covariate support are evaluated on target-only inputs. We report mean absolute scaled error (MASE), weighted ab- solute percentage error (WAPE), scaled quantile loss (SQL), and weighted quantile loss (WQL), and the corresponding average win rates for each metric (Shchur et al., 2025), av- eraged over horizons and macro-averaged across datasets. Further details of the datasets, experimental setup, and addi- tional results are provided in Appendices B and C. 4.2. Zero-Shot Forecasting 4.2.1. TIME SERIES FORECASTING WITH COVARIATES Evaluation on public datasets. We first evaluate Baguan- TS on thefev-benchbenchmark (Shchur et al., 2025). Focusing on time series tasks with covariate information, we select 30 representative tasks from the 100 available datasets that provide at least one known dynamic covariate, and refer to this subset asfev-bench-cov. We evaluate both point and probabilistic forecasting performance, com- paring Baguan-TS against leading time series foundation models. The results are reported in Fig. 2 and Figs. 18–20 (Appendix C.1). As shown in Table 2, Baguan-TS con- sistently outperforms both univariate baselines, including TiRex and Toto-1.0, and models specifically designed for covariate-informed forecasting, such as TabPFN-TS. This demonstrates Baguan-TS can effectively leverage both his- torical target series and future covariates for accurate pre- dictions. Table 2. Average results onfev-bench-cov. The best results are highlighted in bold, and the second-best results areunderlined. ModelSQLMASEWAPEWQL Chronos-Bolt0.91901.12330.27020.2181 Moirai-2.00.92021.12330.27500.2226 Sundial-Base1.01721.14490.28130.2498 TimesFM-2.50.8276 1.01220.25920.2095 TabPFN-TS0.84041.04300.20150.1643 TiRex0.89271.10810.27530.2233 Toto-1.00.96001.17550.27370.2248 Ours0.79970.98570.21730.1724 Evaluation on real-world applications. We further eval- uate Baguan-TS on 27 real-world datasets with historical and future covariates (e.g., weather, calendar). Table 3 re- ports averages over all datasets. Among the baselines, only TabPFN-TS and Baguan-TS use covariates; the others op- erate on the target series only. Baguan-TS achieves the best overall SQL, MASE, WAPE, and WQL, improving over TabPFN-TS and all non-covariate baselines, indicating more effective use of contextual signals. Detailed dataset descriptions and experimental results are provided in the Appendix C.1. 4.2.2. UNIVARIATE TIME SERIES FORECASTING We next consider univariate time series forecasting with- out covariates, where Baguan-TS adapts by incorporating null or time-only feature columns. Specifically, we evalu- ated 10 tasks from thefev-bench-minidataset (Shchur et al., 2025), where multivariate datasets are flattened into standard univariate forecasting problems; we refer to this subset asfev-bench-uni. As shown in Fig. 7, Baguan- TS achieves competitive macro-averaged SQL across tasks even without covariates. Notably, it outperforms TabPFN- 6 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates Table 3. Average results on real-world application datasets. The best results are highlighted in bold, and the second-best results are underlined. ModelSQLMASEWAPEWQL Chronos-Bolt0.51010.63500.09760.0783 Moirai-2.00.71390.88750.10310.0860 Sundial-Base0.74860.90420.11330.0981 TabPFN-TS0.4663 0.61150.06880.0537 TimesFM-2.00.84180.92780.10930.0954 TiRex0.57100.70940.09750.0784 Toto-1.00.84141.05660.12690.1034 Ours0.39740.49560.05970.0490 0.00.5 (Higher is better) Sundial-Base Chronos-Bolt TabPFN-TS Ours Toto-1.0 TiRex Moirai-2.0 TimesFM-2.5 Average Win Rate (w.r.t SQL) 0.40.50.6 (Lower is better) Chronos-Bolt TabPFN-TS Sundial-Base Toto-1.0 TiRex Ours TimesFM-2.5 Moirai-2.0 Average SQL Figure 7.Evaluation onfev-bench-uni(10 univari- ate/multivariate tasks). Baguan-TS achieves competitive prob- abilistic forecasting performance on tasks without covariates. TS, which also natively supports covariates, on more than 50% of these datasets; it also achieves SOTA performance in terms of macro-averaged WAPE and WQL (see Fig. 21, Appendix C.2). These results indicate strong macro-level stability and robustness on aggregate-sensitive metrics, high- lighting system-level accuracy. 4.3. Ablation Study 4.3.1. EFFECT OF RBFCST We first ablate the proposed RBfcst module on an energy task (entsoe1h) and a retail task (hermes). The base- line (w/o RBfcst) uses uniformly split contexts. We compare three retrieval variants: X-space (covariates), XY-space (co- variates + targets), and our Y-space RBfcst (targets only). All variants use the average of cosine andL 2 distances for segment similarity. As shown in Table 4, Y-space RBfcst consistently outper- forms both X-space and XY-space variants, all of which significantly surpass the w/o RBfcst baseline. This vali- dates the critical role of RBfcst in ICL-based time series forecasting and confirms that historical target trajectories generally provide more informative context than covariates, without requiring sophisticated feature selection or aggrega- tion mechanisms. The performance advantage of Y-space RBfcst is particularly large onentsoe1h, whereas the gap is smaller onhermes, which has shorter lookback windows and sparser covariates. Table 4. Probabilistic and point forecasting evaluation for different RBfcst variants on entsoe1H and hermes datasets. Method entsoe1H hermes SQL MASE WAPE WQLSQL MASE WAPE WQL w/o RBfcst 0.4284 0.5387 0.0328 0.0262 0.6607 0.8445 0.0032 0.0025 X-space0.4185 0.5178 0.0313 0.0256 0.6186 0.7973 0.0030 0.0023 XY-space0.4005 0.4972 0.0297 0.0241 0.6186 0.7972 0.0030 0.0023 Y-space0.3859 0.4819 0.0287 0.0231 0.6185 0.7971 0.0030 0.0023 Noise Scale ( ) 0.35 0.40 0.45 0.50 SQL Gaussian White Noise Noise Scale ( ) 0.4 0.5 0.6 0.7 Random Walk Noise Noise Scale ( ) 0.35 0.40 0.45 0.50 0.55 Periodic Noise Ours (XY-space RBfcst) Ours (X-space RBfcst) Ours (Y-space RBfcst) TabPFN-TS Figure 8. Robustness evaluation onepffrunder Gaussian white noise, random walk noise, and periodic noise, withκ ∈ 0, 0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0. To further evaluate the robustness of our local calibra- tion module, we conduct noise-injection experiments on epffrusing three representative noise types: (i) Gaus- sian white noise (stationary); (i) random walk noise (non- stationary with trend); and (i) periodic noise (seasonal interference). The noise intensity is controlled by a scaling factorκrelative to the series’ empirical standard devia- tion (see Appendix B.2 for details). As shown in Fig. 8, although TabPFN-TS achieves the best SQL in the clean setting (κ = 0), its performance degrades sharply under stationary Gaussian noise asκincreases (especially when κ ≥ 0.1). In contrast, Baguan-TS maintains stable perfor- mance across all noise types and intensities. Notably, the Y-space RBfcst variant exhibits the highest robustness, with minimal performance drop even at high noise levels. This shows that our feature-agnostic locality calibration effec- tively mitigates the impact of diverse external perturbations. 4.3.2. EFFECT OF CONTEXT-OVERFITTING We ablate the proposed context-overfitting strategy by re- training two models (a full model with context-overfitting and a baseline without it) from scratch for 220K steps on a dataset with clear daily and weekly seasonality and periodic spikes. Fig. 6 shows training losses and qualitative forecasts. With context-overfitting, the training loss is consistently lower—unsurprising, as the auxiliary self-retrieval task (re- covering targets for a query containing a duplicated context segment) is easier than forecasting unseen targets. More importantly, without context-overfitting the forecasts are overly smooth and miss most spikes, while the full model 7 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates 1251050100 k (larger = sharper) 0.2 0.8 1.4 RMSE (Lower is Better) w/o Overfitting with Overfitting Performance Gap (a) RMSE vs. Spike Sharpness 050100150200 Time Steps Target Value Query Step Ground Truth Start Prediction Query Step Context Window Pred (w/o Overfitting) Pred (with Overfitting) Attn (w/o Overfitting) Attn (with Overfitting) Attention Score (b) Attention Weight Map Figure 9. Ablation of the context-overfitting strategy. (a) The full model (red) consistently outperforms the baseline (green) on synthetic periodic spike data for allk, where largerkcorresponds to sharper spikes. (b) Attention map for a representative peak step (orange dotted line). captures much more high-frequency structure and accurately predicts the periodic spikes without distorting the trend. To further assess robustness and clarify the mechanism, we compare these two models on synthetic periodic toy data. As shown in Fig. 9a, the full model substantially reduces RMSE and more consistently recovers periodic spikes from the lookback window. To examine attention behavior, Fig. 9b visualizes a local attention map: with context-overfitting, the peak step attends strongly to historical peaks, whereas the baseline exhibits a pronounced dip at these locations, supporting our explanation of over-smoothing in vanilla attention. Details of the data generation and additional visualizations are provided in Appendix B.3. 4.3.3. DIFFERENT INFERENCE MODES In this section, we compare the 2D, 3D, and ensemble in- ference modes. Averaged results on thefev-bench-cov are presented in Table 5. While the standalone 3D and 2D modes already deliver strong performance, their ensemble consistently improves all metrics for both point and proba- bilistic forecasting. To understand these gains, we compute the Pearson correlation between 2D and 3D residuals pooled across all tasks (Fig. 10a), obtaining 0.75. This confirms that both modes are accurate and successfully capture the dominant temporal dynamics, while the correlation being well below 1.0 indicates diverse error structures. To further visualize this diversity, we plot the joint residual distribution for theentsoe1Htask in Fig. 10b, where points in the second and fourth quadrants highlight opposite-signed resid- uals, suggesting complementary errors. We also evaluate Table 5. Average results onfev-bench-covunder different inference modes of Baguan-TS model. ModelSQLMASEWAPEWQL Ours (2D mode)0.87691.07050.23330.1863 Ours (3D mode)0.83331.02200.22850.1830 Ours (Ensemble)0.79970.98570.21730.1724 2D mode3D mode 2D mode 3D mode 1.000.75 0.751.00 (a) Heatmap 0.10.00.1 2D Residuals 0.1 0.0 0.1 3D Residuals (b) Residual Correlation Figure 10.Residual correlation analysis of 2D and 3D in- ference modes. (a) Aggregated Pearson correlation across 30 tasks infev-bench-cov. (b) Joint residual distribution for entsoe1H, where darker blue bins indicate higher data density. Points along the red dashed identity line (y = x) represent cases in which the errors in both modes are equal. Deviations from this diagonal, particularly in the second and fourth quadrants, highlight the complementary error structures between the two modes. the probabilistic calibration using calibration histograms (Fig. 11) (Gneiting et al., 2007). The x-axis shows ground- truth quantile levels under predicted CDFs; perfectly cali- brated models yield a uniform histogram. We observe oppo- site systematic biases in the base modes: the 2D mode shows a peaked histogram indicating under-confident predictions, whereas the 3D mode exhibits a U-shape distribution with over-confidence. The ensemble mode balances these errors, yielding a near-uniform histogram for better calibration. 0.00.51.0 Quantile Level 0.00 0.05 0.10 0.15 Frequency 2D Mode 0.00.51.0 Quantile Level 3D Mode 0.00.51.0 Quantile Level Ensemble Figure 11. Calibration histograms of ground-truth quantile levels in predicted CDFs for different inference modes onentsoe1H. A well-calibrated model yields a uniform histogram (red dashed line). The 2D mode is under-confident (peaked), the 3D mode is over-confident (U-shaped), whereas the ensemble effectively cancels these biases and approaches the ideal uniform distribution. 5. Conclusion We presented Baguan-TS, a unified framework that brings in-context learning to raw time series by combining a 3D Transformer over temporal, variable, and context axes with a practical 2D+3D inference scheme. A target-space retrieval- based calibration module and a context-overfitting strat- egy make this high-capacity architecture more stable, bet- ter calibrated, and less prone to oversmoothing. Across fev-bench-cov,fev-bench-uniand 27 in-house datasets, Baguan-TS consistently exhibits strong perfor- mance, showing that raw-sequence ICL can be both effective and robust in realistic forecasting settings. 8 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates References Aksu, T., Woo, G., Liu, J., Liu, X., Liu, C., Savarese, S., Xiong, C., and Sahoo, D. Gift-eval: A benchmark for general time series forecasting model evaluation. arXiv preprint arXiv:2410.10393, 2024. Ansari, A. F., Stella, L., Turkmen, C., Zhang, X., Mercado, P., Shen, H., Shchur, O., Rangapuram, S. S., Arango, S. P., Kapoor, S., et al. Chronos: Learning the language of time series. arXiv preprint arXiv:2403.07815, 2024. Auer, A., Podest, P., Klotz, D., B ̈ ock, S., Klambauer, G., and Hochreiter, S. Tirex: Zero-shot forecasting across long and short horizons with enhanced in-context learning. arXiv preprint arXiv:2505.23719, 2025. Chang, K.-M. Arrhythmia ECG noise reduction by en- semble empirical mode decomposition. Sensors, 10(6): 6063–6080, 2010. Chen, P., Zhang, Y., Cheng, Y., Shu, Y., Wang, Y., Wen, Q., Yang, B., and Guo, C. Pathformer: Multi-scale transform- ers with adaptive pathways for time series forecasting. In Proceedings of the International Conference on Learning Representations, 2024. Chen, T. and Guestrin, C. XGBoost: A scalable tree boost- ing system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 785–794, 2016. Cohen, B., Khwaja, E., Doubli, Y., Lemaachi, S., Lettieri, C., Masson, C., Miccinilli, H., Ram ́ e, E., Ren, Q., Ros- tamizadeh, A., et al. This time is different: An observabil- ity perspective on time series foundation models. arXiv preprint arXiv:2505.14766, 2025. Das, A., Kong, W., Sen, R., and Zhou, Y. A decoder-only foundation model for time-series forecasting. In Proceed- ings of the 41st International Conference on Machine Learning, 2024. Dooley, S., Khurana, G. S., Mohapatra, C., Naidu, S. V., and White, C. ForecastPFN: Synthetically-trained zero-shot forecasting. Advances in Neural Information Processing Systems, 36:2403–2426, 2023. Feuer, B., Schirrmeister, R. T., Cherepanova, V., Hegde, C., Hutter, F., Goldblum, M., Cohen, N., and White, C. TuneTables: Context optimization for scalable prior- data fitted networks. Advances in Neural Information Processing Systems, 37:83430–83464, 2024. Gneiting, T., Balabdaoui, F., and Raftery, A. E. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society Series B: Statistical Methodology, 69 (2):243–268, 2007. Gorishniy, Y., Rubachev, I., Kartashev, N., Shlenskii, D., Kotelnikov, A., and Babenko, A. TabR: Tabular deep learning meets nearest neighbors. In Proceedings of the 12th International Conference on Learning Representa- tions, 2024. Goswami, M., Szafer, K., Choudhry, A., Cai, Y., Li, S., and Dubrawski, A. MOMENT: A family of open time- series foundation models. In Proceedings of the 41st International Conference on Machine Learning, 2024. Hegselmann, S., Buendia, A., Lang, H., Agrawal, M., Jiang, X., and Sontag, D. TabLLM: Few-shot classification of tabular data with large language models. In Proceed- ings of the 26th International Conference on Artificial Intelligence and Statistics, p. 5549–5581, 2023. Hoo, S. B., M ̈ uller, S., Salinas, D., and Hutter, F. The tabular foundation model TabPFN outperforms specialized time series forecasting models based on simple features. arXiv preprint arXiv:2501.02945, 2025. Hu, J., Hu, Y., Chen, W., Jin, M., Pan, S., Wen, Q., and Liang, Y. Attractor memory for long-term time series forecasting: A chaos perspective. Advances in Neural Information Processing Systems, 37:20786–20818, 2024. Jin, M., Zheng, Y., Li, Y.-F., Chen, S., Yang, B., and Pan, S. Multivariate time series forecasting with dynamic graph neural odes. IEEE Transactions on Knowledge and Data Engineering, 35(9):9168–9180, 2022. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural In- formation Processing Systems, 30, 2017. Kim, G. I. and Chung, K. Extraction of features for time series classification using noise injection. Sensors, 24 (19):6402, 2024. Kim, T., Kim, J., Tae, Y., Park, C., Choi, J.-H., and Choo, J. Reversible instance normalization for accurate time-series forecasting against distribution shift. In Proceedings of the International Conference on Learning Representa- tions, 2022. Liu, C., Aksu, T., Liu, J., Liu, X., Yan, H., Pham, Q., Sahoo, D., Xiong, C., Savarese, S., and Li, J. Moirai 2.0: When less is more for time series forecasting. arXiv preprint arXiv:2511.11698, 2025a. Liu, Y., Hu, T., Zhang, H., Wu, H., Wang, S., Ma, L., and Long, M. iTransformer: Inverted transformers are effec- tive for time series forecasting. In Proceedings of the 12th International Conference on Learning Representations, 2024a. 9 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates Liu, Y., Zhang, H., Li, C., Huang, X., Wang, J., and Long, M. Timer: Generative pre-trained transformers are large time series models. In Proceedings of the 41st International Conference on Machine Learning, 2024b. Liu, Y., Qin, G., Shi, Z., Chen, Z., Yang, C., Huang, X., Wang, J., and Long, M. Sundial: A family of highly capable time series foundation models. arXiv preprint arXiv:2502.00816, 2025b. Nie, Y., Nguyen, N. H., Sinthong, P., and Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. In Proceedings of the 11th International Conference on Learning Representations, 2023. Oreshkin, B. N., Carpov, D., Chapados, N., and Bengio, Y. N-beats: Neural basis expansion analysis for inter- pretable time series forecasting. In Proceedings of the International Conference on Learning Representations, 2020. Qi, S., Xu, Z., Li, Y., Wen, L., Wen, Q., Wang, Q., and Qi, Y. PDETime: Rethinking long-term multivariate time series forecasting from the perspective of partial differential equations. arXiv preprint arXiv:2402.16913, 2024. Rahimi, A. and Recht, B. Random features for large-scale kernel machines. Advances in Neural Information Pro- cessing Systems, 20, 2007. Rangapuram, S. S., Seeger, M. W., Gasthaus, J., Stella, L., Wang, Y., and Januschowski, T. Deep state space models for time series forecasting. Advances in Neural Information Processing Systems, 31, 2018. Rasul, K., Ashok, A., Williams, A. R., Khorasani, A., Adamopoulos, G., Bhagwatkar, R., Bilo ˇ s, M., Ghonia, H., Hassen, N. V., Schneider, A., Garg, S., Drouin, A., Chapados, N., Nevmyvaka, Y., and Rish, I. Lag-Llama: Towards foundation models for time series forecasting. arXiv preprint arXiv:2310.08278, 2023. Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski, T. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020. Sen, R., Yu, H.-F., and Dhillon, I. S. Think globally, act locally: A deep neural network approach to high- dimensional time series forecasting. Advances in Neural Information Processing Systems, 32:4837–4846, 2019. Shchur, O., Ansari, A. F., Turkmen, C., Stella, L., Erickson, N., Guerron, P., Bohlke-Schneider, M., and Wang, Y. Fev- bench: A realistic benchmark for time series forecasting. arXiv preprint arXiv:2509.26468, 2025. Shi, X., Wang, S., Nie, Y., Li, D., Ye, Z., Wen, Q., and Jin, M. Time-MoE: Billion-scale time series foundation models with mixture of experts. Proceedings of the Inter- national Conference on Learning Representations, 2025. Thomas, V., Ma, J., Hosseinzadeh, R., Golestan, K., Yu, G., Volkovs, M., and Caterini, A. Retrieval & fine-tuning for in-context tabular models. Advances in Neural Informa- tion Processing Systems, 37:108439–108467, 2024. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser,Ł., and Polosukhin, I. Atten- tion is all you need. Advances in Neural Information Processing Systems, 30, 2017. Wang, S., Li, C., and Lim, A. A model for non-stationary time series and its applications in filtering and anomaly detection. IEEE Transactions on Instrumentation and Measurement, 70:1–11, 2021. Wang, S., Wu, H., Shi, X., Hu, T., Luo, H., Ma, L., Zhang, J. Y., and Zhou, J. TimeMixer: Decomposable multiscale mixing for time series forecasting. In Proceedings of the International Conference on Learning Representations, 2024a. Wang, X., Zhou, T., Wen, Q., Gao, J., Ding, B., and Jin, R. CARD: Channel aligned robust blend transformer for time series forecasting. In Proceedings of the Interna- tional Conference on Learning Representations, 2024b. Woo, G., Liu, C., Kumar, A., Xiong, C., Savarese, S., and Sahoo, D. Unified training of universal time series fore- casting transformers. In Proceedings of the International Conference on Machine Learning, 2024. Wu, H., Xu, J., Wang, J., and Long, M. Autoformer: Decom- position transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Pro- cessing Systems, 34:22419–22430, 2021. Xu, D., Cirit, O., Asadi, R., Sun, Y., and Wang, W. Mixture of in-context prompters for tabular PFNs. Proceedings of the International Conference on Learning Representa- tions, 2025. Zhou, T., Ma, Z., Wen, Q., Sun, L., Yao, T., Yin, W., Jin, R., et al. Film: Frequency improved legendre memory model for long-term time series forecasting. Advances in Neural Information Processing Systems, 35:12677– 12690, 2022a. Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., and Jin, R. FEDformer: Frequency enhanced decomposed trans- former for long-term series forecasting. In Proceedings of the International Conference on Machine Learning, 2022b. 10 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates Zhou, T., Niu, P., Sun, L., Jin, R., et al. One fits all: Power general time series analysis by pretrained LM. Advances in Neural Information Processing Systems, 36:43322– 43355, 2023. Zhu, B., Shi, X., Erickson, N., Li, M., Karypis, G., and Shoaran, M. Xtab: Cross-table pretraining for tabular transformers. In Proceedings of the International Confer- ence on Machine Learning, 2023. 11 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates A. Implement details A.1. Model Details Baguan-TS uses a 3D Transformer-based architecture with hyperparameters summarized in Table 6. The model has 22.4 million parameters. Table 6. Baguan-TS model architecture hyperparameters. ParameterValue Temporal patch size (P )8 Number of 3D Transformer blocks (L)12 Number of heads6 Embedding dimension (D)192 Feed-forward dimension768 Output dimension (K)5000 During training,C,T,H, andMare randomly sampled for each batch, up to fixed maximum values. We setC max = 50 for the contexts,T max = 2048for the lookback length,H max = 192for the prediction horizon, andM max = 80for the number of covariates. A.2. Training Data Training data is critical to the performance of a foundation model. Baguan-TS is trained on a mixture of synthetic data and real-world benchmark data. The real-world data provide grounding in practical distributions and noise characteristics, while the synthetic data are designed to span a broad family of dynamics, covariate roles, and latent-factor structures that are hard to obtain exhaustively from any single corpus. Synthetic Data. The diverse and task-specific semantics of covariates in different forecasting tasks make it difficult to handcraft a single, general-purpose synthetic data simulator. Our key idea is that many covariate-aware forecasting problems can be viewed as regression with partially observed, time-correlated latent factors. Observed covariates are noisy or incomplete proxies for these latent drivers, while the target series combines autoregressive structure with nonlinear effects of both observed and latent inputs. Accordingly, we build a generic generator in which latent trajectories are drawn from a kernel dictionary, and transformed through a dynamic structural causal mechanism with a random MLP to induce nonlinear interactions. Only a subset of nodes is exposed as observed covariates and the remainder act as unobserved drivers; process and measurement noise, as well as optional regime shifts, are added to simulate different data characteristics. Within each task, the mechanism (graph, MLP weights, kernels, noise scales, exposure set) is fixed, whereas context and target instances draw independent latent realizations. Real-World Data. We include the GIFT-Eval pretraining corpus (Aksu et al., 2024) in our training set and augment each time step with a time index feature. To simulate distribution shifts and expose the model to regime changes, we concatenate multiple normalized series into longer sequences and randomly sample contiguous subsequences as training examples. A.3. Training Loss We use the continuous ranked probability score (CRPS) as the training loss. For a predictive CDFFand observationy, CRPS(F, y) = R ∞ −∞ (F(z)− I(z≥ y)) 2 dz . When the predictive distribution is represented byKbins with centersh i , we collect the probabilities into a vectorp = (p 1 ,..., p K ) ⊤ ∈ R K with P i p i = 1, the corresponding discrete form of each time step is CRPS = K X i=1 p i |h i − y|− 1 2 K X i=1 K X j=1 p i p j |h i − h j |,(1) where the first term is the expected absolute error under the forecast and the second term equals half the expected pairwise absolute distance between two independent forecast draws, acting as a sharpness term that discourages over-dispersion while preserving propriety. We compute CRPS per horizon step and average across the forecast window. 12 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates A.4. Inference Details For the inference process, we implement a stochastic ensemble approach to enhance prediction robustness through input perturbation and multiple forward passes. Specifically, covariate column positions are randomly shuffled, and 20% of historical target values are masked during each inference iteration, with 2–4 independent forward passes performed. The output is the average of these forward passes. In our experimental framework, a validation set is constructed for each time series by rolling the historical window backward by either 2 or 5 steps to simulate realistic forecasting conditions. The final predictions represent an ensemble of 2–9 distinct configurations selected based on their performance on this validation set. These configurations vary across four critical dimensions: • Inference mode (2D/3D/2D+3D ensemble); • Whether to include time series order, a blank column, or calendar-specific temporal features (e.g., year, month, day); • Whether to apply reversible instance normalization (RevIn) (Kim et al., 2022) to each organized context; • Context window length (T ), defined as an integer multiple of the prediction horizon, taking values from 2 to 14. This ensemble strategy enhances architectural diversity, automatically addresses covariate-agnostic scenarios by prioritizing temporal pattern extraction, and mitigates the uncertainty caused by the changeable context length and RBfcst module. B. Details of Ablation Study B.1. Effect of 3D and 2D Ensemble Residual Analysis for Point Prediction. We analyze the residual correlation between the 2D and 3D inference modes across all 30 covariate-aware tasks in the fev-bench dataset. The per-task Pearson correlation ranges from 0.42 to 0.95, with an aggregated value of 0.75 (Fig. 10a). The aggregated joint residual distribution is shown in Fig. 12c, revealing diverse error patterns that further support the effectiveness of combining 2D and 3D inference modes. 0.10.00.1 2D Residuals 0.1 0.0 0.1 3D Residuals (a) proenfogfc14 0.10.00.1 2D Residuals 0.1 0.0 0.1 3D Residuals (b) rossmann1W 0.050.000.05 2D Residuals 0.05 0.00 0.05 3D Residuals (c) Aggregation across 30 tasks Figure 12. Joint residual distributions for the energy taskproenfogfc14, the retail taskrossmann1W, and the aggregated results across all 30 tasks. Darker blue bins indicate higher data density. Points lying along the red dashed identity line (y = x) correspond to instances where prediction errors are equal in both modes. Deviations from this diagonal, particularly in the second and fourth quadrants, reveal complementary error patterns between the two modes. Calibration Analysis for Probabilistic Analysis. To assess the effectiveness and complementarity of the 2D and 3D inference modes, we analyze their probabilistic calibration. We compute histograms of the ground-truth quantile levels, also known as Probability Integral Transform (PIT) histograms. Specifically, for each observationy t and predicted CDF ˆ F t , the quantile level is given byu t = ˆ F t (y t ). The x-axis represents these quantile levels ranging from 0 to 1, while the y-axis (labeled “Frequency”) indicates the relative frequency of ground truths falling into each quantile bin. A perfectly calibrated model should yield a uniform distribution, visualized as a flat line at Frequency = 0.1. As shown in Figs. 13– 15, the 2D and 3D modes exhibit diverse error structures across different tasks. On theepfdetask (Fig. 13), the 2D mode shows a distinct decreasing trend, indicating a systematic positive bias (over-estimation), as ground 13 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates truths frequently fall into lower quantiles. Conversely, the 3D mode displays an increasing trend, indicating a negative bias. These opposing biases effectively cancel out in the ensemble mode, resulting in a well-calibrated histogram. On the hermestask (Fig. 14), the 2D mode shows a U-shaped distribution, indicating over-confident predictions. In contrast, the 3D mode demonstrates superior calibration performance. Aggregated across all tasks (Fig. 15), the 2D mode consistently exhibits a U-shaped pattern, reflecting overconfident predictions with under-dispersed distributions, whereas the 3D mode maintains more stable calibration performance with slight under-confidence. The ensemble strategy effectively mitigates their weaknesses, resulting in robust probabilistic performance. 0.00.51.0 Quantile Level 0.00 0.05 0.10 0.15 Frequency 2D Mode 0.00.51.0 Quantile Level 3D Mode 0.00.51.0 Quantile Level Ensemble Figure 13. Calibration histograms of ground-truth quantile levels in predicted CDFs for different inference modes onepfde(energy task). 0.00.51.0 Quantile Level 0.0 0.1 0.2 Frequency 2D Mode 0.00.51.0 Quantile Level 3D Mode 0.00.51.0 Quantile Level Ensemble Figure 14. Calibration histograms of ground-truth quantile levels in predicted CDFs for different inference modes onhermes(retail task). 0.00.51.0 Quantile Level 0.00 0.05 0.10 0.15 Frequency 2D Mode 0.00.51.0 Quantile Level 3D Mode 0.00.51.0 Quantile Level Ensemble Figure 15. Calibration histograms of ground-truth quantile levels in predicted CDFs for different inference modes onfev-bench-cov dataset (across all 30 tasks). B.2. Robustness to Injected Noise To more rigorously evaluate the robustness and stability of the proposed local calibration module, we conduct a noise injection experiment on theepffrdataest. Specifically, we corrupt the original time series with three types of synthetic stochastic noise: Gaussian white noise, random walk noise, and periodic noise (Chang, 2010; Kim & Chung, 2024; Wang et al., 2021). For each noise type, the intensity is controlled by a scaling factorκ∈0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0. The actual scale parameter used is defined asσ· κ, whereσdenotes the empirical standard deviation of the original time series. For Gaussian white noise, we sample independent and identically distributed (i.i.d.) values from a normal distribution N(0,σκ) to obtain uncorrelated perturbations that mimic stationary noise. 14 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates For random walk noise, inspired by (Wang et al., 2021), we aim to mimic non-stationary noise with a directional trend. We first generate i.i.d. Gaussian incrementsε t ∼ N(0,σκ)and then compute their cumulative sum to form a random walk process:w t = P t i=1 ε i . This introduces temporally correlated perturbations and can simulate accumulated uncertainty over time. Finally, we introduce periodic noise (Chang, 2010) to mimic seasonal interference using a sinusoidal signal with random frequency and phase: p noise t = (σκ)· sin 2πt T + φ , where the periodTis uniformly sampled from the interval[12, 60]to emulate diverse seasonal patterns (e.g., daily or weekly cycles), and the phase offsetφis drawn uniformly from[0, 2π). To avoid overly idealized waveforms, we further add a small amount of Gaussian background noise with standard deviation 0.1· σκ. We report results based on a single forward pass without any ensemble strategy for our method, and usenestimators= 2 for TabPFN-TS, with a context length of 50,000 and the full set of clean running index and calendar features. The results are shown in Fig. 8. Our approaches—especially Y-space RBfcst—maintain stronger robustness than TabPFN-TS across all noise types and intensities. B.3. Context-Overfitting on Synthetic Data To simulate controllable periodic spikes, we formulate a synthetic time series usingf(t) = exp(k· (sin( 2πt 50 )− 1)) (see Fig. 16). The spike sharpness is controlled by the parameter k, with larger values of k producing sharper spikes. 050100150200250300350400 Time Step 0.0 0.2 0.4 0.6 0.8 1.0 Amplitude k=1k=2k=5k=10k=50k=100 Figure 16. Toy periodic spike data generated usingf(t) = exp(k· (sin( 2πt 50 )− 1)) . A largerkvalue results in a sharper periodic spike signal. As shown in Fig. 17, distinct attention patterns emerge between the context-overfitting model and the baseline. At the peak prediction step (Figs. 17a and 17c), the attention weights of the context-overfitting model precisely align with the historical peak points within the lookback window. In contrast, the baseline model exhibits a diffuse attention distribution with a noticeable magnitude drop at these critical peak positions. For the trough prediction in the flat region (Figs. 17b and 17d), the context-overfitting model shows more precise attention behavior. In the penultimate layer, the context-overfitting model effectively attends to the target flat region, aligning closely with the ground-truth dynamics. By the final layer, it manifests a dual-focus strategy, aggregating information from both historical peaks and target flat regions, which likely combines both global periodicity and local context to refine the final prediction. Conversely, the baseline model shows a more diffuse and scattered attention pattern across these two layers. It generates more uniform attention over the flat parts and displays a symmetric pattern around the peaks, lacking the precise localization ability found in the context-overfitting model. Overall, this comparison suggests that the context-overfitting approach learns a sharper, structure-aware representation of time series. 15 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates 050100150200250300350400 Time Steps Target Value Query Step Ground Truth Start Prediction Pred (w/o Overfitting) Pred (with Overfitting) Attn (w/o Overfitting) Attn (with Overfitting) Query Step Context Window Attention Score (a) The penultimate layer, peak-step prediction 050100150200250300350400 Time Steps Target Value Query Step Ground Truth Start Prediction Pred (w/o Overfitting) Pred (with Overfitting) Attn (w/o Overfitting) Attn (with Overfitting) Query Step Context Window Attention Score (b) The penultimate layer: trough-step prediction 050100150200250300350400 Time Steps Target Value Query Step Ground Truth Start Prediction Pred (w/o Overfitting) Pred (with Overfitting) Attn (w/o Overfitting) Attn (with Overfitting) Query Step Context Window Attention Score (c) The final layer: peak-step prediction 050100150200250300350400 Time Steps Target Value Query Step Ground Truth Start Prediction Pred (w/o Overfitting) Pred (with Overfitting) Attn (w/o Overfitting) Attn (with Overfitting) Query Step Context Window Attention Score (d) The final layer: trough-step prediction Figure 17. Attention weight maps on the synthetic spike series for the baseline model (w/o overfitting) and the context-overfitting model (with overfitting). Each subplot overlays the ground-truth signal, model predictions, and the corresponding attention distribution within the context window at the query step (orange dash–dot line). C. Detailed Experimental Results C.1. Results on Covariate-Aware Tasks We provide detailed descriptions in Table 7 of the 30 covariate-aware forecasting tasks selected fromfev-bench, which we denote asfev-bench-cov. The average skill scores across all evaluation metrics (SQL, MASE, WAPE, and WQL) are shown in Figs. 18–20. Our model attains either the best or second-best skill score on all four metrics, consistently outperforming other covariate-aware time series foundation models, such as TabPFN-TS. The detailed experimental results for each task are presented in Tables 8–16. Results for our method employ the ensemble strategy described in Appendix A.4; 16 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates baseline results are sourced from the official fev-bench benchmark. To further assess performance in covariate-rich settings, we also include 27 proprietary real-world datasets from production: 18 city-level electricity load datasets and 9 city-level distributional photovoltaic (PV) generation forecasting tasks, all collected from different cities in China. To mirror real deployments, these 27 datasets include the full set of production covariates, notably numerical weather prediction (NWP) features and calendar effects. Table 17 reports the average perfor- mance across these 27 datasets. Baguan-TS achieves the best overall results on all four evaluation metrics, demonstrating strong performance in practical, covariate-rich forecasting settings. 0.20.40.6 (Higher is better) Toto-1.0 Sundial-Base Chronos-Bolt Moirai-2.0 TiRex TabPFN-TS TimesFM-2.5 Ours Average Win Rate (w.r.t MASE) 1.01.1 (Lower is better) Toto-1.0 Sundial-Base Chronos-Bolt Moirai-2.0 TiRex TabPFN-TS TimesFM-2.5 Ours Average MASE (a) MASE 0.250.500.75 (Higher is better) Toto-1.0 Sundial-Base Chronos-Bolt Moirai-2.0 TiRex TabPFN-TS TimesFM-2.5 Ours Average Win Rate (w.r.t WAPE) 0.10.20.3 (Lower is better) Sundial-Base TiRex Moirai-2.0 Toto-1.0 Chronos-Bolt TimesFM-2.5 Ours TabPFN-TS Average WAPE (b) WAPE Figure 18. Point forecasting evaluation on fev-bench-cov. 0.00.5 (Higher is better) Sundial-Base Toto-1.0 Chronos-Bolt Moirai-2.0 TiRex TabPFN-TS TimesFM-2.5 Ours Average Win Rate (w.r.t WQL) 0.10.2 (Lower is better) Sundial-Base Toto-1.0 TiRex Moirai-2.0 Chronos-Bolt TimesFM-2.5 Ours TabPFN-TS Average WQL Figure 19. Probabilistic forecasting evaluation on fev-bench- cov using WQL. 0.00.20.4 (Higher is better) Sundial-Base Toto-1.0 Chronos-Bolt Moirai-2.0 TiRex TabPFN-TS Ours TimesFM-2.5 Average Skill Score (w.r.t SQL) 0.00.2 (Higher is better) Toto-1.0 Sundial-Base Chronos-Bolt Moirai-2.0 TiRex TabPFN-TS Ours TimesFM-2.5 Average Skill Score (w.r.t MASE) 0.00.20.4 (Higher is better) Toto-1.0 Chronos-Bolt Sundial-Base Moirai-2.0 TiRex TabPFN-TS Ours TimesFM-2.5 Average Skill Score (w.r.t WAPE) 0.30.40.5 (Higher is better) Sundial-Base Toto-1.0 Chronos-Bolt Moirai-2.0 TiRex TabPFN-TS TimesFM-2.5 Ours Average Skill Score (w.r.t WQL) Figure 20. Skill score comparison on fev-bench-cov, using Seasonal Naive as the baseline model. C.2. Results on Univariate Tasks We provide detailed information in Table 18 for the 10 univariate and multivariate forecasting tasks selected from fev-bench-mini, which we denote asfev-bench-uni. The corresponding ranking results, evaluated by macro- averaged SQL, MASE, WAPE, and WQL across these 10 datasets, are presented in Table 19. While adapting our 3D model 17 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates Table 7.Detailed overview of the 30 covariate-aware tasks from thefev-benchbenchmark, collectively referred to as fev-bench-cov. TaskDomainFrequency# itemsmedian length# obs# known dynamic colsHW# seasonality# targets entsoe15Tenergy15 Min6175,2926,310,51239620961 entsoe1Henergy1 H643,8221,577,59239620481 entsoe30Tenergy30 Min687,6453,155,220316820241 epfbeenergy1 H152,416157,24822420241 epfdeenergy1 H152,416157,24822420241 epffrenergy1 H152,416157,24822420241 epfnpenergy1 H152,416157,24822420241 epf pjmenergy1 H152,416157,24822420241 proenfogfc12energy1 H1139,414867,108116810241 proenfogfc14energy1 H117,52035,040116820241 proenfo gfc17energy1 H817,544280,704116820241 solarwithweather15Tenergy15 Min1198,6001,986,00079620961 solarwithweather1Henergy1 H149,648496,48072420241 uciairquality1DnatureDaily13895,0573281174 uciairquality1Hnature1 H19,357121,641316820244 m51DretailDaily30,4901,810428,849,460828171 m51MretailMonthly30,4905813,805,6858121121 m51WretailWeekly30,49025760,857,703813111 rohlikorders1DretailDaily71,197115,650461571 rohlikorders1WretailWeekly717015,31648511 rohlik sales1DretailDaily5,3901,04674,413,9351314171 rohliksales1WretailWeekly5,24315010,516,770138111 rossmann 1DretailDaily1,1159427,352,3105481071 rossmann1WretailWeekly1,115133889,770413811 walmartretailWeekly2,9361434,609,1431039111 hermesretailWeekly10,0002615,220,000152111 favorita stores1DretailDaily1,5791,68810,661,4082281071 favoritastores1MretailMonthly1,57954255,7981122121 favoritastores1WretailWeekly1,5792401,136,8801131011 favoritatransactions1DretailDaily511,688258,2641281071 Table 8. Performance Comparison onproenfogfc12,proenfogfc14, andproenfogfc17. The best results are highlighted in bold, and the second-best results are underlined. gfc12gfc14gfc17 ModelSQLMASEWAPEWQLSQLMASEWAPEWQLSQLMASEWAPEWQL AutoARIMA1.14081.38480.11930.09870.94711.16810.05710.04641.11501.38210.09430.0762 AutoETS2.43092.81140.23310.20731.11041.31860.06470.05452.13462.41280.16630.1466 AutoTheta1.41491.52370.13600.12511.05551.18570.05800.05191.14701.36870.09310.0770 Chronos-Bolt0.91721.07810.09100.07730.76740.92850.04550.03760.90041.09770.07450.0608 Moirai-2.00.79280.95380.08030.06680.67660.84580.04130.03300.77400.96280.06470.0519 Naive2.37962.62910.23260.20753.20963.70730.18310.15882.57502.90660.19110.1691 Seasonal Naive1.19971.42820.12850.10881.07511.19890.05870.05291.31601.58450.10780.0897 Stat. Ensemble1.30491.54630.14070.11820.90591.10440.05400.04441.14171.42150.09600.0770 Sundial-Base0.90040.99360.08140.07330.4208 0.46360.02260.02050.50860.55260.03430.0316 TabPFN-TS0.83451.03380.08780.07020.51480.64060.03140.02520.67170.85530.05630.0441 TimesFM-2.50.18760.21680.01630.01430.14030.16880.00830.00690.16010.18980.01200.0101 TiRex0.90811.12420.09540.07710.72060.91190.04470.03540.88941.13810.07640.0597 Toto-1.00.93441.13710.09720.07950.73460.92150.04520.03600.93881.19610.07990.0626 Ours0.66460.82820.06910.05540.46280.57720.02840.02270.52390.66420.04320.0341 18 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates Table 9. Performance Comparison onrohliksales1D,rohlikorders1D,rohliksales1W, androhlikorders1W. The best results are highlighted in bold, and the second-best results are underlined. sales1Dorders1Dsales1Worders1W ModelSQL MASE WAPE WQLSQL MASE WAPE WQLSQLMASE WAPE WQLSQL MASE WAPE WQL AutoARIMA1.2758 1.5092 0.4095 0.3459 1.2662 1.5624 0.0741 0.0604 1.8025 2.0379 0.3070 0.2670 1.4147 1.7324 0.0584 0.0471 AutoETS1.2662 1.4986 0.4185 0.3543 1.4470 1.4870 0.0657 0.0637 14.4532 1.8900 0.3003 4.0491 1.4190 1.7583 0.0582 0.0469 AutoTheta1.2818 1.4939 0.4131 0.3563 1.3973 1.5185 0.0664 0.0607 1.6547 1.8364 0.2946 0.2656 1.3963 1.7122 0.0573 0.0465 Chronos-Bolt1.1471 1.3871 0.3783 0.3140 1.0508 1.3016 0.0614 0.0495 1.5216 1.8467 0.2846 0.2337 1.4282 1.7219 0.0570 0.0471 Moirai-2.01.1696 1.4021 0.3821 0.3202 0.9700 1.1761 0.0552 0.0455 1.5157 1.8215 0.2871 0.2365 1.5315 1.8740 0.0612 0.0501 Naive1.5460 1.7485 0.4717 0.4099 2.9130 2.4354 0.1083 0.1384 1.9282 1.9155 0.3119 0.3323 1.4844 1.7312 0.0584 0.0499 Seasonal Naive 1.3750 1.6150 0.4346 0.3728 1.5544 1.7783 0.0827 0.0730 1.9282 1.9155 0.3119 0.3323 1.4844 1.7312 0.0584 0.0499 Stat. Ensemble 1.2483 1.4760 0.4057 0.3425 1.2114 1.3812 0.0628 0.0552 1.6455 1.8250 0.2906 0.2596 1.3984 1.7128 0.0573 0.0465 Sundial-Base1.2027 1.3443 0.3645 0.3250 1.1963 1.3993 0.0667 0.0571 1.6933 1.8996 0.2929 0.2594 1.8923 2.0952 0.0675 0.0609 TabPFN-TS----1.3411 1.5479 0.0656 0.0585 1.2205 1.52050.21520.16981.5240 1.9887 0.0652 0.0500 TimesFM-2.51.09581.32380.35860.29841.0057 1.2504 0.0591 0.0473 1.4010 1.6891 0.2667 0.2197 1.32781.65920.05420.0430 TiRex1.1481 1.3845 0.3784 0.3154 0.98581.20340.05720.04681.4252 1.7354 0.2736 0.2252 1.3004 1.5829 0.0521 0.0426 Toto-1.01.2181 1.4539 0.3975 0.3352 1.1351 1.3776 0.0646 0.0532 1.5046 1.8031 0.2812 0.2348 1.4934 1.7980 0.0591 0.0489 Ours0.8725 1.0930 0.2663 0.2114 1.1684 1.4377 0.0669 0.0546 1.1827 1.4646 0.2045 0.1627 1.5405 1.8940 0.0608 0.0493 Table 10. Performance Comparison onentsoe15T,entsoe30T, andentsoe1H. The best results are highlighted in bold, and the second-best results are underlined. 15T30T1H ModelSQLMASEWAPEWQLSQLMASEWAPEWQLSQLMASEWAPEWQL AutoARIMA—0.98071.22990.09090.07320.87251.11760.08540.0667 AutoETS3.02894.00700.40810.31402.49283.27380.24120.18161.90502.04750.15700.1447 AutoTheta0.57940.72980.05400.04300.79971.00400.07210.05720.97231.16260.08360.0694 Chronos-Bolt0.50620.60620.04280.03630.52940.63260.03780.03210.45740.55640.03410.0282 Moirai-2.00.47830.59330.04240.03430.48840.59740.03830.03180.48710.59240.03960.0333 Naive1.51771.90460.14340.11551.60922.07330.14630.11421.91542.05530.15710.1448 Seasonal Naive0.78070.93150.06770.05661.01031.21620.08710.07261.05611.10150.07920.0798 Stat. Ensemble—0.84651.05780.07670.06120.89201.11450.08330.0660 Sundial-Base0.66690.75780.05650.05000.72160.77960.05200.04860.74410.81180.05550.0510 TabPFN-TS0.48370.60940.04260.03350.51170.63500.03910.03220.4419 0.53770.03330.0273 TimesFM-2.50.47090.59030.04160.03350.56580.69580.04730.03880.46810.58490.03590.0290 TiRex0.46930.59860.04200.03370.52300.66460.04000.03140.47010.58540.03630.0294 Toto-1.00.59090.75030.05230.04140.49580.62540.03790.03020.47960.59080.03660.0302 Ours0.52610.65860.04580.03660.44410.55380.03650.02950.38590.48190.02870.0231 Table 11. Performance Comparison onepfbe,epfde,epffr,epfnp, andepfpjm. The best results are highlighted in bold, and the second-best results are underlined. bedefrnppjm ModelSQLMASEWAPEWQLSQLMASEWAPEWQLSQLMASEWAPEWQLSQLMASEWAPEWQLSQLMASEWAPEWQL AutoARIMA1.05611.09480.19860.18681.27771.62310.68190.54331.15810.98880.15320.17151.39321.70960.07860.06370.48190.55730.09850.0855 AutoETS1.53431.39290.24880.26591.40131.74170.61070.52880.89890.96650.14930.13871.93322.32440.10410.08640.91390.98470.17170.1607 AutoTheta1.48431.01940.18820.25821.49441.40720.64600.64901.59120.84470.12940.22961.28121.51400.06880.05790.60320.68320.11990.1068 Chronos-Bolt0.57310.72730.13570.10691.02081.26550.56070.46440.43890.56380.08770.06850.97111.24090.05670.04450.42170.53570.09380.0739 Moirai-2.00.52810.67090.12280.09681.01641.22820.53650.45860.40920.50350.07910.06460.92531.19980.05430.04200.44050.56310.09760.0766 Naive3.08451.36090.23760.53571.40121.74170.61070.52873.81891.16030.17030.55011.94042.31550.10370.08690.92980.98490.17170.1642 Seasonal Naive1.15031.02710.18400.20131.38771.70860.74900.59191.24550.85010.12980.18181.52981.76130.07960.06920.51530.58120.10160.0908 Stat. Ensemble1.21350.98140.17540.21091.16661.38210.60230.48821.14560.74100.11330.16771.28441.51470.06830.05740.48710.52950.09340.0863 Sundial-Base0.64650.72220.13030.11701.18311.32270.57220.51030.46110.52440.08130.07200.94511.09470.04960.04280.46790.52140.09340.0841 TabPFN-TS0.53240.67020.11800.09330.44030.56750.30690.24260.33070.41820.06180.04910.65930.87090.03870.02930.42700.53400.09250.0740 TimesFM-2.50.49370.61020.11260.09051.03001.27990.59140.47750.40920.49070.07710.06401.17061.44020.06390.05190.42630.53110.09420.0757 TiRex0.52700.67440.12300.09621.03221.29710.57870.46850.40140.50400.07980.06390.96621.22260.05540.04380.4042 0.50580.08940.0714 Toto-1.00.56480.70650.13230.10581.10581.33700.67720.55150.42570.52530.08370.06831.03691.35890.06160.04710.45190.58220.10310.0800 Ours0.51620.65500.11590.09080.45970.58820.34240.25250.35180.45170.06750.05290.71740.92410.04150.03210.37160.46030.07980.0646 19 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates Table 12. Performance Comparison onrossmann1D,rossmann1W,hermes, andwalmart. The best results are highlighted in bold, and the second-best results are underlined. rossmann1Drossmann1Whermeswalmart ModelSQLMASEWAPEWQLSQLMASEWAPEWQLSQLMASEWAPEWQLSQLMASEWAPEWQL AutoARIMA0.56190.65440.22240.19080.52120.66890.18360.14101.21291.53170.00640.00501.02051.24740.15960.1320 AutoETS0.59370.69130.23610.20280.51750.66850.18360.14021.67301.98520.00880.0074-1.72270.2951- AutoTheta0.83150.74480.25580.29100.51600.66440.18130.13851.55391.84780.00810.00681.46751.41790.18520.1997 Chronos-Bolt0.52460.63710.21760.17880.48710.64520.17600.13170.67520.85790.00320.00250.77400.96710.11730.0950 Moirai-2.00.52740.64800.22150.17990.49690.64400.17590.13420.70380.88540.00340.00270.84471.05590.13150.1058 Naive2.76211.61950.54810.93890.89880.79600.21920.25092.14611.99450.00870.00872.03411.52410.19670.3075 Seasonal Naive0.91370.78860.26920.31400.89880.79600.21920.25092.14611.99450.00870.00872.03411.52410.19670.3075 Stat. Ensemble0.57810.66780.22780.19690.50140.65130.17830.13511.41621.81210.00790.00611.21661.35640.17730.1644 Sundial-Base0.53060.61550.20920.18020.57810.67900.18550.15720.82430.95950.00370.00310.84220.98420.12110.1036 TabPFN-TS0.23210.29450.09790.07720.25390.30460.07920.06600.70490.91230.00340.00260.66190.83180.09430.0752 TimesFM-2.50.50160.61060.20860.17110.49520.65430.18030.13510.61840.78720.00290.00230.6794 0.86150.10160.0803 TiRex0.53910.66010.22510.18370.48160.62180.16970.13040.65100.83100.00310.00240.70750.88620.10540.0850 Toto-1.00.56770.68140.23210.19320.49440.63190.17330.13450.98531.20230.00440.00360.90721.12580.13850.1135 Ours0.27200.33480.11200.09100.27870.33710.08540.07050.61850.79710.00300.00230.76211.00470.13050.0925 Table 13. Performance Comparison onm51D,m51W, andm51M. The best results are highlighted in bold, and the second-best results are underlined. 1D1W1M ModelSQLMASEWAPEWQLSQLMASEWAPEWQLSQLMASEWAPEWQL AutoARIMA0.85171.06310.77170.61150.93671.15680.43880.35801.04551.21310.46010.3801 AutoETS0.85281.06330.77140.61560.95311.16720.44680.37281.10821.21390.46050.4321 AutoTheta0.87211.08060.77790.62650.96341.16980.44750.37671.09871.25900.49100.4195 Chronos-Bolt0.72930.88520.71100.56200.91651.16470.43340.34101.00011.18520.44550.3566 Moirai-2.00.7096 0.86910.69840.55080.90691.15010.42860.33610.99591.17730.43630.3457 Seasonal Naive1.25451.22360.91590.85891.35581.33820.50420.51991.13991.32660.51080.4286 Sundial-Base0.85160.96780.73570.63700.97481.13310.42580.36231.08151.19960.45090.3886 TabPFN-TS----0.92821.16050.43580.34371.00171.18710.43650.3467 TimesFM-2.50.71890.87210.69680.5507 0.88901.12220.42010.33120.97981.15760.42490.3370 TiRex0.71440.87530.70250.55230.90261.14770.42810.33590.97401.16290.43060.3447 Toto-1.00.70760.86830.69860.54950.90501.14480.42780.33691.04401.24220.45970.3648 Ours0.72351.02310.74610.55240.90111.12210.42140.33191.00801.15810.42400.3462 Table 14. Performance Comparison onfavoritastores(1D/1W/1M) andfavoritatransactions1D. The best results are highlighted in bold, and the second-best results are underlined. stores1Dstores1Wstores1Mtransactions1D ModelSQLMASEWAPEWQLSQLMASEWAPEWQLSQLMASEWAPEWQLSQLMASEWAPEWQL AutoARIMA1.22161.41880.19340.16472.29692.58010.15010.13312.03822.20750.13850.13141.56221.74090.12050.1055 AutoETS1.23791.37990.18720.16722.35682.51650.14840.14301.94242.12210.15680.13291.18131.27650.10100.0945 AutoTheta1.27321.37590.18350.17792.31602.47680.14440.14081.94162.13350.15350.12451.24631.25110.09610.0991 Chronos-Bolt1.03221.26890.17430.14192.10112.47490.15800.12682.08652.41780.26370.18520.9750 1.15150.08730.0736 Moirai-2.00.97981.20470.15680.12792.19652.57730.15550.12432.09132.32360.23550.17261.12111.32950.08370.0705 Naive2.63571.88300.30560.40792.49382.51710.15280.16082.05781.99740.12330.17012.73041.86180.15610.2372 Seasonal Naive1.69021.75900.25200.23762.49382.51710.15280.16082.09672.28230.19000.17151.73381.82000.14610.1403 Stat. Ensemble1.19711.34000.17890.16062.21962.43400.14360.13361.94262.11810.14670.12651.18481.26580.10050.0949 Sundial-Base1.06131.22000.15620.13542.30822.56120.14970.13122.25432.40600.25610.22221.19821.32470.07990.0711 TabPFN-TS0.96981.19430.1486 0.12042.12272.52680.12600.10111.93362.17580.11590.09431.22521.67270.07730.0603 TimesFM-2.50.94941.16760.14520.11841.96842.29010.13080.10951.99832.23380.21030.15090.87361.08420.06550.0539 TiRex0.96821.1934 0.15180.12432.04622.42350.13460.11341.85592.21470.18940.14221.03141.34930.08200.0677 Toto-1.01.03641.28040.17840.14392.12772.50800.15110.12232.00942.28650.19780.13921.11391.43070.09470.0770 Ours0.95961.21680.15500.12471.99602.36420.12590.09821.97762.18750.15710.12151.05621.20990.08010.0676 20 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates Table 15. Performance Comparison onsolarwithweather15Tandsolarwithweather1H. The best results are highlighted in bold, and the second-best results are underlined. 15T1H ModelSQLMASEWAPEWQLSQLMASEWAPEWQL AutoARIMA—1.13141.11331.29021.2703 AutoETS2.52892.41621.42151.95722.18182.26821.14651.3928 AutoTheta3.58511.27291.50233.28733.35761.53541.54742.6715 Chronos-Bolt0.80940.97651.29111.13010.81571.07201.32601.0190 Moirai-2.00.83871.07641.37611.12050.90711.13581.48281.2285 Naive2.27801.98951.01801.69642.18072.26821.14641.3916 Seasonal Naive1.19381.05011.31151.40741.21201.06191.23391.3171 Stat. Ensemble—1.45841.31481.26821.1488 Sundial-Base0.96261.11821.50591.35081.18151.30321.44811.3598 TabPFN-TS0.7471 0.94451.16620.97120.70060.86290.90960.7960 TimesFM-2.50.90631.11531.47001.24050.81541.05131.20110.9696 TiRex0.84571.04871.42361.20680.90001.15831.52611.2499 Toto-1.00.78390.9403 1.14800.99980.87601.05851.40681.2191 Ours0.74020.89511.06090.89120.67510.87741.01280.8174 Table 16. Performance Comparison onuciairquality1Handuciairquality1D. The best results are highlighted in bold, and the second-best results are underlined. 1H1D ModelSQLMASEWAPEWQLSQLMASEWAPEWQL AutoARIMA1.19241.36980.45180.39241.24031.55850.32290.2530 AutoETS4.10×10 5 10.69794.06422.05×10 5 1.18131.36290.27910.2425 AutoTheta1.95601.34580.41680.67241.23341.44280.29640.2579 Chronos-Bolt0.89901.11620.37580.30201.09201.38870.28430.2229 Moirai-2.00.94541.19650.39040.30631.13801.44170.29160.2299 Naive2.42501.68160.52510.83611.87392.10910.43150.3990 Seasonal Naive1.38401.44250.45630.46091.40131.73370.36240.2922 Stat. Ensemble1.56071.29920.41160.52691.12251.38370.28600.2304 Sundial-Base1.00091.19240.40070.33691.21451.39910.28930.2497 TabPFN-TS0.93121.17510.38580.30671.18631.53080.31060.2398 TimesFM-2.50.87691.12260.37150.29041.20511.51470.30690.2425 TiRex0.86501.10620.36980.28901.12801.43120.29420.2322 Toto-1.00.87001.11120.36580.28631.26021.58810.32330.2553 Ours0.87321.12240.36850.27941.11221.36670.28810.2317 Table 17. Average results on load, photovoltaic, and all real application datasets. The best results are highlighted in bold, and the second-best results are underlined. Load datasets (18)Photovoltaic datasets (9)All real-world application datasets (27) ModelSQLMASEWAPEWQLSQLMASEWAPEWQLSQLMASEWAPEWQL AutoARIMA2.57300.85260.03950.12893.53500.78580.23661.05702.89360.83030.10520.4383 AutoETS7.45146.46520.29310.321822.46795.74011.70966.701312.45696.22350.76532.4483 AutoTheta0.89771.26440.06320.038538.88776.68611.992111.599113.56113.07170.70623.8920 Chronos-Bolt0.44150.54910.02530.02030.64710.80660.24240.19450.51010.63500.09760.0783 Moirai-2.00.76210.97010.04600.03600.61750.72220.21750.18590.71390.88750.10310.0860 Naive2.46753.05990.14790.11824.99325.74001.70951.48833.30943.95330.66840.5749 Seasonal Naive1.07991.20650.05960.04996.25360.73820.22211.86972.80451.05040.11380.6565 Stat. Ensemble1.62052.23200.10510.07603.20803.27190.97560.95702.14972.57860.39520.3696 Sundial-Base0.75360.94110.04500.03600.73870.83040.25000.22250.74860.90420.11330.0981 TabPFN-TS0.5120 0.68170.03250.02420.37500.47100.14150.11270.46630.61150.06880.0537 TimesFM-2.00.94101.01320.05060.04690.64340.75720.22660.19250.84180.92780.10930.0954 TiRex0.55440.68960.03360.02700.60420.74900.22530.18130.57100.70940.09750.0784 Toto-1.00.88881.13590.05570.04300.74650.89810.26940.22410.84141.05660.12690.1034 Ours0.40890.51820.02190.01740.37430.45040.13520.11210.39740.49560.05970.0490 21 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates Table 18. Summary of the 10 representative univariate forecasting tasks selected fromfev-bench-mini, collectively referred to as fev-bench-uni. Datasets with multiple targets (# targets > 1) are decomposed into independent univariate series for evaluation. TaskDomainFrequency# itemsmedian length# obs# known dynamic colsHW# seasonality# targets ETT15Tenergy15 Min269,680975,52009620967 ETT1Henergy1 H217,420243,880016820247 bizitobs l2c5Tcloud5 Min131,968223,7760288202887 boomlet619cloud1 Min116,384851,96806020144052 boomlet1282cloud1 Min116,384573,44006020144035 boomlet 1676cloud30 Min110,4631,046,3000962048100 hospitaladmissions1DhealthcareDaily81,73113,8460282071 hospitaladmissions1WhealthcareWeekly82461,9680131611 jena weather1Hnature1 H18,784184,464024202421 MDENSE1DmobilityDaily3073021,9000281071 to univariate time series forecasting, we formulate the input with a dummy covariate (e.g., a column of zeros) alongside intrinsic temporal features such as year, month, day, and hour to preserve compatibility with the model’s covariate-aware architecture. Despite the absence of meaningful external covariates, Baguan-TS exhibits highly competitive performance on univariate forecasting tasks. Notably, it achieves best results in both WAPE and WQL (see Table 19 and Fig. 21). This strong performance underscores the model’s robustness in capturing distributional characteristics and its reliability in high-stakes or extreme-value scenarios, where WAPE emphasizes large errors proportionally to actual magnitudes, and WQL explicitly optimizes quantile-based risk awareness. Detailed experimental results are presented in Tables 20–23. Table 19. Average results on fev-bench-uni. The best results are highlighted in bold. ModelSQLMASEWAPEWQL Chronos-Bolt0.60830.73690.39740.3438 Moirai-2.00.54050.67280.32540.2735 Sundial-Base0.58140.68330.33380.2902 TabPFN-TS0.59200.73640.33840.2685 TimesFM-2.50.54280.68310.33410.2726 TiRex0.56840.71040.38580.3215 Toto-1.00.56890.71470.36350.2992 AutoARIMA0.69130.83010.58430.5548 Stat. Ensemble0.75750.83710.51320.5844 AutoETS0.90830.95190.48770.6018 AutoTheta0.94090.85320.49720.6327 Seasonal Naive1.03811.03730.66490.7433 Naive1.36001.09020.52540.7872 Ours0.56640.71260.31050.2497 0.300.350.40 (Lower is better) Chronos-Bolt TiRex Toto-1.0 TabPFN-TS TimesFM-2.5 Sundial-Base Moirai-2.0 Ours Average WAPE 0.20.3 (Lower is better) Chronos-Bolt TiRex Toto-1.0 Sundial-Base Moirai-2.0 TimesFM-2.5 TabPFN-TS Ours Average WQL Figure 21. Probabilistic forecasting evaluation on fev-bench-uni using WAPE and WQL. 22 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates Table 20. Performance Comparison onETT15TandETT1H. The best results are highlighted in bold, and the second-best results are underlined. 15T1H ModelSQL MASE WAPE WQLSQL MASE WAPE WQL AutoARIMA—1.0524 1.2616 0.2730 0.2275 AutoETS1.2626 1.4293 0.2090 0.1890 1.7645 1.6021 0.3240 0.3554 AutoTheta1.0985 0.8022 0.1328 0.1702 2.0613 1.2847 0.2708 0.3879 Chronos-Bolt0.5737 0.7033 0.1146 0.0933 0.9436 1.1267 0.2465 0.2059 Moirai-2.00.5634 0.7119 0.1160 0.0915 0.8993 1.1240 0.2470 0.1952 Naive1.3269 1.3671 0.2028 0.2051 2.3137 1.7184 0.3389 0.4903 Seasonal Naive 0.7625 0.9169 0.1473 0.1224 1.2069 1.3227 0.2864 0.2597 Stat. Ensemble—1.2717 1.2519 0.2621 0.2620 Sundial-Base0.5971 0.7139 0.11570.0970 0.9634 1.1439 0.2548 0.2147 TabPFN-TS0.6024 0.7625 0.1222 0.0966 0.9332 1.1774 0.2553 0.2034 TimesFM-2.50.5772 0.7295 0.1167 0.09250.8823 1.1239 0.24520.1916 TiRex0.56830.7188 0.1158 0.0915 0.87361.11780.2477 0.1924 Toto-1.00.5930 0.7578 0.1216 0.0952 0.8727 1.1129 0.2432 0.1900 Ours0.6121 0.7810 0.1253 0.0980 0.9036 1.1549 0.2566 0.2022 Table 21. Performance Comparison onhospitaladmissions1Dandhospitaladmissions1W. The best results are high- lighted in bold, and the second-best results are underlined. 1D1W ModelSQL MASE WAPE WQLSQL MASE WAPE WQL AutoARIMA0.5556 0.7209 0.5348 0.4122 0.5793 0.75410.2123 0.1631 AutoETS0.5558 0.7211 0.5350 0.4123 0.5783 0.75410.2123 0.1628 AutoTheta0.5748 0.7429 0.5510 0.4264 0.5977 0.7779 0.2191 0.1683 Chronos-Bolt0.5562 0.7195 0.5337 0.4125 0.5868 0.7623 0.2146 0.1653 Moirai-2.00.5556 0.7188 0.5332 0.4121 0.5862 0.7643 0.2153 0.1652 Naive1.3251 0.9747 0.7243 0.9837 1.0492 1.0436 0.2934 0.2969 Seasonal Naive 0.8572 1.0268 0.7622 0.6361 1.0492 1.0436 0.2934 0.2969 Stat. Ensemble 0.5570 0.7214 0.5352 0.4131 0.5789 0.7552 0.2126 0.1630 Sundial-Base0.6103 0.7225 0.5359 0.4526 0.6411 0.7599 0.2139 0.1802 TabPFN-TS0.5623 0.7245 0.5375 0.4171 0.5814 0.7534 0.2123 0.1638 TimesFM-2.50.5561 0.7193 0.5336 0.4125 0.5795 0.7545 0.2124 0.1632 TiRex0.55510.7180 0.5326 0.41180.5851 0.7610 0.2142 0.1648 Toto-1.00.5555 0.71730.53210.4120 0.5976 0.7781 0.2188 0.1681 Ours0.5546 0.7167 0.5317 0.4114 0.6488 0.8405 0.2368 0.1827 Table 22. Performance Comparison onboomlet619,boomlet1676, andboomlet1282. The best results are highlighted in bold, and the second-best results are underlined. 61916761282 ModelSQLMASEWAPEWQLSQLMASEWAPEWQLSQLMASEWAPEWQL AutoARIMA0.55450.71700.58800.4527—0.55650.59250.40500.3750 AutoETS0.89441.09380.95290.76360.75640.76100.35400.35610.91360.68900.47050.6261 AutoTheta0.83521.02000.88210.69930.78350.77720.36060.36900.97170.69100.47180.6673 Chronos-Bolt0.47090.59410.46830.37640.60770.71920.32890.27580.46180.56470.39180.3174 Moirai-2.00.3294 0.43060.30620.23560.57270.71280.32740.25800.42690.52300.36390.2944 Naive1.27521.12360.94590.96791.31960.81990.37380.53761.53940.82990.56741.0473 Seasonal Naive1.27521.12360.94590.96790.85040.95290.44900.38991.53940.82990.56741.0473 Stat. Ensemble0.77721.00470.87140.6586—0.73910.63980.43300.4977 Sundial-Base0.37050.43560.31100.26490.61460.71380.32500.27440.45220.51690.36040.3126 TabPFN-TS0.33050.43050.30360.23490.83110.95870.34390.28530.42530.50670.35180.2921 TimesFM-2.50.33980.43640.31170.24580.5626 0.69820.31950.25200.40340.49380.34180.2770 TiRex0.34110.44180.31740.24710.57120.70930.32200.25480.40890.49660.34470.2814 Toto-1.00.30990.40320.27890.21630.55440.69070.31740.24950.40690.49830.34440.2786 Ours0.33990.43910.31280.24450.61950.76910.34820.27600.41690.50770.35290.2877 23 Baguan-TS: A Sequence-Native In-Context Learning Model for Time Series Forecasting with Covariates Table 23. Performance Comparison onbizitobsl2c5T,jenaweather1H, andMDENSE1D. The best results are highlighted in bold, and the second-best results are underlined. bizitobsl2c5Tjenaweather1HMDENSE1D ModelSQLMASEWAPEWQLSQLMASEWAPEWQLSQLMASEWAPEWQL AutoARIMA0.82470.91631.59801.87720.43700.50920.93750.82570.97021.16900.12610.1051 AutoETS0.73110.81371.33671.61140.55290.56780.35221.40851.07351.08690.13010.1324 AutoTheta0.74960.87401.65611.60690.59330.48460.29551.68591.14351.07720.13220.1459 Chronos-Bolt0.75700.80021.32541.27500.36680.45430.25190.23410.75900.92490.09850.0817 Moirai-2.00.36750.40350.79970.76820.36520.45050.24870.23420.73900.88840.09620.0807 Naive0.67790.75781.28191.45080.57890.53750.32861.66022.19371.72920.19700.2323 Seasonal Naive0.92261.06672.28382.28780.65500.73990.77121.28781.26271.35000.14200.1368 Stat. Ensemble0.71970.80691.34371.51290.45160.46640.32791.05490.96481.05090.11930.1131 Sundial-Base0.4004 0.48080.68930.58580.38030.44590.43520.43510.78450.90000.09680.0844 TabPFN-TS0.48550.61930.88230.67970.41280.51110.28250.23430.75600.91950.09270.0776 TimesFM-2.50.46110.57870.90570.79050.35890.43920.26040.22330.70750.85760.09350.0774 TiRex0.67890.79591.41361.27250.35570.43790.25070.2168 0.74600.90660.09930.0822 Toto-1.00.59540.71361.22211.08470.36160.44760.24680.20690.84181.02790.10930.0907 Ours0.43600.53790.55450.43720.39490.48270.28830.27580.73740.89620.09760.0812 24