← Back to papers

Paper deep dive

Permutation-Equivariant 2D State Space Models: Theory and Canonical Architecture for Multivariate Time Series

Seungwoo Jeong, Heung-Il Suk

Year: 2026Venue: arXiv preprintArea: stat.MLType: PreprintEmbeddings: 98

Abstract

Abstract:Multivariate time series (MTS) modeling often implicitly imposes an artificial ordering over variables, violating the inherent exchangeability found in many real-world systems where no canonical variable axis exists. We formalize this limitation as a violation of the permutation symmetry principle and require state-space dynamics to be permutation-equivariant along the variable axis. In this work, we theoretically characterize the complete canonical form of linear variable coupling under this symmetry constraint. We prove that any permutation-equivariant linear 2D state-space system naturally decomposes into local self-dynamics and a global pooled interaction, rendering ordered recurrence not only unnecessary but structurally suboptimal. Motivated by this theoretical foundation, we introduce the Variable-Invariant Two-Dimensional State Space Model (VI 2D SSM), which realizes the canonical equivariant form via permutation-invariant aggregation. This formulation eliminates sequential dependency chains along the variable axis, reducing the dependency depth from $\mathcal{O}(C)$ to $\mathcal{O}(1)$ and simplifying stability analysis to two scalar modes. Furthermore, we propose VI 2D Mamba, a unified architecture integrating multi-scale temporal dynamics and spectral representations. Extensive experiments on forecasting, classification, and anomaly detection benchmarks demonstrate that our model achieves state-of-the-art performance with superior structural scalability, validating the theoretical necessity of symmetry-preserving 2D modeling.

Tags

ai-safety (imported, 100%)preprint (suggested, 88%)statml (suggested, 92%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/13/2026, 12:30:01 AM

Summary

The paper introduces the Variable-Invariant Two-Dimensional State Space Model (VI 2D SSM) and VI 2D Mamba, which address the artificial ordering bias in multivariate time series modeling. By formalizing permutation equivariance along the variable axis, the authors prove that linear 2D state-space systems decompose into local self-dynamics and global pooled interactions. This approach reduces dependency depth from O(C) to O(1), enables full parallelization, and simplifies stability analysis to two scalar modes.

Entities (5)

Multivariate Time Series · data-domain · 100%VI 2D Mamba · model-architecture · 100%VI 2D SSM · model-architecture · 100%Permutation Equivariance · mathematical-principle · 95%Roesser Model · theoretical-framework · 90%

Relation Signals (3)

VI 2D SSM implements Canonical Equivariant Form

confidence 95% · we introduce the Variable-Invariant Two-Dimensional State Space Model (VI 2D SSM), which realizes the canonical equivariant form via permutation-invariant aggregation.

VI 2D Mamba integrates VI 2D SSM

confidence 95% · we propose VI 2D Mamba, a unified architecture that integrates the proposed VI 2D SSM with multi-scale temporal pathways

Permutation Equivariance reduces Dependency Depth

confidence 95% · By eliminating ordered recurrence, we reduce the dependency depth along the variable axis from O(C) to O(1)

Cypher Suggestions (2)

Find all models that implement the canonical equivariant form · confidence 90% · unvalidated

MATCH (m:Model)-[:IMPLEMENTS]->(t:Theory {name: 'Canonical Equivariant Form'}) RETURN m.name

Map the relationship between architectures and their theoretical foundations · confidence 85% · unvalidated

MATCH (a:Architecture)-[r:BASED_ON]->(t:Theory) RETURN a.name, r.type, t.name

Full Text

97,321 characters extracted from source content.

Expand or collapse full text

Permutation-Equivariant 2D State Space Models: Theory and Canonical Architecture for Multivariate Time Series Seungwoo Jeong and Heung-Il Suk, S. Jeong is with the Department of Artificial Intelligence, Korea University, Seoul 02841, Republic of Korea (e-mail: sw_jeong@korea.ac.kr).H.-I. Suk is with the Department of Artificial Intelligence, Korea University, Seoul 02841, Republic of Korea and the corresponding author (e-mail: hisuk@korea.ac.kr). Abstract Multivariate time series (MTS) modeling often implicitly imposes an artificial ordering over variables, violating the inherent exchangeability found in many real-world systems where no canonical variable axis exists. We formalize this limitation as a violation of the permutation symmetry principle and require state-space dynamics to be permutation-equivariant along the variable axis. In this work, we theoretically characterize the complete canonical form of linear variable coupling under this symmetry constraint. We prove that any permutation-equivariant linear 2D state-space system naturally decomposes into local self-dynamics and a global pooled interaction, rendering ordered recurrence not only unnecessary but structurally suboptimal. Motivated by this theoretical foundation, we introduce the Variable-Invariant Two-Dimensional State Space Model (VI 2D SSM), which realizes the canonical equivariant form via permutation-invariant aggregation. This formulation eliminates sequential dependency chains along the variable axis, reducing the dependency depth from ​(C)O(C) to ​(1)O(1) and simplifying stability analysis to two scalar modes. Furthermore, we propose VI 2D Mamba, a unified architecture integrating multi-scale temporal dynamics and spectral representations. Extensive experiments on forecasting, classification, and anomaly detection benchmarks demonstrate that our model achieves state-of-the-art performance with superior structural scalability, validating the theoretical necessity of symmetry-preserving 2D modeling. I Introduction Time series data are ubiquitous across scientific and industrial domains, ranging from climate science [34], finance [40], and biomedical signals [15]. Those data streams are characterized by complex temporal dynamics, seasonal regularities, and, crucially, inter-variable dependencies among multiple variables. In the multivariate setting, effective modeling requires capturing three distinct structural properties: (i) the coexistence of long-term trends and short-term fluctuations, (i) dynamic and evolving correlations among variables, and (i) distinct spectral manifestations of these patterns in the frequency domain [38]. Developing model architectures that simultaneously address these multi-scale temporal, inter-variable, and spectral patterns-while remaining computationally tractable over long horizons-remains a fundamental challenge in forecasting [56], classification [16], and anomaly detection [30]. To address these challenges, deep learning architectures have evolved significantly. While Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) effectively capture local and autoregressive patterns, they often struggle to model long-range dependencies [39, 9]. Transformers have mitigated this issue via global self-attention [44], yet their quadratic complexity with respect to sequence length limits scalability [42]. Recently, State Space Models (SSMs) have emerged as a compelling alternative, offering structured recurrences with linear time complexity [11]. Notably, Mamba [10] has demonstrated state-of-the-art performance by introducing input-selective state transitions. However, conventional SSMs are inherently one-dimensional; they evolve solely along the temporal axis, lacking an explicit mechanism to model the dynamic interactions between variables. This limitation has motivated the development of Two-Dimensional SSMs (2D SSMs) [53, 3], which extend recurrence to both temporal and variable axes. Despite their improved expressivity, existing 2D SSMs typically rely on sequential scanning along the variable axis (Fig. 1). This approach implicitly treats variable indices as ordered coordinates, imposing a Markov chain structure where none may exist. While such an ordering is natural in spatial domains like image processing, multivariate time series generally satisfy exchangeability: variable indices act as identifiers rather than geometric coordinates. Consequently, sequential scanning introduces an artificial spatial inductive bias, rendering the model sensitive to variable permutation and inducing sequential dependency chains that inhibit parallel computation. Figure 1: Left: 1D SSM models only along the temporal axis, overlooking dependencies across variables. Middle: Conventional 2D SSM captures inter-variable correlations through sequential scans, but still imposes artificial ordering and struggles with distant relationships. Right: Our method performs global aggregation over variables, enabling simultaneous and order-free modeling of inter-variable relationships over time. In this work, we address this fundamental mismatch by formalizing the permutation symmetry principle for multivariate time series. We posit that a well-specified multivariate dynamical model must be permutation-equivariant along the variable axis. Under this symmetry constraint, we theoretically characterize the complete class of admissible linear inter-variable couplings. Specifically, we prove that any permutation-equivariant linear 2D state-space system necessarily decomposes into local self-dynamics and a global pooled interaction term. This characterization reveals that the ordered recurrence employed by prior 2D SSMs is not merely unnecessary but structurally incompatible with the inherent symmetry of the data. Guided by this theoretical insight, we introduce the Variable-Invariant Two-Dimensional State Space Model (VI 2D SSM). Instead of performing sequential updates across variables, our formulation employs a permutation-invariant aggregation mechanism to construct a global descriptor that mediates inter-variable interactions. We demonstrate that this architecture is a direct realization of the canonical equivariant form derived from our theoretical characterization. Beyond theoretical consistency, permutation symmetry yields significant computational advantages. By eliminating ordered recurrence, we reduce the dependency depth along the variable axis from ​(C)O(C) to ​(1)O(1), enabling fully parallelized computation across variables. Furthermore, the symmetry-constrained coupling simplifies stability analysis, allowing us to decompose the dynamics into mean and difference modes governed by just two scalar constraints. Building on this foundation, we propose VI 2D Mamba, a unified architecture that integrates the proposed VI 2D SSM with multi-scale temporal pathways (long- and short-term) and a frequency-domain branch via adaptive gating. This design ensures comprehensive coverage of heterogeneous temporal and spectral structures while strictly preserving permutation invariance. Extensive experiments on forecasting, classification, and anomaly detection benchmarks demonstrate that the proposed method achieves competitive or superior performance compared to state-of-the-art baselines. These results validate the theoretical necessity of symmetry-preserving modeling and highlight its practical benefits in terms of scalability and robustness. Our contributions are summarized as follows: • Permutation symmetry formalization for multivariate 2D SSMs. We formalize multivariate time-series modeling under the principle of variable-axis exchangeability. We establish permutation equivariance as a fundamental constraint for valid 2D state-space dynamics in non-spatial domains. • Canonical characterization of equivariant coupling. We provide a theoretical proof that any permutation-equivariant linear inter-variable state coupling admits a unique canonical decomposition into self-dynamics and global pooled interaction. This result fully characterizes the admissible class of symmetry-compatible linear 2D dynamics. • Symmetry-preserving 2D state-space realization. We introduce VI 2D SSM, which implements the canonical equivariant form via permutation-invariant aggregation. We show that this formulation eliminates artificial dependency chains and simplifies stability analysis to verifiable scalar conditions. • Structural scalability and empirical validation. We demonstrate that our symmetry-constrained approach reduces variable-axis dependency depth from ​(C)O(C) to ​(1)O(1), enabling full parallelization. Empirical evaluations across diverse benchmarks confirm our proposed model’s superior efficiency and performance. I Related Works I-A Neural Models for Time Series Modeling Research on deep learning for multivariate time series has evolved from modeling purely temporal dynamics to explicitly capturing cross-variable dependencies [38]. Early CNN- and RNN-based approaches extracted local patterns and autoregressive structures but were limited in modeling long-range dependencies [22, 39, 9, 12]. Transformer-based architectures advanced global context modeling via self-attention, yet their quadratic complexity with sequence length poses scalability challenges despite sparsification and kernelization efforts [18]. To improve efficiency, convolutional alternatives [30] and SSMs [11, 10, 37] have been proposed, offering linear or near-linear complexity. However, most existing formulations evolve solely along the temporal axis and do not explicitly model time-varying cross-variable interactions. To address this, recent extensions of Mamba—including TimePro, CMamba [52], Simba [37], and GrootVT [49]—incorporate variable-level priors, often through heuristic coupling mechanisms. This has further motivated 2D SSMs, inspired by classical formulations such as Roesser’s model [21]. Recent variants [3, 53, 4] couple temporal and variable dynamics to better capture multivariate correlations. Among them, Chimera [4] improves cross-variable modeling but relies on sequential scanning along the variable axis, thereby introducing ordering dependence and limiting parallelism. These limitations motivate the need for symmetry-consistent 2D state-space formulations, which we address through permutation-invariant variable coupling. I-B Permutation-Invariant and Equivariant Modeling Permutation symmetry has been extensively studied in machine learning under the framework of invariant and equivariant neural networks. Deep Sets [50] established that any permutation-invariant function over a set can be decomposed into a sum-aggregation followed by a transformation, providing a canonical representation for set functions. Subsequent works extended this principle to permutation-equivariant mappings [5], as well as attention-based architectures such as Set Transformer [23], which model interactions over unordered elements. Permutation equivariance has also been studied more broadly in group-equivariant neural networks, where symmetry constraints are enforced through structured operators and representation theory [7, 20]. Unlike prior works that focus on static function approximation over sets, our work studies permutation symmetry in the context of dynamical state-space systems for multivariate time series. Rather than constructing invariant feature mappings, we characterize the complete class of linear permutation-equivariant inter-variable dynamics within 2D SSMs. I Preliminaries We consider a multivariate time series X∈ℝC×TX ^C× T consisting of C variates and T time steps. We index the temporal dimension by t∈1,…,Tt∈\1,…,T\ and the variable dimension by c∈1,…,Cc∈\1,…,C\, where the observation at position (t,c)(t,c) is denoted by x​(t,c)x(t,c). Our goal is to learn a representation mapping fθf_θ that captures both temporal dynamics and inter-variable dependencies. Motivated by the success of structured SSMs in sequence modeling, we adopt a 2D state-space formulation. This architecture maintains two coupled latent states: a horizontal state h​(t,c)∈ℝdhh_h(t,c) ^d_h describing evolution along the temporal axis, and a vertical state hv​(t,c)∈ℝdvh_v(t,c) ^d_v describing evolution along the variable axis. I-A Continuous-Time 2D State Space Models The theoretical foundation of 2D SSMs lies in the Roesser model [21], which generalizes standard linear systems to two dimensions. In the continuous domain, the joint evolution of the horizontal and vertical states is governed by the following coupled partial differential equations (PDEs): ∂h​(t,c)∂t ∂ h_h(t,c)∂ t =(Ah​h​(t,c),Ah​v​hv​(t,c))+Bh​x​(t,c), =(A_hh_h(t,c),\;A_hvh_v(t,c))+B_hx(t,c), (1) ∂hv​(t,c)∂c ∂ h_v(t,c)∂ c =(Av​hv​(t,c),Av​h​h​(t,c))+Bv​x​(t,c), =(A_vh_v(t,c),\;A_vhh_h(t,c))+B_vx(t,c), (2) y​(t,c) y(t,c) =Ch​h​(t,c)+Cv​hv​(t,c). =C_hh_h(t,c)+C_vh_v(t,c). (3) Here, A∙A_ and B∙B_ are state transition and input projection matrices, respectively. The cross-coupling terms Ah​vA_hv and Av​hA_vh allow information to flow between the temporal and variable axes, enabling the model to capture complex spatiotemporal correlations. I-B Discretization and the Ordering Problem To apply this continuous framework to digital signals, the system must be discretized. Recent selective-scan models, such as Mamba [10] and Chimera [4], employ the Zero-Order Hold (ZOH) method. For the temporal axis, given a sampling step Δt _t, the continuous parameters (Ah,Bh)(A_h,B_h) are transformed into discrete parameters (A¯h,B¯h)( A_h, B_h) via A¯h=exp⁡(Δt​Ah) A_h= ( _tA_h) and B¯h=(Δt​Ah)−1​(exp⁡(Δt​Ah)−I)​Δt​Bh B_h=( _tA_h)^-1( ( _tA_h)-I) _tB_h. This discretization naturally yields a recurrence relation. Existing 2D SSMs extend this recurrence to the variable axis as well: h​[t+1,c] h_h[t+1,c] =A¯h​h​[t,c]+B¯h​x​[t+1,c], = A_hh_h[t,c]+ B_hx[t+1,c], (4) hv​[t,c+1] h_v[t,c+1] =A¯v​hv​[t,c]+B¯v​x​[t,c+1], = A_vh_v[t,c]+ B_vx[t,c+1], (5) yv,t y_v,t =Ch​h+Cv​hv, =C_hh_h+C_vh_v, (6) While this sequential scanning is intuitive for spatial data (e.g., images) where c and c+1c+1 are geometric neighbors, it imposes a fundamentally flawed inductive bias on multivariate time series. • Artificial Ordering: Unlike the temporal axis, which possesses a strict causal order (t→t+1t→ t+1), the variable axis in most multivariate datasets is exchangeable. The index c serves merely as an identifier, not a coordinate. Imposing an update rule hv​[c]→hv​[c+1]h_v[c]→ h_v[c+1] creates an artificial dependency chain, making the model sensitive to the permutation of input variables. • Sequential Bottleneck: The recurrence along c forces sequential computation, preventing parallelization across variables and limiting scalability to high-dimensional systems. These limitations motivate our proposed formulation, which replaces the ordered vertical recurrence with a permutation-invariant global coupling mechanism. IV Symmetry-Constrained 2D State Space Model In this section, we formalize the symmetry principle underlying variable-invariant modeling and characterize the canonical form of permutation-equivariant inter-variable coupling. Building upon this theoretical foundation, we then construct the Variable-Invariant Two-Dimensional State Space Model (VI 2D SSM) as a structural realization of the derived canonical form. Finally, we present the VI 2D Mamba architecture, which extends this symmetry-preserving mechanism into a multi-scale spectral–temporal deep learning framework. IV-A Symmetry Principle and Canonical Coupling We first establish the symmetry principle governing permutation-equivariant modeling and derive the canonical form of admissible inter-variable coupling. We then examine the resulting computational structure and stability properties, laying the theoretical groundwork for the Variable-Invariant 2D SSM introduced in the subsequent section. Exchangeable Dynamical Systems. We formalize the modeling assumption under which permutation symmetry is desirable. We then characterize the class of admissible inter-variable dynamics consistent with permutation equivariance. Definition 1 (Variable-Axis Exchangeability). We say that the data-generating distribution of the multivariate time series X is exchangeable along the variable axis if X=dXπX [0.0pt] d=X^π, ∀π∈SC∀π∈ S_C. That is, permuting the variable indices does not change the joint distribution. This reflects the absence of a canonical ordering among variables. In many multivariate domains, variable indices serve as identifiers rather than ordered spatial coordinates. Thus, modeling assumptions that impose sequential structure along the variable axis introduce artificial inductive bias. We therefore adopt the following modeling principle: Assumption 1 (Symmetry Principle). Under exchangeability, a well-specified model should be permutation-equivariant along the variable axis. Conventional 2D state-space models impose ordered recurrence across variables, thereby introducing index-sensitive structure. In contrast, the proposed formulation satisfies permutation equivariance by construction. Expressive Power over Exchangeable Linear 2D Systems. We now characterize the class of linear variable-coupled state updates that satisfy permutation equivariance along the variable axis. Our goal is to determine the canonical form of linear dynamics that are compatible with the symmetry principle. We consider linear vertical-state updates for the form hv​(t+1,c)=∑j=1CMc,j​hv​(t,j)+∑j=1CNc,j​x​(t,j), h_v(t+1,c)=Σ^C_j=1M_c,jh_v(t,j)+Σ^C_j=1N_c,jx(t,j), (7) where M,N∈ℝC×C\M,N\ ^C× C governs inter-variable state/input coupling. This represents the general linear update across the variable axis at a fixed time step. Applying permutation to both sides of the update yields the algebraic condition: M​Pπ=Pπ​M,N​Pπ=Pπ​N,∀π∈SC. MP_π=P_πM, NP_π=P_πN,\;\;\;∀π∈ S_C. (8) Thus, both M and N must commute with every permutation matrix. We now characterize the structure of permutation-equivariant linear state coupling along the variable axis. For clarity, we analyze the state coupling matrix M and omit the input coupling term. Theorem 1 (Characterization of Permutation-Commuting Matrices). Let M∈ℝC×CM ^C× C. Then, M​Pπ=Pπ​M∀π∈SC MP_π=P_πM ∀π∈ S_C (9) if and only if M=α​IC+β​⊤ M=α I_C+ 11 (10) for some scalars α,β∈ℝα,β where ICI_C is the identity matrix and 1 is the column vector of all ones. In particular, any linear permutation-equivariant state coupling must take the form hv​(t+1,c)=α​hv​(t,c)+β​∑j=1Chv​(t,j)+N​x​(t,c). h_v(t+1,c)=α h_v(t,c)+βΣ^C_j=1h_v(t,j)+Nx(t,c). (11) The proof is provided in the Appendix B-C. This result characterizes the canonical form of linear inter-variable coupling under permutation equivariance. In the next section, we construct a state-space realization that adheres to this structure. Permutation symmetry uniquely constrains admissible linear inter-variable dynamics to the canonical form α​IC+β​⊤α I_C+ 11 . Computational Structure. Beyond expressivity, permutation symmetry fundamentally alters the computational structure along the variable axis. Theorem 2 (Reduction of Variable-Axis Dependency Depth). Consider a 2D state-space system with C variables. 1. If inter-variable coupling is implemented via ordered recurrence (e.g., hv​(t,c+1)h_v(t,c+1) depends on hv​(t,c)h_v(t,c)), then the update at each fixed time t induces a dependency chain of length C along the variable axis. 2. In contrast, any formulation that mediates inter-variable interaction through a permutation-invariant aggregated descriptor ψ​(t)ψ(t) admits ​(1)O(1) variable-axis dependency depth, since no sequential dependency is introduced once ψ​(t)ψ(t) is computed. This structural reduction directly enables parallel updates across variables, a property that will be leveraged by the proposed VI 2D SSM and validated empirically in Section VI-D. Stability of Discretized Dynamics. We analyze stability of the discretized state updates under permutation-equivariant coupling. Consider the continuous-time vertical state dynamics h˙v​(t,c)=Av​hv​(t,c), h_v(t,c)=A_vh_v(t,c), (12) where Av∈ℝd×dA_v ^d× d. The discretized update with step size Δ>0 >0 under ZOH is hv​[t+1,c]=A¯v​hv​[t,c],A¯v=eΔ​Av. h_v[t+1,c]= A_vh_v[t,c], A_v=e A_v. (13) Theorem 3 (Discrete Stability under ZOH). If the continuous-time matrix AvA_v is Hurwitz, i.e., ℜ⁡(λi​(Av))<0∀i, ( _i(A_v))<0 ∀ i, then for any Δ>0 >0, the discretized matrix A¯v=eΔ​Av A_v=e A_v satisfies ρ​(A¯v)<1,ρ( A_v)<1, where ρ​(⋅)ρ(·) denotes the spectral radius. Under the canonical permutation-equivariant coupling M=α​IC+β​⊤, M=α I_C+ 11 , (14) the full state update matrix decomposes into two invariant subspaces: 1. The zero-sum subspace (orthogonal to 1), 2. The span of 1. The eigenvalues along these subspaces are: λdiff=α,λmean=α+C​β. _diff=α, _mean=α+Cβ. (15) Hence stability requires |α|<1,|α+C​β|<1. |α|<1, |α+Cβ|<1. (16) Thus, permutation symmetry reduces the stability analysis to two scalar constraints corresponding to the mean and difference modes. Importantly, symmetry does not guarantee stability per se. Rather, it constrains the dynamics to evolve within a two-mode invariant decomposition, reducing structural complexity and simplifying both stability and optimization analysis. Taken together, the preceding analyses establish that permutation symmetry uniquely constrains admissible linear inter-variable dynamics to the canonical form α​IC+β​⊤α I_C+ 11 . This characterization reveals that equivariant coupling necessarily decomposes into local self-dynamics and a global interaction term. In the next section, we construct a state-space realization that adheres to this canonical structure, yielding a symmetry-consistent and computationally efficient formulation of inter-variable dynamics. Figure 2: Overview of the proposed Variable-Invariant 2D Mamba. Left: Variable-Invariant 2D SSM with global coupling. A permutation-invariant set aggregator enforces invariance to variable ordering, after which the 2D state space model captures jointly coupled dynamics along temporal and variable dimensions. The globally coupled state transitions enable structured long-range dependency modeling while preserving symmetry across variables. Right: Variable-Invariant 2D Mamba architecture. The framework decomposes representation learning into complementary long-term, short-term, and spectral components. These components are fused to form a unified representation that simultaneously captures global temporal dynamics, local variations, and frequency-domain structure under a variable-invariant formulation. IV-B Variable-Invariant 2D SSM IV-B1 Global Interaction Field Formulation Let z​(t,:)=[z​(t,1),…,z​(t,C)]∈ℝC×dzz(t,:)=[z(t,1),…,z(t,C)] ^C× d_z denote the ensemble of a variable-indexed representation at time t. To eliminate the artificial ordering imposed by sequential scanning (c→c+1c→ c+1), we introduce a global interaction field ψ​(t)ψ(t), derived via a permutation-invariant aggregation: ψ​(t):=ϕ​(Wv​z​(t,c)c=1C), ψ(t):=φ(\W_vz(t,c)\_c=1^C), (17) where WvW_v is a learnable projection matrix and ϕ​(⋅)φ(·) is a permutation-invariant set function (e.g., mean, sum, or attention-based pooling) such that ϕ​(zc)=ϕ​(zπ​(c))φ(\z_c\)=φ(\z_π(c)\) for any permutation π∈SCπ∈ S_C. Proposition 1 (Permutation invariance of the pooled summary). Let π be any permutation of 1,…,C\1,…,C\ and define ψπ​(t):=ϕ​(Wv​z​(t,π​(c))c=1C) _π(t):=φ\! (\W_vz(t,π(c))\_c=1^C ). Assume (i) WvW_v is variable-shared (independent of c), and (i) ϕφ is permutation-invariant on multisets. Then ψπ​(t)=ψ​(t) _π(t)=ψ(t) for all t. This field ψ​(t)ψ(t) serves as a global descriptor that mediates information exchange across variables without imposing any sequential ordering. Once ψ​(t)ψ(t) is computed via a permutation-invariant aggregation, the subsequent state updates for different variables can be executed in parallel across the vertical axis. The full proof is provided in the Appendix B-A. IV-B2 Coupled Continuous-Time Dynamics We redefine the 2D state-space dynamics by coupling the temporal (horizontal) and variable (vertical) states through this global field. Unlike the Roesser model, which defines evolution along both axes (t and c), our model evolves strictly along the temporal axis t, while the variable axis is governed by instantaneous global coupling: ∂h​(t,c)∂t= ∂ h_h(t,c)∂ t= Ah​h​(t,c)+Ah​ψ​ψ​(t)+Bh​x​(t,c), A_hh_h(t,c)+A_hψ(t)+B_hx(t,c), (18) ∂hv​(t,c)∂t= ∂ h_v(t,c)∂ t= Av​hv​(t,c)+Av​ψ​ψ​(t) A_vh_v(t,c)+A_vψ(t) +Av​h​h​(t,c)+Bv​x​(t,c). +A_vhh_h(t,c)+B_vx(t,c). (19) Here, Ah​ψA_hψ and Av​ψA_vψ are coupling matrices that inject global context into local dynamics. Crucially, the derivative ∂/∂c∂/∂ c is removed. Each variable c evolves according to its own local history (h,hvh_h,h_v) and the global context ψ​(t)ψ(t), ensuring that the update rule is identical and exchangeable for all c. Proposition 2. Assume Ah,Av,Ah​ψ,Av​ψ,Av​h,Bh,BvA_h,A_v,A_hψ,A_vψ,A_vh,B_h,B_v are variable-shared (independent of c), and let ψ​(t)ψ(t) be defined as above with a permutation-invariant ϕφ (Prop. 1). For any permutation π of 1,…,C\1,…,C\, consider the permuted inputs xπ​(t,c):=x​(t,π​(c))x^π(t,c):=x(t,π(c)) and permuted initial states hπ​(0,c):=h​(0,π​(c))h^π(0,c):=h(0,π(c)). Then the unique solutions to Eq. 18-19 satisfy hπ​(t,c)=h​(t,π​(c))h^π_h(t,c)=h_h(t,π(c)), hvπ​(t,c)=hv​(t,π​(c))h^π_v(t,c)=h_v(t,π(c)) ∀t,c∀\;t,c, i.e., the dynamics are permutation equivariant with respect to the variable axis. With variable-shared parameters, the permuted system is a mere relabeling of indices; hence, the solutions are correspondingly permuted. The full proof is provided in the Appendix B-B. IV-B3 Discrete Realization via Parallel Scanning Discretizing the system using ZOH method with sampling step Δ>0 >0, we obtain the following recurrence: h​[t,c] h_h[t,c] =A¯h​h​[t−1,c]+B¯hψ​ψ​[t]+B¯hx​x​[t,c] = A_hh_h[t-1,c]+ B_h^ψ[t]+ B_h^xx[t,c] (20) hv​[t,c] h_v[t,c] =A¯v​hv​[t−1,c]+B¯vψ​ψ​[t] = A_vh_v[t-1,c]+ B_v^ψ[t] +A¯v​h​h​[t−1,c]+B¯vx​x​[t,c]. + A_vhh_h[t-1,c]+ B_v^xx[t,c]. (21) This formulation fundamentally alters the computational graph. Detailed derivation and convolution formulation can be found in the Appendix A. Data-Dependent Parameters. Recent SSM advances (e.g., Mamba) emphasize input-conditioned dynamics, where transition/measurement vectors and step sizes adapt to the current input. Furthermore, models need to adaptively learn the importance of temporal variations based on the data. We follow this principle and make the discrete SSM parameters data dependent: B​[t,c] B[t,c] =fθB​(x​[t,c]), =f^B_θ\! (x[t,c] ), (22) C​[t,c] C[t,c] =fθC​(x​[t,c]), =f^C_θ\! (x[t,c] ), Δ​[t] [t] =fθδ​(x​[t,:]). =f^δ_θ\! (x[t,:] ). Here, fθf_θ is a pointwise map applied per variable and time; their outputs have the same shape as the corresponding SSM vectors. 2D Selective Scan. Although our model is a 2D SSM, it exploits the variable-invariant construction to avoid an explicit vertical scan. At each time t, we compute the permutation-invariant summary ψ​[t]ψ[t] and then perform a hybrid scan: • a 1D selective scan along the time axis (causal, efficient, input-conditioned via Δ​[t],B​[t,c],C​[t,c] [t],B[t,c],C[t,c], • coupled with global pooling on the variable axis to transmit cross-variable information. Note that while conventional 2D SSMs require a nested loop of complexity ​(T×C)O(T× C) due to sequential dependencies, our model decouples the variable axis. The global term ψ​[t]ψ[t] is computed via parallel reduction, allowing the temporal scan to proceed independently for each variable. This reduces the effective dependency depth along the variable axis to ​(1)O(1). Input : X∈ℝC×TX ^C× T (C variables over T steps) Input : ϕφ (Permutation-invariant aggregator) Output : Y Initialize h_h, hvh_v for t←1t← 1 to T do // single temporal scan ψ​[t]←ϕ​(Wv​z​[t−1,c]c=1C)ψ[t]←φ\! (\W_v\,z[t-1,c]\_c=1^C ) for c←1c← 1 to C do h​[t,c]←A¯h​h​[t−1,c]+B¯hx​X​[t,c]+B¯hψ​ψ​[t]h_h[t,c]← A_hh_h[t-1,c]+ B^x_hX[t,c]+ B^ψ_hψ[t] hv​[t,c]←A¯v​hv​[t−1,c]+A¯v​h​h​[t,c]+B¯vx​X​[t,c]+B¯vψ​ψ​[t]h_v[t,c]← A_vh_v[t-1,c]+ A_vhh_h[t,c]+ B^x_vX[t,c]+ B^ψ_vψ[t] end for end for Y←C​[h,hv]Y← C\,[\,h_h,\,h_v\,] return Y Algorithm 1 Variable-Invariant 2D SSM (Forward) IV-C Variable-Invariant 2D Mamba Architecture Building on the theoretical foundation of the VI 2D SSM, we propose the Variable-Invariant 2D Mamba architecture. To capture the multi-scale nature of time series, we integrate three complemenatry pathways: multi-dcale temporal pathways and spectral-domain pathway. IV-C1 Multi-Scale Temporal Pathways Time-series data inherently exhibit structure at multiple scales. Depending on the task and domain, informative signals may range from local, fine-grained dependencies between adjacent time points to long-term, global trends spanning extended horizons. To capture such patterns, recent research has emphasized multi-scale temporal modeling as a central design principle [4, 17]. In our framework, multi-scale dynamics are realized by adjusting the discretization parameter Δ during SSM discretization. Following prior work [4, 17], Δ can be interpreted as the effective time resolution or sampling rate in continuous-time formulations. We instantiate two parallel VI 2D SSM blocks with distinct discretization step sizes. • Long-term Branch (Δl _l): A coarse-grained branch (large Δ ) that filters out high-frequency noise to capture global trends and seasonalities. • Short-term Branch (Δs _s): A fine-grained branch (small Δ ) focused on capturing rapid fluctuations and transient events. This design mimics a continuous wavelet transform, resolving features at different temporal resolutions. Larger Δ values correspond to slower state updates, effectively capturing long-range trends, whereas smaller Δ values yield faster dynamics suited for modeling rapid fluctuations. IV-C2 Spectral-Domain Pathway For many time-series modalities, the frequency domain plays a central role in characterizing the underlying signal [38]. Frequency-domain features complement time-domain representations: low-frequency components often capture global, slowly varying dynamics, whereas high-frequency components reveal localized, transient variations. To leverage this complementary structure, we introduce a frequency-domain pathway into our architecture. Concretely, the input sequence is transformed into the frequency domain using the discrete Fourier transform along the temporal axis. For real-valued signals, only the non-redundant spectral components are retained, and the real and imaginary parts are concatenated to form the frequency representation. This yields a tensor with the same dimensionality as the original input, which is then processed by the proposed 2D SSM operating along the frequency axis instead of the temporal axis. Interpretation of SSM in the Frequency Domain. In this setting, the semantics of the SSM differ from the temporal case. Rather than modeling evolution over time, the states evolve across frequency bands, capturing dependencies between variables across spectral ranges. Importantly, the update operates as a continuous-time SSM applied to the frequency axis, where the discretization step size Δf _f governs how densely the underlying spectral ODE is sampled. Unlike time, frequency does not exhibit causality; it spans from low to high values, with most signal energy concentrated at low frequencies and increasingly sparse but informative oscillatory patterns at higher frequencies. This spectral imbalance imposes a numerical challenge. If Δf _f is large, the discretization becomes too coarse; the exponential term in the SSM update inaccurately scales higher-frequency modes, causing instability, aliasing, and over-attenuation of small but critical spectral components. Setting Δf _f sufficiently small stabilizes the SSM dynamics, suppresses aliasing, and increases the effective resolution in the high-frequency range, ensuring that transient or oscillatory structure is preserved despite low energy levels111In practice, we find that values of Δf _f within [0.001,0.01][0.001,0.01] offer a robust trade-off between spectral fidelity and numerical stability.. IV-C3 Adaptive Gating and Fusion Our final architecture integrates information from three complementary pathways: the long-term temporal branch (Δl _l), the short-term temporal branch (Δs _s), and the frequency-domain branch (Δf _f). hfused=Gate​(hlong,hshort,hspec)h_fused=Gate(h_long,h_short,h_spec) This learnable gating mechanism allows the model to dynamically weight the importance of trend, transient, and spectral features depending on the input instance, providing a robust representation for downstream tasks. TABLE I: Experimental results of long-term time-series forecasting. Best results are highlighted in bold, and second-best results are underlined. Dataset Ours Chimera TimePro Simba TCN iTransformer RLinear PatchTST Crossformer TiDE TimesNet MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE ETTm1 0.345 0.378 0.355 0.381 0.391 0.400 0.383 0.396 0.351 0.381 0.407 0.410 0.414 0.407 0.387 0.400 0.513 0.496 0.419 0.419 0.400 0.406 ETTm2 0.258 0.317 0.252 0.317 0.281 0.326 0.271 0.327 0.253 0.314 0.288 0.332 0.286 0.327 0.281 0.326 0.757 0.610 0.358 0.404 0.291 0.333 ETTh1 0.397 0.419 0.408 0.425 0.438 0.438 0.441 0.432 0.404 0.420 0.454 0.447 0.446 0.434 0.469 0.454 0.529 0.522 0.541 0.507 0.458 0.450 ETTh2 0.313 0.372 0.321 0.377 0.377 0.403 0.361 0.391 0.322 0.379 0.383 0.407 0.374 0.398 0.387 0.407 0.942 0.684 0.611 0.550 0.414 0.427 ECL 0.158 0.254 0.154 0.249 0.169 0.262 0.185 0.274 0.156 0.253 0.178 0.270 0.219 0.298 0.205 0.290 0.244 0.334 0.251 0.344 0.192 0.295 Exchange 0.311 0.375 0.311 0.358 0.352 0.399 - - 0.302 0.366 0.360 0.403 0.378 0.417 0.367 0.404 0.940 0.707 0.370 0.413 0.416 0.443 Traffic 0.398 0.276 0.403 0.286 - - 0.493 0.291 0.398 0.270 0.428 0.282 0.626 0.378 0.481 0.304 0.550 0.304 0.760 0.473 0.620 0.336 Weather 0.227 0.264 0.219 0.258 0.251 0.276 0.255 0.280 0.224 0.264 0.258 0.278 0.272 0.291 0.259 0.281 0.259 0.315 0.271 0.320 0.259 0.287 1st Count 4 3 3 3 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 V Experiments To evaluate the validity and generality of the proposed method, we conducted experiments across four representative tasks: long-term forecasting, short-term forecasting, classification, and anomaly detection. The proposed method was extensively compared against state-of-the-art baselines, including Transformer-based approaches [28, 35, 55] and Mamba-based method [37, 4]. Detailed experimental settings and additional results are provided in the Appendix C and Github222https://github.com/ku-milab/Variable-Invariant-2D-SSM. V-A Long-term Forecasting Settings. We conducted experiments on eight benchmark datasets commonly used in time-series data forecasting tasks, including the Weather, Traffic, Electricity, Exchange, and 4 ETT datasets [56]. Baseline results were obtained from the corresponding literature to ensure consistency of comparison, and all models were evaluated using mean squared error (MSE) and mean absolute error (MAE). For brevity, we report the average results in the main text, while detailed results and extended baseline comparisons are provided in Table X. Results. Table I highlights the effectiveness of the proposed method. It achieves the lowest MSE on four of the eight datasets and the lowest MAE on three, yielding the best overall performance among all baselines. Compared to Transformer-based methods, our approach consistently outperforms across all datasets, and it also surpasses 1D SSM variants [37, 31]. Against Chimera [4], a 2D SSM model, the proposed method achieves superior results on most datasets and remains competitive on the others, further demonstrating its robustness and effectiveness. TABLE I: Experimental results of short-term time-series forecasting in the M4 dataset. Best results are highlighted in bold, and second-best results are underlined. Weighted Avg. Ours Chimera ModernTCN PatchTST TimesNet N-HiTS N-BEATS ETSformer LightTS DLinear FEDformer SMAPE 11.686 11.618 11.698 11.807 11.829 11.927 11.851 14.718 13.525 13.639 12.840 MASE 1.549 1.528 1.556 1.590 1.585 1.613 1.599 2.408 2.111 2.095 1.701 OWA 0.832 0.827 0.838 0.851 0.851 0.861 0.855 1.172 1.051 1.051 0.918 V-B Short-term Forecasting Settings. Short-term forecasting was evaluated on the widely adopted M4 dataset [32]. Performance was assessed using three standard metrics—Symmetric Mean Absolute Percentage Error (SMAPE), Mean Absolute Scaled Error (MASE), and Overall Weighted Average (OWA)—with baseline results drawn directly from the original literature to ensure comparability. For clarity, we report only the weighted average results in the main text, while the complete set of results is deferred to Table XI. Results. Table I reports the results on the M4 dataset. While the proposed method did not obtain the best score, it achieved the second-best performance, indicating its ability to effectively capture short-term patterns. Its slightly lower performance relative to Chimera is attributable to the dataset’s single-channel setting, where the advantages of variable-invariant modeling are less pronounced compared to Chimera’s inherent 2D formulation. Nonetheless, our method surpasses other state-of-the-art baselines, confirming its competitiveness in short-term forecasting. Figure 3: Results of classification (accuracy) and anomaly detection (F1 score). V-C Classification & Anomaly Detection Settings. For classification and anomaly detection, we benchmarked the proposed method on ten datasets from the UEA archive [2] and five widely adopted anomaly detection datasets: SMD [41], SWaT [33], PSM [1], MSL, and SMAP [14]. Classification performance was assessed using overall accuracy, while anomaly detection was evaluated with precision, recall, and F1 score. Figure 3 presents the aggregated results, reporting average accuracy (for classification) and average F1 score (for anomaly detection). Comprehensive results with detailed per-dataset comparisons are provided in the Table XII and XIII. Results. Figure 3 summarizes the overall results. Our proposed method achieves the best performance in anomaly detection, and while it falls slightly short of Chimera in classification, it remains superior to other baselines. This difference can be attributed to the nature of the tasks: anomaly detection strongly benefits from permutation-invariant modeling, as rare deviations often manifest across variable interactions without being tied to a fixed ordering. In contrast, classification tasks in the UEA datasets often involve structured temporal features with relatively limited variable dimensionality, where Chimera’s explicit sequential modeling of the variable axis may still provide an advantage. Although classification accuracy is marginally lower than Chimera, our method delivers competitive results at substantially lower computational cost, making it an attractive and efficient alternative—particularly given its clear gains in anomaly detection. Figure 4: Efficiency analysis with respect to the number of variables. The left y-axis reports the training time in sec/epoch, while the right y-axis reports throughput in transactions per second. VI Analysis To evaluate the superiority and efficiency of the proposed method, we conducted three complementary analyses. Specifically, we analyzed (i) how efficient the proposed 2D SSM is compared to existing 2D SSMs, (i) how well our proposed method performs in case studies where inter-variable relationships are important, and (i) how each module of the proposed Mamba contributes to the overall performance. VI-A Efficiency of the Proposed 2D SSM To assess the computational efficiency of our formulation, we compared it against a 1D SSM and the 2D SSM [4]. For a fair comparison, all models shared identical configurations, differing only in the SSM component. We measured training efficiency using sec/epoch (lower is better) and throughput (higher is better), while varying the number of variables at a fixed sequence length of 256. As shown in Figure 4, our method achieves efficiency close to that of the 1D SSM, with only a slight overhead attributable to global variable aggregation. By contrast, conventional 2D SSM exhibits a sharp decline in efficiency as the number of variables increases, highlighting the scalability bottleneck introduced by sequential scans. These results confirm that the proposed formulation retains the efficiency properties discussed in Section IV-B while eliminating the inefficiencies of conventional 2D SSMs. To clarify computational efficiency, we provide a quantitative comparison of FLOPs and peak GPU memory across representative transformer- and SSM-based baselines using their official implementations. Our model requires only a single temporal SSM scan plus a parallel pooling step, yielding substantially lower FLOPs than methods that recursively evolve states along both axes. A full derivation and empirical results are reported in the Section VI-D. TABLE I: Controlled simulation for variable-ordering robustness. Setting sec/epoch MAE (std) MAPE (std) Ours 6.1s 0.093 (0.000) 2.007 (0.000) ++ Permutation 0.093 (0.000) 2.063 (0.050) 2D-SSM 23.4s 0.093 (0.000) 1.954 (0.000) ++ Permutation 0.094 (0.001) 2.341 (0.130) TABLE IV: Controlled simulation for C-scaling. Method # C sec/epoch MAE MAPE Ours 16 6.2s 0.106 2.465 32 6.1s 0.098 2.146 64 6.1s 0.093 2.007 128 6.4s 0.096 2.313 256 6.3s 0.097 2.663 2D-SSM 16 13.4s 0.105 2.603 32 19.4s 0.098 2.101 64 23.4s 0.093 1.954 128 51.0s 0.096 2.489 256 88.8s 0.097 2.779 VI-B Controlled Simulation To evaluate accuracy in settings where inter-variable relationships are critical, we constructed a controlled simulation using a VAR(1) process defined on a Watts–Strogatz small-world graph [43]. The dataset comprised 64 variables over 1000 time steps, and forecasting performance was assessed using MAE and MAPE (Table I). To further investigate scalability with respect to the number of variables, we performed a controlled C-scaling experiment by varying C∈16,32,64,128,256C∈\16,32,64,128,256\ and measuring both performance and computational efficiency (Table IV). Under the baseline setting (no permutation), the proposed method and a conventional 2D-SSM achieved comparable MAE/MAPE, indicating that both formulations can capture the underlying dynamics when a favorable variable ordering is provided. Nevertheless, our method was approximately 3.8× faster per epoch, reflecting the efficiency gain from removing the variable-axis scan. To assess robustness to variable ordering, we repeated training and evaluation over 10 random permutations of the variable indices. The proposed method exhibited negligible performance drift across permutations (low variance), consistent with its permutation-invariant design via the global summary ψ. In contrast, the conventional 2D-SSM showed significantly larger performance variance, revealing sensitivity to the imposed ordering. This suggests that while 2D-SSM can achieve competitive accuracy under favorable indexings, its performance degrades and variability increases under re-orderings. Beyond permutation robustness, the proposed formulation also exhibits favorable scaling behavior as the number of variables increases. While conventional 2D-SSMs require a recursive pass along the variable axis, our model replaces this sequence of updates with a single parallel pooling step, whose cost scales linearly in theory but executes as a fully batched GPU operation in practice. Both methods achieved comparable MAE/MAPE at all dimensionalities, confirming that eliminating variable-axis recurrence does not compromise expressiveness. However, while the runtime of standard 2D-SSM grew almost linearly with C, our method maintained nearly constant training time, demonstrating that the benefit of variance-invariant pooling intensifies as dimensionality grows. This highlights that the proposed formulation is particularly advantageous in high-dimensional multivariate systems where cross-variable interactions coexist with scalability demands. TABLE V: Ablation study on ETTh1 and ETTm1. Case I: branch removal; Case I: different Δf _f ranges in the frequency domain. Setting ETTh1 ETTm1 MSE MAE MSE MAE I w/o short 0.433 0.440 0.349 0.381 w/o long 0.431 0.439 0.350 0.381 w/o freq. 0.425 0.437 0.348 0.381 I [0.1,0.5][0.1,0.5] 0.406 0.426 0.349 0.380 (Δf _f) [0.01,0.05][0.01,0.05] 0.404 0.423 0.349 0.380 I Mean 0.397 0.419 0.345 0.378 (ψ) Attention 0.409 0.426 0.347 0.381 Ours 0.397 0.419 0.345 0.378 VI-C Ablation Study In Table V, we systematically evaluate the contribution of each component of the proposed Mamba model through three ablation cases: (Case I) analysis of individual branches, (Case I) assessment of frequency-domain resolution under varying Δf _f settings, and (Case I) comparison of permutation-invariant aggregation functions ψ. In Case I, removing any branch led to performance degradation, confirming that all modules contribute meaningfully. Among them, the short-term and long-term pathways have the most significant impact, while the frequency branch plays a complementary role. Case I further highlights this complementarity: although the best results are obtained when all branches are combined, performance gradually improves as Δf _f decreases, indicating that finer resolution in the high-frequency region enhances modeling capacity. In Case I, mean pooling consistently outperforms attention-based pooling across both datasets. This supports our design choice: forecasting tasks benefit more from stable, low-variance, parameter-free aggregation, with minimal computational overhead. This ablation confirms that the default choice of mean pooling for forecasting tasks is both empirically justified and computationally optimal. A detailed discussion of task-dependent aggregation choices and their computational properties is provided in the Appendix C-B. TABLE VI: Measured peak GPU memory and forward FLOPs. Method Memory (MB) FLOPs (G) Ours 546.62 11.99 TimePro 164.35 35.04 S-Mamba 270.76 96.13 PatchTST 675.26 85.64 iTransformer 1155.14 73.20 VI-D Computational Efficiency Analysis We provide a quantitative comparison of computational cost across representative transformer- and SSM-based multivariate forecasting models in Table VI. All methods were evaluated under the same conditions: ECL dataset, input window length L=96L=96, prediction window H=720H=720, and batch size 16. We report measurements only for models with official code enabling reproducible FLOPs and peak memory profiling. A conventional 2D State Space Model processes dynamics along both temporal and variable axes. Let L be the sequence length, C the number of variables, and d the hidden dimension. The standard cost is: ​(L​C​d)⏟temporal SSM scan+​(L​C​d)⏟variable-axis recursion=​(2​L​C​d). O(LCd)_temporal SSM scan+ O(LCd)_variable-axis recursion=O(2LCd). (23) The second term is sequential in C, restricting GPU parallelism. We replace the recursive variable update with a single permutation-invariant pooled descriptor: ​(L​C​d)⏟temporal SSM scan+​(C​d)⏟Global pooling=​(L​C​d)+​(C​d). O(LCd)_temporal SSM scan+ O(Cd)_Global pooling=O(LCd)+O(Cd). (24) Crucially, the additional ​(C​d)O(Cd) term is computed in one fully parallel GPU kernel, rather than a sequential pass. This reduces practical computation to a single directional SSM scan, eliminating any dependence on variable ordering or recursive updates. Because our model does not evolve states along the variable axis, it avoids the second ​(L​C​d)O(LCd) recurrence found in SSM baselines (TimePro, S-Mamba). The measured FLOPs are therefore markedly lower, despite additional frequency-domain modeling. VII Conclusion In this work, we introduced a variable-invariant two-dimensional state space model for multivariate time series. By replacing ordered recurrence along the variable axis with a global permutation-invariant aggregation, the proposed VI 2D SSM eliminates artificial ordering dependence while preserving global inter-variable coupling and enabling full parallelism. Beyond architectural design, we provided a theoretical analysis of symmetry-constrained inter-variable dynamics. We showed that permutation equivariance uniquely determines the canonical form of admissible linear coupling and that the proposed formulation provides a parameterization consistent with this class through local self-dynamics and global pooled interaction. This symmetry-induced structure reduces computational dependency depth and simplifies stability analysis via a two-mode invariant decomposition. Extensive experiments on forecasting, classification, and anomaly detection benchmarks demonstrate consistent improvements over state-of-the-art baselines in both accuracy and efficiency. These results highlight the proposed framework as a scalable, symmetry-grounded, and theoretically justified approach for modeling complex temporal–spectral dynamics in multivariate time series. Acknowledgments This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No. RS-2019-I190079, Artificial Intelligence Graduate School Program(Korea University), No. RS-2024-00457882, National AI Research Lab Project, and No. RS-2022-I220959 ((Part 2) Few-Shot Learning of Causal Inference in Vision and Language for Decision Making)). References [1] A. Abdulaal, Z. Liu, and T. Lancewicki (2021) Practical approach to asynchronous multivariate time series anomaly detection and localization. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, p. 2485–2494. Cited by: §C-A, §V-C. [2] A. Bagnall, H. A. Dau, J. Lines, M. Flynn, J. Large, A. Bostrom, P. Southam, and E. Keogh (2018) The uea multivariate time series classification archive, 2018. arXiv preprint arXiv:1811.00075. Cited by: §C-A, §V-C. [3] E. Baron, I. Zimerman, and L. Wolf (2024) A 2-dimensional state space layer for spatial inductive bias. In The Twelfth International Conference on Learning Representations, Cited by: §I, §I-A. [4] A. Behrouz, M. Santacatterina, and R. Zabih (2024) Chimera: effectively modeling multivariate time series with 2-dimensional state space models. Advances in Neural Information Processing Systems 37, p. 119886–119918. Cited by: Appendix C, §I-A, §I-B, §IV-C1, §IV-C1, §V-A, §V, §VI-A. [5] B. Bloem-Reddy and Y. W. Teh (2020) Probabilistic symmetries and invariant neural networks. Journal of Machine Learning Research 21 (90), p. 1–61. Cited by: §I-B. [6] C. Challu, K. G. Olivares, B. N. Oreshkin, F. G. Ramirez, M. M. Canseco, and A. Dubrawski (2023) Nhits: neural hierarchical interpolation for time series forecasting. In Proceedings of the AAAI conference on artificial intelligence, Vol. 37, p. 6989–6997. Cited by: Appendix C. [7] T. Cohen and M. Welling (2016) Group equivariant convolutional networks. In International conference on machine learning, p. 2990–2999. Cited by: §I-B. [8] A. Das, W. Kong, A. Leach, S. Mathur, R. Sen, and R. Yu (2023) Long-term forecasting with tide: time-series dense encoder. arXiv preprint arXiv:2304.08424. Cited by: Appendix C. [9] J. Franceschi, A. Dieuleveut, and M. Jaggi (2019) Unsupervised scalable representation learning for multivariate time series. Advances in neural information processing systems 32. Cited by: §I, §I-A. [10] A. Gu and T. Dao (2023) Mamba: linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752. Cited by: §I, §I-A, §I-B. [11] A. Gu, K. Goel, and C. Ré (2021) Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396. Cited by: Appendix C, §I, §I-A. [12] Y. He and J. Zhao (2019) Temporal convolutional networks for anomaly detection in time series. In Journal of Physics: Conference Series, Vol. 1213, p. 042050. Cited by: §I-A. [13] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), p. 1735–1780. Cited by: Appendix C. [14] K. Hundman, V. Constantinou, C. Laporte, I. Colwell, and T. Soderstrom (2018) Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, p. 387–395. Cited by: §C-A, §V-C. [15] S. Jeong, W. Jung, J. Sohn, and H. Suk (2024) Deep geometric learning with monotonicity constraints for alzheimer’s disease progression. IEEE Transactions on Neural Networks and Learning Systems 36 (4), p. 7090–7102. Cited by: §I. [16] S. Jeong, W. Ko, A. W. Mulyadi, and H. Suk (2023) Deep efficient continuous manifold learning for time series modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (1), p. 171–184. Cited by: §I. [17] Y. M. Karadag, S. Kalkan, and I. G. Dino (2025) MS-Mamba: multi-scale mamba for time-series forecasting. arXiv preprint arXiv:2504.07654. Cited by: §IV-C1, §IV-C1. [18] F. D. Keles, P. M. Wijewardena, and C. Hegde (2023) On the computational complexity of self-attention. In International conference on algorithmic learning theory, p. 597–619. Cited by: §I-A. [19] N. Kitaev, Ł. Kaiser, and A. Levskaya (2020) Reformer: the efficient transformer. arXiv preprint arXiv:2001.04451. Cited by: Appendix C. [20] R. Kondor and S. Trivedi (2018) On the generalization of equivariance and convolution in neural networks to the action of compact groups. In International conference on machine learning, p. 2747–2755. Cited by: §I-B. [21] S. Kung, B. C. Levy, M. Morf, and T. Kailath (1977) New results in 2-D systems theory, part i: 2-d state-space models—realization and the notions of controllability, observability, and minimality. Proceedings of the IEEE 65 (6), p. 945–961. Cited by: §I-A, §I-A. [22] G. Lai, W. Chang, Y. Yang, and H. Liu (2018) Modeling long-and short-term temporal patterns with deep neural networks. In The 41st international ACM SIGIR conference on research & development in information retrieval, p. 95–104. Cited by: Appendix C, §I-A. [23] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh (2019) Set transformer: a framework for attention-based permutation-invariant neural networks. In International conference on machine learning, p. 3744–3753. Cited by: §I-B. [24] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y. Wang, and X. Yan (2019) Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in neural information processing systems 32. Cited by: Appendix C. [25] Z. Li, S. Qi, Y. Li, and Z. Xu (2023) Revisiting long-term time series forecasting: an investigation on linear mapping. arXiv preprint arXiv:2305.10721. Cited by: Appendix C. [26] M. Liu, A. Zeng, M. Chen, Z. Xu, Q. Lai, L. Ma, and Q. Xu (2022) Scinet: time series modeling and forecasting with sample convolution and interaction. Advances in Neural Information Processing Systems 35, p. 5816–5828. Cited by: Appendix C. [27] S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A. X. Liu, and S. Dustdar Pyraformer: low-complexity pyramidal attention for long-range time series modeling and forecasting. In International Conference on Learning Representations, Cited by: Appendix C. [28] Y. Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, and M. Long (2023) iTransformer: inverted transformers are effective for time series forecasting. arXiv preprint arXiv:2310.06625. Cited by: Appendix C, §V. [29] Y. Liu, H. Wu, J. Wang, and M. Long (2022) Non-stationary transformers: exploring the stationarity in time series forecasting. Advances in neural information processing systems 35, p. 9881–9893. Cited by: Appendix C. [30] D. Luo and X. Wang (2024) ModernTCN: a modern pure convolution structure for general time series analysis. In The twelfth international conference on learning representations, p. 1–43. Cited by: Appendix C, §I, §I-A. [31] X. Ma, Z. Ni, S. Xiao, and X. Chen (2025) TimePro: efficient multivariate long-term time series forecasting with variable-and time-aware hyper-state. In Forty-second International Conference on Machine Learning, Cited by: Appendix C, §V-A. [32] S. Makridakis, E. Spiliotis, and V. Assimakopoulos (2018) The M4 competition: results, findings, conclusion and way forward. International Journal of forecasting 34 (4), p. 802–808. Cited by: §C-A, §V-B. [33] A. P. Mathur and N. O. Tippenhauer (2016) SWaT: a water treatment testbed for research and training on ics security. In 2016 international workshop on cyber-physical systems for smart water networks (CySWater), p. 31–36. Cited by: §C-A, §V-C. [34] M. Mudelsee (2010) Climate time series analysis. Atmospheric and 397. Cited by: §I. [35] Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam (2022) A time series is worth 64 words: long-term forecasting with transformers. arXiv preprint arXiv:2211.14730. Cited by: Appendix C, §V. [36] B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio (2019) N-beats: neural basis expansion analysis for interpretable time series forecasting. arXiv preprint arXiv:1905.10437. Cited by: Appendix C. [37] B. N. Patro and V. S. Agneeswaran (2024) Simba: simplified mamba-based architecture for vision and multivariate time series. arXiv preprint arXiv:2403.15360. Cited by: Appendix C, §I-A, §I-A, §V-A, §V. [38] X. Qiu, H. Cheng, X. Wu, J. Hu, C. Guo, and B. Yang (2025) A comprehensive survey of deep learning for multivariate time series forecasting: a channel strategy perspective. arXiv preprint arXiv:2502.10721. Cited by: §I, §I-A, §IV-C2. [39] L. Shen, Z. Li, and J. Kwok (2020) Timeseries anomaly detection using temporal hierarchical one-class network. Advances in neural information processing systems 33, p. 13016–13026. Cited by: §I, §I-A. [40] Z. Shi (2024) MambaStock: selective state space model for stock prediction. arXiv preprint arXiv:2402.18959. Cited by: §I. [41] Y. Su, Y. Zhao, C. Niu, R. Liu, W. Sun, and D. Pei (2019) Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, p. 2828–2837. Cited by: §C-A, §V-C. [42] Z. Wang, F. Kong, S. Feng, M. Wang, X. Yang, H. Zhao, D. Wang, and Y. Zhang (2025) Is mamba effective for time series forecasting?. Neurocomputing 619, p. 129178. Cited by: §I. [43] D. J. Watts (1998) Strogatz-small world network nature. Nature 393 (June), p. 440–442. Cited by: §VI-B. [44] Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, and L. Sun (2023) Transformers in time series: a survey. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, p. 6778–6786. Cited by: §I. [45] G. Woo, C. Liu, D. Sahoo, A. Kumar, and S. Hoi (2022) Etsformer: exponential smoothing transformers for time-series forecasting. arXiv preprint arXiv:2202.01381. Cited by: Appendix C. [46] H. Wu, T. Hu, Y. Liu, H. Zhou, J. Wang, and M. Long (2022) Timesnet: temporal 2d-variation modeling for general time series analysis. arXiv preprint arXiv:2210.02186. Cited by: Appendix C. [47] H. Wu, J. Wu, J. Xu, J. Wang, and M. Long (2022) Flowformer: linearizing transformers with conservation flows. arXiv preprint arXiv:2202.06258. Cited by: Appendix C. [48] H. Wu, J. Xu, J. Wang, and M. Long (2021) Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. Advances in neural information processing systems 34, p. 22419–22430. Cited by: Appendix C. [49] Y. Xiao, L. Song, S. Huang, J. Wang, S. Song, Y. Ge, X. Li, and Y. Shan (2024) GrootVL: tree topology is all you need in state space model. arXiv preprint arXiv:2406.02395. Cited by: §I-A. [50] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola (2017) Deep sets. Advances in neural information processing systems 30. Cited by: §I-B. [51] A. Zeng, M. Chen, L. Zhang, and Q. Xu (2023) Are transformers effective for time series forecasting?. In Proceedings of the AAAI conference on artificial intelligence, Vol. 37, p. 11121–11128. Cited by: Appendix C. [52] C. Zeng, Z. Liu, G. Zheng, and L. Kong (2024) CMamba: channel correlation enhanced state space models for multivariate time series forecasting. arXiv preprint arXiv:2406.05316. Cited by: §I-A. [53] J. Zhang, A. T. Nguyen, X. Han, V. Q. Trinh, H. Qin, D. Samaras, and M. S. Hosseini (2025) 2DMamba: efficient state space model for image representation with applications on giga-pixel whole slide image classification. In Proceedings of the Computer Vision and Pattern Recognition Conference, p. 3583–3592. Cited by: §I, §I-A. [54] T. Zhang, Y. Zhang, W. Cao, J. Bian, X. Yi, S. Zheng, and J. Li Less is more: fast multivariate time series forecasting with light sampling-oriented mlp structures. arxiv 2022. arXiv preprint arXiv:2207.01186. Cited by: Appendix C. [55] Y. Zhang and J. Yan (2023) Crossformer: transformer utilizing cross-dimension dependency for multivariate time series forecasting. In The eleventh international conference on learning representations, Cited by: Appendix C, §V. [56] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang (2021) Informer: beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35, p. 11106–11115. Cited by: §C-A, Appendix C, §I, §V-A. [57] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin (2022) Fedformer: frequency enhanced decomposed transformer for long-term series forecasting. In International conference on machine learning, p. 27268–27286. Cited by: Appendix C. Seungwoo Jeong received the B.S. degree in Mathematics and Statistics from Hankuk University of Foreign Studies, Yongin, South Korea, in 2019. He is currently pursuing the Ph.D. degree with the Department of Artificial Intelligence, Korea University, Seoul, South Korea. His current research interests include time-series analysis, brain-computer interface, and geometric learning. Heung-Il Suk is currently a Professor at the Department of Artificial Intelligence at Korea University. He was a Visiting Professor at the Department of Radiology at Duke University between 2022 and 2023. He was awarded a Kakao Faculty Fellowship from Kakao and a Young Researcher Award from the Korean Society for Human Brain Mapping (KHBM) in 2018 and 2019, respectively. His research interests include causal machine/deep learning, explainable AI, biomedical data analysis, and brain-computer interface. Dr. Suk serves as an Editorial Board Member for Clinical and Molecular Hepatology (Artificial Intelligence Sector), Electronics, Frontiers in Neuroscience, Frontiers in Radiology (Artificial Intelligence in Radiology), International Journal of Imaging Systems and Technology (IJIST), and a Program Committee or a Reviewer for NeurIPS, ICML, ICLR, AAAI, IJCAI, CVPR, MICCAI, AISTATS, etc. Appendix A Variable-Invariant 2D SSM A-A Discretization of the 2D SSM The continuous proposed 2D SSM is given by: d​t​[h​(t,c)hv​(t,c)] ddt bmatrixh_h(t,c)\\ h_v(t,c) bmatrix =[Ah0Av​hAv]​[h​(t,c)hv​(t,c)] = bmatrixA_h&0\\ A_vh&A_v bmatrix bmatrixh_h(t,c)\\ h_v(t,c) bmatrix +[Ah​ψAv​ψ]​ψ​(t)+[BhBv]​x​(t,c). + bmatrixA_hψ\\ A_vψ bmatrixψ(t)+ bmatrixB_h\\ B_v bmatrixx(t,c). In matrix form: =[Ah0Av​hAv],ℬ=[Ah​ψBhAv​ψBv]. = bmatrixA_h&0\\ A_vh&A_v bmatrix, = bmatrixA_hψ&B_h\\ A_vψ&B_v bmatrix. The general solution of the continuous state equation is: ​(t)=e​(t−t0)​(t0)+∫t0te​(t−τ)​ℬ​(τ)​τ, (t)=e^A(t-t_0)h(t_0)+ _t_0^te^A(t-τ)Bu(τ)dτ, where ​(t)=[h​(t,c)hv​(t,c)]h(t)= bmatrixh_h(t,c)\\ h_v(t,c) bmatrix and ​(t)=[ψ​(t)x​(t,c)]u(t)= bmatrixψ(t)\\ x(t,c) bmatrix. Under the ZOH assumption, when t0=k​Δt_0=k and t=(k+1)​Δt=(k+1) : ​((k+1)​Δ) ((k+1) ) =e​Δ​(k​Δ) =e^A h(k ) +∫k​Δ(k+1)​Δe​((k+1)​Δ−τ)​ℬ​[k]​τ. + _k ^(k+1) e^A((k+1) -τ)B\,u[k]\,dτ. Since ​(τ)=​[k]u(τ)=u[k] is constant over the interval due to ZOH: ​[k+1] [k+1] =e​Δ​[k]+(∫k​Δ(k+1)​Δe​((k+1)​Δ−τ)​ℬ​τ)​[k] =e^A h[k]+ ( _k ^(k+1) e^A((k+1) -τ)Bdτ )u[k] =e​Δ​[k]+(∫0Δe​(Δ−λ)​ℬ​λ)​[k] =e^A h[k]+ ( _0 e^A( -λ)Bdλ )u[k] =e​Δ​[k]+(∫0Δe​ν​ℬ​ν)​[k]. =e^A h[k]+ ( _0 e^AνBdν )u[k]. Therefore, the discretized system is: ​[k+1,c] [k+1,c] =¯​[k,c]+ℬ¯​[ψ​[k]x​[k,c]], = Ah[k,c]+ B bmatrixψ[k]\\ x[k,c] bmatrix, where: ¯ A =e​Δ=exp⁡([Ah0Av​hAv]​Δ), =e^A = ( bmatrixA_h&0\\ A_vh&A_v bmatrix ), ℬ¯ B =(∫0Δe​τ​τ)​ℬ. = ( _0 e^Aτdτ )B. A-B Convolution Formulation of Variable-Invariant 2D SSM Applying the recurrent rules in Eq. 11-13, we can write the output as: y​[t,c]=∑τt−1Kψ​[t−1−τ]​ψ​[τ]+∑τt−1Kx​[t−1−τ]​x​[τ,c], y[t,c]= _τ^t-1K_ψ[t-1-τ]ψ[τ]+ _τ^t-1K_x[t-1-τ]x[τ,c], where global info kernel Kψ​[k]=[Ch,Cv]​¯k​ℬ¯​[:,0]K_ψ[k]=[C_h,C_v] A_k B[:,0] and input kernel Kx​[k]=[Ch,Cv]​¯k​ℬ¯​[:,1]K_x[k]=[C_h,C_v] A_k B[:,1]. Appendix B Theoretical proof B-A Proposition 1 Let π be any permutation of 1,…,C\1,…,C\ and define ψπ​(t):=ϕ​(Wv​z​(t,π​(c))c=1C) _π(t):=φ\! (\W_vz(t,π(c))\_c=1^C ). Assume (i) WvW_v is variable-shared (independent of c), and (i) ϕφ is permutation-invariant on multisets. Then ψπ​(t)=ψ​(t) _π(t)=ψ(t) for all t. Proof. Define the variable-wise multiset t=Wv​z​(t,c)c=1C. _t=\W_vz(t,c)\_c=1^C. By definition, ψ​(t)=ϕ​(t)ψ(t)=φ(S_t). Under permutation π∈SCπ∈ S_C, since z is permutation-equivariant along the variable axis, we have zπ​(t,c)=z​(t,π​(c)). z^π(t,c)=z(t,π(c)). Hence, the permuted collection becomes tπ=Wv​zπ​(t,c)c=1C=Wv​z​(t,π​(c))c=1C. _t^π=\W_vz^π(t,c)\_c=1^C=\W_vz(t,π(c))\_c=1^C. Because π is a bijection on 1,…,C\1,…,C\, tπS_t^π coincides with tS_t as a multiset. Since ϕφ is permutation-invariant, ϕ​(tπ)=ϕ​(t). φ(S_t^π)=φ(S_t). Therefore, ψπ​(t)=ψ​(t) _π(t)=ψ(t) for all t. ∎ B-B Proposition 2 Assume Ah,Av,Ah​ψ,Av​ψ,Av​h,Bh,BvA_h,A_v,A_hψ,A_vψ,A_vh,B_h,B_v are variable-shared (independent of c), and let ψ​(t)ψ(t) be defined as above with a permutation-invariant ϕφ (Prop. 1). For any permutation π of 1,…,C\1,…,C\, consider the permuted inputs xπ​(t,c):=x​(t,π​(c))x^π(t,c):=x(t,π(c)) and permuted initial states hπ​(0,c):=h​(0,π​(c))h^π(0,c):=h(0,π(c)). Then the unique solutions to Eq. 8-9 satisfy hπ​(t,c)=h​(t,π​(c))h^π_h(t,c)=h_h(t,π(c)), hvπ​(t,c)=hv​(t,π​(c))h^π_v(t,c)=h_v(t,π(c)) ∀t,c∀\;t,c, i.e., the dynamics are permutation equivariant with respect to the variable axis. Proof. Let h,hvh_h,h_v be a solution of the continuous dynamics ∂h​(t,c)∂t=Ah​h​(t,c)+Ah​ψ​ψ​(t)+Bh​x​(t,c), split ∂ h_h(t,c)∂ t&=A_hh_h(t,c)+A_hψ(t)\\ & +B_hx(t,c), split ∂hv​(t,c)∂t=Av​hv​(t,c)+Av​ψ​ψ​(t)+Av​h​h​(t,c)+Bv​x​(t,c). split ∂ h_v(t,c)∂ t&=A_vh_v(t,c)+A_vψ(t)\\ & +A_vhh_h(t,c)+B_vx(t,c). split with initial conditions h​(0,c),hv​(0,c)h_h(0,c),h_v(0,c). Here, the matrices Ah,Av,Ah​ψ,Av​ψ,Bv​h,Bh,BvA_h,A_v,A_hψ,A_vψ,B_vh,B_h,B_v are variable-shared (do not depend on c), and ψ​(t)=ϕ​(Wv​hv​(t,c)c=1C)ψ(t)=φ(\W_vh_v(t,c)\^C_c=1) with ϕφ permutation-invariant (Prop. 1). Fix any permutation π of 1,…,C\1,…,C\ and let the permuted inputs xπ​(t,c):=x​(t,π​(c))x^π(t,c):=x(t,π(c)), and the permuted initial states hπ​(0,c):=h​(0,π​(c)),hv​(0,c):=hv​(0,π​(c))h^π_h(0,c):=h_h(0,π(c)),h_v(0,c):=h_v(0,π(c)). Consider the h~h​(t,c):=h​(t,π​(c)),h~v​(t,c):=hv​(t,π​(c)) h_h(t,c):=h_h(t,π(c)), h_v(t,c):=h_v(t,π(c)). We show that (h~h,h~v)( h_h, h_v) satisfy the permuted system driven by xπx^π with initial conditions hπ​(0,⋅),hvπ​(0,⋅)h^π_h(0,·),h_v^π(0,·). ψ is invariant under variable permutations. By Prop. 1, for any t, ψπ​(t):=ψ​(t)ψ^π(t):=ψ(t). Hence, the pooled descriptor used in the dynamics is unchanged by the permutation. h~h,h~v h_h, h_v satisfy the permuted ODEs. Differentiating h~h h_h and using Eq. 8, ∂t​h~h​(t,c) ∂ t h_h(t,c) =∂t​h​(t,π​(c)) = ∂ th_h(t,π(c)) =Ah​h​(t,π​(c))+Ah​ψ​ψ​(t)+Bh​x​(t,π​(c)) =A_hh_h(t,π(c))+A_hψ(t)+B_hx(t,π(c)) =Ah​h~h​(t,c)+Ah​ψ​ψ​(t)+Bh​xπ​(t,c). =A_h h_h(t,c)+A_hψ(t)+B_hx^π(t,c). Likewise, using Eq. 9, ∂t​h~v​(t,c) ∂ t h_v(t,c) =Av​h~v​(t,c)+Av​ψ​ψ​(t) =A_v h_v(t,c)+A_vψ(t) +Av​h​h~h​(t,c)+Bv​xπ​(t,c). +A_vh h_h(t,c)+B_vx^π(t,c). By construction, h~h​(0,c)=hπ​(0,c) h_h(0,c)=h_h^π(0,c) and hv~​(0,c)=hvπ​(0,c) h_v(0,c)=h^π_v(0,c). Thus, (h~h,h~v)( h_h, h_v) are the solution of the permuted system with input xπx^π and the permuted initial states. Uniqueness The systems are linear time-varying (through ψ) but globally Lipschitz in the states. Hence, the solutions are unique. Therefore, the unique solution (hπ,hvπ)(h^π_h,h^π_v) of the permuted system coincides with (h~h,h~v)( h_h, h_v), i.e., (hπ​(t,c),hvπ​(t,c)) (h^π_h(t,c),h^π_v(t,c)) =(h~h​(t,c),h~v​(t,c)) =( h_h(t,c), h_v(t,c)) =(h​(t,π​(c)),hv​(t,π​(c))),∀t,c. =(h_h(t,π(c)),h_v(t,π(c))), ∀ t,c. This proves permutation equivariance of the dynamics with respect to the variable axis. ∎ B-C Theorem 1 Let M∈ℝC×CM ^C× C. Then, M​Pπ=Pπ​M∀π∈SC MP_π=P_πM ∀π∈ S_C if and only if M=α​IC+β​⊤ M=α I_C+ 11 for some scalars α,β∈ℝα,β . In particular, any linear permutation-equivariant state coupling must take the form hv​(t+1,c)=α​hv​(t,c)+β​∑j=1Chv​(t,j)+N​x​(t,c). h_v(t+1,c)=α h_v(t,c)+βΣ^C_j=1h_v(t,j)+Nx(t,c). Proof. Sufficiency Let M=α​IC+β​⊤M=α I_C+ 11 . Since permutation matrices satisfy Pπ​IC=IC​PπP_πI_C=I_CP_π, and Pπ​=,P_π1=1, ⊤​Pπ=⊤,1 P_π=1 , it follows that Pπ​(⊤)=(Pπ​)​⊤=⊤=​(⊤​Pπ)=(⊤)​Pπ. P_π(11 )=(P_π1)1 =11 =1(1 P_π)=(11 )P_π. Hence, both IcI_c and ⊤11 commute with every permutation matrix, and therefore M​Pπ=Pπ​M∀π∈SC. MP_π=P_πM ∀π∈ S_C. Necessity Assume M​Pπ=Pπ​MMP_π=P_πM ∀π∈SC∀π∈ S_C. Since the symmetric group is generated by transpositions, it suffices to consider permutation matrices corresponding to swaps of two indices i≠ji≠ j. Let Pi​jP_ij denote the matrix that exchanges indices i and j. Right-multiplication by Pi​jP_ij swaps columns i and j, while left-multiplication swaps rows i and j. Thus, the commutation relation M​Pi​j=Pi​j​MMP_ij=P_ijM implies that exchanging columns i and j yields the same matrix as exchanging rows i and j. From entrywise comparison, we obtain: 1. All diagonal entries are equal Mi​i=Mj​j∀i,jM_i=M_j ∀ i,j. 2. All off-diagonal entries are equal Mi​k=Mj​k,Mk​i=Mk​j∀i≠j,k∉i,jM_ik=M_jk, M_ki=M_kj ∀ i≠ j,\;k∉\i,j\. Hence there exist scalars d and o such that Mi​i=d,Mi​j=o(i≠j). M_i=d, M_ij=o (i≠ j). Therefore, M M =o​⊤+(d−o)​IC =o11 +(d-o)I_C =α​IC+β​⊤ =α I_C+ 11 where α=d−oα=d-o and β=oβ=o. ∎ B-D Theorem 2 Consider a 2D SSM with C variables. 1. If inter-variable coupling is implemented via ordered recurrence (e.g., hv​(t,c+1)h_v(t,c+1) depends on hv​(t,c)h_v(t,c)), then the update at each fixed time t induces a dependency chain of length C along the variable axis. 2. In contrast, the proposed permutation-invariant aggregation ψ​(t)ψ(t) reduces the variable-axis dependency depth to ​(1)O(1), since all variable-wise updates are conditionally independent given ψ​(t)ψ(t). Proof. We measure the variable-axis dependency depth as the length of the longest directed path in the computational graph at a fixed time step t. 1. In an ordered recurrence, hv​(t,c+1)h_v(t,c+1) explicitly depends on hv​(t,c)h_v(t,c). Hence the computation graph forms a directed chain hv​(t,1)→hv​(t,2)→⋯→hv​(t,C),h_v(t,1)→ h_v(t,2)→·s→ h_v(t,C), whose length is C. 2. In the proposed formulation, the global descriptor ψ​(t)ψ(t) is computed via a permutation-invariant aggregation over z​(t,c)c=1C\z(t,c)\_c=1^C. This aggregation can be implemented using a parallel reduction, whose depth is independent of C (e.g., ​(1)O(1) under ideal parallelization or ​(log⁡C)O( C) in tree reduction). Once ψ​(t)ψ(t) is computed, each hv​(t+1,c)h_v(t+1,c) depends only on local quantities and ψ​(t)ψ(t), and thus all updates can be performed in parallel. Hence, the longest dependency path along the variable axis is constant (up to reduction depth), establishing ​(1)O(1) variable-axis dependency depth. ∎ B-E Theorem 3 If the continuous-time matrix AvA_v is Hurwitz, i.e., ℜ⁡(λi​(Av))<0∀i, ( _i(A_v))<0 ∀ i, then for any Δ>0 >0, the discretized matrix A¯v=eΔ​Av A_v=e A_v satisfies ρ​(A¯v)<1,ρ( A_v)<1, where ρ​(⋅)ρ(·) denotes the spectral radius. Proof. Let λi _i be an eigenvalue of AvA_v. By the spectral mapping theorem, the eigenvalues of A¯v=eΔ​Av A_v=e A_v are eΔ​λie _i. Since ℜ⁡(λi)<0 ( _i)<0 for all i and Δ>0 >0, |eΔ​λi|=eΔ​ℜ⁡(λi)<1.|e _i|=e ( _i)<1. Therefore, all eigenvalues of A¯v A_v lie strictly inside the unit circle, and hence ρ​(A¯v)<1ρ( A_v)<1. ∎ Appendix C Experiments Details To assess the effectiveness of the proposed method, we conducted extensive comparisons against a broad range of state-of-the-art models, including recent Mamba-based approaches. Baseline results were obtained from the corresponding literature to ensure fairness and consistency. The comparison models include Chimera [4], TimePro [31], Simba [37], TCN [30], iTransformer [28], RLinear [25], PatchTST [35], Crossformer [55], TiDE [8], TimesNet [46], DLinear [51], SCINet [26], FEDformer [57], Stationary [29], N-HiTS [6], N-BEATS [36], ETSformer [45], LightTS [54], Autoformer [48], Pyraformer [27], Informer [56], Reformer [19], LSTM [13], LSTNet [22], LSSL [11], Flowformer [47] and LogTrans [24]. TABLE VII: Data description for time series forecasting. Dataset Types Sample Number Variable Prediction Length (train/validation/test) Dimension ETTh1, ETTh2 Long-term (8545/2881/2881) 7 96 192, 336, 720 ETTm1, ETTm2 (34465/11521/11521) 7 96 192, 336, 720 Electricity (18317/2633/5261) 321 96 192, 336, 720 Traffic (12185/1757/3509) 862 96 192, 336, 720 Weather (36792/5271/10540) 21 96 192, 336, 720 Exchange (5120/665/1422) 8 96 192, 336, 720 M4-Yearly Short-term (23000/0/23000) 1 6 M4-Quarterly (24000/0/24000) 1 8 M4-Monthly (48000/0/48000) 1 18 M4-Weekly (359/0/359) 1 13 M4-Daily (4227/0/4227) 1 14 M4-Hourly (414/0/414) 1 48 C-A Datasets Forecasting. We used eight benchmark datasets for long-term forecasting (Weather, Traffic, Electricity, Exchange, and four ETT datasets [56]) and the M4 dataset for short-term forecasting [32]. Long-term datasets consist of single continuous sequences with samples generated by sliding windows, whereas M4 contains 100,000 heterogeneous series from diverse domains. Dataset descriptions are provided in Table VII. TABLE VIII: Data description for time series classification and anomaly detection. Dataset Types Sample Number Variable Series Length (classification) (train/validation/test) Dimension Sliding Window Length (anomaly detection) EthanolConcentration Classification (261/-/263) 3 1751 FaceDetection (5890/-/3524) 144 62 Handwriting (150/-/850) 3 152 Heartbeat (204/-/205) 61 405 JapaneseVowels (270/-/370) 12 29 PEMS-SF (267/-/173) 963 144 SelfRegulationSCP1 (268/-/293) 6 896 SelfRegulationSCP2 (200/-/180) 7 1152 SpokenArabicDigits (6599/-/2199) 13 93 UWaveGestureLibrary (120/-/320) 3 315 SMD Anomaly (566724/141681/708420) 38 100 MSL Detection (44653/11664/73729) 55 100 SMAP (108146/27037/427617) 25 100 SWaT (396000/99000/449919) 51 100 PSM (105984/26497/87841) 25 100 Classification & Anomaly Detection. For classification, we used ten benchmark datasets from the UEA archive [2], spanning diverse domains such as vision, audio, industry monitoring, and medical diagnosis, with most tasks involving around ten classes. For anomaly detection, we adopted five widely used benchmarks: SMD [41], SWaT [33], PSM [1], MSL, and SMAP [14], covering domains such as server machines, spacecraft, and critical infrastructure. Dataset descriptions are summarized in Table VIII. TABLE IX: Computational analysis of permutation-invariant aggregation functions. The trade-off between representational expressiveness and memory/compute overhead motivates task-adaptive selection of ψ. Aggregation ψ Work Parallel Span Peak Memory Characteristics (per timestep) Overhead Mean ​(C​d)O(Cd) ​(log⁡C)O( C) ​(d)O(d) Stable, low-variance descriptor Sum ​(C​d)O(Cd) ​(log⁡C)O( C) ​(d)O(d) Same cost as mean, without normalization Attention ​(C​d2)O(Cd^2) ​(log⁡C)O( C) ​(C​d)O(Cd) Expressive, selectively emphasizes informative variables C-B Aggregation Function and Computational Behavior Choice of the Aggregation Function ψ. The proposed formulation allows any permutation-invariant aggregation operator ψ, such as mean/sum pooling or attention-based pooling. In practice, the default choice of ψ is selected according to the inductive bias and computational structure of each task: • Forecasting. Multi-step forecasting tasks (e.g., ETTh1/ETTm1) involve strong shared seasonal components across variables and are highly sensitive to variance amplification. Mean pooling produces a low-variance global descriptor that suppresses channel-wise noise and stabilizes long-horizon prediction. This property leads to more robust extrapolation than attention-based pooling. • Classification & Anomaly Detection. In contrast, these tasks rely on discriminative, potentially sparse cross-variable interactions, where only a subset of channels may indicate class identity or abnormal behavior. Mean pooling may dilute these informative variables, whereas attention pooling selectively emphasizes salient channels, improving representation quality. GPU Throughput and Memory Usage. To clarify the computational implications of different choices of ψ, Table IX summarizes the work complexity, parallel span, and peak memory usage. Let C denote the number of variables and d denote the hidden dimension. Mean/sum pooling require only a single accumulator vector and do not store intermediate key/value tensors, resulting in minimal peak memory. Attention pooling, however, stores per-variable projections and attention scores, causing memory to grow linearly with C. This explains why forecasting, which benefits from stability and scalability, is paired with mean pooling, whereas expressive tasks (classification/anomaly detection) are paired with attention-based pooling. C-C Full Experimental Results The overall results for long- and short-term forecasting, classification, and anomaly detection are reported in Table X-XIII. TABLE X: Full results of long-term time-series forecasting. Ours Chimera TimePro Simba TCN iTransformer RLinear PatchTST Crossformer TiDE TimesNet DLinear SCINet FEDformer MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE ETTm1 96 0.287 0.342 0.318 0.354 0.326 0.364 0.324 0.360 0.292 0.346 0.334 0.368 0.355 0.376 0.329 0.367 0.404 0.426 0.364 0.387 0.338 0.375 0.345 0.372 0.418 0.438 0.379 0.419 192 0.326 0.367 0.331 0.369 0.367 0.383 0.363 0.382 0.332 0.368 0.377 0.391 0.391 0.392 0.367 0.385 0.450 0.451 0.398 0.404 0.374 0.387 0.380 0.389 0.439 0.450 0.426 0.441 336 0.359 0.387 0.363 0.389 0.402 0.409 0.395 0.405 0.365 0.391 0.426 0.420 0.424 0.415 0.399 0.410 0.532 0.515 0.428 0.425 0.410 0.411 0.413 0.413 0.490 0.485 0.445 0.459 720 0.409 0.416 0.409 0.415 0.469 0.446 0.451 0.437 0.416 0.417 0.491 0.459 0.487 0.450 0.454 0.439 0.666 0.589 0.487 0.461 0.478 0.450 0.474 0.453 0.595 0.550 0.543 0.490 Avg 0.345 0.378 0.355 0.381 0.391 0.400 0.383 0.396 0.351 0.381 0.407 0.410 0.414 0.407 0.387 0.400 0.513 0.496 0.419 0.419 0.400 0.406 0.403 0.407 0.485 0.481 0.448 0.452 ETTm2 96 0.169 0.259 0.169 0.265 0.178 0.260 0.177 0.263 0.166 0.256 0.180 0.264 0.182 0.265 0.175 0.259 0.287 0.366 0.207 0.305 0.187 0.267 0.193 0.292 0.286 0.377 0.203 0.287 192 0.223 0.295 0.221 0.290 0.242 0.303 0.245 0.306 0.222 0.293 0.250 0.309 0.246 0.304 0.241 0.302 0.414 0.492 0.290 0.364 0.249 0.309 0.284 0.362 0.399 0.445 0.269 0.328 336 0.279 0.332 0.279 0.339 0.303 0.342 0.304 0.343 0.272 0.324 0.311 0.348 0.307 0.342 0.305 0.343 0.597 0.542 0.377 0.422 0.321 0.351 0.369 0.427 0.637 0.591 0.325 0.366 720 0.362 0.385 0.342 0.376 0.400 0.399 0.400 0.399 0.351 0.381 0.412 0.407 0.407 0.398 0.402 0.400 1.730 1.042 0.558 0.524 0.408 0.403 0.554 0.522 0.960 0.735 0.421 0.415 Avg 0.258 0.318 0.252 0.317 0.281 0.326 0.271 0.327 0.253 0.314 0.288 0.332 0.286 0.327 0.281 0.326 0.757 0.610 0.358 0.404 0.291 0.333 0.350 0.401 0.571 0.537 0.305 0.349 ETTh1 96 0.365 0.394 0.366 0.392 0.375 0.398 0.379 0.395 0.368 0.394 0.386 0.405 0.386 0.395 0.414 0.419 0.423 0.448 0.479 0.464 0.384 0.402 0.386 0.400 0.654 0.599 0.376 0.419 192 0.407 0.417 0.402 0.414 0.427 0.429 0.432 0.424 0.405 0.413 0.441 0.436 0.437 0.424 0.460 0.445 0.471 0.474 0.525 0.492 0.436 0.429 0.437 0.432 0.719 0.631 0.420 0.448 336 0.385 0.416 0.406 0.419 0.472 0.450 0.473 0.443 0.391 0.412 0.487 0.458 0.479 0.446 0.501 0.466 0.570 0.546 0.565 0.515 0.491 0.469 0.481 0.459 0.778 0.659 0.459 0.465 720 0.434 0.452 0.458 0.477 0.476 0.474 0.483 0.469 0.450 0.461 0.503 0.491 0.481 0.470 0.500 0.488 0.653 0.621 0.594 0.558 0.521 0.500 0.519 0.516 0.836 0.699 0.506 0.507 Avg 0.397 0.419 0.408 0.425 0.438 0.438 0.441 0.432 0.404 0.420 0.454 0.447 0.446 0.434 0.469 0.454 0.529 0.522 0.541 0.507 0.458 0.450 0.456 0.452 0.747 0.647 0.440 0.460 ETTh2 96 0.257 0.329 0.262 0.327 0.293 0.345 0.290 0.339 0.263 0.332 0.297 0.349 0.288 0.338 0.302 0.348 0.745 0.584 0.400 0.440 0.340 0.374 0.333 0.387 0.707 0.621 0.358 0.397 192 0.312 0.367 0.320 0.372 0.367 0.394 0.373 0.390 0.320 0.374 0.380 0.400 0.374 0.390 0.388 0.400 0.877 0.656 0.528 0.509 0.402 0.414 0.477 0.476 0.860 0.689 0.429 0.439 336 0.303 0.367 0.316 0.381 0.419 0.431 0.376 0.406 0.313 0.376 0.428 0.432 0.415 0.426 0.426 0.433 1.043 0.731 0.643 0.571 0.452 0.452 0.594 0.541 1.000 0.744 0.496 0.487 720 0.381 0.425 0.389 0.430 0.427 0.445 0.407 0.431 0.392 0.433 0.427 0.445 0.420 0.440 0.431 0.446 1.104 0.763 0.874 0.679 0.462 0.468 0.831 0.657 1.249 0.838 0.463 0.474 Avg 0.313 0.372 0.321 0.377 0.377 0.403 0.361 0.391 0.322 0.379 0.383 0.407 0.374 0.398 0.387 0.407 0.942 0.684 0.611 0.550 0.414 0.427 0.559 0.515 0.954 0.723 0.437 0.449 ECL 96 0.127 0.223 0.132 0.234 0.139 0.234 0.165 0.253 0.129 0.226 0.148 0.240 0.201 0.281 0.181 0.270 0.219 0.314 0.237 0.329 0.168 0.272 0.197 0.282 0.247 0.345 0.193 0.308 192 0.147 0.234 0.144 0.223 0.156 0.249 0.173 0.262 0.143 0.239 0.162 0.253 0.201 0.283 0.188 0.274 0.231 0.322 0.236 0.330 0.184 0.289 0.196 0.285 0.257 0.355 0.201 0.315 336 0.165 0.266 0.156 0.259 0.172 0.267 0.188 0.277 0.161 0.259 0.178 0.269 0.215 0.298 0.204 0.293 0.246 0.337 0.249 0.344 0.198 0.300 0.209 0.301 0.269 0.369 0.214 0.329 720 0.196 0.295 0.184 0.280 0.209 0.299 0.214 0.305 0.191 0.286 0.225 0.317 0.257 0.331 0.246 0.324 0.280 0.363 0.284 0.373 0.220 0.320 0.245 0.333 0.299 0.390 0.246 0.355 Avg 0.158 0.254 0.154 0.249 0.169 0.262 0.185 0.274 0.156 0.253 0.178 0.270 0.219 0.298 0.205 0.290 0.244 0.334 0.251 0.344 0.192 0.295 0.212 0.300 0.268 0.365 0.214 0.327 Exchange 96 0.085 0.204 0.077 0.198 0.085 0.204 - - 0.080 0.196 0.086 0.206 0.093 0.217 0.088 0.205 0.256 0.367 0.094 0.218 0.107 0.234 0.088 0.218 0.267 0.396 0.148 0.278 192 0.179 0.304 0.159 0.270 0.178 0.299 - - 0.166 0.288 0.177 0.299 0.184 0.307 0.176 0.299 0.470 0.509 0.184 0.307 0.226 0.344 0.176 0.315 0.351 0.459 0.271 0.315 336 0.330 0.415 0.311 0.344 0.328 0.414 - - 0.307 0.398 0.331 0.417 0.351 0.432 0.301 0.397 1.268 0.883 0.349 0.431 0.367 0.448 0.313 0.427 1.324 0.853 0.460 0.427 720 0.653 0.580 0.697 0.623 0.817 0.679 - - 0.656 0.582 0.847 0.691 0.886 0.714 0.901 0.714 1.767 1.068 0.852 0.698 0.964 0.746 0.839 0.695 1.058 0.797 1.195 0.695 Avg 0.311 0.375 0.311 0.358 0.352 0.399 - - 0.302 0.366 0.360 0.403 0.378 0.417 0.367 0.404 0.940 0.707 0.370 0.413 0.416 0.443 0.354 0.414 0.750 0.626 0.519 0.429 Traffic 96 0.369 0.254 0.366 0.248 - - 0.468 0.268 0.368 0.253 0.395 0.268 0.649 0.389 0.462 0.295 0.522 0.290 0.805 0.493 0.593 0.321 0.650 0.396 0.788 0.499 0.587 0.366 192 0.381 0.278 0.394 0.292 - - 0.413 0.317 0.379 0.261 0.417 0.276 0.601 0.366 0.466 0.296 0.530 0.293 0.756 0.474 0.617 0.336 0.598 0.370 0.789 0.505 0.604 0.373 336 0.401 0.278 0.409 0.311 - - 0.529 0.284 0.397 0.270 0.433 0.283 0.609 0.369 0.482 0.304 0.558 0.305 0.762 0.477 0.629 0.336 0.605 0.373 0.797 0.508 0.621 0.383 720 0.444 0.295 0.443 0.294 - - 0.564 0.297 0.440 0.296 0.467 0.302 0.647 0.387 0.514 0.322 0.589 0.328 0.719 0.449 0.640 0.350 0.645 0.394 0.841 0.523 0.626 0.382 Avg 0.398 0.276 0.403 0.286 - - 0.493 0.291 0.398 0.270 0.428 0.282 0.626 0.378 0.481 0.304 0.550 0.304 0.760 0.473 0.620 0.336 0.625 0.383 0.804 0.509 0.610 0.376 Weather 96 0.150 0.201 0.146 0.206 0.166 0.207 0.176 0.219 0.149 0.200 0.174 0.214 0.192 0.232 0.177 0.218 0.158 0.230 0.202 0.261 0.172 0.220 0.196 0.255 0.221 0.306 0.217 0.296 192 0.198 0.245 0.189 0.239 0.216 0.254 0.222 0.260 0.196 0.245 0.221 0.254 0.240 0.271 0.225 0.259 0.206 0.277 0.242 0.298 0.219 0.261 0.237 0.296 0.261 0.340 0.276 0.336 336 0.249 0.285 0.244 0.281 0.273 0.296 0.275 0.297 0.238 0.277 0.278 0.296 0.292 0.307 0.278 0.297 0.272 0.335 0.287 0.335 0.280 0.306 0.283 0.335 0.309 0.378 0.339 0.380 720 0.312 0.326 0.297 0.309 0.351 0.346 0.350 0.349 0.314 0.334 0.358 0.347 0.364 0.353 0.354 0.348 0.398 0.418 0.351 0.386 0.365 0.359 0.345 0.381 0.377 0.427 0.403 0.428 Avg 0.227 0.264 0.219 0.258 0.251 0.276 0.255 0.280 0.224 0.264 0.258 0.278 0.272 0.291 0.259 0.281 0.259 0.315 0.271 0.320 0.259 0.287 0.265 0.317 0.292 0.363 0.309 0.360 TABLE XI: Full results of short-term forecasting in the M4 dataset. Ours Chimera ModernTCN PatchTST TimesNet N-HiTS N-BEATS ETSformer LightTS DLinear FEDformer Stationary Autoformer Pyraformer Informer Reformer Yearly SMAPE 13.150 13.107 13.226 13.258 13.387 13.418 13.436 18.009 14.247 16.965 13.728 13.717 13.974 15.530 14.727 16.169 MASE 2.950 2.902 2.957 2.985 2.996 3.045 3.043 4.487 3.109 4.283 3.048 3.078 3.134 3.711 3.418 3.800 OWA 0.773 0.767 0.777 0.781 0.786 0.793 0.794 1.115 0.827 1.058 0.803 0.807 0.822 0.942 0.881 0.973 Quarterly SMAPE 9.949 9.892 9.971 10.179 10.100 10.202 10.124 13.376 11.364 12.145 10.792 10.958 11.338 15.449 11.360 13.313 MASE 1.160 1.105 1.167 0.803 1.182 1.194 1.169 1.906 1.328 1.520 1.283 1.325 1.365 2.350 1.401 1.775 OWA 0.875 0.853 0.878 0.803 0.890 0.899 0.886 1.302 1.000 1.106 0.958 0.981 1.012 1.558 1.027 1.252 Monthly SMAPE 12.563 12.549 12.556 12.641 12.670 12.791 12.677 14.588 14.014 13.514 14.260 13.917 13.958 17.642 14.062 20.128 MASE 0.922 0.914 0.917 0.930 0.933 0.969 0.937 1.368 1.053 1.037 1.102 1.097 1.103 1.913 1.141 2.614 OWA 0.869 0.864 0.866 0.876 0.878 0.899 0.880 1.149 0.981 0.956 1.012 0.998 1.002 1.511 1.024 1.927 Others SMAPE 4.708 4.685 4.715 4.946 4.891 5.061 4.925 7.267 15.880 6.709 4.954 6.302 5.485 24.786 24.460 32.491 MASE 3.165 3.007 3.107 2.985 3.302 3.216 3.391 5.240 11.434 4.953 3.264 4.064 3.865 18.581 20.960 33.355 OWA 0.994 0.983 0.986 1.044 1.035 1.040 1.053 1.591 3.474 1.487 1.036 1.304 1.187 5.538 5.013 8.679 Weighted Avg SMAPE 11.686 11.618 11.698 11.807 11.829 11.927 11.851 14.718 13.525 13.639 12.840 12.780 12.909 16.987 14.086 18.200 MASE 1.549 1.528 1.556 1.590 1.585 1.613 1.599 2.408 2.111 2.095 1.701 1.756 1.771 3.265 2.718 4.223 OWA 0.832 0.827 0.838 0.851 0.851 0.861 0.855 1.172 1.051 1.051 0.918 0.930 0.939 1.480 1.230 1.775 TABLE XII: Full results of the classification task (accuracy %) Datasets LSTM LSTNet LSSL Reformer Informer Pyraformer Autoformer Stationary FEDformer ETSformer Flowformer DLinear LightTS TimesNet PatchTST MTCN Chimera Ours EthanolConcentration 32.3 39.9 31.1 31.9 31.6 30.8 31.6 32.7 31.2 28.1 33.8 32.6 29.7 35.7 32.8 36.3 39.8 35.7 FaceDetection 57.7 65.7 66.7 68.6 67.0 65.7 68.4 68.0 66.0 66.3 67.6 68.0 67.5 68.6 68.3 70.8 70.4 69.6 Handwriting 15.2 25.8 24.6 27.4 32.8 29.4 36.7 31.6 28.0 32.5 33.8 27.0 26.1 32.1 29.6 30.6 32.9 35.3 Heartbeat 72.2 77.1 72.7 77.1 80.5 75.6 74.6 73.7 73.7 71.2 77.6 75.1 75.1 78.0 74.9 77.2 81.3 78.0 JapaneseVowels 79.7 98.1 98.4 97.8 98.9 98.4 96.2 99.2 98.4 95.9 98.9 96.2 96.2 98.4 97.5 98.8 99.1 98.4 PEMS-SF 39.9 86.7 86.1 82.7 81.5 83.2 82.7 87.3 80.9 86.0 83.8 75.1 88.4 89.6 89.3 89.1 89.5 88.1 SelfRegulationSCP1 68.9 84.0 90.8 90.4 90.1 88.1 84.0 89.4 88.7 89.6 92.5 87.3 89.8 91.8 90.7 93.4 93.7 92.2 SelfRegulationSCP2 46.6 52.8 52.2 56.7 53.3 53.3 50.6 57.2 54.4 55.0 56.1 50.5 51.1 57.2 57.8 60.3 59.9 57.8 SpokenArabicDigits 31.9 100.0 100.0 97.0 100.0 99.6 100.0 100.0 100.0 100.0 98.8 81.4 100.0 99.0 98.3 98.7 100.0 99.0 UWaveGestureLibrary 41.2 87.8 85.9 85.6 85.6 83.4 85.9 87.5 85.3 85.0 86.6 82.1 80.3 85.3 85.8 86.7 86.7 90.0 Average Accuracy 48.6 71.8 70.9 71.5 72.1 70.8 71.1 72.7 70.7 71.0 73.0 67.5 70.4 73.6 72.5 74.2 75.3 74.4 TABLE XIII: Full results of the anomaly detection task. Metrics include Precision (P), Recall (R), and F1-score. SMD MSL SMAP SWaT PSM Avg F1 P R F1 P R F1 P R F1 P R F1 P R F1 LSTM 78.52 65.47 71.41 78.04 86.22 81.93 91.06 57.49 70.48 78.06 91.72 84.34 69.24 99.53 81.67 77.97 Transformer 83.58 76.13 79.56 71.57 87.37 78.68 89.37 57.12 69.70 68.84 96.53 80.37 62.75 96.56 76.07 76.88 LogTrans 83.46 70.13 76.21 73.05 87.37 79.57 89.15 57.59 69.97 68.67 97.32 80.52 63.06 98.00 76.74 76.60 TCN 84.06 79.07 81.49 75.11 82.44 78.60 86.90 59.23 70.45 76.59 95.71 85.09 54.59 99.77 70.57 77.24 Reformer 82.58 69.24 75.32 85.51 83.31 84.40 90.91 57.44 70.40 72.50 96.53 82.80 59.93 95.38 73.61 77.31 Informer 86.60 77.23 81.65 81.77 86.48 84.06 90.11 57.13 69.92 70.29 96.75 81.43 64.27 96.33 77.10 78.83 Anomaly* 88.91 82.23 85.49 79.61 87.37 83.31 91.85 58.11 71.18 72.51 97.32 83.10 68.35 94.72 79.40 80.50 Pyraformer 85.61 80.61 83.04 83.81 85.93 84.86 92.54 57.71 71.09 87.92 96.00 91.78 71.67 96.02 82.08 82.57 Autoformer 88.06 82.35 85.11 77.27 80.92 79.05 90.40 58.62 71.12 89.85 95.81 92.74 99.08 88.15 93.29 84.26 LSSL 78.51 65.32 71.31 77.55 88.18 82.53 89.43 53.43 66.90 79.05 93.72 85.76 66.02 92.93 77.20 76.74 Stationary 88.33 81.21 84.62 68.55 89.14 77.50 89.37 59.02 71.09 68.03 96.75 79.88 97.82 96.76 97.29 82.08 DLinear 83.62 71.52 77.10 84.34 85.42 84.88 92.32 55.41 69.26 80.91 95.30 87.52 98.28 89.26 93.55 82.46 ETSformer 87.44 79.23 83.13 85.13 84.93 85.03 92.25 55.75 69.50 90.02 80.36 84.91 99.31 85.28 91.76 82.87 LightTS 87.10 78.42 82.53 82.40 75.78 78.95 92.58 55.27 69.21 91.98 94.72 93.33 98.37 95.97 97.15 84.23 FEDformer 87.95 82.39 85.08 77.14 80.07 78.57 90.47 58.10 70.76 90.17 96.42 93.19 97.31 97.16 97.23 84.97 TimesNet (I) 87.76 82.63 85.12 82.97 85.42 84.18 91.50 57.80 70.85 88.31 96.24 92.10 98.22 92.21 95.21 85.49 TimesNet (R) 88.66 83.14 85.81 83.92 86.42 85.15 92.52 58.29 71.52 86.76 97.32 91.74 98.19 96.76 97.47 86.34 CrossFormer 83.60 76.61 79.70 84.68 83.71 84.19 92.04 55.37 69.14 88.49 93.48 90.92 97.16 89.73 93.30 83.45 PatchTST 87.42 81.65 84.44 84.07 86.23 85.14 92.43 57.51 70.91 80.70 94.93 87.24 98.87 93.99 96.37 84.82 ModernTCN 87.86 83.85 85.81 83.94 85.93 84.92 93.17 57.69 71.26 91.83 95.98 93.86 98.09 96.38 97.23 86.62 Chimera 87.74 83.29 85.46 84.01 86.83 85.39 93.05 58.12 71.55 92.18 95.93 94.01 97.30 96.19 96.74 86.69 Ours 86.31 81.33 83.74 85.66 84.31 84.97 92.24 64.86 76.17 91.76 95.29 93.94 98.20 96.67 97.42 87.24