Paper deep dive

Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control

Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 102

Abstract

Abstract:Stock markets exhibit regime-dependent behavior where prediction models optimized for stable conditions often fail during volatile periods. Existing approaches typically treat all market states uniformly or require manual regime labeling, which is expensive and quickly becomes stale as market dynamics evolve. This paper introduces an adaptive prediction framework that adaptively identifies deviations from normal market conditions and routes data through specialized prediction pathways. The architecture consists of three components: (1) an autoencoder trained on normal market conditions that identifies anomalous regimes through reconstruction error, (2) dual node transformer networks specialized for stable and event-driven market conditions respectively, and (3) a Soft Actor-Critic reinforcement learning controller that adaptively tunes the regime detection threshold and pathway blending weights based on prediction performance feedback. The reinforcement learning component enables the system to learn adaptive regime boundaries, defining anomalies as market states where standard prediction approaches fail. Experiments on 20 S&P 500 stocks spanning 1982 to 2025 demonstrate that the proposed framework achieves 0.68% MAPE for one-day predictions without the reinforcement controller and 0.59% MAPE with the full adaptive system, compared to 0.80% for the baseline integrated node transformer. Directional accuracy reaches 72% with the complete framework. The system maintains robust performance during high-volatility periods, with MAPE below 0.85% when baseline models exceed 1.5%. Ablation studies confirm that each component contributes meaningfully: autoencoder routing accounts for 36% relative MAPE degradation upon removal, followed by the SAC controller at 15% and the dual-path architecture at 7%.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

101,638 characters extracted from source content.

Expand or collapse full text

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control Mohammad Al Ridhawi, Mahtab Haj Ali, and Hussein Al Osman Abstract Stock markets exhibit regime-dependent behavior where prediction models optimized for stable conditions often fail during volatile periods. Existing approaches typically treat all market states uniformly or require manual regime labeling, which is expensive and quickly becomes stale as market dynamics evolve. This paper introduces an adaptive prediction framework that adaptively identifies deviations from normal market conditions and routes data through specialized prediction pathways. The architecture consists of three components: (1) an autoencoder trained on normal market conditions that identifies anomalous regimes through reconstruction error, (2) dual node transformer networks specialized for stable and event-driven market conditions respectively, and (3) a Soft Actor-Critic reinforcement learning controller that adaptively tunes the regime detection threshold and pathway blending weights based on prediction performance feedback. The reinforcement learning component enables the system to learn adaptive regime boundaries, defining anomalies as market states where standard prediction approaches fail. Experiments on 20 S&P 500 stocks spanning 1982 to 2025 demonstrate that the proposed framework achieves 0.68% MAPE for one-day predictions without the reinforcement controller and 0.59% MAPE with the full adaptive system, compared to 0.80% for the baseline integrated node transformer. Directional accuracy reaches 72% with the complete framework. The system maintains robust performance during high-volatility periods, with MAPE below 0.85% when baseline models exceed 1.5%. Ablation studies confirm that each component contributes meaningfully: autoencoder routing accounts for 36% relative MAPE degradation upon removal, followed by the SAC controller at 15% and the dual-path architecture at 7%. I Introduction Financial markets operate across distinct regimes characterized by different statistical properties, volatility levels, and correlation structures [19]. During stable periods, price movements follow relatively predictable patterns driven by fundamental factors and gradual information incorporation. Crisis periods, earnings announcements, and macroeconomic shocks induce abrupt shifts in market behavior where historical patterns provide limited guidance. Models trained on aggregate historical data often perform well on average but degrade under volatile or event-driven conditions, where robust prediction is especially important. Prior work on stock prediction has often treated market conditions as homogeneous. Graph neural networks capture cross-sectional dependencies [9], transformers model temporal dynamics [37], and sentiment analysis incorporates qualitative signals [12]. Our previous work demonstrated that combining node transformer architectures with BERT (Bidirectional Encoder Representations from Transformers) sentiment analysis achieves 0.80% mean absolute percentage error (MAPE) and 65% directional accuracy (DA) on S&P 500 stocks [2]. Yet this integrated model applies the same processing regardless of market conditions, leaving potential gains from regime-aware specialization unexploited. The challenge of regime detection compounds prediction difficulties. Traditional approaches rely on hidden Markov models [19] or threshold rules on volatility indicators [27], both requiring manual specification of regime definitions. Supervised classifiers demand labeled training data identifying which historical periods constitute crises or anomalies. Such labels are subjective, backward-looking, and fail to generalize as market structure evolves. A system that automatically discovers regime boundaries from prediction performance itself would avoid these limitations. This paper introduces an adaptive framework addressing both challenges. An autoencoder trained on normal market data learns to reconstruct typical price patterns; high reconstruction error indicates departure from learned normality. This weakly supervised anomaly score gates data flow through dual node transformer pathways: one optimized for stable conditions, another incorporating event-specific features for turbulent periods. A Soft Actor-Critic (SAC) reinforcement learning controller observes prediction outcomes and adjusts the autoencoder threshold and pathway blending weights to maximize forecasting accuracy. The SAC component adapts the anomaly-routing threshold by discovering which threshold settings improve downstream predictions. The contributions of this work are: 1. An autoencoder-based regime detection mechanism that identifies market state shifts using weakly supervised anomaly detection trained on historically stable market periods. The autoencoder learns a compressed representation of normal market dynamics; deviations from this representation trigger event-aware processing. 2. A dual node transformer architecture with specialized pathways for stable and volatile market conditions. The event pathway incorporates additional features including volatility regime indicators, sentiment spikes, and event characterization signals. 3. A Soft Actor-Critic reinforcement learning controller that adaptively tunes the regime detection threshold and pathway blending based on realized prediction performance. This enables the system to learn adaptive regime definitions from prediction outcomes rather than relying on fully hand-labeled regime annotations. 4. Experimental validation demonstrating a 26% MAPE reduction over the baseline integrated node transformer (0.59% vs 0.80%) and a 7 percentage point improvement in directional accuracy (72% vs 65%). Section I reviews related work on regime detection, adaptive prediction, and reinforcement learning for financial applications. Section I presents the proposed architecture. Section IV reports experimental results, and Section V discusses findings, limitations, and implications. I Literature Review I-A Regime Detection in Financial Markets Market regime identification has a long history in econometrics and quantitative finance. Hamilton [19] introduced Markov-switching models that probabilistically transition between states with distinct statistical properties. These models estimate regime-specific parameters (means, variances, transition probabilities) via maximum likelihood, enabling classification of historical periods into regimes. Extensions incorporate time-varying transition probabilities [13] and multivariate dependencies. Threshold models offer an alternative where regime switches occur when observable variables cross specified boundaries. The Self-Exciting Threshold Autoregressive (SETAR) model [35] switches dynamics based on lagged values of the series itself. In finance, volatility indices such as VIX (CBOE Volatility Index) commonly serve as regime indicators, with thresholds separating low, medium, and high volatility states. Machine learning approaches to regime detection include clustering methods that partition historical periods based on feature similarity [31], hidden Markov models with neural network emission distributions, and change-point detection algorithms [3]. These methods generally require either explicit labels or assumptions about the number and nature of regimes. The framework proposed in this paper does not entirely avoid such assumptions, as the primary routing is binary and the event pathway conditions on three VIX-based volatility levels. This design nonetheless requires fewer structural commitments than methods that must specify the number, boundaries, and statistical properties of multiple regime states. The autoencoder learns to distinguish normal from anomalous market conditions through reconstruction error without requiring explicit regime definitions, and the SAC controller continuously adapts the routing threshold based on prediction feedback rather than relying on fixed, manually chosen boundaries. The regime structure is therefore partially discovered from data rather than imposed entirely by the modeler. I-B Autoencoders for Anomaly Detection Autoencoders learn compressed representations by reconstructing inputs through an information bottleneck [20]. The encoder maps inputs to a lower-dimensional latent space, and the decoder reconstructs the original input from this representation. When trained on normal data, autoencoders reconstruct typical patterns with low error; anomalous inputs that deviate from the training distribution yield higher reconstruction error, providing an anomaly score. Variational autoencoders (VAEs) extend this framework by imposing distributional constraints on the latent space [23]. The VAE objective combines reconstruction loss with a regularization term encouraging the latent distribution to match a prior (typically standard Gaussian). This probabilistic formulation enables principled uncertainty quantification and generation of novel samples. In financial applications, autoencoders have been applied to fraud detection [32], anomaly identification in trading patterns [1], and feature extraction for downstream prediction tasks [5]. Liu et al. [25] employed autoencoder-based feature extraction combined with bidirectional LSTM (Long Short-Term Memory) for stock price prediction, reporting improved performance from the learned representations. I-C Graph Neural Networks for Stock Prediction Graph neural networks (GNNs) model relational structure among entities through message passing over graph topology [41]. In stock prediction, nodes represent individual securities while edges capture relationships including sectoral affiliation, supply chain connections, or return correlations. Chen et al. [9] proposed a graph convolutional feature-based CNN combining graph convolutions with dual convolutional networks for market-level and stock-level features. Wang et al. [38] introduced multi-graph architectures defining both static (sector) and dynamic (correlation) graphs, achieving 5.11% error reduction over LSTM baselines on Chinese market indices. The node transformer architecture [40] extends transformers to graph-structured data through attention mechanisms that respect graph topology. Unlike standard graph neural networks with fixed message-passing schemes, node transformers learn contextualized representations through adaptive attention over graph neighborhoods. I-D Reinforcement Learning in Finance Reinforcement learning (RL) optimizes sequential decision-making through interaction with an environment, learning policies that maximize cumulative reward [34]. Financial applications include portfolio optimization [22], order execution [30], and trading strategy development [11]. Deep RL algorithms combine neural network function approximation with RL principles. Deep Q-Networks (DQN) learn action-value functions for discrete action spaces [28], while policy gradient methods directly optimize policies for continuous action spaces. Actor-critic algorithms unify both approaches by combining value estimation (critic) with policy optimization (actor) for improved stability and sample efficiency. Soft Actor-Critic (SAC) [17] incorporates entropy regularization into the actor-critic framework, encouraging exploration while maintaining policy stability. By adding policy entropy to the reward, the maximum entropy objective prevents premature convergence to deterministic policies, and SAC performs well across continuous control tasks with delayed, noisy reward signals [17, 18]. Here, SAC serves not as a trading agent but as a meta-controller that learns to configure the prediction system itself. The controller adjusts the autoencoder threshold and pathway blending weights based on observed prediction performance, effectively learning what regime definitions optimize downstream forecasting accuracy. I-E Research Positioning Prior work has addressed regime detection and stock prediction as separate problems. Regime-switching models identify market states but do not adapt prediction methods accordingly [19, 4, 16], while stock prediction models treat all conditions uniformly or rely on hand-crafted regime indicators [15, 9]. Our framework integrates these elements by pairing unsupervised regime detection (autoencoder) with specialized prediction pathways (dual node transformers) and a SAC controller that learns optimal regime definitions from prediction outcomes. This closed-loop architecture enables the system to discover useful regime boundaries rather than imposing them a priori. I Methodology I-A System Overview Figure 1 presents the complete system architecture. Raw market data flows through feature engineering to produce technical indicators and normalized price features. The autoencoder processes these features, producing a reconstruction error score that quantifies deviation from normal market patterns and serves as the basis for regime classification. Based on this score and a learned threshold, the router directs data to either the normal or event node transformer pathway depending on the detected regime. Both pathways produce predictions that are blended according to learned weights. The final prediction is evaluated, and the SAC controller uses this feedback to adjust the autoencoder threshold and blending parameters for subsequent iterations. Market Data Features tx_tAutoencoderRouterNodeFormer(Normal)NodeFormer(Event)AdaptiveBlendingEvaluationSACControllerete_tet<τe_t< ≥τe_t≥ +hNy^N_t+hyt+hEy^E_t+hy^t+h y_t+hRMSE, DAτα Figure 1: System architecture overview. Market features tx_t enter the autoencoder, which produces reconstruction error ete_t (shown on arrow). The router directs data to normal or event node transformer pathways based on whether ete_t exceeds the learned threshold τ. Each pathway produces a prediction (yt+hNy^N_t+h, yt+hEy^E_t+h), and adaptive blending combines them into the final forecast y^t+h y_t+h. The SAC controller observes evaluation metrics and adjusts both τ and α (dashed blue arrows) to optimize forecasting accuracy. The term regime in this framework operates at two distinct levels. At the primary level, the autoencoder performs a binary classification of each trading day as either normal or anomalous based on whether its reconstruction error exceeds a learned threshold τ. This binary decision determines routing: days classified as normal are processed by the normal node transformer, while anomalous days are directed to the event node transformer. A binary primary classification is chosen rather than a multi-class scheme for both practical and theoretical reasons. The autoencoder’s reconstruction error is a scalar anomaly score that naturally lends itself to thresholding rather than clustering into multiple categories, and the fundamental distinction in anomaly detection is between in-distribution and out-of-distribution inputs. Attempting to subdivide anomalous states at the routing stage would require assumptions about the number and nature of anomaly categories that the unsupervised autoencoder is not designed to make; instead, that finer-grained characterization is deferred to the event pathway itself. Within the event pathway, a secondary level of regime characterization captures the heterogeneity of anomalous periods. The event context vector tc_t provides the event node transformer with descriptive features including a VIX-based volatility classification into three levels (low, medium, or high, determined by training-period terciles), sentiment spike indicators, earnings event proximity, and cross-asset stress measures. This secondary characterization does not constitute a separate routing mechanism; rather, it conditions the event transformer’s internal representations by supplying information about the nature of the detected anomaly. An earnings-driven disruption during a period of otherwise low market volatility produces different price dynamics than a systemic sell-off during an already elevated volatility regime, and the context vector enables the transformer to learn these distinctions from the data. The normal transformer does not receive this additional context because it processes in-distribution samples where market dynamics follow the stable patterns learned during the autoencoder’s training phase, making regime-specific conditioning unnecessary. I-B Feature Engineering Input features follow established methodologies for financial time series. For each stock i at time t, the raw feature vector comprises: i,traw=[Ot,Ht,Lt,Ct,Vt]x_i,t^raw=[O_t,H_t,L_t,C_t,V_t] (1) where OtO_t, HtH_t, LtL_t, CtC_t denote open, high, low, and closing prices, and VtV_t is trading volume (collectively referred to as OHLCV data). Technical indicators include simple moving averages (SMA) at 5, 10, and 20-day windows, exponential moving averages (EMA) at matching windows, 14-day Relative Strength Index (RSI), Moving Average Convergence Divergence (MACD) with standard (12, 26, 9) parameters, daily returns, log returns, and 20-day rolling volatility. Figure 2 illustrates this pipeline. Raw OHLCVDataSMA5, 10, 20EMA5, 10, 20RSI, MACDReturnsRollingVolatilityZ-Score Normalization(Expanding Window)Prediction Featuresi,t∈ℝ17x_i,t ^17Router Featuresi,trouter∈ℝ6x_i,t^router ^6 Figure 2: Feature engineering pipeline. Raw OHLCV data is processed through technical indicator computations (SMA, EMA, RSI, MACD, volatility). All features undergo expanding-window z-score normalization to prevent look-ahead bias, producing prediction features and router-specific features. In addition to prediction features, router-specific features capture regime-relevant signals: i,trouter=[σt(5),σt(20),ΔVIXt,ρtΔ,|St|,νtpost]x_i,t^router=[ _t^(5), _t^(20), _t, _t ,|S_t|, _t^post] (2) where σt(k) _t^(k) is k-day rolling volatility, ΔVIXt _t is VIX percentage change, ρtΔ _t is change in average pairwise correlation among stocks, |St||S_t| is absolute sentiment magnitude, and νtpost _t^post is post velocity (count of X posts mentioning crisis-related keywords within the trading day). Sentiment enters the router as an absolute value because the router’s function is anomaly detection rather than directional prediction. For the purpose of identifying unusual market conditions, the magnitude of sentiment deviation is the relevant signal: both strongly negative sentiment (indicating panic) and strongly positive sentiment (indicating euphoria or speculative excess) represent departures from typical market behavior. The signed sentiment score StS_t is retained in the full prediction feature vector i,tx_i,t that reaches the node transformers, so directional information contributes to the price forecasts themselves even though the routing decision depends only on sentiment intensity. These six router features are chosen to capture both gradual shifts (rolling volatility, correlation changes) and abrupt events (VIX spikes, sentiment surges, social media clustering). Missing values in price data (due to trading halts or data gaps) are handled through temporally-aware imputation. For training data, short gaps (1-2 trading days) use linear interpolation between surrounding known values. For validation and test data, only forward-filling from the most recent observed value is applied to ensure no future information leaks into predictions. Technical indicators (SMA, EMA, RSI, MACD) are computed only after imputation, using the forward-filled values. Normalization uses expanding-window z-scores to prevent look-ahead bias. During training, each feature is standardized using the mean and standard deviation computed over all available data from the start of the training period up to time t: x~i,t=xi,t−μ1:tσ1:t x_i,t= x_i,t- _1:t _1:t (3) where μ1:t _1:t and σ1:t _1:t are the cumulative mean and standard deviation from the first training observation through time t. This expanding window ensures that normalization at each time step uses only past information. During validation and testing, normalization statistics are fixed at the full training-period values (μ1:Ttrain _1:T_train and σ1:Ttrain _1:T_train), ensuring that no information from the evaluation period influences standardization. I-C Autoencoder for Regime Detection The autoencoder learns a compressed representation of normal market dynamics. It is trained exclusively on data from stable market periods, defined during training as days where VIX remains below the 75th percentile of its training-period distribution. Figure 3 presents the detailed architecture. tx_tdind_in6432tz_tdzd_z3264^t x_tdind_inet=‖t−^t‖22e_t=\|x_t- x_t\|_2^21W_12W_23W_34W_4ReLUReLUReLUReLUEncoder fencf_encDecoder fdecf_dec Figure 3: Autoencoder architecture for regime detection. The encoder compresses the input feature vector through two hidden layers (64, 32 units) to a latent representation tz_t of dimension dz=32d_z=32. The decoder reconstructs the input through symmetric layers. Reconstruction error ete_t serves as the anomaly score for regime classification. I-C1 Architecture and Training The encoder maps the concatenated feature vector to a latent representation through two hidden layers: t=fenc(t)=ReLU(2⋅ReLU(1t+1)+2)z_t=f_enc(x_t)=ReLU(W_2·ReLU(W_1x_t+b_1)+b_2) (4) where t∈ℝdinx_t ^d_in is the input feature vector, t∈ℝdzz_t ^d_z is the latent representation with dz=32d_z=32, and 1∈ℝ64×dinW_1 ^64× d_in, 2∈ℝ32×64W_2 ^32× 64 are weight matrices. The decoder reconstructs the input through a symmetric architecture: ^t=fdec(t)=4⋅ReLU(3t+3)+4 x_t=f_dec(z_t)=W_4·ReLU(W_3z_t+b_3)+b_4 (5) where 3∈ℝ64×32W_3 ^64× 32 and 4∈ℝdin×64W_4 ^d_in× 64. The autoencoder is trained to minimize reconstruction loss over the stable-period data: ℒAE=1T∑t=1T‖t−^t‖22L_AE= 1T _t=1^T\|x_t- x_t\|_2^2 (6) Training uses the Adam optimizer with learning rate 10−310^-3, batch size 64, for a maximum of 20 epochs with early stopping based on validation reconstruction loss. I-C2 Anomaly Score and Routing At inference time, the reconstruction error serves as an anomaly score et=‖t−^t‖22e_t=\|x_t- x_t\|_2^2. Data points with ete_t exceeding threshold τ are classified as anomalous and routed to the event pathway, while those below τ proceed through the normal pathway. The threshold τ is initialized at the 95th percentile of training-set reconstruction errors and subsequently adjusted by the SAC controller. I-D Dual Node Transformer Architecture Two node transformer networks process data depending on regime classification. Both follow the same base architectural design (6 layers, 8 attention heads, 512 model dimension) but maintain independent weights trained on different data subsets, and the event pathway accepts a larger input due to additional context features. Figure 4 illustrates the dual pathway structure. Router: et≷τe_t Path (et<τe_t<τ)Event Path (et≥τe_t≥τ)Input: i,tx_i,tStock EmbeddingTemporal EncodingMulti-Head Attention × 6Feed-Forward + LayerNormPrediction Headyi,t+hnormaly^normal_i,t+hInput: [i,t∥t][x_i,t\|c_t]Stock + Event EmbeddingTemporal EncodingMulti-Head Attention × 6Feed-Forward + LayerNormPrediction Headyi,t+heventy^event_i,t+hAdaptive Blending: y^=αynormal+(1−α)yevent y=α y^normal+(1-α)y^eventEvent Context tc_t:- Regime embedding- Sentiment spike- Days to earnings- Cross-asset stress Figure 4: Dual node transformer architecture. The router directs data based on reconstruction error. The normal pathway (left, orange) processes typical market conditions with base features. The event pathway (right, blue) augments inputs with event context features tc_t. Both pathways follow the same architectural design (layer count, attention heads, model dimension) but maintain independently trained weights and differ in input dimensionality, as the event pathway accepts additional context features. Outputs are blended with adaptive weight α. I-D1 Normal Node Transformer The normal pathway processes typical market conditions using the node transformer architecture [40], which extends standard transformers to graph-structured data by incorporating relational inductive biases into the attention mechanism. The stock market is represented as a graph =(,ℰ)G=(V,E) with N=20N=20 stock nodes and a fully-connected edge set. While graph neural networks are often applied to larger graphs, the N=20N=20 design balances cross-sectional breadth against temporal depth (252-day sequences per stock), and ablation results confirm that the graph structure contributes 7% MAPE improvement (Table VII), indicating that cross-sectional dependencies carry predictive value even at this scale. Each stock i receives a learned embedding i∈ℝdss_i ^d_s that captures persistent stock-specific characteristics such as sector behavior and volatility profile. The input representation for stock i at time t concatenates the normalized feature vector with temporal encoding and the stock embedding: i,t(0)=[i,t‖TE(t)‖i]∈ℝdinh_i,t^(0)=[x_i,t\|TE(t)\|s_i] ^d_in (7) Temporal encoding follows Vaswani et al. [37], using sinusoidal positional encodings where TE(t,2k)=sin⁡(t/100002k/d)TE(t,2k)= (t/10000^2k/d) and TE(t,2k+1)=cos⁡(t/100002k/d)TE(t,2k+1)= (t/10000^2k/d) for dimension index k and model dimension d=512d=512. This encoding allows the model to distinguish trading days and capture periodic patterns at multiple frequencies. Edge weights in the graph are initialized from sector relationships and return correlations computed strictly on training data (1982-2010): eij(0)=0.5⋅δsector(i,j)+0.5⋅max⁡(0,ρijtrain)e_ij^(0)=0.5· _sector(i,j)+0.5· (0, _ij^train) (8) where δsector(i,j)=1 _sector(i,j)=1 if stocks i and j share the same sector classification and 0 otherwise, and ρijtrain _ij^train is the Pearson correlation of daily returns computed over the training period only, preventing any leakage from validation or test data. During training, edge weights are refined through a learnable function eij(ℓ+1)=σ(eT[i(ℓ)∥j(ℓ)]+be)e_ij^( +1)=σ(w_e^T[h_i^( )\|h_j^( )]+b_e), where σ is the sigmoid function and i(ℓ)h_i^( ) is the node representation at layer ℓ . This allows the model to discover relationship patterns not captured by initial sector and correlation priors. Figure 5 illustrates the resulting graph structure with representative edge weights. AAPLMSFTCRMJPMJNJUNHPFEXOMCVXKOPG0.780.650.82■ Technology ■ Financial ■ Healthcare ■ Energy ■ Consumer Figure 5: Graph representation of stock relationships (representative subset of 11 stocks shown for clarity; full graph contains all 20 stocks). Nodes represent individual stocks, colored by sector. Solid edges indicate same-sector connections with higher learned weights (annotated values show correlation-based initialization from training data). Dashed edges represent weaker cross-sector correlations that are learned during training. At each layer, the node transformer applies multi-head self-attention with causal masking to jointly process all stocks across the temporal dimension. The input representations are projected into queries, keys, and values through learned linear transformations =QQ=XW^Q, =KK=XW^K, =VV=XW^V, and attention output is computed as: =softmax(Tdk++)A=softmax ( QK^T d_k+M+E )V (9) where dk=64d_k=64 is the key dimension, M is the causal mask with Mab=−∞M_ab=-∞ if a<ba<b and Mab=0M_ab=0 otherwise, and ∈ℝN×NE ^N× N is the learned edge weight matrix. The additive graph bias allows content-based attention (via TQK^T) and structural priors (via E) to jointly determine how information flows between stocks at each layer, ensuring that predictions at time t use only information from times up to and including t. The architecture uses H=8H=8 attention heads, each operating in 64 dimensions, yielding a total model dimension of dmodel=512d_model=512. Each transformer layer follows the standard pre-norm residual pattern. The multi-head attention output is added to the input through a residual connection, followed by layer normalization. The normalized output then passes through a position-wise feed-forward network consisting of two linear transformations with a ReLU activation, expanding the representation to dff=2048d_f=2048 dimensions before projecting back to 512. A second residual connection and layer normalization follow the feed-forward block. Dropout at rate 0.1 is applied after both the attention and feed-forward sublayers. The architecture stacks 6 such layers, with the output of the final layer fed into a prediction head consisting of a linear projection from the model dimension to a single scalar price prediction per stock. Figure 6 illustrates the detailed layer structure. Input (ℓ)X^( )Multi-Head Attention (8 heads)Add & LayerNormFFN (512→ 2048→ 512)Add & LayerNormOutput (ℓ+1)X^( +1)Dropout 0.1Dropout 0.1×6× 6 layers Figure 6: Single transformer layer architecture. Input passes through multi-head self-attention, residual connection with layer normalization, feed-forward network, and another residual connection with normalization. The architecture stacks 6 such layers. I-D2 Event Node Transformer The event pathway augments the base architecture with additional inputs capturing regime-specific information. The input to the event transformer concatenates the standard feature vector with an event context vector, i,tevent=[i,t∥t]x_i,t^event=[x_i,t\|c_t], where t∈ℝdcc_t ^d_c with dc=12d_c=12. This vector comprises four groups of features. A learned regime embedding t∈ℝ4r_t ^4 maps the current VIX regime (low, medium, or high, determined by training-period VIX terciles) through a trainable embedding layer. A sentiment spike component t∈ℝ2s_t ^2 encodes a binary flag and scaled magnitude when daily sentiment exceeds two standard deviations of training-period sentiment. An event characterization component t∈ℝ4a_t ^4 captures proximity to scheduled earnings announcements (days-to-announcement, normalized), historical earnings surprise magnitude for the stock, a binary earnings-window indicator, and sector-average surprise. Finally, a cross-asset stress vector e¯tcross∈ℝ2 e_t^cross ^2 encodes the mean and standard deviation of reconstruction error across all 20 stocks at time t, distinguishing systemic anomalies (high mean error) from idiosyncratic ones (high variance). The full context vector is the concatenation t=[t‖t‖t∥e¯tcross]c_t=[r_t\|s_t\|a_t\| e_t^cross]. Architecturally, the event transformer differs from the normal pathway in two respects beyond its independently trained weights. First, the input projection layer is wider: while the normal transformer’s first linear layer maps from dind_in dimensions (the concatenation of market features, temporal encoding, and stock embedding), the event transformer maps from din+dcd_in+d_c dimensions to accommodate the appended context vector. This wider projection maps back to the shared model dimension of dmodel=512d_model=512 before entering the first transformer layer, so all subsequent layers (the 6 transformer blocks, feed-forward networks, and prediction head) operate at the same dimensionality as the normal pathway. Second, the event pathway includes a trainable embedding layer that maps the discrete VIX regime label (one of three categories) to the continuous regime embedding t∈ℝ4r_t ^4. This embedding layer is an additional learnable component with no counterpart in the normal pathway, adding 3×4=123× 4=12 trainable parameters for the three regime categories. The remaining context features (ts_t, ta_t, e¯tcross e_t^cross) are continuous values that enter the context vector directly without additional learned transformations. I-D3 Pathway Blending Rather than hard routing, predictions from both pathways are blended with an adaptive weight: y^i,t+h=αt⋅yi,t+hnormal+(1−αt)⋅yi,t+hevent y_i,t+h= _t· y_i,t+h^normal+(1- _t)· y_i,t+h^event (10) The blending coefficient αt∈[0,1] _t∈[0,1] is determined by the SAC controller based on current market state. During high-confidence normal periods, αt _t approaches 1; during clear anomalies, it approaches 0. Intermediate values enable smooth transitions and hedge against misclassification. I-E Soft Actor-Critic Controller The SAC controller learns to configure the prediction system by adjusting the autoencoder threshold τ and blending weight α based on observed prediction performance. Although these are only two scalar parameters, the optimization landscape is non-trivial: the reward signal is delayed (prediction errors are observed only after the threshold decision), noisy (financial returns are inherently stochastic), and non-stationary (optimal thresholds shift as market regimes evolve). SAC is well suited to this setting because its entropy regularization prevents premature convergence to fixed threshold values, and its off-policy learning with experience replay enables sample-efficient adaptation from sparse, delayed feedback. Simpler alternatives such as grid search or bandit methods typically assume stationary reward distributions [34] and cannot adapt continuously to shifting regime dynamics. Figure 7 presents the actor-critic network architecture. State sts_t[et,e¯,σt,RMSEt−1,DAt−1,αt−1,τt−1][e_t, e, _t,RMSE_t-1,DA_t-1, _t-1, _t-1]FC(256) + ReLUFC(256) + ReLUμϕ _φ, log⁡σϕ _φAction at=[Δτ,Δα]a_t=[ τ, α]FC(256) + ReLUFC(256) + ReLUQθ1(s,a)Q_ _1(s,a)FC(256) + ReLUFC(256) + ReLUQθ2(s,a)Q_ _2(s,a)min⁡(Qθ1,Qθ2) (Q_ _1,Q_ _2)Actor πϕ _φTwin Critics Figure 7: Soft Actor-Critic network architecture. The actor network (left, orange) maps states to a Gaussian policy over actions [Δτ,Δα][ τ, α]. Twin critic networks (right, green/cyan) estimate Q-values; the minimum is used to prevent overestimation. All networks use two hidden layers with 256 units and ReLU activation. I-E1 Markov Decision Process Formulation The control problem is formulated as a Markov Decision Process (MDP). The state sts_t comprises: st=[et,e¯t−k:t,σt,RMSEt−1,DAt−1,αt−1,τt−1]s_t=[e_t, e_t-k:t, _t,RMSE_t-1,DA_t-1, _t-1, _t-1] (11) including current reconstruction error, recent error history over the past k=5k=5 trading days (one week), volatility, previous prediction metrics, and current parameter settings. The action space consists of continuous adjustments at=[Δτ,Δα]∈[−0.1,0.1]2a_t=[ τ, α]∈[-0.1,0.1]^2 to threshold and blending weight, clipped to maintain τ∈[emin,emax]τ∈[e_min,e_max] and α∈[0,1]α∈[0,1]. The reward signal combines prediction accuracy and stability: rt=−RMSEt−λdir⋅(1−DAt)−λstable⋅|Δτ|r_t=-RMSE_t- _dir·(1-DA_t)- _stable·| τ| (12) where λdir=0.5 _dir=0.5 weights directional accuracy and λstable=0.1 _stable=0.1 penalizes threshold instability to prevent oscillation. I-E2 SAC Algorithm SAC maximizes the entropy-regularized objective: J(π)=∑t=0T[rt+αentℋ(π(⋅|st))]J(π)= _t=0^TE [r_t+ _entH(π(·|s_t)) ] (13) where ℋH is policy entropy and αent _ent is the temperature parameter controlling exploration. The actor network πϕ(a|s) _φ(a|s) outputs a Gaussian distribution over actions, πϕ(a|s)=(μϕ(s),σϕ(s)2) _φ(a|s)=N( _φ(s), _φ(s)^2), while two critic networks Qθ1Q_ _1, Qθ2Q_ _2 estimate action values. To prevent overestimation, the minimum of both critics is used: Q(s,a)=min⁡(Qθ1(s,a),Qθ2(s,a))Q(s,a)= (Q_ _1(s,a),Q_ _2(s,a)) (14) All networks are feed-forward with two hidden layers of 256 units each. Training uses the Adam optimizer with learning rate 3×10−43× 10^-4, soft target updates with τsoft=0.005 _soft=0.005, and replay buffer size 10510^5. I-E3 Training Protocol The SAC controller is trained after the autoencoder and node transformers are pre-trained. Training begins by initializing τ at the 95th percentile of training reconstruction errors and α=0.5α=0.5. At each step, the controller computes predictions using the current τ and α, evaluates them against actual outcomes to obtain the reward signal, updates the SAC networks from collected transitions, and applies the learned action adjustments to both parameters. This loop runs for 50 epochs with 1000 steps per epoch. Temperature αent _ent is automatically tuned to target entropy −dim(a)- (a) following Haarnoja et al. [17]. I-F Training Pipeline Figure 8 illustrates the complete multi-stage training pipeline. Stage 1: AE20 epochsStable dataStage 2: NodeFormer60 epochsNormal + EventStage 3: SAC50 epochsLearn τ, α 4: Fine-tune20 epochsAll unfrozenTraining Progression Figure 8: Multi-stage training pipeline. Stage 1 trains the autoencoder on stable market data. Stage 2 trains both node transformers on their respective data subsets. Stage 3 trains the SAC controller with frozen prediction components. Stage 4 performs end-to-end fine-tuning with all weights unfrozen. The complete training pipeline proceeds in four stages. In Stage 1 (20 epochs), the autoencoder is trained on stable-period data where VIX falls below the 75th percentile of its training-period distribution. Stage 2 (60 epochs) trains both node transformers: the normal pathway on data with low reconstruction error (below the 95th percentile), and the event pathway on high-error data augmented with context features. In Stage 3 (50 epochs), autoencoder and node transformer weights are frozen while the SAC controller learns to optimize the threshold and blending parameters. Finally, Stage 4 (20 epochs) unfreezes all components for end-to-end fine-tuning with reduced learning rates. The architecture is modular by design: Stages 1 and 2 produce a fully functional prediction system in which the routing threshold and blending weight remain at their initialization values (τ at the 95th percentile of training-set reconstruction errors, α=0.5α=0.5). Stage 3 adds adaptive control on top of this static configuration, allowing the experimental evaluation to quantify the marginal contribution of the SAC controller by comparing the system with and without it. At inference time, all weights—including the SAC policy network—are frozen. The policy produces state-dependent routing decisions through its fixed learned mapping, with no gradient updates or reward computation during the test period. I-G Loss Functions The prediction networks minimize a composite loss: ℒ=λ1ℒMSE+λ2ℒDIR+λ3ℒREGL= _1L_MSE+ _2L_DIR+ _3L_REG (15) where ℒMSE=1N∑i,t,h(yi,t+h−y^i,t+h)2L_MSE= 1N _i,t,h(y_i,t+h- y_i,t+h)^2 is the mean squared error between predicted and actual prices. The directional loss ℒDIRL_DIR is a binary cross-entropy term that explicitly rewards correct prediction of price movement direction, since minimizing magnitude error alone does not guarantee directional accuracy: ℒDIR=−1N∑i,t,h[di,t,hlog⁡pi,t,h+(1−di,t,h)log⁡(1−pi,t,h)]L_DIR=- 1N _i,t,h [d_i,t,h p_i,t,h+(1-d_i,t,h) (1-p_i,t,h) ] (16) where di,t,h=(yi,t+h>yi,t)d_i,t,h=I(y_i,t+h>y_i,t) is the true direction indicator and pi,t,hp_i,t,h is the predicted probability of a price increase. The regularization term ℒREG=‖θ‖22L_REG=\|θ\|_2^2 applies L2 weight decay over all trainable parameters θ, penalizing large weight magnitudes to prevent overfitting. Loss weights are λ1=1.0 _1=1.0, λ2=0.5 _2=0.5, λ3=10−4 _3=10^-4. Table I summarizes all model hyperparameters across the three components. TABLE I: Model Hyperparameters Component Parameter Value Autoencoder Hidden layers [64, 32] Latent dimension 32 Learning rate 10−310^-3 Training epochs 20 Node Transformer Layers 6 Attention heads 8 Model dimension 512 FFN dimension 2048 Dropout 0.1 Learning rate 10−410^-4 Input sequence length 252 days SAC Controller Hidden layers [256, 256] Learning rate 3×10−43× 10^-4 Soft update τ 0.005 Replay buffer 10510^5 Training epochs 50 Stage 4 Fine-tuning AE learning rate 10−410^-4 Node Transformer learning rate 10−510^-5 SAC learning rate 3×10−53× 10^-5 IV Experiments and Results IV-A Dataset and Experimental Setup The dataset comprises two complementary data streams for 20 S&P 500 stocks spanning January 1982 to March 2025. The Financial Market Data (FMD) stream consists of daily OHLCV (open, high, low, close, volume) price data sourced from Yahoo Finance, providing adjusted close prices that account for stock splits and dividends. Each trading day produces a five-dimensional price vector per stock alongside the trading volume, from which 11 additional technical indicators are derived (SMA, EMA, RSI, MACD, returns, log returns, and rolling volatility) as described in Section I. The sentiment stream draws on two datasets. The first is the Market Sentiment Evaluation (MSE) dataset [10], a publicly available corpus of finance-related social media messages annotated by financial experts with sentiment scores in [−1,+1][-1,+1], which serves as ground truth for fine-tuning the BERT sentiment classifier. The second is the Comprehensive Stock Sentiment (CSS) dataset, which was introduced in [2] and was constructed using the X (formerly Twitter) API through systematic searches for posts mentioning the 20 stock tickers, yielding approximately 4.2 million posts covering January 2007 to March 2025. The fine-tuned BERT model is applied to the CSS corpus to generate daily sentiment scores for each stock, which are then aggregated and fed into the prediction framework as additional input features. Table I lists the complete stock universe. Stocks were selected to span nine distinct sectors, ensuring that the graph structure captures both intra-sector and cross-sector dependencies. The selection also prioritizes variation in market capitalization, trading volume, and volatility characteristics to evaluate robustness across different stock profiles. For companies with IPO dates after 1982 (e.g., Salesforce incorporated 1999, Netflix 2002, Visa 2008), data begins at their first available trading date, and these stocks are included in training only from their listing date onward. TABLE I: Stock Universe: 20 S&P 500 Constituents Across 9 Sectors Sector Stock (Ticker) Data Start Technology Apple (AAPL) 1982 Microsoft (MSFT) 1986 Salesforce (CRM) 1999 Financial Services JPMorgan Chase (JPM) 1982 Visa (V) 2008 Healthcare Johnson & Johnson (JNJ) 1982 UnitedHealth Group (UNH) 1984 Pfizer (PFE) 1982 Retail Walmart (WMT) 1982 Home Depot (HD) 1982 Energy ExxonMobil (XOM) 1982 Chevron (CVX) 1982 Consumer Goods Procter & Gamble (PG) 1982 Coca-Cola (KO) 1982 Nike (NKE) 1982 McDonald’s (MCD) 1982 Entertainment Netflix (NFLX) 2002 Telecommunications Verizon (VZ) 1982 Industrials Boeing (BA) 1982 Caterpillar (CAT) 1982 Temporal splits maintain strict chronological separation to prevent any leakage of future information into training. The training set spans January 1982 to December 2010 (approximately 70% of the temporal range), encompassing multiple market cycles including the 1987 crash, the dot-com bubble and its collapse, and the 2008 financial crisis. The validation set covers January 2011 to December 2016 (approximately 15%), a period of relatively steady recovery used for hyperparameter tuning and early stopping. The test set spans January 2017 to March 2025 (approximately 15%), including the 2018 correction, the 2020 COVID crash, and the 2022 market decline, which provide rigorous evaluation under diverse volatility conditions. Since the X platform (formerly Twitter) was founded in 2006, sentiment data covers January 2007 to March 2025. For the 1982-2006 portion of training, sentiment features are set to zero, meaning the model learns to operate with and without sentiment depending on the data period. During the validation and test periods, full sentiment coverage is available. IV-B Evaluation Metrics Model performance is assessed using five complementary metrics. The primary metric is Mean Absolute Percentage Error (MAPE), defined as: MAPE=100N∑i=1N|yi−y^iyi|MAPE= 100N _i=1^N | y_i- y_iy_i | (17) MAPE provides intuitive interpretation as percentage deviation from actual prices. As a complementary measure, Root Mean Squared Error (RMSE) penalizes large errors more heavily due to the squaring operation and is computed in normalized price units, where each stock’s prices are z-scored individually to enable fair cross-stock aggregation: RMSE=1N∑i=1N(yi−y^i)2RMSE= 1N _i=1^N(y_i- y_i)^2 (18) Directional Accuracy (DA) measures the proportion of correctly predicted price movement directions, which is particularly relevant for trading applications where the sign of the predicted move often matters more than its magnitude: DA=100N∑i=1N(sign(y^i,t+h−yi,t)=sign(yi,t+h−yi,t))DA= 100N _i=1^NI\! (sign( y_i,t+h-y_i,t)=sign(y_i,t+h-y_i,t) ) (19) Theil’s U statistic provides a scale-independent benchmark by comparing forecast error to that of a naive random walk that predicts tomorrow’s price as today’s price: U=∑t(yt+1−y^t+1)2∑t(yt+1−yt)2U= _t(y_t+1- y_t+1)^2 _t(y_t+1-y_t)^2 (20) Values of U<1U<1 indicate that the model outperforms the naive baseline, making this metric particularly informative for long time series where absolute price level changes can affect percentage-based measures. Finally, the Confidence Tracking Rate (CTR) captures the proportion of predictions where model confidence (measured as inverse prediction variance across the dual pathway outputs) agrees with actual accuracy: CTR=1NT∑i,t((confi,t>c¯)=(|y^i,t+h−yi,t+h|<ϵ¯))CTR= 1NT _i,tI\! ((conf_i,t> c)=(| y_i,t+h-y_i,t+h|< ε) ) (21) where confi,tconf_i,t is the inverse prediction variance from the two pathways, c¯ c is the median confidence, and ϵ¯ ε is the median absolute error across all predictions. CTR indicates whether the model “knows when it knows,” a property valuable for risk-sensitive downstream applications. IV-C Baseline Models Baselines span statistical methods (ARIMA, VAR, MS-VAR [24]), classical machine learning (Random Forest, SVR, XGBoost), deep learning (LSTM, Simple Transformer), multimodal and regime-switching approaches (BERT Sentiment + LSTM, HMM-LSTM), recent time-series transformers (TimesNet [39], PatchTST [29], iTransformer [26]), and the Integrated NodeFormer-BERT model from prior work [2]. To ensure fair comparison, all baselines share the same experimental conditions wherever the model class permits. Every model uses identical temporal splits (training: 1982–2010, validation: 2011–2016, test: 2017–2025), the same expanding-window z-score normalization described in Section I, and the same missing-data imputation strategy. Models capable of multivariate input (Random Forest, SVR, XGBoost, LSTM, Simple Transformer, BERT Sentiment + LSTM, TimesNet, PatchTST, iTransformer, HMM-LSTM, Integrated NodeFormer-BERT) receive the same 17-dimensional feature vector comprising OHLCV data and 11 derived technical indicators. Daily sentiment scores produced by the fine-tuned BERT classifier are appended to the feature set for all multivariate models, so that any advantage from sentiment information is available to baselines as well as to the proposed framework. ARIMA operates on the univariate closing price series for each stock independently, VAR jointly models the closing prices of all 20 stocks, and MS-VAR extends the VAR specification with Markov-switching regime dynamics over the same joint price series, since these statistical methods are not designed to incorporate arbitrary exogenous feature vectors. The target variable for all models is identical: the closing price at horizon h∈1,5,20h∈\1,5,20\ trading days ahead. The Simple Transformer baseline uses the same encoder architecture as the proposed node transformer (6 layers, 8 attention heads, 512-dimensional representations) but processes each stock’s time series independently without graph structure or inter-stock attention, isolating the contribution of graph-based relational modeling. The BERT Sentiment + LSTM baseline combines the same BERT-derived sentiment scores with a two-layer LSTM through concatenation-based fusion, testing whether the attention-based integration in the proposed architecture provides meaningful improvement over straightforward feature combination. The Integrated NodeFormer-BERT model reproduces our prior work [2] with its published hyperparameters, serving as the primary single-pathway baseline against which architectural additions are measured. PatchTST [29] segments each stock’s multivariate input time series into overlapping patches and applies a transformer encoder with self-attention over the patch sequence, capturing local temporal patterns within patches and long-range dependencies across them. Its channel-independent design processes each feature dimension separately before aggregating predictions, which limits its ability to model cross-feature interactions. iTransformer [26] inverts the conventional transformer architecture by applying self-attention across the variate (feature) dimension rather than the temporal dimension, enabling it to capture dependencies among price, volume, technical indicators, and sentiment features at each time step. TimesNet [39] extends temporal modeling by transforming one-dimensional time series into two-dimensional tensors based on learned multi-periodicity structure, applying inception-based convolution blocks to capture both intra-period and inter-period variation. All three recent time-series transformers process each stock’s feature set independently without graph structure or cross-stock attention, isolating the contribution of relational modeling in the proposed framework. The Markov-Switching VAR (MS-VAR) [24] extends the VAR baseline with K=3K=3 latent regime states governed by a first-order Markov chain, allowing the intercepts and error covariance to vary across regimes while the autoregressive coefficients remain regime-invariant (MSIH specification). Regime transitions are inferred through maximum likelihood estimation of the full joint model, in contrast to our autoencoder-based approach which detects anomalies from reconstruction error without specifying the number or parametric structure of regimes a priori. The HMM-LSTM baseline combines a Hidden Markov Model with K=3K=3 states for regime detection with three regime-specific two-layer LSTMs, each trained on data assigned to its corresponding regime by the Viterbi decoder. At inference, the HMM identifies the most likely current regime and routes the input to the corresponding LSTM, producing a regime-conditional forecast. This architecture provides the most direct comparison to our framework: it replaces the autoencoder with an HMM for regime detection, the node transformers with LSTMs for prediction, and omits adaptive control entirely, using fixed routing with no blending across pathways. Each baseline underwent hyperparameter tuning via grid search on validation data, with the search ranges and selected values reported in Table X (Appendix). IV-D Main Results Results are reported for two variants of the proposed framework. The full model (AE-NodeFormer + SAC) includes all components and completes all four training stages. The ablated variant (AE-NodeFormer, no SAC) retains the autoencoder and dual node transformers but removes the reinforcement learning controller entirely, skipping Stage 3 of the training pipeline. In this variant, the routing threshold is fixed at τ=e95τ=e_95, the 95th percentile of training-set reconstruction errors, which is the same initialization used by the full model before SAC adaptation begins. This percentile is a standard choice in anomaly detection, classifying the top 5% of reconstruction errors as anomalous. The blending weight is held constant at α=0.5α=0.5, assigning equal contribution to both pathways regardless of market conditions. Comparing the two variants isolates the contribution of adaptive parameter tuning from the architectural benefits of autoencoder routing and dual-pathway specialization. Table I presents 1-day ahead closing price prediction results across all baselines and proposed variants. TABLE I: 1-Day Ahead Closing Price Prediction Results. Best Results in Bold. Model MAPE RMSE DA Theil’s U CTR ARIMA [6] 1.20% 1.35 55% 0.98 51% VAR [33] 1.10% 1.30 56% 0.95 52% MS-VAR [24] 1.02% 1.22 57% 0.90 53% Random Forest [7] 1.10% 1.25 57% 0.92 53% SVR [36] 1.20% 1.40 54% 1.02 50% XGBoost [8] 1.00% 1.15 59% 0.85 55% LSTM [21] 1.00% 1.20 58% 0.88 54% Simple Transformer [37] 0.90% 1.10 61% 0.80 57% BERT Sent. + LSTM [12] 0.90% 1.05 62% 0.78 58% HMM-LSTM [19] 0.87% 1.02 64% 0.76 60% TimesNet [39] 0.85% 1.00 63% 0.75 59% PatchTST [29] 0.83% 0.98 64% 0.74 59% iTransformer [26] 0.82% 0.97 64% 0.73 61% Integrated NF-BERT [2] 0.80% 0.95 65% 0.72 62% AE-NodeFormer (no SAC) 0.68% 0.88 69% 0.68 64% AE-NodeFormer + SAC 0.59% 0.82 72% 0.64 67% The proposed AE-NodeFormer + SAC achieves 0.59% MAPE, representing a 26% relative improvement over the Integrated NodeFormer-BERT baseline (0.80%) and a 28% improvement over iTransformer (0.82%), the strongest recent time-series transformer. Directional accuracy reaches 72%, a 7 percentage point gain over the graph-based baseline. Among regime-switching approaches, HMM-LSTM achieves 0.87% MAPE, outperforming the basic LSTM (1.00%) by 13% through regime-specific specialization, yet still trailing the proposed model by 32%, indicating that the combination of autoencoder-based anomaly detection, graph-aware dual pathways, and adaptive control provides substantially greater benefit than parametric regime detection with independent LSTMs. The recent time-series transformers (TimesNet 0.85%, PatchTST 0.83%, iTransformer 0.82%) cluster near the Integrated NodeFormer-BERT (0.80%), confirming that the prior single-pathway architecture was already competitive with current state-of-the-art forecasting models and that the improvements in the present work stem from the regime-aware architectural innovations rather than from a weak baseline. All pairwise improvements of the proposed model over iTransformer, PatchTST, and HMM-LSTM are statistically significant (Diebold-Mariano test [14], p<0.001p<0.001 in each case). To contextualize the directional accuracy, we computed a naive long-only baseline: predicting “up” for every day. Over the 2017-2025 test period, this naive strategy achieves 54% DA on average across the 20 stocks, reflecting the slight upward drift in equity markets. The 72% DA of the proposed model thus represents an 18 percentage point improvement over this trivial baseline, confirming that the model captures predictive signal beyond simple market drift. To assess generalization across forecasting horizons, Table IV and Table V present 5-day and 20-day ahead closing price results. TABLE IV: 5-Day Ahead Closing Price Prediction Results Model MAPE RMSE DA Theil’s U CTR ARIMA [6] 2.05% 2.30 51% 1.05 47% VAR [33] 1.88% 2.10 52% 1.00 48% MS-VAR [24] 1.70% 1.90 53% 0.95 49% Random Forest [7] 1.92% 2.15 52% 1.02 48% SVR [36] 2.10% 2.35 50% 1.08 46% XGBoost [8] 1.68% 1.88 54% 0.92 50% LSTM [21] 1.65% 1.85 54% 0.93 50% Simple Transformer [37] 1.50% 1.68 56% 0.85 53% BERT Sent. + LSTM [12] 1.48% 1.65 57% 0.83 54% HMM-LSTM [19] 1.45% 1.60 59% 0.82 55% TimesNet [39] 1.40% 1.55 58% 0.81 56% PatchTST [29] 1.38% 1.52 59% 0.79 55% iTransformer [26] 1.35% 1.50 59% 0.80 57% Integrated NF-BERT [2] 1.30% 1.45 61% 0.78 58% AE-NodeFormer (no SAC) 1.15% 1.32 64% 0.74 60% AE-NodeFormer + SAC 1.05% 1.25 67% 0.70 63% TABLE V: 20-Day Ahead Closing Price Prediction Results Model MAPE RMSE DA Theil’s U CTR ARIMA [6] 3.10% 3.45 48% 1.12 44% VAR [33] 2.85% 3.20 49% 1.05 45% MS-VAR [24] 2.55% 2.85 50% 0.98 46% Random Forest [7] 2.90% 3.25 49% 1.08 45% SVR [36] 3.20% 3.55 47% 1.15 43% XGBoost [8] 2.60% 2.90 51% 0.98 47% LSTM [21] 2.50% 2.80 51% 0.96 47% Simple Transformer [37] 2.25% 2.52 53% 0.90 50% BERT Sent. + LSTM [12] 2.20% 2.45 54% 0.88 51% HMM-LSTM [19] 2.12% 2.35 55% 0.87 52% TimesNet [39] 2.08% 2.30 56% 0.85 52% PatchTST [29] 2.05% 2.28 55% 0.84 53% iTransformer [26] 2.00% 2.22 56% 0.86 53% Integrated NF-BERT [2] 1.90% 2.15 57% 0.85 54% AE-NodeFormer (no SAC) 1.70% 2.00 60% 0.82 56% AE-NodeFormer + SAC 1.55% 1.85 63% 0.78 59% Performance improvements persist across all prediction horizons. At the 5-day horizon, the proposed model achieves 1.05% MAPE compared to 1.30% for the Integrated NodeFormer-BERT and 1.35% for iTransformer, maintaining a 19% and 22% relative advantage respectively. At 20 days, these gaps widen further: the proposed model reaches 1.55% MAPE versus 1.90% for the graph-based baseline and 2.00% for iTransformer, reflecting the increasing value of regime-aware routing as the prediction horizon extends and structural regime shifts become more consequential. Several statistical and classical ML baselines (ARIMA, VAR, Random Forest, SVR) produce Theil’s U values exceeding 1.0 at the 20-day horizon, indicating that they underperform the naive random walk at longer horizons—a well-known limitation of models without explicit temporal or regime-adaptive structure. In contrast, all transformer-based and regime-switching models maintain Theil’s U below 1.0 across all horizons. Directional accuracy for the proposed model declines from 72% at 1-day to 63% at 20-day, a more gradual degradation than iTransformer (64% to 56%) or HMM-LSTM (64% to 55%), suggesting that the combination of autoencoder routing and SAC adaptation captures structural signals that remain informative beyond short-term momentum. IV-E Per-Stock Results To examine cross-sectional variation, Table VI presents 1-day ahead closing price results for all 20 stocks in the universe, grouped by sector. TABLE VI: Per-Stock 1-Day Ahead Closing Price Results (AE-NodeFormer + SAC) Sector Stock MAPE RMSE DA Theil’s U Technology AAPL 0.62% 0.88 69% 0.68 MSFT 0.50% 0.72 74% 0.60 CRM 0.63% 0.89 70% 0.67 Financial Services JPM 0.70% 1.02 67% 0.72 V 0.48% 0.69 76% 0.58 Healthcare JNJ 0.44% 0.64 77% 0.55 UNH 0.52% 0.74 73% 0.61 PFE 0.58% 0.80 72% 0.64 Retail WMT 0.42% 0.62 78% 0.54 HD 0.48% 0.68 75% 0.59 Energy XOM 1.10% 1.38 63% 0.82 CVX 0.95% 1.22 64% 0.79 Consumer Goods PG 0.43% 0.62 77% 0.55 KO 0.44% 0.65 76% 0.56 NKE 0.56% 0.73 72% 0.63 MCD 0.45% 0.65 77% 0.56 Entertainment NFLX 0.75% 1.05 66% 0.75 Telecommunications VZ 0.47% 0.67 75% 0.58 Industrials BA 0.68% 0.94 68% 0.71 CAT 0.60% 0.81 71% 0.66 Mean 0.59% 0.82 72% 0.64 Individual stock performance spans from 0.42% MAPE (WMT) to 1.10% (XOM), with 16 of 20 stocks falling below 0.70%. The error distribution aligns with established differences in equity predictability. Defensive stocks with stable revenue profiles—WMT (0.42%), PG (0.43%), JNJ (0.44%), KO (0.44%), and MCD (0.45%)—cluster at the low end regardless of sector classification, suggesting that the predictability advantage stems from fundamental business stability rather than broad sectoral factors. Energy stocks occupy the highest error positions, with both XOM (1.10%) and CVX (0.95%) exhibiting MAPE values roughly double the universe mean, consistent with the dominant influence of exogenous commodity price movements that the autoencoder’s feature-based reconstruction cannot fully anticipate. Within sectors, meaningful variation persists: in healthcare, JNJ (0.44%) substantially outperforms PFE (0.58%), plausibly reflecting Pfizer’s heightened pipeline-driven volatility during the test period; in technology, MSFT (0.50%) outperforms both AAPL (0.62%) and CRM (0.63%), consistent with differences in revenue diversification and product-cycle exposure. Financial services exhibit a similar spread, where Visa’s stable payment-processing model yields considerably lower error (0.48%) than JPMorgan’s sensitivity to interest rate and credit dynamics (0.70%). Although the sample of two to four stocks per sector does not support formal statistical claims about sectoral predictability, the consistency of observed patterns—both energy stocks at the top of the error distribution, five defensive consumer and healthcare names clustered near the bottom—suggests that stock-level characteristics such as earnings stability, commodity exposure, and idiosyncratic volatility interact with the regime detection mechanism in interpretable ways. Theil’s U remains below 1.0 for all 20 stocks without exception, confirming that the model outperforms the naive random-walk baseline across the full predictability spectrum. Directional accuracy ranges from 63% (XOM) to 78% (WMT), with every stock exceeding the 54% naive long-only baseline reported in the main results, indicating that the regime-aware routing mechanism provides meaningful predictive signal even for the most volatile equities in the universe. IV-F Ablation Study To quantify the contribution of each architectural component, Table VII reports results from systematically removing one component at a time, with all other components held constant or adapted to the reduced architecture as described below. The No SAC configuration removes the reinforcement learning controller entirely, skipping Stage 3 of the training pipeline. The autoencoder and dual node transformers retain their Stage 1 and Stage 2 trained weights. The routing threshold is fixed at τ=e95τ=e_95, the 95th percentile of training-set reconstruction errors, and the blending weight is held at α=0.5α=0.5, assigning equal contribution to both pathways regardless of market conditions. This variant isolates the benefit of adaptive parameter tuning from the architectural contributions of regime-aware routing and pathway specialization. The No Dual Paths configuration replaces the two specialized node transformers with a single node transformer that processes all data regardless of regime classification. The single pathway retains the same architectural hyperparameters as each individual pathway in the full model (6 layers, 8 attention heads, 512 model dimension), so that any performance difference reflects the architectural choice of pathway specialization rather than a difference in model capacity. The autoencoder is retained, and its reconstruction error ete_t together with a binary regime indicator (determined by threshold τ) are concatenated to the single pathway’s input feature vector, providing regime context without architectural separation. The SAC controller remains active but operates over a reduced action space: it adjusts only τ to optimize the anomaly detection threshold, since the blending weight α is undefined when a single pathway produces the output. This variant quantifies the value of allocating independent representational capacity to normal and anomalous conditions, as opposed to conditioning a shared pathway on a regime signal. The No AE configuration removes the autoencoder, which cascades into removing dual-pathway routing and the SAC controller, since both depend on the reconstruction error that the autoencoder produces. The resulting system is a single node transformer processing the standard feature set augmented with BERT sentiment scores, architecturally equivalent to the Integrated NodeFormer-BERT baseline from prior work [2]. This configuration serves as the reference point from which the incremental contributions of regime-aware routing, pathway specialization, and adaptive control are jointly measured. TABLE VII: Ablation Study: 1-Day MAPE Configuration MAPE Δ vs Full Full Model (AE + Dual NF + SAC) 0.59% – No SAC (AE + Dual NF) 0.68% +15.3% No Dual Paths (AE + Single NF + SAC) 0.63% +6.8% No AE (Single NF + BERT, baseline) 0.80% +35.6% Autoencoder routing contributes most substantially, with its removal increasing MAPE by 35.6% relative. This large degradation reflects the fact that the autoencoder provides the foundational regime signal upon which both pathway routing and adaptive control depend; its removal eliminates the entire regime-aware processing chain. The SAC controller contributes the next largest improvement at 15.3%, confirming that adaptive tuning of τ and α based on prediction feedback materially outperforms static initialization, particularly during regime transitions where the optimal threshold and blending weight shift over time. The dual-pathway architecture contributes a smaller but meaningful 6.8% improvement, indicating that allocating independent weights to normal and anomalous conditions yields better representations than conditioning a single pathway on a binary regime indicator, even when both configurations receive the same regime information from the autoencoder. All components provide statistically significant gains (p<0.01p<0.01 via paired t-tests across stock-day predictions). IV-G Volatility Regime Analysis To evaluate regime-specific performance, Table VIII disaggregates MAPE by VIX regime. TABLE VIII: 1-Day MAPE by Volatility Regime Model Low VIX Medium VIX High VIX iTransformer 0.72% 0.92% 1.42% HMM-LSTM 0.78% 0.95% 1.35% Integrated NodeFormer-BERT 0.70% 0.90% 1.50% AE-NodeFormer (no SAC) 0.60% 0.75% 1.10% AE-NodeFormer + SAC 0.52% 0.65% 0.85% The regime-specific results reveal an instructive pattern. During low-VIX periods, iTransformer (0.72%) slightly outperforms HMM-LSTM (0.78%), as the superior representational capacity of the transformer architecture dominates when market dynamics are stable and regime detection adds limited value. During high-VIX periods, this relationship reverses: HMM-LSTM (1.35%) outperforms iTransformer (1.42%) because its regime-specific LSTMs adapt to volatile conditions even though its overall architecture is less expressive. The proposed model outperforms both across all VIX levels, maintaining MAPE at 0.85% during high-volatility periods where iTransformer reaches 1.42% and the Integrated NodeFormer-BERT baseline reaches 1.50%. The 40% relative improvement over iTransformer in high-VIX conditions, compared to 28% overall, confirms that regime-aware processing is most valuable in precisely the conditions where accurate forecasts matter most for risk management. IV-H SAC Controller Behavior Because the threshold τ and blending weight α vary across the test period (Figure 9), it is important to specify precisely what occurs at inference time and to address the resulting implications for baseline comparability. After Stage 4 training completes, all model weights are frozen, including the SAC actor and critic networks. During testing, the actor network operates as a fixed deterministic function: given the current state sts_t, it outputs [Δτ,Δα][ τ, α] through a single forward pass with no gradient computation, no reward evaluation, and no parameter update. The threshold and blending weight evolve across the test period because the inputs to this fixed function change—reconstruction errors rise during volatile markets, recent prediction metrics shift—not because the policy itself is modified. In this respect, the SAC policy at inference time is functionally equivalent to any other feedforward neural network applied to streaming data: its parameters are static, but its outputs depend on input features that vary over time. This state-dependent inference behavior is not unique to the proposed framework. The HMM-LSTM baseline performs analogous adaptive routing at test time: at each step, the HMM’s forward algorithm computes regime posterior probabilities using fixed transition and emission parameters learned during training, and these probabilities determine which regime-specific LSTM produces the forecast. The MS-VAR baseline similarly infers time-varying regime probabilities through its fixed Markov-switching parameters, adjusting intercepts and error covariance accordingly. In all three cases—the proposed SAC policy, the HMM, and the Markov-switching model—a frozen statistical or neural model produces time-varying routing decisions from fixed parameters applied to changing inputs. The proposed framework thus does not enjoy an online learning advantage relative to the regime-switching baselines; rather, the three approaches represent alternative designs for the same underlying capability of state-dependent inference-time adaptation. A separate question is whether state-dependent routing confers an advantage over the purely static baselines (ARIMA, Random Forest, LSTM, Simple Transformer, and others) that apply fixed parameters uniformly across all market conditions. It does, and this advantage is by design: the central thesis of this work is that regime-aware processing improves prediction quality. Crucially, the ablation study (Table VII) demonstrates that this advantage does not depend on the SAC controller. The AE-NodeFormer variant without SAC uses entirely static routing parameters (τ=e95τ=e_95, α=0.5α=0.5) and achieves 0.68% MAPE, which already outperforms every baseline including iTransformer (0.82%) and the Integrated NodeFormer-BERT (0.80%). The architectural contributions of autoencoder-based regime detection and dual-pathway specialization account for the majority of the improvement, with the SAC controller providing an additional 15.3% relative gain through its state-dependent refinement of routing decisions. The comparison between the static No SAC variant and the baselines is therefore on equal footing with respect to inference-time adaptation. Regarding the state features themselves, the previous-day prediction metrics (RMSEt−1RMSE_t-1, DAt−1DA_t-1) included in the SAC state vector are computed by comparing the model’s day-(t−1)(t-1) forecast with the realized closing price, which is publicly available at market open on day t. This introduces no information leakage: any practitioner would know whether yesterday’s prediction was accurate before making today’s forecast. The remaining state features—current reconstruction error ete_t, recent error history e¯t−k:t e_t-k:t, and market volatility σt _t—are computed entirely from the model’s own outputs and observable market data, with no access to future prices. Figure 9 illustrates the threshold trajectory produced by the frozen policy. During stable periods, the policy maps low reconstruction errors to higher thresholds, routing most data through the normal pathway. When volatility increases and reconstruction errors rise, the same fixed policy maps these elevated states to lower thresholds, directing more data to the event pathway. The blending weight α follows a complementary pattern, reducing normal pathway contribution during detected anomalies. 010102020303040405050606070708080909010010000.20.20.40.40.60.60.80.811COVID2022Test Period (2017-2025)Threshold τ τ τ (No SAC) Figure 9: Threshold τ produced by the frozen SAC policy over the test period. Dips correspond to volatile periods (COVID crash around index 35, 2022 market decline around index 75) where elevated reconstruction errors cause the fixed policy to output lower threshold values, routing more data through the event pathway. The dashed red line shows the static threshold (τ=e95τ=e_95) used in the No SAC ablation variant. All policy weights are frozen after training; the trajectory reflects state-dependent outputs from a fixed function, not online learning. The threshold trajectory reveals interpretable behavior. During the pre-COVID period (2017-2019), the frozen policy maps the prevailing low reconstruction errors to a relatively high threshold (τ≈0.70τ≈ 0.70-0.750.75), routing the majority of data through the normal pathway. As the COVID crash unfolds in early 2020, the spike in reconstruction errors causes the policy to output large negative Δτ τ adjustments, dropping the threshold sharply to approximately 0.35 and activating the event pathway for most stocks. Recovery is gradual: as reconstruction errors slowly normalize, the policy produces small positive adjustments that return the threshold to pre-crisis levels over several months rather than snapping back, reflecting the state-to-action mapping learned during training between persistent elevated errors and cautious threshold recovery. A similar but less severe pattern occurs during the 2022 market decline. Importantly, the policy’s mapping was learned entirely from training-period data; no crisis labels, test-period supervision, or weight updates inform the threshold trajectory shown in the figure. The frozen policy generalizes its learned associations between reconstruction error patterns and routing decisions to market events it has never encountered. IV-I Statistical Significance To confirm that the observed improvements are not attributable to chance, Table IX presents paired t-test results comparing daily squared errors. TABLE IX: Statistical Significance (n=1,580n=1,580 Test Days) Comparison t-statistic p-value Cohen’s d AE-NF+SAC vs Integrated NF-BERT −5.82-5.82 <0.0001<0.0001 0.46 AE-NF+SAC vs AE-NF (no SAC) −3.45-3.45 0.00060.0006 0.27 AE-NF+SAC vs LSTM −7.21-7.21 <0.0001<0.0001 0.57 All comparisons achieve p<0.001p<0.001, with effect sizes (Cohen’s d) ranging from 0.27 to 0.57 and indicating medium practical significance. The largest effect size (0.57) appears in the comparison against LSTM, which lacks both graph structure and regime awareness. The smallest effect size (0.27) is between the full model and the non-adaptive AE-NodeFormer variant, consistent with the SAC controller’s contribution being a refinement over an already strong base architecture rather than a wholesale improvement. V Discussion V-A Interpretation of Results Regime-aware prediction with adaptive control outperforms homogeneous approaches across all metrics. The 26% MAPE improvement over the baseline integrated model (0.59% vs 0.80%) reflects gains from three sources: regime detection (autoencoder), specialized processing (dual node transformers), and adaptive parameter tuning (SAC controller). Without requiring explicit per-sample anomaly labels, the autoencoder identifies market states that deviate from normal patterns. High reconstruction errors coincide with periods of elevated volatility, earnings announcements, and macroeconomic shocks, allowing the system to detect anomalies as deviations from its learned representation of typical behavior. The SAC controller learns to adjust the detection threshold based on prediction outcomes, maintaining a higher threshold during stable periods to prevent unnecessary routing to the event pathway and lowering it during genuine regime shifts to engage event-aware processing. This adaptive behavior emerges from the reward signal rather than hand-crafted rules. The dual-path architecture enables specialization: the normal pathway develops representations optimized for stable conditions where fundamental factors dominate, while the event pathway incorporates additional context (sentiment spikes, volatility regimes, event characterization) that proves informative during turbulent periods but might add noise during normal conditions. V-B Economic Interpretation While the performance improvements carry practical implications, they should be interpreted cautiously. A 7 percentage point improvement in directional accuracy (72% vs 65%) carries practical value for trading decisions, and the 43% relative improvement during high-volatility periods is particularly valuable since these are precisely the conditions where accurate forecasts matter most for risk management. Transaction costs, market impact, and execution constraints would reduce realized gains from any trading strategy based on these predictions. The economic significance analysis in prior work [2] demonstrated that even the baseline model’s predictions, when combined with simple trading rules, generate returns exceeding buy-and-hold benchmarks before costs. The improvements documented here should amplify these returns proportionally, though the gap would narrow after incorporating realistic trading frictions. V-C Limitations The most consequential limitation concerns data selection. The 20 stocks used in this study are drawn from the current S&P 500 universe, which introduces survivorship bias: companies that failed, were acquired, or delisted between 1982 and 2025 are absent from the dataset. Because the selected stocks are disproportionately successful over the evaluation period, predictability estimates may be inflated relative to what a real-time investor would experience when choosing from the full market. Performance on a point-in-time universe constructed from historical index constituents could differ, and the reported results should therefore be interpreted as evidence of architectural capability rather than guaranteed trading performance. Several data-related concerns compound this issue. Edge weights in the graph are initialized from correlations computed over the training period (1982-2010), but market correlations are non-stationary, and relationships that held during this window may weaken or reverse by the test period (2017-2025). The learnable edge refinement mechanism partially compensates by adapting weights during training, though fundamental correlation regime shifts could still affect generalization. Sentiment data from X (formerly Twitter) is available only from 2007 onward; for the 1982-2006 portion of training, sentiment features are set to zero. This means the sentiment modality effectively “turns on” partway through training, potentially limiting the model’s ability to learn robust sentiment-price relationships from the earlier decades. On the modeling side, the reinforcement learning controller requires careful tuning of reward weights, temperature, and network architecture, and suboptimal configurations could lead to unstable threshold behavior or poor convergence. The system also learns regime boundaries from its own prediction performance, which risks feedback loops where poor initial predictions lead to suboptimal threshold learning; the staged training protocol mitigates this by pre-training each component before SAC optimization, but the circularity is not fully eliminated. The three-component architecture increases computational requirements, with training time approximately 33% longer than the baseline model, which may constrain real-time deployment. Although the SAC policy weights are frozen at inference time, the state-dependent routing decisions produced by the fixed policy constitute a form of adaptive processing that purely static baselines (ARIMA, Random Forest, LSTM) do not possess. The regime-switching baselines (HMM-LSTM, MS-VAR) share this property, and the ablation study confirms that the static No SAC variant already outperforms all baselines, but the additional gain from state-dependent threshold tuning should be interpreted as an architectural advantage of the framework rather than an improvement attributable solely to prediction accuracy. From an economic standpoint, the analysis excludes transaction costs, market impact, and execution constraints. For strategies involving frequent rebalancing, these frictions would reduce realized returns. Backtesting is performed on the same 20-stock universe used for model development, leaving external validity on unseen stocks untested. Finally, sentiment data was collected via the X (formerly Twitter) API, which has undergone significant access policy changes, making exact replication of the sentiment component challenging with current API limitations. V-D Future Directions Several extensions could address the limitations identified above. Expanding the stock universe to include historical index constituents would mitigate survivorship bias. Automated SAC hyperparameter tuning via meta-learning could reduce configuration sensitivity. More efficient architectures would enable real-time deployment, and strictly online regime detection without training-period VIX percentiles would eliminate any residual look-ahead. Beyond addressing limitations, incorporating multiple autoencoders for different anomaly types (e.g., separating liquidity crises from earnings shocks) could refine regime classification. Extending the framework to portfolio optimization, where regime-aware allocation could improve risk-adjusted returns, represents a natural application. Transfer learning to adapt the system to new markets or asset classes without full retraining would broaden practical applicability. VI Conclusion This paper introduced an adaptive framework for stock price prediction that automatically detects market regimes and adjusts processing accordingly. The architecture combines an autoencoder for regime detection, dual node transformer networks specialized for stable and volatile conditions, and a Soft Actor-Critic reinforcement learning controller that learns adaptive regime thresholds from prediction performance. Experiments on 20 S&P 500 stocks spanning 1982-2025 demonstrate substantial improvements over prior approaches: the complete system achieves 0.59% MAPE for one-day predictions compared to 0.80% for the baseline integrated node transformer, while directional accuracy reaches 72%, a 7 percentage point improvement. These gains persist across prediction horizons and are most pronounced during volatile periods where the baseline struggles. The key conceptual contribution is the adaptive learning of regime boundaries. Rather than relying solely on fixed hand-crafted anomaly definitions, the system discovers useful boundaries by optimizing downstream prediction accuracy. This approach avoids the staleness problem of hand-crafted regime rules and the labeling burden of supervised regime classification. Future work should validate the framework on broader universes, develop more efficient implementations for real-time deployment, and explore extensions to portfolio optimization and risk management applications. Appendix A Hyperparameter Search Ranges Table X reports the hyperparameter search space explored for each baseline model. All searches were conducted via grid search on the validation set (2011–2016), with the configuration yielding the lowest validation MAPE selected for test evaluation. TABLE X: Hyperparameter Search Ranges and Selected Values for Baseline Models Model Hyperparameter Search Range Selected ARIMA AR order (p) 0,1,2,3,4,5\0,1,2,3,4,5\ Per stock (AIC) Differencing (d) 0,1,2\0,1,2\ Per stock (AIC) MA order (q) 0,1,2,3,4,5\0,1,2,3,4,5\ Per stock (AIC) Selection criterion AIC, BIC AIC VAR Lag order 1,2,3,…,10\1,2,3,…,10\ 3 Trend none, constant, both\none, constant, both\ constant Selection criterion AIC, BIC BIC Random Forest Number of estimators 100,200,500\100,200,500\ 200 Maximum depth 5,10,15,20,None\5,10,15,20,None\ 15 Min samples split 2,5,10\2,5,10\ 5 Min samples leaf 1,2,4\1,2,4\ 2 Max features d,log2⁡(d), 0.5\ d,\; _2(d),\;0.5\ d d SVR Kernel RBF (fixed) RBF Regularization (C) 0.1,1,10,100\0.1,1,10,100\ 10 Kernel width (γ) scale,0.01,0.1\scale,0.01,0.1\ scale Epsilon (ϵε) 0.01,0.05,0.1\0.01,0.05,0.1\ 0.05 XGBoost Number of estimators 100,500,1000\100,500,1000\ 500 Maximum depth 3,5,7,10\3,5,7,10\ 7 Learning rate 0.01,0.05,0.1\0.01,0.05,0.1\ 0.05 Subsample ratio 0.7,0.8,0.9,1.0\0.7,0.8,0.9,1.0\ 0.8 Column sample ratio 0.7,0.8,0.9,1.0\0.7,0.8,0.9,1.0\ 0.8 Min child weight 1,3,5\1,3,5\ 3 L1 regularization (α) 0,0.01,0.1\0,0.01,0.1\ 0.01 L2 regularization (λ) 1,1.5,2\1,1.5,2\ 1.5 LSTM Hidden dimension 64,128,256,512\64,128,256,512\ 256 Number of layers 1,2,3\1,2,3\ 2 Dropout 0.1,0.2,0.3\0.1,0.2,0.3\ 0.2 Learning rate 10−4,5×10−4,10−3\10^-4,5× 10^-4,10^-3\ 5×10−45× 10^-4 Batch size 32,64\32,64\ 64 Sequence length 252 (fixed) 252 Simple Transformer Layers 4,6,8\4,6,8\ 6 Attention heads 4,8\4,8\ 8 Model dimension 256,512\256,512\ 512 FFN dimension 1024,2048\1024,2048\ 2048 Dropout 0.1,0.2\0.1,0.2\ 0.1 Learning rate 10−4,5×10−4,10−3\10^-4,5× 10^-4,10^-3\ 10−410^-4 BERT Sentiment + LSTM LSTM hidden dimension 128,256,512\128,256,512\ 256 LSTM layers 1,2\1,2\ 2 Sentiment fusion concat., gated\concat., gated\ concatenation Dropout 0.1,0.2,0.3\0.1,0.2,0.3\ 0.2 Learning rate 10−4,5×10−4\10^-4,5× 10^-4\ 10−410^-4 MS-VAR Number of regimes (K) 2,3,4\2,3,4\ 3 Lag order 1,2,3,…,10\1,2,3,…,10\ 3 Switching specification MSI, MSM, MSIH\MSI, MSM, MSIH\ MSIH EM convergence tolerance 10−6,10−8\10^-6,10^-8\ 10−810^-8 HMM-LSTM HMM states (K) 2,3,4\2,3,4\ 3 HMM covariance diag., full\diag., full\ diagonal LSTM hidden dimension 128,256,512\128,256,512\ 256 LSTM layers 1,2,3\1,2,3\ 2 Dropout 0.1,0.2,0.3\0.1,0.2,0.3\ 0.2 Learning rate 10−4,5×10−4,10−3\10^-4,5× 10^-4,10^-3\ 5×10−45× 10^-4 Sequence length 252 (fixed) 252 TimesNet Layers 2,3,4\2,3,4\ 3 Model dimension 32,64,128\32,64,128\ 64 Top-k periods 3,5,7\3,5,7\ 5 FFN dimension 64,128,256\64,128,256\ 128 Dropout 0.1,0.2,0.3\0.1,0.2,0.3\ 0.2 Learning rate 10−4,5×10−4,10−3\10^-4,5× 10^-4,10^-3\ 10−410^-4 Sequence length 252 (fixed) 252 PatchTST Patch length 12,16,24\12,16,24\ 16 Stride 8,12,16\8,12,16\ 8 Layers 3,4,6\3,4,6\ 4 Attention heads 4,8\4,8\ 8 Model dimension 128,256,512\128,256,512\ 256 Dropout 0.1,0.2,0.3\0.1,0.2,0.3\ 0.2 Learning rate 10−4,5×10−4,10−3\10^-4,5× 10^-4,10^-3\ 10−410^-4 iTransformer Layers 3,4,6\3,4,6\ 4 Attention heads 4,8\4,8\ 8 Model dimension 128,256,512\128,256,512\ 256 FFN dimension 256,512,1024\256,512,1024\ 512 Dropout 0.1,0.2\0.1,0.2\ 0.1 Learning rate 10−4,5×10−4,10−3\10^-4,5× 10^-4,10^-3\ 10−410^-4 Integrated NF-BERT Architecture and hyperparameters fixed per [2]. For ARIMA, optimal orders vary across stocks because each stock exhibits different autocorrelation and partial autocorrelation structure; the most common selections were (p,d,q)=(2,1,2)(p,d,q)=(2,1,2) and (1,1,1)(1,1,1). XGBoost uses early stopping with a patience of 50 rounds on validation RMSE, so the effective number of estimators is often lower than the specified maximum. The LSTM and Simple Transformer sequence lengths are fixed at 252 trading days (one calendar year) to match the proposed model’s input window, ensuring that differences in performance reflect architectural capacity rather than information asymmetry. The MS-VAR uses the MSIH specification (Markov-switching intercept and heteroscedasticity), which allows both the intercept and the error variance to switch across regimes while keeping the autoregressive coefficients regime-invariant; this provides sufficient flexibility to capture volatility regime changes without overfitting the transition dynamics. For the HMM-LSTM, regime assignments are determined by the Viterbi path on the training set, and each regime-specific LSTM is trained only on data segments assigned to its corresponding state. PatchTST and iTransformer use the channel-independent and inverted-attention configurations recommended in their respective original publications, with sequence lengths fixed at 252 to match other deep learning baselines. References [1] M. Ahmed, A. N. Mahmood, and J. Hu (2016) A survey of network anomaly detection techniques. Journal of Network and Computer Applications 60, p. 19–31. Cited by: §I-B. [2] M. A. Al Ridhawi, M. Haj Ali, and H. Al Osman (2026) Stock market prediction using node transformer architecture integrated with BERT sentiment analysis. Submitted to IEEE Access. Note: Under review. arXiv:2603.05917 Cited by: TABLE X, §I, §IV-A, §IV-C, §IV-C, §IV-F, TABLE I, TABLE IV, TABLE V, §V-B. [3] S. Aminikhanghahi and D. J. Cook (2017) A survey of methods for time series change point detection. Knowledge and Information Systems 51 (2), p. 339–367. Cited by: §I-A. [4] A. Ang and G. Bekaert (2002) International asset allocation with regime shifts. The Review of Financial Studies 15 (4), p. 1137–1187. Cited by: §I-E. [5] W. Bao, J. Yue, and Y. Rao (2017) A deep learning framework for financial time series using stacked autoencoders and long-short term memory. PLOS ONE 12 (7), p. e0180944. Cited by: §I-B. [6] G. E. P. Box and G. M. Jenkins (1976) Time series analysis: forecasting and control. Holden-Day, San Francisco. Cited by: TABLE I, TABLE IV, TABLE V. [7] L. Breiman (2001) Random forests. Machine Learning 45 (1), p. 5–32. Cited by: TABLE I, TABLE IV, TABLE V. [8] T. Chen and C. Guestrin (2016) XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 785–794. Cited by: TABLE I, TABLE IV, TABLE V. [9] W. Chen, M. Jiang, W. Zhang, and Z. Chen (2021) A novel graph convolutional feature based convolutional neural network for stock trend prediction. Information Sciences 556, p. 67–94. Cited by: §I, §I-C, §I-E. [10] K. Cortis, A. Freitas, T. Daudert, M. Huerlimann, M. Zarrouk, S. Handschuh, and B. Davis (2017) SemEval-2017 task 5: fine-grained sentiment analysis on financial microblogs and news. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), p. 519–535. Cited by: §IV-A. [11] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai (2016) Deep direct reinforcement learning for financial signal representation and trading. IEEE Transactions on Neural Networks and Learning Systems 28 (3), p. 653–664. Cited by: §I-D. [12] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, p. 4171–4186. Cited by: §I, TABLE I, TABLE IV, TABLE V. [13] F. X. Diebold, J. Lee, and G. C. Weinbach (1994) Regime switching with time-varying transition probabilities. Business Cycles: Durations, Dynamics, and Forecasting, p. 144–165. Cited by: §I-A. [14] F. X. Diebold and R. S. Mariano (1995) Comparing predictive accuracy. Journal of Business & Economic Statistics 13 (3), p. 253–263. Cited by: §IV-D. [15] T. Fischer and C. Krauss (2018) Deep learning with long short-term memory networks for financial market predictions. European Journal of Operational Research 270 (2), p. 654–669. Cited by: §I-E. [16] M. Guidolin and A. Timmermann (2007) Asset allocation under multivariate regime switching. Journal of Economic Dynamics and Control 31 (11), p. 3503–3544. Cited by: §I-E. [17] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, p. 1861–1870. Cited by: §I-D, §I-E3. [18] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine (2019) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: §I-D. [19] J. D. Hamilton (1989) A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica 57 (2), p. 357–384. Cited by: §I, §I, §I-A, §I-E, TABLE I, TABLE IV, TABLE V. [20] G. E. Hinton and R. R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. Science 313 (5786), p. 504–507. Cited by: §I-B. [21] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), p. 1735–1780. Cited by: TABLE I, TABLE IV, TABLE V. [22] Z. Jiang, D. Xu, and J. Liang (2017) A deep reinforcement learning framework for the financial portfolio management problem. arXiv preprint arXiv:1706.10059. Cited by: §I-D. [23] D. P. Kingma and M. Welling (2014) Auto-encoding variational Bayes. In International Conference on Learning Representations, Cited by: §I-B. [24] H. Krolzig (1997) Markov-switching vector autoregressions: modelling, statistical inference, and application to business cycle analysis. Springer-Verlag, Berlin. Cited by: §IV-C, §IV-C, TABLE I, TABLE IV, TABLE V. [25] M. Liu, H. Sheng, N. Zhang, et al. (2022) A new deep network model for stock price prediction. In International Conference on Machine Learning for Cyber Security, p. 413–426. Cited by: §I-B. [26] Y. Liu, T. Hu, H. Zhang, H. Wu, S. Wang, L. Ma, and M. Long (2024) iTransformer: inverted transformers are effective for time series forecasting. In Proceedings of the 12th International Conference on Learning Representations (ICLR), Cited by: §IV-C, §IV-C, TABLE I, TABLE IV, TABLE V. [27] Z. Liu, J. Liu, Q. Zeng, and L. Wu (2022) VIX and stock market volatility predictability: a new approach. Finance Research Letters 48, p. 102887. Cited by: §I. [28] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. In Nature, Vol. 518, p. 529–533. Cited by: §I-D. [29] Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam (2023) A time series is worth 64 words: long-term forecasting with transformers. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Cited by: §IV-C, §IV-C, TABLE I, TABLE IV, TABLE V. [30] Z. Ning, P. Dong, X. Wang, X. Hu, L. Guo, B. Hu, R. Y. K. Kwok, and V. C. M. Leung (2021) A double deep Q-learning model for energy-efficient edge scheduling. IEEE Transactions on Services Computing 14 (5), p. 1555–1566. Cited by: §I-D. [31] P. Nystrup, H. Madsen, and E. Lindström (2017) Regime-based versus static asset allocation: letting the data speak. The Journal of Portfolio Management 44 (1), p. 103–115. Cited by: §I-A. [32] A. Pumsirirat and L. Yan (2018) Credit card fraud detection using deep learning based on auto-encoder and restricted Boltzmann machine. International Journal of Advanced Computer Science and Applications 9 (1), p. 18–25. Cited by: §I-B. [33] C. A. Sims (1980) Macroeconomics and reality. Econometrica 48 (1), p. 1–48. Cited by: TABLE I, TABLE IV, TABLE V. [34] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. 2nd edition, MIT Press. Cited by: §I-D, §I-E. [35] H. Tong (1990) Non-linear time series: a dynamical system approach. Oxford University Press. Cited by: §I-A. [36] V. N. Vapnik (1995) The nature of statistical learning theory. Springer, New York. Cited by: TABLE I, TABLE IV, TABLE V. [37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30, p. 5998–6008. Cited by: §I, §I-D1, TABLE I, TABLE IV, TABLE V. [38] C. Wang, H. Liang, B. Wang, X. Cui, and Y. Xu (2022) MG-Conv: a spatiotemporal multi-graph convolutional neural network for stock market index trend prediction. Computers and Electrical Engineering 103, p. 108285. Cited by: §I-C. [39] H. Wu, T. Hu, Y. Liu, H. Zhou, J. Wang, and M. Long (2023) TimesNet: temporal 2d-variation modeling for general time series analysis. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Cited by: §IV-C, §IV-C, TABLE I, TABLE IV, TABLE V. [40] Q. Wu, W. Zhao, Z. Li, D. P. Wipf, and J. Yan (2022) NodeFormer: a scalable graph structure learning transformer for node classification. In Advances in Neural Information Processing Systems, Vol. 35, p. 27387–27401. Cited by: §I-C, §I-D1. [41] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip (2020) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems 32 (1), p. 4–24. Cited by: §I-C. Mohammad Al Ridhawi received the B.A.Sc. degree in computer engineering and the M.Sc. degree in digital transformation and innovation (machine learning) from the University of Ottawa, Ottawa, Canada, in 2019 and 2021, respectively. He is currently pursuing the Ph.D. degree in electrical and computer engineering at the University of Ottawa, where he also serves as a Part-Time Engineering Professor. He has industry experience as a Senior Data Scientist and Senior Machine Learning Engineer, building production ML systems in financial and environmental domains. His research interests include deep learning, graph neural networks, natural language processing, financial time series analysis, and reinforcement learning. Mahtab Haj Ali received the M.Sc. degree in digital transformation and innovation from the University of Ottawa, Ottawa, Canada, in 2021. She is currently pursuing the Ph.D. degree in electrical and computer engineering at the University of Ottawa, with a research focus on time series forecasting and deep learning models. She works as an AI Research Engineer at the National Research Council of Canada, where she builds and evaluates large language models (LLMs) and develops AI-driven solutions for real-world industrial applications. Her work includes large-scale time series analysis, advanced feature engineering, and the application of LLMs in production environments. Her research interests include deep learning for time series analysis, deep neural networks, and applied artificial intelligence. Hussein Al Osman received the B.A.Sc., M.A.Sc., and Ph.D. degrees from the University of Ottawa, Ottawa, Canada. He is an Associate Professor and Associate Director in the School of Electrical Engineering and Computer Science at the University of Ottawa, where he leads the Multimedia Processing and Interaction Group. His research focuses on affective computing, multimodal affect estimation, human–computer interaction, serious gaming, and multimedia systems. He has produced over 50 peer-reviewed research articles, two patents, and several technology transfers to industry.