Paper deep dive

ARROW: Augmented Replay for RObust World models

Abdulaziz Alyahya, Abdallah Al Siyabi, Markus R. Ernst, Luke Yang, Levin Kuhlmann, Gideon Kowadlo

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 72

Abstract

Abstract:Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: failed | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 0%

Last extracted: 3/13/2026, 1:14:06 AM

OpenRouter request failed (402): {"error":{"message":"This request requires more credits, or fewer max_tokens. You requested up to 65536 tokens, but can only afford 52954. To increase, visit https://openrouter.ai/settings/keys and create a key with a higher monthly limit","code":402,"metadata":{"provider_name":null}},"user_id":"user_2shvuzpVFCCndDdGXIdfi40gIMy"}

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

72,073 characters extracted from source content.

Expand or collapse full text

ARROW: Augmented Replay for RObust World models Abdulaziz Alyahyaabalyahya@imamu.edu.sa Department of Information Systems Imam Mohammad Ibn Saud Islamic University (IMSIU) Abdallah Al Siyabiabdallah.alsyiabi@monash.edu Department of Data Science & AI Monash University Markus R. Ernstm.ernst@unsw.edu.au School of Computer Science and Engineering University of New South Wales, Sydney Luke Yangluke.yang@monash.edu Department of Data Science & AI Monash University Levin Kuhlmannlevin.kuhlmann@monash.edu Department of Data Science & AI Monash University Gideon Kowadlogideon@cerenaut.ai Cerenaut – https://cerenaut.ai Abstract Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain re- plays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algo- rithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diver- sity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates sub- stantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research. 1 Introduction The ability to continually acquire new skills while retaining old ones is central to intelligence. However, many AI systems suffer from catastrophic forgetting, where learning new tasks abruptly degrades earlier capabilities (McCloskey & Cohen, 1989; French, 1999). In order to deploy reinforcement learning (RL) agents in open-ended, sequentially changing environments, overcoming this limitation becomes essential. 1 arXiv:2603.11395v1 [cs.LG] 12 Mar 2026 1.1 The challenge of continual reinforcement learning Classical RL optimizes expected return in a single, stationary environment (Sutton & Barto, 1998), enabling major successes in Atari, Go, and beyond (Mnih et al., 2015; Silver et al., 2017; Vinyals et al., 2019; Levine et al., 2016). Many real-world settings, however, require adaptation across sequential tasks. In continual reinforcement learning (CRL), an agent encounters a curriculum T = (τ 1 ,τ 2 ,...,τ T ), often without task boundaries or identifiers (Khetarpal et al., 2022). While continual learning (CL) has long been studied in supervised learning (Kirkpatrick et al. (2017); Lopez- Paz & Ranzato (2017)), supervised pipelines can often mitigate non-stationarity by reshuffling and replaying a fixed dataset (French, 1999). In RL, a stationary data distribution may be unattainable: data are streamed, tasks may not reliably repeat, and the environment itself can be non-stationary. Following Parisi et al. (2019), CL requires balancing stability, plasticity, and transfer. Agents must retain prior performance (stability) while learning new tasks efficiently (plasticity), and ideally reuse earlier knowl- edge to accelerate future learning (forward transfer) or improve earlier tasks (backward transfer), especially when tasks share dynamics or visual structure. These goals are inherently in tension (Parisi et al., 2019), forming the stability–plasticity dilemma (Mermillod et al., 2013). 1.2 Related work and limitations Continual learning methods are commonly grouped into parameter regularization, architectural modularity, and rehearsal/replay. Parameter regularization constrains parameter updates to preserve weights important for prior tasks, as in Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017). Modularity dedicates task-specific components or routes computation through task-specific pathways, exemplified by PathNet (Fernando et al., 2017) and Progress & Compress (P&C) (Schwarz et al., 2018). Replay methods interleave stored experiences with new data to rehearse previous tasks (Rolnick et al., 2019; Riemer et al., 2019; Chaudhry et al., 2021) and are among the most effective approaches, but naive replay scales poorly because retaining complete experience histories demands large memory (OpenAI et al., 2019). The state-of-the-art model-free CRL methods such as CLEAR (Rolnick et al., 2019) combine large replay buffers with V-trace off-policy correction and behavior cloning for stability, while replay-buffer augmentations inspired by neuroscience (e.g., selective replay based on surprise or reward) indicate that matching the global training distribution can mitigate catastrophic forgetting without storing all experiences (Isele & Cosgun, 2018). 1.3 The neuroscience-inspired alternative Complementary Learning Systems (CLS) theory (Hassabis et al., 2017; Khetarpal et al., 2022) posits two interacting memory systems: a fast system that captures recent episodes and a slow system that builds structured knowledge. In this view, the hippocampus replays recent experiences to the neocortex, a slow statistical learner, reducing forgetting; the neocortex can be interpreted as forming a predictive World Model (Mathis, 2023). Yet in RL, replay is typically used to improve model-free policies rather than to train a World Model directly. World models are central to model-based RL, predicting action consequences (Ha & Schmidhuber, 2018; Hafner et al., 2019) and underpin DreamerV1-V3 (Hafner et al., 2020; 2021; 2025). They have also been applied to continual RL (Nagabandi et al., 2019; Huang et al., 2021; Kessler et al., 2023; Rahimi-Kalahroudi et al., 2023). Notably, Kessler et al. (2022) showed that DreamerV2 (Hafner et al., 2021) with a persistent FIFO buffer can reduce forgetting. However, World Model CL approaches often rely on replay buffers with millions of high-dimensional samples, creating substantial memory and scalability constraints. World models are a natural fit for replay because they support off-policy learning, consistent with neuro- scientific motivation, but remain underexplored in memory-efficient continual settings. Thus, a key open question is: Can strategic, memory-efficient replay to World Models enable robust continual learning while retaining the sample efficiency of existing approaches? 2 1.4 Our contribution Building on the work by Yang et al. (2024), we introduce ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 (Hafner et al., 2025) with a memory- efficient, strategically managed replay mechanism (Isele & Cosgun, 2018) to balance recent experience with long-term knowledge; designed for continual learning. We evaluate ARROW in two continual learning regimes: (i) tasks without shared structure, where each task corresponds to a distinct environment and reward function, and (i) tasks with shared structure, where common dynamics or visual features enable transfer (Khetarpal et al., 2022) (e.g., shifting game conditions (Riemer et al., 2019)). This second regime reflects many practical applications, such as a household robot learning related skills. Accordingly, our analysis considers not only forgetting but also forward and backward transfer, while targeting scalability in memory and computation (Khetarpal et al., 2022; Chen & Liu, 2018). We benchmark ARROW against model-based (DreamerV3) and model-free (SAC) baselines matched in the memory footprint of their respective buffers. By showing that bio-inspired replay strategies can improve continual learning with modest memory, this work advances lifelong agents that can continuously acquire and refine skills in open-ended environments. 2 Background We consider continual learning (CL) in finite, discrete-time, partially observable reinforcement learning (RL) environments modeled as Partially Observable Markov Decision Processes (POMDPs) (Kaelbling et al., 1998), a generalization of fully observable MDPs (Puterman, 1990). A POMDP is defined by M = (S,A,p,r, Ω,O,γ), where s t ∈ S evolves via s t+1 ∼ p(s t ,a t ) under actions a t ∈ A, yielding re- ward r(s t ,a t ,s t+1 ) ∈ R. The agent observes ω t ∈ Ω generated by ω t ∼ O(s t ), and optimizes discounted returns with γ ∈ (0, 1). Actions are sampled from a stochastic policy a t ∼ π(ω t ), typically parameterized by a neural network π θ . Under a finite horizon T, the return is R t = P T i=t γ i−t r(s i ,a i ,s i+1 ), and the objective is to maximize E π [R 0 | s 0 ]. RL methods may be model-free or model-based: model-free learning directly optimizes π θ , while model-based approaches learn a simulator (a World Model) of p and r from past rollouts and can use it to simulate trajectories for planning and action selection, or to train the policy from imagined rollouts. RL algorithms are also on- or off-policy; on-policy methods require fresh data from the latest π θ , whereas off-policy methods learn from previously collected samples despite policy mismatch (Espeholt et al., 2018). Off-policy replay buffers can be used to train World Models and improve sample efficiency, often yielding substantially better data efficiency than purely model-free optimization. 3 Augmented Replay for RObust World models (ARROW) ARROW extends DreamerV3, which achieved state-of-the-art performance on several single-GPU RL bench- marks, not including explicit testing on continual RL. It comprises three components: a World Model of the environment, an actor-critic controller for decision making, and an augmented replay buffer that stores ex- perience. The replay buffer is used to train the World Model, which then generates imagined (“dreamed”) trajectories used to train the controller, enabling off-policy learning and data augmentation—useful in con- tinual learning when environment interaction is limited. ARROW does not require explicit task identifiers, allowing for more flexible adaptation to changing environments. The source code is available online 1 . 3.1 World model As with DreamerV3, ARROW uses a Recurrent State-Space Model (RSSM) (Hafner et al., 2019) to predict dynamics, see Fig. 1A. It maintains a deterministic hidden state h t and a stochastic latent state z t conditioned 1 https://anonymous.4open.science/r/ARROW-B6F2/ 3 ˆv 3 ˆr 3 ˆv 2 ˆr 2 x 3 x 2 ˆx 3 ˆx 2 ˆr 3 ˆr 2 enc min KL dec z 1 x 1 ˆx 1 ˆr 1 ˆz 1 a 1 ⎡ ⎢ ⎣ q 1 . . . q n ⎤ ⎥ ⎦ enc min KL dec ⎡ ⎢ ⎣ q 1 . . . q n ⎤ ⎥ ⎦ enc min KL Dreaming with Actor-CriticWorld Model Learning dec ⎡ ⎢ ⎣ q 1 . . . q n ⎤ ⎥ ⎦ a 2 z 2 ˆz 2 ˆz 2 ˆz 2 z 3 ˆz 3 enc z 1 x 1 ⎡ ⎢ ⎣ q 1 . . . q n ⎤ ⎥ ⎦ a ˆ h 2 ˆa 1 h 1 h 2 h 3 h 1 ˆz 3 ˆ h 3 ˆr 1 ˆr 1 ˆv 1 ˆr 3 ˆr 2 ˆr 2 ˆr 2 ˆa 2 ˆv 2 ˆa 3 ˆr 3 ˆv 3 A C A C A C Figure 1: World Model Learning. (A) Images drawn from the replay buffer are encoded to and reconstructed from a latent space using a recurrent state space model. (B) Learning the policy is achieved with Actor (A) and Critic (C) networks applied to latent states “dreamt-up” by the model. on observations x 1:t and actions a 1:t . A GRU models the dynamics by predicting the deterministic state h t+1 = GRU(h t ,z t ,a t ) and the stochastic state ˆz t+1 = f (h t+1 ). The latent z t is inferred via a variational encoder z t ∼ q θ (z t | h t ,x t ), while the prior ˆz t ∼ p θ (ˆz t | h t ) is used for open-loop dreaming when posteriors are unavailable. We use a standard GRU with tanh activation, and set z t to 32 discrete units with 32 categorical classes. The World Model state at time t concatenates h t and z t , yielding a Markovian representation. Training reconstructs images and rewards, using KL balancing (Hafner et al., 2021) to stabilize transitions. 3.2 Actor-critic controller The actor and critic are MLPs that map World Model states to actions and value estimates, see Fig. 1B. They are trained entirely on imagined trajectories generated by the World Model (“dreaming”), using on-policy REINFORCE (Williams, 1992). Imagined trajectories are inexpensive and avoid additional environment interaction. 3.3 Augmented replay buffer The augmented replay buffer, Fig. 2A, consists of a short-term FIFO buffer D 1 (as in DreamerV3) and a long-term global distribution matching bufferD 2 , used in parallel and sampled uniformly for each minibatch. Each buffer stores 2 18 ≈ 262,000 observations, yielding a smaller total capacity than DreamerV3’s single 1M- observation buffer (in our experiments, all methods are matched at 2 19 observations), without noticeable performance loss on our benchmarks. To further reduce storage pressure, we store spliced rollouts instead of entire episodes. Short-term FIFO buffer The FIFO buffer stores the most recent 2 18 samples, ensuring the World Model trains on all incoming experience with a recency bias that improves convergence on the current task. Long-term global distribution matching (LTDM) buffer Matching the global training distribution under limited capacity can reduce catastrophic forgetting (Isele & Cosgun, 2018). Our LTDM buffer also has capacity 2 18 , and stores a uniform random subset of 512 spliced rollouts. We use reservoir sampling by assigning each rollout chunk a random key and maintaining a size-limited priority queue that retains the highest keys. 4 with shared structure Coinrun (ProcGen) FIFO only Buffer World Model DreamerV3 1. Ms. Pacman 2. Boxing 3. Crazy Climber 5. Seaquest 6. Enduro 1. Coinrun 2. NB 3. NB+RT 4. NB+RT+GA 5. NB+RT+GA+MA 6. NB+RT+GA+MA+CA ARROW (ours) w/o shared structure Regular replay may lead to forgetting Replay from augmented buffer to improve CL characteristics Augmented Buffer FIFO FIFOLTDM Atari (ALE) 1. 2. 3. 5. 6. 4. 4. Frostbite 1.2.3. 6.5.4. D 1 D 2 Figure 2: Experiment setup. (A) Augmented buffer used in ARROW. (B) Continual learning tasks with and without shared structure. NB: no background, RT: restricted themes, GA: generated assets, MA: monochrome assets, CA: centered agent. Spliced rollouts With small buffers, storing full episodes can yield too few unique trajectories, biasing training data and reducing World Model accuracy, especially in D 2 , which should retain coverage across tasks. We therefore splice rollouts into chunks of length 512, ensuring a controlled sampling granularity. The remaining rollouts with fewer than 512 states are concatenated with subsequent episodes, using a reset flag to mark boundaries. For efficiency, episodes are also truncated after a fixed number of steps so each iteration collects an identical number of environment steps. In practice, splicing provided a simple way to control granularity without harming performance. 3.4 Task-agnostic exploration In tasks without shared structure, environments can differ sharply in dynamics, visuals, and reward scales, making exploration difficult without task IDs; policies trained on earlier tasks may be insufficiently stochas- tic on new tasks. We apply fixed-entropy regularization (as in DreamerV3) and use predetermined per- environment reward scales (from single-task baselines) during actor-critic training. This mitigates explo- ration issues without adding an explicit exploration system such as Plan2Explore (Sekar et al., 2020). 4 Experiments 4.1 Environments Atari (without shared structure) For environments without shared structure, we selected six diverse Atari games from the ALE’s v5 configurations spanning different visual modalities and gameplay dynamics (Bellemare et al., 2013). We chose Ms. Pac-Man, Boxing, Crazy Climber, Frostbite, Seaquest, and Enduro, see Fig. 2B. We presented the tasks to the agent in this specific but randomly chosen order (henceforth referred to as default task order) and followed (Machado et al., 2018) in using sticky actions. CoinRun (shared structure) To construct a suite of continual learning tasks with shared structure, we used the CoinRun environment from OpenAI Procgen as the base (Cobbe et al., 2020). We then introduced six progressive visual and behavioral perturbations, as shown in Fig. 2B and described in Appendix Tab. A.1. 4.2 Training configurations and baselines To evaluate continual learning behavior across both Atari and CoinRun, we adopted three training configu- rations. Training and evaluation details are summarized in Appendix Tab. A.2. Default task order: The agent is trained on all tasks once, following the default order shown in Fig. 2B for each of the two benchmarks. 5 Reversed task order: The agent is trained on all tasks once but in reverse order. For Atari: Enduro → Seaquest → Frostbite → Crazy Climber → Boxing → Ms. Pac-Man. For CoinRun: CA → MA → GA → RT → NB → CoinRun. Two-cycle training: This configuration uses the default task order but splits the total training budget into two cycles, with identical total environment steps to the two previous configurations. This allows us to examine relearning, retention, cross-cycle adaptation, and quantify performance recovery when a task is revisited. To measure the efficacy of the augmented replay buffer, we compared ARROW to DreamerV3 and model-free SAC, each with equal memory allowance for their respective replay buffers. For SAC, we adopted the Target Entropy Scheduled SAC (TES-SAC) variant, an extension proposed by Xu et al. (2021) that dynamically schedules the entropy target for improved stability and exploration. We also ran single-task baselines that were used for evaluation metrics and normalization. 4.3 Replay memory allowance All methods are compared under an equal replay memory budget. Replay buffers store sequences (spliced rollouts) of length T = 512; a capacity of 512 sequences corresponds to 512× 512 = 262,144 = 2 18 observa- tions. For ARROW, we combine a short-term FIFO buffer and a long-term global distribution-matching (LTDM) buffer, each with capacity 512 sequences of length T = 512. Consequently, N ARROW = 2× 512 = 1024 and T × N ARROW = 512× 1024 = 524,288 = 2 19 . DreamerV3 and TES-SAC use a single FIFO replay buffer with capacity 1024 sequences of length T = 512: N DV3/TES-SAC = 1024 ⇒ T × N DV3/TES-SAC = 512× 1024 = 524,288 = 2 19 . 4.4 Evaluation metrics To assess continual learning performance, we evaluated both task-level reward and a set of established stability–plasticity metrics. Following Kessler et al. (2023), we report forgetting and forward transfer. Fol- lowing Lange et al. (2023), we include ACC, min-ACC, and WC-ACC. For our two-cycle setting, we further report Recovery and introduce a new cross-cycle metric, maximum forgetting (Max-F). We evaluated performance by normalizing episodic reward to two single-task baselines: ARROW and a random agent. 4.4.1 Normalized rewards We define an ordered suite of tasks T = (τ 1 ,τ 2 ,...,τ T ), where τ i denotes the i-th task in the curriculum and i ∈ 1,...,T. Performance in task τ ∈ T after n steps in single-task experiments is given by p ST τ (n) and in CL by p τ (n). Agents were trained on each task for n = N environment steps in single-task and CL experiments. For each task τ ∈T , we calculated the normalized reward using p ST τ (0) and p ST τ (n): q τ (n) = p τ (n)− p ST τ (0) p ST τ (n)− p ST τ (0) .(1) A normalized score of 0 corresponds to random performance and a score of 1 corresponds to the performance when trained on only that task. 6 4.4.2 Forgetting (Backward transfer) Average forgetting for each task is the difference between performance after training on a given task and performance at the end of all tasks. Average forgetting over all tasks is defined as: F = 1 T T X i=1 q τ i (i× N )− q τ i (T × N ) .(2) A lower value indicates improved stability and a better continual learning method. A negative value implies that the agent has managed to gain performance on earlier tasks, thus exhibiting backward transfer. 4.4.3 Forward transfer The forward transfer for a task is the normalized difference between performance in the CL and single-task experiments. The average over all tasks is defined as: FT = 1 T T X i=1 S τ i − S ST τ i S ST τ i ,(3) where S τ i = 1 N N X n=1 q τ i ((i− 1)× N + n),(4) S ST τ i = 1 N N X n=1 q ST τ i (n).(5) The larger the forward transfer, the better the continual learning method. A positive value implies effective use of learned knowledge from previous environments and, as a result, accelerated learning in the current environment. When each task is not related to the others, no positive forward transfer is expected. In this case, forward transfer of 0 represents optimal plasticity, and negative values indicate a barrier to learning newer tasks from previous tasks. 4.4.4 Stability-plasticity metrics To characterize the stability-plasticity trade-off in continual learning, we follow Lange et al. (2023) and analyze three complementary metrics: Average accuracy (ACC), average minimum accuracy (min-ACC) and Worst-case Accuracy (WC-ACC). These assess how well the agent learns the current task (plasticity) while retaining knowledge of previous tasks (stability). Average accuracy (ACC) After completion of task τ k , ACC measures performance on all tasks encoun- tered up to that point: ACC τ k = 1 k k X i=1 q τ i (t k ).(6) Here, q τ i (t k ) is the normalized performance on task τ i evaluated at the end of task τ k . Average minimum accuracy (min-ACC) min-ACC tracks the worst performance each previously learned task attains after it has been learned (average of each task’s own minimum): min-ACC τ k = 1 k− 1 k−1 X i=1 min t i <n≤t k q τ i (n).(7) For each earlier task τ i , we take the minimum normalized performance over evaluation steps n after t i and up to t k , then average these minima over the k− 1 previous tasks. 7 Worst-case accuracy (WC-ACC) WC-ACC can be evaluated at every training iteration. At iteration n within task τ k : WC-ACC n = 1 k q τ k (n) + 1− 1 k min-ACC τ k .(8) The first term, 1 k q τ k (n), reflects performance on the current task and therefore measures plasticity. The second term, weighted by 1− 1 k , incorporates the minimum performance that each previously learned task ever achieved through min-ACC, which accounts for the minimum accuracy that each earlier task ever reached. Unlike ACC, which is only defined at task boundaries, WC-ACC provides a continuous view of how the agent trades off learning the present task while maintaining stability on earlier ones throughout training. 4.4.5 Sample efficiency Another important factor for practical applications, especially where agents operate in the real world (as opposed to simulation), is sample efficiency. Therefore, we analyzed how quickly each method reaches performance thresholds. Sample efficiency is measured as the median number of environment frames required to reach 85% of the maximum median performance across all methods. Formally, let P (m) t denote the median normalized performance of method m at frame t, and let P ∗ = max m,t P (m) t be the maximum performance achieved by any method during training. The sample efficiency for method m is defined as: SE m = min n t : P (m) t ≥ 0.85· P ∗ o .(9) 4.4.6 Two-cycle metrics We extended our evaluation to a two-cycle setting that uses the same overall training budget as the one- cycle curriculum. The total number of environment steps allocated to each task is divided into two equal exposures: each task τ i is first encountered during cycle 1 and then revisited for the remaining budget during cycle 2. Maximum forgetting (Max-F) To quantify the worst degradation that occurs between the first and second exposures of a task, we compute the maximum forgetting experienced by each task τ i across the two-cycle curriculum: t (1) i = i× N, t (2) i = (T × N ) + (i− 1)× N, t (3) i = (T × N ) + i× N,(10) Max-F τ i = q τ i t (1) i − q τ i (t (2) i ) − ,(11) where q τ i (t (1) i ) denotes the normalized performance at the end of the first exposure to task τ i in cycle 1, and q τ i (t (2) i ) − denotes the last evaluation immediately before the second exposure of the same task during cycle 2. We write x − to denote the final evaluation step strictly before step x; in particular, (t (2) i ) − is the last evaluation of task τ i immediately before its second exposure begins at step t (2) i . The difference measures how much performance on task τ i has deteriorated while the agent is learning other tasks. Using Max-F compares forgetting at an equivalent point for each task: after the agent has completed training on all other tasks in cycle 1 but before relearning begins in cycle 2. In this interval, the agent is continuously trained on tasks τ i+1 ,...,τ T . As a result, Max-F captures how resilient a task’s knowledge remains when the agent has been exposed to the maximum possible amount of interference from all other tasks in the curriculum. Recovery The recovery metric quantifies how much performance on task τ i is regained after the agent relearns the task during cycle 2, where t (3) i denotes the end of the second exposure to task τ i : Rec τ i = q τ i (t (3) i ) q τ i (t (1) i ) .(12) 8 02468 1e6 0.5 0.0 0.5 1.0 1.5 Norm. perf. ARROW (Ours) 02468 1e6 DreamerV3 02468 1e6 TES-SAC 02468 1e6 0.5 0.0 0.5 1.0 1.5 Norm. perf. 02468 1e6 02468 1e6 02468 1e6 0.5 0.0 0.5 1.0 1.5 Norm. perf. 02468 1e6 02468 1e6 A B C Ms. Pac-Man Boxing Crazy Climber Frostbite Seaquest Enduro Figure 3: Atari median normalized performance (Eq. 1). Shaded area depicts 0.25 and 0.75 quartiles of 5 seeds. Bold line segments indicate training of task. (A) Default order of tasks (one-cycle). (B) Reversed order of tasks (one-cycle). (C) Default order of tasks (two-cycle). The dotted, vertical line marks the end of cycle 1 and the beginning of cycle 2. 5 Results 5.1 Tasks without shared structure: Atari Median normalized performance is shown in Fig. 3, while other metrics are visualized in Fig. 4. Detailed numerical results (median [IQR]) are provided in Appendix Tab. A.3. ARROW nearly eliminates catastrophic forgetting, regardless of task order, while DreamerV3 suffers severe forgetting whenever a new task is introduced. TES-SAC also exhibits low forgetting scores, but this is misleading: TES-SAC largely fails to learn the Atari tasks in the first place, so there is little performance to lose. Default task order. ARROW reduces forgetting by more than six-fold compared to DreamerV3 (0.197 vs. 1.217; Fig. 4A). DreamerV3 retains a slight edge in forward transfer, indicating that its unconstrained buffer allows faster initial learning on new tasks, whereas ARROW’s augmented buffer introduces a modest plasticity cost. Crucially, ARROW achieves the best overall stability–plasticity trade-off, as reflected by a WC-ACC of 0.615, well above the negative values recorded for both baselines. Reversed task order. The same pattern holds when the curriculum is reversed (Fig. 4B). ARROW’s forgetting drops to 0.039, closely matching TES-SAC while DreamerV3 again forgets catastrophically (1.348). 9 ARROWDV3SAC 0.0 0.5 1.0 1.5 Forgetting A Forgetting ARROWDV3SAC 1.0 0.5 0.0 Forward Transfer Forward Transfer ARROWDV3SAC 0.0 0.5 1.0 Performance Stability-plasticity Tradeoff 02468 1e6 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Norm. Perf. Threshold = 2.02 Sample Efficiency ARROWDV3SAC 0.0 0.5 1.0 1.5 Forgetting B ARROWDV3SAC 1.0 0.5 0.0 Forward Transfer ARROWDV3SAC 0.0 0.5 1.0 1.5 Performance 02468 1e6 0.0 0.5 1.0 1.5 2.0 2.5 Norm. Perf. Threshold = 0.82 ARROWDV3SAC 0.0 0.5 1.0 1.5 Forgetting C Forgetting - Cycle 1 Forgetting - Cycle 2 Max-F ARROWDV3SAC 1.0 0.5 0.0 0.5 Forward Transfer FT - Cycle 1 FT - Cycle 2 ARROWDV3SAC 0.5 0.0 0.5 1.0 1.5 2.0 2.5 Performance ACC MIN-ACC WC-ACC Recovery 02468 1e6 0.0 0.5 1.0 1.5 2.0 2.5 Norm. Perf. Threshold = 1.17 ARROW DV3 SAC Figure 4: Atari metrics shown as median with (0.25 - 0.75) quartile confidence intervals, across 5 seeds, and calculated using normalized scores (Eq. 1). (A) Default task order (one-cycle). (B) Reversed task order (one-cycle). (C) Default task order (two-cycle). ARROW maintains the highest WC-ACC (0.618), confirming that its stability advantage is robust to task ordering. Two-cycle training. The two-cycle setting reveals ARROW’s most distinctive strength: the ability to recover and even surpass prior performance when tasks are revisited (Fig. 4C). This is reflected in ARROW’s maximum forgetting (Max-F), which is essentially zero (0.012), indicating that the worst degradation any task experiences between its first and second exposure is negligible. DreamerV3, by contrast, suffers a Max-F of 0.735, and TES-SAC reaches 0.089. Most notably, ARROW again leads in WC-ACC (0.388 vs. negative values for both baselines). 5.2 Tasks with shared structure: Procgen CoinRun Median normalized performance is shown in Fig. 5, while other metrics are visualized in Fig. 6. Detailed numerical results (median [IQR]) are provided in Appendix Tab. A.4. When tasks share visual and structural features, all methods forget less than in Atari, ARROW in particular achieves near-zero forgetting in the reversed order and negative maximum forgetting in the two-cycle setting. Because forgetting is generally lower across the board, the distinguishing factor becomes the stability– 10 02468 1e6 0.0 0.5 1.0 1.5 Norm. perf. ARROW (Ours) 02468 1e6 DreamerV3 02468 1e6 TES-SAC 02468 1e6 0.0 0.5 1.0 1.5 Norm. perf. 02468 1e6 02468 1e6 02468 1e6 0.0 0.5 1.0 1.5 Norm. perf. 02468 1e6 02468 1e6 A B C CoinRun CoinRun+NB CoinRun+NB+RT CoinRun+NB+RT+GA CoinRun+NB+RT+GA+MA CoinRun+NB+RT+GA+MA+CA Figure 5: CoinRun median normalized performance (Eq. equation 1). Shaded area depicts 0.25 and 0.75 quartiles of 5 seeds. Bold line segments indicate training of task. (A) Default order of tasks (one-cycle). (B) Reversed order of tasks (one-cycle). (C) Default order of tasks (two-cycle). The dotted vertical line marks the end of cycle 1 and the beginning of cycle 2. plasticity balance: ARROW attains the highest WC-ACC in every CoinRun configuration, pairing strong forward transfer with reliable retention of earlier tasks. Default task order. Both model-based methods exhibit strong forward transfer on CoinRun variants (Fig. 6A), with DreamerV3 slightly ahead (0.787 vs. 0.507 for ARROW). However, ARROW provides a more reliable trade-off, achieving the highest WC-ACC (0.635 vs. 0.328 for DreamerV3). TES-SAC shows minimal forgetting (−0.012) but also near-zero forward transfer, reflecting limited benefit from shared structure. Reversed task order. When the curriculum is reversed, ARROW’s forgetting drops to effectively zero (0.000), while its forward transfer rises to 0.715, nearly matching DreamerV3 (0.786). Under this setting, ARROW achieves the strongest overall profile, with WC-ACC exceeding 1.0. Two-cycle training. In this setting (Fig. 6C), ARROW’s maximum forgetting is negative (Max- F =−0.089), while DreamerV3 reaches 0.233, and TES-SAC is near zero (−0.026). All three methods achieve recovery above 1.0, confirming that revisiting CoinRun tasks is universally beneficial. However, AR- ROW stands out in two ways: it attains the lowest Max-F and by far the best WC-ACC (0.912 vs. 0.071 for DreamerV3, 0.600 for TES-SAC). 11 ARROWDV3SAC 0.0 0.5 Forgetting A Forgetting ARROWDV3SAC 0.0 0.5 1.0 Forward Transfer Forward Transfer ARROWDV3SAC 0.5 0.0 0.5 1.0 Performance Stability-plasticity Tradeoff 02468 1e6 0.0 0.5 1.0 Norm. Perf. Threshold = 1.02 Sample Efficiency ARROWDV3SAC 0.0 0.5 Forgetting B ARROWDV3SAC 0.0 0.5 1.0 Forward Transfer ARROWDV3SAC 0.0 0.5 1.0 1.5 Performance 02468 1e6 0.0 0.5 1.0 Norm. Perf. Threshold = 1.14 ARROWDV3SAC 0.0 0.5 1.0 1.5 Forgetting C Forgetting - Cycle 1 Forgetting - Cycle 2 Max-F ARROWDV3SAC 0.0 0.5 1.0 Forward Transfer FT - Cycle 1 FT - Cycle 2 ARROWDV3SAC 0.0 0.5 1.0 1.5 Performance ACC MIN-ACC WC-ACC Recovery 02468 1e6 0.0 0.5 1.0 Norm. Perf. Threshold = 1.13 ARROW DV3 SAC Figure 6: CoinRun metrics shown as median with (0.25 - 0.75) quartile confidence intervals, across 5 seeds, and calculated using normalized scores, Eq. 1. (A) Default order of tasks (one-cycle). (B) Reversed order of tasks (one-cycle). (C) Default order of tasks (two-cycle). 5.2.1 Continual learning sample efficiency The last columns of Fig. 4 and Fig. 6 illustrate how quickly each method reaches a performance threshold of 85%. The non-shared tasks (Atari) show dramatic task-order sensitivity as the peak performances vary widely (2.02 for default, 0.82 for reversed and 1.17 for two-cycle). In the default task order, DreamerV3 reaches the threshold in 5.652M frames across 4/5 seeds. ARROW does not reach the 2.02 threshold in any seed. TES-SAC also fails to reach the threshold. In the reversed task order, ARROW reaches the 0.82 threshold in 5.079M frames across 4/5 seeds, unmatched by both DreamerV3 and TES-SAC. For the two-cycle training both ARROW and DreamerV3 reach the 85% threshold, with DreamerV3 (3.113M) being slightly more sample efficient than ARROW (3.441M). For detailed statistics including interquartile ranges, see Appendix Tab. A.5. As for the tasks with shared structure, we observe greater consistency between the model-based approaches, where for each task configuration both ARROW and DreamerV3 consistently reach the threshold in all 5 seeds. TES-SAC fails to reach any threshold across all given CoinRun task configurations. We observe that the sample efficiency for ARROW is very sensitive to task order. In the default order, DreamerV3 reaches the threshold in a third of the time (0.492M) compared to ARROW (1.638M). In reversed 12 order both model-based approaches are of the same order of magnitude (ARROW: 3.768M, DreamerV3: 3.604M). This pattern can also be observed in the two-cycle training where tasks are presented in the default order again. For detailed statistics including interquartile ranges, see Tab. A.6. 6 Discussion The results demonstrate that ARROW tends to be more stable and trades off stability and plasticity better than the model-free and model-based baselines. Our results confirm that augmented replay to a World Model in a model-based approach provides a strong foundation for continual RL. For tasks without shared structure, ARROW almost completely eliminates forgetting. This suggests that the distribution-matching principle that ARROW is built upon is able to preserve World Model accuracy across previous and current tasks. Moreover, this is in stark contrast to the results obtained for DreamerV3 which exhibited catastrophic forgetting on every novel Atari task. TES-SAC also showed favorable forgetting statistics, but since TES-SAC fails to adequately learn any of the Atari tasks this does not indicate retention; rather, it must be understood in the context of the overall performance. On tasks with shared structure, ARROW showed decreased forgetting. Notably, the magnitude of forgetting is tightly linked to the order of the presented tasks. In fact, the very last task of the default order “+CA” causes the camera to no longer be centered on the agent. This appears to be a significant departure from the other shared structure environments and causes a drop in normalized performance. When the order is reversed the near-zero forgetting of ARROW reappears. When we applied DreamerV3 to the shared structure tasks (CoinRun), we observed the variance was very high, resulting in low minimum average accuracy (min- ACC). ARROW mitigates that variance by making effective use of the augmented replay buffer, where past experiences stabilize the training. In addition, ARROW was able to learn multiple tasks with highly varying magnitude of rewards, without task identifiers, which is an important and difficult challenge on its own. Interestingly, for our two-cycle training, ARROW consistently exhibited exceptional recovery. We hypothesize that it also indicates that ARROW learns superior representations through implicit multi-task learning. Alternatively, the training time in the two-cycle approach for some of the tasks may be too limited to fully learn each task in the first cycle. Regarding sample efficiency, ARROW clearly surpasses DreamerV3 and TES-SAC in almost all single tasks. For tasks with shared structure we see that ARROW trades its superior stability and forgetting characteristics for a slightly worse sample efficiency when compared to DreamerV3. Our methods can be used in conjunction with prior state-of-the-art approaches to combating catastrophic forgetting such as EWC and P&C (Kirkpatrick et al., 2017; Schwarz et al., 2018) which work over network parameters, and CLEAR (Rolnick et al., 2019) which uses replay but typically operates over model-free approaches and uses behavior cloning of the policy from environment input to action output. Generalization. Forward and backward transfer was significantly better for tasks with shared structure than those without shared structure; where the World Model’s ability to generalize across tasks is very beneficial. Reward scaling without task IDs. While ARROW was sufficiently robust in learning multiple tasks without forgetting, even when reward scales are somewhat different, we found that this property did not hold when reward scales differed significantly (e.g. by a factor of 10 2 ), where only the task with the higher reward scale would be learned. In that case, we found that approximate reward scaling allowed the agent to learn multiple tasks. We hypothesized that the different rewards and subsequent returns cause poorly scaled advantages when training the actor, resulting in the actor only learning tasks with the highest returns. Experiments with automatic scaling of advantages through non-linear squashing transformations proved to hurt learning on individual tasks, so the static, linear, reward transformation was used. Memory capacity. As RL algorithms often consume high compute, we emphasize the benefit of lower computational and memory costs. ARROW does not scale up the buffer when compared to DreamerV3, but instead splits the available memory and intelligently uses past experience. Despite the benefits and improved 13 memory capacity of ARROW, a key limitation of any buffer-based method is finite capacity. As more tasks are explored and previous tasks are not revisited, an increasing number of samples from previous tasks will inevitably be lost, leading to forgetting. Limitations and future work. ARROW currently allocates a fixed 50/50 split capacity to short-term and long-term buffers. A natural extension of this idea would be to test different splits or dynamically allocate memory based on task characteristics. Extending ARROW to continuous control or robotics domains like MuJoCo (Todorov et al., 2012) could further validate the generalization capabilities of our approach. As the specific task ordering can result in significant differences in CL performance (Appendix G in (Rahimi- Kalahroudi et al., 2023)) we implemented two randomly chosen task sequences. This approach allowed us to quantify ordering effects while limiting computational cost. A follow-up study could dedicate more time to different permutations to better understand the relationship between individual environments and to shed some light on the unusually high performance of Frostbite in DV3. Additionally, the experiments could be expanded by bringing ARROW to other model-based RL algorithms, e.g. TD-MPC (Hansen et al., 2022; 2024), or by combining it with existing techniques such as behavior cloning in CLEAR (Rolnick et al., 2019). 7 Conclusion We extended the DreamerV3 World Model architecture with an augmented replay buffer (ARROW) and studied continual RL in two scenarios: tasks with and without shared structure. ARROW, DreamerV3, and TES-SAC were compared using the same memory budget (same-sized buffers). We evaluated forget- ting, forward and backward transfer, and stability–plasticity metrics; ARROW’s augmented replay buffer yielded substantial improvements in tasks without shared structure and a minor benefit in tasks with shared structure. Overall, these results support model-based RL with a World Model and a memory-efficient replay buffer as an effective and practical approach to continual RL, motivating future work. References Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. URL https://arxiv. org/abs/1607.06450. M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An Evaluation Platform for General Agents. Journal of Artificial Intelligence Research, 47:253–279, June 2013. ISSN 1076-9757. doi: 10.1613/jair.3912. Arslan Chaudhry, Albert Gordo, Puneet Dokania, Philip Torr, and David Lopez-Paz. Using hindsight to anchor past knowledge in continual learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(8):6993–7001, May 2021. doi: 10.1609/aaai.v35i8.16861. URL https://ojs.aaai.org/index.php/AAAI/ article/view/16861. Zhiyuan Chen and Bing Liu. Continual learning and catastrophic forgetting. In Lifelong Machine Learning, p. 55–75. Springer International Publishing, Cham, 2018. ISBN 978-3-031-01581-6. doi: 10.1007/978-3-031-01581-6_4. Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. In Hal Daumé I and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, p. 2048–2056. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/cobbe20a.html. Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In Jennifer Dy and An- dreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, p. 1407–1416. PMLR, 10–15 Jul 2018. URL https: //proceedings.mlr.press/v80/espeholt18a.html. 14 Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A. Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks, 2017. URL https://arxiv.org/abs/1701.08734. Robert M. French. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4): 128–135, 1999. ISSN 1364-6613. doi: 10.1016/S1364-6613(99)01294-2. URL https://w.sciencedirect. com/science/article/pii/S1364661399012942. David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://papers.neurips. c/paper_files/paper/2018/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html. Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, p. 2555–2565. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr. press/v97/hafner19a.html. Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=S1lOTC4tDS. Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=0oabwyZbOu. Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models. Nature, 640(8059):647–653, 2025. doi: 10.1038/s41586-025-08744-2. Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. In The Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=Oxh5CstDJU. Nicklas A Hansen, Hao Su, and Xiaolong Wang. Temporal difference learning for model predictive control. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, p. 8387–8406. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/ v162/hansen22a.html. Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick. Neuroscience- inspired artificial intelligence. Neuron, 95(2):245–258, 2017. ISSN 0896-6273. doi: 10.1016/j.neuron.2017. 06.011. URL https://w.sciencedirect.com/science/article/pii/S0896627317305093. Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units, 2016. URL http://arxiv.org/abs/1606.08415. Yizhou Huang, Kevin Xie, Homanga Bharadhwaj, and Florian Shkurti. Continual model-based reinforce- ment learning with hypernetworks. In 2021 IEEE International Conference on Robotics and Automation (ICRA), p. 799–805, 2021. doi: 10.1109/ICRA48506.2021.9560793. David Isele and Akansel Cosgun. Selective experience replay for lifelong learning. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018. doi: 10.1609/aaai.v32i1.11595. URL https://ojs. aaai.org/index.php/AAAI/article/view/11595. Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1):99–134, 1998. ISSN 0004-3702. doi: 10.1016/ S0004-3702(98)00023-X. URL https://w.sciencedirect.com/science/article/pii/S000437029800023X. 15 Samuel Kessler, Piotr Milos, Jack Parker-Holder, and Stephen J. Roberts. The surprising effectiveness of latent world models for continual reinforcement learning. In Deep Reinforcement Learning Workshop NeurIPS 2022, 2022. URL https://openreview.net/forum?id=-lHOOgHuWwu. Samuel Kessler, Mateusz Ostaszewski, Michał PawełBortkiewicz, Mateusz Żarski, Maciej Wolczyk, Jack Parker-Holder, Stephen J. Roberts, and Piotr Milos. The effectiveness of world models for continual rein- forcement learning. In Sarath Chandar, Razvan Pascanu, Hanie Sedghi, and Doina Precup (eds.), Proceed- ings of The 2nd Conference on Lifelong Learning Agents, volume 232 of Proceedings of Machine Learning Research, p. 184–204. PMLR, 22–25 Aug 2023. URL https://proceedings.mlr.press/v232/kessler23a.html. Khimya Khetarpal, Matthew Riemer, Irina Rish, and Doina Precup. Towards continual reinforcement learn- ing: A review and perspectives. Journal of Artificial Intelligence Research, 75:1401–1476, December 2022. ISSN 1076-9757. doi: 10.1613/jair.1.13673. James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017. doi: 10.1073/pnas.1611835114. URL https://w.pnas.org/doi/abs/10.1073/pnas.1611835114. Matthias De Lange, Gido van de Ven, and Tinne Tuytelaars. Continual evaluation for lifelong learning: Identifying the stability gap, 2023. URL https://arxiv.org/abs/2205.13452. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Back- propagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989. doi: 10.1162/neco.1989.1.4.541. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016. URL http://jmlr.org/papers/v17/ 15-522.html. David Lopez-Paz and Marc' Aurelio Ranzato. Gradient episodic memory for continual learning. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https:// proceedings.neurips.c/paper_files/paper/2017/file/f87522788a2be2d171666752f97ddebb-Paper.pdf. Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents. Journal of Artificial Intelligence Research, 61:523–562, March 2018. ISSN 1076-9757. doi: 10.1613/jair.5699. Mackenzie Weygandt Mathis. The neocortical column as a universal template for perception and world-model learning. Nature Reviews Neuroscience, 24(1):3–3, 2023. doi: 10.1038/s41583-022-00658-6. Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Gordon H. Bower (ed.), Psychology of learning and motivation, volume 24 of Psychol- ogy of Learning and Motivation, p. 109–165. Academic Press, 1989. doi: 10.1016/S0079-7421(08)60536-8. URL https://w.sciencedirect.com/science/article/pii/S0079742108605368. Martial Mermillod, Aurélia Bugaiska, and Patrick BONIN. The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in Psychology, Volume 4 - 2013, 2013. ISSN 1664-1078. doi: 10.3389/fpsyg.2013.00504. URL https://w.frontiersin.org/ journals/psychology/articles/10.3389/fpsyg.2013.00504. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. 16 Anusha Nagabandi, Chelsea Finn, and Sergey Levine. Deep online learning via meta-learning: Continual adaptation for model-based RL. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id= HyxAfnA5tm. OpenAI, :, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Cather- ine Olsson, Jakub Pachocki, Michael Petrov, Henrique P. d. O. Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with large scale deep reinforcement learning, 2019. URL https://arxiv.org/abs/1912.06680. German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54–71, 2019. doi: 10.1016/j.neunet.2019. 01.012. Martin L. Puterman. Chapter 8 markov decision processes. In Stochastic Models, volume 2 of Handbooks in Operations Research and Management Science, p. 331–434. Elsevier, 1990. doi: 10.1016/S0927-0507(05) 80172-0. URL https://w.sciencedirect.com/science/article/pii/S0927050705801720. Ali Rahimi-Kalahroudi, Janarthanan Rajendran, Ida Momennejad, Harm van Seijen, and Sarath Chandar. Replay buffer with local forgetting for adapting to local environment changes in deep model-based reinforce- ment learning. In Sarath Chandar, Razvan Pascanu, Hanie Sedghi, and Doina Precup (eds.), Proceedings of The 2nd Conference on Lifelong Learning Agents, volume 232 of Proceedings of Machine Learning Research, p. 21–42. PMLR, 22–25 Aug 2023. URL https://proceedings.mlr.press/v232/rahimi-kalahroudi23a.html. Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. In 7th Inter- national Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=B1gTShAct7. David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experi- ence replay for continual learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché- Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, vol- ume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.c/paper_files/paper/2019/file/ fa7cdfad1a5aaf8370ebeda47a1f1c3-Paper.pdf. Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, p. 4528–4537. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/schwarz18a.html. Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Plan- ning to explore via self-supervised world models. In Hal Daumé I and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, p. 8583–8592. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/sekar20a.html. David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT Press Cambridge, 1998. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, p. 5026–5033. IEEE, 2012. doi: 10.1109/IROS.2012.6386109. 17 Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom L. Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver. Grandmaster level in starcraft i using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019. doi: 10.1038/s41586-019-1724-z. Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256, 1992. doi: 10.1007/BF00992696. Yaosheng Xu, Dailin Hu, Litian Liang, Stephen Marcus McAleer, Pieter Abbeel, and Roy Fox. Target entropy annealing for discrete soft actor-critic. In Deep RL Workshop NeurIPS 2021, 2021. URL https: //openreview.net/forum?id=jJKzGBBQiZu. Luke Yang, Levin Kuhlmann, and Gideon Kowadlo. Augmenting replay in world models for continual reinforcement learning, 2024. URL https://arxiv.org/abs/2401.16650. 18 A Tabular data & additional results Variant Procgen flagDescription Coinrun —regularly rendered game +NB use_backgrounds = Falseremoves decorative backgrounds +RT restrict_themes = Truerestricts the set of level themes +GA use_generated_assets = True enables procedurally generated assets +MA use_monochrome_assets = True enables monochrome assets +CA center_agent = Falsecamera does not remain centered on the agent Table A.1: CoinRun task variations (Procgen configuration flags). ItemSpecification Training budget8.84 million environment frames over 540 epochs Frame definitionCoinRun: 1 env. step per frame; Atari: 4 env. steps per frame Default / Reversed sched- ule 90 epochs per task, then shift to next task (single pass over 540 epochs) Two-cycle scheduleTask shift every 45 epochs; cycle 1 ends at 270 epochs; cycle 2 uses remaining 270 epochs Evaluation frequencyEvery 10 epochs (55 checkpoints over 540 epochs, including epoch 0) Evaluation scopeAll tasks in the current sequence at each checkpoint Policy for evaluationStochastic policy (random policy at epoch 0, trained policy thereafter) Rollouts per taskCoinRun: 256; Atari: 16 Return computationIdentify episode boundaries from reset and continuation flags; sum rewards within each episode Reported statisticsMean and standard deviation of episode returns per task Table A.2: Training schedule and evaluation protocol. A.1 Single task sample efficiency Tab. A.7 presents sample efficiency for individual Atari games learned in isolation. The results reveal heterogeneous task difficulty. In all comparisons where both model-based baselines reach the task threshold ARROW proves to be more sample efficient. These single-task results establish important context: both ARROW and DreamerV3 struggle with cer- tain games even in isolation, indicating that continual learning performance differences reflect both catas- trophic forgetting and inherent task difficulty. The complementary failure modes (ARROW on Ms. Pac- Man/Seaquest, DreamerV3 on Frostbite) suggest different biases in how each method interacts with the game characteristics. The TES-SAC baseline fails to reach the 85 % threshold in every single Atari task. Tab. A.8 presents single-task results for all six CoinRun variants. Both ARROW and DreamerV3 successfully reach the 85 % threshold on all variants. However, certain variants seem to be more difficult than others with steps ranging from about 500k (basic CoinRun) to 1.1M (+NB+RT+GA). ARROW outperforms DreamerV3 in all but two variants (+NB+RT+GA and +NB+RT+GA+MA+CA). Interestingly, for the particular variants and the infrequent cases where the TES-SAC algorithm successfully reaches the 85 % threshold it can be considered competitive with regard to sample efficiency. A.2 Validation of DreamerV3 implementation To validate that our implementation faithfully represents DreamerV3, we compared it with the author’s open-source implementation https://github.com/danijar/dreamerv3. Both implementations were evaluated on four tasks without shared structure. Fig. A.1 shows a side-by-side comparison, with our implementation on the left and the author’s implementation on the right. 19 Figure A.1: Performance of DreamerV3, with bold line segments denoting the periods in which certain tasks are being trained. Scores are normalized using min-max normalization. The line is the median and the shaded area is between the 0.25 and 0.75 quantiles, of 5 seeds. A.3 Single-task runs The parameters used for single-tasks for CoinRun (shared structure) are shown in Tab. A.9 and for Atari (without shared structure) in Tab. A.10. The single-task results and the reward scales for Atari (without shared structure) are shown in Tab. A.11. The single-task results for CoinRun (shared structure) are shown in Tab. A.12. B World Models B.1 Training algorithm Algorithm 1 ARROW training algorithm Hyperparameters: World Model training iterations K. Input: World Model M, augmented replay buffer D, sequence of tasksT 1:T = (τ 1 ,τ 2 ,...,τ T ). for τ = τ 1 ,τ 2 ,...,τ T do for i = 1, 2,...,K do Train World Model M on D. Train actor π using M. Use π in τ and append episodes to D. end for end for B.2 Network architecture The Actor Critic architecture is shown in Fig. B.1. We adhered to most of the parameters and architectural choices of DreamerV3. Changes were primarily made to benefit wall time, as running continual learning experiments is computationally expensive. See Tab. C.1. CNN encoder and decoder Following DreamerV3, the convolutional feature extractor’s input is a 64×64 RGB image as a resized environment frame. The encoder convolutional neural network (CNN) (LeCun et al., 1989) consists of stride 2 convolutions of doubling depth with the same padding until the image is at a 20 Flatten and concat with Actor MLP and sampling Critic MLP Value estimate Figure B.1: Actor critic definition. resolution of 4× 4, where it is flattened. We elected to use the “small” configuration of the hyperparameters controlling network architecture from DreamerV3 to appropriately manage experiment wall time. Hence, 4 convolutional layers were used with depths of 32, 64, 128, and 256 respectively. As with DreamerV3, we used channel-wise layer normalization (Ba et al., 2016) and SiLU (Hendrycks & Gimpel, 2016) activation for the CNN. The CNN decoder performs a linear transformation of the model state to a 4× 4× 256 = 4096 vector before reshaping to a 4× 4 image and inverting the encoder architecture to reconstruct the original environment frame. MLP All multi-layer perceptrons (MLP) within the RSSM, actor, and critic are 2 layers and 512 hidden units in accordance with the “small” configuration of DreamerV3. B.3 Augmented replay buffer Algorithm 2 Sampling from combined buffers D 1 and D 2 Combined (augmented) buffer D ̇=D 1 ,D 2 . Uniformly sample i∈1, 2. return Sampled minibatch from D i . C Experimental Details C.1 Experimental parameters and execution time General training and experimental parameters are provided in Tab. C.2. TES-SAC hyperparameters provided in Tab. C.3. The experimental run breakdown and wall-clock times are detailed in Tab. C.4 and Tab. C.5 respectively. 21 Default task order Method Forgetting ↓ FT ↑ ACC ↑ min-ACC ↑ WC-ACC ↑ ARROW 0.197 [-0.017 – 0.430] -0.157 [-0.589 – 0.179] 0.659 [0.652 – 0.788] 0.712 [0.570 – 0.723] 0.615 [0.493 – 0.618] DV3 1.217 [1.057 – 1.745] 0.046 [-0.323 – 0.206] -0.017 [-0.048 – 0.131] -0.324 [-0.361 – -0.259] -0.169 [-0.182 – -0.157] TES-SAC 0.194 [0.033 – 0.365] -0.901 [-0.990 – -0.594] -0.008 [-0.034 – 0.033] -0.183 [-0.191 – -0.156] -0.143 [-0.152 – -0.119] Reversed task order Method Forgetting ↓ FT ↑ ACC ↑ min-ACC ↑ WC-ACC ↑ ARROW 0.039 [-0.050 – 0.423] -0.077 [-0.449 – 0.161] 0.962 [0.691 – 1.309] 0.647 [0.619 – 0.674] 0.618 [0.583 – 0.643] DV3 1.348 [1.083 – 1.479] 0.057 [-0.150 – 0.256] -0.020 [-0.031 – 0.114] -0.291 [-0.311 – -0.264] -0.076 [-0.130 – -0.069] TES-SAC 0.039 [0.007 – 0.558] -0.954 [-1.063 – -0.461] -0.014 [-0.052 – 0.012] -0.223 [-0.232 – -0.154] -0.148 [-0.158 – -0.123] Two-cycle training Method C1-F ↓ C2-F ↓ Max-F ↓ ARROW -0.036 [-0.260 – 0.101] 0.030 [-0.047 – 0.202] 0.012 [-0.139 – 0.155] DV3 0.722 [0.415 – 1.256] 0.378 [0.218 – 0.979] 0.735 [0.403 – 1.387] TES-SAC 0.194 [0.113 – 0.305] 0.112 [0.017 – 0.232] 0.089 [0.004 – 0.263] C1-FT ↑ C2-FT ↑ Recovery ↑ ARROW -0.554 [-0.854 – -0.228] 0.309 [-0.490 – 0.793] 1.418 [1.031 – 1.989] DV3 -0.514 [-0.773 – -0.026] -0.750 [-0.918 – -0.028] 0.610 [0.298 – 1.161] TES-SAC -0.898 [-1.017 – -0.660] -0.882 [-1.045 – -0.699] 0.767 [0.142 – 1.234] ACC ↑ min-ACC ↑ WC-ACC ↑ ARROW 0.796 [0.777 – 0.845] 0.442 [0.389 – 0.568] 0.388 [0.378 – 0.521] DV3 0.009 [0.001 – 0.100] -0.393 [-0.499 – -0.304] -0.299 [-0.391 – -0.218] TES-SAC 0.044 [0.012 – 0.097] -0.203 [-0.288 – -0.201] -0.168 [-0.235 – -0.165] Table A.3: Atari metrics using median [IQR] across seeds for (A,B and C). Best performance is written in bold. 22 Default task order Method Forgetting ↓ FT ↑ ACC ↑ min-ACC ↑ WC-ACC ↑ ARROW 0.407 [0.194 – 0.569] 0.507 [0.154 – 0.815] 0.792 [0.656 – 1.003] 0.612 [0.599 – 0.762] 0.635 [0.617 – 0.731] DV3 0.560 [0.362 – 0.743] 0.787 [0.351 – 0.970] 0.979 [0.872 – 0.992] 0.125 [-0.656 – 0.299] 0.328 [-0.353 – 0.435] TES-SAC -0.012 [-0.122 – 0.147] -0.069 [-0.114 – 0.131] 0.684 [0.677 – 0.871] 0.510 [0.503 – 0.515] 0.596 [0.576 – 0.610] Reversed task order Method Forgetting ↓ FT ↑ ACC ↑ min-ACC ↑ WC-ACC ↑ ARROW 0.000 [-0.165 – 0.043] 0.715 [0.227 – 1.032] 1.287 [1.285 – 1.333] 0.940 [0.875 – 1.055] 1.026 [0.925 – 1.115] DV3 0.355 [0.211 – 0.636] 0.786 [0.236 – 1.186] 1.027 [0.795 – 1.244] 0.712 [0.590 – 0.830] 0.797 [0.719 – 0.928] TES-SAC -0.029 [-0.058 – 0.054] 0.006 [-0.143 – 0.155] 0.766 [0.752 – 0.802] 0.548 [0.526 – 0.552] 0.584 [0.580 – 0.608] Two-cycle training Method C1-F ↓ C2-F ↓ Max-F ↓ ARROW -0.111 [-0.196 – 0.010] 0.099 [-0.086 – 0.155] -0.089 [-0.254 – 0.187] DV3 0.521 [0.127 – 1.127] 0.528 [0.306 – 0.740] 0.233 [-0.155 – 0.507] TES-SAC 0.022 [-0.020 – 0.108] -0.039 [-0.127 – 0.039] -0.026 [-0.200 – 0.037] C1-FT ↑ C2-FT ↑ Recovery ↑ ARROW 0.401 [0.048 – 0.531] 0.717 [0.585 – 0.931] 1.184 [1.069 – 1.298] DV3 0.401 [-0.021 – 0.834] 0.638 [0.453 – 1.005] 1.159 [1.070 – 1.288] TES-SAC -0.027 [-0.148 – 0.217] 0.150 [-0.022 – 0.381] 1.090 [0.925 – 1.278] ACC ↑ min-ACC ↑ WC-ACC ↑ ARROW 1.331 [1.221 – 1.344] 0.933 [0.919 – 0.977] 0.912 [0.863 – 0.922] DV3 0.633 [0.561 – 0.669] -0.117 [-0.177 – 0.144] 0.071 [-0.020 – 0.264] TES-SAC 0.883 [0.790 – 0.945] 0.482 [0.337 – 0.575] 0.600 [0.423 – 0.686] Table A.4: CoinRun Metrics using median [IQR] across seeds for (A,B and C). Best performance is written in bold. 23 85% Threshold Frame at Middle Method Max Perf.Env. Frames (median [q25–q75])Runs ≥85% Default Task Order 2.024,423,680 ARROW0.88Never reached threshold0/5 DV32.385,652,480 [5,447,680 – 5,775,360]4/5 TES-SAC0.16Never reached threshold0/5 Reversed Task Order 0.824,423,680 ARROW0.965,079,040 [3,563,520 – 7,045,120]4/5 DV30.31Never reached threshold0/5 TES-SAC0.15Never reached threshold0/5 Two-Cycle Training 1.174,423,680 ARROW1.043,440,640 [3,194,880 – 3,522,560]3/5 DV31.383,112,960 [2,867,200 – 3,112,960]3/5 TES-SAC0.18Never reached threshold0/5 Table A.5: Atari (Normalized) continual learning sample-efficiency. Best performance is written in bold. 85% Threshold Frame at Middle Method Max Perf.Env. Frames (median [q25–q75])Runs ≥85% Default Task Order 1.024,423,680 ARROW1.191,638,400 [655,360 – 2,293,760]5/5 DV31.19491,520 [327,680 – 819,200]5/5 TES-SAC0.86Never reached threshold0/5 Reversed Task Order 1.144,423,680 ARROW1.343,768,320 [3,604,480 – 4,096,000]5/5 DV31.283,604,480 [3,276,800 – 4,751,360]5/5 TES-SAC0.85Never reached threshold0/5 Two-Cycle Training 1.134,423,680 ARROW1.333,768,320 [2,949,120 – 3,768,320]5/5 DV31.231,802,240 [1,474,560 – 1,802,240]5/5 TES-SAC0.90Never reached threshold0/5 Table A.6: CoinRun (Normalized) continual learning sample-efficiency. Best performance is written in bold. TaskTask-Specific Threshold (85%)MethodMax Perf.Env. Frames (median [q25–q75])Runs≥85% Ms. Pac-Man131.43 ARROW96.26Never reached 85% of peak0/5 DV3154.621,310,720 [1,310,720 – 1,310,720]1/5 TES-SAC46.00Never reached 85% of peak0/5 Boxing84.05 ARROW98.88819,200 [655,360 – 983,040]5/5 DV398.71983,040 [819,200 – 983,040]5/5 TES-SAC3.12Never reached 85% of peak0/5 Crazy Climber109.27 ARROW128.56983,040 [983,040 – 983,040]1/5 DV3118.881,228,800 [1,105,920 – 1,351,680]4/5 TES-SAC12.31Never reached 85% of peak0/5 Frostbite68.96 ARROW81.121,474,560 [1,474,560 – 1,474,560]1/5 DV355.42Never reached 85% of peak0/5 TES-SAC42.12Never reached 85% of peak0/5 Seaquest533.18 ARROW401.05Never reached 85% of peak0/5 DV3627.271,310,720 [1,310,720 – 1,310,720]1/5 TES-SAC258.75Never reached 85% of peak0/5 Enduro339.06 ARROW398.891,146,880 [1,146,880 – 1,474,560]5/5 DV3374.101,228,800 [1,146,880 – 1,351,680]4/5 TES-SAC15.50Never reached 85% of peak0/5 Table A.7: Atari single-task sample-efficiency (raw rewards). Best performance is written in bold. 24 TaskTask-Specific Threshold (85%) Method Max Perf. Env. Frames (median [q25–q75]) Runs≥85% CoinRun6.08 ARROW6.99491,520 [491,520 – 819,200]5/5 DV37.15491,520 [491,520 – 819,200]5/5 TES-SAC6.56655,360 [655,360 – 655,360]2/5 +NB6.84 ARROW7.46983,040 [901,120 – 983,040]4/5 DV38.051,146,880 [983,040 – 1,310,720]5/5 TES-SAC6.45Never reached threshold0/5 +NB+RT6.24 ARROW7.34819,200 [655,360 – 983,040]5/5 DV37.30983,040 [983,040 – 1,146,880]5/5 TES-SAC6.52163,840 [163,840 – 163,840]1/5 +NB+RT+GA7.04 ARROW8.281,310,720 [1,146,880 – 1,392,640]3/5 DV38.051,146,880 [983,040 – 1,146,880]5/5 TES-SAC6.64Never reached threshold0/5 +NB+RT+GA+MA6.97 ARROW8.20491,520 [491,520 – 983,040]5/5 DV37.89983,040 [819,200 – 983,040]5/5 TES-SAC6.02Never reached threshold0/5 +NB+RT+GA+MA+CA5.45 ARROW6.25491,520 [491,520 – 655,360]5/5 DV36.41491,520 [327,680 – 655,360]5/5 TES-SAC6.33573,440 [532,480 – 614,400]2/5 Table A.8: CoinRun single-task sample-efficiency (raw rewards). Best performance is written in bold. Env. frames Env. steps Replay buffer capacity ARROW1.47M1.47M2× 512 sequences ×T =512 (2 19 obs.) DreamerV3 1.47M1.47M1024 sequences ×T =512 (2 19 obs.) TES-SAC 1.47M1.47M1024 sequences ×T =512 (2 19 obs.) Table A.9: CoinRun single-task training parameters. Env. frames Env. steps Replay buffer capacity ARROW1.47M5.89M2× 512 sequences ×T =512 (2 19 obs.) DreamerV3 1.47M5.89M1024 sequences ×T =512 (2 19 obs.) TES-SAC 1.47M5.89M1024 sequences ×T =512 (2 19 obs.) Table A.10: Atari single-task training parameters. TaskReward scale Random ARROW Ms. Pac-Man 0.0512.401540.30 Boxing10.5190.27 Crazy Climber 0.0017.49109245.16 Frostbite0.214.48297.83 Seaquest0.538.47439.62 Enduro0.50.01707.47 Table A.11: Atari single-task experimental results, median across 5 random seeds. Scores are unnormalized at the end of training. 25 TaskRandom ARROW CoinRun2.786.09 +NB2.457.14 +NB+RT2.706.89 +NB+RT+GA2.626.85 +NB+RT+GA+MA2.507.89 +NB+RT+GA+MA+CA 2.695.78 Table A.12: Procgen CoinRun single-task experimental results, median across 5 random seeds. Scores are unnormalized at the end of training. NameARROWDreamerV3TES-SAC Replay capacity (FIFO)0.26M0.52M0.52M Replay capacity (long-term) 0.26M00 Batch size1616128 Learning rate1× 10 −4 1× 10 −4 5× 10 −4 Activation (MLP)LayerNorm+SiLU LayerNorm+SiLU ReLU Activation (GRU)tanhtanh– GRU units512512– MLP features512512512 MLP layers222 CNN depth323232 Table C.1: Hyperparameters. ParameterAtariCoinRun Data Collection Parallel environments44 Sequence length40964096 Environment repeat41 Data sequences per batch 3232 Max sequences in buffer 512512 Sequence time steps512512 Minibatch Configuration Minibatch time size3232 Minibatch number size1616 Environment-Specific Action space size1815 Image size64× 6464× 64 Table C.2: General training and experimental parameters. 26 ParameterValue Learning Rates Policy learning rate5× 10 −4 Q-network learning rate5× 10 −4 Entropy coefficient (alpha) learning rate5× 10 −4 Core SAC Hyperparameters Batch size128 Discount factor (gamma)0.99 Soft target update (tau)0.005 Initial entropy coefficient (alpha)0.2 Target entropy0.8× log(|A|) Gradient clipping5.0 Target Entropy Scheduling (TES-SAC) TES lambda0.999 Average threshold0.01 Standard deviation threshold0.05 Discount factor k0.98 TES period T1000 Table C.3: TES-SAC Hyperparameters Benchmark TaskRunsTotal Atari Single-task6 tasks × 5 seeds30 Continual learning3 settings (A, B, C) × 5 seeds 15 CoinRun Single-task6 tasks × 5 seeds30 Continual learning3 settings (A, B, C) × 5 seeds 15 Per method, per benchmark(30 + 15)45 Per benchmark, all methods (ARROW, DV3, TES-SAC) 45 runs × 3 methods135 Total (Atari + CoinRun, all methods)135 runs × 2 benchmarks270 Table C.4: Experimental run breakdown across benchmarks and methods. All experiments were executed on a single NVIDIA A40 or A100 GPU (depending on availability). MethodSingle-task wall time Continual learning wall time ARROW / DV3 6 hours1–2 days (CoinRun CL:∼30 hours; Atari CL:∼50 hours) TES-SAC3 hours17 hours Table C.5: Wall-clock time per method and setting. 27