Paper deep dive
A Reference Architecture of Reinforcement Learning Frameworks
Xiaoran Liu, Istvan David
Abstract
Abstract:The surge in reinforcement learning (RL) applications gave rise to diverse supporting technology, such as RL frameworks. However, the architectural patterns of these frameworks are inconsistent across implementations and there exists no reference architecture (RA) to form a common basis of comparison, evaluation, and integration. To address this gap, we propose an RA of RL frameworks. Through a grounded theory approach, we analyze 18 state-of-the-practice RL frameworks and, by that, we identify recurring architectural components and their relationships, and codify them in an RA. To demonstrate our RA, we reconstruct characteristic RL patterns. Finally, we identify architectural trends, e.g., commonly used components, and outline paths to improving RL frameworks.
Tags
Links
- Source: https://arxiv.org/abs/2603.06413v1
- Canonical: https://arxiv.org/abs/2603.06413v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/13/2026, 12:19:21 AM
Summary
This paper proposes a Reference Architecture (RA) for Reinforcement Learning (RL) frameworks, developed through a grounded theory analysis of 18 state-of-the-practice RL frameworks. The RA identifies recurring architectural components—categorized into Framework, Framework Core, Environment, and Utilities—to address terminological inconsistencies and provide a common basis for comparison, evaluation, and integration of RL systems.
Entities (7)
Relation Signals (3)
Grounded Theory → usedtoanalyze → Reinforcement Learning Frameworks
confidence 100% · Through a grounded theory approach, we analyze 18 state-of-the-practice RL frameworks
Experiment Orchestrator → contains → Experiment Manager
confidence 95% · The Experiment Orchestrator consists of three components (Fig. 2, Tab. I)... Experiment Manager
Reference Architecture → comprises → Framework
confidence 90% · The high-level overview of the RA is shown in Fig. 1. It contains six top-level components organized into four component groups: the Framework...
Cypher Suggestions (2)
Find all components belonging to the Framework group · confidence 90% · unvalidated
MATCH (c:Component)-[:BELONGS_TO]->(g:ComponentGroup {name: 'Framework'}) RETURN c.nameList all RL frameworks analyzed in the study · confidence 85% · unvalidated
MATCH (f:Framework) WHERE f.analyzed = true RETURN f.name
Full Text
80,298 characters extracted from source content.
Expand or collapse full text
A Reference Architecture of Reinforcement Learning Frameworks Xiaoran Liu ∗ , Istvan David ∗† ∗ McMaster University, Hamilton, Canada † McMaster Centre for Software Certification, Hamilton, Canada Abstract—The surge in reinforcement learning (RL) appli- cations gave rise to diverse supporting technology, such as RL frameworks. However, the architectural patterns of these frameworks are inconsistent across implementations and there exists no reference architecture (RA) to form a common basis of comparison, evaluation, and integration. To address this gap, we propose an RA of RL frameworks. Through a grounded theory approach, we analyze 18 state-of-the-practice RL frameworks and, by that, we identify recurring architectural components and their relationships, and codify them in an RA. To demonstrate our RA, we reconstruct characteristic RL patterns. Finally, we identify architectural trends, e.g., commonly used components, and outline paths to improving RL frameworks. Index Terms—AI, architecture, grounded theory, machine learning, reinforcement learning, simulation I. INTRODUCTION Reinforcement learning (RL) [1] has become one of the most widely used machine learning (ML) techniques [2], being adopted in an expanding array of fields from digital twins [3] to model-driven engineering [4], outside of its traditional application area of robotics [5]. In RL, learning is achieved through a process of trial-and-error, in which an agent takes actions and assesses the utility of those actions, to reinforce the beneficial ones. Virtual training frameworks [6] offer a safe and cost-efficient way to train RL agents, by modeling the real world and allowing the agents to interact with this model during training, instead of interacting with the real world [7]. In response to the surge in RL applications [8], [9], nu- merous RL frameworks have appeared [10]. However, due to the lack of common design guidelines and standards, RL frameworks exhibit diverse architectures and organization of components. This diversity gives rise to inconsistent abstrac- tions among RL frameworks, hinders the reuse of solutions across frameworks, and poses a challenging learning curve for adopters. Although partial architectural solutions exist, e.g., abstractions for distributed RL [F12], support for modularizing RL algorithms [F13], and decoupling simulators in specific RL frameworks [11], the need for a comprehensive architectural understanding of RL remains unaddressed. The closest to addressing this need is the work by Ntentos et al. [12] who propose an Architectural Design Decisions model for developing RL architectures. To understand the architectures of actual RL frameworks, a reference architecture (RA) of existing implementations is needed. Reflecting on the current practices in RL frameworks allows for identifying key architectural tendencies, limitations, and opportunities for improvement. Therefore, in this work, we analyze 18 frequently used open-source RL frameworks, derive an RA from our analysis, and demonstrate its utility by reconstructing the architectures of characteristic RL patterns. The importance of comprehensive RAs in RL cannot be overstated. Software engineers and ML developers rely on RL frameworks when they develop and integrate RL functionality into production software systems. Without proper architectural understanding, the quality assessment [13], dependency man- agement [14], certification [15], and delivery [16] of these systems becomes a formidable challenge. Contributions. We make the following contributions. • We develop a reference architecture for RL environments; • we clarify RL architectural concepts that are often used interchangeably and incorrectly; • we reconstruct RL patterns to demonstrate the RA; • we identify architectural tendencies in RL frameworks. Our contributions benefit (i) RL framework developers by clearly delineating architectural components and providing a blueprint for mapping RL processes onto these compo- nents; (i) adopters by providing a basis for comparing RL frameworks; and (i) ML engineers by aiding modularity and reusability of their RL pipelines. Open science. To enable the independent verification and reuse of our results, we publish a data package as an Open Research Object on Zenodo: https://zenodo.org/records/18637532. I. BACKGROUND AND RELATED WORK A. Reinforcement learning and supporting infrastructure Reinforcement learning (RL) [1] is a machine learning (ML) paradigm formalized by Markov decision processes [17], in which the agent interacts with the environment to learn the optimal strategies for making sequential decisions. The agent observes the current state and selects an action according to its decision function—often referred to as a policy—which maps its experienced history of observations to the action. The environment transitions to a new state and produces a reward based on the agent’s action. The RL agent refines its decision function through an iterative learning process that requires large volumes of interaction with the environment. This interaction can be operationalized in physical or virtual environments. Agents can be trained directly on physical systems, e.g., learning manipulation skills on actual robotic Author pre-print. Publication accepted for ICSA 2026. Author pre-print. Publication accepted for ICSA 2026. arXiv:2603.06413v1 [cs.SE] 6 Mar 2026 hardware [18], or refining control policies in autonomous vehicles [19]. However, real-world training is challenged by safety risks, high costs, and privacy constraints [20]. Virtual environments overcome these barriers by situating the training process in the virtual space, i.e., in silico. By that, virtual environments enable safe and cost-efficient agent learning. B. Terminology blurring in RL infrastructures RL development depends on more than the environment alone. It requires additional components, such as training configurations, learning algorithms, and supporting utilities for data collection, logging, and result evaluation. These broader systems—commonly referred to as frameworks—integrate the environment with the necessary training-related components. The architectural boundaries that distinguish what consti- tutes an environment, a framework, or a system are loosely defined in RL practice. First, the distinction between the environment and the simulator is often blurred. For example, CARLA is an open-source autonomous driving simulator fea- turing custom-designed digital assets (e.g., vehicles, buildings, road layouts) that reflect real-world scales and properties [21]. However, it is frequently referred to as an environment [22], [23]. Simulators are sometimes quoted as ideal environments for learning [24], demonstrating heavy terminological blurring. Second, RL algorithms and frameworks are often not properly separated. For example, Dopamine is a research framework for fast prototyping of RL algorithms [F17], yet is referenced as the algorithm libraries in related literature [22]. The lack of clarity creates a barrier to engineering RL frameworks, and underscores the need for an RA that clarifies how architectural primitives are organized into coarser-grained entities, e.g., environment, framework core, framework. In this paper, unless ambiguous, we use the term RL framework inclusive of RL environments and utilities. C. Related work Recent work has begun to address terminological and ar- chitectural ambiguities in RL frameworks from different per- spectives. However, there is no general reference architecture that captures common patterns across RL frameworks. Schuderer et al. [11] propose Sim-Env, a workflow and tool for decoupling OpenAI Gym [25] environments from simulators and models to allow for swapping RL environments while preserving the underlying simulator. However, their approach is focused on simulation concerns in a simple single- agent environment, and ignores various other flavors of RL, e.g., multi-agent reinforcement learning (MARL) [26]. Ntentos et al. [12] propose an Architectural Design Deci- sions (ADDs) model for RL architectures, identifying decision options, relations, and decision drivers for training strategies, such as single versus multi-agent configurations, and check- point usage. Their work provides valuable guidance on ar- chitectural choices based on academic and gray literature; our work further investigates the RL architecture by deconstructing existing implementations. Balhara et al. [27] conduct a systematic literature review (SLR) to analyze different deep reinforcement learning (DRL) algorithms and their architectures. However, they focus on DRL algorithm architectures and neural network (N) struc- tures, rather than proposing a general architecture. Some works propose architectures for specific RL toolkits. For example, Hu et al. [F14] present MARLlib’s MARL archi- tecture, Hoffman et al. [F13] present Acme’s distributed learn- ing architecture. However, these proposals are implementation- specific and do not generalize across different RL frameworks. To address these limitations, in this work, we propose a general RA for RL by analyzing existing RL frameworks. I. METHODOLOGY In this section, we design a study for recovering architec- tures of RL frameworks. We use grounded theory (GT), the method of inductive generation of a theory (here, a general architecture of RL frameworks) from data [28]. GT involves simultaneous data collection and analysis through iterative interpretation, aiming to construct a theory rooted in the collected data [12]. The appeal of GT lies in its general applicability over different types of data, including qualitative, quantitative, semi-structured, interviews, etc. [29]; and in this work, specifically, source code and design documentation encountered in open-source repositories. GT has been proven to be of high utility in recovering archi- tectures [30] and architectural decision points [12] previously. A. Iterative coding and data collection We employ the Strauss-Corbin flavor of GT [31], in which three coding phases of open, axial, and selective coding are used, to produce a detailed, explanatory theory of RL frame- works’ architectures. In the open coding phase, we review source code, configuration files, and documentation of RL frameworks, categorizing implementation details into concep- tual labels, such as Algorithm, Optimizer, and Learner. In the axial coding phase, we cluster related labels into architectural components and identify component interactions. For exam- ple, we group Algorithm, Optimizer, and Learner under the Learner component due to their strongly related functionality, and identify their relationships with the Buffer and Function Approximator components to form the Agent component. In the selective coding phase, we refine and integrate components into the theory of RL frameworks’ architecture, encompassing environment design, simulator integration, agent-environment interaction, and training orchestration. Following the principle of immediate and continuous data analysis of GT [32], we conduct these coding steps iteratively, i.e., after a coding phase, we constantly compare data, memos, labels, and categories across sources. We implement memos through detailed notes maintained in the analysis spreadsheet, capturing our thought processes, interpretations, and reasoning to ensure traceability of labels and categories to their origins. To mitigate threats to validity stemming from the researchers’ biases, we facilitate constant cross-verification steps among the researchers, and discussions when agreement is not immediate. B. Sampling We sample RL environments and frameworks. At this stage of the study, we could only rely on the usual imprecise classification of RL systems (see Sec. I-B) and sampled both environments and frameworks, and reach saturation in both classes of systems—i.e., the point where new data no longer yields new insights to the theory [33]. Saturation in envi- ronments is reached after analyzing five of them ([F1], [F2], [F3], [F4], [F5]); subsequent environments ([F6], [F7], [F8], [F9]) confirm the same categories and relationships without adding new architectural elements. Saturation in frameworks is reached after analyzing six of them ([F12], [F13], [F14], [F10], [F11], [F15]); subsequent training sources ([F17], [F16], [F18]) confirm existing categories without yielding new insights. We drive our sampling through our domain understanding as a heuristic and aim to cover a wide range of intents and usage patterns early on. We start with Gymnasium [F1], a single- agent RL environment. Subsequently, we open up our analysis to multi-agent environments by sampling PettingZoo [F2]. This allows us to investigate inter-agent coordination mech- anisms. Then, we open up our investigation to frameworks beyond RL environments and investigate more complex RL frameworks, such as RLLib [F12]. This allows us to inves- tigate how multi-agent coordination extends beyond training, specifically into policy management and execution. Eventually, we sample and analyze 18 RL frameworks. This sample features the most widely used RL systems in both research and practice, evidenced by a mix of peer-reviewed literature [6], [22], [34] and community-curated collections. 1 IV. REFERENCE ARCHITECTURE OF RL FRAMEWORKS We now present the reference architecture (RA) of RL frameworks we developed through our empirical inquiry. The high-level overview of the RA is shown in Fig. 1. It contains six top-level components organized into four component groups: the Framework (Sec. IV-A), Framework Core (Sec. IV-B), Environment (Sec. IV-C), and Utilities (Sec. IV-D). In the following, we elaborate on each of these components. For each component, we provide a detailed internal architectural overview, and report which components can be found in various RL frameworks as separate entities. Often, components are implemented under different names, or in an aggregation or amalgamation with other components. Such details are provided in the replication package. The analyzed RL systems often substantially differ in scope. Some provide only a Framework Core—these are typically the ones that are colloquially referred to as “environments,” e.g., Gymnasium [F1]. Others implement additional services that typically fall into the Utilities (Sec. IV-D) category of components and delegate the responsibility of defining agents and environments to the former class of RL systems—these are typically the ones that are colloquially referred to as “frameworks,” e.g., Stable Baselines3 [F10]. As RL is formally 1 https://github.com/awesomelistsio/awesome-reinforcement-learning <<component>> Agent control <<component>> Framework Orchestrator framework orchestration <<component>> Experiment Orchestrator control experiment control FRAMEWORK FRAMEWORK CORE persist render monitor UTILITIES <<component>> Monitoring & Visualization <<component>> Data Persistence <<component>> Environment ENVIRONMENT Fig. 1: High-level architectural view of RL frameworks. (Component groups areunderlined and correspond to the subsections of Sec. IV.) defined as a Markov decision process of an Agent in an Environment [1], it is these two components that are truly essential in RL systems, and one can encounter simplistic RL experiments with only these two components. Nonetheless, reasonably complex RL experiments necessitate additional services, e.g., visualization in 3D simulated environments, or data persistence for long-running experiments. A. Framework The Framework comprises the user-facing Experiment Or- chestrator component, the Framework Core, and the Utilities. Its primary responsibility is to enable users to configure and execute experiments. An experiment is a collection of training or evaluation executions, with particular hyperparameters and configurations [35]. The execution of individual runs within an experiment is handled by the Framework Core. <<component>> <<component>> Multiagent Coordinator <<component>> Configuration Manager <<component>> Lifecycle Manager <<use>> <<component>> Distributed Execution Coordinator multi-agent strategy distribution strategy Framework Orchestrator <<component>> <<component>> Reward Manager <<component>> Action Manager state action <<component>> Observation Manager Environment Core reward <<component>> Simulator Adapter rendering information <<component>> <<component>> Model Simulator <<component>> <<component>> Experiment Manager <<component>> Benchmark Manager <<component>> Hyperparameter Tuner experiment execution <<use>> Experiment Orchestrator simulate get traces <<component>> <<component>> Function Approximator <<component>> Learner <<use>><<use>> update update signal sample update signal <<component>> Buffer learning signal Agent agent control environment control <<component>> <<component>> Logger <<component>> Renderer logrender <<component>> Recorder <<component>> Reporter report record Monitoring & Visualization monitor framework orchestration experiment control persistence persistence <<component>> Data persistence <<component>> Checkpoint Manager <<component>> Environment Parameter Manager retrievepersist monitoring& visualization monitoring& visualization rendering information environment control agent control framework orchestration Fig. 2: Experiment Orchestrator TABLE I: Experiment Orchestrator Components ComponentFrameworks Benchmark Manager[F15] Experiment Manager[F3] [F10] [F11] [F12] [F13] [F14] [F15] [F16] [F17] [F18] Hyperparameter Tuner[F11] [F12] [F14] 1) Experiment Orchestrator: Provides the primary interface to users for defining and running experiments, e.g., training agents, tuning hyperparameters, or benchmarking algorithms. It orchestrates the high-level experiment process with the op- tional hyperparameter tuning and benchmarking steps before the sequences of experiments are executed. The Experiment Orchestrator consists of three components (Fig. 2, Tab. I). a) Experiment Manager: The Experiment Manager sets up the experiment for execution, prepares the Data Persistence, Monitor & Visualization components, and delegates the exper- iment execution to the Framework Orchestrator. It relies on two Utilities components to save and load experiment state (via the Data Persistence component), and to monitor and visualize the results (via the Monitoring & Visualization component). b) Hyperparameter Tuner: Automates the search for the optimal hyperparameter configurations by sampling the hyperparameter space. It generates candidate hyperparameters (e.g., via grid search [36], random search [37], or Bayesian optimization [38]), and passes them to the Experiment Man- ager to execute experiments for evaluation. Upon executing the experiment, the Hyperparameter Tuner analyzes the results to guide subsequent sampling iterations. This process continues until stopping criteria are met, e.g., convergence to user- defined objectives or exhaustion of the search budget. The implementation is often provided by specialized third- party libraries. For example, RL-Zoo3 [F11] uses Optuna [39], and RLlib [F12] uses Ray Tune [40]. c) Benchmark Manager: Enables the evaluation of dif- ferent RL algorithms under consistent experimental settings. Executes experiments (via the Experiment Manager) that share a base configuration but vary in algorithms or policies. An example of this component is given by the Benchmark module in the BenchMARL framework [F15]. 2 B. Framework Core The Framework Core coordinates the learning process. It encapsulates the Environment component group, as well as the Framework Orchestrator, and the Agent components. 1) Framework Orchestrator: Receives experiment requests from the Experiment Manager and orchestrates the framework to execute the training or evaluation runs. It loads configura- tions, controls the training and evaluation lifecycle, allocates resources for distributed execution if needed, and coordinates multi-agent execution when applicable. The framework or- chestrator consists of four components (Fig. 3, Tab. I). a) Lifecycle Manager: Controls the execution of the training and evaluation loops. It initializes required compo- nents (via the Configuration Manager), handles episode termi- nation and truncation from the environment, monitors global stopping criteria, e.g., maximum timesteps or convergence thresholds, and triggers lifecycle events. Notably, it coordi- nates the interaction between the Agent and the Environment. It saves and loads the training state via the Data Persistence component, and tracks per-step metrics and activates rendering via the Monitoring & Visualization. The Lifecycle Manager has two patterns to control the agent-environment interaction. In the first pattern, the Lifecycle Manager queries actions from the Agent, forwards them to the Environment, and sends resulting data back to the Agent (e.g., Acme [F13]). Alternatively, the Lifecycle Manager actuates 2 Detailed pointers to modules and source code in the analyzed frameworks are available in the data package. <<component>> <<component>> Multiagent Coordinator <<component>> Configuration Manager <<component>> Lifecycle Manager <<use>> <<component>> Distributed Execution Coordinator multi-agent strategy distribution strategy Framework Orchestrator <<component>> <<component>> Reward Manager <<component>> Action Manager state action <<component>> Observation Manager Environment Core reward <<component>> Simulator Adapter rendering information <<component>> <<component>> Model Simulator <<component>> <<component>> Experiment Manager <<component>> Benchmark Manager <<component>> Hyperparameter Tuner experiment execution <<use>> Experiment Orchestrator simulate get traces <<component>> <<component>> Function Approximator <<component>> Learner <<use>><<use>> update update signal sample update signal <<component>> Buffer learning signal Agent agent control environment control <<component>> <<component>> Logger <<component>> Renderer logrender <<component>> Recorder <<component>> Reporter report record Monitoring & Visualization monitor framework orchestration experiment control persistence persistence <<component>> Data persistence <<component>> Checkpoint Manager <<component>> Environment Parameter Manager retrievepersist monitoring& visualization monitoring& visualization rendering information environment control agent control framework orchestration Fig. 3: Framework Orchestrator TABLE I: Framework Orchestrator Components ComponentFrameworks Configuration Manager[F5] [F12] [F14] [F15] [F16] [F17] [F18] Distributed Exec. Coord.[F12] [F13] [F14] [F16] Lifecycle Manager[F3] [F10] [F11] [F12] [F13] [F14] [F15] [F16] [F17] [F18] Multi-Agent Coord.[F2] [F3] [F5] [F12] [F14] [F15] [F18] the Agent to interact with Environment and schedules the execution of actions and learning updates (e.g., RLlib [F12]). Implementations of the Lifecycle Manager differ across frameworks. Some implement the manager in a single module, while others distribute lifecycle logic into multiple modules. For example, Acme [F13] uses the EnvironmentLoop module to coordinate the interaction between the Environment and the Agent. Isaac Lab [F5] implements lifecycle management through two managers. The EventManager orchestrates opera- tions based on different simulation events across the lifecycle; and the TerminationManager computes termination signals. b) Configuration Manager: Loads and validates the con- figurations for the learning process from various sources (e.g., YAML files, JSON files, command-line arguments). Config- urations specify which algorithm to use, which environment to instantiate, the execution mode (e.g., distributed vs. non- distributed), and resource allocation preferences. About half of the sampled RL frameworks use custom configuration managers, and half integrate third-party libraries. Hydra [41] is the most common third-party library, adopted by, e.g., Isaac Lab [F5], BenchMARL [F15], and Mava [F16]. c) Multi-AgentCoordinator:Managesinteractions among agents in multi-agent RL, and coordinates them, including the management of policy assignment, i.e., determining which policy controls each agent and whether policies are shared across agents. It also constructs joint actions from individual agent policies, and coordinates agent-to-agent communication when required. d) Distributed Execution Coordinator: Allocates and de- ploys components across multiple processes, devices, or ma- chines when distributed execution is configured. It determines the distributed topology by mapping logical components, typi- cally of the Agent component group (e.g., function approxima- tors, learners, buffers) onto physical resources (e.g., CPUs and GPUs across different machines), deploys these components to their assigned resources, and maintains the metadata (e.g., <<component>> <<component>> Function Approximator <<component>> Learner <<use>><<use>> update update signal sample update signal <<component>> Buffer learning signal Agent <<use>> agent control environment control Fig. 4: Agent TABLE I: Agent Components ComponentFrameworks Buffer[F3] [F10] [F11] [F12] [F13] [F14] [F15] [F16] [F17] [F18] Func. Approx.[F3] [F10] [F11] [F12] [F13] [F14] [F15] [F16] [F17] [F18] Learner[F3] [F10] [F11] [F12] [F13] [F14] [F15] [F16] [F17] [F18] IP addresses) needed for inter-component communication. The implementation is typically provided by third-party libraries, e.g., Ray Core [42] (used by RLlib [F12] and MARLlib [F14]) and Launchpad [43] (used by Acme [F13]). 2) Agent: The Agent implements the RL algorithm. It interacts with the Environment to learn. The schedule of the learning cycle is controlled by the Lifecycle Manager, which determines, e.g., when actions are selected, when experience is collected, and when learning updates occur. Some frameworks provide only the Agent component and delegate the definition of an Environment to external libraries through standardized interfaces. For example, Acme provides out-of-the-box inte- gration with the DeepMind Control Suite [F6] and DeepMind Lab [F7]; other environments must implement the dm env interface [44] to integrate with Acme. The Agent consists of three components, as shown in Fig. 4 and Tab. I. a) FunctionApproximator:Encodestheagent’s decision-making mechanism, which maps states to action selections or value estimates. It selects actions during theAgent’sinteractionswiththeEnvironment,and produces update signals, e.g., experience data in actor- type approximators, or advantage information in critic-type approximators. The details of the Function Approximator vary by algorithm type. Policy-based methods (e.g., proximal policy optimization (PPO) [45]) implement the Function Approximator as an explicit policy, with a probability-based sampling serving for action selection. Value-based methods (e.g., Q-learning [46]) implement the Function Approximator as the value function that estimates expected return for each action in a given state, from which actions are derived (e.g., viaε-greedy strategy). Complex agent architectures may use multiple instances of the component. For example, actor–critic methods [47] instantiate the Function Approximator twice: as a policy-based actor and as a value-based critic. The internal representation depends on the data type and <<component>> Environment <<component>> Simulator Adapter <<component>> <<component>> Reward Manager <<component>> Action Manager state action <<component>> Observation Manager Environment Core reward rendering information <<component>> <<component>> Model Simulator simulate get traces environment control Fig. 5: Environment TABLE IV: Environment Components ComponentFrameworks Environment Core[F1] [F2] [F3] [F4] [F5] [F6] [F7] [F8] [F9] Simulator[F1] [F2] [F3] [F4] [F5] [F6] [F7] [F8] [F9] Simulator Adapter[F1] [F2] [F3] [F4] [F5] [F6] [F7] [F8] state space. For small, discrete spaces, tabular representations are often feasible (e.g., Q-table [48]). Deep RL frameworks necessitate using a neural network [49]. b) Learner: Updates the Function Approximator from learning signals. It receives these signals either directly from the Function Approximator (e.g., in the SARSA algo- rithm [50]) or samples it from the Buffer (e.g., in the soft actor-critic (SAC) method [47]). Subsequently, it computes algorithm-specific parameter updates (e.g., temporal-difference (TD) errors defined by Bellman equations [51], or policy gra- dients [49]), and applies them to the Function Approximator. c) Buffer: Stores update signals and provides the Learner with sampled batches. There are two common Buffer types. A rollout buffer—often used in on-policy RL, e.g., PPO [45]— stores short trajectories that are consumed immediately and then discarded. A replay buffer—often used in off-policy RL, e.g., SAC [47]—stores a large pool of transitions and allows the Learner to request samples during training. Most RL frameworks implement buffers directly rather than depending on external libraries. Some of the exceptions in- clude Acme [F13], which integrates Reverb [52] for distributed settings; and Mava [F16], which uses Flashbax [53]. C. Environment The Environment is a group of components with which Agent interacts. It chiefly encapsulates the simulation infras- tructure that provides a virtual world for interactions. The Environment consists of three components (Fig. 5, Tab. IV). 1) Environment Core: Exposes the control interface to the Framework Orchestrator. It initializes and resets the en- vironment, receives agent actions, updates the environment state, computes observations and rewards, and produces ren- dered frames for visualization. The Environment Core may <<component>> <<component>> Multiagent Coordinator <<component>> Configuration Manager <<component>> Lifecycle Manager <<use>> <<component>> Distributed Execution Coordinator multi-agent strategy distribution strategy Framework Orchestrator <<component>> <<component>> Reward Manager <<component>> Action Manager state action <<component>> Observation Manager Environment Core reward <<component>> Simulator Adapter rendering information <<component>> <<component>> Model Simulator <<component>> <<component>> Experiment Manager <<component>> Benchmark Manager <<component>> Hyperparameter Tuner experiment execution <<use>> Experiment Orchestrator simulate get traces <<component>> <<component>> Function Approximator <<component>> Learner <<use>><<use>> update update signal sample update signal <<component>> Buffer learning signal Agent agent control environment control <<component>> <<component>> Logger <<component>> Renderer logrender <<component>> Recorder <<component>> Reporter report record Monitoring & Visualization monitor framework orchestration experiment control persistence persistence <<component>> Data persistence <<component>> Checkpoint Manager <<component>> Environment Parameter Manager retrievepersist monitoring& visualization monitoring& visualization rendering information environment control agent control framework orchestration Fig. 6: Data Persistence TABLE V: Data Persistence Components ComponentFrameworks Checkpoint Mgr[F3] [F9] [F10] [F11] [F12] [F13] [F14] [F15] [F16] [F17] [F18] Env. Parameter Mgr[F3] [F4] [F5] [F8] be implemented as a vectorized environment, i.e., multiple parallel environment instances, to improve data collection efficiency [F1]. The Environment Core includes an Action Manager to process and apply actions, an Observation Man- ager to collect observations from the simulator infrastructure, and a Reward Manager to compute reward signals [F5]. 2) Simulator: The program that encodes the probabilistic mechanism that represents the real phenomenon and executes this probabilistic mechanism over a sufficiently long period of time to produce simulation traces characterizing the actual system [54]. At the core of the Simulator, the physical asset is represented by a Model. This Model captures the essential properties of the simulated asset in appropriate detail to consider the results of the simulation representative [55]. RL environments rely on a range of simulators. For exam- ple, Gymnasium [F1] supports Box2D [56] for 2D physics, Stella [57] for Atari games, and MuJoCo [58] for control tasks. DeepMind Lab [F7] uses the ioquake3 3D engine [59]. 3) Simulator Adapter: Connects the Environment Core with the underlying Simulator. It translates the Agent’s action to simulation steps, and the Lifecycle Manager’s instructions for resetting and pausing the simulation. After executing the simulation steps that correspond to the Agent’s action, the Sim- ulator Adapter translates the simulation traces to observations, subsequently available for the Observation Manager. D. Utilities The Utilities provide services to the rest of the components. It has two components: Data Persistence for state management and experiment resumption, and Monitoring & Visualization for tracking training progress and visualizing information. 1) Data Persistence: Manages the storage and retrieval of the experiment state. It has two components (Fig. 6, Tab. V). a) Checkpoint Manager: Saves and restores the experi- ment state. It stores the algorithms’ parameters (e.g., policy weights, replay buffer contents), the environment state, and metadata needed for experiment resumption. The Lifecycle Manager invokes the checkpointing logic at pre-configured intervals (e.g., every N steps); and the Experiment Manager uses it to resume interrupted experiments. Implementation may be native or provided by a third-party library. For example, Acme [F13] implements the Checkpoint Manager in the Checkpointer module. Mava [F16] uses an external checkpoint management library, Orbax [60]. <<component>> <<component>> Multiagent Coordinator <<component>> Configuration Manager <<component>> Lifecycle Manager <<use>> <<component>> Distributed Execution Coordinator multi-agent strategy distribution strategy Framework Orchestrator <<component>> <<component>> Reward Manager <<component>> Action Manager state action <<component>> Observation Manager Environment Core reward <<component>> Simulator Adapter rendering information <<component>> <<component>> Model Simulator <<component>> <<component>> Experiment Manager <<component>> Benchmark Manager <<component>> Hyperparameter Tuner experiment execution <<use>> Experiment Orchestrator simulate get traces <<component>> <<component>> Function Approximator <<component>> Learner <<use>><<use>> update update signal sample update signal <<component>> Buffer learning signal Agent agent control environment control <<component>> <<component>> Logger <<component>> Renderer logrender <<component>> Recorder <<component>> Reporter report record Monitoring & Visualization monitor framework orchestration experiment control persistence persistence <<component>> Data persistence <<component>> Checkpoint Manager <<component>> Environment Parameter Manager retrievepersist monitoring& visualization monitoring& visualization rendering information environment control agent control framework orchestration Fig. 7: Monitoring and Visualization TABLE VI: Monitoring & Visualization Components Comp.Frameworks Logger[F1] [F2] [F3] [F4] [F5] [F6] [F7] [F8] [F9] [F10] [F11] [F12] [F13] [F14] [F15] [F16] [F17] [F18] Recorder[F1] [F2] [F4] [F5] [F7] [F8] [F10] [F11] [F13] [F15] [F17] Renderer[F1] [F2] [F3] [F4] [F5] [F6] [F7] [F8] [F9] Reporter[F1] [F2] [F3] [F4] [F10] [F11] [F12] [F14] [F15] [F16] [F17] [F18] b) Environment Parameter Manager: Handles environ- ment parameters that change over time, e.g., difficulty levels in curriculum learning [61] and parameters in domain ran- domization methods [62]. It exposes a retrieval interface that allows the Lifecycle Manager to query parameters based on the current learning progress. The Lifecycle Manager uses these parameters when initializing or resetting the environment (e.g., in new training episodes), and for method-specific tasks, e.g., randomizing parameters in the environment [63]. For example, ML-Agents [F3] implements the component in the EnvironmentParameters module. 2) Monitoring & Visualization: Tracks training metrics, generates diagnostic outputs or result summaries, and produces visual representations of agent behavior. The frameworks in our sample organize these responsibilities into one larger com- ponent. The Logger produces raw data that other components strongly depend on to track and visualize the training progress. It consists of four components (Fig. 7, Tab. VI). a) Renderer: Generates visual frames of the environment based on data obtained from the Environment’s rendering information interface (Fig. 5). Common modes include rgb array to return image arrays for programmatic use, and human to display interactive viewports for humans [F3]. b) Recorder: Captures frames via the Renderer and assembles them into videos or image sequences. For example, Gymnasium implements this component by wrapping environments with the RecordVideo wrapper to capture episodic videos during interaction [F1]. Isaac Lab’s RecorderManager records per-step and per-episode frames and exports them through dataset file handlers [F5]. c) Logger: Records raw experiment data, such as episode returns, losses, and diagnostic information, and persists them for later consumption by the Reporter. Most RL frameworks implement a custom Logger, but some integrate external experiment tracking tools. For example, Isaac Lab [F5] and Mava [F16] use Neptune Logger to log metrics, model pa- TABLE VII: Summary of Components and their Responsibilities ComponentContainerRole Experiment OrchestratorFrameworkTranslates user-defined experiment specifications into the concrete execution process. Experiment ManagerExperiment OrchestratorSets up the Data Persistence and Monitor & Visualization components and delegates the experiment execution to the Framework Orchestrator. Hyperparameter TunerExperiment OrchestratorGenerates candidate hyperparameter configurations. Benchmark ManagerExperiment OrchestratorEnables algorithm comparison under consistent experimental settings. Framework OrchestratorFramework CoreInitializes required framework components and coordinates their operation during training or evaluation. Lifecycle ManagerFramework OrchestratorControls the execution of the agent–environment interaction loop. Configuration ManagerFramework OrchestratorLoads and validates the configurations. Multi-Agent CoordinatorFramework OrchestratorManages how multiple agents interact and learn. Distributed Execution CoordinatorFramework OrchestratorAllocates and deploys components across resources. AgentFramework CoreImplements the RL algorithm and learns through agent-environment interaction. Function ApproximatorAgentEncodes the agent’s decision-making mechanism. BufferAgentStores the collected experience. LearnerAgentUpdates the Function Approximator using collected experience. EnvironmentFramework CoreEncapsulates the simulation infrastructure with which the Agent interacts. Environment CoreEnvironmentInitializes the environment, applies agent actions, updates the environment state, and computes rewards. Action ManagerEnvironment CoreProcesses and applies actions. Observation ManagerEnvironment CoreCollects observations from the simulator infrastructure. Reward ManagerEnvironment CoreComputes reward signals. SimulatorEnvironmentExecutes the probabilistic mechanism representing the real-world phenomenon, producing simulation traces. Simulator AdapterEnvironmentConnects the Environment Core with the underlying Simulator. Data PersistenceUtilitiesManages the storage and retrieval of the experiment data. Checkpoint ManagerData PersistenceSaves and restores the experiment state. Env. Parameter ManagerData PersistenceStores environment parameters that change over time. Monitoring & VisualizationUtilitiesTracks the learning process and generates result summaries. RendererMonitoring& VisualizationGenerates visual frames of the environment state. RecorderMonitoring& VisualizationStores capture frames as videos or image sequences. LoggerMonitoring& VisualizationRecords raw experiment data. ReporterMonitoring& VisualizationTransforms logged data into human-readable outputs. rameters, and gradients during execution. d) Reporter:Transforms logged data into human- readable outputs. It is typically used for generating training summaries, performance tables, and learning curves. V. RECONSTRUCTING RL PATTERNS To demonstrate the RA, we reconstruct typical RL patterns. A. Reconstructing Discrete Policy Gradient Discrete policy gradient RL methods directly learn a proba- bility distribution over possible actions for any given state [49]. Fig. 8 shows how such methods are instantiated from the RA. In discrete policy gradient methods, the Function Approx- imator is implemented by a stochastic, parameterized Policy over a finite action set. Given the current state of the Agent, the policy samples from a probability distribution to determine the next action, and the resulting experience is stored in a Rollout Buffer, storing short trajectories that are consumed immediately and then discarded. The Policy-Based Learner samples experience from the Buffer, and updates the Policy. B. Reconstructing Q-learning Q-learning is an example of value based methods, which— as opposed to policy-based ones—learn the value (the expected future reward) of states or actions to guide decisions [46]. Fig. 9 shows how Q-learning is instantiated from the RA. In Q-learning, the Function Approximator is implemented by a Value function (here: Q-function [46]) and select actions based on the Q-values through an appropriate action selection strategy, e.g.,ε-greedy [64]. The resulting experience is stored in a Replay Buffer, storing a large pool of transitions. The Value-Based Learner samples mini-batches of experience from <<component>> <<component>> Actor : Function Approximator <<component>> Policy Optimizer : Learner <<use>><<use>> update experience <<component>> Rollout Buffer : Buffer learning signal Agent experience sample <<component>> Critic : Function Approximator V-function <<use>> agent control <<component>> <<component>> Policy : Function Approximator <<component>> Policy-Based Learner : Learner <<use>><<use>> update experience sample experience <<component>> Rollout Buffer : Buffer Agent <<use>> agent control <<component>> <<component>> Value Function : Function Approximator <<component>> Value-Based Learner : Learner <<use>><<use>> update experience sample experience <<component>> Replay Buffer : Buffer Agent <<use>> agent control <<component>> <<component>> Multiagent Coordinator <<component>> Configuration Manager <<component>> Lifecycle Manager <<use>> <<component>> Distributed Execution Coordinator multi-agent strategy distribution strategy Framework Orchestrator framework orchestration <<component>> Data Persistence persistence <<component>> Monitoring & Visualization monitoring& visualization <<component>> Environment environment control <<component>> Agent 1 <<component>> Agent N update signal <<component>> Centralized Learner agent controlagent control update learning signal update signal environment control environment control environment control Fig. 8: Reconstruction of Discrete Policy Gradient <<component>> <<component>> Actor : Function Approximator <<component>> Policy Optimizer : Learner <<use>><<use>> update experience <<component>> Rollout Buffer : Buffer learning signal Agent experience sample <<component>> Critic : Function Approximator V-function <<use>> agent control <<component>> <<component>> Policy : Function Approximator <<component>> Policy-Based Learner : Learner <<use>><<use>> update experience sample experience <<component>> Rollout Buffer : Buffer Agent <<use>> agent control <<component>> <<component>> Value Function : Function Approximator <<component>> Value-Based Learner : Learner <<use>><<use>> update experience sample experience <<component>> Replay Buffer : Buffer Agent <<use>> agent control <<component>> <<component>> Multiagent Coordinator <<component>> Configuration Manager <<component>> Lifecycle Manager <<use>> <<component>> Distributed Execution Coordinator multi-agent strategy distribution strategy Framework Orchestrator framework orchestration <<component>> Data Persistence persistence <<component>> Monitoring & Visualization monitoring& visualization <<component>> Environment environment control <<component>> Agent 1 <<component>> Agent N update signal <<component>> Centralized Learner agent controlagent control update learning signal update signal environment control environment control environment control Fig. 9: Reconstruction of Q-learning the Buffer, computes temporal difference (TD) errors between current and target Q-values, and updates the Policy so that TD errors are minimized and by that, Q-value estimates improve. <<component>> <<component>> Actor : Function Approximator <<component>> Policy Optimizer : Learner <<use>><<use>> update experience <<component>> Rollout Buffer : Buffer learning signal Agent experience sample <<component>> Critic : Function Approximator V-function <<use>> agent control <<component>> <<component>> Policy : Function Approximator <<component>> Policy-Based Learner : Learner <<use>><<use>> update experience sample experience <<component>> Rollout Buffer : Buffer Agent <<use>> agent control <<component>> <<component>> Value Function : Function Approximator <<component>> Value-Based Learner : Learner <<use>><<use>> update experience sample experience <<component>> Replay Buffer : Buffer Agent <<use>> agent control <<component>> <<component>> Multiagent Coordinator <<component>> Configuration Manager <<component>> Lifecycle Manager <<use>> <<component>> Distributed Execution Coordinator multi-agent strategy distribution strategy Framework Orchestrator framework orchestration <<component>> Data Persistence persistence <<component>> Monitoring & Visualization monitoring& visualization <<component>> Environment environment control <<component>> Agent 1 <<component>> Agent N update signal <<component>> Centralized Learner agent controlagent control update learning signal update signal environment control environment control environment control Fig. 10: Reconstruction of Advantage Actor-Critic (A2C) C. Reconstructing Actor-Critic Actor-critic methods combine a policy-based Actor that prioritizes short-term learning, and a value-based Actor that prioritizes long-term learning. Fig. 10 shows how advantage actor-critic (A2C), a specific flavor of actor-critic methods is instantiated from the RA. In actor-critic methods, the Function Approximator is in- stantiated twice: the Actor and the Critic. The Actor imple- ments policy-based learning. It conducts short rollouts and stores this experience in the Buffer. The Critic, based on the Actor’s experience, estimates the state values, i.e., the V-function. Both the Actor’s experience and the Critic’s V- function is used by the Policy Optimizer to calculate the advan- tage of rollouts, and update the Actor’s policy by ascending the policy-gradient objective, which is a function of the advantage. D. Reconstructing Multi-Agent Learning In decentralized multi-agent RL (MARL) [65], agents learns independently on local information. Fig. 11 shows how the centralized flavor of MARL is instantiated from the RA. <<component>> <<component>> Actor : Function Approximator <<component>> Policy Optimizer : Learner <<use>><<use>> update experience <<component>> Rollout Buffer : Buffer learning signal Agent experience sample <<component>> Critic : Function Approximator V-function <<use>> agent control <<component>> <<component>> Policy : Function Approximator <<component>> Policy-Based Learner : Learner <<use>><<use>> update experience sample experience <<component>> Rollout Buffer : Buffer Agent <<use>> agent control <<component>> <<component>> Value Function : Function Approximator <<component>> Value-Based Learner : Learner <<use>><<use>> update experience sample experience <<component>> Replay Buffer : Buffer Agent <<use>> agent control <<component>> <<component>> Multiagent Coordinator <<component>> Configuration Manager <<component>> Lifecycle Manager <<use>> <<component>> Distributed Execution Coordinator multi-agent strategy distribution strategy Framework Orchestrator framework orchestration <<component>> Data Persistence persistence <<component>> Monitoring & Visualization monitoring& visualization <<component>> Environment environment control <<component>> Agent 1 <<component>> Agent N update signal <<component>> Centralized Learner agent controlagent control update learning signal update signal environment control environment control environment control Fig. 11: Reconstruction of MARL (with centralized learning) In MARL, Agents 1..N interact with the Environment indi- vidually, and produce update signals to either learn individ- ually (decentralized MARL), or to be used by a Centralized Learner component (centralized MARL, shown in Fig. 11). Agents are architected as shown in the previous examples, each containing their own Function Approximator and Buffer. A Centralized Learner typically samples from all Buffers, computing joint learning signals (e.g., shared returns, team rewards), and updating all Agent’s Function Approximators. The Multi-Agent Coordinator takes care of assembling joint actions for environment execution, distributing experience back to each agent, handling agent ordering, and maintaining policy-agent mappings for shared or individual policies. In dis- tributed MARL, the Distributed Execution Coordinator takes care of deploying Agents on different hardware components (e.g., different threads, GPUs, or clusters). VI. RESULTS EVALUATION AND QUALITY ASSESSMENT Following best practices in GT [32], [66], we evaluate our results and the quality of our study by the following criteria. Credibility, i.e., is there sufficient data to merit claims? In our study, we analyze 18 RL systems that support an array of RL techniques and use-cases. The resulting theory reached theoretical saturation relatively fast, after about 65% of the corpus. Based on these observations, we conjecture that we analyzed a sufficient number of RL systems to render this study credible. Originality, i.e., do the categories offernew insights? Our study offers a comprehensive empirical view on how modern RL frameworks are architected. In addition, our study clarifies coarser-grained component groups (see Fig. 1 and the subsections of Sec. IV) to depart from informal colloquialisms of “environments” and “frameworks.” Resonance, i.e., does the theory make sense to experts? To assess resonance, we gathered structured reflections from five experts (researchers, practitioners) knowledgeable in RL. We recruited the practitioners via convenience sampling and asked them the following questions. (1) When you read this reference architecture (RA), does anything stand out as inaccurate in your experience? (2) Which parts of this RA match your daily reality, i.e., when working with RL? (3) Are there components or relationships in this RA that do not fit your experience? (4) Did anything in the RA help you see your experience with RL frameworks differently? Experts unanimously responded that nothing stood out as inaccurate. Some experts mentioned components that do not fit their experience as they often implement custom compo- nents, but they could identify those components in the RA. About the benefits of the RA, one expert mentioned that they “would use the RA to help me organize my code components that I have to implement myself ” and that they “would also use this RA when learning to work with a new framework, to map concepts onto the RA to help me understand how it works.” Regarding modularity, one expert mentioned that “the explicit identification of a Lifecycle Manager helped clarify the architectural role of the training loop,” and that the RA highlighted how “simulator integration can be treated as a reusable architectural pattern rather than a framework-specific detail.” Experts recognized opportunities the RA may bring, too, e.g., when ensuring convergence of RL algorithms: “When convergence was difficult, it was often challenging to pinpoint whether the issue was architectural, environmental, or algorithmic. This reference architecture highlights how stronger modular separation could make debugging and experimentation clearer.” We judge that the RA resonates well with practitioners as they articulate no misalignment with their practices and recognize value in the RA tied to their daily practices. It is important to note that this exercise is a mere resonance check rather than a thorough validation. As such, it does not allow for generalization. However, it indicates interpretive plausibility and experiential alignment. Usefulness, i.e., does the theory offer useful interpretations? Our study identifies architectural primitives of RL frameworks that are grounded in existing implementations, i.e., they allow for interpreting the RA in the practical context of RL. Beyond identifying components and component groups, our study allows researchers and practitioners to structure future RL systems and anticipate architectural trade-offs (e.g., highly modular design vs more integrated functionality). Threats to validity Internal validity The inferred compositions of, and relation- ships between categories may reflect researchers’ interpre- tations rather than inherent properties of the studied frame- works. We mitigated this threat through constant compari- son and by actively searching for alternative explanations. Construct validity Researcher bias could threaten the valid- ity of this study, e.g., architectural components may have been shaped by this effect. To mitigate this threat, we facilitated constant cross-verification steps and discussions among the researchers. Selection bias (e.g., missing impor- tant frameworks, and including less relevant ones) could threaten construct validity. To mitigate this threat, we relied on the best practices of grounded theory and extensively analyzed frameworks until we confirmed saturation. External validity GT is a non-statistical research genre and therefore, the results cannot be statistically generalized to a general population, e.g., to general machine learning frameworks and closed-source implementations. Since we reached saturation, we expect the RA to generalize to RL frameworks though. VII. DISCUSSION A. The architectural tendencies of RL frameworks The 18 frameworks and 28 RA components imply a to- tal of 504 (18 × 28) potential implementations across the sampled frameworks. In total, we find 252 of 504 (50.0%) implemented components in the sampled frameworks and 252 of 504 (50.0%) missing ones. As shown in Fig. 12, RL systems labeled as “frameworks” and “environments” tend to implement complementary functionality. (The figure does not include Utilities, which shows a more homogeneous coverage of components across framework- and environment-type RL ImplicitExternalExplicitImplicitExternalExplicit Agent Buffer Func. Approx. Learner Framework Orchestrator Experiment Manager Lifecycle Mgr. Multi-Agent Coord. Config. Mgr. Distributed Exec. Coord. Hyperparam. Tuner Experiment Orchestrator Benchmark Manager Action Mgr. Observation Mgr. Reward Mgr. Simulator Adapter Simulator Environment Environment Core 28 46 226 523 811 82 523 412 421 13 3 1 1 171 171 171 152 135 18 18 Labeled as FrameworkLabeled as Environment Fig. 12: Implementations of RA components across the sam- pled RL systems. (Utility components not included.) systems.) Framework-type RL systems (e.g., Acme [F13] and RLlib [F12]) tend to implement the Agent components (e.g., Buffer and Function Approximator – Fig. 4) and Framework Orchestrator components (e.g., Lifecycle Manager and Multi- Agent Coordinator – Fig. 3); and exhibit 75 of 90 (83.3%) coverage on such components. 3 Environment-type RL sys- tems (e.g., Gymnasium [F1] and PettingZoo [F2]) tend to be restricted to implement the Environment group (Fig. 5) and exhibit 55 of 56 (98.2%) coverage on such components. These figures highlight the complementary architectural tendencies of RL systems colloquially labeled as “environ- ments” and “frameworks,” and hint at the importance of considering both types when designing RL-based software. We recommend developers of RL-enabled software to consider the complementary feature set of RL systems that implement environments and those that implement framework components. Complex software solutions may necessitate integrating both types. B. The role of external libraries We find 123 of 252 (48.8%) explicitly implemented com- ponents (i.e., the responsibility of an RA component is clearly assigned to an existing component); 47 of 252 (18.7%) that are implemented through an external library (i.e., explicit isolation of the responsibility but the implementation is deferred); and 82 of 252 (32.5%) implicitly implemented ones (i.e., responsibilities are lumped into other components). Fig. 12 shows that external implementations can be found in the vast majority of RA component types, especially the 3 Implementation details are available in the data package. TABLE VIII: Localizing Architectural Design Decisions ADDDefinitionAffected RA components Model ArchitectureThe structural organization of RL modelsAgent, Multi-Agent Coordinator Model TrainingThe organization of learning across agentsMulti-Agent Coord., Distributed Execution Coord., Learner, Buffer CheckpointsWhether RL models’ states are saved during learningCheckpoint Manager Transfer LearningWhether to use pre-trained models or train from scratchCheckpoint Manager, Agent Distribution StrategyWhether and how to use distributed trainingDistributed Execution Coordinator Hyperparameter TuningWhether to use hyperparameter tuning or notHyperparameter Tuner ones that are Framework-related. This tendency hints at the existence of external libraries that can act as viable building blocks of RL software. For example, the TensorBoard [67] is a recurring library to implement the Reporter component, e.g., in Unity ML-Agents [F3], Isaac Gym [F4], and Dopamine [F17]. Hydra [41] is often used to implement the Configuration Man- ager component, e.g., in Isaac Lab [F5], BenchMARL [F15] and Mava [F16]. This finding is in line with recent reports on the fundamental enabler role of open-source software stacks in ML [68] and the fast-moving landscape of libraries in machine learning (ML) [69]. In such a landscape, clear architectural guidelines, such as RAs are in high demand. For example, a recent study by Larios Vargas et al. [70] shows that the impact of a wrongly chosen external library depends on the library’s role in the software architecture, and that experts tend to address this problem by evaluating the architectural alignment of external libraries in the prototype phase of software. In our sample, 18 of 47 (38.3%) external libraries originate directly from the same development team or ecosystem. For example, in RLlib [F12], multiple components are realized through libraries from the same team, Anyscale, e.g., Ray Tune [40] to implement the Hyperparameter Tuner and Ray Core [42] to implement the Distributed Execution Coordi- nator. Similarly, Acme [F13] uses libraries from Google DeepMind, e.g., Launchpad [43] to implement the Distributed Execution Coordinator and Reverb [52] to implement Buffers. We recommend that adopters assess the supporting ecosystem as reliable external libraries may improve the quality of the developed RL-enabled software. C. Localizing Architectural Design Decisions (ADD) in RL The RA in this work allows for localizing the Architectural Design Decisions (ADD) for RL by Ntentos et al. [12]. ADDs represent the design choices that influence how an RL framework is constructed and operated. By understanding which components are affected by a specific ADD, archi- tects can evaluate the feasibility and impact of a decision more precisely. For example, components relying on external dependencies may not allow for the same liberty as custom implementations (e.g., versioning may be more rigorous for the sake of backward compatibility). Tab. VIII maps ADDs onto the RA components that they primarily affect. The Model Architecture ADD determines the number of Agents, their internal structure, and whether multi-agent co- ordination (by the Multi-Agent Coordinator) is required. For example, A monolithic model uses a single Agent, and a multi-agent architecture requires multiple Agents and a Multi- agent Coordinator to manage agent interactions. The Model Training ADD determines where and how learning updates are performed. For example, centralized training with decen- tralized execution MARL uses a single Centralized Learner accessing multiple Buffers, and the Multi-agent Coordinator handles joint action assembly and experience distribution across agents. Distributed training requires the Distributed Ex- ecution Coordinator to deploy components across distributed resources. The Checkpoints ADD determines whether, what, and when to save the experiment states. The Checkpoint Manager handles this decision by deciding when (e.g., every N steps) and what (e.g., training states, model parameters) to save. The Transfer Learning ADD determines whether to train from pre-trained models or from scratch. If Transfer Learning is enabled, the Checkpoint Manager loads the pre-trained model parameters into the Agent. The Distribution Strategy ADD determines where and how to use distributed training. The Distributed Execution Coordinator handles this decision by allocating and deploying components across resources. The Hyperparameter Tuning ADD determines whether to use hyperparameter tuning or not. If Hyperparameter Tuning is enabled, the Hyperparameter Tuner automates the search over hyperparameter configurations to find the optimal ones. We recommend RL architects use our RA to localize design decisions for better assessment and evaluation of the implications of their decisions. VIII. CONCLUSION In this paper, we propose a reference architecture of RL frameworks based on our empirical investigation of 18 widely used open-source implementations. Our work allows for better- informed design decisions in the development and mainte- nance of RL frameworks and during the integration of RL frameworks into software systems. We plan to maintain our reference architecture by continuously analyzing new RL frameworks. Future work will focus on a more detailed evalu- ation of the RA to assess its understandability, effectiveness in solving practical RL problems, and usefulness to practitioners. In addition, we plan to provide a reference implementation that RL system developers can use as reasonable abstractions and starting points for their implementations. ACKNOWLEDGMENT We truly appreciate the insightful remarks of the three anonymous reviewers who helped us improve the original sub- mission; and our colleagues who provided structured feedback in the resonance check phase of this work. RL FRAMEWORKS [F1] M. Towers et al., Gymnasium: A standard interface for reinforcement learning environments, GitHub repository: https://github.com/Farama- Foundation / Gymnasium – Accessed on Feb 12, 2026., 2025. DOI: 10.48550/arXiv.2407.17032 [F2] J. Terry et al., “PettingZoo: Gym for Multi-Agent Reinforcement Learning,” in Advances in Neural Information Processing Systems, GitHub repository: https://github.com/Farama- Foundation/PettingZoo – Accessed on Feb 12, 2026., vol. 34, Curran Associates, Inc., 2021, p. 15 032–15 043. [F3] A. Juliani et al., Unity: A general platform for intelligent agents, GitHub repository: https://github.com/Unity-Technologies/ml-agents – Accessed on Feb 12, 2026., 2020. DOI: 10.48550/arXiv.1809.02627 [F4] V. Makoviychuk et al., Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning, GitHub repository: https : / / github. com / isaac - sim / IsaacGymEnvs – Accessed on Feb 12, 2026., 2021. DOI: 10.48550/arXiv.2108.10470 [F5] M. Mittal et al., “Isaac Lab: A GPU-accelerated simulation framework for multi-modal robot learning,” arXiv preprint arXiv:2511.04831, 2025, GitHub repository: https : / / github . com / isaac - sim / IsaacLab – Accessed on Feb 12, 2026. [F6] S. Tunyasuvunakool et al., “dmcontrol: Software and tasks for con- tinuous control,” Software Impacts, vol. 6, p. 100 022, 2020, GitHub repository: https : / / github . com / google - deepmind / dmcontrol – Accessed on Feb 12, 2026. DOI: 10.1016/j.simpa.2020.100022 [F7] C. Beattie et al., “Deepmind Lab,” arXiv preprint arXiv:1612.03801, 2016, GitHub repository: https://github.com/google- deepmind/lab – Accessed on Feb 12, 2026. [F8] M. G. Bellemare et al., “The Arcade Learning Environment: An evaluation platform for general agents,” J. Artif. Int. Res., vol. 47, no. 1, p. 253–279, 2013, GitHub repository: https : / / github. com / Farama - Foundation / Arcade - Learning - Environment – Accessed on Feb 12, 2026. [F9] C. Bonnet et al., Jumanji: A diverse suite of scalable reinforcement learning environments in JAX, GitHub repository: https://github.com/ instadeepai/jumanji – Acc on Feb 12, 2026., 2024. arXiv: 2306.09884. [F10] A. Raffin et al., “Stable-Baselines3: Reliable reinforcement learning implementations,” J Mach Learn Res, vol. 22, no. 268, p. 1–8, 2021, GitHub repository: https://github.com/DLR- RM/stable- baselines3 – Accessed on Feb 12, 2026. [F11] A. Raffin, RL Baselines3 Zoo, GitHub repository: https://github.com/ DLR-RM/rl-baselines3-zoo – Accessed on Feb 12, 2026., 2020. [F12] E. Liang et al., “RLlib: Abstractions for distributed reinforcement learning,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, GitHub repository: https://github.com/ray- project/ray/tree/master/rllib – Accessed on Feb 12, 2026., vol. 80, PMLR, 2018, p. 3053–3062. [F13] M. W. Hoffman et al., “Acme: A Research Framework for Distributed Reinforcement Learning,” arXiv preprint arXiv:2006.00979, 2020, GitHub repository: https : / / github . com / google - deepmind / acme – Accessed on Feb 12, 2026. [F14] S. Hu et al., “MARLlib: A Scalable and Efficient Multi-agent Re- inforcement Learning Library,” J Mach Learn Res, 2023, GitHub repository: https://github.com/Replicable-MARL/MARLlib – Accessed on Feb 12, 2026. [F15] M. Bettini et al., “BenchMARL: Benchmarking Multi-Agent Rein- forcement Learning,” J Mach Learn Res, vol. 25, no. 217, p. 1–10, 2024, GitHub repository: https : / / github . com / facebookresearch / BenchMARL – Accessed on Feb 12, 2026. [F16] R. de Kock et al., “Mava: A research library for distributed multi-agent reinforcement learning in JAX,” arXiv preprint arXiv:2107.01460, 2023, GitHub repository: https : / / github . com / instadeepai / Mava – Accessed on Feb 12, 2026. [F17] P. S. Castro et al., Dopamine: A Research Framework for Deep Reinforcement Learning, GitHub repository: https://github.com/google/ dopamine – Accessed on Feb 12, 2026., 2018. [F18] J. Weng et al., “Tianshou: A Highly Modularized Deep Reinforcement Learning Library,” J Mach Learn Res, vol. 23, no. 267, p. 1–6, 2022, GitHub repository: https://github.com/thu- ml/tianshou – Accessed on Feb 12, 2026. REFERENCES [1] R. S. Sutton and A. G. Barto, Reinforcement Learning. Cambridge, MA: Bradford Books, Feb. 1998. [2] R. Figueiredo Prudencio, M. R. O. A. Maximo, and E. L. Colombini, “A survey on offline reinforcement learning: Taxonomy, review, and open problems,” IEEE Trans Neural Netw Learn Syst, vol. 35, no. 8, p. 10 237–10 257, 2024. DOI: 10.1109/TNNLS.2023.3250269 [3] I. David and E. Syriani, “Automated Inference of Simulators in Digital Twins,” in Handbook of Digital Twins. CRC Press, 2023, ch. 8, ISBN: 978-1-032-54607-0. [4] K. Dagenais and I. David, “Complex model transformations by reinforcement learning with uncertain human guidance,” in 2025 ACM/IEEE 28th International Conference on Model Driven Engineer- ing Languages and Systems (MODELS), 2025, p. 209–220. DOI: 10. 1109/MODELS67397.2025.00025 [5] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” Int J Rob Res, vol. 32, no. 11, p. 1238–1274, 2013. DOI: 10.1177/0278364913495721 [6] T. Kim, M. Jang, and J. Kim, “A survey on simulation environments for reinforcement learning,” in 2021 18th International Conference on Ubiquitous Robots (UR), IEEE, 2021, p. 63–67. [7] X. Liu and I. David, “Ai simulation by digital twins: Systematic survey, reference framework, and mapping to a standardized architecture,” Soft Sys Mod, 2025. DOI: 10.1007/s10270-025-01306-0 [8] J. Terven, “Deep reinforcement learning: A chronological overview and methods,” AI, vol. 6, no. 3, 2025. DOI: 10.3390/ai6030046 [9] N. Pippas, E. A. Ludvig, and C. Turkay, “The Evolution of Reinforce- ment Learning in quantitative finance: A survey,” ACM Comput. Surv., vol. 57, no. 11, Jun. 2025, ISSN: 0360-0300. DOI: 10.1145/3733714 [10] P. S. Nouwou Mindom, A. Nikanjam, and F. Khomh, “A comparison of reinforcement learning frameworks for software testing tasks,” Empir Softw Eng, vol. 28, no. 5, p. 111, 2023. [11] A. Schuderer, S. Bromuri, and M. van Eekelen, “Sim-Env: Decoupling OpenAI Gym environments from simulation models,” in International Conference on Practical Applications of Agents and Multi-Agent Sys- tems, Springer, 2021, p. 390–393. [12] E. Ntentos, S. J. Warnett, and U. Zdun, “Supporting architectural decision making on training strategies in reinforcement learning ar- chitectures,” in 2024 IEEE 21st International Conference on Software Architecture (ICSA), IEEE, 2024, p. 90–100. [13] N. Nahar, H. Zhang, G. Lewis, S. Zhou, and C. K ̈ astner, “A meta- summary of challenges in building products with ML components – collecting experiences from 4758+ practitioners,” in IEEE/ACM 2nd Intl Conf on AI Engineering – Software Engineering for AI (CAIN), 2023, p. 171–183. DOI: 10.1109/CAIN58948.2023.00034 [14] M. M. Morovati, F. Tambon, M. Taraghi, A. Nikanjam, and F. Khomh, “Common challenges of deep reinforcement learning applications development: An empirical study,” Empirical Software Engineering, vol. 29, no. 4, p. 95, 2024. DOI: 10.1007/s10664-024-10500-5 [15] F. Tambon, G. Laberge, L. An, A. Nikanjam, P. S. N. Mindom, Y. Pequignot, F. Khomh, G. Antoniol, E. Merlo, and F. Laviolette, “How to certify machine learning based safety-critical systems? a systematic literature review,” Autom Softw Eng, vol. 29, no. 2, p. 38, 2022. DOI: 10.1007/s10515-022-00337-x [16] L. Baier, F. J ̈ ohren, and S. Seebacher, “Challenges in the deployment and operation of machine learning in practice,” in 27th European Conf on Information Systems (ECIS), AISeL, 2019, Paper: 163. [17] M. L. Puterman, “Markov decision processes,” Handbooks in opera- tions research and management science, vol. 2, p. 331–434, 1990. [18] D. Kalashnikov et al., “Scalable deep reinforcement learning for vision- based robotic manipulation,” in Conference on robot learning, PMLR, 2018, p. 651–673. [19] A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J.-M. Allen, V.-D. Lam, A. Bewley, and A. Shah, “Learning to drive in a day,” in 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019, p. 8248–8254. [20] L. Zhou, S. Pan, J. Wang, and A. V. Vasilakos, “Machine learning on big data: Opportunities and challenges,” Neurocomputing, vol. 237, p. 350–361, 2017. DOI: 10.1016/j.neucom.2017.01.026 [21] A. Dosovitskiy et al., “CARLA: An open urban driving simulator,” in Proc of the 1st Annual Conference on Robot Learning, vol. 78, PMLR, 2017, p. 1–16. [22] E. Kusmenko, M. M ̈ unker, M. Nadenau, and B. Rumpe, “A model- driven generative self play-based toolchain for developing games and players,” in Proc. 21st ACM SIGPLAN Intl Conference on Generative Programming: Concepts and Experiences, 2022, p. 95–107. [23] P. Czechowski et al., “Deep reinforcement and IL for autonomous driving: A review in the CARLA simulation environment,” Applied Sciences, vol. 15, no. 16, 2025. DOI: 10.3390/app15168972 [24] T. Buhet et al., “Conditional vehicle trajectories prediction in CARLA urban environment,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2019. DOI: 10 . 1109/ICCVW.2019.00284 [25] G. Brockman et al., OpenAI Gym, 2016. DOI: 10.48550/arXiv.1606. 01540 [26] K. Zhang, Z. Yang, and T. Bas ̧ar, “Multi-agent reinforcement learn- ing: A selective overview of theories and algorithms,” Handbook of reinforcement learning and control, p. 321–384, 2021. [27] S. Balhara et al., “A survey on deep reinforcement learning architec- tures, applications and emerging trends,” IET Communications, vol. 19, 1 2022. DOI: 10.1049/cmu2.12447 [28] B. Glaser and A. Strauss, Discovery of grounded theory: Strategies for qualitative research. Routledge, 1967. [29] B. G. Glaser, “Doing grounded theory: Issues and discussions,” Soci- ology, 1998. [30] D. A. Tamburri and R. Kazman, “General methods for software archi- tecture recovery: A potential approach and its evaluation,” Empirical Software Engineering, vol. 23, no. 3, p. 1457–1489, 2018. [31] J. M. Corbin and A. C. Strauss, Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory, 4th ed. SAGE Publications, 2015. [32] K.-J. Stol, P. Ralph, and B. Fitzgerald, “Grounded theory in software engineering research: A critical review and guidelines,” in Proceedings of the 38th International Conference on Software Engineering, ACM, 2016, p. 120–131. DOI: 10.1145/2884781.2884833 [33] B. Glaser, “Theoretical sensitivity,” Advances in the methodology of grounded theory, 1978. [34] Z. Liu et al., “Acceleration for deep reinforcement learning using parallel and distributed computing: A survey,” ACM Comput. Surv., vol. 57, no. 4, DOI: 10.1145/3703453 [35] M. A. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. A. Hong, A. Konwinski, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe, F. Xie, and C. Zumar, “Accelerating the Machine Learning Lifecycle with MLflow,” IEEE Data Eng. Bull., vol. 41, p. 39–45, 2018. [36] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, “An empirical evaluation of deep architectures on problems with many factors of variation,” in Proceedings of the 24th international conference on Machine learning, 2007, p. 473–480. [37] J. Bergstra et al., “Random search for hyper-parameter optimization,” J. Mach. Learn. Res., vol. 13, p. 281–305, 2012. [38] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza- tion of machine learning algorithms,” Advances in neural information processing systems, vol. 25, 2012. [39] T. Akiba et al., “Optuna: A next-generation hyperparameter optimiza- tion framework,” in Proc of the 25th ACM SIGKDD Intl Conf on Knowledge Discovery & Data Mining, ser. KDD ’19, ACM, 2019, p. 2623–2631. DOI: 10.1145/3292500.3330701 [40] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica, “Tune: A research platform for distributed model selection and training,” arXiv:1807.05118, 2018. [41] O. Yadan, Hydra – a framework for elegantly configuring complex applications, GitHub, 2019. [Online]. Available: https : / / github. com / facebookresearch/hydra [42] P. Moritz et al., “Ray: A distributed framework for emergingai applications,” in 13th USENIX symposium on operating systems design and implementation (OSDI 18), 2018, p. 561–577. [43] F. Yang, G. Barth-Maron, P. Sta ́ nczyk, M. Hoffman, S. Liu, M. Kroiss, A. Pope, and A. Rrustemi, “Launchpad: A programming model for distributed machine learning research,” arXiv:2106.04516, 2021. [44] A. Muldal, Y. Doron, J. Aslanides, T. Harley, T. Ward, and S. Liu, Dm env: A python interface for reinforcement learning environments, 2019. [Online]. Available: http://github.com/deepmind/dm env [45] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv:1707.06347, 2017. [46] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3, p. 279–292, 1992. [47] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” in Int Conf on Machine Learning, PMLR, 2018, p. 1861–1870. [48] Y. Hirashima, Y. Iiguni, A. Inoue, and S. Masuda, “Q-learning al- gorithm using an adaptive-sized Q-table,” in Proceedings of the 38th IEEE Conference on Decision and Control, vol. 2, 1999, 1599–1604 vol.2. DOI: 10.1109/CDC.1999.830250 [49] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Advances in neural information processing systems, vol. 12, 1999. [50] G. A. Rummery and M. Niranjan, On-line Q-learning using connec- tionist systems. University of Cambridge, Department of Engineering Cambridge, UK, 1994, vol. 37. [51] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, p. 34–37, 1966. DOI: 10.1126/science.153.3731.34 [52] A. Cassirer et al., Reverb: A framework for experience replay, 2021. DOI: 10.48550/arXiv.2102.04736 [53] E. Toledo, L. Midgley, D. Byrne, C. R. Tilbury, M. Macfarlane, C. Courtot, and A. Laterre, Flashbax: Streamlining experience replay buffers for reinforcement learning with jax, 2023. [Online]. Available: https://github.com/instadeepai/flashbax/ [54] S. M. Ross, Simulation. academic press, 2022. [55] B. P. Zeigler, A. Muzy, and E. Kofman, Theory of modeling and sim- ulation: discrete event & iterative system computational foundations. Academic press, 2018. [56] Box2d, GitHub, 2025. [Online]. Available: https://github.com/erincatto/ box2d [57] Stella, GitHub, 2025. [Online]. Available: https://stella-emu.github.io/ [58] E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” in 2012 IEEE/RSJ international conference on intelligent robots and systems, IEEE, 2012, p. 5026–5033. [59] Ioquake3, 2025. [Online]. Available: https://ioquake3.org [60] Orbax, GitHub, 2025. [Online]. Available: https://github.com/google/ orbax [61] Y. Bengio et al., “Curriculum learning,” in Proc of the 26th Annual Intl Conf on Machine Learning, ser. ICML ’09, Association for Computing Machinery, 2009, p. 41–48. DOI: 10.1145/1553374.1553380 [62] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in 2017 IEEE/RSJ international confer- ence on intelligent robots and systems (IROS), IEEE, 2017, p. 23–30. [63] O. M. Andrychowicz et al., “Learning dexterous in-hand manipulation,” The International Journal of Robotics Research, vol. 39, no. 1, p. 3– 20, 2020. DOI: 10.1177/0278364919887447 [64] C. J. C. H. Watkins et al., “Learning from delayed rewards,” 1989. [65] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar, “Fully decentralized multi-agent reinforcement learning with networked agents,” in Int Conf on Machine Learning, PMLR, 2018, p. 5872–5881. [66] K. C. Charmaz, Constructing grounded theory. SAGE, 2006. [67] TensorBoard, 2025. [Online]. Available: https://w.tensorflow.org/ tensorboard [68] X. Tan, K. Gao, M. Zhou, and L. Zhang, “An exploratory study of deep learning supply chain,” in Proceedings of the 44th International Conference on Software Engineering, ser. ICSE ’22, Pittsburgh, Penn- sylvania: ACM, 2022, p. 86–98. DOI: 10.1145/3510003.3510199 [69] G. Nguyen, S. Dlugolinsky, M. Bob ́ ak, V. Tran, ́ A. L ́ opez Garc ́ ıa, I. Heredia, P. Mal ́ ık, and L. Hluch ́ y, “Machine learning and deep learning frameworks and libraries for large-scale data mining: A survey,” Artif. Intell. Rev., vol. 52, no. 1, p. 77–124, 2019. [70] E. Larios Vargas, M. Aniche, C. Treude, M. Bruntink, and G. Gousios, “Selecting third-party libraries: The practitioners’ perspective,” in Proc. of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineer- ing, ACM, 2020, p. 245–256. DOI: 10.1145/3368089.3409711