Paper deep dive

ReLMXEL: Adaptive RL-Based Memory Controller with Explainable Energy and Latency Optimization

Panuganti Chirag Sai, Gandholi Sarat, R. Raghunatha Sarma, Venkata Kalyan Tavva, Naveen M

Year: 2026Venue: arXiv preprintArea: cs.ARType: PreprintEmbeddings: 31

Abstract

Abstract:Reducing latency and energy consumption is critical to improving the efficiency of memory systems in modern computing. This work introduces ReLMXEL (Reinforcement Learning for Memory Controller with Explainable Energy and Latency Optimization), a explainable multi-agent online reinforcement learning framework that dynamically optimizes memory controller parameters using reward decomposition. ReLMXEL operates within the memory controller, leveraging detailed memory behavior metrics to guide decision-making. Experimental evaluations across diverse workloads demonstrate consistent performance gains over baseline configurations, with refinements driven by workload-specific memory access behaviour. By incorporating explainability into the learning process, ReLMXEL not only enhances performance but also increases the transparency of control decisions, paving the way for more accountable and adaptive memory system designs.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

31,014 characters extracted from source content.

Expand or collapse full text

ReLMXEL: Adaptive RL-Based Memory Controller with Explainable Energy and Latency Optimization Panuganti Chirag Sai Department of Mathematics and Computer Science Sri Sathya Sai Institute of Higher Learning chiragsaipanuganti@sssihl.edu.in Gandholi Sarat Department of Mathematics and Computer Science Sri Sathya Sai Institute of Higher Learning gandholisarat@sssihl.edu.in R. Raghunatha Sarma Department of Mathematics and Computer Science Sri Sathya Sai Institute of Higher Learning rraghunathasarma@sssihl.edu.in Venkata Kalyan Tavva Department of Computer Science and Engineering Indian Institute of Technology Ropar kalyantv@iitrpr.ac.in Naveen M AI Performance Engineer Red Hat nmiriyal@redhat.com Abstract Reducing latency and energy consumption is critical to improving the efficiency of memory systems in modern computing. This work introduces ReLMXEL (Reinforcement Learning for Memory Controller with Explainable Energy and Latency Optimization), a explainable multi-agent online reinforcement learning framework that dynamically optimizes memory controller parameters using reward decomposition. ReLMXEL operates within the memory controller, leveraging detailed memory behavior metrics to guide decision-making. Experimental evaluations across diverse workloads demonstrate consistent performance gains over baseline configurations, with refinements driven by workload-specific memory access behaviour. By incorporating explainability into the learning process, ReLMXEL not only enhances performance but also increases the transparency of control decisions, paving the way for more accountable and adaptive memory system designs. I Introduction In modern computing systems, Dynamic Random Access Memory (DRAM) is de-facto memory technology and plays a critical role in overall system performance, especially for memory- and compute-intensive workloads such as those encountered in machine learning (ML) training and inference. Consequently, significant research focuses on improving DRAM efficiency, particularly in reducing latency and energy consumption. The memory controller managing the communication between the processor and DRAM, is pivotal in achieving these optimizations. A survey by Wu et al. [18] reviews the growing use of machine learning in computer architecture, highlighting reinforcement learning (RL) as a promising technique for designing self optimizing memory controllers. These controllers are modeled as RL agents that choose DRAM commands based on long-term expected benefits and incorporate techniques such as genetic algorithms and multi-factor state representations to handle diverse objectives like energy and throughput. One prominent approach is the self-optimizing memory controller proposed by Ipek et al. [6], that uses RL to adapt scheduling decisions and outperform static policies across various workloads. Despite these improvements, there is a lack of transparency in RL-driven decisions, hindering their adoption in real-world systems that require explainability, reliability and trust. To bridge this gap, we introduce Reinforcement Learning for Memory Controller with Explainable Energy and Latency Optimization (ReLMXEL), a novel multi-agent RL-based memory controller. ReLMXEL dynamically tunes memory policies to optimize latency and energy across diverse workloads, including several that exhibit computational patterns commonly found in machine learning (ML) applications, such as dense linear algebra (GEMM), memory-bound operations (STREAM, mcf) and irregular data access patterns (BFS, omnetpp) while incorporating explainability techniques to make its decisions interpretable. This approach builds upon prior work in adaptive memory systems and aims to balance performance with accountability in complex computing environments. I Literature Review Figure 1: Reinforcement Learning Framework [16] In a RL framework, an agent interacts with the environment over discrete timesteps. At each timestep t, the agent observes the current state StS_t, selects an action AtA_t based on a policy π(a|s)π(a|s), receives a reward RtR_t, and transitions to a new state St+1S_t+1. The goal of the agent is to learn an optimal policy that maximizes the expected cumulative reward over time. This process continues iteratively, allowing the agent to learn a policy π(a|s)π(a|s) that maximizes the expected cumulative reward over time. Machine learning approaches generally require large, labeled datasets and assume that data distributions remain stationary. However, memory systems exhibit highly dynamic behavior, with workloads and access patterns changing rapidly over time. Traditional ML methods lack the capability to adapt on-the-fly and cannot effectively capture the dynamism in memory systems. Whereas, an RL agent learns through direct interaction with the environment, making decisions based on real-time feedback rather than relying on pre-collected data. This allows RL to effectively handle non-stationary environments by continuously adapting its policy as system conditions evolve. Additionally, RL optimizes long-term cumulative rewards, and supports multi-objective optimization tasks such as balancing energy efficiency, bandwidth, and latency. These strengths make RL particularly well-suited for memory controller parameter tuning. I-A Self-Optimizing Memory Controllers: A Reinforcement Learning Approach The Self-Optimizing Memory Controller by Ipek et al.[6] overcomes the limitations of static DRAM controllers by using reinforcement learning to dynamically adapt command scheduling. It models the controller as an RL agent interacting with an environment composed of processor cores, caches, buses, DRAM banks, and scheduling queues. The state includes features such as read/write counts and load misses, while actions include Precharge, Activate, Read-CAS, Write-CAS, REF, or NOP commands. The agent receives a reward of 1 for read/write and 0 otherwise. SARSA[13, 16] updates Q-values [17] using a Cerebellar Model Articulation Controller (CMAC) function approximator [1] with overlapping coarse-grained Q-tables to handle the large state space. This approach enables adaptability to workload changes, optimizing scheduling decisions. However, it focuses solely on scheduling, neglecting important parameters such as arbitration, refresh policies, page policies, scheduler buffer policies, and the maximum number of permitted active transactions. Furthermore, the lack of explainability in learned policies limits interpretability and reliability, highlighting the need for memory controllers that balance adaptability with transparency. I-B Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning The Pythia [2] framework proposes a prefetcher for cache optimization using reinforcement learning. Pythia treats the prefetcher as an RL agent, where, for each demand request, it observes various types of program context information to make a prefetch decision. After each decision, Pythia receives a numerical reward that evaluates the quality of the prefetch, considering current memory bandwidth usage. This reward strengthens the correlation between the observed program context and the prefetch decisions, helping generate more accurate, timely, and system-aware prefetch requests in the future. The primary objective of Pythia is to discover the optimal prefetching policy that maximizes the number of accurate and timely prefetch requests while incorporating system-level feedback. The state space is a k-dimensional vector of program features, S≡φ1S,φ2S,…,φkSS≡\ _1^S, _2^S,…, _k^S\. The action is the selection of a prefetch offset from a set of pre-determined offsets. The reward is calculated based on factors like Accurate and Timely, Accurate but Late, Loss of Coverage, Inaccurate, and No Prefetch [2]. I-C Reinforcement Learning using Reward Decomposition In Explainable Reinforcement Learning via Reward Decomposition [8], the scalar reward in conventional reinforcement learning is decomposed into a reward vector, where each element represents the reward from a specific component. Say, we have two possible actions a1a_1 and a2a_2 available to the agent in a given state s. The reward vector helps explain why an action a1a_1 is preferred over another a2a_2 in a state s. The explanation is provided through the Reward Difference Explanation (RDX), defined as: Δ(s,a1,a2)=Q→(s,a1)−Q→(s,a2), (s,a_1,a_2)= Q(s,a_1)- Q(s,a_2), (1) wherein, each component Δc(s,a1,a2) _c(s,a_1,a_2) represents the difference in expected return with respect to a component c. A positive Δc _c indicates an advantage of a1a_1 over a2a_2, and vice versa. When the reward components are numerous, the authors introduce Minimal Sufficient Explanation (MSX). An MSX is a minimal subset of components whose cumulative advantage justifies the preference of one action over another. Specifically, an MSX for a1a_1 over a2a_2 is given by the smallest subset MSX+⊆MSX^+ such that: ∑c∈MSX+Δc(s,a1,a2)>d, _c ^+ _c(s,a_1,a_2)>d, (2) where d is the total disadvantage from negatively contributing components: d=−∑Δc(s,a1,a2)<0Δc(s,a1,a2).d=- _ _c(s,a_1,a_2)<0 _c(s,a_1,a_2). (3) To verify whether each component in MSX+MSX^+ is necessary, a necessity check is introduced and calculated as: v=∑c∈MSX+Δc(s,a1,a2)−minc∈MSX+⁡Δc(s,a1,a2),v= _c ^+ _c(s,a_1,a_2)- _c ^+ _c(s,a_1,a_2), (4) Finally, checking if any subset of negative components MSX−MSX^- has a total disadvantage exceeding v, if so, all the elements in MSX+MSX^+ are deemed necessary, leading to the formal definition: MSX−=arg⁡minM⊆⁡|M| s.t. ∑c∈M−Δc(s,a1,a2)>vMSX^-= _M |M| s.t. _c∈ M- _c(s,a_1,a_2)>v (5) I ReLMXEL Figure 2: ReLMXEL Framework We now propose a strategy: Reinforcement Learning for Memory Controller with Explainable Energy and Latency Optimization (ReLMXEL), that operates within an RL setting. The memory controller serves as the environment, providing information/metrics such as latency, average power, total energy consumption, bandwidth utilization, bank and bankgroup switches, and row buffer (page) hits and misses. Latency is tracked per request to reflect internal delays. Average power and total energy are derived from DRAM state transitions and activity counters. The bandwidth utilization captures interface efficiency. The bank and bank group switches are logged to monitor access locality, and the row buffer hits and misses indicate the effectiveness of row management. These metrics provide deep visibility into DRAM behavior and serve as observations for the RL agent, which computes per-metric rewards and selects actions to optimize the overall DRAM performance. Algorithm 1 ReLMXEL Algorithm 1:Input: Timesteps T, base seed s, threshold w, learning rate α, discount factor γ 2:Output: All Q-tables QiQ_i and ℛCR_C 3:Initialize ϵold _old, ϵnew _new, ℛC←0R_C← 0 4:for i=1i=1 to N do ⊳ N agents 5: si←s+is_i← s+i ⊳ Seed per agent 6: Initialize Qi(s,ai,r)Q_i(s,a_i,r) 7:end for 8:Initialize current state solds_old 9:Select initial action vector ←(a1,…,aN)a←(a_1,…,a_N) using 10:ϵε-greedy strategy 11:for t=1t=1 to T do 12: Apply action a to memory controller 13: Extract performance metrics (Rj,obs)j=1M (R_j,obs )_j=1^M 14: Compute rewards metric-wise using Eq. (6) 15: if t<wt<w then 16: ϵ←ϵoldε← _old 17: else 18: ϵ←ϵnewε← _new 19: ℛC←ℛC+RTR_C _C+R_T ⊳ Cumulative Reward 20: end if 21: for i=1i=1 to N do ⊳ Each agent chooses action 22: if random number <ϵ<ε then 23: ai′←a _i← random action for agent i 24: else 25: ai′←arg⁡maxai′∑jQi(sold,i,ai′,rj)a _i← _a _i _jQ_i(s_old,i,a _i,r_j) 26: end if 27: end for 28: ′←(a1′,a2′,…,aN′)a ←(a _1,a _2,…,a _N) ⊳ Next Action 29: snew←s_new ⊳ New State 30: for i=1i=1 to N do 31: for each reward rjr_j do 32: Compute Qi(sold,i,ai,rj)Q_i(s_old,i,a_i,r_j) using Eq. (8) 33: end for 34: end for 35: sold←snews_old← s_new 36: ←′a 37:end for 38:return All Q-tables QiQ_i, ℛCR_C The actions consist of configurable DRAM parameters, including PagePolicy (Open, OpenAdaptive, Closed, ClosedAdaptive), which governs whether a row remains open or closed immediately after access. Scheduler (FIFO, FR-FCFS, FR-FCFS Grp), defines how memory requests are prioritized and ordered to balance fairness and throughput. SchedulerBuffer (Bankwise, ReadWrite, Shared), determines how request queues are organized, either by bank, by read/write separation, or as a shared buffer. Arbiter (Simple, FIFO, Reorder), selects which commands proceed to DRAM based on fixed priorities, order, or dynamic reordering to improve timing efficiency. RespQueue (FIFO, Reorder), controls the order in which responses are sent back to the requester. RefreshPolicy (NoRefresh, AllBank), manages how DRAM refresh operations are performed to maintain data integrity while minimizing interference. RefreshMaxPostponed (0,…,70,…,7), and RefreshMaxPulledin (0,…,70,…,7), allow the controller to delay or advance refreshes within limits to reduce conflicts with memory accesses. RequestBufferSize limits the number of outstanding requests the controller can hold and MaxActiveTransactions (2x2^x where x=0,…,7x=0,…,7) controls the number of concurrent active DRAM commands. Through iterative interaction, the agent learns to tune DRAM parameters for optimal efficiency. It can be noted that the framework is generalized and can be extended/adapted to various standards (DDR/GDDR/LPDDR, etc.,) and generations, and varying polices like SameBank Refresh,chopped-BurstLength, etc. As described in Algorithm 1, each configurable parameter is associated with a Q-table [17]. The reward is calculated by the function: RX=Rtarget|Rtarget−Robserved|R_X= R_target|R_target-R_observed| (6) wherein, the subscript X corresponds to the reward R of a performance metric, RtargetR_target and RobservedR_observed corresponds to the ideal reward and the reward of current timestep respectively. RTR_T is defined as: RT=∑i=17RXi; where Xi is a performance metricR_T= _i=1^7R_X_i\ ; where X_i is a performance metric (7) The Q-value [17], denoted as Q(s,a)Q(s,a), represents the expected cumulative reward for taking an action a in the state s and following the current policy. These Q-values [17] are stored in a Q-table, a lookup table organized such that each dimension corresponds to discrete states and possible actions for specific DRAM parameters. During decision-making, the agent uses the current state and possible actions as indices to retrieve the associated Q-values, enabling efficient evaluation of expected rewards.The model follows the SARSA [16, 13] update rule to continuously improve its policy based on observed transitions. Q(st,at)←Q(st,at)+α[rt+γQ(st+1,at+1)−Q(st,at)]Q(s_t,a_t)← Q(s_t,a_t)+α [r_t+γ Q(s_t+1,a_t+1)-Q(s_t,a_t) ] (8) where sts_t and ata_t are the current state and action, rtr_t is the received reward, st+1s_t+1 is the next state, and at+1a_t+1 is the next action chosen using the current policy. Here, α is the learning rate (0<α≤10<α≤ 1) and γ is the discount factor (0≤γ≤10≤γ≤ 1). To guide the learning process, we define a warmup threshold w, representing the initial number of iterations focused on exploration, this allows the algorithm to adequately explore various memory controller parameters before commencing the optimization. A base seed is used to generate a unique seed for each agent. I-A Explainability of ReLMXEL Following the approach given by Juozapaitis et al., in ReLMXEL, the conventional scalar RL reward is decomposed into a vector representing system-level performance metrics. The Q-function is decomposed into individual Q-values for each of the reward types. For a given state s, an action a1a_1 is selected over a2a_2 iff: ∑cQc(s,a1)>∑cQc(s,a2) _cQ_c(s,a_1)> _cQ_c(s,a_2) (9) To understand further we use RDX. But this setup leads us to consider every possible action state pair. To simplify, we apply the MSX as in I-C which provides a rationale for selecting action a1a_1 over a2a_2 if ∑c∈MSX+Δc(s,a1,a2)>d _c ^+ _c(s,a_1,a_2)>d (10) where d is the disadvantage from negatively contributing factors. Consider an action a1a_1 which uses open page policy and improves latency and bandwidth but negatively impacts energy. Another action a2a_2 which uses closed page policy offers huge improvement in energy but negatively impacts latency and bandwdith. MSX identifies the smallest subset of components that adequately justifies the preference for a2a_2. For example, if the energy improvement is substantial enough to outweigh the latency and bandwidth drawbacks, MSX helps explain the decision as ’the improvement in energy alone justifies the action, despite losses in other components’. Similarly, consider an action a3a_3 that uses simple arbitration policy and reduces energy consumption significantly but negatively impacts the latency and bandwidth usage. On the other hand, another action a4a_4 using reorder arbitration policy provides moderate improvements in both latency and bandwidth with slight increase in energy consumption. MSX could justify the action a3a_3 by explaining: ’The significant reduction in energy consumption is enough to justify a3a_3 against moderate improvements in latency and bandwidth of a4a_4.’ IV Experimental Setup and Results We performed experiments using DDR4 memory [7] in DRAMSys simulator [15], featuring a burst length of eight, four bank groups with four banks each, and each bank comprising of 32,768 rows and 1024 columns of size 8 bytes per device. The system uses a single channel, single rank configuration, made up of x8x8 DRAM devices. The baseline memory controller employs an OpenAdaptive Page Policy, outperforming static open and closed policies [5], and uses the widely adopted FR-FCFS scheduling [12] algorithm with a bank wise scheduler buffer supporting up to eight requests. It also supports an All-bank refresh policy with up to eight postponed and eight pulled-in refreshes. The controller manages up to 128 active transactions, and an arbitration unit reorders incoming requests. We consider traces, generated using Intel’s Pin Tool [11], from the GEMM [9], STREAM [10] benchmarks and Breadth First Search (BFS).GEMM, represents dense linear algebra operations, while STREAM consists of vector-based operations. Both demonstrate computational patterns that are characteristic of ML workloads. Additionally, we use traces from the SPEC CPU 2017 [14] suite, specifically high memory intensive applications, namely, fotonik_3d_s, mcf_s, lbm_s, and roms_s stress the memory hierarchy due to their large data sets and frequent memory accesses. The compute intensive workloads include xalancbmk_s and gcc_s, which involve heavy computation for tasks such as XML transformations and code compilation. The omnetpp_s requires intensive processing for network simulations while handling large amounts of simulation data, placing equal strain on the CPU and memory system. The SPEC CPU 2017 traces are generated using the ChampSim [4] simulator, the traces are captured by monitoring last-level cache misses during simulations that execute at least ten billion instructions. The DRAMSys simulator integrated with DRAMPower [3] provides performance metrics such as latency, average power consumption, total energy usage and, average and maximum bandwidth, etc. To gain deeper insights into memory behavior, we also extract additional metrics, including the number of bank group switches which occur when the memory controller switches between different bank groups within the DRAM, bank switches refers to switching between different banks within a bankgroup. Additionally, we track row buffer hits, which represent instances where the requested data is already in the row buffer, while row buffer misses occur when data is not in the buffer, requiring additional time to fetch from the corresponding row. IV-A Results Workload Time Steps Threshold w Baseline Reward ReLMXEL Reward Average Energy (%) Average Bandwidth (%) Average Latency (%) STREAM 20170 16000 15555.06 17597.07 3.84 8.39 0.23 GEMM 19468 17000 6572.88 7121.46 3.83 4.95 0.01 BFS 17995 14000 9673.14 10842.41 7.66 7.22 -0.03 fotonik_3d 20770 17000 4870.89 9165.52 7.66 2.90 0.07 xalancbmk 16494 14000 3092.9 3320.38 7.68 107.03 -0.02 gcc 17863 14000 9154.29 9556.25 7.66 1.70 -0.24 roms 17563 14000 8017.8 13554.84 7.67 35.63 0.08 mcf 17894 14000 6013.5 6075.53 7.67 40.19 -4.43 lbm 18473 15000 5496.77 14934.6 7.67 26.73 0.05 omnetpp 16682 14000 4743.99 6688.05 4.06 138.78 -0.09 TABLE I: Comparison of Baseline and ReLMXEL performance The experiments use a discount factor (γ) of 0.9 and a learning rate (α) of 0.1. These values are chosen based on design space exploration across γ∈0.9,0.95,0.99γ∈\0.9,0.95,0.99\ and α∈0.01,0.1,0.3,0.5,0.6,0.7,0.8α∈\0.01,0.1,0.3,0.5,0.6,0.7,0.8\. While each workload has its own optimal (γ, α) pair, the combination providing the highest reward across all workloads is used for all subsequent evaluations. We also introduce a Trace-split parameter, that segments the trace file into fixed-size partitions. After each partition, the model makes decisions about the parameters and takes feedback from the SARSA using reward vector and Q-Tables, improving performance for the next timestep. Through experimentation, we set the trace split parameter to 30,000 and the exploration parameter ϵnew _new to 0.0010.001, as values like 0.010.01 hinder convergence due to excessive randomness, and 0.00010.0001 limit exploration, slowing recovery from suboptimal choices. The percentage improvements are computed relative to the baseline as follows: for energy and latency metrics, the improvement is calculated as Improvement (%)=Baseline−ReLMXELBaseline×100,Improvement (\%)= Baseline-ReLMXELBaseline× 100, so that a positive value indicates a reduction compared to the baseline. For the bandwidth metric, the improvement is calculated as Improvement (%)=ReLMXEL−BaselineBaseline×100,Improvement (\%)= ReLMXEL-BaselineBaseline× 100, so that a positive value indicates an increase compared to the baseline. STREAMGEMMBFSfotonik_3dxalancbmkgccromsmcflbmomnetpp05005001,0001,0001,5001,500WorkloadsAvg Energy (pJ/10910^9)BaselineReLMXEL Figure 3: Average energy consumption The %\% improvement of average Energy, Bandwidth and Latency columns in Table I show that ReLMXEL consistently outperforms the baseline across all workloads. ReLMXEL achieves high bandwidth utilization and reduced latency, while also exhibiting slightly better energy efficiency than the baseline in memory-bound workloads, such as STREAM and GEMM. It also performs well in bandwidth utilization and energy efficiency for irregular and graph-based workloads, including BFS, fotonik_3d, and roms, as well as on compute-intensive workloads, such as xalancbmk, gcc, and lbm, reflecting optimized computation scheduling. Workloads with high memory traffic or communication demands, including mcf and omnetpp, achieve improvement in energy consumption and bandwidth utilization; however a slight increase in latency, indicates a trade-off between energy efficiency and data transfer overhead. STREAMGEMMBFSfotonik_3dxalancbmkgccromsmcflbmomnetpp05050100100150150WorkloadsAvg Bandwidth (Gb/s)BaselineReLMXEL Figure 4: Average bandwidth utilization STREAMGEMMBFSfotonik_3dxalancbmkgccromsmcflbmomnetpp02244⋅106· 10^6WorkloadsAvg Latency (ps/10610^6)BaselineReLMXEL Figure 5: Average latency Figures 3, 4, and 5 illustrate how ReLMXEL’s dynamic tuning approach incrementally optimizes memory controller parameters through a step-by-step, feedback driven process that adapts to real time workload characteristics. Leveraging a multi-agent reinforcement learning framework with explainability, it balances competing objectives to optimize overall system performance. As a result, significant reductions in energy consumption and bandwidth gains are achieved across diverse workloads, particularly in memory-bound and irregular patterns, without causing substantial latency degradation. This minimal impact on latency demonstrates that ReLMXEL successfully navigates the tradeoffs inherent in system optimization, proving the effectiveness of its adaptive, feedback driven parameter tuning in delivering balanced and robust performance improvements. V Conclusion and Future Directions The proposed ReLMXEL based memory controller achieved enhanced efficiency and transparency. The RL framework proposed optimizes memory controller parameters while decomposing rewards to model energy, bandwidth, and latency trade-offs. Experimental results showed significant performance improvements across diverse workloads, confirming the framework’s ability to balance competing system objectives. This integration of adaptive learning with interpretable decision-making marks a key advancement in memory systems, paving the way for future research into self-optimizing, high-performance architectures with explainability. As RL optimizes memory controller parameters, enabling adaptive responses to dynamic workloads, RL based optimizations can be extended to heterogeneous memory architectures, such as hybrid nonvolatile memory systems, to assess its robustness in real-world scenarios. Integrating RL with hardware in the loop setups allows real time interaction with actual hardware, bridging the gap between simulations and real-world applications. Additionally, RL can help in efficient detection and mitigation of DRAM security threats like row hammer attacks, by identifying malicious memory access patterns and adjusting memory access strategies to prevent data corruption or security breaches. References [1] J. Albus (1975-1975-09-30) New approach to manipulator control: the cerebellar model articulation controller (cmac)1. (en). External Links: Link Cited by: §I-A. [2] R. Bera, K. Kanellopoulos, A. Nori, T. Shahroodi, S. Subramoney, and O. Mutlu (2021-10) Pythia: a customizable hardware prefetching framework using online reinforcement learning. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, p. 1121–1137. External Links: Document, Link Cited by: §I-B. [3] K. Chandrasekar, C. Weis, Y. Li, S. Goossens, M. Jung, O. Naji, B. Akesson, N. Wehn, and K. Goossens (2014) DRAMPower: Open-source DRAM Power & Energy Estimation Tool. Note: http://w.drampower.infoAccessed: April 2025 Cited by: §IV. [4] N. Gober, G. Chacon, L. Wang, P. V. Gratz, D. A. Jimenez, E. Teran, S. Pugsley, and J. Kim (2022-10) The Championship Simulator: Architectural Simulation for Education and Competition. arXiv e-prints, p. arXiv:2210.14324. External Links: Document, 2210.14324 Cited by: §IV. [5] Intel Corporation (2024-07) Performance differences for open-page / close-page policy. External Links: Link Cited by: §IV. [6] E. Ipek, O. Mutlu, J. F. Martínez, and R. Caruana (2008) Self-optimizing memory controllers: a reinforcement learning approach. In 2008 International Symposium on Computer Architecture, Vol. , p. 39–50. External Links: Document Cited by: §I, §I-A. [7] (2021-07)JEDEC ddr4 sdram standard document(Website) External Links: Link Cited by: §IV. [8] Z. Juozapaitis, A. Koul, A. Fern, M. Erwig, and F. Doshi-Velez (2019) Explainable reinforcement learning via reward decomposition. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) Workshop on Explainable Artificial Intelligence, Cited by: §I-C. [9] A. Lokhmotov (2015) GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication. External Links: 1511.03742, Link Cited by: §IV. [10] J. D. McCalpin (1995-12) Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, p. 19–25. Note: http://tab.computer.org/tcca/NEWS/DEC95/dec95_mccalpin.ps Cited by: §IV. [11] V. J. Reddi, A. Settle, D. A. Connors, and R. S. Cohn (2004) PIN: a binary instrumentation tool for computer architecture research and education. In Proceedings of the 2004 Workshop on Computer Architecture Education: Held in Conjunction with the 31st International Symposium on Computer Architecture, WCAE ’04, New York, NY, USA, p. 22–es. External Links: ISBN 9781450347334, Link, Document Cited by: §IV. [12] S. Rixner (2004) Memory controller optimizations for web servers. In 37th International Symposium on Microarchitecture (MICRO-37’04), Vol. , p. 355–366. External Links: Document Cited by: §IV. [13] G. A. Rummery and M. Niranjan (1994) On-line q-learning using connectionist systems. Vol. 37, University of Cambridge, Department of Engineering Cambridge, UK. Cited by: §I-A, §I. [14] Standard Performance Evaluation Corporation (2017) SPEC cpu 2017 benchmark suite. External Links: Link Cited by: §IV. [15] L. Steiner, M. Jung, F. S. Prado, et al. (2022) DRAMSys4.0: an open-source simulation framework for in-depth dram analyses. International Journal of Parallel Programming 50, p. 217–242. External Links: Document Cited by: §IV. [16] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. A Bradford Book, Cambridge, MA, USA. External Links: ISBN 0262039249 Cited by: Figure 1, §I-A, §I. [17] C. J. C. H. Watkins and P. Dayan (1992-05-01) Q-learning. Machine Learning 8 (3), p. 279–292. External Links: Document, ISSN 1573-0565, Link Cited by: §I-A, §I, §I. [18] N. Wu and Y. Xie (2022-02) A survey of machine learning for computer architecture and systems. ACM Computing Surveys 55 (3), p. 1–39. External Links: ISSN 1557-7341, Link, Document Cited by: §I.