Paper deep dive
Quantum entanglement provides a competitive advantage in adversarial games
Peiyong Wang, Kieran Hymas, James Quach
Abstract
Abstract:Whether uniquely quantum resources confer advantages in fully classical, competitive environments remains an open question. Competitive zero-sum reinforcement learning is particularly challenging, as success requires modelling dynamic interactions between opposing agents rather than static state-action mappings. Here, we conduct a controlled study isolating the role of quantum entanglement in a quantum-classical hybrid agent trained on Pong, a competitive Markov game. An 8-qubit parameterised quantum circuit serves as a feature extractor within a proximal policy optimisation framework, allowing direct comparison between separable circuits and architectures incorporating fixed (CZ) or trainable (IsingZZ) entangling gates. Entangled circuits consistently outperform separable counterparts with comparable parameter counts and, in low-capacity regimes, match or exceed classical multilayer perceptron baselines. Representation similarity analysis further shows that entangled circuits learn structurally distinct features, consistent with improved modelling of interacting state variables. These findings establish entanglement as a function resource for representation learning in competitive reinforcement learning.
Tags
Links
- Source: https://arxiv.org/abs/2603.10289v1
- Canonical: https://arxiv.org/abs/2603.10289v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/13/2026, 1:09:04 AM
Summary
This paper investigates the role of quantum entanglement in reinforcement learning by comparing quantum-classical hybrid agents in the competitive Markov game 'Pong'. The study evaluates separable, CZ-entangled, and IsingZZ-entangled parameterised quantum circuits (PQCs) as feature extractors within a proximal policy optimisation (PPO) framework. Results demonstrate that entangled circuits consistently outperform separable counterparts and can match or exceed classical multilayer perceptron baselines in low-capacity regimes, suggesting that entanglement serves as a functional resource for representation learning in competitive environments.
Entities (6)
Relation Signals (3)
Entangled circuits → outperform → Separable circuits
confidence 95% · Entangled circuits consistently outperform separable counterparts with comparable parameter counts
Parameterised Quantum Circuit → servesasfeatureextractorfor → Proximal Policy Optimisation
confidence 95% · An 8-qubit parameterised quantum circuit serves as a feature extractor within a proximal policy optimisation framework
Quantum entanglement → improvesperformancein → Reinforcement Learning
confidence 90% · These findings establish entanglement as a functional resource for representation learning in competitive reinforcement learning.
Cypher Suggestions (2)
Find all quantum architectures that outperform classical baselines · confidence 85% · unvalidated
MATCH (a:Architecture)-[:OUTPERFORMS]->(b:Baseline {type: 'Classical'}) WHERE a.is_quantum = true RETURN a.name, a.performance_metricMap the relationship between entangling gates and performance · confidence 80% · unvalidated
MATCH (g:Gate)-[:USED_IN]->(a:Architecture), (a)-[:ACHIEVES]->(p:Performance) RETURN g.name, avg(p.value) as avg_performance
Full Text
54,375 characters extracted from source content.
Expand or collapse full text
Quantum entanglement provides a competitive advantage in adversarial games Peiyong Wang 1 , Kieran Hymas 1 , James Quach 1 1 Manufacturing, CSIRO Clayton, Research Way, Clayton, 3168, VIC, Australia. Contributing authors: Peiyong.Wang@csiro.au; Kieran.Hymas@csiro.au; James.Quach@csiro.au; Abstract Whether uniquely quantum resources confer advantages in fully classical, com- petitive environments remains an open question. Competitive zero-sum reinforce- ment learning is particularly challenging, as success requires modelling dynamic interactions between opposing agents rather than static state–action mappings. Here, we conduct a controlled study isolating the role of quantum entanglement in a quantum–classical hybrid agent trained on Pong, a competitive Markov game. An 8-qubit parameterised quantum circuit serves as a feature extractor within a proximal policy optimisation framework, allowing direct comparison between sep- arable circuits and architectures incorporating fixed (CZ) or trainable (IsingZZ) entangling gates. Entangled circuits consistently outperform separable counter- parts with comparable parameter counts and, in low-capacity regimes, match or exceed classical multilayer perceptron baselines. Representation similarity anal- ysis further shows that entangled circuits learn structurally distinct features, consistent with improved modelling of interacting state variables. These find- ings establish entanglement as a functional resource for representation learning in competitive reinforcement learning. Keywords: Quantum Machine Learning, Reinforcement Learning, Artificial Intelligence 1 Main Competitive environments arise in many fields, from quantitative trading [1, 2] to modern reinforcement learning [3–5], where one player’s gain is usually another’s arXiv:2603.10289v1 [quant-ph] 11 Mar 2026 loss. In game theory, the simplest abstraction of this antagonism is the two-player zero-sum game, whose value under optimal play is characterised by the minimax theorem [6]. Many competitive decision problems are sequential rather than one- shot, and can be formalised as (zero-sum) stochastic games—also known as Markov games—in which two agents repeatedly act in a shared state with stochastic tran- sitions [7, 8]. This perspective aligns naturally with modern RL, where self-play in competitive games has powered several landmark research [3–5]. Pong provides a par- ticularly controlled example: it is a classic two-player competitive game, and common RL benchmark formulations assign rewards that are largely anti-symmetric between the two sides (e.g., ±1 on scoring), making it a close approximation to a zero-sum Markov game [9–11]. These competitive benchmarks, therefore, offer a natural testbed for probing whether uniquely quantum resources—most notably entanglement—can yield measurable benefits for learning and decision making. Quantum computing has long been anticipated to deliver improvements over clas- sical algorithms across a range of domains, including optimisation, quantum system simulation, and machine learning, by offering advantages in computational complex- ity and predictive performance. A canonical example is Grover’s search algorithm, where quantum superposition enables an oracle to evaluate all database entries in parallel, yielding a quadratic speedup with O( √ N ) time complexity [12]. Beyond superposition, other non-classical quantum resources have also been shown to provide advantages: quantum contextuality underpins the memory advantage of contextual recurrent neural networks (CRNNs) over classical recurrent neural networks (RNNs) [13]. Among these resources, quantum entanglement is widely regarded as a key enabler of quantum advantage. Entanglement enables models and algorithms to exploit quan- tum correlations that lack classical counterparts, potentially enhancing the expressive power and learning efficiency of quantum machine learning (QML) architectures. To date, however, systematic empirical evidence for such benefits remains limited. Exist- ing benchmarking studies have primarily focused on supervised learning, particularly (binary) classification tasks [14]. Within this setting, the data reuploading model [15], also known as the quantum Fourier model (QFM) [16], stands out as one of the few quantum models for which entanglement has been shown to improve classification accuracy [14]. These results suggest that entanglement can meaningfully enhance the performance of variational quantum models, motivating further investigation of its impact beyond supervised learning scenarios. Research on quantum reinforcement learning (QRL) has established several forms of provable quantum advantage, but these proofs typically rely on settings where a quantum agent interacts with a quantum environment or communicates through a quantum channel. For example, [17] demonstrates that access to quantum channels can yield algorithmic speed-ups for QRL agents. In [18], the authors introduced a quantum deep deterministic policy gradient algorithm for the quantum state genera- tion and the eigenvalue problems. Quantum signals can also be harnessed to enhance the average reward outcomes in the infinite-horizon Markov decision problem [19]. Such results, however, do not readily extend to classical RL environments, which are far more complex and less structured than the idealised models commonly assumed in theoretical quantum machine learning analyses. Most proofs of quantum advantage in machine learning rely on specific structures in the data that can be efficiently utilised by a quantum machine learning model, but not by a classical model. One of the pop- ular choices for such structures is those related to cryptography with a (classical) hardness proof, such as the data constructed based on the decisional Diffie-Hellman assumption [20], those based on weak pseudorandom functions in [21], and the clas- sical data constructed based on the discrete logarithm algorithm in [22]. The hidden subgroup problem is also a popular choice for proving quantum advantage in learn- ing from classical data [23]. However, in modern artificial intelligence research, the machine learning model is expected to adaptively discover and utilise the unknown structures in the training data with as few assumptions or biases as possible [24–26]. Despite this limitation, a growing body of work has examined quantum agents acting within classical environments [27–33]. Prior studies have explored the use of parameterised quantum circuits (PQCs) for Q-function and value-function approxi- mation [27, 30], as policy networks trained using REINFORCE [28], as components of the actor and critic networks for the soft actor-critic algorithm[33], or the proximal policy optimisation (PPO) algorithm [30]. These works demonstrate that PQCs can, in principle, serve as function approximators within standard RL pipelines. However, they do not address a fundamental question: Does quantum entanglement improve agent performance in classical RL tasks? Since entanglement is the resource often associated with quantum advantage, an answer to this urgent question is timely for quantum machine learning. No experimental benchmarking study has isolated and evaluated the role of entanglement in PQC-based RL agents operating in classical environments. Existing works typically introduce entanglement implicitly or treat it as a design choice rather than a variable of interest. Instead of using the PQC solely as a Q-function approximator or a policy function, we adopt an 8-qubit QFM circuit as a backbone, providing features for downstream actor and critic heads, following the construction of the PPO agent [34] in CleanRL [35]. The architecture of our agent can be found in Fig. 1. We compare the performance of our hybrid network with four qualitatively different backbone networks: • A classical multi-layer perceptron (MLP), which has three layers in total: one input, one output and one hidden. • A separable data reuploading PQC, with only single-qubit gates. • An entangled data reuploading PQC. The entanglement gate is the controlled-Z gate. • An entangled data reuploading PQC. The entanglement gate is the trainable IsingZZ (R Z ) gate: R Z = e −i θ 2 Z⊗Z .(1) where each backbone takes an 8-dimensional observation vector as input encoding the left and right paddle positions p l , p r , and velocities s l , s r , as well as the ball position b x , b y and velocity v b,x , v b,y . The output is a new 8-dimensional feature vector that represents the features extracted from the game state and which is passed to the classical actor-critic framework for policy implementation and scoring. Environment Observation (8-dim) Backbone Network (Classical or Quantum) Features (8-dim) Actor Network Critic Network (a) z ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk ULk Pong Environment (PufferLib) Classical Backbone Quantum Backbone (Separable) Quantum Backbone (Entanglement with Controlled-Z) Quantum Backbone (Entanglement with Trainable IsingZZ) ULk ULk (a) (b) (c) (d) Fig. 1: The overall architecture of our quantum-classical hybrid agent. The observa- tion of the environment is an 8-element vector [p l ,p r ,b x ,b y ,v b,x ,v b,y ,s l ,s r ], where p l and p r denote the position of the left and right paddle, respectively; b x and b y are the coordinates of the ball, v b,x and v b,y are the velocity of the ball on the x− and y−direction, respectively; s l and s r are the scores for the left and the right paddle. A backbone network, whether classical or quantum, takes the observation vector as input and produces an 8-dimensional feature vector. A classical actor network and a classical critic network share the feature vector. Based on the features provided by the backbone network, the actor network proposes an action based on the input representations from the PQC feature extractor; the critic network provides a scalar evaluation of the current state. The four different kinds of backbone network struc- ture studied in this paper are shown in (a) to (d). (a): classical multi-layer perception (MLP); (b) 8-qubit separable parameterised quantum circuit; (c) 8-qubit parame- terised quantum circuit with fixed entanglement gates (controlled-Z gates), and (d) 8-qubit parameterised quantum circuit with trainable entanglement gates (the IsingZZ gate exp (−i θ 2 Z i ⊗ Z j )). At the end of the parameterised quantum circuits, all qubits are measured with the Pauli X observable. Each of these 4 different types of backbones can be configured to have a different number of trainable parameters. For the classical MLP backbone, the number of total trainable parameters is controlled by the number of neurons in the hidden layer (the dimension of the hidden-layer representation), since the number of neurons in the input and output layers is both 8. For the parameterised quantum circuits, the number of parameters can be controlled by changing the number of layers. For all three types of PQCs, the majority of trainable parameters are encoded in single-qubit gates that can also take input data as arguments. Each element of the observation vector x is first mapped to three values with six trainable parameters w l,0 i ,w l,1 i ,w l,2 i ,b l,0 i ,b l,1 i , and b l,2 i : φ l i (x i ) = [w l,0 i x i + b l,0 i ,w l,1 i x i + b l,1 i ,w l,2 i x i + b l,2 i ] T ,(2) where i is the index of the observation vector element, as well as the index of the qubit, and l denotes the layer number. Then, each element of the φ l i (x i ) vector is used as inputs to the U 3 gate: U (φ l i (x i )) = U 3 (w l,0 i x i + b l,0 i ,w l,1 i x i + b l,1 i ,w l,2 i x i + b l,2 i ),(3) where U 3 (θ,λ,δ) = cos θ 2 − exp(iδ) sin( θ 2 ) exp(iλ) sin( θ 2 ) exp(i(λ + δ)) cos( θ 2 ) .(4) Hence, for separable and CZ-entangled PQCs, each layer has 6× 8 = 48 trainable parameters. For IsingZZ-entangled PQCs, each layer has 6 × 8 + 8 = 56 trainable parameters. 2 Results We systematically evaluated the effect of quantum entanglement on the performance of a quantum–classical hybrid reinforcement learning agent in the Pong environment. To isolate the role of entanglement, we compared three classes of quantum backbones (sep- arable, CZ-entangled, and IsingZZ-entangled) against classical MLP baselines across a wide range of parameter counts. All reported results are averaged over 10 independent training runs with different random initialisations. Performance is evaluated using the episodic return at the final stage of training, where the maximum achievable return in Pong is +21, and the minimum is−21. A summary of all configurations and their final performance statistics is provided in Table 1. 2.1 Entangled PQC backbones consistently outperform separable circuits Across all tested configurations, quantum backbones with entanglement substantially outperform separable PQCs with comparable numbers of parameters. This trend is evident in both the averaged learning curves (Fig. 2) and the final episodic returns reported in Table 1. Table 1: Averaged Performance for our hybrid agent with different parameter configurations. For each configuration, the statistics (mean and standard deviation) of agent performance are calculated from 10 random initialisations of the trainable parameters. For the Pong game, the maxi- mum achievable return is 21, and the minimum is -21. And for single runs before smoothing with an exponential moving average, the return values are always integers. For quantum backbones, the number of parameters is shown as number of parameters in a single layer × number of layers. For classical backbones, since they all have three layers, the number of parameters is shown as input layer dimension×hidden layer dimension+ hidden layer dimension× output layer dimension. The best performance of all different configurations comes from the classical backbone with 4096 parameters, which is supposed to have access to the same “hidden dimen- sion size” as an 8-qubit quantum circuit, which also has a 2 8 -dimensional state vector. For all other configurations besides the classical backbone with 4096 parameters, the minimum reward at the final step during the training process is−21, while the classical 4096-parameter MLP backbone has a minimum final reward of −5 across 10 different random initiali- sations. The data in the “Final Return Max.” column is from the raw training statistics before being smoothed with a weighted exponential moving average, as those in Fig. 2 and Fig. 5. Backbone TypeParam. Num.Final Return Mean Final Return Std. Final Return Max. Separable48 = 48× 1-17.94.72-8 Separable96 = 48× 2-17.94.68-8 Separable144 = 48× 3-19.54.50-6 Separable192 = 48× 4-21.00.00-21 Separable240 = 48× 5-20.41.80-15 Separable288 = 48× 6-20.80.60-19 CZ-entangled48 = 48× 1-10.013.3913 CZ-entangled96 = 48× 2-2.115.0217 CZ-entangled144 = 48× 3-1.012.6117 CZ-entangled192 = 48× 4-10.012.0716 CZ-entangled240 = 48× 5-14.84.96-3 CZ-entangled288 = 48× 6-15.76.29-2 IsingZZ-entangled56 = 56× 1-5.012.6312 IsingZZ-entangled112 = 56× 2-9.710.919 IsingZZ-entangled168 = 56× 3-10.29.404 IsingZZ-entangled224 = 56× 4-14.48.342 IsingZZ-entangled280 = 56× 5-13.411.8319 IsingZZ-entangled336 = 56× 6-16.86.71-1 Classical64 = 8× 4 + 4× 8-8.711.798 Classical128 = 8× 8 + 8× 8-6.79.4111 Classical256 = 8× 16 + 16× 8-3.09.6912 Classical336 = 8× 21 + 21× 8-0.811.1414 Classical4096 = 8× 2 8 + 2 8 × 815.07.0319 Separable PQC backbones exhibit consistently poor performance regardless of cir- cuit depth. As shown in Fig. 2(a–f), increasing the number of layers does not lead to meaningful improvements in averaged episodic return; in several cases, performance collapses to the minimum achievable return of−21. This behaviour is also reflected in Table 1, where separable circuits achieve the lowest mean and maximum final returns across all parameter counts. In contrast, both CZ-entangled and IsingZZ-entangled backbones achieve signifi- cantly higher returns. For the same number of layers, entangled PQCs learn faster and converge to better-performing policies, as demonstrated by the separation between entangled and separable learning curves in Fig. 2. Importantly, this improvement cannot be attributed to differences in single-qubit parameterisation as separable and entangled PQCs share identical gate topologies that differ only in the absence and presence (respectively) of two-qubit entangling operations. This strongly sug- gests that entanglement is responsible for the observed performance gains. Without entanglement, the representations generated by the quantum circuit is restricted to element-wise nonlinear transformations of the input vector. In the entangled case, however, the representation is more flexible and can be extended by non-linear combinations of input elements. 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 10 15 Averaged Episodic Return Episodic Return Comparison for 1 Q-Layers Quantum Separable; 1 Q-Layers Quantum CZ Entangled; 1 Q-Layers Quantum IsingZZ Entangled; 1 Q-Layers (a) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 10 15 Averaged Episodic Return Episodic Return Comparison for 2 Q-Layers Quantum Separable; 2 Q-Layers Quantum CZ Entangled; 2 Q-Layers Quantum IsingZZ Entangled; 2 Q-Layers (b) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 10 15 Averaged Episodic Return Episodic Return Comparison for 3 Q-Layers Quantum Separable; 3 Q-Layers Quantum CZ Entangled; 3 Q-Layers Quantum IsingZZ Entangled; 3 Q-Layers (c) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 Averaged Episodic Return Episodic Return Comparison for 4 Q-Layers Quantum Separable; 4 Q-Layers Quantum CZ Entangled; 4 Q-Layers Quantum IsingZZ Entangled; 4 Q-Layers (d) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 10 Averaged Episodic Return Episodic Return Comparison for 5 Q-Layers Quantum Separable; 5 Q-Layers Quantum CZ Entangled; 5 Q-Layers Quantum IsingZZ Entangled; 5 Q-Layers (e) 0.00.20.40.60.81.0 Training Steps ×10 7 −20.0 −17.5 −15.0 −12.5 −10.0 −7.5 −5.0 −2.5 Averaged Episodic Return Episodic Return Comparison for 6 Q-Layers Quantum Separable; 6 Q-Layers Quantum CZ Entangled; 6 Q-Layers Quantum IsingZZ Entangled; 6 Q-Layers (f) Fig. 2: The return of different quantum backbone configurations with respect to the total training steps. The curves for separable, CZ-entangled, and IsingZZ-entangled are averaged over 10 runs with different random initialisations. The averaged return curve and the max-min shade for returns are calculated and plotted from the smoothed raw data using a weighted exponential moving average. We can see that, although entangled quantum backbones could, on average, outperform separable ones, it is unclear, based on the current results, whether trainable entanglement gates (IsingZZ) could achieve better performance than fixed entanglement gates (CZ). Even though, on average, the backbone with IsingZZ gates can outperform that with CZ gates (as in (a)), the best performance of quantum circuits with CZ gates is better than that of backbones with IsingZZ gates. 2.2 Optimal performance occurs at shallow circuit depths For entangled PQC backbones, increasing circuit depth does not monotonically improve performance. Instead, both averaged and maximum episodic returns peak at relatively shallow depths. For CZ-entangled backbones, the best overall performance is achieved with two to three layers, while deeper circuits lead to reduced average returns and higher variance across random initialisations. A similar pattern is observed for IsingZZ-entangled backbones, where the highest mean performance occurs at a single layer, although the maximum return across runs peaks at deeper layers. These results suggest that deeper quantum circuits may suffer from optimisation difficulties, consistent with the barren plateau phenomenon [36], which can severely degrade gradient-based training. Reinforcement learning appears particularly sensitive to this effect, as the training signal is noisier and less structured than in supervised learning tasks. Importantly, the presence of trainable entanglement gates (IsingZZ) does not guar- antee superior performance compared to fixed entanglement (CZ). While IsingZZ backbones occasionally achieve higher peak returns, their average performance is com- parable to or slightly worse than CZ-entangled circuits, indicating a trade-off between expressivity and trainability. 2.3 Entangled PQCs can outperform classical baselines in the low-parameter regime In the low-parameter regime, entangled quantum backbones can match or exceed the performance of classical MLP baselines, despite having comparable or even fewer trainable parameters (Fig. 3). Specifically: • A 1-layer IsingZZ-entangled PQC with 56 parameters outperforms a classical MLP with 64 parameters (Fig. 3(a)). • 2- and 3-layer CZ-entangled PQCs outperform the same classical baseline after averaging across random initialisations, albeit with slightly higher parameter counts (Fig. 3(b)). In contrast, separable PQCs fail to outperform the classical baseline in any tested configuration, further reinforcing the central role of entanglement. These results demonstrate that quantum-enhanced feature extraction can provide tangible benefits when model capacity is constrained, a regime that is particularly relevant for near-term quantum hardware. At higher parameter counts, classical networks ultimately achieve superior absolute performance. A large classical MLP with 4096 parameters consistently outperforms all quantum backbones, reflecting the greater flexibility and optimisation stability of classical models at scale (Fig. 5). 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 10 Averaged Episodic Return Episodic Return: Quantum IsingZZ Entangled vs Classical Baseline 64 Params Quantum IsingZZ Entangled; 1 Q-Layers Quantum IsingZZ Entangled; 2 Q-Layers Quantum IsingZZ Entangled; 3 Q-Layers Quantum IsingZZ Entangled; 4 Q-Layers Quantum IsingZZ Entangled; 5 Q-Layers Quantum IsingZZ Entangled; 6 Q-Layers Classical Baseline 64 Params (a) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 10 15 Averaged Episodic Return Episodic Return: Quantum CZ Entangled vs Classical Baseline 64 Params Quantum CZ Entangled; 1 Q-Layers Quantum CZ Entangled; 2 Q-Layers Quantum CZ Entangled; 3 Q-Layers Quantum CZ Entangled; 4 Q-Layers Quantum CZ Entangled; 5 Q-Layers Quantum CZ Entangled; 6 Q-Layers Classical Baseline 64 Params (b) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 Averaged Episodic Return Episodic Return: Quantum Separable vs Classical Baseline 64 Params Quantum Separable; 1 Q-Layers Quantum Separable; 2 Q-Layers Quantum Separable; 3 Q-Layers Quantum Separable; 4 Q-Layers Quantum Separable; 5 Q-Layers Quantum Separable; 6 Q-Layers Classical Baseline 64 Params (c) Fig. 3: Comparing the (averaged) episodic performance of the three different quantum backbones (separable, CZ-entangled and IsingZZ-entangled) w.r.t. classical MLP with 64 parameters. (a): The performance of the IsingZZ-entangled backbone, compared with the classical MLP backbone with 64 parameters. Each layer of the IsingZZ- entangled backbone has 56 parameters. In this configuration, only the 1-layer one could outperform the classical baseline, but with fewer parameters (56 vs 64). In this case, increasing the number of layers (and hence the number of parameters) does not guarantee improved performance either. (b): The performance of the CZ-entangled backbone, compared with the classical MLP backbone with 64 parameters. Among the different layer configurations, those with 2 and 3 layers can outperform the classical baseline after averaging results across 10 random initialisations. Since CZ-entangled quantum circuits have 48 parameters per layer, the CZ-entangled backbones that out- perform the classical baseline have 96 and 144 parameters, respectively. However, when the number of layers is increased beyond 3, the performance of the quantum-classical hybrid agent drops by half on average at the final iteration. (c): The performance of the separable backbone, compared with the classical MLP backbone. We can see that there is little difference in performance across different layer configurations of the sep- arable backbone, and they all have worse average returns than the 64-parameter MLP backbone. 2.4 Quantum and classical backbones learn fundamentally different representations 1.0 0.8 0.6 0.4 0.2 Separable CZ-entangled IsingZZ-entangled Classical SeparableCZ-entangledIsingZZ-entangledClassical Fig. 4: The similarities of representations generated by different backbone configura- tions, calculated with centred kernel alignment (CKA) [37]. The heatmap is divided into approximately sixteen blocks, representing intragroup (diagonal blocks) and inter- group (off-diagonal blocks) similarities. For intra-group similarities, we see that the classical backbone generally produces similar representations, even though the hidden dimension differs. The CZ-entangled backbones produce representations with the most variability between different initialisations (intragroup block) and different backbone topologies (intergroup). Generally, for classical neural networks, CKA would yield very high similarity scores for representations produced by networks with the same struc- ture, trained on the same dataset, but with different random initialisations, as shown in the bottom-right corner of the heatmap. However, such intra-group similarity is much harder to find among the representations produced by the quantum backbones. Separable and IsingZZ-entangled PQC backbones have an overall slightly higher intra- group similarity score compared to the CZ-entangled ones. However, all three types of PQC-based backbone networks produce representations that share little similarity with those from the classical backbone networks. To better understand the origin of the observed performance differences, we analysed the similarity of learned representations using centred kernel alignment (CKA) [37]. Classical MLP backbones produce highly similar representations across different random initialisations and hidden dimensions, consistent with prior observations in deep learning. In contrast, representations generated by quantum backbones exhibit low intra-group similarity, particularly for CZ-entangled circuits, indicating a high degree of representational diversity. Moreover, all quantum backbones—separable and entangled—produce represen- tations that are markedly dissimilar from those learned by classical networks. This suggests that PQCs do not simply approximate classical solutions, but instead explore qualitatively different regions of representation space. Among quantum architectures, separable PQCs yield representations that are more similar to classical ones than those produced by entangled circuits. Entangled PQCs, especially with CZ gates, generate the most distinct representations, aligning with their superior task performance. 3 Discussion In this work, we presented a controlled experimental study of the role of quantum entanglement in quantum–classical hybrid reinforcement learning agents operating in a classical environment. By explicitly isolating entanglement as a design variable in parameterised quantum circuit (PQC) backbones, we demonstrated that entanglement is not merely an architectural embellishment, but a functional computational resource that can improve agent performance. Our results show that PQC backbones with entanglement consistently outperform separable PQCs with comparable parameter counts when learning to play Pong. This finding provides direct empirical evidence that entanglement enhances feature extraction for reinforcement learning tasks, address- ing a central and previously unresolved question in quantum reinforcement learning research. 3.1 Why entanglement helps in classical reinforcement learning The Pong environment used in this study provides an 8-dimensional observation vector, where each component encodes a distinct aspect of the game state (paddle positions and velocities, ball position, and ball velocity). Although these variables are presented independently, optimal decision making depends critically on their joint relationships—for example, how paddle motion should respond to the ball’s trajectory. In separable PQC backbones, each observation component is processed indepen- dently by a corresponding qubit, and the final representation is obtained through single-qubit measurements. In this setting, there is no mechanism for modelling interactions between different state variables during feature extraction. As a result, separable circuits struggle to construct useful representations, regardless of circuit depth. Entangling gates fundamentally change this behaviour. Both CZ and IsingZZ entanglement introduce non-classical correlations between qubits, enabling the PQC to encode joint information about multiple input variables. From a functional per- spective, entanglement acts as an interaction mechanism analogous to multiplicative feature coupling in classical neural networks. This provides a plausible explanation for the substantial performance gap observed between separable and entangled PQC backbones. 3.2 Representation learning beyond classical solutions Our representation similarity analysis further supports the interpretation that entan- gled PQCs learn qualitatively different representations from classical neural networks. While classical MLP backbones converge to highly similar representations across differ- ent initialisations and hidden dimensions, quantum backbones, particularly those with entanglement, exhibit low intra-group similarity and minimal overlap with classical representations. This observation suggests that entangled PQCs do not simply approximate classi- cal feature extractors with fewer parameters. Instead, they explore distinct regions of representation space that are not easily accessible to classical architectures of compara- ble size. Importantly, the PQC representations that are most dissimilar from classical ones, those produced by CZ-entangled circuits, are also those that yield the strongest performance improvements. These findings align with recent work on the expressive properties of quantum models, which argues that entanglement enables access to better function spaces even when the number of trainable parameters is limited. Our results extend this perspec- tive to reinforcement learning, where representation quality is critical for downstream policy optimisation. 3.3 Trade-offs between expressivity and trainability Although entanglement improves performance, our results also highlight important limitations. Increasing circuit depth does not reliably improve learning outcomes, and deeper entangled PQCs often perform worse than their shallower counterparts. This behaviour is consistent with the barren plateau phenomenon, which causes gradients to vanish as circuit depth increases and severely hinders optimisation. Interestingly, introducing trainable entanglement through IsingZZ gates does not consistently outperform fixed CZ entanglement. While IsingZZ-entangled circuits occa- sionally achieve higher peak returns, their average performance is comparable to or worse than CZ-based circuits. This suggests a trade-off between expressivity and train- ability: additional trainable parameters increase model flexibility but also exacerbate optimisation difficulties. These results indicate that, for reinforcement learning tasks, shallow entangled circuits may represent a sweet spot, offering sufficient expressivity without rendering training intractable. 3.4 Quantum advantage in the low-parameter regime One of the most practically relevant findings of this study is that entangled PQC backbones can outperform classical MLP baselines in the low-parameter regime. In particular, we observe configurations where entangled PQCs achieve higher average returns than classical networks with similar or even fewer parameters. This result does not constitute a full quantum advantage in the complexity- theoretic sense. However, it does demonstrate a resource advantage: entanglement enables more effective use of limited parameter budgets. Such regimes are precisely those relevant to near-term quantum hardware, where circuit depth, coherence time, and parameter counts are tightly constrained. At larger parameter counts, classical networks retain a decisive advantage. A suf- ficiently large MLP consistently outperforms all quantum backbones tested in this study, reflecting the greater expressive freedom and optimisation stability of clas- sical models. A recent analysis of the Fourier fingerprints of the Quantum Fourier Model (QFM) [38] shows that, due to trainability constraints, QFM circuits admit only a polynomial scaling of independent trainable parameters with the number of qubits. As a consequence, correlations arise among the model’s Fourier coefficients, effectively restricting the number of frequencies that can be independently controlled by the parameters, despite the fact that the total number of accessible frequencies grows exponentially with the number of qubits. Classical models do not have such limitations, giving them more degrees of freedom. This result is neither surprising nor discouraging; rather, it underscores the importance of identifying application domains and operating regimes where quantum resources offer complementary benefits rather than wholesale replacement of classical methods. 3.5 Scope, limitations, and future directions This study is intentionally narrow in scope. We focused on a low-dimensional classical environment to tightly control experimental variables and enable systematic bench- marking of entanglement. More complex environments, particularly those involving high-dimensional observations such as images or videos, would require either quan- tum models capable of direct high-dimensional encoding or hybrid pipelines that rely heavily on classical preprocessing. To better reflect the competitive nature of Pong, an important next step is to move beyond training and evaluating against a single fixed opponent. For example, the agent could be trained and tested against a range of opponents, such as self-play (playing against a copy of itself) or an opponent pool consisting of multiple policies (e.g., different random seeds or saved checkpoints). Evaluating performance across this opponent set would help distinguish genuine improvements in learning from overfitting to one particular opponent behaviour, and would provide a more robust test of whether entanglement improves performance in competitive settings. Scaling PQC-based reinforcement learning to such settings remains an open chal- lenge. Current quantum simulators and hardware are not yet capable of efficiently training large-scale quantum models with realistic data volumes. Moreover, optimisa- tion challenges such as barren plateaus become increasingly severe as circuit size grows. These observations motivate several directions for future investigation, including: • Alternative entanglement patterns and circuit topologies that improve trainability, • Task domains where low-parameter efficiency is especially valuable, • Multi-agent/self-play competitive evaluation, where both players learn (self-play or population-based training) and performance is reported not only as mean episodic return but also robustness and stability under opponent adaptation (Markov-game perspective), • Training strategies that mitigate barren plateaus in reinforcement learning settings, • Other hybrid architectures that combine quantum feature extractors with classical representation learning. Overall, our work demonstrates that quantum entanglement can meaningfully enhance reinforcement learning performance in classical environments, not by outper- forming large classical models outright, but by enabling more efficient representation learning under constrained resources. Our findings position entanglement as a tangi- ble and measurable contributor to quantum-enhanced decision making, and provide a concrete experimental foundation for future research in quantum reinforcement learning. 4 Methods 4.1 Proximal Policy Optimisation Our implementation of the Proximal Policy Optimisation (PPO) algorithm is directly based on CleanRL [35]. The total objective of PPO that is optimised at each iteration is L CLIP+V F+S t (θ) = ˆ E t L CLIP t (θ)− c 1 L V F t (θ) + c 2 S[π θ ](s t ) ,(5) which makes use of the entropy S[π θ ](s t ) of the policy at state s t and the clipped surrogate objective [34] L CLIP (θ) = ˆ E t h min(r t (θ) ˆ A t , clip(r t (θ), 1− ε, 1 + ε) ˆ A t ) i ,(6) where • θ is the parameters of the policy function π θ (a t |s t ), a t is the action taken at time t given observed environment state s t . • r t (θ) = π θ (a t |s t ) π θ old (a t |s t ) is the ratio function. • ˆ A t is the estimated advantage. In [34], a truncated advantage function was used: ˆ A t = δ t + (γλ)δ t−1 +· +· + (γλ) T−t+1 δ T−1 .(7) Here, δ t = r t + γV (s t+1 )− V (s t ), r t is the reward, V (s) is a learned state-value function, trained with a squared-error loss L V F t between the predicted value and the discounted cumulative reward V target t associated with having taken an action a t in a given state s t and is computed based on the observations made throughout an experienced trajectory (episode), i.e. a sequence of interactions between an agent and the environment [39], γ is the temporal-difference discount factor, T is the length of the episode and λ is a hyperparameter. • c 1 ,c 2 are coefficients. 4.2 Centred Kernel Alignment Let X ∈ R n×p 1 denote representations with dimension p 1 for n inputs, and let Y ∈ R n×p 2 denote representations with dimension p 2 for the same n inputs. The similarity between two different representations generated by two different neural networks for the same set of inputs can be measured with the linear centred kernel alignment (linear CKA). Following [37], we can calculate linear CKA scores after centreing X,Y to ˆ X, ˆ Y : CKA Linear = ||A|| 2 F ||B|| 2 F ||C|| 2 F ,(8) where A = ˆ Y T ˆ X,B = ˆ X T ˆ X,C = ˆ Y T ˆ Y , and ||·|| F denotes the Frobenius norm for matrices. Supplementary information. Table 2 : Averaged episodic length for our hybrid agent with different parameter configurations. For each config- uration, the statistics (mean and standard deviation) of the lengths of episodes are calculated from 10 randominitialisations of the trainable parameters. And for single runs before smoothing with an exponential moving aver-age, the number of steps in a single episode is always an integer. It should be noted that the episodic length is aless informative indicator for the agents’ performance compared to the actual return the agent obtains from theenvironment. When the number of steps in an episode is low, the agent could perform well or poorly, since in Pong,the agent needs to play with a computer-controlled paddle to win the game. Small episodic length could indicatethat the agent finishes the opponent quickly, or gets finished by the opponent quickly. Backbone Type Params Episodic Length Mean Episodic Length Std. Episodic Length Max. Episodic Length Min. Separable 48 274.5 106.21 417 132 Separable 96 331.5 115.96 607 227 Separable 144 274.5 114.40 417 132 Separable 192 227.0 73.59 417 132 Separable 240 236.5 78.91 417 132 Separable 288 303.0 254.20 987 132 CZ-entangled 48 364.6 149.28 654 132 CZ-entangled 96 431.0 170.13 892 274 CZ-entangled 144 407.3 161.46 702 132 CZ-entangled 192 378.9 174.92 797 132 CZ-entangled 240 322.0 127.46 607 132 CZ-entangled 288 616.5 744.95 2792 132 IsingZZ-entangled 56 369.3 168.50 607 132 IsingZZ-entangled 112 317.1 179.60 702 132 IsingZZ-entangled 168 502.4 206.78 892 227 IsingZZ-entangled 224 697.2 628.90 2507 227 IsingZZ-entangled 280 649.7 560.60 1937 227 IsingZZ-entangled 336 521.5 508.85 1937 227 Classical 64 298.1 172.70 559 132 Classical 128 393.1 244.13 1034 132 Classical 256 421.5 173.21 892 227 Classical 336 473.8 280.79 1224 227 Classical 4096 544.8 190.02 939 274 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 Averaged Episodic Return Episodic Return: Quantum Separable vs Classical Baseline 128 Params Quantum Separable; 1 Q-Layers Quantum Separable; 2 Q-Layers Quantum Separable; 3 Q-Layers Quantum Separable; 4 Q-Layers Quantum Separable; 5 Q-Layers Quantum Separable; 6 Q-Layers Classical Baseline 128 Params (a) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 10 15 Averaged Episodic Return Episodic Return: Quantum CZ Entangled vs Classical Baseline 128 Params Quantum CZ Entangled; 1 Q-Layers Quantum CZ Entangled; 2 Q-Layers Quantum CZ Entangled; 3 Q-Layers Quantum CZ Entangled; 4 Q-Layers Quantum CZ Entangled; 5 Q-Layers Quantum CZ Entangled; 6 Q-Layers Classical Baseline 128 Params (b) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 10 Averaged Episodic Return Episodic Return: Quantum IsingZZ Entangled vs Classical Baseline 128 Params Quantum IsingZZ Entangled; 1 Q-Layers Quantum IsingZZ Entangled; 2 Q-Layers Quantum IsingZZ Entangled; 3 Q-Layers Quantum IsingZZ Entangled; 4 Q-Layers Quantum IsingZZ Entangled; 5 Q-Layers Quantum IsingZZ Entangled; 6 Q-Layers Classical Baseline 128 Params (c) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 10 Averaged Episodic Return Episodic Return: Quantum Separable vs Classical Baseline 256 Params Quantum Separable; 1 Q-Layers Quantum Separable; 2 Q-Layers Quantum Separable; 3 Q-Layers Quantum Separable; 4 Q-Layers Quantum Separable; 5 Q-Layers Quantum Separable; 6 Q-Layers Classical Baseline 256 Params (d) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 10 15 Averaged Episodic Return Episodic Return: Quantum CZ Entangled vs Classical Baseline 256 Params Quantum CZ Entangled; 1 Q-Layers Quantum CZ Entangled; 2 Q-Layers Quantum CZ Entangled; 3 Q-Layers Quantum CZ Entangled; 4 Q-Layers Quantum CZ Entangled; 5 Q-Layers Quantum CZ Entangled; 6 Q-Layers Classical Baseline 256 Params (e) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 10 Averaged Episodic Return Episodic Return: Quantum IsingZZ Entangled vs Classical Baseline 256 Params Quantum IsingZZ Entangled; 1 Q-Layers Quantum IsingZZ Entangled; 2 Q-Layers Quantum IsingZZ Entangled; 3 Q-Layers Quantum IsingZZ Entangled; 4 Q-Layers Quantum IsingZZ Entangled; 5 Q-Layers Quantum IsingZZ Entangled; 6 Q-Layers Classical Baseline 256 Params (f) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 10 Averaged Episodic Return Episodic Return: Quantum Separable vs Classical Baseline 336 Params Quantum Separable; 1 Q-Layers Quantum Separable; 2 Q-Layers Quantum Separable; 3 Q-Layers Quantum Separable; 4 Q-Layers Quantum Separable; 5 Q-Layers Quantum Separable; 6 Q-Layers Classical Baseline 336 Params (g) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 10 15 Averaged Episodic Return Episodic Return: Quantum CZ Entangled vs Classical Baseline 336 Params Quantum CZ Entangled; 1 Q-Layers Quantum CZ Entangled; 2 Q-Layers Quantum CZ Entangled; 3 Q-Layers Quantum CZ Entangled; 4 Q-Layers Quantum CZ Entangled; 5 Q-Layers Quantum CZ Entangled; 6 Q-Layers Classical Baseline 336 Params (h) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −15 −10 −5 0 5 10 Averaged Episodic Return Episodic Return: Quantum IsingZZ Entangled vs Classical Baseline 336 Params Quantum IsingZZ Entangled; 1 Q-Layers Quantum IsingZZ Entangled; 2 Q-Layers Quantum IsingZZ Entangled; 3 Q-Layers Quantum IsingZZ Entangled; 4 Q-Layers Quantum IsingZZ Entangled; 5 Q-Layers Quantum IsingZZ Entangled; 6 Q-Layers Classical Baseline 336 Params (i) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −10 0 10 20 Averaged Episodic Return Episodic Return: Quantum Separable vs Classical Baseline 4096 Params Quantum Separable; 1 Q-Layers Quantum Separable; 2 Q-Layers Quantum Separable; 3 Q-Layers Quantum Separable; 4 Q-Layers Quantum Separable; 5 Q-Layers Quantum Separable; 6 Q-Layers Classical Baseline 4096 Params (j) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −10 0 10 20 Averaged Episodic Return Episodic Return: Quantum CZ Entangled vs Classical Baseline 4096 Params Quantum CZ Entangled; 1 Q-Layers Quantum CZ Entangled; 2 Q-Layers Quantum CZ Entangled; 3 Q-Layers Quantum CZ Entangled; 4 Q-Layers Quantum CZ Entangled; 5 Q-Layers Quantum CZ Entangled; 6 Q-Layers Classical Baseline 4096 Params (k) 0.00.20.40.60.81.0 Training Steps ×10 7 −20 −10 0 10 20 Averaged Episodic Return Episodic Return: Quantum IsingZZ Entangled vs Classical Baseline 4096 Params Quantum IsingZZ Entangled; 1 Q-Layers Quantum IsingZZ Entangled; 2 Q-Layers Quantum IsingZZ Entangled; 3 Q-Layers Quantum IsingZZ Entangled; 4 Q-Layers Quantum IsingZZ Entangled; 5 Q-Layers Quantum IsingZZ Entangled; 6 Q-Layers Classical Baseline 4096 Params (l) Fig. 5: Comparing the (averaged) episodic performance of the three different quantum backbones (separable, CZ-entangled and IsingZZ-entangled) w.r.t. classical MLP with 128, 256, 336 and 4096 parameters. Acknowledgements. The authors acknowledge the Quantum Technology Future Science Platform for financial support. KH would also like to acknowledge the Revolutionary Energy Storage Systems Future Science Platform for support. Declarations • Conflict of interest/Competing interests: The authors declare no conflict of interest • Ethics approval and consent to participate: Not applicable • The authors give Nature Machine Intelligence consent for publication. • Data availability: Data and code are available at https://github.com/ peiyong-addwater/QRL-Pong • Materials availability: Not applicable • Code availability: See https://github.com/peiyong-addwater/QRL-Pong • Author contribution: PW wrote the code, conducted the numerical experiments and wrote the first version of this manuscript. JQ and KH designed and interpretted results from the numerical experiments and refined the manuscript. References [1] Treynor, J.L.: What does it take to win the trading game? Fin. Anal. J. 37(1), 55–60 (1981) https://doi.org/10.2469/faj.v37.n1.55 [2] Treynor, J.: Zero sum. Fin. Anal. J. 55(1), 8–12 (1999) https://doi.org/10.2469/ faj.v55.n1.2237 [3] Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Driessche, G., Schrit- twieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016) https: //doi.org/10.1038/nature16961 [4] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Driessche, G., Graepel, T., Hassabis, D.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017) https://doi.org/10.1038/ nature24270 [5] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanc- tot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., Hassabis, D.: Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv [cs.AI] (2017) arXiv:1712.01815 [cs.AI] [6] Neumann, J.: Zur theorie der gesellschaftsspiele. Math. Ann. 100(1), 295–320 (1928) https://doi.org/10.1007/bf01448847 [7] Shapley, L.S.: Stochastic games. Proc. Natl. Acad. Sci. U. S. A. 39(10), 1095–1100 (1953) https://doi.org/10.1073/pnas.39.10.1095 [8] Littman, M.L.: Markov games as a framework for multi-agent rein- forcement learning. In: Machine Learning Proceedings 1994, p. 157–163. Elsevier, ??? (1994). https://doi.org/10.1016/b978-1-55860-335-6.50027-1 . http://dx.doi.org/10.1016/B978-1-55860-335-6.50027-1 [9] Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning envi- ronment: an evaluation platform for general agents. J. Artif. Int. Res. 47(1), 253–279 (2013) [10] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv [cs.LG] (2013) arXiv:1312.5602 [cs.LG] [11] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015) https://doi.org/10.1038/nature14236 [12] Grover, L.K.: A fast quantum mechanical algorithm for database search (1996). https://arxiv.org/abs/quant-ph/9605043 [13] Anschuetz, E.R., Hu, H.-Y., Huang, J.-L., Gao, X.: Interpretable quantum advan- tage in neural sequence learning. PRX Quantum 4(2), 020338 (2023) https: //doi.org/10.1103/PRXQuantum.4.020338 arXiv:2209.14353 [quant-ph] [14] Bowles, J., Ahmed, S., Schuld, M.: Better than classical? the subtle art of benchmarking quantum machine learning models. arXiv [quant-ph] (2024) arXiv:2403.07059 [quant-ph] [15] P ́erez-Salinas, A., Cervera-Lierta, A., Gil-Fuster, E., Latorre, J.I.: Data re- uploading for a universal quantum classifier. Quantum 4(226), 226 (2020) https: //doi.org/10.22331/q-2020-02-06-226 1907.02085v2 [16] Schuld, M., Sweke, R., Meyer, J.J.: Effect of data encoding on the expressive power of variational quantum-machine-learning models. Phys. Rev. A 103(3), 032430 (2021) https://doi.org/10.1103/PhysRevA.103.032430 [17] Saggio, V., Asenbeck, B.E., Hamann, A., Str ̈omberg, T., Schiansky, P., Dunjko, V., Friis, N., Harris, N.C., Hochberg, M., Englund, D., W ̈olk, S., Briegel, H.J., Walther, P.: Experimental quantum speed-up in reinforcement learning agents. Nature 591(7849), 229–233 (2021) https://doi.org/10.1038/s41586-021-03242-7 [18] Wu, S., Jin, S., Wen, D., Han, D., Wang, X.: Quantum reinforcement learning in continuous action space. Quantum 9, 1660 (2025) https://doi.org/10.22331/ q-2025-03-12-1660 2012.10711v5 [19] Ganguly, B., Xu, Y., Aggarwal, V.: Quantum speedups in regret anal- ysis of infinite horizon average-reward markov decision processes. In: Forty-secondInternationalConferenceonMachineLearning(2025). https://openreview.net/forum?id=BDfBKk9CbE [20] Sweke, R., Seifert, J.-P., Hangleiter, D., Eisert, J.: On the quantum versus classical learnability of discrete distributions. Quantum 5(417), 417 (2021) https://doi. org/10.22331/q-2021-03-23-417 2007.14451v2 [21] Pirnay, N., Sweke, R., Eisert, J., Seifert, J.-P.: Superpolynomial quantum-classical separation for density modeling. Phys. Rev. A 107(4), 042416 (2023) https://doi. org/10.1103/PhysRevA.107.042416 [22] Liu, Y., Arunachalam, S., Temme, K.: A rigorous and robust quantum speed- up in supervised machine learning. Nat. Phys. 17(9), 1013–1017 (2021) https: //doi.org/10.1038/s41567-021-01287-z [23] Wakeham, D., Schuld, M.: Inference, interference and invariance: How the quan- tum fourier transform can help to learn from data. arXiv [quant-ph] (2024) arXiv:2409.00172 [quant-ph] [24] Sutton, R.S.: The Bitter Lesson. http://w.incompleteideas.net/IncIdeas/ BitterLesson.html. [Online; accessed 2025-12-10] (2019) [25] Chollet, F.: On the measure of intelligence. arXiv [cs.AI] (2019) arXiv:1911.01547 [cs.AI] [26] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv [cs.LG] (2020) arXiv:2001.08361 [cs.LG] [27] Skolik, A., Jerbi, S., Dunjko, V.: Quantum agents in the gym: a variational quan- tum algorithm for deep Q-learning. Quantum 6, 720 (2022) https://doi.org/10. 22331/q-2022-05-24-720 2103.15084v3 [28] Jerbi, S., Gyurik, C., Marshall, S.C., Briegel, H.J., Dunjko, V.: Parametrized quantum policies for reinforcement learning. In: Advances in Neural Information Processing Systems (2021). https://openreview.net/forum?id=qGn3Rlgul5F [29] Nico, M., Christian, U., George, Y., Georgios, K., Christopher, M., Daniel, D.S.: Benchmarking quantum reinforcement learning. arXiv [quant-ph] (2025) arXiv:2501.15893 [quant-ph] [30] Jin, Y.-X., Wang, Z.-W., Xu, H.-Z., Zhuang, W.-F., Hu, M.-J., Liu, D.E.: PPO- Q: Proximal policy optimization with parametrized quantum policies or values. arXiv [quant-ph] (2025) 2501.07085 [31] Lazaro, J., Vazquez, J.-I., Garcia-Bringas, P.: Dissecting quantum reinforcement learning: A systematic evaluation of key components. arXiv [quant-ph] (2025) https://doi.org/10.48550/arXiv.2511.17112 arXiv:2511.17112 [quant-ph] [32] Kubo, Y., Chalmers, E., Luczak, A.: Combining backpropagation with equilib- rium propagation to improve an actor-critic reinforcement learning framework. Front. Comput. Neurosci. 16, 980613 (2022) https://doi.org/10.3389/fncom.2022. 980613 [33] K ̈olle, M., Hagog, M., Ritz, F., Altmann, P., Zorn, M., Stein, J., Linnhoff-Popien, C.: Quantum advantage actor-critic for reinforcement learning. arXiv [quant-ph] (2024) https://doi.org/10.48550/arXiv.2401.07043 arXiv:2401.07043 [quant-ph] [34] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv [cs.LG] (2017) arXiv:1707.06347 [cs.LG] [35] Huang, S., Dossa, R.F.J., Ye, C., Braga, J., Chakraborty, D., Mehta, K., Ara ́ujo, J.G.M.: Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research 23(274), 1–18 (2022) [36] McClean, J.R., Boixo, S., Smelyanskiy, V.N., Babbush, R., Neven, H.: Barren plateaus in quantum neural network training landscapes. Nat. Commun. 9(1), 4812 (2018) https://doi.org/10.1038/s41467-018-07090-4 [37] Kornblith, S., Norouzi, M., Lee, H., Hinton, G.: Similarity of Neural Network Representations Revisited (2019). https://arxiv.org/abs/1905.00414 [38] Strobl, M., Sahin, M.E., Horst, L., Kuehn, E., Streit, A., Jaderberg, B.: Fourier fingerprints of ansatzes in quantum machine learning. arXiv [quant-ph] (2025) arXiv:2508.20868 [quant-ph] [39] Bick, D.: Towards delivering a coherent self-contained explanation of proximal policy optimization. Master’s thesis (2021). https://fse.studenttheses.ub.rug.nl/ 25709/