Paper deep dive
Benchmarking Reinforcement Learning via Stochastic Converse Optimality: Generating Systems with Known Optimal Policies
Sinan Ibrahim, Grégoire Ouerdane, Hadi Salloum, Henni Ouerdane, Stefan Streif, Pavel Osinenko
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%
Last extracted: 3/22/2026, 5:55:42 AM
Summary
The paper introduces a rigorous benchmarking framework for Reinforcement Learning (RL) by extending converse optimality to stochastic, discrete-time, control-affine, nonlinear systems. By providing necessary and sufficient conditions for a prescribed value function and policy to be optimal, the authors enable the systematic generation of benchmark environments with known ground-truth optima, allowing for absolute performance metrics like optimality gaps and regret.
Entities (5)
Relation Signals (3)
Stochastic Converse Optimality → enables → Benchmark Generation
confidence 95% · enabling the systematic generation of benchmark families via homotopy variations and randomized parameters.
Stochastic Converse Optimality → extends → Converse Optimality
confidence 95% · we introduce a rigorous benchmarking framework by extending converse optimality to discrete-time, control-affine, nonlinear systems with noise.
Reinforcement Learning → uses → Hamilton-Jacobi-Bellman Equation
confidence 90% · This problem has stimulated the development of alternative, suboptimal methods like reinforcement learning (RL), that trade global optimality for reduced computational complexity.
Cypher Suggestions (2)
Identify mathematical foundations of the framework · confidence 90% · unvalidated
MATCH (f:Framework {name: 'Stochastic Converse Optimality'})-[:BASED_ON]->(m:MathematicalEquation) RETURN mFind all frameworks related to RL benchmarking · confidence 85% · unvalidated
MATCH (f:Framework)-[:USED_IN]->(r:Field {name: 'Reinforcement Learning'}) RETURN fAbstract
Abstract:The objective comparison of Reinforcement Learning (RL) algorithms is notoriously complex as outcomes and benchmarking of performances of different RL approaches are critically sensitive to environmental design, reward structures, and stochasticity inherent in both algorithmic learning and environmental dynamics. To manage this complexity, we introduce a rigorous benchmarking framework by extending converse optimality to discrete-time, control-affine, nonlinear systems with noise. Our framework provides necessary and sufficient conditions, under which a prescribed value function and policy are optimal for constructed systems, enabling the systematic generation of benchmark families via homotopy variations and randomized parameters. We validate it by automatically constructing diverse environments, demonstrating our framework's capacity for a controlled and comprehensive evaluation across algorithms. By assessing standard methods against a ground-truth optimum, our work delivers a reproducible foundation for precise and rigorous RL benchmarking.
Tags
Links
- Source: https://arxiv.org/abs/2603.17631v1
- Canonical: https://arxiv.org/abs/2603.17631v1
Full Text
71,781 characters extracted from source content.
Expand or collapse full text
Benchmarking Reinforcement Learning via Stochastic Converse Optimality: Generating Systems with Known Optimal Policies Sinan Ibrahim1, Grégoire Ouerdane1, Hadi Salloum2, Henni Ouerdane1, Stefan Streif3, Pavel Osinenko1,4,5 1Skolkovo Institute of Science and Technology, Moscow, Russia. Email: sinan.ibrahim, gregoire.ouerdane, h.ouerdane, p.osinenko@skoltech.ru2Innopolis University, Innopolis, Russia. Email: h.salloum@innopolis.ru3Technische Universität Chemnitz, Chemnitz, Germany. Email: stefan.streif@etit.tu-chemnitz.de4Central University, Moscow, Russia5Sirius University of Science and Technology, Sotchi, RussiaCorresponding author: Pavel Osinenko, p.osinenko@gmail.com. Abstract The objective comparison of Reinforcement Learning (RL) algorithms is notoriously complex as outcomes and benchmarking of performances of different RL approaches are critically sensitive to environmental design, reward structures, and stochasticity inherent in both algorithmic learning and environmental dynamics. To manage this complexity, we introduce a rigorous benchmarking framework by extending converse optimality to discrete-time, control-affine, nonlinear systems with noise. Our framework provides necessary and sufficient conditions, under which a prescribed value function and policy are optimal for constructed systems, enabling the systematic generation of benchmark families via homotopy variations and randomized parameters. We validate it by automatically constructing diverse environments, demonstrating our framework’s capacity for a controlled and comprehensive evaluation across algorithms. By assessing standard methods against a ground-truth optimum, our work delivers a reproducible foundation for precise and rigorous RL benchmarking. I Introduction The design of optimal feedback controllers for nonlinear systems is governed by the Hamilton-Jacobi-Bellman (HJB) equation, which is computationally intractable for all but the lowest-dimensional systems due to the curse of dimensionality [1]. This problem has stimulated the development of alternative, suboptimal methods like reinforcement learning (RL), that trade global optimality for reduced computational complexity. However, without knowledge of true optimal solutions, there is no systematic methodology for testing and evaluating these alternative approaches [2]. The ability to benchmark and verify advanced control algorithms used in RL is complicated by a lack of high-dimensional nonlinear test problems with known optimal solutions. Moreover, the difficulty is increased by the strong dependence of results on reward shaping, the choice of system dynamics, stochasticity, and evaluation protocols [3, 4]. Existing benchmarks aggregate scores over heterogeneous suites [5, 6], true optima remaining unavailable for a reliable estimation of optimality gaps, characterization of sample efficiency, and diagnosis of failure modes. Converse optimality addresses this need by solving the converse problem instead of the HJB equation: given a target optimal value function and stage costs, what are the system dynamics and feedback laws for which Bellman’s optimality equation holds globally? This approach has been successfully applied to generate RL benchmarks for deterministic systems to test control methods [7, 8, 9]: converse theorems yield a closed-form characterization of drifts that realize a prescribed value function [10]. However, a significant gap exists for stochastic systems, which remain unaddressed due to the complexity that noise adds to solving the HJB equation. The present work builds a new benchmarking pipeline for RL by extending converse optimality to a widely used class of stochastic, discounted, discrete-time, control-affine, non-linear systems. Our contributions: (i) a discounted stochastic converse optimality framework with a theorem for the case of additive Gaussian noise and a Quadratic–Gaussian (QG) corollary; (i) a constructive benchmark generation via homotopy over control strength; and (i) a paired evaluation protocol (Common Random Numbers) with baseline implementations, reporting absolute metrics (optimality gaps, regret) against certified optima. Figure 1: Training dynamics snapshot: stage reward r=−cr=-c over learning for representative algorithms on a fixed fixture under CRN evaluation. The core comparisons in this paper use absolute, oracle-referenced metrics (OptGap, regret). I Related Work The need for reproducible evaluation in reinforcement learning has driven the creation of benchmarking tools. Reproducible RL evaluation remains challenging. Henderson et al. [3] showed that the apparent performance of an algorithm is a function of often overlooked experimental conditions, such as hyperparameters and implementation differences [11]. This lack of reproducibility and insufficient statistical rigor hinders progress in RL. In response, newer benchmarks isolate specific facets of this unreliability. The Procgen benchmark confronts the tendency of models to memorize training environments by using procedural generation [5]. The D4RL benchmark provides datasets that realistically represent distributional shifts for offline RL [6]. A more recent contribution, Open RL Benchmark [12], offers a centralized repository of fully tracked RL experiments. It aggregates over 25,000 runs and documents performance results and experimental conditions. Despite its benefits, Open RL Benchmark faces the same fundamental limitation as its predecessors: the true globally optimal policy is unknown. This makes it impossible to calculate the optimality gap, the most direct performance metric. Without this ground truth, evaluation is necessarily relative and can be confounded by high variance and sensitivity to protocol choices [4, 3]. Converse optimality enables absolute evaluation by constructing problems with certified optimal solutions. Several works have formalized this idea for both discrete-time [10] and continuous-time [13] deterministic settings, establishing conditions under which a given value function can be viewed as the solution to an optimal control problem. This forms the basis for a new benchmarking paradigm that replaces cumulative reward curves with rigorous statistical testing. Our work extends this paradigm to the stochastic setting, using benchmarks representative of real-world systems to provide a certified optimum and allow an absolute assessment of the optimality gap. I Problem Setup and Assumptions We consider a stochastic, discrete-time, control-affine, nonlinear system subject to an additive Gaussian noise: sk+1=f(sk)+g(sk)ak+wk,s_k+1=f(s_k)+g(s_k)a_k+w_k, with state s∈ℝns ^n, action a∈ℝma ^m, internal dynamic f:ℝn→ℝnf:R^n ^n, input-coupling matrix g:ℝn→ℝn×mg:R^n ^n× m, and w the independent and identically distributed (i.i.d.), time-independent Gaussian noise process with zero mean and covariance matrix Σ . For a stationary policy π, mapping states to actions through π(sk)=akπ(s_k)=a_k, the discounted infinite-horizon cost-to-go is Vπ(s)=[∑k=0∞γkc(sk,π(sk))∣s0=s],γ∈(0,1),V^π(s)=E\! [ _k=0^∞γ^k\,c(s_k,π(s_k)) s_0=s ], γ∈(0,1), where γ is the discount factor. The function c:ℝn×ℝm→ℝ+c:R^n×R^m _+ is the stage cost, separable into state (positive) and control (positive, strictly convex) components q:ℝn→ℝq:R^n and ρ:ℝm→ℝρ:R^m , such that c(s,a)=q(s)+ρ(a)c(s,a)=q(s)+ρ(a). The optimal value V∗V^* satisfies the stochastic Bellman equation V∗(s)=mina∈ℝmq(s)+ρ(a)+γ[V∗(f(s)+g(s)a+w)]V^*(s)= _a ^m \q(s)+ρ(a)+ [V^*(f(s)+g(s)a+w) ] \ (1) and the optimal action a∗a^* is defined stepwise by: ak∗=argmina∈ℝmq(sk)+ρ(ak)+V∗(sk+1),a^*_k= _a ^m \q(s_k)+ρ(a_k)+V^*(s_k+1) \, satisfying also the first-order optimality condition: ∇aρ(a)|a=a∗+γ∇a[V∗(f(s)+g(s)a+w)]|a=a∗=0. _aρ(a) |_a=a^*+γ\, _aE\! [V^*(f(s)+g(s)a+w) ] |_a=a^*=0. (2) The stochastic converse optimality problem can eventually be summarized as follows. Given a chosen optimal value function V∗:ℝn→ℝ+V^*:R^n _+, an input-coupling function g, a stage cost function c(s,a)=q(s)+ρ(a)c(s,a)=q(s)+ρ(a), a discount factor γ∈(0,1)γ∈(0,1), and a noise distribution w∼(0,Σ)w (0, ), find a function f:ℝn→ℝnf:R^n ^n such that the stochastic Bellman equation (1) holds for all s∈ℝns ^n. To guarantee that the stochastic Bellman equation is well-posed and that the first-order optimality condition is valid, we make the following set of assumptions. These are extensions of standard assumptions in stochastic optimal control to the converse optimality setting. Assumption 1 (Well-posedness and regularity). The following holds on ℝnR^n: 1. q:ℝn→ℝ+q:R^n _+ is continuous. 2. ρ:ℝm→ℝ+ρ:R^m _+ is C2C^2 and μ-strongly convex: ∇2ρ(a)⪰μI∇^2ρ(a) μ I for all a. 3. f,gf,g are locally Lipschitz with at most polynomial growth. 4. The candidate value function V:ℝn→ℝV:R^n is C2C^2. Assumption 2 (Growth and integrability). There exist constants η>0η>0 and an integer ℓ≥1 ≥ 1 such that for all s∈ℝns ^n, |V(s)|+‖∇V(s)‖+‖∇2V(s)‖≤η(1+‖s‖ℓ),|V(s)|+\|∇ V(s)\|+\|∇^2V(s)\|≤η (1+\|s\| ), and, for each fixed (s,a)∈ℝn×ℝm(s,a) ^n×R^m, the expectations [‖∇V(f(s)+g(s)a+w)‖]E[\|∇ V(f(s)+g(s)a+w)\|] and [‖∇2V(f(s)+g(s)a+w)‖]E[\|∇^2V(f(s)+g(s)a+w)\|] are finite under w∼(0,Σ)w (0, ). IV Stochastic Converse Optimality We now state the general converse result in the discounted, stochastic setting; a key technical step is the lemma justifying the interchange of expectation and differentiation in the first-order optimality condition. Lemma 1. Under Assumptions 1 and 2, for all s∈ℝns ^n and a∈ℝma ^m, ∇a[V(f(s)+g(s)a+w)]= _aE\! [V(f(s)+g(s)a+w) ]= (3) =[g(s)⊤∇V(f(s)+g(s)a+w)]. =E\! [g(s) ∇ V(f(s)+g(s)a+w) ]. Proof. Fix s∈ℝns ^n and define ϕ(a,w)=V(f(s)+g(s)a+w)φ(a,w)=V(f(s)+g(s)a+w). By the chain rule, we have ∇aϕ(a,w)=g(s)⊤∇V(f(s)+g(s)a+w) _aφ(a,w)=g(s) ∇ V(f(s)+g(s)a+w) To interchange the derivative and the expectation, we verify the conditions of the dominated convergence theorem. Let e1,…,eme_1,…,e_m be the canonical basis vectors of ℝmR^m. Without loss of generality, we choose a sequence tν\t_ν\ converging to 0, such that we can define sequences aν(i)\a_ν^(i)\ through aν(i)=a+tνeia^(i)_ν=a+t_νe_i converging to a along each coordinate axis i. By the mean value theorem, for each axis i, there exists ξν(i)ξ^(i)_ν on the line segment between aν(i)a^(i)_ν and a such that ϕ(aν(i),w)−ϕ(a,w)=∇aϕ(ξν(i),w)⋅(aν(i)−a), φ(a^(i)_ν,w)-φ(a,w)= _aφ(ξ^(i)_ν,w)·(a_ν^(i)-a), and |ϕ(aν(i),w)−ϕ(a,w)|≤‖∇aϕ(ξν(i),w)‖⋅‖aν(i)−a‖. |φ(a^(i)_ν,w)-φ(a,w)|≤\| _aφ(ξ^(i)_ν,w)\|·\|a^(i)_ν-a\|. Now, let us bound ‖∇aϕ(ξν(i),w)‖\| _aφ(ξ^(i)_ν,w)\| for each axis i. From Assumption 2, we obtain: ‖∇aϕ(ξν(i),w)‖≤‖g(s)‖⋅‖∇V(f(s)+g(s)ξν(i)+w)‖,\| _aφ(ξ^(i)_ν,w)\|≤\|g(s)\|·\|∇ V(f(s)+g(s)ξ^(i)_ν+w)\|, and hence, ‖∇aϕ(ξν(i),w)‖≤‖g(s)‖⋅η(1+‖f(s)+g(s)ξν(i)+w‖ℓ).\| _aφ(ξ^(i)_ν,w)\|≤\|g(s)\|·η(1+\|f(s)+g(s)ξ^(i)_ν+w\| ). Since ξν(i)ξ^(i)_ν lies in a bounded neighborhood of a (for ν sufficiently large), and f,gf,g have at most polynomial growth (Assumption 1.3), there exist constants C1(s)C_1(s) and C2(s)C_2(s) such that ‖f(s)+g(s)ξν(i)+w‖ \|f(s)+g(s)ξ^(i)_ν+w\| ≤C1(s)+C2(s)‖ξν(i)‖+‖w‖ ≤ C_1(s)+C_2(s)\|ξ^(i)_ν\|+\|w\| ≤C(s,a)+‖w‖, ≤ C(s,a)+\|w\|, where C(s,a)C(s,a) is independent of ν and ξν(i) _ν^(i) (i.e., uniform over the neighborhood of a). This uniformity ensures the subsequent growth bound ‖∇aϕ(ξν(i),w)‖ \| _aφ(ξ^(i)_ν,w)\| ≤η‖g(s)‖(1+(C(s,a)+‖w‖)ℓ) ≤η\|g(s)\|(1+(C(s,a)+\|w\|) ) ≤K(s,a)(1+‖w‖ℓ), ≤ K(s,a)(1+\|w\| ), where K(s,a)(1+‖w‖ℓ)K(s,a)(1+\|w\| ) is independent of ν. This provides a single bound applicable to the entire sequence. The growth bound β:w↦K(s,a)(1+‖w‖l)β:w K(s,a)(1+\|w\|^l) is measurable. Let μΣ _ be the Gaussian measure. By definition of the expectation, the Lebesgue integral of β weighted by the Gaussian density is ∫ℝnβ(w)μΣ(w)=[K(s,a)(1+‖w‖l)]. _R^nβ(w)d _ (w)=E[K(s,a)(1+\|w\|^l)]. The linearity of the expectation yields [K(s,a)(1+‖w‖l)]=K(s,a)(1+[‖w‖l]),E[K(s,a)(1+\|w\|^l)]=K(s,a)(1+E[\|w\|^l]), and hence, ∫ℝnβ(w)μΣ(w)=K(s,a)(1+∫ℝn‖w‖ℓμΣ(w)). _R^nβ(w)d _ (w)=K(s,a) (1+ _R^n\|w\| d _ (w) ). As all Gaussian moments are finite, we have: ∫ℝn‖w‖ℓμΣ(w)=[‖w‖ℓ]<∞ _R^n\|w\| d _ (w)=E[\|w\| ]<∞ for all ℓ>0 >0. Hence, ∫ℝnβ(w)μΣ(w)<∞, _R^nβ(w)d _ (w)<∞, and the growth bound β is Lebesgue-integrable with respect to the Gaussian measure. Since ξν(i)→aξ^(i)_ν→ a, continuity gives ∇aϕ(ξν(i),w)→∇aϕ(a,w) _aφ(ξ^(i)_ν,w)→ _aφ(a,w) pointwise. Hence, the bound ‖∇aϕ(ξν(i),w)‖≤K(s,a)(1+‖w‖ℓ)\| _aφ(ξ^(i)_ν,w)\|≤ K(s,a)(1+\|w\| ) provides a suitable dominating function, and by the dominated convergence theorem, we eventually obtain limν→∞[∇aϕ(ξν(i),w)]=[∇aϕ(a,w)].(∗) _ν→∞E[ _aφ(ξ^(i)_ν,w)]=E[ _aφ(a,w)]. (*) On the other hand, by definition of the gradient, we have ∇a[ϕ(a,w)]=(∂a1[ϕ(a,w)],…,∂am[ϕ(a,w)]). _aE[φ(a,w)]= ( ∂ a_1E[φ(a,w)],…, ∂ a_mE[φ(a,w)] ). The directional derivative along the i-th axis is ∂ai[ϕ(a,w)]=limν→∞[ϕ(a+tνei,w)−ϕ(a,w)tν]. ∂ a_iE[φ(a,w)]= _ν→∞E [ φ(a+t_νe_i,w)-φ(a,w)t_ν ]. Along the i-th axis, we also have ϕ(a+tνei,w)−ϕ(a,w)tν=∇aϕ(ξν(i),w)⋅ei. φ(a+t_νe_i,w)-φ(a,w)t_ν= _aφ(ξ^(i)_ν,w)· e_i. Combining these expressions yields limν→∞[ϕ(a+tνei,w)−ϕ(a,w)tν]= _ν→∞E [ φ(a+t_νe_i,w)-φ(a,w)t_ν ]= =limν→∞[∇aϕ(ξν(i),w)⋅ei]=[∇aϕ(a,w)⋅ei], = _ν→∞E[ _aφ(ξ^(i)_ν,w)· e_i]=E[ _aφ(a,w)· e_i], where the last equality follows from the dominated convergence argument (∗)(*) above. But, ∇aϕ(a,w)⋅ei=∂aiϕ(a,w) _aφ(a,w)· e_i= ∂ a_iφ(a,w). Therefore, ∂ai[ϕ(a,w)]=[∂aiϕ(a,w)] ∂ a_iE[φ(a,w)]=E[ ∂ a_iφ(a,w)]. Finally, we get ∇a[ϕ(a,w)]=([∂a1ϕ(a,w)],…,[∂amϕ(a,w)])= _aE[φ(a,w)]= (E[ ∂ a_1φ(a,w)],…,E[ ∂ a_mφ(a,w)] )= =(∂ϕ∂a1(a,w),…,∂ϕ∂am(a,w))=[∇aϕ(a,w)]. =E ( ∂φ∂ a_1(a,w),…, ∂φ∂ a_m(a,w) )=E[ _aφ(a,w)]. ∎ Theorem 1 (Stochastic converse optimality). Fix a stage cost c(s,a)=q(s)+ρ(a)c(s,a)=q(s)+ρ(a) with q,ρq,ρ satisfying Assumption 1, discount factor γ∈(0,1)γ∈(0,1), noise covariance Σ≻0 0, input coupling g:ℝn→ℝn×mg:R^n ^n× m, and a candidate value function V∗∈C2(ℝn)V^*∈ C^2(R^n) satisfying Assumption 2. Then there exists a drift f:ℝn→ℝnf:R^n ^n (i.e. a system sk+1=f(sk)+g(sk)ak+wks_k+1=f(s_k)+g(s_k)a_k+w_k with wk∼(0,Σ)w_k (0, )) for which V∗V^* is the optimal value function if and only if there exists a policy π∗:ℝn→ℝmπ^*:R^n ^m such that for every s∈ℝns ^n: ∇a[ρ(a)+γ[V∗(f(s)+g(s)a+w)]]|a=a∗=0, _a [ρ(a)+γ\,E\! [V^* (f(s)+g(s)a+w ) ] ] |_a=a^*=0, (4) V∗(s)=q(s)+ρ(a∗)+γ[V∗(f(s)+g(s)a∗+w)]. V^*(s)=q(s)+ρ(a^*)+ \! [V^* (f(s)+g(s)a^*+w ) ]. (5) When such π∗π^* exists, it is an optimal policy and V∗V^* satisfies the stochastic Bellman equation (1) for all s∈ℝns ^n. Proof. Sufficiency. Suppose there exist a policy π∗:ℝn→ℝmπ^*:R^n ^m and a drift f:ℝn→ℝnf:R^n ^n such that for every s∈ℝns ^n, equations (4) and (5) hold, and π∗(sk)=ak∗π^*(s_k)=a_k^* for all sk∈ℝns_k ^n. For a fixed s∈ℝns ^n we define the Bellman integrand: Φs(a):=q(s)+ρ(a)+γ[V∗(f(s)+g(s)a+w)] _s(a):=q(s)+ρ(a)+ [V^*(f(s)+g(s)a+w) ] By Lemma 1, we have: ∇aΦs(a)=∇aρ(a)+γ[g(s)⊤∇V∗(f(s)+g(s)a+w)]. _a _s(a)= _aρ(a)+ [g(s) ∇ V^*(f(s)+g(s)a+w) ]. Equations (4)-(5) give ∇aΦs(a)|a=a∗=0 _a _s(a)|_a=a^*=0 and imply that the Bellman identity holds with action a∗a^*. In particular, V∗(s) V^*(s) =q(s)+ρ(a∗)+γ[V∗(f(s)+g(s)a∗+w)] =q(s)+ρ(a^*)+ [V^*(f(s)+g(s)a^*+w) ] =Φs(a∗). = _s(a^*). Since the Bellman equation involves the minimization of Φs(a) _s(a) with respect to a, it follows that this minimum is attained at a∗a^*. Hence: mina∈ℝmΦs(a) _a ^m _s(a) =Φs(a∗) = _s(a^*) =q(s)+ρ(a∗)+γ[V∗(f(s)+g(s)a∗+w)] =q(s)+ρ(a^*)+ [V^*(f(s)+g(s)a^*+w) ] =V∗(s). =V^*(s). Thus, π∗π^* is an optimal policy and V∗V^* is the optimal value function for the drift f. Now fix any stationary policy π and define its Bellman operator: (TπV)(s):=q(s)+ρ(π(s))+γ[V(f(s)+g(s)π(s)+w)].(T^πV)(s):=q(s)+ρ(π(s))+ \! [V (f(s)+g(s)π(s)+w ) ]. By definition of the minimum, minaΦs(a)≤Φs(π(s)) _a _s(a)≤ _s(π(s)), hence V∗(s)≤(TπV∗)(s)V^*(s)≤(T^πV^*)(s) for all s. Applying TπT^π to both sides of V∗≤TπV∗V^*≤ T^πV^* and using the monotonicity of TπT^π (which follows from the positivity of costs), we obtain TπV∗≤(Tπ)2V∗T^πV^*≤(T^π)^2V^*. By induction, V∗≤(Tπ)kV∗V^*≤(T^π)^kV^* for all k∈ℕk . Since TπT^π is a γ-contraction in an appropriate weighted norm (e.g. the sup norm under bounded costs, or a weighted norm under the discounted criterion), then by the Banach fixed-point theorem, (Tπ)kV∗→Vπ(T^π)^kV^*→ V^π, the value function of policy π. Taking limits preserves the inequality, yielding V∗≤VπV^*≤ V^π on ℝnR^n, proving the optimality of V∗V^* and π∗π^*. Necessity. Assume V∗V^* optimal, so for each s the minimizer π∗(s)π^*(s) in (1) exists. By the measurable selection theorem [14], π∗π^* is chosen to be measurable, and hence admissible as a feedback policy. At a minimum, we have ∇aΦs(π∗(s))=0 _a _s(π^*(s))=0. Lemma 1 justifies interchanging the derivative and the expectation, yielding (4). The Bellman identity (5) holds by the definition of π∗(s)=a∗π^*(s)=a^* as the minimizer in (1). ∎ V Quadratic–Gaussian Specialization and Constructive Drifts We now examine the Quadratic–Gaussian (QG) setting, where the cost functions and the value function are now quadratic. In this setting, a∗a^* will be shown to be unique. With explicit V∗V^* and a∗a^*, we can generate and validate numerous instances for RL evaluation. Let b be the certainty equivalence offset. We assume the following: q(s)=s⊤Qs,ρ(a)=a⊤Ra,V∗(s)=s⊤Ps+b,q(s)=s Qs, ρ(a)=a Ra, V^*(s)=s Ps+b, with Q⪰0Q 0, R≻0R 0, P≻0P 0, and w∼(0,Σ)w (0, ). Certainty-equivalent optimal policy. We start from the Bellman integrand: Φs(a)=q(s)+ρ(a)+γ[V∗(f(s)+g(s)a+w)]. _s(a)=q(s)+ρ(a)+γ\,E[V^*(f(s)+g(s)a+w)]. For ease of reading, let z:=z(s,a)=f(s)+g(s)az:=z(s,a)=f(s)+g(s)a. As V∗(s)=s⊤Ps+bV^*(s)=s Ps+b, we have: [V∗(z+w)]=[(z+w)⊤P(z+w)+b]= [V^*(z+w)]=E[(z+w) P(z+w)+b]= =z⊤Pz+2z⊤P[w]+[w⊤Pw]+b. =z Pz+2z PE[w]+E[w Pw]+b. Since w∼(0,Σ)w (0, ), we have [w]=0E[w]=0 and [w⊤Pw]=tr(PΣ)E[w Pw]=tr(P ), and we are left with [V∗(z+w)]=z⊤Pz+tr(PΣ)+b.E[V^*(z+w)]=z Pz+tr(P )+b. Hence, it follows that: Φs(a)=s⊤Qs+a⊤Ra+γz⊤Pz+γtr(PΣ)+γb, _s(a)=s Qs+a Ra+γ\,z Pz+γ\,tr(P )+γ b, and ∇aΦs(a)=∇a[a⊤Ra+γz⊤Pz], _a _s(a)= _a [a Ra+γ z Pz ], as the others terms do not depend on a. We then have: ∇aΦs(a) _a _s(a) =∇a[a⊤Ra+γ(f(s)⊤Pz+a⊤g⊤Pz)] = _a [a Ra+γ (f(s) Pz+a g Pz ) ] =∇a[a⊤Ra+γ(f(s)⊤Pf(s)+f(s)⊤Pg(s)a+ = _a [a Ra+γ (f(s) Pf(s)+f(s) Pg(s)a+ +a⊤g(s)⊤Pf(s)+a⊤g(s)⊤Pg(s)a)] +a g(s) Pf(s)+a g(s) Pg(s)a ) ] =∇a[a⊤B(s)a+2γa⊤g(s)⊤Pf(s)+ = _a [a B(s)a+2γ a g(s) Pf(s)+ +γf(s)⊤Pf(s)], +γ f(s) Pf(s) ], where B(s)=R+γg(s)⊤Pg(s)B(s)=R+γ g(s) Pg(s). Setting ∇aΦs(a)=0 _a _s(a)=0 yields: ∇aΦs(a)=2B(s)a+2γg(s)⊤Pf(s)=0. _a _s(a)=2B(s)a+2γ g(s) Pf(s)=0. Solving for a leads to the stationary optimal feedback a∗=−γB(s)−1g(s)⊤Pf(s).a^*=-γ\,B(s)^-1g(s) Pf(s). As ∇a2Φs(a)=2B(s)≻0 _a^2 _s(a)=2B(s) 0, Φs _s is strictly convex and its minimizer a∗a^* is unique. Discounted metric and energy identity. The minimized term mina∈ℝmΦs(a)=mina∈ℝm[a⊤Ra+γz⊤Pz]=f(s)⊤Hγ(s)f(s) _a ^m _s(a)= _a ^m [a Ra+γ z Pz ]=f(s) H_γ(s)f(s) induces a state-dependent metric: Hγ(s) H_γ(s) =γ(P−γPg(s)B(s)−1g(s)⊤P) =γ (P-γ\,Pg(s)B(s)^-1g(s) P ) (6) =γ(P−1+γg(s)R−1g(s)⊤)−1≻0, =γ (P^-1+γ\,g(s)R^-1g(s) )^-1 0, where the second equality uses the Woodbury identity. Inserting mina∈ℝmΦs(a) _a ^m _s(a) into the Bellman equation (1) leads to V∗(s) V^*(s) =s⊤Ps+b =s Ps+b =mina∈ℝmq(s)+ρ(a)+γ[V∗(z+w)] = _a ^m \q(s)+ρ(a)+ [V^*(z+w) ] \ =q(s)+mina∈ℝma⊤Ra+γz⊤Pz+γtr(PΣ)+γb =q(s)+ _a ^m \a Ra+γ z Pz \+ (P )+γ b =q(s)+f(s)⊤Hγ(s)f(s)+γtr(PΣ)+γb. =q(s)+f(s) H_γ(s)f(s)+ (P )+γ b. Comparing the state-dependent parts gives s⊤Ps=q(s)+f(s)⊤Hγ(s)f(s),s Ps=q(s)+f(s) H_γ(s)f(s), and reduces the Bellman identity to the quadratic “energy” equation: s⊤(P−Q)s=f(s)⊤Hγ(s)f(s).s (P-Q)s=f(s) H_γ(s)\,f(s). In the generator, this energy equation is the design equation: it characterizes the family of drifts that make the prescribed V∗V^* optimal. Noise offset. Comparing the constant parts gives b=γtr(PΣ)+γb,b= (P )+γ b, and hence additive Gaussian noise contributes only a constant shift in value, with b=γ1−γtr(PΣ).b= γ1-γtr(P ). Two constructive parameterizations. All drifts satisfying the energy equation admit explicit equivalent square-root (SR) and metric-normalized (MN) parameterizations that make it easy to inject bounded nonlinear structure. For any nonzero direction field f~(s) f(s), the square-root parameterization is given by: f(s)=s⊤(P−Q)sHγ(s)−1/2f~(s)‖f~(s)‖,f~(s)≠0.f(s)= s (P-Q)s\; H_γ(s)^-1/2 f(s)\| f(s)\|, f(s)≠ 0. Equivalently, for any orthogonal field S(s)S(s) with S(s)⊤S(s)=IS(s) S(s)=I, f(s)=Hγ(s)−1/2S(s)(P−Q)1/2s.f(s)=H_γ(s)^-1/2S(s)(P-Q)^1/2s. Our benchmarks primarily use this latter equation: orthogonality guarantees the energy equation exactly, while the field S(⋅)S(·) provides a knob for nonlinear coupling and stiffness. VI Benchmark Generation, Families, and Dataset We now describe the systematic construction of benchmark instances based on the Quadratic-Gaussian (QG) specialization developed in Section V. The goal is to generate a diverse set of stochastic control problems, each with a certified optimal policy π∗π^* and an optimal value function V∗V^*, which can be used to evaluate reinforcement learning algorithms. We first define the general problem class and the generator pipeline. We then introduce two benchmark families: (i) a nonlinear linked-arm system with state-dependent actuation and (i) a dynamically-extended nonholonomic vehicle model. Finally, we describe how validated instances are assembled into a dataset for algorithm evaluation. VI-A Problem definition and generation of benchmark families Each benchmark instance is a discrete-time stochastic control problem of the form: sk+1=f(sk)+g(sk)ak+wk,s_k+1=f(s_k)+g(s_k)a_k+w_k, with state s∈ℝns ^n, action a∈ℝma ^m, internal dynamic f:ℝn→ℝnf:R^n ^n, input-coupling matrix g:ℝn→ℝn×mg:R^n ^n× m, a discount factor γ∈(0,1)γ∈(0,1), and w the independent and identically distributed (i.i.d.), time-independent Gaussian noise process with zero mean and covariance matrix Σ . The corresponding stage cost c:ℝn×ℝm→ℝc:R^n×R^m is defined by c(s,a)=q(s)+ρ(a)c(s,a)=q(s)+ρ(a). The functions f, g, q, and ρ satisfy Assumptions 1 and 2. The optimal policy π∗:ℝn→ℝmπ^*:R^n ^m and the optimal value function V∗:ℝn→ℝV^*:R^n are analytically known from the converse-optimal construction of Section V. A benchmark family is a parametric collection of control problems: ℱψ=Mψ:ψ∈Ψ,F_ψ=\M_ψ:ψ∈ \, where each control problem MψM_ψ is defined by a tuple (fψ,gψ,qψ,ρψ,Σψ)(f_ψ,g_ψ,q_ψ, _ψ, _ψ) that satisfies the QG conditions of Section V, and Ψ is a parameter space. The certified ground truth (πψ∗,Vψ∗)(π^*_ψ,V^*_ψ) follows directly from the construction. We construct two distinct families, ℱArmF_Arm and ℱNVDExF_NVDEx, which are described below. Each family provides a range of difficulty levels through interpretable parameters; we make the physical structure explicit through governing equations. Depending on the structure of the input matrix, two equivalent QG parameterizations are convenient: a state-dependent formulation used for the Arm family, a constant-input formulation used for the NVDEx family. In addition, the converse-optimal drift is not treated as an abstract matrix field only: its factors are chosen so that they admit a robotics interpretation. In particular, the orthogonal matrix field S(⋅)S(·) is implemented as a product of planar rotations, which preserves the converse-optimal energy identity exactly while producing dynamics that resemble physically meaningful transition and coupling operators in robotic systems. The generator supports multiple families; here we focus on two physically grounded, scalable constructions that stress different failure modes of deep RL. The generator (Algorithm 1) produces instances with a validation trace. Each instance undergoes three checks: 1. Symmetric positive-definiteness (SPD) of B(s)B(s): B(s)=R+γg(s)⊤Pg(s)≻0,s∈ℝn.B(s)=R+γ g(s) Pg(s) 0,s ^n. The SPD of B(s)B(s) is required for two reasons: (i) it guarantees strict convexity of the Bellman integrand Φs(a) _s(a), so the first-order condition yields a unique global minimizer a∗a^*; and (i) it keeps the discounted metric Hγ(s)H_γ(s) positive definite, which is necessary for the energy equation s⊤(P−Q)s=f(s)⊤Hγ(s)f(s)s (P-Q)s=f(s) H_γ(s)f(s) to be well-posed and numerically stable. In practice, numerically verifying B(s)≻0B(s) 0 on a sampling grid over a prescribed bounded test region in the domain ℝnR^n guarantees the mathematical consistency of the constructed benchmark instance. 2. Bellman residual: this check verifies that the constructed (f,g,q,ρ)(f,g,q,ρ) indeed satisfy the optimality conditions numerically. For each test point s on a grid G, we compute: δ(s) δ(s) =|V∗(s)−(q(s)+ρ(a∗(s))+ = |V^*(s)- (q(s)+ρ(a^*(s))+ +γ[V∗(f(s)+g(s)a∗(s)+w)])|. + [V^*(f(s)+g(s)a^*(s)+w)] ) |. If maxs∈δ(s)<ε _s δ(s)< , where ε is a suitable numerical tolerance threshold, the instance passes validation. This avoids implementation errors and ensures that the benchmark is correctly instantiated. 3. Boundedness: the boundedness check simulates the closed-loop system under the optimal policy π∗π^* over a fixed horizon and domain. This verifies that numerical implementation does not introduce instabilities, that the theoretically optimal behavior translates into practice, and that the benchmark instance is usable for RL evaluation, ensuring trajectories remain well-behaved and do not diverge due to accumulated errors or hidden sensitivities. Instances that fail this test are rejected, guaranteeing that every exported fixture is both mathematically correct and practically stable. Validated instances are exported as YAML fixtures containing parameters, CRN seeds, and reference evaluators for V∗V^* and a∗a^*. Algorithm 1 Converse-Optimal Benchmark Generator (QG case) 1: Input: family type, parameter ranges, number of instances N, parameterization (sqrt or metric), nonlinearity knob α, seeds 2: for each hyperparameter tuple ψ do 3: Sample parameter vector ψi _i from family-specific distribution PΨP_ 4: Extract (n,m,Q,R,P,γ,Σ)(n,m,Q,R,P,γ, ) and input map g from ψi _i 5: Verify R+γg(s)⊤Pg(s)≻0R+γ g(s) Pg(s) 0 on domain D 6: Choose direction field f~(s) f(s) or orthogonal field S(s)S(s) (bounded via α) 7: Construct f via parameterizations; compute a∗(s)a^*(s) 8: Validate Bellman identity on a grid; test closed-loop boundedness under a∗a^* 9: Export fixture (YAML) with reference (a∗,V∗)(a^*,V^*) and CRN seeds 10: end for 11: Output: fixture dataset + validator + reference solvers + plotting scripts VI-B Serial n-link planar arm systems (ConverseArm) The first family, ℱArmF_Arm, models a serial n-link planar arm with revolute joints. VI-B1 State and action spaces In order to do this, we must express the configuration (joint angles) and the rates of change (angular velocities) of each joint. The natural choice is therefore to include for each joint i: • θi _i — the angular position (angle) of the joint i; • ωi _i — the angular velocity of the joint i. Stacking these variables for all n joints gives the state vector s=[θ1,ω1,θ2,ω2,…,θn,ωn]⊤∈ℝ2n,s=[ _1, _1, _2, _2,…, _n, _n] ^2n, In this ordering, each joint occupies two consecutive components: position followed by velocity. Thus, for joint i, the angle θi _i appears at position 2i−12i-1 (odd index), and the angular velocity ωi _i appears at position 2i2i (even index). This convention is convenient when writing the dynamics and the input coupling matrix, as it allows us to refer to the velocity component of the joint i simply as the 2i2i-th entry of the state vector. The control inputs are the torques applied at each joint: a=[τ1,τ2,…,τn]⊤∈ℝn,a=[ _1, _2,…, _n] ^n, where τi _i is the torque commanded at the joint i. Positive torques act to increase the corresponding angular velocity. This choice of state and action spaces is standard for robotic manipulators: the state contains the information needed to predict future evolution (positions and velocities), while the actions represent the physical inputs available to the controller (joint torques). The dimension grows linearly with the number of joints, allowing systematic scaling to higher-dimensional problems. VI-B2 Dynamics The discrete-time dynamics are obtained by Euler discretization of the continuous-time equations of motion, with sampling step τ>0τ>0. The kinematic update for each joint angle is straightforward: θi,k+1=θi,k+τωi,k+wi,kθ, _i,k+1= _i,k+τ\, _i,k+w^θ_i,k, where wi,kθw^θ_i,k is the component of the process noise affecting the angle of the joint i. The update in angular velocity is more complex because it depends on both the internal dynamics (torques arising from motion, gravity, and coupling between joints) and the applied control torques. In our converse-optimal construction, the internal dynamics are encoded in the drift fp(s)f_p(s), while the control enters through the input matrix g(s)g(s). Specifically, the angular velocity update is: ωi,k+1=[fp(sk)]2i+pcos(θi,k)ai,k+wi,kω, _i,k+1=[f_p(s_k)]_2i+p\, ( _i,k)\,a_i,k+w^ω_i,k, where: • [fp(sk)]2i[f_p(s_k)]_2i denotes the 2i2i-th component of the drift vector (recall the state ordering [θ1,ω1,…,θn,ωn][ _1, _1,…, _n, _n], so odd indices correspond to angles, even indices to angular velocities); • p∈(0,1]p∈(0,1] is a control-strength continuation parameter that scales the input. When p=1p=1, the system has full control authority; as p→0p→ 0, control becomes weaker; • cos(θi,k) ( _i,k) captures the configuration-dependent mechanical advantage: torque applied at joint i is most effective when the joint angle is near zero (torque acts perpendicular to the link), and becomes ineffective as θi→±π2 _i→± π2 (torque aligns with the link direction); • wi,kωw^ω_i,k is the process noise affecting the angular velocity of joint i. The complete system can be written in the compact control-affine form: sk+1=fp(sk)+pg(sk)ak+wk,wk∼(0,Σ),s_k+1=f_p(s_k)+p\,g(s_k)a_k+w_k, w_k (0, ), where the input matrix g(s)g(s) has entries [g(s)]2i,i=cosθi,i=1,…,n,[g(s)]_2i,i= _i, i=1,…,n, with all other entries zero. This structure ensures that each torque aia_i affects only its corresponding angular velocity channel, scaled by the mechanical advantage cosθi _i, exactly as in the governing equations above. VI-B3 Drift construction The drift fpf_p is not specified a priori; it is constructed using the converse-optimality framework to ensure that the prescribed value function V∗(s)=s⊤Ps+bV^*(s)=s Ps+b is optimal. Using the MN parameterization, we set fp(s)=Hγ(p)(s)−1/2Sp(s)(P−Q)1/2s,f_p(s)=H_γ^(p)(s)^-1/2S_p(s)(P-Q)^1/2s, where Hγ(p)(s)H_γ^(p)(s) is the discounted metric with control strength p: Hγ(p)(s)=γ(P−γp2Pg(s)(R+γp2g(s)⊤Pg(s))−1g(s)⊤P).H_γ^(p)(s)=γ (P-γ p^2Pg(s)(R+γ p^2g(s) Pg(s))^-1g(s) P ). The orthogonal field Sp(s)S_p(s) is implemented as a product of Givens rotations: Sp(s) S_p(s) =[∏i=1n−1G(ωi,ωi+1;βi(s,p))]× = [ _i=1^n-1G( _i, _i+1; _i(s,p)) ]× ×[∏i=1nG(θi,ωi;αi(s,p))], × [ _i=1^nG( _i, _i; _i(s,p)) ], with rotation angles αi(s,p) _i(s,p) =α0ptanh(κ1θiωi)+g0psinθi, = _0p ( _1 _i _i)+g_0p _i, βi(s,p) _i(s,p) =β0ptanh(κ2(θi+1−θi)). = _0p ( _2( _i+1- _i)). These angles introduce physically interpretable couplings: • The tanh(κ1θiωi) ( _1 _i _i) term creates a velocity-dependent coupling similar to Coriolis forces in robotic manipulators; • The sinθi _i term introduces a gravity-like bias that depends on joint angle. g0g_0 scales the magnitude of this bias term; • The tanh(κ2(θi+1−θi)) ( _2( _i+1- _i)) term couples neighboring joints, similar to elastic or damping forces in a serial linkage. The constants κ1 _1 and κ2 _2 are gain parameters that control the sensitivity of the coupling terms: • κ1 _1 scales the argument of the hyperbolic tangent in αi _i, determining how strongly the product θiωi _i _i (position–velocity coupling) influences the rotation angle. Larger κ1 _1 makes the tanh saturate more quickly, meaning the coupling becomes active even for small deviations from zero; • κ2 _2 scales the argument of the hyperbolic tangent in βi _i, controlling the sensitivity of the neighbor-coupling term to the angle difference θi+1−θi _i+1- _i. Higher κ2 _2 makes the coupling more responsive to small misalignments between adjacent joints. In both cases, the hyperbolic tangent ensures that the coupling remains bounded: |tanh(⋅)|<1| (·)|<1, so the rotation angles αi _i and βi _i are confined to intervals determined by α0p _0p and β0p _0p respectively. This boundedness guarantees that the orthogonal field Sp(s)S_p(s) remains well-defined and numerically stable across the entire state space. Crucially, because each G is an orthogonal rotation matrix, the product Sp(s)S_p(s) remains orthogonal (Sp⊤Sp=IS_p S_p=I). This orthogonality guarantees that the energy equation s⊤(P−Q)s=fp(s)⊤Hγ(p)(s)fp(s)s (P-Q)s=f_p(s) H_γ^(p)(s)f_p(s) holds exactly, preserving the converse-optimal construction while injecting nonlinear, physically motivated coupling. VI-B4 Optimal policy and value With the drift constructed as above, the optimal action derived from the QG specialization is: ap∗=−γp(R+γp2g(s)⊤Pg(s))−1g(s)⊤Pfp(s).a_p^*=-γ p (R+γ p^2g(s) Pg(s) )^-1g(s) Pf_p(s). The parameter p∈(0,1]p∈(0,1] scales g↦pg pg, resulting in a homotopy family that preserves the analytic structure of the optimal action ap∗a_p^*. The optimal policy provides the control authority that, together with the drift fp(s)f_p(s), steers the system optimally according to V∗V^*. VI-B5 Difficulty knobs The family is parameterized by ψ=(n,p,Q,R,P,Σ,α0,β0,κ1,κ2,g0).ψ=(n,p,Q,R,P, , _0, _0, _1, _2,g_0). To facilitate ordering benchmark instances by expected difficulty, we define a scalar heuristic index: D=λ1n+λ2p−2+λ3tr(Σ)+λ4α0+λ5κ(P),D= _1n+ _2p^-2+ _3tr( )+ _4 _0+ _5κ(P), with fixed weights λi>0 _i>0. This index combines: • n: dimensionality (more joints = more complex dynamics); • p−2p^-2: penalizes weak control authority (small p), as it makes stabilization harder; • tr(Σ)tr( ): noise power; • α0 _0: nonlinearity gain of SpS_p (larger α0 _0 introduces stronger coupling); • κ(P)κ(P): condition number of P, a proxy for metric ill-conditioning; Although D is heuristic, it provides a useful relative ordering of instances for ablation studies and difficulty sweeps. While the ConverseArm family emphasizes serial-chain dynamics with state-dependent actuation, the next family takes a different structural form. The NVDEx system features constant input matrices and includes a controlled open-loop instability mechanism, offering complementary challenges to reinforcement learning algorithms. Figure 2: Converse-optimal benchmark pipeline: sampling, drift construction, validation, fixture export, paired evaluation, and reporting. VI-C Nonholonomic Vehicle with Dynamic Extension (NVDEx) The second family, ℱNVDExF_NVDEx, models a unicycle/differential-drive vehicle module with dynamic extension, allowing unstable open-loop configurations. VI-C1 State and action spaces A standard unicycle model has states (x,y,ϕ)(x,y,φ) representing position and heading, with controls for linear velocity v (magnitude of velocity in direction of heading) and angular velocity ω (rate of change of heading). However, such a model is too simple for challenging benchmarks: control inputs directly set velocities, leaving little room for complex dynamics. We therefore introduce the dynamic extension: instead of commanding velocities directly, we command their derivatives (accelerations), and the velocities themselves become part of the state. This creates a richer dynamical structure where: • The system has inertia: the velocity changes only through applied accelerations; • Multiple modules K can be coupled, increasing the dimensionality. For j>0j>0, we denote vi(j)v_i^(j) and ωi(j) _i^(j) the internal states in the dynamic extension chain and the angular velocity chain, respectively. They represent “delayed” or “filtered” versions of the velocity, creating inertia-like effects. The integrator depths rvr_v and rωr_ω control how many steps of “memory” appear between the acceleration commands and the actual velocity changes. Integrator depth refers to how many cascaded integrators lie between the control input and the actual physical velocity. For rv=1r_v=1, the chain contains only the linear velocity vi(0)v_i^(0), which is directly affected by the control; the response is immediate, with low inertia. For rv=2r_v=2, the chain contains vi(0)v_i^(0) and vi(1)v_i^(1); control affects vi(1)v_i^(1), which then integrates to influence vi(0)v_i^(0), creating a smoother and more realistic response with inertia. For rv=3r_v=3, the chain has three levels, further increasing the smoothing between the command and the actual velocity. Higher integrator depth makes the system more challenging to control because the effect of control actions is delayed and filtered, the state space dimension increases, and the dynamics become more complex, requiring the controller to anticipate future effects of current actions. Thus, rvr_v and rωr_ω serve as difficulty knobs that increase dynamical complexity while preserving the certified optimal policy structure. For a system with K≥1K≥ 1 modules and integrator depths rv,rω∈ℕr_v,r_ω , the state vector of module i contains: • Position (xi,yi)(x_i,y_i) and heading ϕi _i (the core unicycle states); • A chain of rvr_v velocity variables vi(0),…,vi(rv−1)v_i^(0),…,v_i^(r_v-1), where vi(0)v_i^(0) is the actual linear velocity, and the higher indices represent intermediate states in the extension; • A chain of rωr_ω angular velocity variables ωi(0),…,ωi(rω−1) _i^(0),…, _i^(r_ω-1), similarly structured. Thus, the state si∈ℝ3+rv+rωs_i ^3+r_v+r_ω of the module i is: si=[xi,yi,ϕi,vi(0),…,vi(rv−1),ωi(0),…,ωi(rω−1)]⊤.s_i=[x_i,y_i, _i,v_i^(0),…,v_i^(r_v-1), _i^(0),…, _i^(r_ω-1)] . The action for the module i consists of the accelerations commanded to the velocity chains: ai=[uv,i,uω,i]⊤∈ℝ2,a_i=[u_v,i,u_ω,i] ^2, where uv,iu_v,i accelerates the linear velocity chain and uω,iu_ω,i accelerates the angular velocity chain. For multi-module systems (K>1K>1), states and actions are simply stacked: s s =[s1⊤,s2⊤,…,sK⊤]⊤∈ℝ(3+rv+rω)K, =[s_1 ,s_2 ,…,s_K ] ^(3+r_v+r_ω)K, a a =[a1⊤,a2⊤,…,aK⊤]⊤∈ℝ2K. =[a_1 ,a_2 ,…,a_K ] ^2K. Analogously to the ConverseArm setup, this choice of state and action spaces reflects the structure of a unicycle with dynamic extension: the module’s configuration consists of its position and heading; velocities determine its motion through kinematic equations; inertia by placing integrators between acceleration commands and actual velocities. Dimension scales linearly with the number of modules K, allowing systematic progression from single-module to multi-agent benchmarks while maintaining physical interpretability and certified optimality. VI-C2 Dynamics With the sampling step τ>0τ>0, the discrete-time dynamics reflects the physical relationships between these variables. For each module i, the position and heading evolve according to the actual linear and angular velocities, and are obtained by Euler discretization of continuous-time kinematic equations: xi,k+1 x_i,k+1 =xi,k+τvi,k(0)cosϕi,k+wi,kx, =x_i,k+τ\,v_i,k^(0) _i,k+w^x_i,k, yi,k+1 y_i,k+1 =yi,k+τvi,k(0)sinϕi,k+wi,ky, =y_i,k+τ\,v_i,k^(0) _i,k+w^y_i,k, ϕi,k+1 _i,k+1 =ϕi,k+τωi,k(0)+wi,kϕ. = _i,k+τ\, _i,k^(0)+w^φ_i,k. The velocity chains propagate their values upward: each level feeds into the next, like a cascade of integrators. For j=0,…,rv−2j=0,…,r_v-2: vi,k+1(j)=vi,k(j)+τvi,k(j+1)+wi,kv,j,v_i,k+1^(j)=v_i,k^(j)+τ\,v_i,k^(j+1)+w^v,j_i,k, and similarly for j=0,…,rω−2j=0,…,r_ω-2: ωi,k+1(j)=ωi,k(j)+τωi,k(j+1)+wi,kω,j. _i,k+1^(j)= _i,k^(j)+τ\, _i,k^(j+1)+w^ω,j_i,k. At the highest level of each velocity chain, the control input directly influences the dynamics. This is where the commanded acceleration enters the system. For the linear velocity chain, the top level is vi(rv−1)v_i^(r_v-1). Its update equation is: vi,k+1(rv−1)=[f(sk)]v,i+τuv,i,k+wi,kv,rv−1.v_i,k+1^(r_v-1)=[f(s_k)]_v,i+τ\,u_v,i,k+w^v,r_v-1_i,k. where we have: • τuv,i,kτ\,u_v,i,k: the direct effect of the control input. The sampling step τ scales the acceleration command uv,i,ku_v,i,k to produce a change in this top-level state over one discrete time step; • [f(sk)]v,i[f(s_k)]_v,i: this term comes from the converse-optimal drift construction. It represents the contribution of the system’s internal dynamics, i.e., how the top-level state would evolve even in the absence of control; • wi,kv,rv−1w^v,r_v-1_i,k: the process noise affecting this state, drawn from the Gaussian distribution with covariance Σ . An analogous equation holds for the angular velocity chain, with uω,i,ku_ω,i,k the commanded angular acceleration: ωi,k+1(rω−1)=[f(sk)]ω,i+τuω,i,k+wi,kω,rω−1. _i,k+1^(r_ω-1)=[f(s_k)]_ω,i+τ\,u_ω,i,k+w^ω,r_ω-1_i,k. After stacking all modules, these equations collapse into the compact control-affine form used throughout the paper: sk+1=fp(sk)+pGak+wk,wk∼(0,Σ),s_k+1=f_p(s_k)+pGa_k+w_k, w_k (0, ), with control-strength continuation p∈(0,1]p∈(0,1]. VI-C3 Input coupling Each control input ai=[uv,i,uω,i]⊤a_i=[u_v,i,u_ω,i] influences exactly one state variable (the top of its respective chain), and it does so linearly with coefficient τ. This local, linear influence can be encoded in a matrix G that maps the global action vector a=[a1⊤,…,aK⊤]⊤a=[a_1 ,…,a_K ] to the correct components of the state update. G is a matrix that selects which states receive control inputs and places τ in those positions. This gives G a block-diagonal structure: G=blkdiag(Gmod,Gmod,…,Gmod⏟K times)∈ℝn×2K.G=blkdiag( G_mod,G_mod,…,G_mod_K times) ^n× 2K. The per-module matrix GmodG_mod encodes how accelerations uv,iu_v,i and uω,iu_ω,i enter the dynamics. The control inputs uv,iu_v,i and uω,iu_ω,i should affect only the top-level states vi(rv−1)v_i^(r_v-1) and ωi(rω−1) _i^(r_ω-1), respectively. Therefore, GmodG_mod is a matrix with 3+rv+rω3+r_v+r_ω rows and 22 columns, where: • All rows corresponding to xi,yi,ϕix_i,y_i, _i, and lower-level velocity states are zero; controls do not directly affect these. • The row corresponding to vi(rv−1)v_i^(r_v-1) (the top of the linear velocity chain) has a 11 in the first column (for uv,iu_v,i) and 0 in the second. • The row corresponding to ωi(rω−1) _i^(r_ω-1) (the top of the angular velocity chain) has a 0 in the first column and a 11 in the second column (for uω,iu_ω,i). Explicitly, with the sampling step τ factored out, we have: Gmod=τ[(3+rv−1)×2(1, 0)(rω−1)×2(0, 1)]∈ℝ(3+rv+rω)×2.G_mod=τ bmatrix0_(3+r_v-1)× 2\\[2.0pt] (1,\ 0)\\[2.0pt] 0_(r_ω-1)× 2\\[2.0pt] (0,\ 1) bmatrix ^(3+r_v+r_ω)× 2. VI-C4 Drift construction Following the converse-optimal framework, the drift fp(s)f_p(s) is constructed using the metric-normalized parameterization: fp(s)=(Hγ(p))−1/2Sν(s)(P−Q)1/2s,f_p(s)=(H_γ^(p))^-1/2S_ν(s)(P-Q)^1/2s, where the discounted metric with control strength p is: Hγ(p)=γ(P−1+γp2GR−1G⊤)−1≻0.H_γ^(p)=γ (P^-1+γ p^2GR^-1G )^-1 0. Note that unlike the ConverseArm family, Hγ(p)H_γ^(p) is state-independent because G is constant. Here, Sν(s)S_ν(s) is an orthogonal field which introduces nonlinear coupling while preserving the energy identity. It is constructed as a product of bounded rotations: • A rotation in the (xi,yi)(x_i,y_i) plane by the local heading ϕi _i; • A coupling between (vi(0)(v_i^(0) and ωi(0)) _i^(0)) via angle νi(s)=αtanh(κvi(0)ωi(0)) _i(s)=α (κ v_i^(0) _i^(0)), which creates an interaction between linear and angular velocities; • Optional cross-couplings (e.g., between (xi,vi(0))(x_i,v_i^(0)) and (yi,ωi(0))(y_i, _i^(0))) can be added to increase nonlinearity. The hyperbolic tangent ensures boundedness: |νi(s)|<α| _i(s)|<α, so rotations remain within controlled limits. Orthogonality (Sν⊤Sν=I)(S_ν S_ν=I) guarantees that the energy equation s⊤(P−Q)s=fp(s)⊤Hγ(p)fp(s)s (P-Q)s=f_p(s) H_γ^(p)f_p(s) holds exactly. VI-C5 Controlled open-loop stability A feature of the NVDEx family is the ability to create systems that are locally open-loop unstable, yet globally stabilizable by the optimal policy. This tests an algorithm’s ability to handle unstable dynamics. This is achieved by tuning the linearization at the origin. Let A0=Hγ−1/2(P−Q)1/2A_0=H_γ^-1/2(P-Q)^1/2 denote the linearization matrix at s=0s=0 (since Sν(0)=IS_ν(0)=I). The drift simplifies to fp(s)≈Hγ−1/2(P−Q)1/2s=:A0s.f_p(s)≈ H_γ^-1/2(P-Q)^1/2s=:A_0s. The eigenvalues of A0A_0 determine local stability. Choosing Q=βPQ=β P with β∈[0,1)β∈[0,1) gives: (P−Q)1/2=(P−βP)1/2=((1−β)P)1/2=1−βP1/2.(P-Q)^1/2=(P-β P)^1/2=((1-β)P)^1/2= 1-βP^1/2. We substitute this into the expression for A0A_0: A0=Hγ−1/2(P−Q)1/2=1−β(Hγ−1/2P1/2).A_0=H_γ^-1/2(P-Q)^1/2= 1-β (H_γ^-1/2P^1/2 ). The spectral radius ϱ(A0) (A_0) scales with this factor: ϱ(A0)=1−βϱ(Hγ−1/2P1/2). (A_0)= 1-β\, (H_γ^-1/2P^1/2 ). Let ϱbase:=ϱ(Hγ−1/2P1/2) _base:= (H_γ^-1/2P^1/2). This is a fixed number determined by HγH_γ and P. We tune β, reaching desired instability corresponding to a target spectral radius ϱtarget>1 _target>1. Then, ϱ(A0)=1−βϱbase=ϱtarget. (A_0)= 1-β _base= _target. Solving for β yields the following: β=1−(ϱtargetϱbase)2.β=1- ( _target _base )^2. If ϱbase≤ϱtarget _base≤ _target, then β≤0β≤ 0, which is not allowed (β must be in [0,1)[0,1)). In this case, we first need to increase ϱbase _base. Here, ϱbase=ϱ(Hγ−1/2P1/2) _base= (H_γ^-1/2P^1/2) depends on HγH_γ, which depends on R (control cost). We select a cheaper control by scaling R down with some factor smaller than 11. This scales HγH_γ up, increasing ϱbase _base. We scale R down until ϱbase>ϱtarget _base> _target, then compute β as above. Thus, choosing ϱtarget>1 _target>1 guarantees local open-loop instability, while the optimal policy πp∗ _p^* stabilizes the system exactly to attain the optimal value V∗V^*. VI-C6 Optimal policy and value With the drift constructed as above, the optimal action is: ap∗=−γp(R+γp2G⊤PG)−1G⊤Pfp(s).a_p^*=-γ p (R+γ p^2G PG )^-1G Pf_p(s). As before, the additive Gaussian noise contributes only a constant offset to the value function, leaving the optimal action unchanged. VI-C7 Difficulty knobs The family is parameterized by ψ=(K,rv,rω,p,Q,R,P,Σ,α,κ,ϱtarget).ψ=(K,r_v,r_ω,p,Q,R,P, ,α,κ, _target). Key difficulty dimensions include: • K: number of modules (scales state dimension); • rv,rωr_v,r_ω: integrator depths (increase dynamical complexity); • p: control authority (smaller p makes stabilization harder); • α,κα,κ: nonlinearity gains in the coupling νi(s) _i(s); • ϱtarget _target: desired open-loop instability margin. By adjusting these parameters, we can generate benchmark instances ranging from nearly linear, easily controllable systems to highly nonlinear, unstable, high-dimensional challenges — all while preserving the certified optimal action ap∗a_p^* and value function V∗(s)V^*(s). VI-D Dataset construction From each family, we generate a dataset of validated instances by sampling parameters from specified distributions: N(Arm)=Mψi:ψi∼i.i.d.PΨArm,i=1,…,NArm,D_N^(Arm)=\M_ _i: _i .i.d. P_ _Arm,\ i=1,…,N_Arm\, (7) and similarly for N(NVDEx)D_N^(NVDEx). The full benchmark dataset is the union: =NArm(Arm)∪NNVDEx(NVDEx).D=D_N_Arm^(Arm) _N_NVDEx^(NVDEx). (8) Each instance includes: • Full parameter tuple (Q,R,P,γ,Σ)(Q,R,P,γ, ) and family metadata; • Deterministic seeds for initial-state schedules and noise streams; • Reference evaluators for V∗V^* and a∗a^*; • Validation logs (SPD checks, Bellman residuals, boundedness tests). The dataset supports paired evaluation under common random numbers (CRN) and enables ablations that separate algorithmic effects from protocol noise. VII Experimental Setup and Paired Evaluation We evaluate PPO, A2C, SAC, TD3, and DDPG (Stable-Baselines3) under identical action bounds and matched environment step budgets. On-policy methods are run on CPU; off-policy methods use GPU if available. Reward is r(s,a)=−c(s,a)r(s,a)=-c(s,a). For each fixture, algorithms share identical initial-state schedules (ISS) and noise streams (CRN), and evaluation uses the analytic (a∗,Vr∗)(a^*,V_r^*), where Vr∗V_r^* is the optimal reward-to-go (analogous to cost-to-go V∗V^*). For a start state s0s_0, the normalized optimality gap is OptGap(π;s0)=Vrπ(s0)−Vr∗(s0)|Vr∗(s0)|+ε,OptGap(π;s_0)= V_r^π(s_0)-V_r^*(s_0)|V_r^*(s_0)|+ , (9) with ε being a small constant to avoid division by 0. Discounted regret along the same CRN trajectory is RegretT(π;s0)=∑k=0T−1γk(r(sk,ak∗)−r(sk,π(sk))).Regret_T(π;s_0)= _k=0^T-1γ^k (r(s_k,a^*_k)-r(s_k,π(s_k)) ). (10) We report means and 95% confidence intervals via paired bootstrap over CRN trials. VIII Results We report three regimes: serial_random (scheduled initial states), serial_fixed (repeated s0s_0), and NVDEx (dynamic extension, longer horizon). These regimes vary occupancy measures and long-horizon conditioning while preserving the oracle (a∗,V∗)(a^*,V^*). As a result, protocol effects (coverage, distribution shift, estimator variance) and algorithmic effects (approximation error, exploration, bootstrapping bias) can be separated and interpreted in absolute units. TABLE I: Overall performance per algorithm and experiment. Metrics are mean and 95% CI via paired bootstrap under CRN. Experiment Algorithm OptGap (mean) OptGap 95% CI Regret (mean) Regret 95% CI serial_random A2C -2.36e+03 [-2.79e+03, -1.93e+03] 8.24e+03 [6.83e+03, 9.58e+03] serial_random DDPG -2.29e+03 [-3.83e+03, -9.12e+02] 9.28e+03 [3.50e+03, 1.63e+04] serial_random PPO -3.32e+02 [-8.45e+02, -4.92e+01] 1.51e+03 [1.92e+02, 3.93e+03] serial_random SAC -1.65e+02 [-3.47e+02, -4.63e+01] 6.90e+02 [1.72e+02, 1.53e+03] serial_random TD3 -5.85e+01 [-8.64e+01, -3.57e+01] 2.14e+02 [1.25e+02, 3.35e+02] serial_fixed A2C -1.11e+03 [-2.15e+03, -4.58e+02] 5.78e+04 [3.12e+04, 9.23e+04] serial_fixed DDPG -6.48e+02 [-1.34e+03, -2.51e+02] 3.06e+04 [1.61e+04, 5.22e+04] serial_fixed PPO -6.46e+01 [-9.56e+01, -3.68e+01] 4.08e+03 [2.30e+03, 6.14e+03] serial_fixed SAC -1.51e+03 [-2.44e+03, -8.00e+02] 9.03e+04 [5.18e+04, 1.33e+05] serial_fixed TD3 -1.53e+03 [-2.69e+03, -7.61e+02] 8.31e+04 [5.05e+04, 1.23e+05] NVDEx A2C -5.60e+03 [-1.33e+04, -1.06e+03] 5.94e+02 [1.34e+02, 1.33e+03] NVDEx DDPG -1.46e+04 [-3.13e+04, -3.37e+03] 3.22e+03 [4.76e+02, 7.39e+03] NVDEx PPO -1.58e+01 [-3.07e+01, -4.57e+00] 2.97e+00 [7.01e-01, 5.82e+00] NVDEx SAC -7.76e+03 [-1.56e+04, -1.15e+03] 1.56e+03 [1.13e+02, 3.13e+03] NVDEx TD3 -1.22e+04 [-2.24e+04, -4.20e+03] 2.78e+03 [6.96e+02, 5.67e+03] (a) Optimality gap. (b) Regret (log10 _10 scale). Figure 3: Serial n-link planar arm (ConverseArm): absolute oracle-referenced performance (optimality gap and regret) across algorithms (rows) and control authority p (columns), with separate panels for n. Initialization protocol effects. The performance divergence between serial_random and serial_fixed configurations underscores the critical role of state-distribution coverage in algorithmic efficacy. Randomized initial conditions facilitate diverse transition sampling in TD3 and SAC experience replay buffers, enhancing target estimate stability and data efficiency. However, it is important to note that in highly nonlinear and unstable dynamical systems, randomizing initial states may introduce substantial challenges, as algorithms must contend with widely varying system behaviors and potential instability regions that complicate policy learning. Conversely, fixed initial states produce narrow occupancy distributions where PPO’s constrained policy updates mitigate distributional shift, conferring advantages in stationary environments by focusing learning on a consistent operational regime. Training budget considerations. Disparate resource allocation across algorithms in the NVDEx domain warrants careful interpretation. PPO and A2C received substantially greater environmental interaction budgets, while off-policy methods incorporated initial warm-up periods (learning_starts) that reduced their effective update windows. Although this allocation partially explains PPO’s dominance, the magnitude of performance differentials suggests additional contributing factors beyond mere step-count disparities. (a) Macro-averaged optimality gaps of algorithms for all experiments (serial n-link planar arm and NVDEx). (b) Macro-averaged regrets of algorithms for all experiments (serial n-link planar arm and NVDEx). Figure 4: Compact result summaries. Macro-averaged optimality gaps and regrets across all experiments. Figure 5: Discounted reward bars (r=−cr=-c) with 95% CIs under CRN. Colors correspond to p∈0.5,0.6,0.8.p∈\0.5,0.6,0.8\. Dynamics and temporal horizon interactions. The NVDEx environment employs dynamic extension with extended temporal horizons (T=700T=700), generating stiff, feedback-sensitive closed-loop dynamics particularly at elevated (p, α, κ) parameter values. These conditions amplify bootstrap estimation errors and critic approximation variance. PPO’s on-policy return estimates, regularized through clipping mechanisms, suggest superior robustness in such challenging regimes. This pattern recurs in the serial benchmark suite (T=512T=512 with state-dependent actuation), where off-policy methods excel under diverse state distributions (serial_random) while PPO prevails in narrow, stationary occupancies (serial_fixed). Synthesis. The empirical evidence suggests PPO is preferable for environments with stationary state distributions (serial_fixed) and complex dynamic systems with extended horizons (NVDEx), while maintaining competitive performance under diverse initial conditions (serial_random). TD3 and SAC demonstrate particular strength in scenarios with rich replay buffer coverage. A2C and DDPG exhibit consistent underperformance across experimental conditions. These hierarchical performance relationships emerge from complex interactions among initialization protocols, effective training budgets, and environmental characteristics including temporal horizon and dynamic stiffness. Figures 3, 4, and 5 summarize performance. Heatmaps show how optimality gap and regret vary with control authority p and dimension n for ConverseArm; bar graphs aggregate between regimes, emphasizing broad trends. Darker shades correspond to higher regret (lower performance). As n and p increase, dynamics become more sensitive to control and approximation errors, making it harder for learned policies to match π∗π^*. This visualization efficiently captures emergent patterns. Off-policy methods (TD3, SAC) dominate at high p values and larger dimensions, and DDPG generally struggles due to ineffective exploration strategies, instability in value estimation, and sensitivity to reward scaling. On the other hand, PPO shows strong performance in moderate dimensions (n∈7,9n∈\7,9\) and control authority (p≤0.6p≤ 0.6), benefiting from its robustness in policy, but struggles with high authority and dimensionality due to constrained trust regions and sample inefficiency. As p increases (see Figs. 3 and 5), regret consistently grows across algorithms may be explained by three related effects: • Accelerated oracle improvement: The optimal policy π∗π^* becomes increasingly effective at dissipating energy in the P-norm as control authority grows. While the optimal reward R∗R^* decreases rapidly with increasing p, learned policies RπR^π fail to improve at the same rate due to optimization and estimation limitations, resulting in widening regret (Rπ−R∗)(R^π-R^*). This paradox arises because while stronger control makes the optimal problem easier, it simultaneously amplifies sensitivity to policy imperfections, destabilizes learning dynamics, and increases function approximation complexity, making the learning problem significantly harder. • Amplified sensitivity to imperfections: The Jacobian of the closed-loop dynamics with respect to policy actions scales with p. Consequently, even minor policy errors, exploration noise, or value function estimation inaccuracies become dramatically amplified in the next-state distribution and state cost term s⊤Qs Qs. The high control authority thus penalizes modest misestimation and introduces higher variance in bootstrapping targets. • Increased approximation complexity: Both the discounted metric Hγ(p)(s)H_γ^(p)(s) and the drift dynamics fp(s)f_p(s) exhibit a nonlinear dependence on p. The state-dependent actuation g(s)=cosθg(s)= θ creates regions of reduced control effectiveness, while the system becomes increasingly stiff and reactive at higher p values. These factors collectively destabilize gradients through long horizons (T=512T=512), cause non-stationary replay distributions, and increase noise in GAE/TD targets, thereby slowing policy improvement. IX Discussion and Limitations The central advantage of our benchmarks is calibration: knowing (a∗,V∗)(a^*,V^*) disentangles algorithmic issues (exploration, approximation, bootstrapping) from protocol effects (initial states, noise, termination) – impossible on standard suites. The continuation parameter p varies control authority smoothly while preserving optimality, revealing when improved oracle performance coincides with increased learning sensitivity. Bounded nonlinear couplings via S(⋅)S(·) induce stiffness and state-dependent conditioning without sacrificing ground truth. However, our assumptions impose limitations. The control-affine structure excludes general control dependencies. Additive i.i.d Gaussian noise misses heavy-tailed disturbances, heteroscedasticity, or temporal correlations. The QG specialization adds quadratic costs (excluding non-quadratic objectives, constraints) and linear optimal policies (excluding nonlinear controllers), with quadratic value functions restricting drifts via the energy equation. These choices were deliberate, as they enable tractable construction of benchmarks with certified optima, but strong performance on our families does not guarantee robustness when these assumptions break. Extensions to broader noise, control, and cost models remain future work. X Conclusion We presented a framework that extends converse optimality to discounted stochastic control-affine systems with additive Gaussian noise and used the resulting construction to generate RL benchmarks with analytically known optimal policies and value functions. This method can also be used to benchmark classical control approaches. This enables reproducible, ground-truth evaluation via optimality gaps and regret under matched stochasticity and common random numbers. We release validated fixtures and code to support standardized comparisons. Artifacts and Reproducibility All fixtures, training/evaluation scripts, plotting code, and a dataset of other benchmarking systems are available at: github.com/converseoptimality/RL-Benchmarking. References [1] R. Bellman, Dynamic Programming. Princeton, NJ: Princeton University Press, 1957. [2] V. Nevistic and J. A. Primbs, “Constrained nonlinear optimal control: a converse hjb approach,” California Institute of Technology, Pasadena, CA, vol. 91125, p. 96–021, 1996. [3] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018. [4] R. Agarwal, M. Schwarzer, P. S. Castro, A. Courville, and M. G. Bellemare, “Deep reinforcement learning at the edge of the statistical precipice,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 34, 2021. [Online]. Available: https://papers.nips.c/paper/2021/hash/f514cec81cb148559cf475e7426eed5e-Paper.pdf [5] K. Cobbe, C. Hesse, J. Hilton, and J. Schulman, “Leveraging procedural generation to benchmark reinforcement learning,” in Proceedings of the 37th International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, p. 2048–2056. [Online]. Available: https://proceedings.mlr.press/v119/cobbe20a.html [6] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine, “D4RL: Datasets for deep data-driven reinforcement learning,” arXiv preprint arXiv:2004.07219, 2020. [Online]. Available: https://arxiv.org/abs/2004.07219 [7] F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,” IEEE Control Systems Magazine, vol. 32, no. 6, p. 76–105, 2012. [8] D. Vrabie and F. Lewis, “Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems,” Neural networks, vol. 22, no. 3, p. 237–246, 2009. [9] K. G. Vamvoudakis and F. L. Lewis, “Multi-player non-zero-sum games: Online adaptive learning solution of coupled hamilton–jacobi equations,” Automatica, vol. 47, no. 8, p. 1556–1569, 2011. [10] T. Göhrt, P. Osinenko, and S. Streif, “Converse optimality for discrete-time systems,” IEEE Transactions on Automatic Control, vol. 65, no. 5, p. 2257–2264, 2019. [11] R. Islam, P. Henderson, M. Gomrokchi, and D. Precup, “Reproducibility of benchmarked deep reinforcement learning tasks for continuous control,” arXiv preprint arXiv:1708.04133, 2017. [12] S. Huang, Q. Gallouédec, F. Felten, A. Raffin, R. F. J. Dossa, Y. Zhao, R. Sullivan, V. Makoviychuk, D. Makoviichuk, M. H. Danesh et al., “Open rl benchmark: Comprehensive tracked experiments for reinforcement learning,” arXiv preprint arXiv:2402.03046, 2024. [13] G. Yamerenko, S. Ibrahim, F. Moreno-Mora, P. Osinenko, and S. Streif, “Generating informative benchmarks for reinforcement learning,” IEEE Control Systems Letters, 2025. [14] S. E. Shreve and D. P. Bertsekas, “Universally measurable policies in dynamic programming,” Mathematics of Operations Research, vol. 4, no. 1, p. 15–30, 1979.