Paper deep dive
Adversarial Latent-State Training for Robust Policies in Partially Observable Domains
Angad Singh Ahuja
Abstract
Abstract:Robustness under latent distribution shift remains challenging in partially observable reinforcement learning. We formalize a focused setting where an adversary selects a hidden initial latent distribution before the episode, termed an adversarial latent-initial-state POMDP. Theoretically, we prove a latent minimax principle, characterize worst-case defender distributions, and derive approximate best-response inequalities with finite-sample concentration bounds that make the optimization and sampling terms explicit. Empirically, using a Battleship benchmark, we demonstrate that targeted exposure to shifted latent distributions reduces average robustness gaps between Spread and Uniform distributions from 10.3 to 3.1 shots at equal budget. Furthermore, iterative best-response training exhibits budget-sensitive behavior that is qualitatively consistent with the theorem-guided diagnostics once one accounts for discounted PPO surrogates and finite-sample noise. Ultimately, we show that for latent-initial-state problems, the framework yields a clean evaluation game and useful theorem-motivated diagnostics while also making clear where implementation-level surrogates and optimization limits enter.
Tags
Links
- Source: https://arxiv.org/abs/2603.07313v2
- Canonical: https://arxiv.org/abs/2603.07313v2
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/13/2026, 12:32:17 AM
Summary
This paper introduces 'adversarial latent-initial-state POMDPs', a framework for robust reinforcement learning where an adversary selects a hidden initial latent state before an episode. The authors prove a latent minimax principle, derive best-response inequalities with finite-sample concentration bounds, and demonstrate empirically on a Battleship benchmark that targeted exposure to shifted latent distributions significantly reduces robustness gaps.
Entities (4)
Relation Signals (3)
Latent minimax principle → governs → Adversarial latent-initial-state POMDP
confidence 95% · Theoretically, we prove a latent minimax principle... for adversarial latent-initial-state POMDPs.
Adversarial latent-initial-state POMDP → uses → Battleship
confidence 95% · Empirically, using a Battleship benchmark, we demonstrate that targeted exposure to shifted latent distributions reduces average robustness gaps.
PPO → optimizes → Adversarial latent-initial-state POMDP
confidence 90% · On the optimization side, we use PPO together with action masking.
Cypher Suggestions (2)
Find all benchmarks used by the framework · confidence 90% · unvalidated
MATCH (f:Framework)-[:USES]->(b:Benchmark) RETURN f.name, b.name
Identify algorithms applied to frameworks · confidence 90% · unvalidated
MATCH (a:Algorithm)-[:OPTIMIZES]->(f:Framework) RETURN a.name, f.name
Full Text
71,069 characters extracted from source content.
Expand or collapse full text
Adversarial Latent-State Training for Robust Policies in Partially Observable Domains Angad Singh Ahuja Constrained Image-Synthesis Lab Abstract Robustness under latent distribution shift remains challenging in partially observable reinforcement learning. We formalize a focused setting where an adversary selects a hidden initial latent distribution before the episode, termed an adversarial latent-initial-state POMDP. Theoretically, we prove a latent minimax principle, characterize worst-case defender distributions, and derive approximate best-response inequalities with finite-sample concentration bounds that make the optimization and sampling terms explicit. Empirically, using a Battleship benchmark, we demonstrate that targeted exposure to shifted latent distributions reduces average robustness gaps between Spread and Uniform distributions from 10.310.3 to 3.13.1 shots at equal budget. Furthermore, iterative best-response training exhibits budget-sensitive behavior that is qualitatively consistent with the theorem-guided diagnostics once one accounts for discounted PPO surrogates and finite-sample noise. Ultimately, we show that for latent-initial-state problems, the framework yields a clean evaluation game and useful theorem-motivated diagnostics while also making clear where implementation-level surrogates and optimization limits enter. The code repository for this work is available at https://github.com/AngadSingh22/adversarial_pomdp. 1 Motivation and Claim Many partially observable control problems are dominated not by stepwise stochasticity, but by a hidden condition selected before interaction begins. A diagnosis system may face an unknown fault configuration. A robotics policy may confront an unobserved physical parameter regime. A constrained graphics system may act under a fixed but hidden process condition such as dot gain, substrate behavior, or capture distortion. In each case, the agent interacts sequentially, but the principal source of uncertainty is a latent variable drawn at the start of the episode and then held fixed. This paper studies that setting through a deliberately restricted adversarial lens. The adversary does not modify transitions online and does not inject arbitrary trajectory-level noise. It acts only once, at time 0, by choosing the hidden initial latent state or a distribution over such latent states. We refer to this problem class as adversarial latent-initial-state POMDPs. Battleship is a particularly clean benchmark for this class. The hidden ship layout is exactly the latent variable. Conditional on the layout, transitions and observations are deterministic. The action space is discrete and constrained in a transparent way by already-fired cells. Strong scripted baselines are available. Most importantly, the hidden placement distribution can be varied in controlled ways to induce latent distribution shift. The central claim of the paper has two parts. The first is theoretical. Adversarial latent-initial-state POMDPs admit a compact theorem package that is both exact and practically interpretable: the defender’s optimization is a genuine finite minimax problem over latent distributions; approximate best responses imply explicit inequalities for the main training diagnostics; and finite-sample concentration guarantees certify when empirical signs of those diagnostics can be trusted. The second part is empirical. In Battleship, training exposure to shifted latent distributions substantially reduces robustness gaps under held-out stress distributions. However, full iterative best response only behaves as intended when the defender is itself optimized strongly enough to be genuinely adversarial. The present benchmark is also motivated, in a modest way, by a broader graphics-for-ML agenda. In constrained image synthesis and sequential graphics control problems, one often faces hidden physical or process latents that remain fixed throughout a generation trajectory. Although this paper does not study those domains directly, the formal development is meant to be reusable there. Battleship is used here because it makes the latent-state structure exact and the theoretical analysis verifiable. 2 Related Work The closest classical background is robust Markov decision process theory, where uncertainty typically enters through transition kernels or rewards and the control objective becomes minimax optimization over a model uncertainty set [6, 8]. That literature is foundational, but its uncertainty model differs from the one studied here. In our setting, the adversary does not alter transition kernels throughout the episode. Instead, it selects a hidden initial latent condition, after which the environment evolves conditionally on that latent. This restriction is not merely cosmetic. It yields a simpler strategic object and, as we show, a cleaner theorem package. Adversarial reinforcement learning has also been studied in forms where the adversary acts throughout the trajectory. Representative examples include robust adversarial reinforcement learning with a disturbance player [10], as well as broader lines of work on adversarial observations, adversarial state perturbations, and online disturbance robustness. These approaches are important and practically relevant, but they are structurally different from hidden-latent robustness. In those settings, the adversary is a trajectory-level controller. In the present paper, the adversary is a selector of initial latent conditions. Partially observable sequential decision-making is naturally framed using POMDPs [7]. In a general POMDP, the canonical sufficient statistic for optimal control is the posterior belief over hidden states. That perspective matters here because Battleship is fundamentally a search-under-uncertainty problem. It also explains why belief-style scripted baselines are so strong in this benchmark: they approximate posterior occupancy over hidden layouts. Our contribution, however, is not a new posterior filter or a new search baseline. It is a theorem-backed way of thinking about robustness when the hidden variable itself is adversarially chosen. Closer structural relatives include planning problems with static hidden parameters and robust POMDP formulations with ambiguity over latent model components [2, 9]. Our setting is narrower than a general robust POMDP because the defender acts only once by choosing a latent initial condition. That restriction is deliberate: it isolates a hidden-latent subclass in which the defender strategy space, the relevant robustness gaps, and the diagnostic quantities all become especially transparent. On the optimization side, we use PPO [12] together with action masking. This is a natural choice in Battleship because invalid actions are rule-defined rather than learned. For evaluation, we use not only mean episode length but also tail metrics such as the 95th95^th percentile and conditional value-at-risk [11, 3]. This is important because robustness failures under latent shift often appear first in the tails rather than in the mean. The game-theoretic training protocol is related to fictitious play and iterative best-response ideas [1, 4]. However, our defender is intentionally restricted. It does not act as an unrestricted online adversary, nor does it directly optimize over all legal layouts in one step. It learns within a structured latent-selection class and is then evaluated through the induced distribution over layouts. This restricted viewpoint is central to the empirical story: the theoretical diagnostics are meaningful precisely because the defender’s role is mathematically clear, but their empirical success depends on whether that restricted defender can be optimized well enough to behave adversarially in practice. Finally, the present work should be read as part of a broader methodological tendency in machine learning toward robustness under hidden nuisance variation. Domain randomization and stress-distribution training pursue a related intuition, namely that controlled exposure to variation can improve out-of-distribution performance. The present paper differs in two ways. First, the latent variable is hidden and fixed rather than directly observed or resampled within an episode. Second, the main contribution is not simply an empirical robustness gain, but a mathematically explicit set of statements connecting the game formulation, the training objectives, and the diagnostics used to interpret them. 3 Methodology, Theoretical Development, and Diagnostics The problem formulation and the methodological choices are best presented together. The formal game specifies the hidden latent layout, the observation channel, the feasible action set, and the objective. The solution space then specifies how that game is operationalized in learning: the reward surrogate used during optimization, the policy class actually implemented, the self-play protocol, and the evaluation quantities that connect the experiments back to the theory. Let the board be an H×WH× W grid. In the implementation and in the analysis below, cells are indexed as =0,1,…,HW−1.C=\0,1,…,HW-1\. A legal hidden ship layout is an element B∈ℬB , where ℬB is the finite set of all placements satisfying the standard legality constraints. The latent layout is treated as hidden because the attacker never observes ship locations directly. Instead it receives only immediate shot outcomes: miss, hit, or a sunk event for a specific ship. This is the minimum information required to preserve the actual information structure of the game. The underlying state at time t may be written as st=(B,Mt,Ht,Ut),s_t=(B,M_t,H_t,U_t), where MtM_t and HtH_t denote the public miss and hit indicators and UtU_t denotes the remaining unsunk ship structure. The action set consists of valid grid cells not yet fired upon, (st)=a∈:Mt(a)=0∧Ht(a)=0.A(s_t)=\a :M_t(a)=0 H_t(a)=0\. Conditioned on B, the environment is deterministic. The only randomness under standard play enters through the initial layout draw. Definition 1 (Battleship POMDP). A Battleship episode is a POMDP ℳ=(,,,,Ω,ℛ)M=(S,A,O,T, ,R) such that a hidden layout B is sampled once at time 0 and then fixed; the public state evolves deterministically through the hit, miss, and sunk updates; the observation kernel Ω is deterministic conditional on the hidden layout and public shot record; and the default reward is a step penalty until all ships are sunk. The natural performance criterion is shots-to-win. Let τ denote the first time all ships are sunk. The game-level objective is minππ[τ]. _πE_π[τ]. All main theorems below are stated for this evaluation loss, because π[τ]E_π[τ] is the quantity reported throughout the empirical section. The following lemma explains the exact relation between shots-to-win and the sparse step-penalty reward in the undiscounted setting. Lemma 1 (Undiscounted step penalty and shots-to-win). Suppose the reward is rt=−1for t<τ,rτ=0.r_t=-1 t<τ, r_τ=0. Then for any policy π, J1(π):=π[∑t=0τ−1rt]=−π[τ].J_1(π):=E_π [ _t=0^τ-1r_t ]=-E_π[τ]. Therefore maximizing undiscounted expected return is exactly equivalent to minimizing expected shots-to-win. Remark 1 (Discounted PPO surrogate in the implementation). The Stage-1 and Stage-2 PPO agents in the code are trained with discount factor γtrain=0.99 _train=0.99. For the sparse step-penalty reward, the resulting optimization objective is Jγtrain(π):=π[∑t=0τ−1γtraintrt]=−π[1−γtrainτ1−γtrain].J_ _train(π):=E_π [ _t=0^τ-1 _train^tr_t ]=-E_π [ 1- _train^τ1- _train ]. This discounted surrogate is monotone in τ on each realized trajectory but is not equal to −π[τ]-E_π[τ] in expectation. The theorem package below should therefore be read as characterizing the evaluation game defined by episode length, while PPO serves as an optimization method for a closely related discounted surrogate. The exact POMDP sufficient statistic is the posterior belief over legal layouts, bt(B)=ℙ(B∣ht),b_t(B)=P(B h_t), where hth_t denotes the interaction history up to time t. In principle, optimal control could be stated entirely in belief space. In practice, the implemented agents do not maintain that posterior explicitly. Instead, they receive a public board record encoded as a three-channel tensor with channels Hit, Miss, and Unknown, where Unknown is computed deterministically as Unknown=1−(Hit+Miss). Unknown=1-( Hit+ Miss). This design is faithful to the implementation and keeps the policy class computationally tractable. Conceptually, however, the posterior remains important: it explains why belief-based baselines are strong and why the hidden-latent view of the game is the correct one. The formal extension that matters for robustness is now immediate. Definition 2 (Adversarial latent-initial-state POMDP). An adversarial latent-initial-state POMDP is a POMDP in which a latent variable z∈z is sampled once at time 0 from a defender-chosen distribution ρ, after which the transition and observation laws remain fixed conditional on z, and the attacker acts under partial observability to optimize expected return. Battleship instantiates this definition exactly, with z=B∈ℬz=B . The defender’s strategic role is therefore not to perturb the agent online, but to choose a hidden initial condition that changes the induced difficulty distribution over episodes. For any loss variable X under defender distribution ρ, write p95ρ(X):=infx∈ℝ:ℙρ(X≤x)≥0.95,p95_ρ(X):= \x :P_ρ(X≤ x)≥ 0.95\, and CVaR0.10,ρ(X):=10.10∫0.901FX,ρ−1(u)u,CVaR_0.10,ρ(X):= 10.10 _0.90^1F^-1_X,ρ(u)\,du, so that CVaR0.10,ρCVaR_0.10,ρ is the mean of the worst 10%10\% upper tail when X is interpreted as a loss. We measure robustness relative to a nominal defender distribution ρnom _nom and a stress defender distribution ρstress _stress. For an attacker policy π, the mean robustness gap is Δrob(π)=ρstress[τ(π)]−ρnom[τ(π)]. _rob(π)=E_ _stress[τ(π)]-E_ _nom[τ(π)]. We also use tail-sensitive analogues, Δrob(p95)(π)=p95ρstress(τ(π))−p95ρnom(τ(π)), _rob^(p95)(π)=p95_ _stress(τ(π))-p95_ _nom(τ(π)), and Δrob(CVaR)(π)=CVaR0.10,ρstress(τ(π))−CVaR0.10,ρnom(τ(π)). _rob^(CVaR)(π)=CVaR_0.10, _stress(τ(π))-CVaR_0.10, _nom(τ(π)). These quantities match the evaluation protocol actually used in the experiments. At this point the main theorem package can be stated. The first result is the central mathematical claim of the paper. Because the horizon, action set, and observation set are finite in Battleship, the set of deterministic history-dependent attacker policies is finite, although large. This permits an exact reduction of the attacker-defender interaction to a finite zero-sum game. Definition 3 (Deterministic history-dependent attacker policy). Let ℋH be the finite set of all feasible attacker histories up to a finite horizon TmaxT_ . A deterministic history-dependent attacker policy is a map π:ℋ→π:H such that π(h)∈(h)π(h) (h) for every feasible history h∈ℋh . Let Πdet denote the finite set of all such deterministic attacker policies. Let the defender class be a compact convex polytope =convρ1,…,ρmP=conv\ρ^1,…,ρ^m\ of admissible latent distributions over a finite latent set Z. For each π∈Πdetπ∈ and each latent z∈z , define the latent-conditioned loss L(π,z):=[τ(π)∣z].L(π,z):=E[τ(π) z]. Theorem 1 (Latent minimax principle). Assume that the horizon TmaxT_ is finite, the action and observation alphabets are finite, the latent set Z is finite, and the defender class P is a compact convex polytope of latent distributions. Then: minμ∈Δ(Πdet)maxρ∈V(μ,ρ)=maxρ∈minμ∈Δ(Πdet)V(μ,ρ), _μ∈ ( ) _ρ V(μ,ρ)= _ρ _μ∈ ( )V(μ,ρ), where V(μ,ρ)=∑π∈Πdet∑z∈μ(π)ρ(z)L(π,z).V(μ,ρ)= _π∈ _z μ(π)ρ(z)L(π,z). In particular, adversarial latent-state training is a genuine finite minimax problem over attacker mixtures and defender distributions. The immediate corollary is that worst-case defenders can be sought at extreme points. Corollary 1 (Extreme-point defenders). For any fixed attacker mixture μ, the defender optimization maxρ∈V(μ,ρ) _ρ V(μ,ρ) attains its maximum at an extreme point of P. This matters conceptually because it justifies training against defender distributions. In this setting, distributions are not a heuristic relaxation of the true game; they are the natural strategy objects induced by the latent minimax formulation. The next result family is more directly tied to training logs. The exact best-response interpretation is too strong for practical optimization, so we formulate approximate certificates. Definition 4 (Defender ε -best response). Fix an attacker policy π and defender class P. A defender distribution ρε∈ρ is an εD _D-best response to π if supρ∈ρ[τ(π)]−ρε[τ(π)]≤εD. _ρ E_ρ[τ(π)]-E_ρ [τ(π)]≤ _D. Definition 5 (Attacker ε -best response to a defender mixture). Fix an attacker policy class Π and let ν=λρD+(1−λ)ρU,λ∈(0,1).ν=λ _D+(1-λ) _U, λ∈(0,1). A policy πε∈Ππ ∈ is an εA _A-best response to ν if ν[τ(πε)]−infπ∈Πν[τ(π)]≤εA.E_ν[τ(π )]- _π∈ E_ν[τ(π)]≤ _A. Theorem 2 (Approximate best-response certificates). Let ρU _U denote the nominal UNIFORM defender distribution. If ρk _k is an εD _D-best response to attacker πk−1 _k-1 over a defender class containing ρU _U, then ρk[τ(πk−1)]−ρU[τ(πk−1)]≥−εD.E_ _k[τ( _k-1)]-E_ _U[τ( _k-1)]≥- _D. Equivalently, defender_adversarialk≥−εD. defender\_adversarial_k≥- _D. If, further, πk _k is an εA _A-best response to the mixture νk=λρk+(1−λ)ρU, _k=λ _k+(1-λ) _U, then λ(ρk[τ(πk)]−ρk[τ(πk−1)])+(1−λ)(ρU[τ(πk)]−ρU[τ(πk−1)])≤εA.λ (E_ _k[τ( _k)]-E_ _k[τ( _k-1)] )+(1-λ) (E_ _U[τ( _k)]-E_ _U[τ( _k-1)] )≤ _A. Equivalently, λ⋅attacker_adaptationk+(1−λ)⋅uniform_driftk≤εA.λ· attacker\_adaptation_k+(1-λ)· uniform\_drift_k≤ _A. The reported Stage-2 experiments use the default two-way attacker mixture νk=0.5ρk+0.5ρU _k=0.5\, _k+0.5\, _U, which matches the certificate above. These inequalities explain how to read the Stage-2 diagnostics, but only after one specifies the optimization errors εD _D and εA _A. In the experiments those quantities are not observed directly, so the logged diagnostics should be interpreted as theorem-motivated proxies rather than as certified duality gaps. Because the experiments report empirical means from finitely many evaluation episodes, one also wants a statistical statement ensuring that the observed sign of a diagnostic is trustworthy. Since Battleship episode length is bounded by the finite board size, concentration bounds are available. In the Battleship benchmark considered here, valid masking ensures that no cell can be fired at more than once. Therefore, in the no-truncation setting, τ≤HWalmost surely.τ≤ HW surely. If evaluation uses an external truncation cap TcapT_cap, then all bounds below hold with Tmax=min(HW,Tcap).T_ = (HW,T_cap). We state the theorem using a generic TmaxT_ so that the result remains valid under either evaluation protocol. Theorem 3 (Finite-sample sign certification). Suppose evaluation episodes are independent and episode lengths are almost surely bounded in [0,Tmax][0,T_ ], where in Battleship one may take Tmax=HWT_ =HW in the no-truncation setting and Tmax=min(HW,Tcap)T_ = (HW,T_cap) when an external truncation cap is imposed. Let Δ^D=^ρk[τ(πk−1)]−^ρU[τ(πk−1)] _D= E_ _k[τ( _k-1)]- E_ _U[τ( _k-1)] be estimated from nDn_D independent episodes under ρk _k and nUn_U independent episodes under ρU _U. Then for any δ∈(0,1)δ∈(0,1), with probability at least 1−δ1-δ, |Δ^D−ΔD|≤Tmaxlog(4/δ)2nD+Tmaxlog(4/δ)2nU, | _D- _D |≤ T_ (4/δ)2n_D+T_ (4/δ)2n_U, where ΔD=ρk[τ(πk−1)]−ρU[τ(πk−1)]. _D=E_ _k[τ( _k-1)]-E_ _U[τ( _k-1)]. Consequently, if |Δ^D|>Tmaxlog(4/δ)2nD+Tmaxlog(4/δ)2nU,| _D|>T_ (4/δ)2n_D+T_ (4/δ)2n_U, then the sign of Δ^D _D matches the sign of ΔD _D with probability at least 1−δ1-δ. Now define the weighted attacker residual R^k:=λ(^ρk[τ(πk)]−^ρk[τ(πk−1)])+(1−λ)(^ρU[τ(πk)]−^ρU[τ(πk−1)]) R_k:=λ ( E_ _k[τ( _k)]- E_ _k[τ( _k-1)] )+(1-λ) ( E_ _U[τ( _k)]- E_ _U[τ( _k-1)] ) and its population counterpart Rk:=λ(ρk[τ(πk)]−ρk[τ(πk−1)])+(1−λ)(ρU[τ(πk)]−ρU[τ(πk−1)]).R_k:=λ (E_ _k[τ( _k)]-E_ _k[τ( _k-1)] )+(1-λ) (E_ _U[τ( _k)]-E_ _U[τ( _k-1)] ). If the four empirical means are computed from independent samples of sizes nD,an_D,a, nD,bn_D,b, nU,an_U,a, and nU,bn_U,b respectively, then with probability at least 1−δ1-δ, |R^k−Rk|≤λTmax(log(8/δ)2nD,a+log(8/δ)2nD,b)+(1−λ)Tmax(log(8/δ)2nU,a+log(8/δ)2nU,b).| R_k-R_k|≤λ T_ ( (8/δ)2n_D,a+ (8/δ)2n_D,b )+(1-λ)T_ ( (8/δ)2n_U,a+ (8/δ)2n_U,b ). Consequently, if |R^k|| R_k| exceeds the right-hand side, then the sign of R^k R_k matches the sign of RkR_k with probability at least 1−δ1-δ. In the reported Stage-2 pipeline, λ=0.5λ=0.5, the defender diagnostic uses nU=100n_U=100 and nD=50n_D=50, and the attacker residual uses nU,a=100n_U,a=100, nU,b=100n_U,b=100, nD,a=100n_D,a=100, and nD,b=50n_D,b=50. The resulting Hoeffding radii are conservative, so we use the theorem as a worst-case sanity bound rather than as a tight row-wise certificate. The final structural proposition explains why defender-shift analysis should not rely only on single-cell occupancy marginals. Proposition 1 (Marginal insufficiency for fixed policies). There exist adversarial latent-initial-state POMDPs with product latent space =∏i=1diZ= _i=1^dZ_i, two defender distributions ρ and ρ′ρ with identical one-coordinate marginals over that product latent variable, and a deterministic attacker policy π such that ρ[τ(π)]≠ρ′[τ(π)].E_ρ[τ(π)] _ρ [τ(π)]. Hence one-coordinate marginals are not sufficient to characterize adversarial hardness for fixed policies, and higher-order latent structure can matter even when marginals agree. Remark 2. This proposition is intentionally modest. It does not claim that identical one-coordinate marginals change the value of the fully optimal attacker in every setting. It claims only that low-order marginals are insufficient to determine hardness for fixed policies, which is already enough to justify the use of structure-aware defender metrics in the empirical analysis. The empirical diagnostics are therefore not ad hoc. The quantity defender_adversarial operationalizes the defender certificate theorem, while the weighted residual RkR_k operationalizes the attacker certificate theorem. In the actual experiments, these objects are best understood as theorem-motivated diagnostics for empirically estimated quantities rather than as formal certificates of small optimization error. There is also a useful multiobjective reading of the attacker-training regimes. If one regards nominal loss and adversarial loss as two objectives, then training against a mixture λρD+(1−λ)ρUλ _D+(1-λ) _U is a scalarization of the nominal-versus-adversarial tradeoff. This leads to the following small but useful consequence. Corollary 2 (Supported weakly Pareto-optimal points under mixture training). Fix a defender distribution ρD _D and a nominal defender distribution ρU _U. Any exact minimizer of λρD[τ(π)]+(1−λ)ρU[τ(π)] _ _D[τ(π)]+(1-λ)E_ _U[τ(π)] for some λ∈(0,1)λ∈(0,1) attains a supported weakly Pareto-optimal point of the achievable nominal-versus-adversarial loss region. The solution-space choices are now easy to state. The reported experiments use the sparse step-penalty reward together with discounted PPO as a surrogate for the evaluation objective [τ]E[τ]. A shaped reward family is conceptually possible and is discussed in the solution-space specification, but it is not the focus of the reported results. The attacker policy actually implemented for the main experiments is a feedforward masked PPO policy over the public three-channel board tensor rather than a recurrent explicit-belief controller. This is a methodological choice rather than a property of the problem itself. The baseline self-play protocol is alternating optimization: hold the defender fixed while training the attacker, then hold the attacker fixed while training the defender. More elaborate game-theoretic stabilizers are possible, but they are not needed to define the present benchmark. For the algorithmic summaries, it is useful to write the training loops in operator form. In Stage 1, let r∈A,B,Cr∈\A,B,C\ denote the chosen training regime, let ρg(A)=ρU,ρg(B)=ρmix,ρg(C)=ρb(g),ρ^(A)_g= _U, ρ^(B)_g= _mix, ρ^(C)_g= _b(g), where ρmix _mix is the fixed scripted training mixture and b(g)b(g) is the alternating block schedule with ρb(g)∈ρU,ρstress _b(g)∈\ _U, _stress\. If gD_g denotes the masked rollout batch collected at attacker iterate πg _g, we may summarize the update as g∼(πg,ρg(r),Ng),πg+1=A(πg,g).D_g Rollout( _g,ρ^(r)_g,N_g), _g+1= PPO_A( _g,D_g). In Stage 2, the defender and attacker updates may be written as πD,k=D(πA,k−1;τD),ρk=(πD,k),νk=λρk+(1−λ)ρU, _D,k= Train_D( _A,k-1; _D), _k= Extract( _D,k), _k=λ _k+(1-λ) _U, followed by πA,k=A(πA,k−1,νk;τA). _A,k= Train_A( _A,k-1, _k; _A). The logged quantities are empirical estimates of the theorem-level diagnostics, so we distinguish them with hats: defender_adversarial^k:=^ρk[τ(πA,k−1)]−^ρU[τ(πA,k−1)], defender\_adversarial_k:= E_ _k[τ( _A,k-1)]- E_ _U[τ( _A,k-1)], attacker_adaptation^k:=^ρk[τ(πA,k)]−^ρk[τ(πA,k−1)], attacker\_adaptation_k:= E_ _k[τ( _A,k)]- E_ _k[τ( _A,k-1)], and uniform_drift^k:=^ρU[τ(πA,k)]−^ρU[τ(πA,k−1)]. uniform\_drift_k:= E_ _U[τ( _A,k)]- E_ _U[τ( _A,k-1)]. For clarity, the two main training procedures are summarized below. Algorithm 1 Stage-1 attacker training 1: Input regime r∈A,B,Cr∈\A,B,C\, initialization π0 _0, rollout sizes Ngg=0G1−1\N_g\_g=0^G_1-1, and evaluation schedule ℐevalI_eval 2: for g=0,…,G1−1g=0,…,G_1-1 do 3: Set the training latent distribution ρg(r)ρ^(r)_g: ρg(A)=ρU,ρg(B)=ρmix,ρg(C)=ρb(g).ρ^(A)_g= _U, ρ^(B)_g= _mix, ρ^(C)_g= _b(g). 4: Sample masked rollouts g∼(πg,ρg(r),Ng).D_g Rollout( _g,ρ^(r)_g,N_g). 5: Update the attacker with PPO: πg+1←A(πg,g). _g+1← PPO_A( _g,D_g). 6: if g+1∈ℐevalg+1 _eval then 7: Estimate the evaluation losses J^U(πg+1)=^ρU[τ(πg+1)],J^S(πg+1)=^ρstress[τ(πg+1)]. J_U( _g+1)= E_ _U[τ( _g+1)], J_S( _g+1)= E_ _stress[τ( _g+1)]. 8: end if 9: end for 10: Return πG1 _G_1 Algorithm 2 Restricted iterative best response 1: Input initial attacker πA,0 _A,0, generations G, budgets (τD,τA)( _D, _A), and mixture weight λ 2: for k=1,…,Gk=1,…,G do 3: Train a defender against the frozen attacker: πD,k←D(πA,k−1;τD). _D,k← Train_D( _A,k-1; _D). 4: Extract the induced latent distribution: ρk←(πD,k). _k← Extract( _D,k). 5: Form the attacker-training mixture νk←λρk+(1−λ)ρU. _k←λ _k+(1-λ) _U. 6: Update the attacker: πA,k←A(πA,k−1,νk;τA). _A,k← Train_A( _A,k-1, _k; _A). 7: Estimate J^U,k−=^ρU[τ(πA,k−1)],J^D,k−=^ρk[τ(πA,k−1)],J^U,k+=^ρU[τ(πA,k)], J_U,k^-= E_ _U[τ( _A,k-1)], J_D,k^-= E_ _k[τ( _A,k-1)], J_U,k^+= E_ _U[τ( _A,k)], J^D,k+=^ρk[τ(πA,k)],J^S,k+=^ρstress[τ(πA,k)]. J_D,k^+= E_ _k[τ( _A,k)], J_S,k^+= E_ _stress[τ( _A,k)]. 8: Log defender_adversarial^k←J^D,k−J^U,k−, defender\_adversarial_k← J_D,k^-- J_U,k^-, attacker_adaptation^k←J^D,k+−J^D,k−,uniform_drift^k←J^U,k+−J^U,k−. attacker\_adaptation_k← J_D,k^+- J_D,k^-, uniform\_drift_k← J_U,k^+- J_U,k^-. 9: end for 10: Return (πA,k,ρk)k=1G\( _A,k, _k)\_k=1^G For the reported Stage-2 runs, the attacker mixture weight is fixed at λ=0.5λ=0.5. Generation-wise evaluation uses 100100 episodes for each scripted defender mode, 5050 episodes for the pre-update DkD_k evaluation, and 100100 episodes for the post-update DkD_k evaluation. 4 Results The experiments are organized around four questions. The first asks whether stress exposure reduces the robustness gap under scripted latent shift. The second asks whether restricted iterative best response discovers harder latent distributions. The third asks whether the attacker adapts without sacrificing too much nominal performance. The fourth asks whether defender budget is the main determinant of whether the Stage-2 game behaves in the intended minimax fashion. The clearest answer comes from the first question. Table 1 reports the Stage-1 results aggregated across three seeds. Regime B, which trains on a fixed mixture of nominal and shifted defenders, sharply reduces the mean robustness gap between SPREAD and UNIFORM from approximately 10.310.3 shots to 3.13.1 shots at equal environment-step budget. This is the strongest empirical support for the paper’s main claim. Regime C, which alternates more aggressively between nominal and stress exposure, drives the gap negative, but at the cost of worse nominal performance. This is consistent with the scalarization view introduced above: stronger adversarial emphasis can move the solution farther along the nominal-versus-stress tradeoff frontier, but not necessarily in a way that improves both objectives simultaneously. Table 1: Stage-1 results aggregated across three seeds. Nominal = UNIFORM; stress = SPREAD. Lower is better. Regime UNIFORM mean SPREAD mean Gap Δrob _rob SPREAD p95p95 SPREAD CVaR0.10CVaR_0.10 A (uniform-only) 90.00 ± 1.72 100.33 ± 1.18 10.33 ± 0.54 100.00 ± 0.00 100.00 ± 0.00 B (fixed mixture) 91.33 ± 0.62 94.47 ± 0.62 3.13 ± 1.17 100.00 ± 0.00 99.33 ± 1.15 C (alternating stress) 93.33 ± 2.04 84.44 ± 1.62 -8.89 ± 2.42 99.00 ± 1.73 98.89 ± 1.92 These Stage-1 gains should be interpreted as relative robustness gains rather than as evidence of strong absolute gameplay. The strongest scripted baselines in this benchmark remain substantially better than the learned attackers at the current budgets. That point matters and should be made plainly. The present paper is about robustness under hidden latent shift, not about solving Battleship optimally. The Stage-2 results are more nuanced and, in some ways, more informative. Table 2 gives the final three-seed iterative best-response diagnostics now contained in the uploaded paper. The quantity defender_adversarial is the key empirical bridge to the theory. Under an exact defender best response it cannot be negative, and under an εD _D-best response it is bounded below by −εD- _D. Because εD _D is not directly observed, negative entries should be read as evidence of limited defender optimization rather than as formal violations of the minimax formulation. The displayed UNIFORM, SPREAD, and DkD_k columns report post-update means for AkA_k, whereas the diagnostic columns use the pre/post definitions stated in the table caption. Table 2: Final Stage-2 iterative best-response diagnostics across three independent seeds. The columns UNIFORM, SPREAD, and DkD_k report post-update means ρU[τ(πk)]E_ _U[τ( _k)], ρstress[τ(πk)]E_ _stress[τ( _k)], and ρk[τ(πk)]E_ _k[τ( _k)]. The diagnostic columns are computed from both pre- and post-update evaluations: defender_adversarialk=ρk[τ(πk−1)]−ρU[τ(πk−1)] defender\_adversarial_k=E_ _k[τ( _k-1)]-E_ _U[τ( _k-1)], attacker_adaptationk=ρk[τ(πk)]−ρk[τ(πk−1)] attacker\_adaptation_k=E_ _k[τ( _k)]-E_ _k[τ( _k-1)], and uniform_driftk=ρU[τ(πk)]−ρU[τ(πk−1)] uniform\_drift_k=E_ _U[τ( _k)]-E_ _U[τ( _k-1)]. Defender budgets are restricted to 50k steps per generation, with 100 scripted episodes per mode and 50 pre-update DkD_k episodes. Seed Generation (k) AkA_k on UNIFORM AkA_k on SPREAD AkA_k on DkD_k defender_adversarial attacker_adaptation uniform_drift 42 1 90.3 91.8 90.5 -2.15 -1.37 -0.96 42 2 89.8 90.9 90.0 -0.14 -0.45 0.06 42 3 89.1 89.1 88.5 0.26 0.73 0.18 123 1 88.8 92.3 89.1 1.47 1.49 -0.31 123 2 89.2 87.1 89.6 1.93 1.17 0.35 123 3 89.2 82.8 87.5 -1.35 -0.23 0.53 777 1 89.7 88.5 89.3 -1.93 -0.50 -1.04 777 2 89.3 86.3 89.5 0.18 0.05 -0.13 777 3 87.9 83.2 88.6 3.19 3.38 -0.88 The Stage-2 table also clarifies the third experimental question. The attacker does adapt in some generations, but that adaptation is not uniformly clean. Theorem 2 says that the correct object is the weighted combination of attacker adaptation and nominal drift, which in the reported runs uses λ=0.5λ=0.5. In other words, a small amount of nominal degradation can still be acceptable if it is compensated by sufficient improvement against the learned defender. At the sample sizes used here, the finite-sample theorem is intentionally conservative, so we treat it as a worst-case sanity check rather than as a sharp row-wise sign certificate. The empirical picture is therefore not one of failure, but of partial success under a budget-limited adversarial regime. That budget-limited interpretation becomes much clearer in the defender-budget ablation. Table 3 varies defender optimization budget while keeping the attacker side fixed. At 5050k defender steps, the defender frequently fails to remain reliably adversarial, especially for seed 42. At 200200k steps, the first generation for seed 42 becomes positively adversarial. This is exactly the sort of evidence one would hope for if the negative defender_adversarial values were primarily due to insufficient defender optimization rather than to a conceptual flaw in the game formulation. Table 3: Defender-budget ablation. Positive defender_adversarial implies that the learned defender induces a harder distribution than the nominal distribution for the current attacker. Budget (τD)( _D) Seed Gen (k)(k) defender_adversarial attacker_adaptation 50k 42 1 -1.42 -0.87 50k 42 2 -2.02 1.40 50k 42 3 0.02 1.68 50k 123 1 1.33 -1.37 50k 123 2 0.34 1.21 50k 123 3 0.19 -0.91 200k 42 1 0.10 -1.69 The defender-shift metrics support the interpretation that the scripted defender families correspond to real structural shifts rather than only relabeled samplers. Table 4 reports centroid distance, cluster score, marginal entropy, and quadrant mass asymmetry. SPREAD is the strongest geometric shift in this family, which is consistent with its role as the main stress distribution in the Stage-1 results. Table 4: Scripted defender-shift metrics. Defender CentroidDistMean ClusterScore MarginalEntropy QuadrantMassStd UNIFORM 0.000 18.05 0.451 0.012 EDGE 1.866 18.04 0.449 0.013 CLUSTER 2.266 21.55 0.399 0.016 SPREAD 2.947 15.97 0.416 0.026 PARITY 1.502 18.50 0.434 0.034 The results therefore support a differentiated conclusion. Stage 1 provides a strong and stable robustness result: exposure to shifted latent distributions substantially reduces robustness gaps. Stage 2 provides a more conditional result: iterative best response is meaningful, but only when the defender is optimized strongly enough to act as a true adversary. This is not a weakness of the theory. On the contrary, it is precisely where the theory earns its place: the diagnostic quantities are interpretable because they are tied to explicit certificate theorems. 5 Discussion and Future Work The theoretical and empirical halves of the paper point to the same conclusion, but in different ways. The theoretical development shows that adversarial latent-state training is not merely an intuitive adaptation of robust RL language. In finite-horizon hidden-latent problems, it is a genuine minimax problem over attacker mixtures and defender latent distributions. This matters because it clarifies what the defender is, what the attacker is optimizing against, and what can and cannot be concluded from empirical diagnostics. The approximate best-response theorems then explain how to read imperfect training runs: not as contradictions of the game formulation, but as evidence about optimization quality and representational limitations. The experiments, in turn, support two claims of different strengths. The first claim is comparatively strong: training exposure to latent shift reduces robustness gaps under latent shift. This is supported clearly by Stage 1 and is stable across three seeds. The second claim is more cautious: full iterative best response can discover harder latent distributions and yield meaningful attacker adaptation, but only when the defender side is trained strongly enough. This is why the defender-budget ablation matters so much. It turns what could otherwise be dismissed as unstable self-play into a concrete scientific result about when the adversarial training game is or is not being solved well enough to justify its interpretation. A further reason to take this formulation seriously is that it is not intrinsically tied to Battleship. The hidden-layout interpretation is especially clear there, but the same mathematics may apply whenever a sequential controller acts under a fixed hidden process condition. This is where the graphics-for-ML motivation becomes relevant. In sequential halftoning, print planning, or physically constrained image-generation loops, one can imagine hidden process latents such as substrate behavior, optical blur, dot gain, viewing condition, or calibration drift being selected at the beginning of a trajectory and then remaining fixed. If those settings are cast as sequential decision problems, the present theorem package offers a natural robustness language: the adversary chooses the hidden process condition, the controller acts sequentially, and the resulting diagnostics have a principled interpretation. We therefore see one natural direction for future work as the transfer of this adversarial latent-state framework to sequential graphics problems, especially DRL-based halftoning under hidden process shift. There are also purely methodological next steps. One is to strengthen the defender in Stage 2 so that the empirical game comes closer to the exact best-response regime described by the theory. Another is to test whether the theorem-guided diagnostics can be made useful online, for instance as stopping criteria or adaptive budget-allocation signals during attacker-defender training. A third is to move beyond a single benchmark family. The current paper argues that Battleship is enough to make the latent-state structure exact and analyzable. A subsequent paper should test how much of the theory survives in other hidden-latent sequential tasks. 6 Limitations The first limitation is absolute performance. The learned attackers are not yet close to the strongest scripted baselines in this benchmark. That is not fatal for the current paper, because the central claim is comparative rather than absolute, but it places a natural limit on how strongly one should interpret the practical performance of the learned policies. The second limitation concerns the optimization bridge from theory to implementation. The theoretical statements are exact for the evaluation loss [τ]E[τ] under exact best responses and become approximate under ε -best responses. The implemented PPO agents, however, optimize a discounted surrogate with γtrain=0.99 _train=0.99, and the optimization errors εA,εD _A, _D are not directly observed. The experiments strongly suggest that the defender is often the optimization bottleneck. This means that Stage-2 failures should not be over-read as failures of the minimax formulation itself. At the same time, it means the empirical IBR evidence is only as convincing as the defender optimization budget, defender class, and surrogate-optimization quality are strong. The third limitation is scope. Battleship is a single benchmark family. It is a particularly useful one for this question, because the hidden-latent structure is exact and transparent, but it is still a single family. The graphics-for-ML motivation discussed above is therefore a plausible research trajectory rather than an empirical result of the present paper. Similarly, the marginal-insufficiency proposition is intentionally limited to fixed policies; the paper does not claim that low-order marginals are insufficient to determine the value of a fully optimized attacker in every hidden-latent problem. The fourth limitation is representational. The formal POMDP sufficient statistic is the full posterior over legal layouts, whereas the implemented neural policy observes only the public three-channel board tensor. This is the right implementation choice for tractability, but it means the policy class is not an explicit belief-state controller. Part of the gap to the strongest scripted baselines may therefore be representational rather than purely optimization-based. 7 Conclusion This paper introduced adversarial latent-initial-state POMDPs as a restricted but mathematically tractable class of robustness problems. For this class, it proved a latent minimax principle, an extreme-point characterization of worst-case defenders, approximate best-response certificates, finite-sample sign-certification results, and a structural marginal-insufficiency proposition. The novelty claim is deliberately narrow: it is not the existence of minimax equilibria in finite games per se, but the specialization to hidden-latent initial-state problems together with theorem-guided diagnostics that can be mapped back to concrete training and evaluation code. These results are modest in scope, but they serve a clear purpose: they give exact mathematical meaning to the diagnostics used during adversarial latent-state training. Empirically, the Battleship benchmark shows that training exposure to shifted latent distributions substantially reduces robustness gaps under held-out stress distributions. At the same time, iterative best response is shown to be sensitive to defender budget, which is precisely the behavior the approximate-certificate theorems suggest one should expect. The contribution of the paper is therefore not that it solves Battleship, nor that it proves a broad universal theorem about all adversarial training. Its contribution is narrower and, we believe, more useful: it identifies a specific hidden-latent robustness setting, proves the right structural statements for that setting, and shows that those statements help explain both the successes and the failures of the training dynamics observed in practice. 8 Appendixes Appendix A Notation Table Table 5: Notation used in the paper. Symbol Meaning H,WH,W Board height and width. C Set of grid cells, indexed in the implementation as 0,1,…,HW−1\0,1,…,HW-1\. B Hidden Battleship layout. ℬB Set of legal Battleship layouts. Mt,HtM_t,H_t Public miss and hit indicator grids at time t. UtU_t Remaining unsunk ship structure at time t. sts_t Full underlying state at time t, typically written as (B,Mt,Ht,Ut)(B,M_t,H_t,U_t). Ω Observation kernel of the Battleship POMDP. τ Episode length, i.e. shots-to-win. z Generic hidden latent variable. Z Latent state space. ρ Defender distribution over latent states. ρU _U Nominal UNIFORM defender distribution. ρD _D Learned or stress defender distribution. ρnom,ρstress _nom, _stress Nominal and stress defender distributions in the robustness evaluation. γtrain _train Discount factor used by the PPO training surrogate, fixed to 0.990.99 in the reported runs. ℋH Set of feasible attacker histories up to a finite horizon. Πdet Finite set of deterministic history-dependent attacker policies. Δ(Πdet) ( ) Simplex of randomized mixtures over deterministic attacker policies. P Defender class, assumed to be a compact convex polytope of latent distributions. L(π,z)L(π,z) Latent-conditioned expected loss of attacker policy π when latent state is z. V(μ,ρ)V(μ,ρ) Bilinear expected loss under attacker mixture μ and defender distribution ρ. Δrob _rob Mean robustness gap between stress and nominal distributions. CVaR0.10CVaR_0.10 Conditional value-at-risk at the worst 10%10\% tail. RkR_k Weighted Stage-2 attacker residual appearing in the attacker certificate and concentration bound. λ Mixture weight on the defender-induced distribution in Stage 2, fixed to 0.50.5 in the reported runs. defender_adversarialk defender\_adversarial_k Difference in expected shots-to-win between the learned defender distribution and the nominal defender at generation k, evaluated against the pre-update attacker. attacker_adaptationk attacker\_adaptation_k Difference in expected shots-to-win on the learned defender distribution between the post-update and pre-update attackers at generation k. uniform_driftk uniform\_drift_k Difference in expected shots-to-win on the nominal defender distribution between the post-update and pre-update attackers at generation k. Appendix B Proofs B.1 Proof of the latent minimax principle (S) The horizon, action alphabet, observation alphabet, and latent set are finite, and the defender class P is a compact convex polytope. (G) Reduce the attacker-defender interaction to a finite zero-sum game and then apply von Neumann’s minimax theorem. Because the horizon TmaxT_ is finite and both the action and observation alphabets are finite, the number of feasible attacker histories is finite. A deterministic history-dependent attacker policy assigns to each such history a legal cell in the finite grid-action set. Therefore the number of deterministic history-dependent attacker policies is finite. Denote this finite set by Πdet . Because the latent set Z is finite by assumption, and because the defender class P is a compact convex polytope in the finite-dimensional simplex over Z, there exist finitely many extreme points ρ1,…,ρmρ^1,…,ρ^m such that =convρ1,…,ρm.P=conv\ρ^1,…,ρ^m\. Fix any deterministic attacker policy π∈Πdetπ∈ and any extreme defender distribution ρjρ^j. By definition, L(π,ρj)=∑z∈ρj(z)L(π,z).L(π,ρ^j)= _z ρ^j(z)L(π,z). This defines a finite matrix game whose attacker pure strategies are indexed by Πdet and whose defender pure strategies are indexed by the extreme points ρ1,…,ρmρ^1,…,ρ^m. Now let μ∈Δ(Πdet)μ∈ ( ) be any mixed attacker strategy, and let ρ∈ρ be any defender distribution. Since ρ lies in the convex hull of the extreme points, it can be written as a convex combination of them. The expected loss under (μ,ρ)(μ,ρ) is then V(μ,ρ)=∑π∈Πdetμ(π)∑z∈ρ(z)L(π,z)=∑π∈Πdet∑z∈μ(π)ρ(z)L(π,z).V(μ,ρ)= _π∈ μ(π) _z ρ(z)L(π,z)= _π∈ _z μ(π)ρ(z)L(π,z). This expression is bilinear in (μ,ρ)(μ,ρ). Because the game is finite and zero-sum, von Neumann’s minimax theorem applies directly [13]. Therefore minμ∈Δ(Πdet)maxρ∈V(μ,ρ)=maxρ∈minμ∈Δ(Πdet)V(μ,ρ). _μ∈ ( ) _ρ V(μ,ρ)= _ρ _μ∈ ( )V(μ,ρ). Thus the finite-history reduction supplies finitely many attacker pure strategies, the defender polytope supplies finitely many extreme-point pure strategies, and the bilinear payoff V satisfies the hypotheses of von Neumann’s theorem, proving the latent minimax equality. B.2 Proof of the extreme-point corollary (S) Fix an attacker mixture μ and optimize only over the defender polytope P. (G) Show that the defender maximizer can be chosen at an extreme point because the payoff is linear in ρ. Fix an attacker mixture μ. The map ρ↦V(μ,ρ)ρ V(μ,ρ) is linear in ρ, since V(μ,ρ)=∑π∈Πdet∑z∈μ(π)ρ(z)L(π,z)V(μ,ρ)= _π∈ _z μ(π)ρ(z)L(π,z) is affine, hence linear up to the simplex constraint. A standard fact from convex analysis is that a linear functional over a compact convex polytope attains its maximum at an extreme point. Therefore the defender optimum is realized at an extreme point of P. B.3 Proof of the approximate best-response certificates (S) Assume ρk _k is an εD _D-best response to πk−1 _k-1 and πk _k is an εA _A-best response to the mixture νk=λρk+(1−λ)ρU _k=λ _k+(1-λ) _U. (G) Derive the two diagnostic inequalities by comparison with the nominal defender ρU _U and the previous attacker iterate πk−1 _k-1. We begin with the defender certificate. By definition, ρk _k is an εD _D-best response to πk−1 _k-1 if supρ∈ρ[τ(πk−1)]−ρk[τ(πk−1)]≤εD. _ρ E_ρ[τ( _k-1)]-E_ _k[τ( _k-1)]≤ _D. Rearranging, ρk[τ(πk−1)]≥supρ∈ρ[τ(πk−1)]−εD.E_ _k[τ( _k-1)]≥ _ρ E_ρ[τ( _k-1)]- _D. Because the nominal defender distribution ρU _U belongs to P, supρ∈ρ[τ(πk−1)]≥ρU[τ(πk−1)]. _ρ E_ρ[τ( _k-1)] _ _U[τ( _k-1)]. Combining the two inequalities gives ρk[τ(πk−1)]≥ρU[τ(πk−1)]−εD.E_ _k[τ( _k-1)] _ _U[τ( _k-1)]- _D. Subtracting ρU[τ(πk−1)]E_ _U[τ( _k-1)] from both sides yields ρk[τ(πk−1)]−ρU[τ(πk−1)]≥−εD.E_ _k[τ( _k-1)]-E_ _U[τ( _k-1)]≥- _D. This is exactly the claimed lower bound for defender_adversarialk. We now prove the attacker certificate. Let νk=λρk+(1−λ)ρU. _k=λ _k+(1-λ) _U. By definition, πk _k is an εA _A-best response to νk _k if νk[τ(πk)]−infπ∈Πνk[τ(π)]≤εA.E_ _k[τ( _k)]- _π∈ E_ _k[τ(π)]≤ _A. Since πk−1∈Π _k-1∈ , we have infπ∈Πνk[τ(π)]≤νk[τ(πk−1)]. _π∈ E_ _k[τ(π)] _ _k[τ( _k-1)]. Therefore νk[τ(πk)]≤νk[τ(πk−1)]+εA.E_ _k[τ( _k)] _ _k[τ( _k-1)]+ _A. Expanding the definition of νk _k, λρk[τ(πk)]+(1−λ)ρU[τ(πk)]≤λρk[τ(πk−1)]+(1−λ)ρU[τ(πk−1)]+εA. _ _k[τ( _k)]+(1-λ)E_ _U[τ( _k)]≤ _ _k[τ( _k-1)]+(1-λ)E_ _U[τ( _k-1)]+ _A. Subtract the two terms on the right-hand side from both sides. This yields λ(ρk[τ(πk)]−ρk[τ(πk−1)])+(1−λ)(ρU[τ(πk)]−ρU[τ(πk−1)])≤εA.λ (E_ _k[τ( _k)]-E_ _k[τ( _k-1)] )+(1-λ) (E_ _U[τ( _k)]-E_ _U[τ( _k-1)] )≤ _A. The defender part follows from comparing the approximate defender response to the admissible nominal distribution ρU _U, and the attacker part follows from comparing the approximate attacker response to the admissible baseline policy πk−1 _k-1. Together these yield the two Stage-2 certificate inequalities. B.4 Proof of finite-sample sign certification (S) The empirical episode lengths are independent and bounded in [0,Tmax][0,T_ ]. (G) Use Hoeffding’s inequality and a union bound to derive confidence radii for the defender diagnostic and the weighted attacker residual, then convert those radii into sign-certification statements. We first prove the bound for the defender diagnostic. Let X1,…,XnDX_1,…,X_n_D be the independent episode lengths collected under ρk _k when evaluating πk−1 _k-1, and let Y1,…,YnUY_1,…,Y_n_U be the independent episode lengths collected under ρU _U when evaluating πk−1 _k-1. By assumption, 0≤Xi≤Tmax,0≤Yj≤Tmax0≤ X_i≤ T_ , 0≤ Y_j≤ T_ almost surely. Define X¯=1nD∑i=1nDXi,Y¯=1nU∑j=1nUYj. X= 1n_D _i=1^n_DX_i, Y= 1n_U _j=1^n_UY_j. Then Δ^D=X¯−Y¯ _D= X- Y and ΔD=[X¯]−[Y¯]. _D=E[ X]-E[ Y]. By Hoeffding’s inequality [5], for any ηD>0 _D>0, ℙ(|X¯−[X¯]|>ηD)≤2exp(−2nDηD2Tmax2).P (| X-E[ X]|> _D )≤ 2 (- 2n_D _D^2T_ ^2 ). Similarly, for any ηU>0 _U>0, ℙ(|Y¯−[Y¯]|>ηU)≤2exp(−2nUηU2Tmax2).P (| Y-E[ Y]|> _U )≤ 2 (- 2n_U _U^2T_ ^2 ). Choose ηD=Tmaxlog(4/δ)2nD,ηU=Tmaxlog(4/δ)2nU. _D=T_ (4/δ)2n_D, _U=T_ (4/δ)2n_U. Then ℙ(|X¯−[X¯]|>ηD)≤δ2,ℙ(|Y¯−[Y¯]|>ηU)≤δ2.P (| X-E[ X]|> _D )≤ δ2, (| Y-E[ Y]|> _U )≤ δ2. By a union bound, with probability at least 1−δ1-δ, |X¯−[X¯]|≤ηDand|Y¯−[Y¯]|≤ηU| X-E[ X]|≤ _D | Y-E[ Y]|≤ _U simultaneously. On that event, |Δ^D−ΔD|=|(X¯−Y¯)−([X¯]−[Y¯])|| _D- _D|=|( X- Y)-(E[ X]-E[ Y])| =|(X¯−[X¯])−(Y¯−[Y¯])|=|( X-E[ X])-( Y-E[ Y])| ≤|X¯−[X¯]|+|Y¯−[Y¯]|≤ηD+ηU.≤| X-E[ X]|+| Y-E[ Y]|≤ _D+ _U. Substituting the chosen values of ηD _D and ηU _U yields the claimed bound. The sign-certification statement is now immediate. If |Δ^D|>ηD+ηU,| _D|> _D+ _U, then the interval [Δ^D−(ηD+ηU),Δ^D+(ηD+ηU)][ _D-( _D+ _U),\ _D+( _D+ _U)] does not contain zero. Since ΔD _D lies inside that interval with probability at least 1−δ1-δ, the sign of ΔD _D must match that of Δ^D _D on that event. For the weighted attacker residual, write R^k=λ(X¯1−X¯2)+(1−λ)(Y¯1−Y¯2), R_k=λ( X_1- X_2)+(1-λ)( Y_1- Y_2), where X¯1,X¯2,Y¯1,Y¯2 X_1, X_2, Y_1, Y_2 are independent empirical means of bounded episode lengths in [0,Tmax][0,T_ ], and let Rk=λ([X¯1]−[X¯2])+(1−λ)([Y¯1]−[Y¯2]).R_k=λ(E[ X_1]-E[ X_2])+(1-λ)(E[ Y_1]-E[ Y_2]). Apply Hoeffding’s inequality separately to each empirical mean, choose each deviation bound so that its failure probability is at most δ/4δ/4, and apply a union bound. With ηX1=Tmaxlog(8/δ)2nD,a,ηX2=Tmaxlog(8/δ)2nD,b, _X_1=T_ (8/δ)2n_D,a, _X_2=T_ (8/δ)2n_D,b, and ηY1=Tmaxlog(8/δ)2nU,a,ηY2=Tmaxlog(8/δ)2nU,b, _Y_1=T_ (8/δ)2n_U,a, _Y_2=T_ (8/δ)2n_U,b, the same algebra as above gives |R^k−Rk|≤λ(ηX1+ηX2)+(1−λ)(ηY1+ηY2),| R_k-R_k|≤λ( _X_1+ _X_2)+(1-λ)( _Y_1+ _Y_2), which is exactly the weighted confidence radius stated in the theorem. Each diagnostic is a signed combination of bounded empirical means, so separate Hoeffding bounds plus a union bound produce explicit confidence radii. Whenever the observed empirical value lies outside its corresponding radius, the empirical sign agrees with the population sign on the high-probability event. B.5 Proof of marginal insufficiency for fixed policies (S) Construct two distributions on a product latent space that agree on one-coordinate marginals but differ in their correlation structure. (G) Exhibit a fixed deterministic attacker whose loss depends on that correlation, proving that one-coordinate marginals are insufficient. We prove the proposition by explicit construction. Let the latent space be =0,12,z=(z1,z2).Z=\0,1\^2, z=(z_1,z_2). Consider two defender distributions ρ+:(0,0)and(1,1)each with probability 12,ρ^+: (0,0)\ and\ (1,1)\ each with probability 12, ρ−:(0,1)and(1,0)each with probability 12.ρ^-: (0,1)\ and\ (1,0)\ each with probability 12. Under both distributions, ℙ(z1=1)=12,ℙ(z2=1)=12.P(z_1=1)= 12, (z_2=1)= 12. Therefore the one-coordinate marginals are identical. Now define a two-step adversarial latent-initial-state POMDP as follows. At step 1, the attacker observes z1z_1 exactly. At step 2, the attacker outputs a guess z^2∈0,1 z_2∈\0,1\ for the second coordinate. Define the episode length by τ=0,if z^2=z2,1,if z^2≠z2.τ= cases0,&if z_2=z_2,\\ 1,&if z_2≠ z_2. cases Equivalently, shorter episodes correspond to correct prediction and longer episodes correspond to incorrect prediction. Consider the deterministic attacker policy πsame:z^2=z1.π^same: z_2=z_1. Under ρ+ρ^+, the only possible latent states are (0,0)(0,0) and (1,1)(1,1), so always z2=z1.z_2=z_1. Hence πsameπ^same is always correct and therefore ρ+[τ(πsame)]=0.E_ρ^+[τ(π^same)]=0. Under ρ−ρ^-, the only possible latent states are (0,1)(0,1) and (1,0)(1,0), so always z2=1−z1.z_2=1-z_1. Hence πsameπ^same is always incorrect and therefore ρ−[τ(πsame)]=1.E_ρ^-[τ(π^same)]=1. Thus ρ+[τ(πsame)]≠ρ−[τ(πsame)],E_ρ^+[τ(π^same)] _ρ^-[τ(π^same)], even though ρ+ρ^+ and ρ−ρ^- have identical one-coordinate marginals. This proves that one-coordinate marginals are not sufficient to characterize adversarial hardness for fixed policies. The construction isolates the missing information in the joint dependence between z1z_1 and z2z_2: matching one-coordinate marginals does not prevent different expected losses for a fixed policy. B.6 Proof of the supported weakly Pareto-optimality corollary (S) Let π⋆π minimize a weighted sum of adversarial and nominal losses with strictly positive weights. (G) Rule out strict improvement in both objectives and then identify the supporting hyperplane exposed by the weighted sum. Let f(π)=ρD[τ(π)],g(π)=ρU[τ(π)],f(π)=E_ _D[τ(π)], g(π)=E_ _U[τ(π)], and suppose π⋆π is an exact minimizer of λf(π)+(1−λ)g(π)λ f(π)+(1-λ)g(π) for some λ∈(0,1)λ∈(0,1). If π⋆π were not weakly Pareto-optimal, then there would exist π~ π such that f(π~)<f(π⋆)andg(π~)<g(π⋆).f( π)<f(π ) g( π)<g(π ). Multiplying by the strictly positive weights λ and (1−λ)(1-λ) and adding yields λf(π~)+(1−λ)g(π~)<λf(π⋆)+(1−λ)g(π⋆),λ f( π)+(1-λ)g( π)<λ f(π )+(1-λ)g(π ), contradicting optimality of π⋆π . Because π⋆π minimizes a nontrivial linear functional of the two objectives, the point (f(π⋆),g(π⋆))(f(π ),g(π )) lies on a supporting hyperplane of the achievable loss region with outward normal (λ,1−λ)(λ,1-λ). Hence it is a supported weakly Pareto-optimal point. Appendix C Hyperparameters and Training Metadata The exact PPO hyperparameters used for all Stage-1 and Stage-2 attacker training runs in the uploaded paper are listed below. Table 6: Proximal Policy Optimization (PPO) hyperparameters. Hyperparameter Value Learning Rate 3×10−43× 10^-4 nstepsn_steps (per environment) 2048 Batch Size 1024 Number of Epochs 5 Discount Factor (γ) 0.99 GAE Lambda 0.95 Clip Range 0.2 Entropy Coefficient 0.01 Max Gradient Norm 0.5 Number of Parallel Environments 16 Optimizer Adam Total Environment Steps (Stage 1) 2,000,0002,000,000 Total Environment Steps (Stage 2) 1,000,0001,000,000 per generation Appendix D Additional Baselines The absolute performance scale of three scripted baselines is summarized below. While the primary focus of this paper is the comparison between learned attackers and learned defenders, establishing absolute bounds on optimal and random play is crucial for calibration. The Battleship domain allows us to define several canonical non-learned strategies: • Random: This baseline fires uniformly at random at any valid, un-targeted cell. It maintains no state and leverages no game rules beyond the legality of actions. Its performance essentially bounds the worst possible coherent play. • ProbMap (Heuristic Probability Map): This baseline maintains a dense probability map of all possible ship placements given the current hit/miss history. It always targets the cell with the highest marginal probability of containing a ship. After a hit, it aggressively targets adjacent cells (parity tracking). ProbMap is an excellent proxy for strong human play because it greedily minimizes stepwise uncertainty. • Particle Belief: This baseline maintains an explicit particle filter tracking a set of valid board configurations consistent with the history constraint. Given a large number of particles, it approximates the exact Bayesian posterior over hidden layouts. In partially observable settings like Battleship, an exact belief state is the sufficient statistic for optimal control; therefore, Particle Belief represents an extremely strong, pseudo-optimal search baseline. As shown in Table 7, the performance gap between random play and belief-based play is nearly 50 shots. Furthermore, while the learned policies presented in the main text demonstrate significant relative robustness gains under adversarial shift, they do not yet attain the absolute efficiency of the belief state searchers on the nominal Uniform distribution. Table 7: Baseline absolute performance (mean shots-to-win ± standard deviation) against UNIFORM and SPREAD. Scripted Policy UNIFORM SPREAD Random 96.0±4.296.0± 4.2 95.5±4.895.5± 4.8 ProbMap 44.7±8.944.7± 8.9 49.5±7.849.5± 7.8 Particle Belief 48.2±9.348.2± 9.3 51.1±7.951.1± 7.9 Appendix E Training Dynamics and IBR Diagnostics This appendix provides the final generation-wise diagnostic figure, as well as additional training and tail-performance curves that confirm the robustness and optimization claims made in the main text. (a) Stage-1: Attacker learning curves (b) Stage-1: Tail optimization curves (c) Stage-2: Final IBR generation diagnostics Figure 1: Training dynamics and IBR diagnostics averaged over independent seeds. (a) Fixed-mixture and alternating stress approaches readily acquire nominal proficiency without catastrophic gameplay degradation. (b) Direct targeted adversarial exposure effectively limits extreme worst-case trajectory vulnerabilities, explicitly lowering tail severity (CVaR0.10CVaR_0.10 and p95p95). (c) In Stage 2, tracked IBR metrics show that sufficient optimization bandwidth can yield positive defender_adversarial shifts and subsequent attacker_adaptation; these curves should be read as theorem-motivated diagnostics rather than as exact row-wise certificates. References [1] G. W. Brown (1951) Iterative solution of games by fictitious play. Technical report RAND Corporation. Cited by: §2. [2] M. Chen, E. Frazzoli, D. Hsu, and W. S. Lee (2016) POMDP-lite for robust robot planning under uncertainty. In 2016 IEEE International Conference on Robotics and Automation, p. 5427–5433. External Links: Document Cited by: §2. [3] Y. Chow, A. Tamar, S. Mannor, and M. Pavone (2015) Risk-sensitive and robust decision-making: a CVaR optimization approach. In Advances in Neural Information Processing Systems, Cited by: §2. [4] J. Heinrich and D. Silver (2016) Deep reinforcement learning from self-play in imperfect-information games. External Links: 1603.01121 Cited by: §2. [5] W. Hoeffding (1963) Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58 (301), p. 13–30. External Links: Document Cited by: §B.4. [6] G. N. Iyengar (2005) Robust dynamic programming for markov decision processes. Mathematics of Operations Research 30 (2), p. 257–280. External Links: Document Cited by: §2. [7] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra (1998) Planning and acting in partially observable stochastic domains. Artificial Intelligence 101 (1–2), p. 99–134. External Links: Document Cited by: §2. [8] A. Nilim and L. El Ghaoui (2005) Robust control of markov decision processes with uncertain transition matrices. Operations Research 53 (5), p. 780–798. External Links: Document Cited by: §2. [9] T. Osogami (2015) Robust partially observable markov decision process. In Proceedings of the 32nd International Conference on Machine Learning, Cited by: §2. [10] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta (2017) Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, p. 2817–2826. Cited by: §2. [11] R. T. Rockafellar and S. Uryasev (2000) Optimization of conditional value-at-risk. Journal of Risk 2 (3), p. 21–42. Cited by: §2. [12] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. External Links: 1707.06347 Cited by: §2. [13] J. von Neumann (1928) Zur theorie der gesellschaftsspiele. Mathematische Annalen 100 (1), p. 295–320. External Links: Document Cited by: §B.1.